Skip to content

Cloud & Infrastructure · Deployment

Blue-Green and Canary Deployments: A Production Guide for Engineering Teams

Blue-green and canary deployments give you a way to release software without taking down your service or discovering a bug when it's already affecting everyone. Here's how they work and when to use each.

Anurag Verma

Anurag Verma

7 min read

Blue-Green and Canary Deployments: A Production Guide for Engineering Teams

Sponsored

Share

The thing about production bugs is they always surface at the worst time. A deployment goes out at 2pm on a Tuesday, error rates spike at 2:03pm, and you’re scrambling to figure out whether to roll forward or roll back while customers notice.

Blue-green and canary deployments don’t eliminate bugs, but they change the blast radius. Instead of every user hitting a broken release simultaneously, you get an early warning before the problem reaches everyone.

Blue-Green Deployments

Blue-green is the simpler of the two strategies. You maintain two identical production environments, called blue and green. At any given time, one is live (serving traffic) and the other is idle (ready to receive the next deployment).

The workflow:

  1. Traffic is routing to blue (the current live environment)
  2. You deploy the new version to green (currently idle)
  3. You run smoke tests against green directly, before it sees real traffic
  4. You switch the load balancer to route all traffic to green
  5. Blue stays up for a period as fallback; if something goes wrong, you switch back

The rollback is instant. You flip the load balancer back to blue. No re-deploying, no waiting for containers to spin up.

This is the key advantage over rolling deployments, where rolling back means either deploying the old version again (slow) or having a mixed fleet of old and new pods (messy).

What It Takes

You need two identical environments. In Kubernetes this is two separate Deployments with distinct labels, and a Service that targets one set of labels at a time:

# blue deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
      slot: blue
  template:
    metadata:
      labels:
        app: api
        slot: blue
    spec:
      containers:
        - name: api
          image: myregistry/api:1.4.2

---
# service pointing at blue
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
    slot: blue    # change this to "green" to cut over
  ports:
    - port: 80
      targetPort: 3000

To deploy green and cut over:

# deploy new version to green
kubectl apply -f api-green.yaml

# verify green is healthy
kubectl rollout status deployment/api-green

# run your smoke tests against green directly via its own ClusterIP
# then patch the service selector
kubectl patch service api -p '{"spec":{"selector":{"slot":"green"}}}'

Cutting back to blue is the same patch in reverse.

The Database Problem

Blue-green gets complicated when your deployment includes database schema changes. If green uses a new schema that blue doesn’t understand, you can’t safely roll back to blue without rolling back the migration too.

The standard approach is to make schema changes backward-compatible: additive migrations only, column renames handled in two phases (add new column, backfill, remove old column), and no dropped columns until after the old version is fully retired.

If a schema change is not backward-compatible, blue-green is the wrong tool for that release. Do a more careful migration with a maintenance window instead.

When Blue-Green Makes Sense

Blue-green is well-suited for:

  • Services with fast smoke test suites. The benefit is testing against production infrastructure before real users hit it. If smoke tests take 45 minutes, the window for catching issues is shorter but you still catch them.
  • Teams that need instant rollback. If your rollback SLA is “seconds, not minutes,” blue-green delivers that.
  • Stateless services. No session state means traffic can flip between environments cleanly.

It’s less suited for services with large state that’s expensive to keep synchronized across two environments, or for teams without the infrastructure budget to run two full environments continuously.

Canary Deployments

Canary releases take a more gradual approach. Instead of flipping all traffic from one environment to another, you route a small percentage of traffic to the new version while the majority continues hitting the old version. You increase the percentage over time as you gain confidence.

The name comes from “canary in a coal mine.” A small cohort of users experiences the new version first. If they’re fine, you continue. If they’re not, you’ve limited the impact.

100% traffic → v1.4.2

After first canary increment:
  5% traffic → v1.4.3 (canary)
  95% traffic → v1.4.2 (stable)

After validation:
  25% → v1.4.3
  75% → v1.4.2

Full rollout:
  100% → v1.4.3

Implementing Canary in Kubernetes with Argo Rollouts

The cleanest way to implement canary in Kubernetes is Argo Rollouts. It replaces the standard Deployment resource with a Rollout resource that understands canary and blue-green strategies natively:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: myregistry/api:1.4.3
  strategy:
    canary:
      steps:
        - setWeight: 5        # 5% of traffic to new version
        - pause: {duration: 5m}
        - setWeight: 25
        - pause: {duration: 10m}
        - setWeight: 50
        - pause: {}           # manual promotion required before proceeding
        - setWeight: 100
      analysis:
        templates:
          - templateName: error-rate-check
        startingStep: 1
        args:
          - name: service-name
            value: api

The pause: {} step with no duration requires a human (or your CD system) to manually promote before hitting 100%. The analysis template runs continuously during the rollout and can automatically abort if error rates exceed a threshold.

The analysis template looks like this:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
spec:
  metrics:
    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.05   # abort if error rate > 5%
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"5.."
            }[2m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service-name}}"
            }[2m]))

If error rates exceed 5% on any check, Argo Rollouts halts the canary and keeps it at the current traffic split until you investigate and either promote or abort.

Without Argo Rollouts

If you’re not ready to add Argo Rollouts, you can approximate canary with two standard Deployments and weight-based routing in your ingress. With ingress-nginx:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "5"
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-new
                port:
                  number: 80

The main ingress routes to api-stable. This canary ingress routes 5% of requests to api-new. Increase the weight gradually as confidence builds. This approach lacks the automated analysis and promotion controls of Argo Rollouts, so you’re managing the progression manually.

Choosing Between the Two

ScenarioRecommendation
You need instant rollbackBlue-green
You want early signal from real trafficCanary
Database schema is changingNeither cleanly; plan migrations separately
You have good observabilityCanary (automated analysis gates)
You have limited infra budgetCanary (reuses existing fleet)
Your service has long-lived connectionsCanary (avoids connection drops on cutover)

Canary is generally the better default for long-running services because it gives you real production signal at limited risk. Blue-green is better when you need the ability to cut back instantly without the partial-traffic awkwardness of having two versions live simultaneously.

Many teams run both: canary for gradual rollout of normal releases, blue-green for deployments where you need a clean switch (new infrastructure, major version upgrades, or anything requiring post-deploy verification before traffic shifts).

What You Need Before Either Strategy Is Useful

Neither strategy works well without the following in place:

Meaningful health checks. Your readiness probe needs to verify that the new version can actually serve traffic, not just that the process started. If your health check just returns 200 immediately without testing database connectivity or configuration loading, you’ll route traffic to broken pods.

Request-level observability. You need per-version error rates and latency. Without this, canary analysis gates are guessing. Tag your metrics with the deployment version (app_version: 1.4.3) from the start.

Fast smoke tests. For blue-green specifically, the value is catching problems before traffic flips. A smoke test suite that takes 20 minutes gives you less time to catch problems than one that takes 2 minutes.

Defined promotion criteria. “It looks okay” is not a criterion. Before you start a canary, write down the numbers that constitute success: error rate below X, p95 latency below Y, no new error classes in logs. Then check those numbers before each increment.

The tooling is the easy part. The hard part is having the observability and promotion criteria defined before you need them.

Sponsored

Enjoyed it? Pass it on.

Share this article.

Sponsored

The dispatch

Working notes from
the studio.

A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.

No spam, ever. Unsubscribe anytime.

Discussion

Join the conversation.

Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.

Sponsored