On-Prem Blue/Green Deployment Journey

Background
Why blue/green?
Architecture overview
What we changed
Issues we hit
The deploy script in detail
Final workflow
Rollback strategy
Useful commands
Lessons learned

Background

We run a Node.js REST API on a self-hosted Linux server, managed by PM2 and fronted by Nginx as a reverse proxy. For a long time, our deploy process was simple to the point of being dangerous:

SSH into the server
git pull
npm ci && npm run build
pm2 restart api

That last step caused a brief but real window of downtime. During a restart, PM2 kills the old process before the new one is ready. For low-traffic internal services, that was tolerable. As our API became more critical — serving mobile clients and third-party integrations — "tolerable" stopped being good enough.

We also had a rollback problem: if a bad deploy slipped through, our recovery path was to re-pull the previous commit and restart again, meaning another downtime window.

Why blue/green?

Blue/green deployment is a release technique where you maintain two identical production environments (called blue and green). At any moment, one instance is live (serving traffic) and the other is idle (standing by). When you deploy:

You update and start the idle instance.
You run health checks against it.
Only if it passes, you flip the traffic switch to point at it.
The previously live instance becomes the new idle one.

The key benefits:

Zero downtime: traffic switches at the load-balancer level, atomically. No restart gap.
Instant rollback: if something goes wrong post-switch, you flip back to the previous instance in seconds — no rebuild needed.
Safe validation window: you can smoke-test the new instance internally before it ever receives real traffic.

This pattern is common in cloud environments (AWS CodeDeploy, Kubernetes rolling updates, etc.), but it translates cleanly to a single on-prem server too, which is what we'll cover here.

Architecture overview

                        ┌─────────────────────────────────┐
Internet ──► Nginx :80  │  upstream: api_active_backend    │
             (proxy)    │  → resolves to :3001 OR :3002    │
                        └──────────────┬──────────────────┘
                                       │
                     ┌─────────────────┴──────────────────┐
                     │                                     │
              PM2: api_blue                       PM2: api_green
              Node.js :3001                       Node.js :3002
              (e.g. currently LIVE)               (e.g. currently IDLE)

Nginx reads a small include file (active_backend.conf) that sets a variable pointing to either the blue or green upstream. Swapping which one is live is just a matter of writing a new value to that file and reloading Nginx — no process restarts, no dropped connections.

What we changed

PM2 processes

We replaced the single api PM2 process with two named processes:

Process	Port	Role
`api_blue`	`3001`	Starts as live
`api_green`	`3002`	Starts as idle

Both processes run from the same codebase directory, identical in every way except the port they bind to (set via environment variable).

Nginx upstream config

We split the Nginx config into two layers:

/etc/nginx/conf.d/api_upstreams.conf — lives in http context, never changes:

upstream api_blue  { server 127.0.0.1:3001; }
upstream api_green { server 127.0.0.1:3002; }

/etc/nginx/snippets/active_backend.conf — referenced inside the server block, swapped by the deploy script:

set $active_backend "api_blue";

The server block's proxy_pass uses $active_backend to route:

location /api/ {
    proxy_pass http://$active_backend;
}

Deploy script (`/root/deploy.sh`)

The wrapper script orchestrates the full flow:

Pull latest code from the deploy branch
npm ci + npm run build
Detect which instance is currently idle
Restart the idle instance with the new build
Poll /uptime on the idle instance until healthy (with retries + grace period)
If healthy: rewrite active_backend.conf and reload Nginx
If unhealthy: abort without touching Nginx (live instance stays up)

Issues we hit

1. Nginx context errors

Our first attempt put the upstream blocks inside a server block include, and the set directive in conf.d. Both were wrong.

Nginx context rules:

upstream — must be in http context
set — must be in server, location, or if context

Fix: separate the concerns into two files as described above. The conf.d directory is included at http level, so upstream definitions go there. The snippets include is referenced from inside the server block, so set goes there.

2. Health check hitting the wrong endpoint

Our early health check was curl <server_ip>/uptime. This returned a 200 — but it was the frontend HTML, not the API, because it was hitting port 80 which Nginx routes to the frontend by default.

The correct check must hit the instance's port directly, bypassing Nginx entirely:

# Wrong — hits Nginx on port 80, may route to frontend
curl http://<server_ip>/uptime

# Correct — hits the API process directly
curl http://127.0.0.1:3001/uptime
curl http://127.0.0.1:3002/uptime

This matters for blue/green because you're validating the new instance before it receives routed traffic. If you check via Nginx, you're checking the current live instance, not the one you just deployed.

3. Runtime dependency in `devDependencies`

One of our packages (request-ip) was mistakenly listed under devDependencies rather than dependencies. Locally it worked fine because node_modules was populated from a full npm install. On the server after npm ci --production, it wasn't installed, causing a runtime crash on first request.

Fix: move it to dependencies, commit and push. Don't patch it on the server directly — that's exactly how environment drift starts.

Lesson: treat npm ci failures and startup crashes as signals to audit your dependency categorization. The distinction between dependencies and devDependencies matters when you deploy with --production or --omit=dev.

4. Startup timing / warm-up

Node.js apps (especially those with database connection pools, cache warming, or lazy-loaded modules) can take a few seconds before they're truly ready to serve. Our health check was firing immediately after pm2 restart, catching the process before it was up.

Fix:

Add a startup grace delay (e.g. sleep 3) before the first health check attempt
Retry the health check N times with a short sleep between attempts rather than failing on the first non-200

Example retry loop:

MAX_RETRIES=10
RETRY_DELAY=3

for i in $(seq 1 $MAX_RETRIES); do
  if curl -fsS "http://127.0.0.1:${IDLE_PORT}/uptime" > /dev/null 2>&1; then
    echo "Health check passed on attempt $i"
    HEALTHY=true
    break
  fi
  echo "Attempt $i failed, retrying in ${RETRY_DELAY}s..."
  sleep $RETRY_DELAY
done

The deploy script in detail

Here is an annotated version of the core deploy logic (simplified for clarity):

#!/usr/bin/env bash
set -euo pipefail

APP_DIR="/var/www/html/myapp/backend"
SNIPPETS_FILE="/etc/nginx/snippets/active_backend.conf"
BLUE_PORT=3001
GREEN_PORT=3002

# --- 1. Determine live and idle instances ---
ACTIVE=$(grep -oP '(?<=set \$active_backend ")[^"]+' "$SNIPPETS_FILE")

if [[ "$ACTIVE" == "api_blue" ]]; then
  IDLE_NAME="api_green"
  IDLE_PORT=$GREEN_PORT
else
  IDLE_NAME="api_blue"
  IDLE_PORT=$BLUE_PORT
fi

echo "Active: $ACTIVE → Deploying to idle: $IDLE_NAME (:$IDLE_PORT)"

# --- 2. Pull and build ---
cd "$APP_DIR"
git pull origin dev
npm ci
npm run build

# --- 3. Restart idle instance ---
pm2 restart "$IDLE_NAME"

# --- 4. Health check with retries ---
sleep 3  # startup grace
HEALTHY=false
for i in $(seq 1 10); do
  if curl -fsS "http://127.0.0.1:${IDLE_PORT}/uptime" > /dev/null 2>&1; then
    HEALTHY=true
    break
  fi
  sleep 3
done

# --- 5. Switch or abort ---
if [[ "$HEALTHY" == "true" ]]; then
  echo "set \$active_backend \"${IDLE_NAME}\";" > "$SNIPPETS_FILE"
  nginx -t && systemctl reload nginx
  echo "✅ Switched traffic to $IDLE_NAME"
else
  echo "❌ Health check failed — Nginx NOT updated. $ACTIVE remains live."
  exit 1
fi

Final workflow

Day-to-day deploy (after setup is complete)

# On developer machine
git push origin dev

# On server — that's it
./deploy.sh

The deploy script handles the git pull internally, so there's no need to run it manually first.

What happens under the hood

Script detects the active instance (e.g. api_blue on :3001)
Pulls latest code, installs deps, builds
Restarts the idle instance (api_green on :3002) with the new code
Polls http://127.0.0.1:3002/uptime until healthy
Writes set $active_backend "api_green"; to active_backend.conf
Reloads Nginx — traffic switches instantly
api_blue continues running on :3001 as the new idle instance

Verification after deploy

# Check both instances are running
pm2 list | grep api

# Direct health checks
curl -fsS http://127.0.0.1:3001/uptime
curl -fsS http://127.0.0.1:3002/uptime

# Check which instance is currently active
cat /etc/nginx/snippets/active_backend.conf

# End-to-end check through Nginx
curl -fsS http://127.0.0.1/api/uptime

Rollback strategy

Rollback is now essentially free, because the previous build is still running as the idle instance.

Instant rollback (no rebuild):

# Flip back to the previous instance
echo 'set $active_backend "api_blue";' > /etc/nginx/snippets/active_backend.conf
nginx -t && systemctl reload nginx

This takes under a second and doesn't touch either PM2 process. The only time you need a full redeploy rollback (git revert + rebuild) is if:

Both instances were deployed with the bad version (i.e. you deployed twice without noticing the issue)
The problem is in a shared resource (database migration, config file) rather than the app code itself

For database migrations, blue/green only solves the application layer. If your deploy includes breaking DB schema changes, you need a separate migration strategy (expand/contract pattern, feature flags, etc.) — that's beyond the scope of this post but worth planning for.

Useful commands

Check state

# Which instance is currently serving traffic?
cat /etc/nginx/snippets/active_backend.conf

# Are both PM2 instances healthy?
pm2 list | grep "api_blue\|api_green"

# Nginx config test
nginx -t

Health checks

# Direct instance checks (bypass Nginx)
curl -fsS http://127.0.0.1:3001/uptime
curl -fsS http://127.0.0.1:3002/uptime

# End-to-end through Nginx (confirms routing)
curl -fsS http://127.0.0.1/api/uptime

Logs

# Tail recent logs for each instance
pm2 logs api_blue  --lines 120 --nostream
pm2 logs api_green --lines 120 --nostream

# Nginx access/error logs
tail -f /var/log/nginx/access.log
tail -f /var/log/nginx/error.log

Manual traffic switch (without deploy)

# Switch to blue
echo 'set $active_backend "api_blue";'  > /etc/nginx/snippets/active_backend.conf
nginx -t && systemctl reload nginx

# Switch to green
echo 'set $active_backend "api_green";' > /etc/nginx/snippets/active_backend.conf
nginx -t && systemctl reload nginx

Lessons learned

Start with health checks you can trust. Our biggest early mistake was a health check that appeared to pass but was actually hitting the wrong process. Before you wire up any automated switching logic, manually verify that your health endpoint returns what you think it does from the right address and port.

Separate config concerns in Nginx. The http/server/location context hierarchy trips people up. When adding dynamic behavior, draw out which context each directive lives in before writing config. It will save you from cryptic "directive not allowed here" errors.

Dependency hygiene matters more on servers. Local development is forgiving — you often have lingering packages from previous installs. npm ci on a clean CI or server environment exposes sloppiness in your package.json. Run npm ci locally in a clean directory occasionally to catch these before they reach production.

Keep both instances warm. Keep both PM2 processes running at all times. An idle instance that's already warm (connected to the database, module cache populated) will respond to health checks faster and be a more reliable rollback target than one that was stopped.

Document your assumptions. This setup assumes a single server. If you ever scale to multiple servers behind a load balancer, the blue/green logic moves up a level (to the load balancer) and the per-process approach described here becomes less relevant. Make sure your team knows what the design was optimized for.

Thanks for reading. If you're running a Node.js API on a self-hosted server and haven't moved away from pm2 restart deploys yet, this pattern is relatively straightforward to set up and makes a meaningful difference in deploy confidence.

Quick navigation