April 15, 2026
On-Prem Blue/Green Deployment Journey
How we migrated from a single-process restart deployment to safer, zero-downtime blue/green releases on a self-hosted Linux server.
Quick navigation
- Background
- Why blue/green?
- Architecture overview
- What we changed
- Issues we hit
- The deploy script in detail
- Final workflow
- Rollback strategy
- Useful commands
- Lessons learned
Background
We run a Node.js REST API on a self-hosted Linux server, managed by PM2 and fronted by Nginx as a reverse proxy. For a long time, our deploy process was simple to the point of being dangerous:
- SSH into the server
git pullnpm ci && npm run buildpm2 restart api
That last step caused a brief but real window of downtime. During a restart, PM2 kills the old process before the new one is ready. For low-traffic internal services, that was tolerable. As our API became more critical — serving mobile clients and third-party integrations — "tolerable" stopped being good enough.
We also had a rollback problem: if a bad deploy slipped through, our recovery path was to re-pull the previous commit and restart again, meaning another downtime window.
Why blue/green?
Blue/green deployment is a release technique where you maintain two identical production environments (called blue and green). At any moment, one instance is live (serving traffic) and the other is idle (standing by). When you deploy:
- You update and start the idle instance.
- You run health checks against it.
- Only if it passes, you flip the traffic switch to point at it.
- The previously live instance becomes the new idle one.
The key benefits:
- Zero downtime: traffic switches at the load-balancer level, atomically. No restart gap.
- Instant rollback: if something goes wrong post-switch, you flip back to the previous instance in seconds — no rebuild needed.
- Safe validation window: you can smoke-test the new instance internally before it ever receives real traffic.
This pattern is common in cloud environments (AWS CodeDeploy, Kubernetes rolling updates, etc.), but it translates cleanly to a single on-prem server too, which is what we'll cover here.
Architecture overview
┌─────────────────────────────────┐
Internet ──► Nginx :80 │ upstream: api_active_backend │
(proxy) │ → resolves to :3001 OR :3002 │
└──────────────┬──────────────────┘
│
┌─────────────────┴──────────────────┐
│ │
PM2: api_blue PM2: api_green
Node.js :3001 Node.js :3002
(e.g. currently LIVE) (e.g. currently IDLE)
Nginx reads a small include file (active_backend.conf) that sets a variable pointing to either the blue or green upstream. Swapping which one is live is just a matter of writing a new value to that file and reloading Nginx — no process restarts, no dropped connections.
What we changed
PM2 processes
We replaced the single api PM2 process with two named processes:
| Process | Port | Role |
|---|---|---|
api_blue | 3001 | Starts as live |
api_green | 3002 | Starts as idle |
Both processes run from the same codebase directory, identical in every way except the port they bind to (set via environment variable).
Nginx upstream config
We split the Nginx config into two layers:
/etc/nginx/conf.d/api_upstreams.conf — lives in http context, never changes:
upstream api_blue { server 127.0.0.1:3001; }
upstream api_green { server 127.0.0.1:3002; }
/etc/nginx/snippets/active_backend.conf — referenced inside the server block, swapped by the deploy script:
set $active_backend "api_blue";
The server block's proxy_pass uses $active_backend to route:
location /api/ {
proxy_pass http://$active_backend;
}
Deploy script (/root/deploy.sh)
The wrapper script orchestrates the full flow:
- Pull latest code from the deploy branch
npm ci+npm run build- Detect which instance is currently idle
- Restart the idle instance with the new build
- Poll
/uptimeon the idle instance until healthy (with retries + grace period) - If healthy: rewrite
active_backend.confand reload Nginx - If unhealthy: abort without touching Nginx (live instance stays up)
Issues we hit
1. Nginx context errors
Our first attempt put the upstream blocks inside a server block include, and the set directive in conf.d. Both were wrong.
Nginx context rules:
upstream— must be inhttpcontextset— must be inserver,location, orifcontext
Fix: separate the concerns into two files as described above. The conf.d directory is included at http level, so upstream definitions go there. The snippets include is referenced from inside the server block, so set goes there.
2. Health check hitting the wrong endpoint
Our early health check was curl <server_ip>/uptime. This returned a 200 — but it was the frontend HTML, not the API, because it was hitting port 80 which Nginx routes to the frontend by default.
The correct check must hit the instance's port directly, bypassing Nginx entirely:
# Wrong — hits Nginx on port 80, may route to frontend
curl http://<server_ip>/uptime
# Correct — hits the API process directly
curl http://127.0.0.1:3001/uptime
curl http://127.0.0.1:3002/uptime
This matters for blue/green because you're validating the new instance before it receives routed traffic. If you check via Nginx, you're checking the current live instance, not the one you just deployed.
3. Runtime dependency in devDependencies
One of our packages (request-ip) was mistakenly listed under devDependencies rather than dependencies. Locally it worked fine because node_modules was populated from a full npm install. On the server after npm ci --production, it wasn't installed, causing a runtime crash on first request.
Fix: move it to dependencies, commit and push. Don't patch it on the server directly — that's exactly how environment drift starts.
Lesson: treat npm ci failures and startup crashes as signals to audit your dependency categorization. The distinction between dependencies and devDependencies matters when you deploy with --production or --omit=dev.
4. Startup timing / warm-up
Node.js apps (especially those with database connection pools, cache warming, or lazy-loaded modules) can take a few seconds before they're truly ready to serve. Our health check was firing immediately after pm2 restart, catching the process before it was up.
Fix:
- Add a startup grace delay (e.g.
sleep 3) before the first health check attempt - Retry the health check N times with a short sleep between attempts rather than failing on the first non-200
Example retry loop:
MAX_RETRIES=10
RETRY_DELAY=3
for i in $(seq 1 $MAX_RETRIES); do
if curl -fsS "http://127.0.0.1:${IDLE_PORT}/uptime" > /dev/null 2>&1; then
echo "Health check passed on attempt $i"
HEALTHY=true
break
fi
echo "Attempt $i failed, retrying in ${RETRY_DELAY}s..."
sleep $RETRY_DELAY
done
The deploy script in detail
Here is an annotated version of the core deploy logic (simplified for clarity):
#!/usr/bin/env bash
set -euo pipefail
APP_DIR="/var/www/html/myapp/backend"
SNIPPETS_FILE="/etc/nginx/snippets/active_backend.conf"
BLUE_PORT=3001
GREEN_PORT=3002
# --- 1. Determine live and idle instances ---
ACTIVE=$(grep -oP '(?<=set \$active_backend ")[^"]+' "$SNIPPETS_FILE")
if [[ "$ACTIVE" == "api_blue" ]]; then
IDLE_NAME="api_green"
IDLE_PORT=$GREEN_PORT
else
IDLE_NAME="api_blue"
IDLE_PORT=$BLUE_PORT
fi
echo "Active: $ACTIVE → Deploying to idle: $IDLE_NAME (:$IDLE_PORT)"
# --- 2. Pull and build ---
cd "$APP_DIR"
git pull origin dev
npm ci
npm run build
# --- 3. Restart idle instance ---
pm2 restart "$IDLE_NAME"
# --- 4. Health check with retries ---
sleep 3 # startup grace
HEALTHY=false
for i in $(seq 1 10); do
if curl -fsS "http://127.0.0.1:${IDLE_PORT}/uptime" > /dev/null 2>&1; then
HEALTHY=true
break
fi
sleep 3
done
# --- 5. Switch or abort ---
if [[ "$HEALTHY" == "true" ]]; then
echo "set \$active_backend \"${IDLE_NAME}\";" > "$SNIPPETS_FILE"
nginx -t && systemctl reload nginx
echo "✅ Switched traffic to $IDLE_NAME"
else
echo "❌ Health check failed — Nginx NOT updated. $ACTIVE remains live."
exit 1
fi
Final workflow
Day-to-day deploy (after setup is complete)
# On developer machine
git push origin dev
# On server — that's it
./deploy.sh
The deploy script handles the git pull internally, so there's no need to run it manually first.
What happens under the hood
- Script detects the active instance (e.g.
api_blueon:3001) - Pulls latest code, installs deps, builds
- Restarts the idle instance (
api_greenon:3002) with the new code - Polls
http://127.0.0.1:3002/uptimeuntil healthy - Writes
set $active_backend "api_green";toactive_backend.conf - Reloads Nginx — traffic switches instantly
api_bluecontinues running on:3001as the new idle instance
Verification after deploy
# Check both instances are running
pm2 list | grep api
# Direct health checks
curl -fsS http://127.0.0.1:3001/uptime
curl -fsS http://127.0.0.1:3002/uptime
# Check which instance is currently active
cat /etc/nginx/snippets/active_backend.conf
# End-to-end check through Nginx
curl -fsS http://127.0.0.1/api/uptime
Rollback strategy
Rollback is now essentially free, because the previous build is still running as the idle instance.
Instant rollback (no rebuild):
# Flip back to the previous instance
echo 'set $active_backend "api_blue";' > /etc/nginx/snippets/active_backend.conf
nginx -t && systemctl reload nginx
This takes under a second and doesn't touch either PM2 process. The only time you need a full redeploy rollback (git revert + rebuild) is if:
- Both instances were deployed with the bad version (i.e. you deployed twice without noticing the issue)
- The problem is in a shared resource (database migration, config file) rather than the app code itself
For database migrations, blue/green only solves the application layer. If your deploy includes breaking DB schema changes, you need a separate migration strategy (expand/contract pattern, feature flags, etc.) — that's beyond the scope of this post but worth planning for.
Useful commands
Check state
# Which instance is currently serving traffic?
cat /etc/nginx/snippets/active_backend.conf
# Are both PM2 instances healthy?
pm2 list | grep "api_blue\|api_green"
# Nginx config test
nginx -t
Health checks
# Direct instance checks (bypass Nginx)
curl -fsS http://127.0.0.1:3001/uptime
curl -fsS http://127.0.0.1:3002/uptime
# End-to-end through Nginx (confirms routing)
curl -fsS http://127.0.0.1/api/uptime
Logs
# Tail recent logs for each instance
pm2 logs api_blue --lines 120 --nostream
pm2 logs api_green --lines 120 --nostream
# Nginx access/error logs
tail -f /var/log/nginx/access.log
tail -f /var/log/nginx/error.log
Manual traffic switch (without deploy)
# Switch to blue
echo 'set $active_backend "api_blue";' > /etc/nginx/snippets/active_backend.conf
nginx -t && systemctl reload nginx
# Switch to green
echo 'set $active_backend "api_green";' > /etc/nginx/snippets/active_backend.conf
nginx -t && systemctl reload nginx
Lessons learned
Start with health checks you can trust. Our biggest early mistake was a health check that appeared to pass but was actually hitting the wrong process. Before you wire up any automated switching logic, manually verify that your health endpoint returns what you think it does from the right address and port.
Separate config concerns in Nginx. The http/server/location context hierarchy trips people up. When adding dynamic behavior, draw out which context each directive lives in before writing config. It will save you from cryptic "directive not allowed here" errors.
Dependency hygiene matters more on servers. Local development is forgiving — you often have lingering packages from previous installs. npm ci on a clean CI or server environment exposes sloppiness in your package.json. Run npm ci locally in a clean directory occasionally to catch these before they reach production.
Keep both instances warm. Keep both PM2 processes running at all times. An idle instance that's already warm (connected to the database, module cache populated) will respond to health checks faster and be a more reliable rollback target than one that was stopped.
Document your assumptions. This setup assumes a single server. If you ever scale to multiple servers behind a load balancer, the blue/green logic moves up a level (to the load balancer) and the per-process approach described here becomes less relevant. Make sure your team knows what the design was optimized for.
Thanks for reading. If you're running a Node.js API on a self-hosted server and haven't moved away from pm2 restart deploys yet, this pattern is relatively straightforward to set up and makes a meaningful difference in deploy confidence.