I Found 119 Zombie Connections Silently Strangling Our API — Here's How I Killed Them

TL;DR: Our production database had been getting heavier and slower for months. API response times tanked, timeout errors kept appearing, and no one could figure out why. When I finally dug in, I found 119 zombie connections that had been accumulating long before I touched anything. Here's how I discovered them, why I set up a MySQL read replica to rescue the database, and the automated cleanup that's been running flawlessly ever since.

The Setup (Or: How I Got Here)

Hey, I'm Kaung Myat Han (Eric). I'm a full-stack developer in Bangkok working on a high-traffic web application. Nothing glamorous—mostly Node.js, MySQL, and the occasional 3 AM production incident.

Our main database had grown to 17GB over time. Performance was degrading week by week. API response times that used to be 200ms were now hitting 2–3 seconds. Heavy analytics queries and Excel exports would completely lock up the application. Users were seeing timeout errors for no apparent reason.

At first, I thought it was just database size. "17GB isn't that big, but maybe MySQL is struggling?" I started investigating.

What I found changed everything—and it had nothing to do with the data itself.

Spoiler alert: The database size wasn't the main problem. 119 zombie database connections had been quietly accumulating in the system—a connection leak that existed long before I started digging. They were choking the life out of our connection pool. Setting up a read replica wasn't the original plan, but once I understood how bad the situation was—a bloated, slow database and a leaking connection pool—it became the right call to fix both problems properly.

Discovering the Problem
Setting Up the Replica
The Fix (Three Layers Deep)
What I Learned
Current Status

Discovering the Problem

This is the part nobody told me to check first. I was preparing to set up a read replica to take load off the primary database—but before touching anything, I wanted to understand why the database was so slow to begin with.

The Symptoms

These errors had been appearing more and more frequently in the logs:

TimeoutError: unknown timed out
    at Timeout._onTimeout (.../retry-as-promised/dist/index.js:62:24)
{
  previous: undefined
}

That previous: undefined was weird. It meant the query wasn't even reaching the database—it was timing out before execution.

The Investigation

I checked the usual suspects:

Slow query log? Nothing unusual.
CPU usage? Normal.
Disk I/O? A bit high, but nothing crazy.
Memory? Fine.

Then I ran this:

SELECT id, user, host, command, time, state
FROM information_schema.processlist
WHERE user = 'app_user'
ORDER BY time DESC;

And there it was:

+--------+----------+---------+------+-------+
| id     | user     | command | time | state |
+--------+----------+---------+------+-------+
| 119283 | app_user | Sleep   | 213  |       |
| 119284 | app_user | Sleep   | 213  |       |
| 119285 | app_user | Sleep   |  101 |       |
| 119286 | app_user | Sleep   |   87 |       |
... (115 more rows)

119 connections. All sleeping. Some for over 3 minutes.

Our Sequelize pool was configured for a maximum of 80 connections.

What Was Actually Happening

This wasn't something I introduced. The connection leak had been baked into the application for a long time, slowly accumulating. The chain of events looked like this:

1. App checks out a connection from the pool
2. Something goes wrong (unhandled error? transaction never committed?)
3. Connection never gets released back to the pool
4. Repeat 119 times over several months
5. Pool is now "full" even though most connections are zombies
6. New request comes in, needs a database connection
7. Waits in queue for a free connection that will never come
8. After 60 seconds: timeout fires
9. Error: "unknown timed out" with "previous: undefined"

The database wasn't slow because of its size. It was starved. The connection pool was exhausted by zombie connections that had been leaking from the application for months.

Why This Made Everything Worse

With 119 idle connections sitting there doing nothing:

Every new query had to compete for one of the few actually available connections
Heavy analytics queries would grab connections and hold them even longer
The database was spending resources managing connections that weren't doing anything useful
The 17GB database size added its own overhead on top of the connection problem

The full picture became clear: The database was carrying dead weight—both in zombie connections and in heavy read traffic that didn't need to hit the primary at all. The right move was to set up a read replica to offload analytics queries, and get the connection situation under control.

Setting Up the Replica

With the problem fully understood, the plan was:

Set up a read replica to handle all the heavy analytics and reporting queries
Keep the primary lean and focused on real-time application traffic
Fix the connection leak so it stops accumulating

The Architecture

Same physical server, two MySQL instances:

Primary (port 5231)
├─ Handles all writes
├─ Serves live user queries
└─ Replicates to →

Replica (port 5232)
├─ Read-only
├─ Analytics & Excel exports only
└─ Only syncs the application database

Not exactly rocket science, but there's always something.

Step 1: Make Sure the Primary Is Ready

First, I checked the primary's config to confirm binary logging was enabled:

cat /etc/mysql/mysql.conf.d/mysqld.cnf | grep -E "(server-id|log_bin|bind-address|port)"

server-id = 1
log-bin = /var/log/mysql/mysql-bin.log
binlog-format = ROW
binlog-do-db = app_db    # Only replicate the database we need
port = 5231
bind-address = 0.0.0.0

The binlog-do-db = app_db part is important—it means we're only logging changes to the one database we care about, not every random schema on the server.

Step 2: Create the Replication User (First Gotcha)

CREATE USER 'replica_user'@'localhost'
  IDENTIFIED WITH mysql_native_password BY 'your_strong_password_here';
GRANT REPLICATION SLAVE ON *.* TO 'replica_user'@'localhost';
FLUSH PRIVILEGES;

Notice mysql_native_password? Yeah, that took me an hour to figure out.

The replica initially threw this error:

Authentication requires secure connection

Turns out MySQL 8's default caching_sha2_password expects SSL, which I didn't have configured for local replication. Switching to the older password plugin fixed it. Not ideal for internet-facing traffic, but perfectly fine for 127.0.0.1.

Step 3: Grab the Binary Log Position

This is basically your "save point" for replication:

SHOW MASTER STATUS;

+------------------+----------+--------------+
| File             | Position | Binlog_Do_DB |
+------------------+----------+--------------+
| mysql-bin.000085 | 28392997 | app_db       |
+------------------+----------+--------------+

Write that down. You'll need it in a minute.

Step 4: Configure the Replica

I created a separate config file for the replica instance:

# /etc/mysql/mysql-replica.cnf
[mysqld]
server-id = 2                    # Must be different from the primary!
port = 5232
bind-address = 127.0.0.1         # Local only

read_only = 1                    # Safety: prevent accidental writes
relay-log = /var/lib/mysql-replica/replica-relay-bin
replicate-do-db = app_db         # Filter again at the replica level

innodb_buffer_pool_size = 8G
max_connections = 200

# These timeout settings became VERY important later...
wait_timeout = 120
interactive_timeout = 120

Step 5: Set Up Systemd Service

Since I'm running two MySQL instances on one server, I needed a separate systemd service:

# /etc/systemd/system/mysql-replica.service
[Unit]
Description=MySQL Replica Server
After=network.target mysql.service

[Service]
Type=notify
User=mysql
ExecStart=/usr/sbin/mysqld --defaults-file=/etc/mysql/mysql-replica.cnf
LimitNOFILE=65535
Restart=on-failure

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl start mysql-replica

If you get an Address already in use error on port 5232, check for zombie processes:

ps aux | grep mysqld | grep 5232
sudo kill [pid]
sudo systemctl start mysql-replica

Step 6: Configure Replication

-- On the replica:
STOP REPLICA;

CHANGE REPLICATION SOURCE TO
  SOURCE_HOST='127.0.0.1',
  SOURCE_PORT=5231,
  SOURCE_USER='replica_user',
  SOURCE_PASSWORD='your_strong_password_here',
  SOURCE_LOG_FILE='mysql-bin.000085',
  SOURCE_LOG_POS=28392997;

START REPLICA;

Step 7: Import the Data (This Took All Night)

Here's where things got interesting.

I started replication, but the database didn't exist on the replica yet. MySQL immediately threw:

Unknown database 'app_db'

So I created the database and started importing a SQL dump:

mysql -h 127.0.0.1 -P 5232 -u root -e "CREATE DATABASE app_db;"

# Use nohup so it keeps running if SSH disconnects
nohup mysql -h 127.0.0.1 -P 5232 -u root app_db < /home/db_dump.sql > /tmp/import.log 2>&1 &

This dump had 274 tables. At 3 AM, my SSH connection dropped. I woke up the next morning expecting disaster, but MySQL had just silently kept going. It was at table 195.

Lesson learned: nohup is your best friend for long-running imports. Also, MySQL skips tables that already exist, so you can re-run the import if something fails.

Step 8: Check Replication Status

SHOW REPLICA STATUS\G

What you want to see:

Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 0

If Seconds_Behind_Source is climbing, your replica is falling behind.

The Fix (Three Layers Deep)

Now that I understood the problem and had the replica set up, I needed to get the connection leak under control. I implemented a three-layer solution.

Layer 1: Immediate Cleanup (Emergency Response)

First, I killed all those zombie connections that had been building up:

-- Find all the zombies
SELECT id, time
FROM information_schema.processlist
WHERE command = 'Sleep'
  AND time > 60
  AND user = 'app_user';

-- Kill them
KILL 119283;
KILL 119284;
-- ... (I wrote a script to generate these commands)

This cleared the backlog immediately. The API started responding normally again.

But I knew they'd come back. The leak was in the application code somewhere, and I didn't have time to audit thousands of lines of code right away.

The solution: A nightly cron job to automatically clean up zombies before they accumulate.

#!/bin/bash
# /usr/local/bin/kill_zombie_connections.sh

# Kill connections that have been sleeping for more than 5 minutes
mysql -h 127.0.0.1 -P 5231 -u admin -p'your_password' -e "
  SELECT CONCAT('KILL ', id, ';')
  FROM information_schema.processlist
  WHERE command = 'Sleep'
    AND time > 300
    AND user = 'app_user'
" | grep KILL | mysql -h 127.0.0.1 -P 5231 -u admin -p'your_password'

echo "$(date): Cleaned up zombie connections" >> /var/log/mysql_cleanup.log

Then added it to crontab to run every night at 3 AM:

crontab -e
# Add this line:
0 3 * * * /usr/local/bin/kill_zombie_connections.sh

This has been running every night for weeks now. It typically kills 10–20 connections per night—proof that the underlying leak is still somewhere in the code, but at least it's no longer accumulating into a crisis.

Layer 2: MySQL Safety Net

Added to /etc/mysql/mysql.conf.d/mysqld.cnf:

[mysqld]
wait_timeout = 120           # Kill idle connections after 2 minutes
interactive_timeout = 120

This is your "circuit breaker." Even if the app leaks connections, MySQL will eventually clean them up on its own.

sudo systemctl restart mysql

Layer 3: Fix the Application Config

// src/sequelize_init.ts - BEFORE vs AFTER

// BEFORE
const sequelize = new Sequelize(config, {
  pool: {
    max: 80,          // Huge pool hides leaks
    acquire: 600000,  // Wait 10 minutes for a connection (!?)
  },
  retry: {
    timeout: 60000,   // Retry timeout lower than acquire → weird errors
  },
});

// AFTER
const sequelize = new Sequelize(config, {
  pool: {
    max: 20,          // Smaller pool: leaks fail LOUDLY
    min: 2,
    acquire: 30000,   // Fail fast: 30 seconds, not 10 minutes
    idle: 10000,      // Return idle connections faster
    evict: 5000,      // Check for idle connections every 5s
  },

  dialectOptions: {
    connectTimeout: 10000,
  },

  retry: {
    match: [
      /ETIMEDOUT/,
      /ECONNRESET/,
      /SequelizeConnectionError/,
    ],
    max: 2,
    timeout: 35000,   // Must be HIGHER than pool.acquire
  },

  // Temporary: log slow queries to find the leak source
  logging: (sql, timing) => {
    if (timing && timing > 5000) {
      console.warn(`[SLOW ${timing}ms] ${sql.substring(0, 200)}`);
    }
  },
});

Why These Changes Work

Setting	Old	New	Why
`pool.max`	80	20	Smaller pool = leaks cause visible errors instead of silent death
`pool.acquire`	600s	30s	Fail fast; don't let users wait 10 minutes
`retry.timeout`	60s	35s	Must be higher than acquire, or you get "unknown timed out"
`idle`/`evict`	defaults	10s/5s	Aggressively reclaim leaked connections

The key insight: a large connection pool hides bugs. If you have 80 connections available, you won't notice when 40 of them leak. With 20, you'll hit the limit faster and see errors immediately.

Layer 4: Find the Actual Leak

I added monitoring to track pool usage in real time:

setInterval(() => {
  const pool = sequelize.connectionManager.pool;
  console.log(`Pool: ${pool.available} free, ${pool.borrowed} in use, ${pool.size} total`);
}, 30000);

And a quick query I run occasionally to check for long-running transactions:

SELECT id, time, LEFT(info, 150) as query
FROM information_schema.processlist
WHERE time > 10
  AND user = 'app_user'
  AND command != 'Sleep'
ORDER BY time DESC;

The actual code fix—finding where connections leak and adding proper .release() calls—is still on the backlog. But with the cron job, MySQL timeouts, and tighter pool config, the system is stable enough that I can tackle it properly when the time comes.

Current Status

It's been several weeks since I completed the setup, and everything is running fantastically:

✅ Read replica is working perfectly

Import completed successfully (274 tables)
Seconds_Behind_Source stays under 2 seconds even during peak traffic
All analytics and reporting queries route to the replica
Primary database handles only real-time application traffic

✅ Connection leak is fully contained

Cron job runs nightly, kills 10–20 zombie connections on average
Haven't seen a single TimeoutError: unknown timed out in production since the fix
Pool metrics stay healthy (typically 3–5 connections in use, 15–17 available)

✅ Performance is dramatically better

API response times back to 200–400ms average
Heavy reports no longer impact real-time users
Database handles traffic spikes without degradation

Next steps:

Code audit to find and permanently fix the actual leak
Set up Prometheus/Grafana to graph connection pool metrics over time
Consider increasing replica buffer pool if analytics queries grow heavier

What I Learned

About MySQL Replication

Server IDs must be unique across all instances (primary, replicas, everything)
For local replication without SSL, use mysql_native_password instead of the default caching_sha2_password
nohup is a lifesaver for long-running operations over SSH
The binary log position is your "checkpoint"—get it with SHOW MASTER STATUS before you do anything
Always verify replication health with SHOW REPLICA STATUS\G

About Connection Pools

High max connections hide bugs. Start small (like 20), then scale up only if you actually need to.
Align your timeouts correctly: connect < acquire < retry < app timeout
SHOW PROCESSLIST is your best friend for debugging connection issues
Set wait_timeout in MySQL as a safety net—it'll clean up zombies even if your app doesn't
If you see TimeoutError: unknown timed out with previous: undefined, it's almost certainly pool exhaustion, not a slow query

About Debugging Production Issues

When errors spike, run SHOW PROCESSLIST first—it shows you the current state, not just stale logs
Kill zombie connections immediately during an incident; fix the root cause later
Lower connection limits temporarily to make problems visible (counterintuitive, but it works)
Fix the symptom first (timeouts, zombie cleanup), then hunt for the root cause at your own pace

Final Thoughts

I started this thinking, "The database is slow, I'll just add a replica."

What I actually found was a connection leak that had been silently living in the system for months, slowly filling the pool with dead connections. The replica setup gave me a reason to look closely at the primary's health—and that's when the real problem revealed itself.

What I ended up learning:

How binary log replication works under the hood
Why MySQL authentication plugins matter
The real cost of connection pool exhaustion
How timeout cascades mask the actual problem
That infrastructure work often surfaces the application bugs hiding underneath

The replica setup itself? Straightforward once you know the gotchas. But the most valuable part was what the investigation uncovered before I even started the setup.

The best part? The fix is simple and automated. A cron job, some timeout tweaks, and a tighter pool config. No massive code refactor, no downtime, no drama. And the whole thing is running fantastically.

The main takeaway: When your database is slow, don't just assume it's the queries or the size. Run SHOW PROCESSLIST. You might be surprised at what's been sleeping there for months.

If you're dealing with mysterious timeout errors, here's your checklist:

Run SHOW PROCESSLIST and count what's sleeping
Set wait_timeout in MySQL as a safety net
Tighten your connection pool config (smaller max, lower acquire timeout)
Add a cron job to kill zombies nightly
Monitor your pool usage metrics
Fix the actual leak in the code (eventually, properly)

And if you're setting up a read replica: mysql_native_password for local replication, nohup for long imports, and always check SHOW REPLICA STATUS\G after you start.

Kaung Myat Han (Eric) is a full-stack developer based in Bangkok, working on high-traffic Node.js/MySQL systems. He writes about the messy, real-world problems that don't make it into the documentation.

Found this useful? Share it with someone who's debugging connection pool issues right now.