March 4, 2026
Building an Async Job Queue Worker in Go: Redis, Worker Pools, and Production Patterns
Background jobs in Go with Redis-backed queues, goroutine worker pools, graceful shutdown, and patterns that hold up in production.
Quick navigation
- System Architecture
- Tech Stack
- Development Timeline: 6 Days
- API Layer
- Observability and Metrics
- Docker Compose Setup
- Testing the System
- Performance Benchmarks
- Key Technical Lessons
- Interview Talking Points
- Resume Bullet
- Comparison: Synchronous vs Asynchronous
- Resources That Helped
- What's Next?
- Conclusion
System Architecture
┌──────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ HTTP Client │ --> │ Queue Manager │ --> │ Worker Pools │
│ POST /jobs │ │ (Redis Backend) │ │ (Goroutines) │
└──────────────┘ └──────────────────┘ └──────────────────┘
↓ ↓ ↓
Returns JobID Stores in Redis Processes tasks:
{"job_id":"uuid"} • pending_jobs • Email sending
• running_jobs • Image resize
GET /jobs/:id • completed_jobs • Analytics
→ status check • failed_jobs • Report gen
• dlq (dead letter)
Key Design Decisions:
- Redis for persistence — Jobs survive restarts, simple list-based queue
- Worker pool pattern — Fixed goroutines prevent resource exhaustion
- Exponential backoff — Failed jobs retry intelligently
- Dead letter queue — Failed jobs aren't lost, they're quarantined
- Prometheus metrics — Observable system behavior
Tech Stack
| Component | Technology | Why This Choice |
|---|---|---|
| Language | Go 1.25 | Fast, strong typing, excellent concurrency |
| Queue Backend | Redis 7 | In-memory speed with persistence |
| Concurrency | Goroutines + Channels | Lightweight, efficient communication |
| Worker Pattern | Pool of N workers | Bounded resource usage |
| Failure Handling | Retry + DLQ | Resilient to transient failures |
| Monitoring | Prometheus | Industry standard metrics |
| Containerization | Docker Compose | Reproducible local/prod environments |
Development Timeline: 6 Days
Day 1-2: Core Job Model and In-Memory Queue
Started with the fundamental data structure:
// internal/queue/job.go
type JobStatus string
const (
StatusPending JobStatus = "pending"
StatusRunning JobStatus = "running"
StatusCompleted JobStatus = "completed"
StatusFailed JobStatus = "failed"
)
type Job struct {
ID string `json:"id"`
Type string `json:"type"` // email, resize, analytics
Payload []byte `json:"payload"` // Job-specific data
Status JobStatus `json:"status"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
Attempt int `json:"attempt"` // Retry counter
Error *string `json:"error,omitempty"`
StartedAt *time.Time `json:"started_at,omitempty"`
CompletedAt *time.Time `json:"completed_at,omitempty"`
}
Why pointer fields for Error, StartedAt, CompletedAt?
- Nullable in JSON —
nullvs empty string distinction - Clear semantic: job hasn't started vs started at zero time
Initial In-Memory Queue:
type MemoryQueue struct {
jobs map[string]*Job
pending []string
mu sync.RWMutex
}
func (mq *MemoryQueue) Create(ctx context.Context, job *Job) error {
mq.mu.Lock()
defer mq.mu.Unlock()
if job.ID == "" {
job.ID = uuid.New().String()
}
job.Status = StatusPending
job.CreatedAt = time.Now()
job.UpdatedAt = time.Now()
mq.jobs[job.ID] = job
mq.pending = append(mq.pending, job.ID)
return nil
}
Problem discovered: Restart server → all jobs vanish. Not production-ready.
Day 3-4: Redis Persistence Layer
Migrated to Redis for durable job storage:
// internal/queue/redis_queue.go
type RedisQueue struct {
client *redis.Client
maxSize int
}
const (
pendingJobsKey = "queue:pending"
runningJobsKey = "queue:running"
completedJobsKey = "queue:completed"
failedJobsKey = "queue:failed"
jobPrefix = "job:"
)
func (rq *RedisQueue) Create(ctx context.Context, job *Job) error {
// Check queue capacity
count, err := rq.client.LLen(ctx, pendingJobsKey).Result()
if err != nil {
return fmt.Errorf("failed to check queue size: %w", err)
}
if count >= int64(rq.maxSize) {
return fmt.Errorf("queue is full (max %d jobs)", rq.maxSize)
}
// Generate ID if not provided
if job.ID == "" {
job.ID = uuid.New().String()
}
job.Status = StatusPending
job.CreatedAt = time.Now()
job.UpdatedAt = time.Now()
// Store job data
jobKey := jobPrefix + job.ID
jobData, err := json.Marshal(job)
if err != nil {
return fmt.Errorf("failed to marshal job: %w", err)
}
// Store with 7-day expiration
if err := rq.client.Set(ctx, jobKey, jobData, 7*24*time.Hour).Err(); err != nil {
return fmt.Errorf("failed to store job: %w", err)
}
// Add to pending queue
if err := rq.client.RPush(ctx, pendingJobsKey, job.ID).Err(); err != nil {
return fmt.Errorf("failed to enqueue job: %w", err)
}
return nil
}
Redis Data Model:
Keys:
queue:pending → LIST of job IDs waiting for processing
queue:running → LIST of job IDs currently being processed
queue:completed → LIST of completed job IDs
queue:failed → LIST of failed job IDs
job:{uuid} → HASH of job data (JSON)
dlq:{uuid} → HASH of dead-letter job data
Dequeue with Atomic Move:
func (rq *RedisQueue) Dequeue(ctx context.Context) (*Job, error) {
// Atomically move from pending to running
jobID, err := rq.client.BLMove(
ctx,
pendingJobsKey,
runningJobsKey,
"LEFT",
"RIGHT",
5*time.Second, // Block for 5 seconds
).Result()
if err == redis.Nil {
return nil, nil // No jobs available
}
if err != nil {
return nil, err
}
// Fetch job data
jobKey := jobPrefix + jobID
jobData, err := rq.client.Get(ctx, jobKey).Result()
if err != nil {
return nil, fmt.Errorf("job data not found: %w", err)
}
var job Job
if err := json.Unmarshal([]byte(jobData), &job); err != nil {
return nil, err
}
job.Status = StatusRunning
now := time.Now()
job.StartedAt = &now
job.UpdatedAt = now
// Update job status
updatedData, _ := json.Marshal(job)
rq.client.Set(ctx, jobKey, updatedData, 7*24*time.Hour)
return &job, nil
}
Critical insight: BLMove (blocking list move) is atomic. Job can't be picked up by two workers simultaneously.
Day 5: Worker Pool Implementation
Producer-consumer pattern with bounded concurrency:
// internal/worker/pool.go
type WorkerPool struct {
numWorkers int
queue queue.Queue
workers []chan *queue.Job
wg sync.WaitGroup
ctx context.Context
cancel context.CancelFunc
logger *zap.Logger
}
func NewWorkerPool(numWorkers int, q queue.Queue, logger *zap.Logger) *WorkerPool {
ctx, cancel := context.WithCancel(context.Background())
wp := &WorkerPool{
numWorkers: numWorkers,
queue: q,
workers: make([]chan *queue.Job, numWorkers),
ctx: ctx,
cancel: cancel,
logger: logger,
}
// Start worker goroutines
for i := 0; i < numWorkers; i++ {
wp.workers[i] = make(chan *queue.Job, 10) // Buffered channel
wp.wg.Add(1)
go wp.worker(i, wp.workers[i])
}
return wp
}
func (wp *WorkerPool) Start() {
wp.logger.Info("worker pool started", zap.Int("num_workers", wp.numWorkers))
for {
select {
case <-wp.ctx.Done():
wp.logger.Info("worker pool shutting down")
return
default:
// Dequeue job from Redis
job, err := wp.queue.Dequeue(wp.ctx)
if err != nil {
wp.logger.Error("failed to dequeue job", zap.Error(err))
time.Sleep(1 * time.Second)
continue
}
if job == nil {
// No jobs available, short sleep
time.Sleep(100 * time.Millisecond)
continue
}
// Round-robin assignment to workers
workerID := rand.Intn(wp.numWorkers)
wp.workers[workerID] <- job
}
}
}
Worker goroutine:
func (wp *WorkerPool) worker(id int, jobs <-chan *queue.Job) {
defer wp.wg.Done()
wp.logger.Info("worker started", zap.Int("worker_id", id))
for job := range jobs {
wp.logger.Info("processing job",
zap.Int("worker_id", id),
zap.String("job_id", job.ID),
zap.String("job_type", job.Type),
)
if err := wp.processJob(job); err != nil {
wp.handleJobFailure(job, err)
} else {
wp.handleJobSuccess(job)
}
}
wp.logger.Info("worker stopped", zap.Int("worker_id", id))
}
Job Processing Logic:
func (wp *WorkerPool) processJob(job *queue.Job) error {
switch job.Type {
case "email":
return wp.processEmailJob(job)
case "image_resize":
return wp.processImageResizeJob(job)
case "analytics":
return wp.processAnalyticsJob(job)
default:
return fmt.Errorf("unknown job type: %s", job.Type)
}
}
func (wp *WorkerPool) processEmailJob(job *queue.Job) error {
var payload struct {
To string `json:"to"`
Subject string `json:"subject"`
Body string `json:"body"`
}
if err := json.Unmarshal(job.Payload, &payload); err != nil {
return fmt.Errorf("invalid email payload: %w", err)
}
// Simulate email sending
time.Sleep(time.Duration(rand.Intn(3)) * time.Second)
// In production: integrate with SendGrid, AWS SES, etc.
wp.logger.Info("email sent",
zap.String("to", payload.To),
zap.String("subject", payload.Subject),
)
return nil
}
Day 6: Retry Logic and Dead Letter Queue
Exponential Backoff with Jitter:
const (
MaxRetries = 3
BaseDelay = 1 * time.Second
)
func calculateBackoff(attempt int) time.Duration {
base := float64(BaseDelay)
multiplier := math.Pow(2, float64(attempt-1)) // 2^0, 2^1, 2^2
jitter := rand.Float64() * 0.1 * base // ±10% randomness
delay := time.Duration(base*multiplier + jitter)
// Cap at 30 seconds
if delay > 30*time.Second {
delay = 30 * time.Second
}
return delay
}
Why jitter? Prevents thundering herd when many jobs fail simultaneously.
Retry attempts: 1s → 2s → 4s (with randomness)
Failure Handler:
func (wp *WorkerPool) handleJobFailure(job *queue.Job, err error) {
job.Attempt++
errMsg := err.Error()
job.Error = &errMsg
job.UpdatedAt = time.Now()
if job.Attempt >= MaxRetries {
wp.logger.Error("job failed permanently",
zap.String("job_id", job.ID),
zap.Int("attempts", job.Attempt),
zap.Error(err),
)
// Move to dead letter queue
wp.moveToDeadLetterQueue(job)
jobsMovedToDLQ.Inc()
} else {
wp.logger.Warn("job failed, retrying",
zap.String("job_id", job.ID),
zap.Int("attempt", job.Attempt),
zap.Error(err),
)
// Schedule retry with backoff
backoff := calculateBackoff(job.Attempt)
time.Sleep(backoff)
// Re-enqueue
job.Status = queue.StatusPending
wp.queue.Create(wp.ctx, job)
jobsRetried.Inc()
}
}
Dead Letter Queue Implementation:
func (wp *WorkerPool) moveToDeadLetterQueue(job *queue.Job) {
ctx := context.Background()
dlqKey := "dlq:" + job.ID
job.Status = queue.StatusFailed
now := time.Now()
job.CompletedAt = &now
data, err := json.Marshal(job)
if err != nil {
wp.logger.Error("failed to marshal DLQ job", zap.Error(err))
return
}
// Store in DLQ with no expiration (manual intervention needed)
if err := wp.queue.(*queue.RedisQueue).Client().Set(ctx, dlqKey, data, 0).Err(); err != nil {
wp.logger.Error("failed to move job to DLQ", zap.Error(err))
return
}
// Add to failed jobs list
wp.queue.(*queue.RedisQueue).Client().RPush(ctx, "queue:failed", job.ID)
wp.logger.Info("job moved to DLQ",
zap.String("job_id", job.ID),
zap.String("error", *job.Error),
)
}
DLQ allows:
- Manual inspection of failed jobs
- Debugging payload issues
- Re-queueing after fixes
API Layer
HTTP Server Setup:
// cmd/api/main.go
func main() {
// Load config
cfg := config.Load()
// Initialize logger
logger, _ := zap.NewProduction()
defer logger.Sync()
// Connect to Redis
redisClient := redis.NewClient(&redis.Options{
Addr: cfg.RedisURL,
Password: cfg.RedisPassword,
DB: 0,
})
// Test connection
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := redisClient.Ping(ctx).Err(); err != nil {
logger.Fatal("failed to connect to Redis", zap.Error(err))
}
// Initialize queue
queue := queue.NewRedisQueue(redisClient, 1000) // Max 1000 pending jobs
// Start worker pool
workerPool := worker.NewWorkerPool(cfg.NumWorkers, queue, logger)
go workerPool.Start()
// Setup HTTP server
router := chi.NewRouter()
router.Use(middleware.Logger)
router.Use(middleware.Recoverer)
// Routes
router.Post("/jobs", handleCreateJob(queue, logger))
router.Get("/jobs/{id}", handleGetJob(queue, logger))
router.Get("/stats", handleStats(queue, logger))
router.Get("/health", handleHealth(redisClient))
router.Get("/metrics", promhttp.Handler().ServeHTTP)
// Start server
srv := &http.Server{
Addr: ":" + cfg.Port,
Handler: router,
}
logger.Info("server starting", zap.String("port", cfg.Port))
if err := srv.ListenAndServe(); err != nil {
logger.Fatal("server failed", zap.Error(err))
}
}
Create Job Endpoint:
func handleCreateJob(q queue.Queue, logger *zap.Logger) http.HandlerFunc {
type request struct {
Type string `json:"type"`
Payload json.RawMessage `json:"payload"`
}
type response struct {
JobID string `json:"job_id"`
}
return func(w http.ResponseWriter, r *http.Request) {
var req request
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, "invalid request", http.StatusBadRequest)
return
}
job := &queue.Job{
Type: req.Type,
Payload: req.Payload,
}
if err := q.Create(r.Context(), job); err != nil {
logger.Error("failed to create job", zap.Error(err))
http.Error(w, "failed to create job", http.StatusInternalServerError)
return
}
jobsCreated.Inc()
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusCreated)
json.NewEncoder(w).Encode(response{JobID: job.ID})
}
}
Stats Endpoint:
func handleStats(q queue.Queue, logger *zap.Logger) http.HandlerFunc {
type stats struct {
TotalJobs int `json:"total_jobs"`
PendingJobs int `json:"pending_jobs"`
RunningJobs int `json:"running_jobs"`
CompletedJobs int `json:"completed_jobs"`
FailedJobs int `json:"failed_jobs"`
DLQSize int `json:"dlq_size"`
}
return func(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
redisQueue := q.(*queue.RedisQueue)
pending, _ := redisQueue.Client().LLen(ctx, "queue:pending").Result()
running, _ := redisQueue.Client().LLen(ctx, "queue:running").Result()
completed, _ := redisQueue.Client().LLen(ctx, "queue:completed").Result()
failed, _ := redisQueue.Client().LLen(ctx, "queue:failed").Result()
// Count DLQ entries
dlqKeys, _ := redisQueue.Client().Keys(ctx, "dlq:*").Result()
s := stats{
TotalJobs: int(pending + running + completed + failed),
PendingJobs: int(pending),
RunningJobs: int(running),
CompletedJobs: int(completed),
FailedJobs: int(failed),
DLQSize: len(dlqKeys),
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(s)
}
}
Observability and Metrics
Prometheus Metrics:
var (
jobsCreated = promauto.NewCounter(prometheus.CounterOpts{
Name: "jobs_created_total",
Help: "Total number of jobs created",
})
jobsProcessed = promauto.NewCounter(prometheus.CounterOpts{
Name: "jobs_processed_total",
Help: "Total number of jobs successfully processed",
})
jobsFailed = promauto.NewCounter(prometheus.CounterOpts{
Name: "jobs_failed_total",
Help: "Total number of jobs that failed",
})
jobsRetried = promauto.NewCounter(prometheus.CounterOpts{
Name: "jobs_retried_total",
Help: "Total number of job retry attempts",
})
jobsMovedToDLQ = promauto.NewCounter(prometheus.CounterOpts{
Name: "jobs_dlq_total",
Help: "Total number of jobs moved to dead letter queue",
})
jobProcessingDuration = promauto.NewHistogram(prometheus.HistogramOpts{
Name: "job_processing_duration_seconds",
Help: "Time spent processing jobs",
Buckets: prometheus.DefBuckets,
})
queueSize = promauto.NewGaugeVec(prometheus.GaugeOpts{
Name: "queue_size",
Help: "Number of jobs in each queue state",
}, []string{"state"})
)
Metrics Collection:
func (wp *WorkerPool) worker(id int, jobs <-chan *queue.Job) {
for job := range jobs {
start := time.Now()
if err := wp.processJob(job); err != nil {
jobsFailed.Inc()
wp.handleJobFailure(job, err)
} else {
jobsProcessed.Inc()
jobProcessingDuration.Observe(time.Since(start).Seconds())
wp.handleJobSuccess(job)
}
}
}
Health Check Endpoint:
func handleHealth(redisClient *redis.Client) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
// Check Redis connectivity
if err := redisClient.Ping(ctx).Err(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"status": "unhealthy",
"error": "redis connection failed",
})
return
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]string{
"status": "healthy",
})
}
}
Docker Compose Setup
Complete development environment:
version: '3.8'
services:
app:
build:
context: .
dockerfile: Dockerfile
ports:
- "8081:8081"
environment:
- REDIS_URL=redis:6379
- PORT=8081
- NUM_WORKERS=5
depends_on:
redis:
condition: service_healthy
volumes:
- .:/app
restart: unless-stopped
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
restart: unless-stopped
volumes:
redis_data:
prometheus_data:
Prometheus Configuration:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'job-queue-worker'
static_configs:
- targets: ['app:8081']
Testing the System
Create Job:
curl -X POST http://localhost:8081/jobs \
-H "Content-Type: application/json" \
-d '{
"type": "email",
"payload": {
"to": "user@example.com",
"subject": "Welcome",
"body": "Thanks for signing up!"
}
}'
# Response:
# {"job_id":"550e8400-e29b-41d4-a716-446655440000"}
Check Job Status:
curl http://localhost:8081/jobs/550e8400-e29b-41d4-a716-446655440000
# Response:
# {
# "id": "550e8400-e29b-41d4-a716-446655440000",
# "type": "email",
# "status": "completed",
# "created_at": "2026-04-07T10:30:00Z",
# "started_at": "2026-04-07T10:30:01Z",
# "completed_at": "2026-04-07T10:30:03Z",
# "attempt": 1
# }
View Stats:
curl http://localhost:8081/stats
# Response:
# {
# "total_jobs": 150,
# "pending_jobs": 5,
# "running_jobs": 3,
# "completed_jobs": 140,
# "failed_jobs": 2,
# "dlq_size": 2
# }
Check Metrics:
curl http://localhost:8081/metrics | grep jobs_
# jobs_created_total 150
# jobs_processed_total 140
# jobs_failed_total 8
# jobs_retried_total 6
# jobs_dlq_total 2
Performance Benchmarks
Load Testing with vegeta:
echo "POST http://localhost:8081/jobs" | vegeta attack \
-rate=100 \
-duration=30s \
-body='{"type":"email","payload":{"to":"test@example.com"}}' \
-header="Content-Type: application/json" \
| vegeta report
Results:
| Metric | 5 Workers | 10 Workers | 20 Workers |
|---|---|---|---|
| Requests/sec | 95 | 98 | 99 |
| Latency (p50) | 45ms | 42ms | 40ms |
| Latency (p95) | 120ms | 98ms | 85ms |
| Latency (p99) | 180ms | 145ms | 125ms |
| Success Rate | 100% | 100% | 100% |
Key Insight: Diminishing returns after 10 workers for this workload. Right-sizing matters.
Key Technical Lessons
1. Channel Buffer Size Matters
Problem: Unbuffered channels caused workers to block.
Solution: Buffer channels to smooth out bursts:
wp.workers[i] = make(chan *queue.Job, 10) // 10-item buffer
2. Redis Atomic Operations Prevent Race Conditions
Using BLMove instead of separate LPOP + RPUSH prevents job duplication:
// WRONG: Race condition possible
jobID := redis.LPop("pending")
redis.RPush("running", jobID)
// RIGHT: Atomic operation
jobID := redis.BLMove("pending", "running", "LEFT", "RIGHT", 5*time.Second)
3. Context Cancellation for Graceful Shutdown
func (wp *WorkerPool) Shutdown() {
wp.cancel() // Signal all goroutines to stop
// Close all worker channels
for _, ch := range wp.workers {
close(ch)
}
// Wait for all workers to finish
wp.wg.Wait()
wp.logger.Info("all workers shut down gracefully")
}
4. Exponential Backoff Prevents Cascading Failures
Fixed delay causes synchronized retries → thundering herd:
100 jobs fail at T=0
All retry at T=1
All fail again, retry at T=2
...system overloaded
Exponential backoff with jitter spreads retries:
Job 1: retry at 1.05s
Job 2: retry at 0.98s
Job 3: retry at 1.12s
...spread across time window
5. Dead Letter Queue is Non-Negotiable
Without DLQ, permanently failed jobs disappear. With DLQ:
- Investigate failure patterns
- Fix bugs in job processing logic
- Manually re-queue after fixes
- Audit for data quality issues
Interview Talking Points
When asked: "How do you handle background processing?"
"I built an asynchronous job queue system in Go using Redis for persistence and goroutine worker pools for concurrent processing. The system handles retry logic with exponential backoff, moves permanently failed jobs to a dead letter queue, and exposes Prometheus metrics for observability. Key challenges included preventing race conditions with atomic Redis operations, right-sizing the worker pool to avoid resource exhaustion, and implementing graceful shutdown to prevent job loss during deployments."
Follow-up questions prepared:
Q: How would you scale this to millions of jobs?
- Horizontal scaling: Multiple worker instances reading from same Redis
- Partitioning: Shard jobs by type across multiple Redis instances
- Priority queues: Separate queues for high/low priority jobs
- Rate limiting: Prevent overwhelming downstream services
Q: How do you handle job ordering?
- Currently FIFO via Redis lists
- For strict ordering: Use sorted sets with timestamps
- For dependent jobs: Implement DAG-based workflow engine
Q: What about job scheduling (run at specific time)?
- Current: Immediate execution only
- Future: Redis sorted sets with score = execution timestamp
- Dedicated scheduler goroutine polls for ready jobs
Resume Bullet
• Built production asynchronous job queue system in Go with Redis persistence, worker pools,
exponential backoff retries, and dead letter queue for failed jobs. Handled 100+ jobs/sec
with p95 latency <100ms.
• Implemented atomic Redis operations to prevent race conditions, Prometheus metrics for
observability, and graceful shutdown for zero job loss during deployments.
Repository: github.com/kaungmyathan22/golang-job-queue-worker
Comparison: Synchronous vs Asynchronous
| Aspect | Synchronous | Asynchronous (This Project) |
|---|---|---|
| User Experience | Blocks until completion | Immediate response |
| Timeout Risk | High (30s+ tasks) | None (runs in background) |
| Scalability | 1 thread = 1 task | N workers handle M tasks |
| Failure Handling | Request fails entirely | Individual job retries |
| Resource Efficiency | Wastes threads waiting | Worker pool reuse |
| Observability | Request logs only | Job-level metrics + DLQ |
When to use async:
- Tasks longer than 500ms
- External API calls (email, webhooks)
- Bulk operations (1000+ items)
- Non-critical path work (analytics)
Resources That Helped
- Go Concurrency Patterns
- Redis Commands Reference
- Worker Pool Pattern
- Exponential Backoff Algorithm
- Prometheus Client Go
What's Next?
Planned Enhancements:
- Job scheduling (cron-like syntax)
- Priority queues (high/normal/low)
- Job chaining (dependencies)
- Web UI for DLQ management
- Horizontal scaling tests (multiple worker instances)
- Integration with real email provider (SendGrid)
Related Projects:
Check out my URL Shortener microservice (Days 1-10) for complementary patterns.
Conclusion
Building async job processing while working full-time taught me:
- Concurrency is about coordination — Channels and mutexes prevent chaos
- Failure is normal — Design for retries, not perfection
- Observability earns trust — Metrics prove the system works
- Right-sizing beats over-engineering — 10 workers often enough
- Production-grade = edge cases handled — DLQ, graceful shutdown, health checks
This project demonstrates systems thinking beyond basic CRUD APIs. Interviewers notice.
Built over 6 focused evenings while working ful time