Deploying a Go service to AWS ECS: a practical walkthrough
Infrastructure, CI/CD with OIDC, ECR lifecycle quirks, and what actually drives rolling-deploy speed on ECS with an ALB — including health checks, ENI limits, and deregistration delay.
Quick navigation
- The stack at a glance
- The Go application
- Infrastructure architecture
- The CI/CD pipeline
- ECR image management
- Understanding rolling deployments
- Autoscaling
- Observability
- Practical lessons
- What's next
I recently built and deployed a production-ready Go backend to AWS ECS using the EC2 launch type, Terraform, and GitHub Actions. This article walks through the whole setup — the infrastructure, the CI/CD pipeline, the deployment behavior, and the things I learned the hard way by watching health check logs scroll by in real time.
The stack at a glance
- Language: Go (chi, zap, viper, go-playground/validator)
- Container registry: Amazon ECR
- Orchestration: Amazon ECS (EC2 launch type, not Fargate)
- Load balancer: Application Load Balancer (ALB)
- Infrastructure: Terraform, modularized
- CI/CD: GitHub Actions with OIDC auth (no long-lived AWS keys)
- Networking: VPC with public/private subnets, NAT gateway,
awsvpctask networking
Repository: github.com/kaungmyathan18/golang-ecs-deployment
The Go application
The earlier article draft leaned on Terraform because that is where ECS networking, capacity, and ALB behavior are configured. The runnable service in the repo is a small REST API (user CRUD + health) with a layout that keeps transport code separate from domain logic:
cmd/server/
main.go — entrypoint, graceful shutdown, chi router + HTTP server timeouts
internal/
apiresponse/ — success JSON envelope + RFC 7807 problem+json helpers
config/ — viper + env (optional local `.env` via InitEnv)
handler/ — HTTP handlers (`api.go`, `health.go`)
observability/ — zap logging, metrics hook, OTel scaffold
repository/ — in-memory user store (easy to swap for Postgres later)
service/ — use-cases over the repository
validation/ — go-playground/validator → 422 problem details
The Dockerfile is a multi-stage build: compile with golang:1.26-alpine, ship a static binary on alpine:3.21, listen on 8080 (matching container_port in Terraform).
# Dockerfile (abbreviated)
FROM golang:1.26-alpine AS build
WORKDIR /src
COPY go.mod go.sum* ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/server ./cmd/server
FROM alpine:3.21
WORKDIR /app
RUN apk add --no-cache ca-certificates
COPY --from=build /out/server ./server
EXPOSE 8080
ENTRYPOINT ["./server"]
cmd/server/main.go attaches chi middleware (RealIP, recoverer, RequestID, optional CORS, structured request logging) and registers two groups: /health for the load balancer and /api/v1 for the app.
// cmd/server/main.go — setupRouter excerpt
func setupRouter(
cfg *config.Config,
logger *zap.Logger,
metrics *observability.Metrics,
health *handler.HealthHandler,
api *handler.APIHandler,
) *chi.Mux {
r := chi.NewRouter()
r.Use(middleware.RealIP)
r.Use(middleware.Recoverer)
r.Use(middleware.RequestID)
// cors + LoggingMiddleware + metrics ...
r.Route("/health", func(sr chi.Router) {
sr.Get("/", health.Live) // ALB default path /health
sr.Get("/live", health.Live)
sr.Get("/ready", health.Ready)
})
r.Route("/api/v1", func(ar chi.Router) {
ar.Use(middleware.Timeout(60 * time.Second))
ar.Post("/users", api.CreateUser)
ar.Get("/users/{id}", api.GetUser)
ar.Get("/users", api.ListUsers)
})
return r
}
/health/live returns plain ok for a simple liveness probe. /health/ready returns JSON (including a version string) so you can poll during deploys and see mixed versions when the ALB has more than one healthy target — that is what produced the alternating 1.1.0 / 1.2.0 lines later in this post (bump the version in code when you cut a release).
Infrastructure architecture
In the repo, Terraform lives under terraform/ and is split into focused child modules under terraform/modules/:
terraform/
bootstrap/ — S3 bucket + DynamoDB table for remote state / locking
config/ — backend *.hcl, per-env *.tfvars
modules/
network/ — VPC, subnets, IGW, NAT, route tables
security_groups/ — ALB SG, ECS task SG, ECS instance SG
ecr/ — ECR repository with lifecycle policy
iam/ — ECS execution role, task role, instance profile
secrets/ — SSM SecureString parameters
alb/ — ALB, target group, HTTP/HTTPS listeners
ecs/ — Cluster, ASG, launch template, service, task def, autoscaling, CloudWatch
acm/ — ACM certificate with Route53 DNS validation
route53_alias/ — A record alias to the ALB
main.tf — root module composes the child modules
Everything is composed in terraform/main.tf. Environment-specific values live in terraform/config/*.tfvars and remote state is configured via terraform/config/backend-*.hcl, bootstrapped from terraform/bootstrap/.
Networking
The VPC uses the classic two-tier layout: public subnets for the ALB, private subnets for ECS tasks and EC2 instances. Tasks use awsvpc networking mode, meaning each task gets its own ENI. This matters for capacity planning — a t3.small instance has a limited ENI count, and during rolling deploys you can briefly need more ENIs than usual when old and new tasks overlap.
# terraform/modules/network/main.tf
resource "aws_subnet" "public" {
count = length(var.public_subnet_cidrs)
map_public_ip_on_launch = true
# ALB lives here
}
resource "aws_subnet" "private" {
count = length(var.private_subnet_cidrs)
# ECS tasks live here, outbound via NAT
}
ECS service configuration
The most important deployment settings are on the ECS service:
# terraform/modules/ecs/service.tf
resource "aws_ecs_service" "app" {
desired_count = var.desired_count
launch_type = "EC2"
deployment_minimum_healthy_percent = var.ecs_deployment.minimum_healthy_percent
deployment_maximum_percent = var.ecs_deployment.maximum_percent
lifecycle {
ignore_changes = [desired_count] # let autoscaling manage this
}
}
The defaults are minimum_healthy_percent = 100 and maximum_percent = 200. More on why those matter in a moment.
EC2 capacity
Rather than Fargate, this setup uses an Auto Scaling Group of EC2 instances with the ECS-optimized AMI:
# terraform/modules/ecs/capacity.tf
resource "aws_launch_template" "ecs_instance" {
image_id = data.aws_ssm_parameter.ecs_optimized_ami.value
instance_type = var.ecs_instance_type # default: t3.small
user_data = base64encode(<<EOT
#!/bin/bash
echo ECS_CLUSTER=${aws_ecs_cluster.app.name} >> /etc/ecs/ecs.config
EOT
)
}
The t3.micro is explicitly called out in the variables as too small — 1 GiB RAM barely fits a 512 MiB task definition after the ECS agent and OS overhead. t3.small is the minimum practical choice, and you need headroom for the rolling deploy overlap.
ALB and target group
The ALB forwards HTTP on port 80 (or HTTPS on 443 with ACM + Route53 when enable_https = true) to an ip-type target group, which is required for awsvpc tasks:
# terraform/modules/alb/main.tf
resource "aws_lb_target_group" "app" {
target_type = "ip" # required for awsvpc
health_check {
path = var.health_check_path
interval = 30
healthy_threshold = 2
unhealthy_threshold = 3
}
}
The health check defaults (30s interval, 2 consecutive passes = healthy) mean a new task takes at least 60 seconds before the ALB considers it healthy. This is the floor on how fast a rolling deploy can proceed.
The CI/CD pipeline
OIDC auth (no long-lived keys)
GitHub Actions uses OIDC to assume an IAM role, eliminating stored AWS credentials:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ vars.AWS_REGION }}
The IAM role's trust policy allows sts:AssumeRoleWithWebIdentity from the GitHub OIDC provider, scoped to your repo.
Build and push to ECR
Images are tagged with the git SHA for immutability plus latest as a floating pointer:
docker build --platform linux/amd64 -t my-app:local .
docker tag my-app:local "${REGISTRY}/${REPO}:${GITHUB_SHA}"
docker tag my-app:local "${REGISTRY}/${REPO}:latest"
docker push "${REGISTRY}/${REPO}:${GITHUB_SHA}"
docker push "${REGISTRY}/${REPO}:latest"
Platform is explicitly linux/amd64 because ECS EC2 instances are AMD64. Building on an Apple Silicon Mac without --platform linux/amd64 produces an ARM image that ECS will refuse to run.
Deploying to ECS
After images are pushed to ECR, the production path in this repo (ci.yml’s deploy-ecs job) does two important things:
- Clones the task definition ECS is already running, swaps
containerDefinitions[0].imageto the immutable…ECR_REPO:${GITHUB_SHA}URI (viaaws ecs describe-task-definition+jq), andregister-task-definition. - Calls
aws ecs update-servicewith--task-definition family:revisionand--force-new-deployment.
That avoids relying on :latest alone on EC2-backed ECS, where cached images can otherwise make “why is my pod still old?” debugging painful. For a one-off manual deploy, update-service --force-new-deployment against the current task definition revision is still valid — it just will not pick up a new tag if the task definition still points at an older image reference.
Dependency and security updates
During development, Trivy flagged several issues worth bumping:
- chi — host header / open-redirect issue, fixed in v5.2.2
golang.org/x/crypto— SSH-related CVEs, fixed in v0.45.0golang.org/x/net— pulled in as a transitive fix with crypto v0.45.0
Bumping x/crypto to v0.45.0 raised the minimum Go version in the toolchain at the time; today this repo tracks Go 1.26.x in go.mod, the Dockerfile build stage, and actions/setup-go in CI. The recurring lesson: when a security bump moves the floor Go version, align local, CI, and the container build in one change so you are not debugging “works on my laptop” skew.
ECR image management
ECR is configured with MUTABLE tags and a lifecycle policy that keeps the last 20 images:
# terraform/modules/ecr/main.tf
resource "aws_ecr_lifecycle_policy" "app" {
policy = jsonencode({
rules = [{
rulePriority = 1
description = "Keep the last 20 images."
selection = {
tagStatus = "any"
countType = "imageCountMoreThan"
countNumber = 20
}
action = { type = "expire" }
}]
})
}
The floating latest tag and orphaned images
With mutable tags, pushing a new latest moves the tag to the new image and leaves the previous image untagged. In the ECR console this shows up as a - in the Image tags column — not an error, just an orphan from the previous push.
Similarly, the "Last pulled at" column showing - means ECS never actually pulled that image. This happens when you push multiple images in quick succession during active development — each new push supersedes the previous one before ECS gets a chance to pull it.
Understanding rolling deployments
This is where the real learning happened. The setup uses minimum_healthy_percent = 100 and maximum_percent = 200.
What those numbers mean
With desired_count = 1:
- ECS cannot drop below 100% of 1 = 1 healthy task
- ECS can run up to 200% of 1 = 2 tasks simultaneously
So the rollout looks like: spin up 1 new task → wait for ALB health → drain and stop 1 old task.
With desired_count = 2:
- ECS keeps at least 2 healthy tasks throughout
- Can temporarily run up to 4 tasks during the overlap
Watching the rollout live
During a deploy from v1.1.0 to v1.2.0, hitting the health endpoint every second tells the whole story:
18:59:10 → version: 1.1.0 # still old
18:59:11 → version: 1.2.0 # new task became healthy, ALB started routing to it
18:59:12 → version: 1.1.0 # ALB still has both targets; round-robin
...
# Mixed phase: both tasks healthy, ALB splits traffic ~50/50
...
19:07:51 → version: 1.2.0 # old task fully drained and gone
19:07:52 → version: 1.2.0 # only new task remains
The mixed phase — where responses alternated between 1.1.0 and 1.2.0 — lasted about 8 minutes. That's not ECS being slow; it's the ALB deregistration_delay (default: 300 seconds) keeping the old target registered and draining before fully removing it.
The real levers for faster deploys
Increasing desired_count makes deploys smoother (fewer gaps), not necessarily faster. The actual knobs:
| Setting | Default | Dev-friendly value | Impact |
|---|---|---|---|
| ALB health check interval | 30s | 10s | Time-to-healthy for new tasks |
| ALB healthy threshold | 2 checks | 2 checks | Combined with interval: 20s vs 60s |
deregistration_delay | 300s | 30s | How long old tasks stay in "draining" |
| Image size | varies | minimize | Faster ECR pull |
The 300s deregistration delay is the biggest culprit for the long mixed phase in the logs above. For dev, setting it to 30s would collapse that 8-minute overlap to under a minute.
Why desired_count = 1 has a gap
With a single task, there's a moment — after the old task starts draining but before the new task passes all health checks — where you can have zero healthy targets. That's the "silence" in the health check stream: requests hit the ALB, find no healthy targets, and return 503.
With desired_count = 2, one task can remain healthy while the other cycles. No zero-healthy-target moment. This is why ECS recommends at least 2 tasks for any service that needs to stay up during deploys.
ECS's mental model
A common misconception: ECS doesn't follow a fixed script of "kill one, start one." Its actual logic is:
- Start new tasks (up to
maximum_percentallows) - Wait for ALB health checks to pass on new tasks
- Drain and stop old tasks (limited by
minimum_healthy_percent)
This happens in waves, not one-by-one, and the exact order depends on available capacity. If the cluster is full and can't place a new task, the rollout stalls until capacity appears. This is where understanding your EC2 instance sizing and ENI limits becomes critical.
Autoscaling
The service has both application-level (ECS task count) and infrastructure-level (EC2 instance count) autoscaling configured.
Task autoscaling
# terraform/modules/ecs/service.tf
resource "aws_appautoscaling_policy" "cpu" {
target_tracking_scaling_policy_configuration {
target_value = 70
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
}
}
resource "aws_appautoscaling_policy" "memory" {
target_tracking_scaling_policy_configuration {
target_value = 75
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageMemoryUtilization"
}
}
}
CPU above 70% or memory above 75% triggers scale-out. The lifecycle { ignore_changes = [desired_count] } on the service resource prevents Terraform from fighting with autoscaling every time you run apply.
Instance autoscaling
The ASG handles EC2 capacity, with min/max/desired capacity as separate variables. For dev this is 1–2 instances; for prod you'd widen the range and potentially add Capacity Provider managed scaling.
Observability
CloudWatch is set up with log groups, metric alarms, and a dashboard:
# terraform/modules/ecs/cloudwatch.tf
resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
metric_name = "HTTPCode_ELB_5XX_Count"
threshold = 5
# 2 evaluation periods of 60s each
}
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
metric_name = "CPUUtilization"
threshold = 80
}
Log retention is 7 days for dev and 30 days for prod, controlled by the env variable.
Practical lessons
Watch the deregistration delay. The default 300 seconds is appropriate for prod (gives in-flight requests time to complete) but brutal in dev when you're iterating fast. For dev environments, 30 seconds is plenty.
Size your instances for the overlap. With minimum_healthy_percent = 100 and maximum_percent = 200, a deploy with desired_count = 2 can briefly need 4 tasks running. On a t3.small with awsvpc, you'll hit ENI limits before CPU limits.
Tag with the git SHA, float latest. SHA-tagged images give you a permanent audit trail of what was deployed when. latest is convenient for quick local pushes but unreliable as a deployment target — use it as a shortcut, not a source of truth.
desired_count = 1 will show gaps during deploys. This is expected and unavoidable given minimum_healthy_percent = 100. If your health check script sees 503s during a deploy and you're on desired_count = 1, it's working as designed. Move to 2 if you want that eliminated.
OIDC over stored credentials, always. The GitHub Actions OIDC setup takes maybe 30 extra minutes to configure but eliminates the credential rotation problem permanently.
What's next
A few things this setup doesn't yet have that would be worth adding:
- Canary or blue/green deployments — ECS supports CodeDeploy blue/green which gives you a proper traffic cutover with rollback, rather than the ALB round-robin during rolling updates
- ECS Capacity Provider managed scaling — ties EC2 instance scaling directly to task placement demand, smoother than independent ASG scaling
- Structured logging with correlation IDs — the CloudWatch log groups are there but the value multiplies when your Go app emits JSON with request IDs
- Health check tuning per environment — dev and prod probably want different intervals, thresholds, and deregistration delays rather than sharing one Terraform variable set