Deploying a Go service to AWS ECS: a practical walkthrough

Infrastructure, CI/CD with OIDC, ECR lifecycle quirks, and what actually drives rolling-deploy speed on ECS with an ALB — including health checks, ENI limits, and deregistration delay.

Quick navigation

I recently built and deployed a production-ready Go backend to AWS ECS using the EC2 launch type, Terraform, and GitHub Actions. This article walks through the whole setup — the infrastructure, the CI/CD pipeline, the deployment behavior, and the things I learned the hard way by watching health check logs scroll by in real time.

The stack at a glance

  • Language: Go (chi, zap, viper, go-playground/validator)
  • Container registry: Amazon ECR
  • Orchestration: Amazon ECS (EC2 launch type, not Fargate)
  • Load balancer: Application Load Balancer (ALB)
  • Infrastructure: Terraform, modularized
  • CI/CD: GitHub Actions with OIDC auth (no long-lived AWS keys)
  • Networking: VPC with public/private subnets, NAT gateway, awsvpc task networking

Repository: github.com/kaungmyathan18/golang-ecs-deployment

The Go application

The earlier article draft leaned on Terraform because that is where ECS networking, capacity, and ALB behavior are configured. The runnable service in the repo is a small REST API (user CRUD + health) with a layout that keeps transport code separate from domain logic:

cmd/server/
  main.go              — entrypoint, graceful shutdown, chi router + HTTP server timeouts
internal/
  apiresponse/         — success JSON envelope + RFC 7807 problem+json helpers
  config/              — viper + env (optional local `.env` via InitEnv)
  handler/             — HTTP handlers (`api.go`, `health.go`)
  observability/       — zap logging, metrics hook, OTel scaffold
  repository/          — in-memory user store (easy to swap for Postgres later)
  service/             — use-cases over the repository
  validation/          — go-playground/validator → 422 problem details

The Dockerfile is a multi-stage build: compile with golang:1.26-alpine, ship a static binary on alpine:3.21, listen on 8080 (matching container_port in Terraform).

# Dockerfile (abbreviated)
FROM golang:1.26-alpine AS build
WORKDIR /src
COPY go.mod go.sum* ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/server ./cmd/server

FROM alpine:3.21
WORKDIR /app
RUN apk add --no-cache ca-certificates
COPY --from=build /out/server ./server
EXPOSE 8080
ENTRYPOINT ["./server"]

cmd/server/main.go attaches chi middleware (RealIP, recoverer, RequestID, optional CORS, structured request logging) and registers two groups: /health for the load balancer and /api/v1 for the app.

// cmd/server/main.go — setupRouter excerpt
func setupRouter(
	cfg *config.Config,
	logger *zap.Logger,
	metrics *observability.Metrics,
	health *handler.HealthHandler,
	api *handler.APIHandler,
) *chi.Mux {
	r := chi.NewRouter()
	r.Use(middleware.RealIP)
	r.Use(middleware.Recoverer)
	r.Use(middleware.RequestID)
	// cors + LoggingMiddleware + metrics ...
	r.Route("/health", func(sr chi.Router) {
		sr.Get("/", health.Live) // ALB default path /health
		sr.Get("/live", health.Live)
		sr.Get("/ready", health.Ready)
	})
	r.Route("/api/v1", func(ar chi.Router) {
		ar.Use(middleware.Timeout(60 * time.Second))
		ar.Post("/users", api.CreateUser)
		ar.Get("/users/{id}", api.GetUser)
		ar.Get("/users", api.ListUsers)
	})
	return r
}

/health/live returns plain ok for a simple liveness probe. /health/ready returns JSON (including a version string) so you can poll during deploys and see mixed versions when the ALB has more than one healthy target — that is what produced the alternating 1.1.0 / 1.2.0 lines later in this post (bump the version in code when you cut a release).

Infrastructure architecture

In the repo, Terraform lives under terraform/ and is split into focused child modules under terraform/modules/:

terraform/
  bootstrap/         — S3 bucket + DynamoDB table for remote state / locking
  config/            — backend *.hcl, per-env *.tfvars
  modules/
    network/         — VPC, subnets, IGW, NAT, route tables
    security_groups/ — ALB SG, ECS task SG, ECS instance SG
    ecr/             — ECR repository with lifecycle policy
    iam/             — ECS execution role, task role, instance profile
    secrets/         — SSM SecureString parameters
    alb/             — ALB, target group, HTTP/HTTPS listeners
    ecs/             — Cluster, ASG, launch template, service, task def, autoscaling, CloudWatch
    acm/             — ACM certificate with Route53 DNS validation
    route53_alias/   — A record alias to the ALB
  main.tf            — root module composes the child modules

Everything is composed in terraform/main.tf. Environment-specific values live in terraform/config/*.tfvars and remote state is configured via terraform/config/backend-*.hcl, bootstrapped from terraform/bootstrap/.

Networking

The VPC uses the classic two-tier layout: public subnets for the ALB, private subnets for ECS tasks and EC2 instances. Tasks use awsvpc networking mode, meaning each task gets its own ENI. This matters for capacity planning — a t3.small instance has a limited ENI count, and during rolling deploys you can briefly need more ENIs than usual when old and new tasks overlap.

# terraform/modules/network/main.tf
resource "aws_subnet" "public" {
  count                   = length(var.public_subnet_cidrs)
  map_public_ip_on_launch = true
  # ALB lives here
}

resource "aws_subnet" "private" {
  count = length(var.private_subnet_cidrs)
  # ECS tasks live here, outbound via NAT
}

ECS service configuration

The most important deployment settings are on the ECS service:

# terraform/modules/ecs/service.tf
resource "aws_ecs_service" "app" {
  desired_count = var.desired_count
  launch_type   = "EC2"

  deployment_minimum_healthy_percent = var.ecs_deployment.minimum_healthy_percent
  deployment_maximum_percent         = var.ecs_deployment.maximum_percent

  lifecycle {
    ignore_changes = [desired_count] # let autoscaling manage this
  }
}

The defaults are minimum_healthy_percent = 100 and maximum_percent = 200. More on why those matter in a moment.

EC2 capacity

Rather than Fargate, this setup uses an Auto Scaling Group of EC2 instances with the ECS-optimized AMI:

# terraform/modules/ecs/capacity.tf
resource "aws_launch_template" "ecs_instance" {
  image_id        = data.aws_ssm_parameter.ecs_optimized_ami.value
  instance_type   = var.ecs_instance_type # default: t3.small

  user_data = base64encode(<<EOT
#!/bin/bash
echo ECS_CLUSTER=${aws_ecs_cluster.app.name} >> /etc/ecs/ecs.config
EOT
  )
}

The t3.micro is explicitly called out in the variables as too small — 1 GiB RAM barely fits a 512 MiB task definition after the ECS agent and OS overhead. t3.small is the minimum practical choice, and you need headroom for the rolling deploy overlap.

ALB and target group

The ALB forwards HTTP on port 80 (or HTTPS on 443 with ACM + Route53 when enable_https = true) to an ip-type target group, which is required for awsvpc tasks:

# terraform/modules/alb/main.tf
resource "aws_lb_target_group" "app" {
  target_type = "ip" # required for awsvpc

  health_check {
    path                = var.health_check_path
    interval            = 30
    healthy_threshold   = 2
    unhealthy_threshold = 3
  }
}

The health check defaults (30s interval, 2 consecutive passes = healthy) mean a new task takes at least 60 seconds before the ALB considers it healthy. This is the floor on how fast a rolling deploy can proceed.

The CI/CD pipeline

OIDC auth (no long-lived keys)

GitHub Actions uses OIDC to assume an IAM role, eliminating stored AWS credentials:

- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
    aws-region: ${{ vars.AWS_REGION }}

The IAM role's trust policy allows sts:AssumeRoleWithWebIdentity from the GitHub OIDC provider, scoped to your repo.

Build and push to ECR

Images are tagged with the git SHA for immutability plus latest as a floating pointer:

docker build --platform linux/amd64 -t my-app:local .

docker tag my-app:local "${REGISTRY}/${REPO}:${GITHUB_SHA}"
docker tag my-app:local "${REGISTRY}/${REPO}:latest"

docker push "${REGISTRY}/${REPO}:${GITHUB_SHA}"
docker push "${REGISTRY}/${REPO}:latest"

Platform is explicitly linux/amd64 because ECS EC2 instances are AMD64. Building on an Apple Silicon Mac without --platform linux/amd64 produces an ARM image that ECS will refuse to run.

Deploying to ECS

After images are pushed to ECR, the production path in this repo (ci.yml’s deploy-ecs job) does two important things:

  1. Clones the task definition ECS is already running, swaps containerDefinitions[0].image to the immutable …ECR_REPO:${GITHUB_SHA} URI (via aws ecs describe-task-definition + jq), and register-task-definition.
  2. Calls aws ecs update-service with --task-definition family:revision and --force-new-deployment.

That avoids relying on :latest alone on EC2-backed ECS, where cached images can otherwise make “why is my pod still old?” debugging painful. For a one-off manual deploy, update-service --force-new-deployment against the current task definition revision is still valid — it just will not pick up a new tag if the task definition still points at an older image reference.

Dependency and security updates

During development, Trivy flagged several issues worth bumping:

  • chi — host header / open-redirect issue, fixed in v5.2.2
  • golang.org/x/crypto — SSH-related CVEs, fixed in v0.45.0
  • golang.org/x/net — pulled in as a transitive fix with crypto v0.45.0

Bumping x/crypto to v0.45.0 raised the minimum Go version in the toolchain at the time; today this repo tracks Go 1.26.x in go.mod, the Dockerfile build stage, and actions/setup-go in CI. The recurring lesson: when a security bump moves the floor Go version, align local, CI, and the container build in one change so you are not debugging “works on my laptop” skew.

ECR image management

ECR is configured with MUTABLE tags and a lifecycle policy that keeps the last 20 images:

# terraform/modules/ecr/main.tf
resource "aws_ecr_lifecycle_policy" "app" {
  policy = jsonencode({
    rules = [{
      rulePriority = 1
      description  = "Keep the last 20 images."
      selection = {
        tagStatus   = "any"
        countType   = "imageCountMoreThan"
        countNumber = 20
      }
      action = { type = "expire" }
    }]
  })
}

The floating latest tag and orphaned images

With mutable tags, pushing a new latest moves the tag to the new image and leaves the previous image untagged. In the ECR console this shows up as a - in the Image tags column — not an error, just an orphan from the previous push.

Similarly, the "Last pulled at" column showing - means ECS never actually pulled that image. This happens when you push multiple images in quick succession during active development — each new push supersedes the previous one before ECS gets a chance to pull it.

Understanding rolling deployments

This is where the real learning happened. The setup uses minimum_healthy_percent = 100 and maximum_percent = 200.

What those numbers mean

With desired_count = 1:

  • ECS cannot drop below 100% of 1 = 1 healthy task
  • ECS can run up to 200% of 1 = 2 tasks simultaneously

So the rollout looks like: spin up 1 new task → wait for ALB health → drain and stop 1 old task.

With desired_count = 2:

  • ECS keeps at least 2 healthy tasks throughout
  • Can temporarily run up to 4 tasks during the overlap

Watching the rollout live

During a deploy from v1.1.0 to v1.2.0, hitting the health endpoint every second tells the whole story:

18:59:10  → version: 1.1.0   # still old
18:59:11  → version: 1.2.0   # new task became healthy, ALB started routing to it
18:59:12  → version: 1.1.0   # ALB still has both targets; round-robin
...
# Mixed phase: both tasks healthy, ALB splits traffic ~50/50
...
19:07:51  → version: 1.2.0   # old task fully drained and gone
19:07:52  → version: 1.2.0   # only new task remains

The mixed phase — where responses alternated between 1.1.0 and 1.2.0 — lasted about 8 minutes. That's not ECS being slow; it's the ALB deregistration_delay (default: 300 seconds) keeping the old target registered and draining before fully removing it.

The real levers for faster deploys

Increasing desired_count makes deploys smoother (fewer gaps), not necessarily faster. The actual knobs:

SettingDefaultDev-friendly valueImpact
ALB health check interval30s10sTime-to-healthy for new tasks
ALB healthy threshold2 checks2 checksCombined with interval: 20s vs 60s
deregistration_delay300s30sHow long old tasks stay in "draining"
Image sizevariesminimizeFaster ECR pull

The 300s deregistration delay is the biggest culprit for the long mixed phase in the logs above. For dev, setting it to 30s would collapse that 8-minute overlap to under a minute.

Why desired_count = 1 has a gap

With a single task, there's a moment — after the old task starts draining but before the new task passes all health checks — where you can have zero healthy targets. That's the "silence" in the health check stream: requests hit the ALB, find no healthy targets, and return 503.

With desired_count = 2, one task can remain healthy while the other cycles. No zero-healthy-target moment. This is why ECS recommends at least 2 tasks for any service that needs to stay up during deploys.

ECS's mental model

A common misconception: ECS doesn't follow a fixed script of "kill one, start one." Its actual logic is:

  1. Start new tasks (up to maximum_percent allows)
  2. Wait for ALB health checks to pass on new tasks
  3. Drain and stop old tasks (limited by minimum_healthy_percent)

This happens in waves, not one-by-one, and the exact order depends on available capacity. If the cluster is full and can't place a new task, the rollout stalls until capacity appears. This is where understanding your EC2 instance sizing and ENI limits becomes critical.

Autoscaling

The service has both application-level (ECS task count) and infrastructure-level (EC2 instance count) autoscaling configured.

Task autoscaling

# terraform/modules/ecs/service.tf
resource "aws_appautoscaling_policy" "cpu" {
  target_tracking_scaling_policy_configuration {
    target_value = 70
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
  }
}

resource "aws_appautoscaling_policy" "memory" {
  target_tracking_scaling_policy_configuration {
    target_value = 75
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageMemoryUtilization"
    }
  }
}

CPU above 70% or memory above 75% triggers scale-out. The lifecycle { ignore_changes = [desired_count] } on the service resource prevents Terraform from fighting with autoscaling every time you run apply.

Instance autoscaling

The ASG handles EC2 capacity, with min/max/desired capacity as separate variables. For dev this is 1–2 instances; for prod you'd widen the range and potentially add Capacity Provider managed scaling.

Observability

CloudWatch is set up with log groups, metric alarms, and a dashboard:

# terraform/modules/ecs/cloudwatch.tf
resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
  metric_name = "HTTPCode_ELB_5XX_Count"
  threshold   = 5
  # 2 evaluation periods of 60s each
}

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
  metric_name = "CPUUtilization"
  threshold   = 80
}

Log retention is 7 days for dev and 30 days for prod, controlled by the env variable.

Practical lessons

Watch the deregistration delay. The default 300 seconds is appropriate for prod (gives in-flight requests time to complete) but brutal in dev when you're iterating fast. For dev environments, 30 seconds is plenty.

Size your instances for the overlap. With minimum_healthy_percent = 100 and maximum_percent = 200, a deploy with desired_count = 2 can briefly need 4 tasks running. On a t3.small with awsvpc, you'll hit ENI limits before CPU limits.

Tag with the git SHA, float latest. SHA-tagged images give you a permanent audit trail of what was deployed when. latest is convenient for quick local pushes but unreliable as a deployment target — use it as a shortcut, not a source of truth.

desired_count = 1 will show gaps during deploys. This is expected and unavoidable given minimum_healthy_percent = 100. If your health check script sees 503s during a deploy and you're on desired_count = 1, it's working as designed. Move to 2 if you want that eliminated.

OIDC over stored credentials, always. The GitHub Actions OIDC setup takes maybe 30 extra minutes to configure but eliminates the credential rotation problem permanently.

What's next

A few things this setup doesn't yet have that would be worth adding:

  • Canary or blue/green deployments — ECS supports CodeDeploy blue/green which gives you a proper traffic cutover with rollback, rather than the ALB round-robin during rolling updates
  • ECS Capacity Provider managed scaling — ties EC2 instance scaling directly to task placement demand, smoother than independent ASG scaling
  • Structured logging with correlation IDs — the CloudWatch log groups are there but the value multiplies when your Go app emits JSON with request IDs
  • Health check tuning per environment — dev and prod probably want different intervals, thresholds, and deregistration delays rather than sharing one Terraform variable set