by skunxicat

Mastering ECS Autoscaling: Target Tracking, Step, and Scheduled Together

asciicast

Autoscaling in AWS ECS looks straightforward at first glance — “just add a target tracking policy and you’re done.”
But in practice, it’s full of nuance. Over the past weeks I’ve been experimenting with real workloads, real queues, and real ECS services, and I’ve learned that you need more than one lever to get it right.

This post walks through how I combined Target Tracking, Step Scaling, and Scheduled Scaling to build a resilient and cost-effective autoscaling setup — and the unexpected lessons I learned along the way.


Introduction

By default, ECS autoscaling often relies on CPU or memory utilization metrics to adjust the number of running tasks. While this works well for many applications, queue-driven workloads require a more nuanced approach. Instead of just resource usage, scaling based on the backlog of work waiting in the queue provides a more accurate reflection of demand. A common pattern is to use a custom metric like backlog-per-task (e.g., messages per running task) to ensure the number of tasks matches the amount of work, enabling more precise and efficient scaling. This approach is especially useful because it accounts for the variable nature of demand—if demand were constant, you could simply size your tasks once and forget about scaling, but in reality, demand fluctuates, so dynamic scaling is required.


Example Scenario

The workload is queue-driven:

  • Demand: measured by QueueDepth (ApproximateNumberOfMessagesVisible)
  • Capacity: measured by DesiredTaskCount / RunningTaskCount from ECS

The core idea: scale ECS workers up or down so that the number of running tasks matches the amount of work waiting in the queue.

demand-capacity

Goal: keep backlog low, keep latency under control, minimize cost.

Target Tracking Policy

Target Tracking provides a way to automatically adjust the number of running tasks based on a custom metric that reflects the workload per task.

resource "aws_appautoscaling_policy" "ecs_default" {
  name               = "${local.pascal_prefix}AvailabilityAutoscalingPolicy"
  policy_type        = "TargetTrackingScaling"
  scalable_dimension = aws_appautoscaling_target.ecs_default.scalable_dimension
  resource_id        = aws_appautoscaling_target.ecs_default.resource_id
  service_namespace  =  aws_appautoscaling_target.ecs_default.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value = 1  # 1 task per message
    disable_scale_in = false
    scale_in_cooldown = 0
    scale_out_cooldown = 180
    
    # Backlog per task - considers both queue depth AND running tasks
    customized_metric_specification {
      metrics {
        label = "ApproximateNumberOfMessagesVisible"
        id    = "m1"
        metric_stat {
          metric {
            metric_name = "ApproximateNumberOfMessagesVisible"
            namespace   = "AWS/SQS"
            dimensions {
              name  = "QueueName"
              value = "${aws_sqs_queue.default.name}"
            }
          }
          stat = "Sum"
        }
        return_data = false
      }
      metrics {
        label = "ApproximateNumberOfMessagesNotVisible"
        id    = "m3"
        metric_stat {
          metric {
            metric_name = "ApproximateNumberOfMessagesNotVisible"
            namespace   = "AWS/SQS"
            dimensions {
              name  = "QueueName"
              value = "${aws_sqs_queue.default.name}"
            }
          }
          stat = "Sum"
        }
        return_data = false
      }
      metrics {
        label = "RunningTaskCount"
        id    = "m2"
        metric_stat {
          metric {
            metric_name = "RunningTaskCount"
            namespace   = "ECS/ContainerInsights"
            dimensions {
              name  = "ClusterName"
              value = "${local.ecs_cluster_name}"
            }
            dimensions {
              name  = "ServiceName"
              value = "${aws_ecs_service.default.name}"
            }
          }
          stat = "Maximum"
        }
        return_data = false
      }
      metrics {
        label       = "SafeRunningTaskCount"
        id          = "f1"
        expression  = "FILL(FILL(m2,REPEAT),0)"
        return_data = false
      }
      metrics {
        label       = "backLogPerTask"
        id          = "e1"
        expression  = "IF(f1 > 0, (m1 + m3) / f1, m1 + m3)"
        return_data = true
      }
    }
  }
}

Why this Target Tracking policy (reasoning)

This policy implements backlog-per-task as the control metric and then applies Target Tracking to keep it near a chosen set‑point.

Metric composition

  • Backlog = ApproximateNumberOfMessagesVisible + ApproximateNumberOfMessagesNotVisible
    Using both captures waiting and in-flight messages, which better reflects real work remaining than visible-only depth.
  • Capacity = RunningTaskCount from ECS/ContainerInsights scoped to the service.
    We then compute:
    IF(RunningTaskCount > 0, Backlog / RunningTaskCount, Backlog)
    so that when capacity is zero we don’t divide by zero and the metric equals the backlog.
  • FILL(REPEAT, 0) is applied to RunningTaskCount to avoid gaps and transient null samples that would otherwise stall Target Tracking decisions.

Why target_value = 1

  • The set‑point represents messages per running task. A value of 1 means: on average, each task should have roughly one message to work on.
  • You can derive a principled target from your latency budget L (seconds) and average processing time per message T (seconds):
    target_value ≈ L / T  (messages per task)
    Example: if you can tolerate 10s queue latency and tasks process a message in 0.1s, target_value ≈ 100.

Cooldowns & stability

  • scale_out_cooldown = 180 provides damping so the service doesn’t overreact to short spikes (ECS placement and image cold start also take time).
  • scale_in_cooldown = 0 lets the fleet shed capacity as soon as the signal allows, but in practice scale‑in remains conservative due to Target Tracking’s internal flapping protection.
  • If you observe oscillations, increase scale‑in cooldown (e.g., 300–600s) and/or widen the effective dead‑band by slightly lowering the target.

Why not queue depth alone

  • Raw ApproximateNumberOfMessagesVisible ignores current capacity; 1,000 messages with 2 tasks is not the same as 1,000 messages with 50 tasks. Backlog‑per‑task normalizes by capacity and produces proportional behavior.

Stat/period choices

  • Using stat = Sum on 60s periods is a good default for SQS depth. If your producers are very bursty, consider 30s periods and apply light smoothing in the reader to reduce one‑sample noise.

About the Alarms

When you attach a Target Tracking policy, AWS automatically creates the necessary CloudWatch alarms (both upper and lower thresholds) that drive the scaling actions. These alarms have their datapoint-to-alarm and evaluation periods carefully tuned to work optimally with the scaling algorithm. It is generally better not to override these alarms directly. Instead, adjust the target value of the policy—which can be a floating point number—to find the right balance between responsiveness and stability for your workload.

If you’re curious what Target Tracking is doing under the hood, you can use the AWS CLI to see the exact CloudWatch alarms it generates:

aws application-autoscaling describe-scaling-policies \
    --policy-names Ql4bAirswitchRyanairAvailabilityAutoscalingPolicy \
    --service-namespace ecs \
    | jq \
    '.ScalingPolicies|first.Alarms'
[
  {
    "AlarmName": "TargetTracking-service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair-AlarmHigh-3712aff2-5d0a-40fb-a92e-7ca8d34631da",
    "AlarmARN": "arn:aws:cloudwatch:eu-central-1:703177223665:alarm:TargetTracking-service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair-AlarmHigh-3712aff2-5d0a-40fb-a92e-7ca8d34631da"
  },
  {
    "AlarmName": "TargetTracking-service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair-AlarmLow-90f5b055-3d73-4c45-af2d-21054f8e6f36",
    "AlarmARN": "arn:aws:cloudwatch:eu-central-1:703177223665:alarm:TargetTracking-service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair-AlarmLow-90f5b055-3d73-4c45-af2d-21054f8e6f36"
  }
]
aws cloudwatch describe-alarms \
    --alarm-names TargetTracking-service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair-AlarmHigh-3712aff2-5d0a-40fb-a92e-7ca8d34631da TargetTracking-service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair-AlarmLow-90f5b055-3d73-4c45-af2d-21054f8e6f36 \
    | jq '
        .MetricAlarms[]|{ 
            AlarmActions,
            EvaluationPeriods,
            Threshold,
            ComparisonOperator,
            AlarmDescription,
            AlarmName,
            TrackingMetric:.Metrics[]|select(.ReturnData == true) 
    }'
{
  "AlarmActions": [
    "arn:aws:autoscaling:eu-central-1:703177223665:scalingPolicy:c21da43a-bdbf-4ece-924e-f66ecdce25cb:resource/ecs/service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair:policyName/Ql4bAirswitchRyanairAvailabilityAutoscalingPolicy:createdBy/cb04dcaa-1f20-4fea-84e0-cb3f109d9138"
  ],
  "EvaluationPeriods": 3,
  "Threshold": 1.0,
  "ComparisonOperator": "GreaterThanThreshold",
  "AlarmDescription": "DO NOT EDIT OR DELETE. For TargetTrackingScaling policy arn:aws:autoscaling:eu-central-1:703177223665:scalingPolicy:c21da43a-bdbf-4ece-924e-f66ecdce25cb:resource/ecs/service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair:policyName/Ql4bAirswitchRyanairAvailabilityAutoscalingPolicy:createdBy/cb04dcaa-1f20-4fea-84e0-cb3f109d9138.",
  "AlarmName": "TargetTracking-service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair-AlarmHigh-3712aff2-5d0a-40fb-a92e-7ca8d34631da",
  "TrackingMetric": {
    "Id": "e1",
    "Expression": "IF(f1 > 0, (m1 + m3) / f1, m1 + m3)",
    "Label": "backLogPerTask",
    "ReturnData": true,
    "Period": 60
  }
}
{
  "AlarmActions": [
    "arn:aws:autoscaling:eu-central-1:703177223665:scalingPolicy:c21da43a-bdbf-4ece-924e-f66ecdce25cb:resource/ecs/service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair:policyName/Ql4bAirswitchRyanairAvailabilityAutoscalingPolicy:createdBy/cb04dcaa-1f20-4fea-84e0-cb3f109d9138"
  ],
  "EvaluationPeriods": 15,
  "Threshold": 0.9,
  "ComparisonOperator": "LessThanThreshold",
  "AlarmDescription": "DO NOT EDIT OR DELETE. For TargetTrackingScaling policy arn:aws:autoscaling:eu-central-1:703177223665:scalingPolicy:c21da43a-bdbf-4ece-924e-f66ecdce25cb:resource/ecs/service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair:policyName/Ql4bAirswitchRyanairAvailabilityAutoscalingPolicy:createdBy/cb04dcaa-1f20-4fea-84e0-cb3f109d9138.",
  "AlarmName": "TargetTracking-service/arn:aws:ecs:eu-central-1:703177223665:cluster/ql4b-airswitch/ql4b-airswitch-ryanair-AlarmLow-90f5b055-3d73-4c45-af2d-21054f8e6f36",
  "TrackingMetric": {
    "Id": "e1",
    "Expression": "IF(f1 > 0, (m1 + m3) / f1, m1 + m3)",
    "Label": "backLogPerTask",
    "ReturnData": true,
    "Period": 60
  }
}

See it Working (Hands-On)

asciicast


asciicast


The Problem

A few challenges appear when relying only on the standard AWS baseline approach — scaling based on CPU/memory or backlog-per-task target tracking alone:

  • Cold start lag: scaling from 0 tasks to 1 task is slow (metrics take 2–3 minutes to register)
  • Flapping: target tracking policies can oscillate if cooldowns aren’t tuned
  • Business-hour workloads: leaving capacity at 1 task overnight wastes money

These issues require more than just a single target tracking policy.

The Challenge: Availability vs Reactivity

  • Target Tracking is designed for stability, not speed.
  • With min_capacity = 0, scale-out from cold start can take 2–3 minutes due to metric lag.
  • Customers don’t care about metrics — they care about response time.

That’s when you realize: one scaling policy isn’t enough.


Target Tracking — The Baseline

Target tracking works best as the “cruise control” for your service:

  • Uses a backlog-per-task metric (Messages / RunningTasks).
  • Scales smoothly up and down, one task at a time.
  • Conservative by design — avoids overshoot.

⚠️ Lesson learned: Be careful with flapping detection. The algorithm suppresses actions if it sees rapid in/out oscillations. Cooldowns and target values must be tuned to keep it responsive.


Step Scaling — Solving 0 → 1

Target tracking fails at cold start, so I added a step scaling alarm:

  • Trigger: If RunningTasks == 0 and Queue > 0 → add 1 task.
  • Cooldown: 120 seconds.
  • Purpose: bootstrap the service quickly when it’s asleep.

This small rule makes a huge difference in perceived responsiveness.


Scheduled Scaling — Predictable Baseline

Not all workloads are random. Mine had clear business-hour patterns:

  • 9am Mon–Fri → ensure at least 1 task (warm baseline).
  • 11pm Mon–Fri → allow scale-to-zero (cost optimization).

By scheduling, I eliminated the worst cold-start delays right when users are active.


Visualizing Scaling in the Terminal

Instead of dashboards, I used asciigraph and asciinema to watch metrics in real time from the terminal.
This turned out to be incredibly useful — seeing the queue depth rise, tasks spin up, and autoscaling react gave me intuition that CloudWatch graphs alone don’t show.

asciicast


Lessons Learned

  1. Target Tracking keeps things balanced, but don’t expect fast wake-ups.
  2. Step Scaling is essential for cold-start responsiveness.
  3. Scheduled Scaling saves money while keeping service warm during peaks.
  4. Observability (even in ASCII!) builds intuition you can’t get from alarms alone.

References

  1. https://aws.amazon.com/blogs/containers/amazon-elastic-container-service-ecs-auto-scaling-using-custom-metrics/
  2. https://serverlessland.com/patterns/sqs-ecs-autoscaling?ref=search
  3. https://aws.amazon.com/blogs/compute/scaling-an-asg-using-target-tracking-with-a-dynamic-sqs-target/
  1. https://adamtuttle.codes/blog/2022/scaling-fargate-based-on-sqs-queue-depth/
  2. https://github.com/hariohmprasath/ecs-event-driven-scaling

Closing Thoughts

Autoscaling isn’t about a single magic policy — it’s about combining strategies.
With this hybrid approach, I reached the sweet spot: responsive, stable, and cost-optimized ECS scaling.