by skunxicat

Self-Managing ECS Tasks

How ECS Task Protection enables zero-downtime deployments and prevents message loss

The Deployment Problem

You deploy a new version of your booking consumer. ECS starts new tasks and kills old ones. But what happens to the booking request that was being processed when the old task got terminated?

Traditional result: Message lost, customer booking fails, support ticket created.

The ECS Task Protection Solution

Task Protection API lets containers tell ECS: “Don’t kill me, I’m working on something important.”

// The beautiful simplicity
.on("message_received", async () => {
  await taskEnableProtection();    // "Don't kill me, I'm working"
})
.on("message_processed", async () => {
  await taskDisableProtection();   // "OK, I'm ready to die"
})

How Task Protection Works

Without Protection:

ECS: "New deployment, killing task"
Task: "But I'm processing a €500 booking!"
ECS: "Too late" *SIGKILL*
Result: Lost booking, angry customer

With Protection:

ECS: "New deployment, killing task"
Task: "I'm protected, wait please"
ECS: "OK, I'll wait up to 2 hours"
Task: *finishes booking* "Done, kill me now"
ECS: "Thanks for being responsible" *SIGTERM*

Implementation

// Task protection helpers
const taskEnableProtection = async () => {
  const taskArn = await getTaskArn();
  await ecs.updateTaskProtection({
    cluster: CLUSTER_NAME,
    tasks: [taskArn],
    protectionEnabled: true
  }).promise();
};

const taskDisableProtection = async () => {
  const taskArn = await getTaskArn();
  await ecs.updateTaskProtection({
    cluster: CLUSTER_NAME,
    tasks: [taskArn],
    protectionEnabled: false
  }).promise();
};

const taskProtectionEnabled = async () => {
  const taskArn = await getTaskArn();
  const result = await ecs.describeTasks({
    cluster: CLUSTER_NAME,
    tasks: [taskArn],
    include: ['TAGS']
  }).promise();
  
  return result.tasks[0]?.protectionEnabled || false;
};

Consumer Integration

const app = Consumer.create({
  queueUrl: BOOKING_QUEUE,
  handleMessage: processBookingMessage
})
.on("message_received", async (message) => {
  Logger.info("message_received");
  
  // Enable protection before processing
  if (!isLocal) {
    Logger.info("taskEnableProtection");
    await taskEnableProtection();    
  }
  
  Logger.debug(message);
})
.on("message_processed", async (message) => {
  Logger.info("message_processed");
  
  // Disable protection after processing
  if (!isLocal && await taskProtectionEnabled()) {
    Logger.info("taskDisableProtection");
    await taskDisableProtection();
  }
})
.on("processing_error", async (err) => {
  Logger.error("processing_error", err.message);
  
  // Always disable protection on error
  if (!isLocal && await taskProtectionEnabled()) {
    Logger.info("taskDisableProtection");
    await taskDisableProtection();
  }
})
.on("error", async (err) => {
  Logger.error("error", err.message);
  
  // Always disable protection on error
  if (!isLocal && await taskProtectionEnabled()) {
    Logger.info("taskDisableProtection");
    await taskDisableProtection();
  }
});

Signal Handling

.on("signal", (signal) => {
  switch (signal) {
    case 'SIGINT':
      if (app.isRunning) app.stop();
      break;
      
    case 'SIGTERM':
      Logger.info("This task is about to terminate");
      if (app.isRunning) {
        // Stop polling new messages
        app.stop(); 
      }
      break;
      
    default: 
      Logger.info(`Unhandled signal (${signal}), ignore`);
  }
})
.once("unalive", () => {
  Logger.info("**UNALIVE** event, ready to move to a better life");
});

// Process signal handlers
const signalHandler = (signal) => {
  Logger.info(signal, "signal received");
  app.emit('signal', signal);
};

process.on('SIGINT', signalHandler);
process.on('SIGTERM', signalHandler);

The Complete Lifecycle

const handleMessage = async (message) => {
  app.isHandlingMessage = true;
  let job = { id: null };
  
  try {
    // Process the message
    await handle(message, job);
    
  } finally {
    // Check if we should shut down
    if (!app.isRunning) {
      app.emit("unalive");
    }
    
    // Cleanup and notifications
    if (job.id != null) {
      job = await Job.get(job.id);
      await snsNotification(JSON.stringify(job));
    }
    
    app.isHandlingMessage = false;
  }
};

Benefits

🛡️ Zero Message Loss:

  • Tasks finish processing before termination
  • No interrupted booking requests
  • Graceful handling of long-running operations

⚡ Zero-Downtime Deployments:

  • New tasks start while old tasks finish
  • Seamless version transitions
  • No service interruption

📊 Better Observability:

// You can see protection status in logs
Logger.info("taskEnableProtection");   // Task is now protected
Logger.info("taskDisableProtection");  // Task ready for termination

🎯 Predictable Behavior:

  • Tasks always complete their work
  • Clean shutdown sequences
  • Consistent error handling

Production Results

From processing €27M+ in bookings:

  • Zero booking losses during deployments
  • 99.9% success rate maintained during updates
  • Graceful shutdowns in 100% of deployments
  • No customer impact from infrastructure changes

Task Protection Limits

Important constraints:

  • Maximum protection time: 2 hours
  • Protection scope: Only prevents ECS termination
  • Not protection from: Instance termination, AZ failures
  • Best for: Message processing, not long-running jobs

When to Use Task Protection

Perfect for:

  • SQS message processing
  • Financial transactions
  • Critical business operations
  • Any work that can’t be safely interrupted

Not needed for:

  • Stateless HTTP APIs
  • Idempotent operations
  • Quick processing (<30 seconds)
  • Operations with external timeouts

Alternative Patterns

For longer operations:

// Break work into smaller chunks
const processLargeDataset = async (dataset) => {
  const chunks = chunkArray(dataset, 100);
  
  for (const chunk of chunks) {
    await taskEnableProtection();
    await processChunk(chunk);
    await taskDisableProtection();
    
    // Allow graceful shutdown between chunks
    if (!app.isRunning) break;
  }
};

For HTTP APIs:

// Use ALB connection draining instead
// No task protection needed for stateless requests

The Philosophy

Containers should be responsible citizens.

Instead of forcing the orchestrator to guess when it’s safe to kill a container, let the container communicate its state clearly.

“I’m busy” vs “I’m ready” is better than “Kill me randomly”

Monitoring

// CloudWatch metrics for task protection
await cloudWatch.putMetricData({
  Namespace: 'ECS/TaskProtection',
  MetricData: [{
    MetricName: 'ProtectedTasks',
    Value: await taskProtectionEnabled() ? 1 : 0,
    Dimensions: [
      { Name: 'ServiceName', Value: SERVICE_NAME },
      { Name: 'TaskId', Value: TASK_ID }
    ]
  }]
});

Conclusion

ECS Task Protection is the missing piece for reliable container deployments.

It bridges the gap between “fire and forget” deployments and “zero data loss” requirements.

Your containers can finally be polite: “Please wait, I’m finishing something important.”


This pattern has enabled thousands of zero-downtime deployments while processing millions of critical business transactions.