Self-Managing ECS Tasks
How ECS Task Protection enables zero-downtime deployments and prevents message loss
The Deployment Problem
You deploy a new version of your booking consumer. ECS starts new tasks and kills old ones. But what happens to the booking request that was being processed when the old task got terminated?
Traditional result: Message lost, customer booking fails, support ticket created.
The ECS Task Protection Solution
Task Protection API lets containers tell ECS: “Don’t kill me, I’m working on something important.”
// The beautiful simplicity
.on("message_received", async () => {
await taskEnableProtection(); // "Don't kill me, I'm working"
})
.on("message_processed", async () => {
await taskDisableProtection(); // "OK, I'm ready to die"
})
How Task Protection Works
Without Protection:
ECS: "New deployment, killing task"
Task: "But I'm processing a €500 booking!"
ECS: "Too late" *SIGKILL*
Result: Lost booking, angry customer
With Protection:
ECS: "New deployment, killing task"
Task: "I'm protected, wait please"
ECS: "OK, I'll wait up to 2 hours"
Task: *finishes booking* "Done, kill me now"
ECS: "Thanks for being responsible" *SIGTERM*
Implementation
// Task protection helpers
const taskEnableProtection = async () => {
const taskArn = await getTaskArn();
await ecs.updateTaskProtection({
cluster: CLUSTER_NAME,
tasks: [taskArn],
protectionEnabled: true
}).promise();
};
const taskDisableProtection = async () => {
const taskArn = await getTaskArn();
await ecs.updateTaskProtection({
cluster: CLUSTER_NAME,
tasks: [taskArn],
protectionEnabled: false
}).promise();
};
const taskProtectionEnabled = async () => {
const taskArn = await getTaskArn();
const result = await ecs.describeTasks({
cluster: CLUSTER_NAME,
tasks: [taskArn],
include: ['TAGS']
}).promise();
return result.tasks[0]?.protectionEnabled || false;
};
Consumer Integration
const app = Consumer.create({
queueUrl: BOOKING_QUEUE,
handleMessage: processBookingMessage
})
.on("message_received", async (message) => {
Logger.info("message_received");
// Enable protection before processing
if (!isLocal) {
Logger.info("taskEnableProtection");
await taskEnableProtection();
}
Logger.debug(message);
})
.on("message_processed", async (message) => {
Logger.info("message_processed");
// Disable protection after processing
if (!isLocal && await taskProtectionEnabled()) {
Logger.info("taskDisableProtection");
await taskDisableProtection();
}
})
.on("processing_error", async (err) => {
Logger.error("processing_error", err.message);
// Always disable protection on error
if (!isLocal && await taskProtectionEnabled()) {
Logger.info("taskDisableProtection");
await taskDisableProtection();
}
})
.on("error", async (err) => {
Logger.error("error", err.message);
// Always disable protection on error
if (!isLocal && await taskProtectionEnabled()) {
Logger.info("taskDisableProtection");
await taskDisableProtection();
}
});
Signal Handling
.on("signal", (signal) => {
switch (signal) {
case 'SIGINT':
if (app.isRunning) app.stop();
break;
case 'SIGTERM':
Logger.info("This task is about to terminate");
if (app.isRunning) {
// Stop polling new messages
app.stop();
}
break;
default:
Logger.info(`Unhandled signal (${signal}), ignore`);
}
})
.once("unalive", () => {
Logger.info("**UNALIVE** event, ready to move to a better life");
});
// Process signal handlers
const signalHandler = (signal) => {
Logger.info(signal, "signal received");
app.emit('signal', signal);
};
process.on('SIGINT', signalHandler);
process.on('SIGTERM', signalHandler);
The Complete Lifecycle
const handleMessage = async (message) => {
app.isHandlingMessage = true;
let job = { id: null };
try {
// Process the message
await handle(message, job);
} finally {
// Check if we should shut down
if (!app.isRunning) {
app.emit("unalive");
}
// Cleanup and notifications
if (job.id != null) {
job = await Job.get(job.id);
await snsNotification(JSON.stringify(job));
}
app.isHandlingMessage = false;
}
};
Benefits
🛡️ Zero Message Loss:
- Tasks finish processing before termination
- No interrupted booking requests
- Graceful handling of long-running operations
⚡ Zero-Downtime Deployments:
- New tasks start while old tasks finish
- Seamless version transitions
- No service interruption
📊 Better Observability:
// You can see protection status in logs
Logger.info("taskEnableProtection"); // Task is now protected
Logger.info("taskDisableProtection"); // Task ready for termination
🎯 Predictable Behavior:
- Tasks always complete their work
- Clean shutdown sequences
- Consistent error handling
Production Results
From processing €27M+ in bookings:
- Zero booking losses during deployments
- 99.9% success rate maintained during updates
- Graceful shutdowns in 100% of deployments
- No customer impact from infrastructure changes
Task Protection Limits
Important constraints:
- Maximum protection time: 2 hours
- Protection scope: Only prevents ECS termination
- Not protection from: Instance termination, AZ failures
- Best for: Message processing, not long-running jobs
When to Use Task Protection
Perfect for:
- SQS message processing
- Financial transactions
- Critical business operations
- Any work that can’t be safely interrupted
Not needed for:
- Stateless HTTP APIs
- Idempotent operations
- Quick processing (<30 seconds)
- Operations with external timeouts
Alternative Patterns
For longer operations:
// Break work into smaller chunks
const processLargeDataset = async (dataset) => {
const chunks = chunkArray(dataset, 100);
for (const chunk of chunks) {
await taskEnableProtection();
await processChunk(chunk);
await taskDisableProtection();
// Allow graceful shutdown between chunks
if (!app.isRunning) break;
}
};
For HTTP APIs:
// Use ALB connection draining instead
// No task protection needed for stateless requests
The Philosophy
Containers should be responsible citizens.
Instead of forcing the orchestrator to guess when it’s safe to kill a container, let the container communicate its state clearly.
“I’m busy” vs “I’m ready” is better than “Kill me randomly”
Monitoring
// CloudWatch metrics for task protection
await cloudWatch.putMetricData({
Namespace: 'ECS/TaskProtection',
MetricData: [{
MetricName: 'ProtectedTasks',
Value: await taskProtectionEnabled() ? 1 : 0,
Dimensions: [
{ Name: 'ServiceName', Value: SERVICE_NAME },
{ Name: 'TaskId', Value: TASK_ID }
]
}]
});
Conclusion
ECS Task Protection is the missing piece for reliable container deployments.
It bridges the gap between “fire and forget” deployments and “zero data loss” requirements.
Your containers can finally be polite: “Please wait, I’m finishing something important.”
This pattern has enabled thousands of zero-downtime deployments while processing millions of critical business transactions.