Lambda Performance Deep Dive
Container Images, Raw TCP, and the UPX Trap
I built lambda-shell-runtime, a custom AWS Lambda runtime that lets you write serverless functions in Bash. It worked great, but I discovered something that challenged conventional wisdom about Lambda packaging formats.
The Journey: From Pure Bash to Hybrid Architecture
Initially, my shell runtime used pure Bash for everything - including communication with the Lambda Runtime API using curl. This worked, but the overhead was noticeable:
# Pure Bash approach - lots of process spawning
response=$(curl -s "$AWS_LAMBDA_RUNTIME_API/2018-06-01/runtime/invocation/next")
request_id=$(echo "$response" | grep -i lambda-runtime-aws-request-id)
# ... more parsing and HTTP calls
Every Lambda invocation spawned multiple processes for HTTP communication and JSON parsing. The performance impact was clear.
The Hybrid Solution: Go + Bash
To eliminate this overhead, I created a hybrid approach:
- Go binary handles Lambda Runtime API communication (fast HTTP client)
- Bash functions handle business logic (simple scripting)
// Fast Go HTTP client for Lambda API
func (c *runtimeAPIClient) getNextInvocation() (string, []byte, error) {
resp, err := c.httpClient.Get(c.baseURL + "next")
// ... handle response
}
// Execute shell function for business logic
func executeShellHandler(handlerFile, handlerFunc string, eventData []byte) ([]byte, error) {
shellCmd := fmt.Sprintf("source %s && %s", handlerFile, handlerFunc)
cmd := exec.Command("bash", "-c", shellCmd)
cmd.Stdin = bytes.NewReader(eventData)
return cmd.Output()
}
This hybrid runtime delivered ~30ms cold start times when packaged as a container image.
The Experiment: Do We Even Need Custom Runtimes?
Then I had a thought: “What if I’m overcomplicating this?”
AWS provides provided.al2023 - an OS-only runtime where you can bring your own bootstrap. Instead of maintaining a custom runtime, I could:
- Compile my Go bootstrap to a
bootstrapbinary - Package it with my shell handler in a ZIP file
- Use the standard
provided.al2023runtime
This would eliminate the need for my custom runtime entirely. Time to test this theory.
The Surprising Results
I benchmarked two approaches with identical code:
Both variants used the exact same Go bootstrap binary, the same shell handler logic, identical memory settings, and were executed under the same cold-start testing conditions. The only variable was the packaging format.
Approach 1: ZIP Package + provided.al2023
- 5MB Go bootstrap binary (compiled with
-ldflags="-w -s") - Shell handler script
- Standard
provided.al2023runtime
Approach 2: Container Image
- Same 5MB Go bootstrap
- Same shell handler
- Custom container image
The Benchmark Results
I ran controlled benchmarks with 60 cold start measurements for each approach. The results were striking:
ZIP Package Performance (provided.al2023)
{
"init_count": 60,
"init_total_ms": 2557.03,
"init_average_ms": 42.61,
"p20_ms": 40.41,
"p40_ms": 40.92,
"p60_ms": 41.16,
"p80_ms": 47.30
}
Container Image Performance (Optimized)
{
"init_average_ms": 27.68,
"p20_ms": 20.91,
"p40_ms": 24.01,
"p60_ms": 27.74,
"p80_ms": 32.50
}
Complete Performance Comparison
| Metric | ZIP Package | Container (Original) | Container (Optimized) | Raw TCP | Raw TCP + UPX |
|---|---|---|---|---|---|
| Average Init | 42.61ms | 33.51ms | 27.68ms | 21.31ms | 56.01ms |
| P20 | 40.41ms | 25.71ms | 20.91ms | 19.99ms | 51.51ms |
| P40 | 40.92ms | 28.21ms | 24.01ms | 20.24ms | 53.24ms |
| P60 | 41.16ms | 31.88ms | 27.74ms | 20.82ms | 56.53ms |
| P80 | 47.30ms | 36.46ms | 32.50ms | 23.74ms | 60.24ms |
Key findings:
- Container images are 21-36% faster across all percentiles
- Lower variance: Container images show more predictable performance (25-36ms vs 40-47ms)
- Consistent advantage: No scenario where ZIP packages performed better
Expected result: ZIP package should be faster (conventional wisdom)
Actual result: Container images consistently outperformed ZIP packages by a significant margin.
Why Container Images Won
This doesn’t mean container images are always faster — for very small deployments (typically under ~1MB), the difference is often negligible and ZIP packages may still be the simpler choice.
This result challenges the typical assumption that ZIP packages are always faster. Here’s what I believe is happening:
ZIP Package Overhead (provided.al2023)
- Download phase: Lambda downloads 5MB bootstrap from S3 during cold start
- Extraction phase: Unzip and extract files to container filesystem
- Permission setup: Configure file permissions and execution context
- Process startup: Launch the bootstrap binary
Container Image Advantages
- Pre-built layers: Bootstrap is already in optimized image layers
- No runtime I/O: No S3 download or extraction during cold start
- Optimized filesystem: File permissions and structure pre-configured
- Layer caching: Lambda’s container infrastructure efficiently caches layers
The 5MB Threshold Theory
The key insight: there’s a crossover point where container images become more efficient than ZIP packages.
For small deployments (< 1MB), ZIP extraction is negligible. But as your bootstrap grows:
- ZIP: Linear increase in download + extraction time
- Container: Constant startup time (layers are pre-cached)
My 5MB Go binary crossed this threshold. The network I/O and filesystem operations for ZIP extraction exceeded the container startup overhead.
Implications for Lambda Architecture
When to Choose Container Images
- Large custom runtimes (> 2-3MB)
- Complex dependencies that benefit from pre-installation
- Custom system configurations that can be baked into the image
When ZIP Packages Still Win
- Small, simple functions (< 1MB)
- Frequent code changes (faster deployment)
- Standard runtime compatibility requirements
The Optimization Deep Dive: Raw TCP vs HTTP Client
After discovering container images outperformed ZIP packages, I wondered: “Can we optimize the runtime itself?”
The Go bootstrap was using Go’s standard net/http package for Lambda Runtime API communication. While robust, it’s heavy - the HTTP client alone adds ~3MB to the binary and significant initialization overhead.
The Raw TCP Socket Experiment
I replaced the HTTP client with raw TCP sockets:
// Before: Heavy HTTP client
resp, err := c.httpClient.Get(c.baseURL + "next")
// After: Raw TCP socket
conn, err := net.Dial("tcp", c.host)
fmt.Fprintf(conn, "GET /2018-06-01/runtime/invocation/next HTTP/1.1\r\nHost: %s\r\n\r\n", c.host)
This eliminated the entire net/http package dependency, reducing the binary from 5.7MB to 2.3MB.
The UPX Compression Trap
With a smaller binary, I tried UPX compression to reduce it further:
- Binary size: 2.3MB → 676KB (70% reduction)
- Cold start performance: Actually got worse!
Performance Results
| Approach | Binary Size | Avg Init Time | Performance |
|---|---|---|---|
| HTTP Client | 5.7MB | ~42ms | Baseline |
| Raw TCP | 2.3MB | 21ms | 50% faster |
| Raw TCP + UPX | 676KB | ~45ms | Slower than baseline |
Runtime Optimization Comparison
| Metric | HTTP Client | Raw TCP | Raw TCP + UPX | Best Performance |
|---|---|---|---|---|
| Binary Size | 5.7MB | 2.3MB | 676KB | Raw TCP + UPX |
| Average Init | ~42ms | 21.31ms | 56.01ms | Raw TCP |
| P20 | ~40ms | 19.99ms | 51.51ms | Raw TCP |
| P40 | ~41ms | 20.24ms | 53.24ms | Raw TCP |
| P60 | ~42ms | 20.82ms | 56.53ms | Raw TCP |
| P80 | ~47ms | 23.74ms | 60.24ms | Raw TCP |
Why UPX Backfired Dramatically
UPX compression didn’t just hurt performance - it made it 33% worse than the original HTTP client:
- Average: 56.01ms vs 42ms baseline (33% slower)
- P80: 60.24ms vs 47ms baseline (28% slower)
- Decompression penalty: ~35ms overhead per cold start
The performance penalty was much worse than expected because:
- Heavy decompression cost: UPX decompression takes significant CPU time
- Lambda’s ARM64 architecture: Decompression is slower on ARM processors
- Memory pressure: Decompression requires additional memory allocation during init
- No caching benefit: Each container instance must decompress independently
Critical insight: In Lambda’s execution model, a 70% file size reduction led to a 160% performance degradation.
The Meta-Lesson
This experiment taught me something valuable: sometimes the best way to validate your architecture is to try to replace it.
I set out to prove my custom runtime was unnecessary, but instead discovered:
- The hybrid Go+Bash approach has real performance benefits
- Container images can outperform ZIP packages for larger deployments
- Conventional wisdom doesn’t always apply at scale
Measuring Your Own Workloads
If you’re curious about your own Lambda performance characteristics:
# Get Lambda performance stats
aws logs tail --since 1h "/aws/lambda/your-function" \
| grep "REPORT" \
| grep -o -E 'Init Duration: (.+) ms' \
| cut -d' ' -f 3
This extracts just the init duration values from CloudWatch logs, which you can then analyze for averages, percentiles, and trends.
Benchmark both packaging approaches with your actual code. The results might surprise you.
Conclusion
This journey from pure Bash to optimized hybrid runtime revealed multiple performance insights that challenge conventional Lambda wisdom:
- Container images can outperform ZIP packages for larger runtimes (>2-3MB)
- Raw TCP sockets deliver 50% faster cold starts than HTTP clients
- File compression can hurt performance - UPX made things worse, not better
- Code simplicity often beats size optimization in Lambda’s execution model
The Lambda ecosystem is more nuanced than simple rules. As we build sophisticated serverless applications, understanding these performance characteristics becomes crucial.
My lambda-shell-runtime project started as a way to bring Bash scripting to serverless. It ended up revealing that the most valuable discoveries come from systematically challenging your assumptions with real benchmarks.
Sometimes the best way to validate your architecture is to try to replace it.
Want to experiment with hybrid Lambda runtimes? Check out the lambda-shell-runtime project and the benchmark code used in this analysis.