by skunxicat

Lambda Performance Deep Dive

Container Images, Raw TCP, and the UPX Trap

I built lambda-shell-runtime, a custom AWS Lambda runtime that lets you write serverless functions in Bash. It worked great, but I discovered something that challenged conventional wisdom about Lambda packaging formats.

The Journey: From Pure Bash to Hybrid Architecture

Initially, my shell runtime used pure Bash for everything - including communication with the Lambda Runtime API using curl. This worked, but the overhead was noticeable:

# Pure Bash approach - lots of process spawning
response=$(curl -s "$AWS_LAMBDA_RUNTIME_API/2018-06-01/runtime/invocation/next")
request_id=$(echo "$response" | grep -i lambda-runtime-aws-request-id)
# ... more parsing and HTTP calls

Every Lambda invocation spawned multiple processes for HTTP communication and JSON parsing. The performance impact was clear.

The Hybrid Solution: Go + Bash

To eliminate this overhead, I created a hybrid approach:

  • Go binary handles Lambda Runtime API communication (fast HTTP client)
  • Bash functions handle business logic (simple scripting)
// Fast Go HTTP client for Lambda API
func (c *runtimeAPIClient) getNextInvocation() (string, []byte, error) {
    resp, err := c.httpClient.Get(c.baseURL + "next")
    // ... handle response
}

// Execute shell function for business logic  
func executeShellHandler(handlerFile, handlerFunc string, eventData []byte) ([]byte, error) {
    shellCmd := fmt.Sprintf("source %s && %s", handlerFile, handlerFunc)
    cmd := exec.Command("bash", "-c", shellCmd)
    cmd.Stdin = bytes.NewReader(eventData)
    return cmd.Output()
}

This hybrid runtime delivered ~30ms cold start times when packaged as a container image.

The Experiment: Do We Even Need Custom Runtimes?

Then I had a thought: “What if I’m overcomplicating this?”

AWS provides provided.al2023 - an OS-only runtime where you can bring your own bootstrap. Instead of maintaining a custom runtime, I could:

  1. Compile my Go bootstrap to a bootstrap binary
  2. Package it with my shell handler in a ZIP file
  3. Use the standard provided.al2023 runtime

This would eliminate the need for my custom runtime entirely. Time to test this theory.

The Surprising Results

I benchmarked two approaches with identical code:

Both variants used the exact same Go bootstrap binary, the same shell handler logic, identical memory settings, and were executed under the same cold-start testing conditions. The only variable was the packaging format.

Approach 1: ZIP Package + provided.al2023

  • 5MB Go bootstrap binary (compiled with -ldflags="-w -s")
  • Shell handler script
  • Standard provided.al2023 runtime

Approach 2: Container Image

  • Same 5MB Go bootstrap
  • Same shell handler
  • Custom container image

The Benchmark Results

I ran controlled benchmarks with 60 cold start measurements for each approach. The results were striking:

ZIP Package Performance (provided.al2023)

{
  "init_count": 60,
  "init_total_ms": 2557.03,
  "init_average_ms": 42.61,
  "p20_ms": 40.41,
  "p40_ms": 40.92,
  "p60_ms": 41.16,
  "p80_ms": 47.30
}

Container Image Performance (Optimized)

{
  "init_average_ms": 27.68,
  "p20_ms": 20.91,
  "p40_ms": 24.01,
  "p60_ms": 27.74,
  "p80_ms": 32.50
}

Complete Performance Comparison

MetricZIP PackageContainer (Original)Container (Optimized)Raw TCPRaw TCP + UPX
Average Init42.61ms33.51ms27.68ms21.31ms56.01ms
P2040.41ms25.71ms20.91ms19.99ms51.51ms
P4040.92ms28.21ms24.01ms20.24ms53.24ms
P6041.16ms31.88ms27.74ms20.82ms56.53ms
P8047.30ms36.46ms32.50ms23.74ms60.24ms

Key findings:

  • Container images are 21-36% faster across all percentiles
  • Lower variance: Container images show more predictable performance (25-36ms vs 40-47ms)
  • Consistent advantage: No scenario where ZIP packages performed better

Expected result: ZIP package should be faster (conventional wisdom)

Actual result: Container images consistently outperformed ZIP packages by a significant margin.

Why Container Images Won

This doesn’t mean container images are always faster — for very small deployments (typically under ~1MB), the difference is often negligible and ZIP packages may still be the simpler choice.

This result challenges the typical assumption that ZIP packages are always faster. Here’s what I believe is happening:

ZIP Package Overhead (provided.al2023)

  1. Download phase: Lambda downloads 5MB bootstrap from S3 during cold start
  2. Extraction phase: Unzip and extract files to container filesystem
  3. Permission setup: Configure file permissions and execution context
  4. Process startup: Launch the bootstrap binary

Container Image Advantages

  1. Pre-built layers: Bootstrap is already in optimized image layers
  2. No runtime I/O: No S3 download or extraction during cold start
  3. Optimized filesystem: File permissions and structure pre-configured
  4. Layer caching: Lambda’s container infrastructure efficiently caches layers

The 5MB Threshold Theory

The key insight: there’s a crossover point where container images become more efficient than ZIP packages.

For small deployments (< 1MB), ZIP extraction is negligible. But as your bootstrap grows:

  • ZIP: Linear increase in download + extraction time
  • Container: Constant startup time (layers are pre-cached)

My 5MB Go binary crossed this threshold. The network I/O and filesystem operations for ZIP extraction exceeded the container startup overhead.

Implications for Lambda Architecture

When to Choose Container Images

  • Large custom runtimes (> 2-3MB)
  • Complex dependencies that benefit from pre-installation
  • Custom system configurations that can be baked into the image

When ZIP Packages Still Win

  • Small, simple functions (< 1MB)
  • Frequent code changes (faster deployment)
  • Standard runtime compatibility requirements

The Optimization Deep Dive: Raw TCP vs HTTP Client

After discovering container images outperformed ZIP packages, I wondered: “Can we optimize the runtime itself?”

The Go bootstrap was using Go’s standard net/http package for Lambda Runtime API communication. While robust, it’s heavy - the HTTP client alone adds ~3MB to the binary and significant initialization overhead.

The Raw TCP Socket Experiment

I replaced the HTTP client with raw TCP sockets:

// Before: Heavy HTTP client
resp, err := c.httpClient.Get(c.baseURL + "next")

// After: Raw TCP socket
conn, err := net.Dial("tcp", c.host)
fmt.Fprintf(conn, "GET /2018-06-01/runtime/invocation/next HTTP/1.1\r\nHost: %s\r\n\r\n", c.host)

This eliminated the entire net/http package dependency, reducing the binary from 5.7MB to 2.3MB.

The UPX Compression Trap

With a smaller binary, I tried UPX compression to reduce it further:

  • Binary size: 2.3MB → 676KB (70% reduction)
  • Cold start performance: Actually got worse!

Performance Results

ApproachBinary SizeAvg Init TimePerformance
HTTP Client5.7MB~42msBaseline
Raw TCP2.3MB21ms50% faster
Raw TCP + UPX676KB~45msSlower than baseline

Runtime Optimization Comparison

MetricHTTP ClientRaw TCPRaw TCP + UPXBest Performance
Binary Size5.7MB2.3MB676KBRaw TCP + UPX
Average Init~42ms21.31ms56.01msRaw TCP
P20~40ms19.99ms51.51msRaw TCP
P40~41ms20.24ms53.24msRaw TCP
P60~42ms20.82ms56.53msRaw TCP
P80~47ms23.74ms60.24msRaw TCP

Why UPX Backfired Dramatically

UPX compression didn’t just hurt performance - it made it 33% worse than the original HTTP client:

  • Average: 56.01ms vs 42ms baseline (33% slower)
  • P80: 60.24ms vs 47ms baseline (28% slower)
  • Decompression penalty: ~35ms overhead per cold start

The performance penalty was much worse than expected because:

  1. Heavy decompression cost: UPX decompression takes significant CPU time
  2. Lambda’s ARM64 architecture: Decompression is slower on ARM processors
  3. Memory pressure: Decompression requires additional memory allocation during init
  4. No caching benefit: Each container instance must decompress independently

Critical insight: In Lambda’s execution model, a 70% file size reduction led to a 160% performance degradation.

The Meta-Lesson

This experiment taught me something valuable: sometimes the best way to validate your architecture is to try to replace it.

I set out to prove my custom runtime was unnecessary, but instead discovered:

  1. The hybrid Go+Bash approach has real performance benefits
  2. Container images can outperform ZIP packages for larger deployments
  3. Conventional wisdom doesn’t always apply at scale

Measuring Your Own Workloads

If you’re curious about your own Lambda performance characteristics:

# Get Lambda performance stats
aws logs tail --since 1h "/aws/lambda/your-function" \
  | grep "REPORT" \
  | grep -o -E 'Init Duration: (.+) ms' \
  | cut -d' ' -f 3

This extracts just the init duration values from CloudWatch logs, which you can then analyze for averages, percentiles, and trends.

Benchmark both packaging approaches with your actual code. The results might surprise you.

Conclusion

This journey from pure Bash to optimized hybrid runtime revealed multiple performance insights that challenge conventional Lambda wisdom:

  1. Container images can outperform ZIP packages for larger runtimes (>2-3MB)
  2. Raw TCP sockets deliver 50% faster cold starts than HTTP clients
  3. File compression can hurt performance - UPX made things worse, not better
  4. Code simplicity often beats size optimization in Lambda’s execution model

The Lambda ecosystem is more nuanced than simple rules. As we build sophisticated serverless applications, understanding these performance characteristics becomes crucial.

My lambda-shell-runtime project started as a way to bring Bash scripting to serverless. It ended up revealing that the most valuable discoveries come from systematically challenging your assumptions with real benchmarks.

Sometimes the best way to validate your architecture is to try to replace it.


Want to experiment with hybrid Lambda runtimes? Check out the lambda-shell-runtime project and the benchmark code used in this analysis.