Skip to main content

Error Handling & Retries (AWS & GCP)

Resilient systems don't ignore errors—they anticipate, handle, and learn from them. Both AWS Lambda and Google Cloud Functions provide mechanisms for retrying failed operations, routing errors to queues, and logging failures. Understanding these patterns helps you build systems that recover gracefully.


Simple Explanation

What it is

Error handling is how your system reacts when something fails. Retries are how it tries again safely.

Why we need it

Cloud services fail sometimes. Without a plan, a small failure turns into a full outage.

Benefits

  • Higher reliability when transient failures occur.
  • Safer recovery through retries and dead-letter queues.
  • Clearer debugging because errors are captured and tracked.

Tradeoffs

  • More complexity in control flow and testing.
  • Risk of duplicate work if idempotency is not handled.

Real-world examples (architecture only)

  • Payment failure -> Retry with backoff -> DLQ after max retries.
  • File processing error -> Send to error queue for manual review.

Part 1: AWS Error Handling

Try-Catch Basics

Always wrap code that might fail:

import json


def handler(event, context):
try:
result = risky_operation()
return {"statusCode": 200, "body": json.dumps(result)}
except Exception as exc:
print("Error:", exc)
return {"statusCode": 500, "body": json.dumps({"error": str(exc)})}

Error Types

Application Errors

Bug in your code:

# ❌ Forgot to parse
data = event.get("body") # String
count = len(data) + 1 # Might not work as expected

# ✅ Parse first
data = json.loads(event.get("body") or "{}")
count = len(data.get("items", []))

Service Errors

AWS service temporarily unavailable:

# DynamoDB might be busy
try:
result = ddb.get_item(**params)
except Exception as exc:
code = getattr(exc, "response", {}).get("Error", {}).get("Code")
if code == "ProvisionedThroughputExceededException":
# Too many requests to DynamoDB
# Retry with exponential backoff
pass

Configuration Errors

Missing environment variables:

import os

# ❌ Fails silently if env var missing
table_name = os.environ.get("TABLE_NAME")

# ✅ Fail fast with clear error
table_name = os.environ.get("TABLE_NAME")
if not table_name:
raise RuntimeError("TABLE_NAME environment variable not set")

Retry Strategy

Manual Retry

import time


def with_retry(fn, max_attempts=3):
for attempt in range(1, max_attempts + 1):
try:
return fn()
except Exception:
if attempt == max_attempts:
raise
delay = 2 ** attempt
print(f"Attempt {attempt} failed, retrying in {delay}s")
time.sleep(delay)


def handler(event, context):
result = with_retry(lambda: ddb.get_item(**params))
return result

Exponential Backoff

Wait longer between retries:

Attempt 1: Fail
Wait 2 seconds
Attempt 2: Fail
Wait 4 seconds
Attempt 3: Fail
Wait 8 seconds
Attempt 4: Fail
Give up

Reduces load on overwhelmed service.

Jitter

Add randomness to prevent thundering herd:

# Without jitter: All instances retry at same time
# With jitter: Stagger retry times

import random

jitter = random.random()
delay = (2 ** attempt) + jitter

Lambda Retry Behavior

Sync vs. Async

Synchronous (API Gateway):

  • You wait for response
  • No automatic retry
  • Handle errors yourself

Asynchronous (S3, SNS triggers):

  • Fire and forget
  • Lambda automatically retries twice
  • Failed events go to DLQ

Configure Async Behavior

OrderProcessingFunction:
Type: AWS::Serverless::Function
Properties:
AsynchronousInvocation:
MaximumEventAge: 3600 # Max age 1 hour
MaximumRetryAttempts: 2
DestinationConfig:
OnFailure:
Type: SQS
Destination: !GetAtt DeadLetterQueue.Arn

On failure after 2 retries, message goes to DLQ for investigation.

Error Response Formats

REST API Errors

return {
"statusCode": 500,
"body": json.dumps({
"error": "Internal Server Error",
"message": "Failed to process request",
"requestId": context.aws_request_id,
}),
}

Specific Error Codes

try:
result = operation()
return {"statusCode": 200, "body": result}
except Exception as exc:
code = getattr(exc, "code", None)
if code == "ValidationError":
return {"statusCode": 400, "body": str(exc)}
if code == "NotFound":
return {"statusCode": 404, "body": str(exc)}
return {"statusCode": 500, "body": "Server error"}

Transient vs. Permanent Errors

Transient (retry):

  • Network timeout
  • Service temporarily down
  • Throttling

Permanent (don't retry):

  • Invalid input (400)
  • Authentication failed (401)
  • S3 bucket doesn't exist (404)
def should_retry(error):
retryable_statuses = {408, 429, 500, 502, 503, 504}
return getattr(error, "statusCode", None) in retryable_statuses

Circuit Breaker Pattern

Stop retrying if service is broken:

import time

failure_count = 0
FAILURE_THRESHOLD = 5
circuit_open = False
circuit_until = 0


def handler(event, context):
global failure_count, circuit_open, circuit_until

if circuit_open and time.time() < circuit_until:
return {"statusCode": 503, "body": "Service unavailable"}

try:
result = external_service_call()
failure_count = 0
circuit_open = False
return result
except Exception:
failure_count += 1
if failure_count > FAILURE_THRESHOLD:
circuit_open = True
circuit_until = time.time() + 60
raise

When circuit opens, fail fast instead of hammering broken service.

Monitoring Errors

import json


def log_error(error, context):
print(json.dumps({
"level": "ERROR",
"message": str(error),
"requestId": context.aws_request_id,
"functionName": context.function_name,
"remainingMs": context.get_remaining_time_in_millis(),
}))


def handler(event, context):
try:
# Your code
pass
except Exception as exc:
log_error(exc, context)
raise

Create alarm on error rate:

HighErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
MetricName: Errors
Threshold: 10
TreatMissingData: notBreaching

Dead Letter Queues (DLQ)

Catch failures for later investigation:

ProcessingFunction:
Type: AWS::Serverless::Function
Properties:
AsynchronousInvocation:
DestinationConfig:
OnFailure:
Type: SQS
Destination: !GetAtt FailedItems.Arn

FailedItems:
Type: AWS::SQS::Queue
Properties:
QueueName: failed-items-dlq
MessageRetentionPeriod: 1209600 # 14 days

Later, process failed items manually or with retry function.

Chaos Engineering

Intentionally inject failures to test resilience:

import random


def should_fail():
return random.random() < 0.05 # 5% failure rate


def handler(event, context):
if should_fail():
raise RuntimeError("Simulated failure for testing")
# Normal operation

Enable only in staging, verify system recovers gracefully.

Best Practices (AWS)

  1. Fail fast on permanent errors — Don't retry 404s, 401s, 400s
  2. Exponential backoff — Wait 1s, 2s, 4s, 8s between retries
  3. Jitter — Add randomness to prevent thundering herd
  4. Circuit breakers — Stop retrying if service is broken
  5. DLQs — Route failed async events to queues for investigation
  6. Logging — Include request ID, status, and error details

Part 2: GCP Error Handling

Try-Catch Basics

Cloud Functions also uses standard try-catch:

import functions_framework


@functions_framework.http
def handle_request(request):
try:
result = process_request(request.get_json(silent=True) or {})
return ({"success": True, "result": result}, 200)
except Exception as exc:
print("Error:", exc)
return ({"error": str(exc)}, 500)

Error Types

Application Errors:

# Invalid input
payload = request.get_json(silent=True) or {}
if not payload.get("id"):
return ({"error": "Missing id"}, 400)

Service Errors:

try:
doc = firestore.collection("items").document(doc_id).get()
except Exception as exc:
if getattr(exc, "code", None) == "UNAVAILABLE":
return ({"error": "Service unavailable"}, 503)

Configuration Errors:

import os

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
if not project_id:
raise RuntimeError("GOOGLE_CLOUD_PROJECT not set")

Retry Strategy

Manual retry with exponential backoff:

import random
import time


def is_retryable(error):
retryable_codes = {"UNAVAILABLE", "DEADLINE_EXCEEDED", "INTERNAL"}
return getattr(error, "code", None) in retryable_codes


def with_retry(fn, max_attempts=3):
for attempt in range(1, max_attempts + 1):
try:
return fn()
except Exception as exc:
if attempt == max_attempts or not is_retryable(exc):
raise
delay = (2 ** attempt) + random.random()
print(f"Attempt {attempt} failed, retrying in {delay}s")
time.sleep(delay)


@functions_framework.http
def my_function(request):
try:
payload = request.get_json(silent=True) or {}
doc = with_retry(lambda: firestore.collection("items").document(payload["id"]).get())
return ({"data": doc.to_dict()}, 200)
except Exception as exc:
return ({"error": str(exc)}, 500)

Cloud Tasks for Guaranteed Delivery

For critical operations that must succeed, use Cloud Tasks:

import json
import os
import time

import functions_framework
from google.cloud import tasks_v2

client = tasks_v2.CloudTasksClient()


@functions_framework.http
def enqueue_order(request):
project = os.environ.get("GOOGLE_CLOUD_PROJECT", "PROJECT_ID")
queue = "order-processing"
location = "us-central1"

body = json.dumps(request.get_json(silent=True) or {}).encode("utf-8")
task = {
"http_request": {
"http_method": tasks_v2.HttpMethod.POST,
"url": "https://YOUR-FUNCTION-URL/processOrder",
"headers": {"Content-Type": "application/json"},
"body": body,
},
"schedule_time": {"seconds": int(time.time())},
}

parent = client.queue_path(project, location, queue)
response = client.create_task(request={"parent": parent, "task": task})
return ({"taskName": response.name}, 200)

Cloud Tasks automatically retries (configurable: up to 5 times by default) with exponential backoff and jitter.

Pub/Sub Retry Policy

For event-driven workflows, use Pub/Sub with dead-letter topics:

from google.cloud import pubsub_v1

subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path("PROJECT_ID", "order-processing-sub")


def callback(message):
try:
process_order(message.data)
message.ack()
except Exception as exc:
print("Error processing message:", exc)
message.nack()


subscriber.subscribe(subscription_path, callback=callback)

Configure via gcloud:

gcloud pubsub subscriptions create order-sub \
--topic=orders \
--dead-letter-topic=order-dlq \
--max-delivery-attempts=5

Error Reporting (GCP Native)

Report errors to Cloud Error Reporting:

from google.cloud import error_reporting

client = error_reporting.Client()


@functions_framework.http
def my_function(request):
try:
risky_operation()
return ("OK", 200)
except Exception:
client.report_exception()
return ("Internal error", 500)

View aggregated errors and trends in Cloud Console.


AWS vs. GCP Error Handling

FeatureAWS LambdaGoogle Cloud Functions
Sync invocation failuresReturn HTTP status codeReturn HTTP status code
Async invocation failuresAutomatic retry × 2 (configurable)Event retried by source (e.g., Pub/Sub × 5)
Failed event destinationDLQ (SQS/SNS)Dead-letter topic (Pub/Sub) or GCS bucket
Retry backoffManual in codeSource handles (Pub/Sub exponential backoff)
Guaranteed deliverySQS has built-in durabilityCloud Tasks, Pub/Sub
Error monitoringCloudWatch alarms, X-RayCloud Logging, Error Reporting, Cloud Alerting
Circuit breakerManual implementationManual implementation
Request timeout28 minute limit540 second (9 minute) limit
Max message ageConfigurable (OnFailure)Configurable per Pub/Sub subscription

Key Differences

  • Async retry behavior: AWS Lambda retries asynchronously-invoked functions automatically; GCP relies on the event source's retry policy
  • Delivery guarantees: Cloud Tasks provides exactly-once delivery; SQS provides at-least-once
  • DLQ routing: AWS uses SNS/SQS; GCP uses Pub/Sub topics or Cloud Storage
  • Error visibility: CloudWatch needs alarms setup; Cloud Error Reporting auto-aggregates errors

Transient vs. Permanent Errors

Both platforms follow the same principle:

Transient (retry):

  • Network timeout (408, 504)
  • Service temporarily unavailable (429, 503)
  • Deadline exceeded
  • Temporary Firestore lock

Permanent (don't retry):

  • Invalid input (400)
  • Authentication failed (401)
  • Authorization failed (403)
  • Resource not found (404)
  • Invalid message format (JSON parse error)
# Shared logic for both platforms
def should_retry(error):
if getattr(error, "statusCode", None):
retryable_codes = {408, 429, 500, 502, 503, 504}
return error.statusCode in retryable_codes
retryable_codes = {"UNAVAILABLE", "DEADLINE_EXCEEDED", "INTERNAL"}
return getattr(error, "code", None) in retryable_codes

Circuit Breaker Pattern

Prevent hammering broken services:

# Works on both AWS and GCP
import time


class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.threshold = failure_threshold
self.timeout = timeout
self.state = "CLOSED"
self.next_attempt = time.time()

def execute(self, fn):
if self.state == "OPEN":
if time.time() < self.next_attempt:
raise RuntimeError("Circuit breaker is OPEN")
self.state = "HALF_OPEN"

try:
result = fn()
self.on_success()
return result
except Exception:
self.on_failure()
raise

def on_success(self):
self.failure_count = 0
self.state = "CLOSED"

def on_failure(self):
self.failure_count += 1
if self.failure_count >= self.threshold:
self.state = "OPEN"
self.next_attempt = time.time() + self.timeout


breaker = CircuitBreaker()


def aws_handler(event, context):
try:
result = breaker.execute(lambda: ddb.get_item(**params))
return {"statusCode": 200, "body": result}
except Exception:
return {"statusCode": 503, "body": "Service unavailable"}


@functions_framework.http
def gcp_handler(request):
try:
payload = request.get_json(silent=True) or {}
doc = breaker.execute(lambda: firestore.collection("items").document(payload["id"]).get())
return ({"data": doc.to_dict()}, 200)
except Exception:
return ({"error": "Service unavailable"}, 503)

Best Practices (Both Platforms)

  1. Fail fast on permanent errors — Retry only on transient failures
  2. Use exponential backoff — Base delay × 2^attempt with jitter
  3. Implement circuit breakers — Stop retrying broken services
  4. Route failures somewhere — DLQ (AWS), dead-letter topic (GCP)
  5. Monitor error rates — Alert on spikes
  6. Log with context — Request ID, user ID, attempt number
  7. Test error paths — Inject failures to verify recovery
  8. Document recovery — How to manually replay failed messages

Hands-On: Multi-Cloud Resilient API

AWS Lambda

import json
import os
import time

import boto3

ddb = boto3.client("dynamodb")


def with_retry(fn, max_attempts=3):
for attempt in range(1, max_attempts + 1):
try:
return fn()
except Exception as exc:
if not should_retry(exc) or attempt == max_attempts:
raise
time.sleep(2 ** attempt)


def handler(event, context):
try:
result = with_retry(lambda: ddb.get_item(
TableName=os.environ.get("TABLE_NAME"),
Key={"id": {"S": event.get("id")}},
))
return {"statusCode": 200, "body": json.dumps(result)}
except Exception as exc:
print("Final error:", exc)
return {"statusCode": 500, "body": json.dumps({"error": str(exc)})}

Deploy with DLQ:

sam deploy --template-file template.yaml

Google Cloud

import time

import functions_framework
from google.cloud import firestore

db = firestore.Client()


def with_retry(fn, max_attempts=3):
for attempt in range(1, max_attempts + 1):
try:
return fn()
except Exception as exc:
if not should_retry(exc) or attempt == max_attempts:
raise
time.sleep(2 ** attempt)


@functions_framework.http
def get_order(request):
try:
payload = request.get_json(silent=True) or {}
doc = with_retry(lambda: db.collection("orders").document(payload["id"]).get())
return ({"data": doc.to_dict()}, 200)
except Exception as exc:
print("Final error:", exc)
return ({"error": str(exc)}, 500)

Deploy:

gcloud functions deploy getOrder \
--runtime python312 \
--trigger-http \
--allow-unauthenticated

Key Takeaway

Resilient systems expect failures and plan for them. Retry transient errors with exponential backoff, fail fast on permanent errors, handle circuit breakers, and never lose messages. Both AWS and GCP provide the primitives—your job is using them correctly.