failures
Handling Failures in Metaflow
๐ฏ Why Failure Handling Matters
Failures are inevitable in real-world data and ML pipelines โ whether caused by bugs, transient compute issues, bad input data, or flaky external services. Metaflow makes it easy to recover, retry, or debug these failures without manual intervention or reruns from scratch.
๐งฉ Failure Types & Handling Methods
๐งจ 1. Transient Infrastructure Failures
- Examples:
- Spot instance interruption
- Network blip
- AWS Batch / Kubernetes job eviction
- Metadata service temporarily unavailable
- Symptoms:
- Task crashes with infrastructure-related error
- Intermittent success/failure on retry
โ Recommended Handling:
| Feature | How to Use |
|---|---|
@retry decorator |
Retries a step N times automatically on failure |
@catch decorator |
Gracefully catches failure and reroutes flow |
| AWS Batch/K8s retries | Enabled automatically via Metaflow runtime |
| Resumable runs | Resume from last successful step using resume |
@retry(times=3)
@step
def download_data(self):
...
๐ 2. Code Errors / Exceptions
- Examples:
- Python exceptions (e.g.,
KeyError,ValueError) - Logic bugs in your ML model, data pipeline
- Python exceptions (e.g.,
- Symptoms:
- Step fails consistently due to code logic
- Doesn't recover with retries
โ Recommended Handling:
| Feature | How to Use |
|---|---|
@catch decorator |
Catches error and allows fallback or logging |
| Cards / Logs | Use @card to capture logs and show crash cause |
| Step Isolation | Each step is isolated โ only the faulty step needs fixing |
@catch(var='error_info')
@step
def risky_logic(self):
raise ValueError("Oops!")
๐ก 3. External Service Failures
- Examples:
- HTTP API times out
- Database is temporarily unavailable
- File download fails due to remote issue
- Symptoms:
- Step fails randomly depending on external service status
โ Recommended Handling:
| Feature | How to Use |
|---|---|
@retry(times=X) |
Retry transient call automatically |
| Exponential backoff | Use inside your logic (e.g., with time.sleep) |
@catch |
Allow fallback to cached data or offline mode |
๐งช 4. Bad Input / Data Validation Failures
- Examples:
- Missing column
- Null values in required field
- Incompatible schema
- Symptoms:
- Step crashes on specific datasets or flow runs
โ Recommended Handling:
| Feature | How to Use |
|---|---|
| Explicit data validation | Check inputs at start of step |
@catch with branch |
Handle invalid cases separately |
self.fail() |
Mark step as failed with clear reason |
@step
def validate_data(self):
if "target" not in self.df.columns:
self.fail("Missing target column in input data")
โ ๏ธ 5. Permanent Failures (e.g., logic bug, corrupted data)
- Examples:
- Wrong data logic
- Wrong model architecture
- Reproducible crash on every retry
- Symptoms:
- Step fails every time, even with retries or catching
โ Recommended Handling:
| Feature | How to Use |
|---|---|
Debug using resume |
Fix the step, resume from where it failed |
| Use cards/logs | Help visualize what failed (@card) |
Use @catch for fallback |
Log and skip or fallback logic |
๐ง Summary Table
| Failure Type | Example | Handling Strategy |
|---|---|---|
| Transient Infra | Job evicted, spot instance lost | @retry, resume |
| Code Exception | ValueError, logic bug |
@catch, card, isolate step |
| External API | HTTP 500, timeout | @retry, exponential backoff |
| Input Validation | Missing columns, schema mismatch | self.fail(), @catch |
| Permanent Bug | Reproducible crash | Debug + resume, improve code |
๐ Metaflow Retry vs Catch
| Feature | Purpose | Scope | Re-executes Step? |
|---|---|---|---|
@retry(times=3) |
Auto-retry on failure | Transient errors | โ Yes |
@catch(var='err') |
Catch exception and continue | Any exception | โ No (continues to next) |
โ Best Practices
- Use
@retrygenerously for infra + network volatility - Use
@catchwhen you want to continue even if a step fails - Use
resumeto avoid rerunning successful steps - Use
@card+self.set_metadata()to log failure insights - Make each step idempotent and isolated, so reruns are safe
Awesome! Here's a complete, clean failure-handling template Metaflow flow that demonstrates:
- โ
@retryfor transient errors - โ
@catchfor graceful fallback - โ
Support for
resumeso you can fix and continue after failure
๐ฆ Template: FailureHandlingFlow.py
from metaflow import FlowSpec, step, retry, catch, card, Parameter
import random
import time
class FailureHandlingFlow(FlowSpec):
simulate_api_failure = Parameter("simulate_api_failure", type=bool, default=True)
simulate_code_bug = Parameter("simulate_code_bug", type=bool, default=False)
@step
def start(self):
print("๐ Starting the flow.")
self.next(self.api_call)
# Retry up to 3 times on failure (e.g. network)
@retry(times=3, minutes_between_retries=0.03) # ~2 seconds
@step
def api_call(self):
print("๐ Simulating API call...")
if self.simulate_api_failure and random.random() < 0.7:
raise RuntimeError("๐ฅ API request failed! Retrying...")
print("โ
API call succeeded.")
self.api_data = {"message": "API call successful"}
self.next(self.risky_logic)
# Catch unexpected logic bugs and fallback
@catch(var="error_info")
@step
def risky_logic(self):
print("๐ค Running some logic with potential bugs...")
if self.simulate_code_bug:
raise ValueError("๐ Logic bug occurred!")
self.result = "Success!"
self.next(self.summarize)
# Generate a card with the outcome
@card
@step
def summarize(self):
if hasattr(self, "error_info"):
print("๐ Step failed but caught: ", self.error_info)
else:
print("๐ Logic succeeded:", self.result)
print("๐ฆ API Data:", self.api_data)
self.next(self.end)
@step
def end(self):
print("๐ Flow completed.")
if __name__ == "__main__":
FailureHandlingFlow()
๐งช How to Use It
โ Normal run (all success):
python FailureHandlingFlow.py run --simulate_api_failure False --simulate_code_bug False
โ ๏ธ Simulate flaky API (auto-retries):
python FailureHandlingFlow.py run --simulate_api_failure True --simulate_code_bug False
Youโll see retries up to 3 times before proceeding.
๐ Simulate logic bug (caught by @catch):
python FailureHandlingFlow.py run --simulate_code_bug True
risky_logic step will fail, but the flow will continue with a summary of the error.
๐ Use resume after fixing failure
If a non-retriable failure happens (e.g., bug in code), you can fix it and then:
python FailureHandlingFlow.py resume <RUN_ID>
Metaflow will resume from the last failed step, not restart everything.
โ Features Demonstrated
| Feature | Where |
|---|---|
@retry(times=3) |
On flaky api_call step |
@catch(var=...) |
Around risky_logic to catch exceptions |
@card |
Auto-attached report to summarize result/error |
resume |
Use CLI to pick up failed run after fixing bug |
Parameter |
Toggle failure behavior for testing |