failures

Handling Failures in Metaflow

🎯 Why Failure Handling Matters

Failures are inevitable in real-world data and ML pipelines — whether caused by bugs, transient compute issues, bad input data, or flaky external services. Metaflow makes it easy to recover, retry, or debug these failures without manual intervention or reruns from scratch.

🧩 Failure Types & Handling Methods

🧨 1. Transient Infrastructure Failures

Examples:
- Spot instance interruption
- Network blip
- AWS Batch / Kubernetes job eviction
- Metadata service temporarily unavailable
Symptoms:
- Task crashes with infrastructure-related error
- Intermittent success/failure on retry

✅ Recommended Handling:

Feature	How to Use
`@retry` decorator	Retries a step N times automatically on failure
`@catch` decorator	Gracefully catches failure and reroutes flow
AWS Batch/K8s retries	Enabled automatically via Metaflow runtime
Resumable runs	Resume from last successful step using `resume`

@retry(times=3)
@step
def download_data(self):
    ...

🐛 2. Code Errors / Exceptions

Examples:
- Python exceptions (e.g., KeyError, ValueError)
- Logic bugs in your ML model, data pipeline
Symptoms:
- Step fails consistently due to code logic
- Doesn't recover with retries

✅ Recommended Handling:

Feature	How to Use
`@catch` decorator	Catches error and allows fallback or logging
Cards / Logs	Use `@card` to capture logs and show crash cause
Step Isolation	Each step is isolated — only the faulty step needs fixing

@catch(var='error_info')
@step
def risky_logic(self):
    raise ValueError("Oops!")

📡 3. External Service Failures

Examples:
- HTTP API times out
- Database is temporarily unavailable
- File download fails due to remote issue
Symptoms:
- Step fails randomly depending on external service status

✅ Recommended Handling:

Feature	How to Use
`@retry(times=X)`	Retry transient call automatically
Exponential backoff	Use inside your logic (e.g., with `time.sleep`)
`@catch`	Allow fallback to cached data or offline mode

🧪 4. Bad Input / Data Validation Failures

Examples:
- Missing column
- Null values in required field
- Incompatible schema
Symptoms:
- Step crashes on specific datasets or flow runs

✅ Recommended Handling:

Feature	How to Use
Explicit data validation	Check inputs at start of step
`@catch` with branch	Handle invalid cases separately
`self.fail()`	Mark step as failed with clear reason

@step
def validate_data(self):
    if "target" not in self.df.columns:
        self.fail("Missing target column in input data")

⚠️ 5. Permanent Failures (e.g., logic bug, corrupted data)

Examples:
- Wrong data logic
- Wrong model architecture
- Reproducible crash on every retry
Symptoms:
- Step fails every time, even with retries or catching

✅ Recommended Handling:

Feature	How to Use
Debug using `resume`	Fix the step, resume from where it failed
Use cards/logs	Help visualize what failed (`@card`)
Use `@catch` for fallback	Log and skip or fallback logic

🧠 Summary Table

Failure Type	Example	Handling Strategy
Transient Infra	Job evicted, spot instance lost	`@retry`, `resume`
Code Exception	`ValueError`, logic bug	`@catch`, `card`, isolate step
External API	HTTP 500, timeout	`@retry`, exponential backoff
Input Validation	Missing columns, schema mismatch	`self.fail()`, `@catch`
Permanent Bug	Reproducible crash	Debug + `resume`, improve code

🔁 Metaflow Retry vs Catch

Feature	Purpose	Scope	Re-executes Step?
`@retry(times=3)`	Auto-retry on failure	Transient errors	✅ Yes
`@catch(var='err')`	Catch exception and continue	Any exception	❌ No (continues to next)

✅ Best Practices

Use @retry generously for infra + network volatility
Use @catch when you want to continue even if a step fails
Use resume to avoid rerunning successful steps
Use @card + self.set_metadata() to log failure insights
Make each step idempotent and isolated, so reruns are safe

Awesome! Here's a complete, clean failure-handling template Metaflow flow that demonstrates:

✅ @retry for transient errors
✅ @catch for graceful fallback
✅ Support for resume so you can fix and continue after failure

📦 Template: `FailureHandlingFlow.py`

from metaflow import FlowSpec, step, retry, catch, card, Parameter
import random
import time

class FailureHandlingFlow(FlowSpec):

    simulate_api_failure = Parameter("simulate_api_failure", type=bool, default=True)
    simulate_code_bug = Parameter("simulate_code_bug", type=bool, default=False)

    @step
    def start(self):
        print("🚀 Starting the flow.")
        self.next(self.api_call)

    # Retry up to 3 times on failure (e.g. network)
    @retry(times=3, minutes_between_retries=0.03)  # ~2 seconds
    @step
    def api_call(self):
        print("🔄 Simulating API call...")
        if self.simulate_api_failure and random.random() < 0.7:
            raise RuntimeError("💥 API request failed! Retrying...")
        print("✅ API call succeeded.")
        self.api_data = {"message": "API call successful"}
        self.next(self.risky_logic)

    # Catch unexpected logic bugs and fallback
    @catch(var="error_info")
    @step
    def risky_logic(self):
        print("🤖 Running some logic with potential bugs...")
        if self.simulate_code_bug:
            raise ValueError("🐛 Logic bug occurred!")
        self.result = "Success!"
        self.next(self.summarize)

    # Generate a card with the outcome
    @card
    @step
    def summarize(self):
        if hasattr(self, "error_info"):
            print("📝 Step failed but caught: ", self.error_info)
        else:
            print("🎉 Logic succeeded:", self.result)

        print("📦 API Data:", self.api_data)
        self.next(self.end)

    @step
    def end(self):
        print("🏁 Flow completed.")

if __name__ == "__main__":
    FailureHandlingFlow()

🧪 How to Use It

✅ Normal run (all success):

python FailureHandlingFlow.py run --simulate_api_failure False --simulate_code_bug False

⚠️ Simulate flaky API (auto-retries):

python FailureHandlingFlow.py run --simulate_api_failure True --simulate_code_bug False

You’ll see retries up to 3 times before proceeding.

🐛 Simulate logic bug (caught by `@catch`):

python FailureHandlingFlow.py run --simulate_code_bug True

risky_logic step will fail, but the flow will continue with a summary of the error.

🔁 Use `resume` after fixing failure

If a non-retriable failure happens (e.g., bug in code), you can fix it and then:

python FailureHandlingFlow.py resume <RUN_ID>

Metaflow will resume from the last failed step, not restart everything.

✅ Features Demonstrated

Feature	Where
`@retry(times=3)`	On flaky `api_call` step
`@catch(var=...)`	Around `risky_logic` to catch exceptions
`@card`	Auto-attached report to summarize result/error
`resume`	Use CLI to pick up failed run after fixing bug
`Parameter`	Toggle failure behavior for testing

failures

Handling Failures in Metaflow

🎯 Why Failure Handling Matters

🧩 Failure Types & Handling Methods

🧨 1. Transient Infrastructure Failures

🐛 2. Code Errors / Exceptions

📡 3. External Service Failures

🧪 4. Bad Input / Data Validation Failures

⚠️ 5. Permanent Failures (e.g., logic bug, corrupted data)

🧠 Summary Table

🔁 Metaflow Retry vs Catch

✅ Best Practices

📦 Template: FailureHandlingFlow.py

🧪 How to Use It

✅ Normal run (all success):

⚠️ Simulate flaky API (auto-retries):

🐛 Simulate logic bug (caught by @catch):

🔁 Use resume after fixing failure

✅ Features Demonstrated

📦 Template: `FailureHandlingFlow.py`

🐛 Simulate logic bug (caught by `@catch`):

🔁 Use `resume` after fixing failure