AI/ML notes

failures

Handling Failures in Metaflow


๐ŸŽฏ Why Failure Handling Matters

Failures are inevitable in real-world data and ML pipelines โ€” whether caused by bugs, transient compute issues, bad input data, or flaky external services. Metaflow makes it easy to recover, retry, or debug these failures without manual intervention or reruns from scratch.


๐Ÿงฉ Failure Types & Handling Methods

๐Ÿงจ 1. Transient Infrastructure Failures

  • Examples:
    • Spot instance interruption
    • Network blip
    • AWS Batch / Kubernetes job eviction
    • Metadata service temporarily unavailable
  • Symptoms:
    • Task crashes with infrastructure-related error
    • Intermittent success/failure on retry

โœ… Recommended Handling:

Feature How to Use
@retry decorator Retries a step N times automatically on failure
@catch decorator Gracefully catches failure and reroutes flow
AWS Batch/K8s retries Enabled automatically via Metaflow runtime
Resumable runs Resume from last successful step using resume
@retry(times=3)
@step
def download_data(self):
    ...

๐Ÿ› 2. Code Errors / Exceptions

  • Examples:
    • Python exceptions (e.g., KeyError, ValueError)
    • Logic bugs in your ML model, data pipeline
  • Symptoms:
    • Step fails consistently due to code logic
    • Doesn't recover with retries

โœ… Recommended Handling:

Feature How to Use
@catch decorator Catches error and allows fallback or logging
Cards / Logs Use @card to capture logs and show crash cause
Step Isolation Each step is isolated โ€” only the faulty step needs fixing
@catch(var='error_info')
@step
def risky_logic(self):
    raise ValueError("Oops!")

๐Ÿ“ก 3. External Service Failures

  • Examples:
    • HTTP API times out
    • Database is temporarily unavailable
    • File download fails due to remote issue
  • Symptoms:
    • Step fails randomly depending on external service status

โœ… Recommended Handling:

Feature How to Use
@retry(times=X) Retry transient call automatically
Exponential backoff Use inside your logic (e.g., with time.sleep)
@catch Allow fallback to cached data or offline mode

๐Ÿงช 4. Bad Input / Data Validation Failures

  • Examples:
    • Missing column
    • Null values in required field
    • Incompatible schema
  • Symptoms:
    • Step crashes on specific datasets or flow runs

โœ… Recommended Handling:

Feature How to Use
Explicit data validation Check inputs at start of step
@catch with branch Handle invalid cases separately
self.fail() Mark step as failed with clear reason
@step
def validate_data(self):
    if "target" not in self.df.columns:
        self.fail("Missing target column in input data")

โš ๏ธ 5. Permanent Failures (e.g., logic bug, corrupted data)

  • Examples:
    • Wrong data logic
    • Wrong model architecture
    • Reproducible crash on every retry
  • Symptoms:
    • Step fails every time, even with retries or catching

โœ… Recommended Handling:

Feature How to Use
Debug using resume Fix the step, resume from where it failed
Use cards/logs Help visualize what failed (@card)
Use @catch for fallback Log and skip or fallback logic

๐Ÿง  Summary Table

Failure Type Example Handling Strategy
Transient Infra Job evicted, spot instance lost @retry, resume
Code Exception ValueError, logic bug @catch, card, isolate step
External API HTTP 500, timeout @retry, exponential backoff
Input Validation Missing columns, schema mismatch self.fail(), @catch
Permanent Bug Reproducible crash Debug + resume, improve code

๐Ÿ” Metaflow Retry vs Catch

Feature Purpose Scope Re-executes Step?
@retry(times=3) Auto-retry on failure Transient errors โœ… Yes
@catch(var='err') Catch exception and continue Any exception โŒ No (continues to next)

โœ… Best Practices

  • Use @retry generously for infra + network volatility
  • Use @catch when you want to continue even if a step fails
  • Use resume to avoid rerunning successful steps
  • Use @card + self.set_metadata() to log failure insights
  • Make each step idempotent and isolated, so reruns are safe

Awesome! Here's a complete, clean failure-handling template Metaflow flow that demonstrates:

  • โœ… @retry for transient errors
  • โœ… @catch for graceful fallback
  • โœ… Support for resume so you can fix and continue after failure

๐Ÿ“ฆ Template: FailureHandlingFlow.py

from metaflow import FlowSpec, step, retry, catch, card, Parameter
import random
import time

class FailureHandlingFlow(FlowSpec):

    simulate_api_failure = Parameter("simulate_api_failure", type=bool, default=True)
    simulate_code_bug = Parameter("simulate_code_bug", type=bool, default=False)

    @step
    def start(self):
        print("๐Ÿš€ Starting the flow.")
        self.next(self.api_call)

    # Retry up to 3 times on failure (e.g. network)
    @retry(times=3, minutes_between_retries=0.03)  # ~2 seconds
    @step
    def api_call(self):
        print("๐Ÿ”„ Simulating API call...")
        if self.simulate_api_failure and random.random() < 0.7:
            raise RuntimeError("๐Ÿ’ฅ API request failed! Retrying...")
        print("โœ… API call succeeded.")
        self.api_data = {"message": "API call successful"}
        self.next(self.risky_logic)

    # Catch unexpected logic bugs and fallback
    @catch(var="error_info")
    @step
    def risky_logic(self):
        print("๐Ÿค– Running some logic with potential bugs...")
        if self.simulate_code_bug:
            raise ValueError("๐Ÿ› Logic bug occurred!")
        self.result = "Success!"
        self.next(self.summarize)

    # Generate a card with the outcome
    @card
    @step
    def summarize(self):
        if hasattr(self, "error_info"):
            print("๐Ÿ“ Step failed but caught: ", self.error_info)
        else:
            print("๐ŸŽ‰ Logic succeeded:", self.result)

        print("๐Ÿ“ฆ API Data:", self.api_data)
        self.next(self.end)

    @step
    def end(self):
        print("๐Ÿ Flow completed.")

if __name__ == "__main__":
    FailureHandlingFlow()

๐Ÿงช How to Use It

โœ… Normal run (all success):

python FailureHandlingFlow.py run --simulate_api_failure False --simulate_code_bug False

โš ๏ธ Simulate flaky API (auto-retries):

python FailureHandlingFlow.py run --simulate_api_failure True --simulate_code_bug False

Youโ€™ll see retries up to 3 times before proceeding.


๐Ÿ› Simulate logic bug (caught by @catch):

python FailureHandlingFlow.py run --simulate_code_bug True

risky_logic step will fail, but the flow will continue with a summary of the error.


๐Ÿ” Use resume after fixing failure

If a non-retriable failure happens (e.g., bug in code), you can fix it and then:

python FailureHandlingFlow.py resume <RUN_ID>

Metaflow will resume from the last failed step, not restart everything.


โœ… Features Demonstrated

Feature Where
@retry(times=3) On flaky api_call step
@catch(var=...) Around risky_logic to catch exceptions
@card Auto-attached report to summarize result/error
resume Use CLI to pick up failed run after fixing bug
Parameter Toggle failure behavior for testing