AI/ML notes

deep_dive

Deep dive

How everything works under the hood

Lifecycle Mnemonic

DAG ➜ start ➜ next step ➜ isolate ➜ run ➜ pass artifacts ➜ repeat ➜ end

🧠 Key Files and Responsibilities

Location Description
metaflow/flowspec.py 🧠 Core class: FlowSpec — base class your flows inherit. Handles @step methods, self.next(...), data artifact tracking, etc.
metaflow/runtime.py 🚀 Entry point for execution when you do python myflow.py run. Manages flow instantiation and hands off to the runtime engine.
metaflow/runtime/engine.py ⚙️ Abstract engine that defines the interface for running steps. Backend-specific engines (local, batch, etc.) subclass this.
metaflow/runtime/step_functions/step_function.py 🌿 Handles calling step methods, managing inputs, outputs, and error handling.
metaflow/runtime/executor.py 🔄 Responsible for executing steps, resolving dependencies, managing state transitions, etc.
metaflow/datastore/ 📦 Code for persisting and retrieving data artifacts across steps.
metaflow/decorators/ 💡 Contains logic for decorators like @batch, @retry, @conda etc. They wrap the step execution logic.

🪜 How Execution Works Internally (Code Flow)

Here’s a simplified flow of what happens when you run:

python myflow.py run

1. runtime.py

  • Parses CLI args.
  • Locates the flow class in your script.
  • Instantiates it.

2. FlowSpec (flowspec.py)

  • Collects all @step methods using decorators.
  • Builds up internal metadata (_steps, _edges, etc.).
  • Runs the entry point step (start()).

3. self.next(...) (in FlowSpec)

  • Records edges in the DAG (who runs after whom).
  • Actual execution is deferred until Metaflow resolves the step DAG at runtime.

4. engine.py + executor.py

  • The engine resolves the next step(s) to run.
  • Each step is isolated and run (locally or via a backend like AWS Batch).

5. step_function.py

  • Runs the step method (step_fn(self)).
  • Captures outputs, stores artifacts.

6. datastore/

  • Persists variables assigned in the step (self.x = ...) as pickled blobs.
  • Automatically loads them in the next step.

🧭 Entry Points to Explore

  • metaflow/flowspec.py: FlowSpec
  • metaflow/runtime.py: main()
  • metaflow/runtime/executor.py: StepExecutor
  • metaflow/runtime/engine.py: Engine
  • metaflow/runtime/local/local_runner.py (for local runs)

🧪 Tip for Debugging or Hacking

Add print statements or breakpoints in:

  • FlowSpec.next() — to trace DAG construction.
  • StepExecutor._run_step_function() — to watch step execution.
  • DataStore classes — to track how variables get stored/loaded.

Stages of metaflow

In Metaflow (and in many dynamic workflow systems), there’s a meaningful distinction between definition time and runtime


📌 1. Definition Time

  • When: Happens when Python imports and parses your script.
  • What happens:
    • Your FlowSpec class is instantiated (or inspected via reflection).
    • All methods decorated with @step are registered.
    • All decorators (@batch, @retry, etc.) are applied and locked in.
    • The DAG structure is not yet built—just the static wiring is made available.
  • Key constraint: This is when Python and Metaflow set up metadata and decorators.
  • Limitation: You can’t use flow artifacts or conditional logic based on runtime parameters here.

🧠 Think of it as the "blueprint building" phase—just loading and preparing the class, not actually running steps.


📌 2. Runtime

  • When: Starts when you execute something like python myflow.py run.
  • What happens:
    • Metaflow starts from the entry point (start()).
    • Your @step method runs in isolation.
    • Variables assigned (e.g., self.x = 10) become artifacts.
    • Metaflow builds the DAG dynamically based on self.next(...).
    • Backend (local, AWS Batch, etc.) actually runs your code step by step.

🔄 Each step is its own "process," run with serialized inputs and outputs passed between steps.


📌 3. Scheduling Time (Optional)

If you're using Metaflow with Airflow, Argo Workflows, or a scheduler:

  • There's a third layer called scheduling or orchestration time.
  • This happens before runtime, but after definition time, and is when the scheduler:
    • Decides when and where to trigger a new flow run.
    • May inspect the static flow definition.
    • Doesn’t run steps or generate artifacts—just orchestrates triggers.

🧠 Why This Matters

  • Conditional logic based on flow parameters:

    @batch(cpu=2 if self.cpu_heavy else 1)  # ❌ Won’t work at definition time!

    self.cpu_heavy doesn’t exist yet — self only exists at runtime, not at definition time.

  • self.next(...) is called at runtime, allowing dynamic DAG construction.

  • @step, @batch, etc., are used at definition time, and cannot see runtime context.


✅ Rule of Thumb

Phase What you can use
Definition time Class names, static imports, constants, decorators
Runtime Flow parameters, self variables, control flow, self.next(...)
Scheduling time Environment variables, scheduled triggers