AI/ML notes

obj_hierachy

Object Hierarchy

🧱 Metaflow Object Hierarchy

Here’s the hierarchy in top-down order:

Flow
 └── Run
      └── Step
           └── Task
                └── Artifact

Think of it like a tree-shaped object model that helps you programmatically navigate past executions, debug, inspect data, or build automation.


1. Flow

Represents the entire flow class (your pipeline).

from metaflow import Flow
flow = Flow('MyFlow')
  • .latest_run
  • .runs β†’ list of all runs
  • .name β†’ name of the flow

2. Run

Represents a single execution of your flow.

run = Flow('MyFlow').latest_run
  • .id β†’ unique run ID (e.g., 1699374571123927)
  • .steps β†’ dictionary of steps in this run
  • .successful β†’ boolean
  • .created_at β†’ timestamp
  • .user_tags, .system_tags

You can loop through runs:

for run in Flow('MyFlow').runs():
    print(run.id, run.successful)

3. Step

Represents a single step (e.g., start, train, join) inside a run.

step = run['train']
  • .name β†’ step name
  • .tasks β†’ list of tasks for this step (can be more than one for foreach)
  • .successful β†’ if step finished without error

4. Task

Represents a single task (actual execution unit of a step).

task = step.task  # or `step.tasks[0]`
  • .attempt β†’ retry attempt number
  • .finished β†’ if task finished
  • .stdout, .stderr
  • .tags
  • .metadata

Most importantly:

task.data.<artifact_name>

This is how you retrieve artifacts from past runs.


5. Artifact

This is the actual data (variable) persisted by a step.

acc = task.data.accuracy

These are the self.var = ... assignments in your step code.

Artifacts:

  • Are stored in Metaflow's data store
  • Are tied to the task, step, and run that created them
  • Are read-only from the perspective of another run

πŸ§ͺ Example: Traversing the Object Tree

from metaflow import Flow

flow = Flow('GridSearchFlow')
run = flow.latest_run
step = run['train_model']  # Step where training occurred
for task in step.tasks:
    result = task.data.result
    print(result)

🏷️ βœ… Common Properties (Available on All Objects)

Property Description
user_tags Tags added manually (CLI or code)
system_tags Automatically generated tags (e.g., user, runtime, version)
tags Union of user + system tags
created_at When the object was created
parent Parent object (e.g., task’s parent is a step)
pathspec Fully qualified string path (e.g., MyFlow/123/start/abcde)
path_components List split of pathspec (e.g., [MyFlow, 123, start, abcde])

These are useful when writing generic traversal utilities.


πŸ”„ Flow-Level (Flow object)

Accessed via:

from metaflow import Flow
flow = Flow('MyFlow')
Property Description
runs() Iterator of all runs in the current namespace
latest_run Most recent run (finished or not)
latest_successful_run Most recent successful and finished run

➑️ Great for programmatically pulling the last N runs, analyzing outputs, or rerunning failed jobs.


πŸ” Run-Level (Run object)

run = Flow('MyFlow').latest_run
Property Description
steps() Iterator of steps in this run
data Shortcut to run.end_task.data, i.e., final step's artifacts
successful True if run finished successfully
finished True if run finished (success or fail)
finished_at datetime of when run finished
code If saved, the code used in the run
end_task Shortcut to task of the last step in DAG
trigger Info on what triggered the run (e.g., a schedule or user)

➑️ The data property is especially handy to get final results fast without needing to drill down manually.


πŸ” Step-Level (Step object)

step = run['train']
Property Description
task The single Task of this step (or any one of them, if multiple)
tasks() Iterator of all Tasks (for foreach steps)
finished_at When the step completed (i.e., all tasks finished)
environment_info Execution environment metadata (e.g., Python version, OS)

➑️ If using foreach, this is where .tasks() becomes important for aggregating parallel task results.


βš™οΈ Task-Level (Task object)

task = step.task
Property Description
data Artifact values (i.e., variables set via self.var = ...)
artifacts List of individual DataArtifact objects (vs just values)
successful Task succeeded?
finished Task completed (even if failed)?
finished_at When the task finished
exception Exception info if task failed
stdout / stderr Standard output/error strings
code Source code used for this task (if persisted)
environment_info Dict with system-level info (e.g., Python, Conda, image)

➑️ The data and stdout properties are the most common to use in inspection scripts or dashboards.


πŸ§ͺ Minimal Working Example (From Docs)

from metaflow import Step

step = Step('DebugFlow/2/start')  # format: FlowName/RunID/StepName

if step.task.successful:
    print("Finished at:", step.task.finished_at)
    print("Stdout:")
    print(step.task.stdout)
    print("Artifacts:", list(step.task.data.__dict__.keys()))

🧠 Bonus Tips

  • pathspec can be used to fetch any object directly using Step(...), Task(...), etc.
  • Use .tags in combination with filters to group runs or detect anomalies
  • code and environment_info are useful for reproducibility and audit logging
  • Combine .data with pandas.DataFrame or plotly to visualize experiment results

βœ… Summary Table

Level Key Object Must-Know Property Purpose
Flow Flow latest_run, runs() Get recent runs
Run Run data, successful, end_task Check results
Step Step tasks(), finished_at Inspect step behavior
Task Task data, stdout, exception Debug and extract outputs