AI/ML notes

parameters

Parameters

🧩 What Are Parameters in Metaflow?

Metaflow provides the @parameter decorator to declare runtime parameters that are accessible via self.<param_name> in your steps.

They make your flow:

  • Reusable across runs
  • Testable with different configurations
  • Integrable with automation (e.g., running in production or experimentation)

🛠️ Basic Syntax

from metaflow import FlowSpec, step, Parameter

class MyFlow(FlowSpec):

    my_param = Parameter(
        'my_param',
        help='This is a sample parameter',
        default='default_value'
    )

    @step
    def start(self):
        print("Parameter value:", self.my_param)
        self.next(self.end)

    @step
    def end(self):
        print("Flow complete.")

if __name__ == '__main__':
    MyFlow()

🧪 Example CLI Run

You can override the parameter from the command line like this:

python my_flow.py run --my-param hello_world

Expected output:

Parameter value: hello_world
Flow complete.

If you don’t pass the parameter, the default will be used:

python my_flow.py run

Output:

Parameter value: default_value
Flow complete.

🧮 Parameter Types

By default, all parameters are treated as strings. You can specify a type to control parsing.

Examples:

from metaflow import FlowSpec, step, Parameter

class TypedFlow(FlowSpec):

    count = Parameter('count', help='Number of items', type=int, default=3)
    threshold = Parameter('threshold', type=float, default=0.8)
    active = Parameter('active', type=bool, default=True)

    @step
    def start(self):
        print("count:", self.count)
        print("threshold:", self.threshold)
        print("active:", self.active)
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    TypedFlow()

CLI:

python typed_flow.py run --count 10 --threshold 0.95 --active False

Output:

count: 10
threshold: 0.95
active: False

Note: for bool, you must pass True or False (capitalized) on CLI.


🧪 Testing via Jupyter or Script

If running interactively (e.g., from Jupyter or testing):

from metaflow import FlowSpec, step, Parameter

class MyTestFlow(FlowSpec):
    name = Parameter('name', default='default')

    @step
    def start(self):
        print("Hello,", self.name)
        self.next(self.end)

    @step
    def end(self):
        pass

flow = MyTestFlow(name="Alice")
flow.run()

✅ Recap

Feature Description
Parameter(...) Declare a configurable parameter
type= Set expected type (e.g., int, float, bool)
default= Provide a fallback value
--param value CLI syntax to override parameter

Awesome! Let's walk through a simple ML hyperparameter sweep using Metaflow parameters, simulating a grid search over a model training process.

We'll use:

  • @Parameter to pass hyperparameters like learning_rate, epochs
  • Metaflow branching (self.next(*args)) to simulate a grid of parameter combinations
  • A dummy ML model using scikit-learn’s LogisticRegression trained on sklearn.datasets.make_classification

🧪 Example: Grid Search with Metaflow Parameters

from metaflow import FlowSpec, step, Parameter
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import numpy as np

class GridSearchFlow(FlowSpec):

    learning_rates = Parameter(
        'learning_rates',
        help='Comma-separated learning rates',
        default='0.01,0.1,1.0'
    )

    c_values = Parameter(
        'c_values',
        help='Comma-separated C values (inverse regularization)',
        default='0.1,1.0,10.0'
    )

    @step
    def start(self):
        # Parse the parameters into lists
        self.learning_rates = [float(x) for x in self.learning_rates.split(',')]
        self.c_values = [float(x) for x in self.c_values.split(',')]

        # Cartesian product of hyperparameters
        from itertools import product
        self.param_grid = list(product(self.learning_rates, self.c_values))

        print(f"Total configs to try: {len(self.param_grid)}")
        self.next(self.train_model, foreach='param_grid')

    @step
    def train_model(self):
        lr, c = self.input
        print(f"Training with learning_rate={lr}, C={c}")

        # Generate dummy data
        X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Simulate learning rate via max_iter, as LogisticRegression doesn’t support lr directly
        model = LogisticRegression(C=c, max_iter=int(1000 * lr), solver='lbfgs')
        model.fit(X_train, y_train)
        preds = model.predict(X_test)

        acc = accuracy_score(y_test, preds)
        print(f"Accuracy: {acc:.4f}")

        self.result = {
            'learning_rate': lr,
            'C': c,
            'accuracy': acc
        }
        self.next(self.aggregate)

    @step
    def aggregate(self, inputs):
        self.results = [input.result for input in inputs]
        # Sort results by accuracy
        self.results.sort(key=lambda x: x['accuracy'], reverse=True)
        print("Top results:")
        for r in self.results[:3]:
            print(r)
        self.next(self.end)

    @step
    def end(self):
        print("Grid search complete!")

if __name__ == '__main__':
    GridSearchFlow()

🖥️ Run it via CLI

python grid_search_flow.py run \
  --learning-rates 0.01,0.1 \
  --c-values 0.1,1.0

This runs 4 parallel model trainings:

  • (0.01, 0.1)
  • (0.01, 1.0)
  • (0.1, 0.1)
  • (0.1, 1.0)

Each train_model step runs in parallel by virtue of foreach='param_grid'.


📌 Key Concepts Demonstrated

Concept Usage
Parameter Inject hyperparameters via CLI
@step(foreach=) Dynamically fan out steps across combinations
self.input Access tuple of (learning_rate, C) in each branch
aggregate Collect results and compute best models

🧪 JSON Config Parameter

from metaflow import FlowSpec, step, Parameter, JSONType

class JsonParamFlow(FlowSpec):

    config = Parameter(
        'config',
        type=JSONType,
        help='JSON string of model config',
        default='{"lr": 0.1, "batch_size": 32, "dropout": 0.3}'
    )

    @step
    def start(self):
        print("Parsed config:", self.config)
        print("Learning Rate:", self.config['lr'])
        print("Batch Size:", self.config['batch_size'])
        print("Dropout Rate:", self.config['dropout'])
        self.next(self.end)

    @step
    def end(self):
        print("Done.")

if __name__ == '__main__':
    JsonParamFlow()

🧪 CLI Run with JSON

python json_param_flow.py run \
  --config '{"lr": 0.05, "batch_size": 64, "dropout": 0.2}'

✅ Metaflow will automatically convert this into:

{
    "lr": 0.05,
    "batch_size": 64,
    "dropout": 0.2
}

And you'll get CLI output like:

Parsed config: {'lr': 0.05, 'batch_size': 64, 'dropout': 0.2}
Learning Rate: 0.05
Batch Size: 64
Dropout Rate: 0.2
Done.

⚠️ Pro Tip for Bash Users

Make sure to:

  • Use single quotes ' around the whole JSON string
  • Use double quotes " inside for the JSON keys/values

So this works:

'{"key": "value"}'

But this fails:

"{'key': 'value'}"  # Not valid JSON

🧠 Tradeoff: JSONType vs JSON File

🧷 Option 1: JSONType (Inline Parameter)

--config '{"lr": 0.01, "batch_size": 32}'

Pros:

  • Quick and easy for small configs
  • Great for prototyping or ad hoc runs
  • Can be used directly in CI/CD or Airflow triggers

Cons:

  • Quoting issues in shell ("{"key":"value"}" inside "..." = headache)
  • Not human-friendly for large configs
  • Hard to version or document properly
  • Gets messy fast when nested

📁 Option 2: Pass JSON file path as string

class MyFlow(FlowSpec):
    config_file = Parameter('config_file', default='config.json')

    @step
    def start(self):
        import json
        with open(self.config_file, 'r') as f:
            self.config = json.load(f)

        print("Config:", self.config)
        self.next(self.end)
python flow.py run --config-file configs/model_v1.json

Pros:

  • Clean, readable, reusable
  • Easily version-controlled
  • Good for deep configs or experiments
  • Less prone to quoting errors

Cons:

  • Slightly more boilerplate (need to read the file)
  • More moving parts in a fully automated run

🧪 When Metaflow's JSONType Shines

JSONType is best for lightweight structured inputs that change often, e.g.:

python train.py run \
  --config '{"model":"xgb", "features":["f1","f2"], "cv":5}'

Or if you're running experiments programmatically from a notebook:

MyFlow(config={"lr": 0.01, "dropout": 0.2}).run()

It’s not intended for storing full model configs, deployment setups, or anything you'd want under version control.

Situation Best Choice
✅ Small structured configs (e.g., 2–5 fields) type=JSONType
✅ Running flows from CI or notebooks where passing JSON inline is convenient type=JSONType
❌ Large, nested, or reused config (e.g., 10+ fields, multiple layers) JSON file path + load inside flow
❌ Config shared across multiple flows/scripts JSON file
✅ Config needs Git version control JSON file