Orchestration with Airflow

Make ML pipelines a breeze with Airflow and Anyscale

An Airflow DAG using the Ray Provider

Building a flow of tasks that depend on each other is a critical piece of infrastructure that most ML or Data Platforms within an organization will need.

In Airflow parlance, the recommended integration point is the Airflow-Ray Provider. See below for a diagram for how this fits together alongside other operators, tasks, and your storage layer:

Airflow gets a super-charge when paired with Ray's zero-copy plasma object store

Writing a DAG that uses Ray as the Provider layer looks like this:

import anyscale
anyscale.connect()
# ... now everything below runs on your Anyscale Ray cloud!

from ray_provider.decorators import ray_task

default_args = {
    'on_success_callback': RayBackend.on_success_callback,
    'on_failure_callback': RayBackend.on_failure_callback,
    .
    .
    .
}
task_args = {"ray_conn_id": "ray_cluster_connection"}
.
.
.
@dag(
    default_args=default_args,
    .
    .
)
def ray_example_dag():
    @ray_task(**task_args)
    def sum_cols(df: pd.DataFrame) -> pd.DataFrame:
        return pd.DataFrame(df.sum()).T

See a full example on GitHub that you can pull down and try out here.

Last updated

Was this helpful?