

Under the hood, Metaflow translates its flows to Airflow-compatible DAGs automatically, so the operational concerns are invisible to data scientists who can benefit from the features of Metaflow.Ĭonsider the benefits of using Metaflow for workflow development compared to Airflow: The resulting workflows will get scheduled as any other Airflow workflows and they live happily side-by-side with your existing Airflow-native workflows. Our new Airflow integration allows you to develop workflows in Metaflow, using its data scientist-friendly, productivity-boosting APIs, and deploy them on your existing Airflow server, as shown in the video below: We want to provide a new path for teams that find themselves in this situation. Develop with Metaflow, deploy on Airflow At the same time, they are becoming increasingly aware that the system is slowing down their development velocity and causing avoidable operational overhead. Airflow will surely keep improving, as it did with the major release of Airflow 2.0, but migrating existing pipelines to new untried APIs is not necessarily easier than migrating to another, more modern orchestrator.Īs a result, many companies find themselves in a pickle: They have a hairball of business-critical data pipelines orchestrated by Airflow, encapsulating years of accumulated business logic. Airflow is perfectly capable of orchestrating a set of basic data pipelines, but with the increasing demands of ML and data science – and more modern data organizations – its cracks are becoming visible.įixing these issues while maintaining backward compatibility with the millions of existing Airflow pipelines is nigh impossible. Herein lies the root cause of many issues in Airflow: It served its original use cases well but its design and architecture are not suitable for the increasing demands of modern data (science) stacks. We talked about these topics at length when we released our integration to another production-grade orchestrator, AWS Step Functions.
Airflow dag seems to be missing torrent#
And every workflow and task needs to be executed in a highly available manner while reacting to a torrent of external events in real time. In 2023, developing and deploying a new workflow variant should be as easy as opening a pull request.Įach workflow can spawn thousands of tasks – imagine conducting a hyperparameter search as a part of a nightly model training workflow. In a modern experimentation-driven culture every variant counts, so even smaller companies can accumulate surprisingly many workflows quickly. These challenges are not limited to large companies. While walking through a DAG seems like a schoolbook exercise, it is easy to underestimate the number of engineering-years and battle scars that it takes to build a real-world, production-grade workflow orchestrator. We want to provide them with a better user experience and a stable API, which allows them to develop projects faster and start future-proofing their projects with minimal operational disruption. The integration is motivated by our human-centric approach to data science: Still today, many data scientists, ML engineers, and data engineers are required to use Airflow. Today, we are releasing support for orchestrating Metaflow workflows using Airflow. The widely documented issues can be summarized in two categories: suboptimal developer experience and operational headaches, discussed below. The same data pipelines that contribute to Airflow’s popularity have also contributed to countless hours of debugging and missed SLAs, revealing fundamental issues in Airflow’s design. It is easy to see why many companies get started with it: It is readily familiar to most data engineers, it is quick to set up, and as proven by millions of data pipelines powered by it since 2014, clearly it can keep DAGs running. In data engineering, Apache Airflow is a household brand. You can now schedule and orchestrate workflows developed with Metaflow on Apache Airflow.
