Data Pipelines - Airflow vs Pinball vs Luigi

With increasingly more companies considering themselves "data-driven" and with the vast amounts of "big data" being used, data pipelines or workflows have become an integral part of data engineering.

Having outgrown hacked-together cron jobs and bash scripts, there are now several data pipeline frameworks which make it easier to schedule, debug, and retry jobs. These jobs are typically long-running (several hours to several days) and processing several billion rows of data, for example as part of an ETL process, mapreduce jobs, or data migration.

Why Airflow, Luigi, and Pinball?

A client I consult for is considering building its own data pipeline framework to handle sensor / electric meter data. A first step to this was reviewing the currently available data pipeline frameworks, of which there are many.

In this post, I review 3 of these:

Airflow, open sourced by AirBnB
Luigi, open sourced by Spotify
Pinball, open sourced by Pinterest

These were chosen as they most closely matched my client's requirements, namely:

Python-based, which ruled out Scala and Java-based frameworks. I would have loved to look at Mario or Suro if this were not a consideration.
Emphasis on code to configure / run pipelines, which ruled out XML / configuration heavy frameworks.
Reasonably general purpose (ruled out frameworks specialising in scientific / statistical analysis)
Reasonably popular / growing community

Admittedly I was only able to spend about a day looking at the source code for these and have not used these in a production environment, so please correct me if I have misunderstood anything!

Terminology

It can be mildly confusing reading about these 3 frameworks since they use slightly different terminology to refer to similar concepts.

AirflowLuigiPipelineCollection of work to be done (I refer to this as the data pipeline)DAG (Directed Acyclic Graph)Not really supported, Tasks are grouped together into a DAG to be run. Most of the code treats Tasks as the main unit of work.WorkflowMain unit of work. Does not refer to an individual unit of data, but rather a batch of these. eg TopArtistsJob rather than Artist A, Artist B, etcJobsTasksTokensClass processing the main unit of workOperatorsTasks / WorkersJobs / Workers

Configuring a pipeline

AirflowLuigiPinball

Create a python class which imports existing Operator classes
Ships with numerous Operators, so a DAG can be constructed more dynamically with existing Operators
example constructor

Requires subclassing one of the small number of Tasks, not as dynamic.
example constructor

Create a config dictionary with jobs and schedules
example constructor

User Interface and Metadata

All 3 frameworks ship with an admin panel of sorts which shows the status of your workflow / tasks (ie the metadata).

AirflowLuigiPinballUI

Comprehensive, with multiple screens
Built with Flask

Overview of Tasks only

Detailed, looks like Sidekiq
Built with Django

Metadata / Job status

Job status is stored in a database
Operators mark jobs as passed / failed
Last_updated is refreshed frequently with a heartbeat function
kill_zombies() is called to clear all jobs with older heartbeats

Task status is stored in database
Similar to Airflow, but fewer details

Workers 'claim' messages from the queue with an ownership timestamp on the message
This lease claim gets renewed frequently
Messages with older lease claims are requeued. Messages successfully processed are archived to S3 file system using Secor.
Job status is stored to database.

Scaling

Read more about using multiprocessing vs multithreading in Python.

AirflowLuigiPinballScaling

DAGs can be constructed with multiple Operators
Scale out by adding Celery workers

Create multiple Tasks

Add Workers

Parallel Execution

Subprocess

Subprocess

Threading

Dependency Management

This refers to not processing a job / task unless one or more conditions have been met first.

AirflowLuigiPinball

Operators can be constructed with depends_on_past parameter

Tasks can be constructed with requires() method

                        
class ArtistToplistToDatabase(luigi.postgres.CopyToTable):

    def requires(self):
        return Top10Artists(self.date_interval, self.use_hadoop)

Jobs can require other jobs to finish first before starting, eg child_job requires parent_job

                        
WORKFLOWS = {
    'example_workflow': WorkflowConfig(
        jobs={
            'parent_job':JobConfig(SomeJob, []),
            'child_job':JobConfig(AnotherJob, ['parent_job'])
        },
        final_job_config=....,
        notify_emails=....
}

Disadvantages

Here are my perceived disadvantages of each framework compared to my client's requirements:

AirflowLuigiPinball

No Kafka support, uses Celery (RabbitMQ, Redis)
Seems more suitable for scheduled batch jobs, rather than streaming data. In particular, updating the metadata DB for streaming data may have a performance impact.
Possibly too full-featured / overkill.

Metadata and UI is not as useful as Airflow.
Relatively small number of Tasks, requires writing subclasses for most of our requirements.
Possibly not suitable for streaming data, same performance concern as Airflow.

Uses threading.
Possibly not suitable for streaming data, same performance concern as Airflow.

Data Pipelines - Airflow vs Pinball vs Luigi

Why Airflow, Luigi, and Pinball?

Terminology

Configuring a pipeline

User Interface and Metadata

Scaling

Dependency Management

Disadvantages

Further Reading