Data Pipelines - Airflow vs Pinball vs Luigi

Jan 12th, 2016 in  by Michael Cho

Review of 3 common Python-based data pipeline / workflow frameworks from AirBnb, Pinterest, and Spotify.

With increasingly more companies considering themselves "data-driven" and with the vast amounts of "big data" being used, data pipelines or workflows have become an integral part of data engineering.

Having outgrown hacked-together cron jobs and bash scripts, there are now several data pipeline frameworks which make it easier to schedule, debug, and retry jobs. These jobs are typically long-running (several hours to several days) and processing several billion rows of data, for example as part of an ETL process, mapreduce jobs, or data migration.

Why Airflow, Luigi, and Pinball?

A client I consult for is considering building its own data pipeline framework to handle sensor / electric meter data. A first step to this was reviewing the currently available data pipeline frameworks, of which there are many.

In this post, I review 3 of these:

  1. Airflow, open sourced by AirBnB
  2. Luigi, open sourced by Spotify
  3. Pinball, open sourced by Pinterest

These were chosen as they most closely matched my client's requirements, namely:

  • Python-based, which ruled out Scala and Java-based frameworks. I would have loved to look at Mario or Suro if this were not a consideration.
  • Emphasis on code to configure / run pipelines, which ruled out XML / configuration heavy frameworks.
  • Reasonably general purpose (ruled out frameworks specialising in scientific / statistical analysis)
  • Reasonably popular / growing community

Admittedly I was only able to spend about a day looking at the source code for these and have not used these in a production environment, so please correct me if I have misunderstood anything!

Terminology

It can be mildly confusing reading about these 3 frameworks since they use slightly different terminology to refer to similar concepts. 

 AirflowLuigiPipeline
Collection of work to be done (I refer to this as the data pipeline) DAG (Directed Acyclic Graph) Not really supported, Tasks are grouped together into a DAG to be run. Most of the code treats Tasks as the main unit of work. Workflow
Main unit of work. Does not refer to an individual unit of data, but rather a batch of these. eg TopArtistsJob rather than Artist A, Artist B, etc Jobs Tasks Tokens
Class processing the main unit of work Operators Tasks / Workers Jobs / Workers

 

Configuring a pipeline

AirflowLuigiPinball
  • Create a python class which imports existing Operator classes
  • Ships with numerous Operators, so a DAG can be constructed more dynamically with existing Operators
  • example constructor

User Interface and Metadata

All 3 frameworks ship with an admin panel of sorts which shows the status of your workflow / tasks (ie the metadata).

 AirflowLuigiPinball
UI
Metadata / Job status
  • Job status is stored in a database
  • Operators mark jobs as passed / failed
  • Last_updated is refreshed frequently with a heartbeat function
  • kill_zombies() is called to clear all jobs with older heartbeats
  • Task status is stored in database
  • Similar to Airflow, but fewer details
  • Workers 'claim' messages from the queue with an ownership timestamp on the message
  • This lease claim gets renewed frequently
  • Messages with older lease claims are requeued. Messages successfully processed are archived to S3 file system using Secor.
  • Job status is stored to database.

Scaling

Read more about using multiprocessing vs multithreading in Python.

 AirflowLuigiPinball
Scaling
  • DAGs can be constructed with multiple Operators
  • Scale out by adding Celery workers
  • Create multiple Tasks
  • Add Workers
Parallel Execution
  • Subprocess
  • Subprocess
  • Threading

 

Dependency Management

This refers to not processing a job / task unless one or more conditions have been met first.

AirflowLuigiPinball
  • Operators can be constructed with depends_on_past parameter
  • Tasks can be constructed with requires() method
                        
class ArtistToplistToDatabase(luigi.postgres.CopyToTable):

    def requires(self):
        return Top10Artists(self.date_interval, self.use_hadoop)
                        
                    
  • Jobs can require other jobs to finish first before starting, eg child_job requires parent_job
                        
WORKFLOWS = {
    'example_workflow': WorkflowConfig(
        jobs={
            'parent_job':JobConfig(SomeJob, []),
            'child_job':JobConfig(AnotherJob, ['parent_job'])
        },
        final_job_config=....,
        notify_emails=....
}
                        
                    

 

Disadvantages

Here are my perceived disadvantages of each framework compared to my client's requirements:

AirflowLuigiPinball
  • No Kafka support, uses Celery (RabbitMQ, Redis)
  • Seems more suitable for scheduled batch jobs, rather than streaming data. In particular, updating the metadata DB for streaming data may have a performance impact.
  • Possibly too full-featured / overkill.
  • Metadata and UI is not as useful as Airflow.
  • Relatively small number of Tasks, requires writing subclasses for most of our requirements.
  • Possibly not suitable for streaming data, same performance concern as Airflow.
  • Uses threading.
  • Possibly not suitable for streaming data, same performance concern as Airflow.

 

Further Reading

If you have read so far, you may find the following links useful too:

 


Other articles you may like

Primer to Python multiprocessing, multithreading, and asyncio
Oct 24th, 2018
Method delegation in Python
Jul 11th, 2018
Using Python enums in SQLAlchemy models
May 16th, 2018
Python command-line scripts with argparse
Feb 15th, 2018
SQLAlchemy commit(), flush(), expire(), refresh(), merge() - what's the difference?
Nov 2nd, 2017