Luigi

@rantav totango Tuesday, July 15, 14

WHO AM I? Tuesday, July 15, 14

WHO AM I? • A developer • Google, Microsoft, Outbrain,
Gigaspaces, Totango etc • Hector, ﬂask-restful-swager, meteor-migrations, monitoring... • Podcast: reversim.com • devdev.io • Gormim Tuesday, July 15, 14

WHAT IS LUIGI? Tuesday, July 15, 14

WHAT IS LUIGI? • A Workﬂow Engine. • Who the
fuck needs a workﬂow engine? Tuesday, July 15, 14

fuck needs a workﬂow engine? • You do!!! • If you run hadoop (or other ETL jobs) • If you have dependencies b/w them (who doesn’t?!) • If they fail (s/if/when/) Tuesday, July 15, 14

fuck needs a workﬂow engine? • You do!!! • If you run hadoop (or other ETL jobs) • If you have dependencies b/w them (who doesn’t?!) • If they fail (s/if/when/) • Luigi doesn’t replace Hadoop, Scalding, Pig, Hive, Redshift. • It orchestrates them Tuesday, July 15, 14

SCREENSHOTS Tuesday, July 15, 14

HOW DO YOU ETL YOUR DATA? • Hadoop • Spark
• Redshift • Postgres • Ad-hoc java/python/ruby/go/... Tuesday, July 15, 14

RUNNING ONE JOB IS EASY RUNNING MANY IS HARD •
100s of concurrent jobs, 1000s Daily. • Job dependencies. • E.g. ﬁrst copy the ﬁle, then crunch it. • Errors / retries • Idempotency • Monitoring / Visuals Tuesday, July 15, 14

THE WRONG WAY TO DO IT Tuesday, July 15, 14

EXAMPLE WORKFLOW Log data Subsample and extract features Features Train
classiﬁcation model Model Log data Log data Log data Upload model to servers Tuesday, July 15, 14

THE CRON PHENOMENON THE W RONG WAY TO DO IT
Tuesday, July 15, 14

THE CRON PHENOMENON Don’t try this at home!!! THE W
RONG WAY TO DO IT Tuesday, July 15, 14

ENTER LUIGI Tuesday, July 15, 14

ENTER LUIGI • Like Makeﬁle - but in python •
And - For data • Integrates well with data targets • Hadoop, Spark, Databases • Atomic ﬁle/db operations • Visualization • CLI - really nice developer interface! Tuesday, July 15, 14

LUIGI TASK Tuesday, July 15, 14

RUN FROM THE CLI Tuesday, July 15, 14

TASK PARAMETERS Tuesday, July 15, 14

AWESOME HADOOP (MR) SUPPORT Tuesday, July 15, 14

WEB UI Tuesday, July 15, 14

PROCESS SYNCHRONIZATION Tuesday, July 15, 14

USED BY Tuesday, July 15, 14

SEMI-DEEP DIVE Programming for Luigi Tuesday, July 15, 14

LUIGI TASKS • Implement 4 method: def input(self) (optional) def
output(self) def run(self) def depends(self) Tuesday, July 15, 14

LUIGI TASKS • Or extend one of the predeﬁned tasks
• S3CopyToTable • RedshiftManifestTask • SparkJob • HiveQueryTask • HadoopJobTask Tuesday, July 15, 14

EXAMPLE LOCAL WORDCOUNT class WordCount(luigi.Task): date_interval = luigi.DateIntervalParameter() def requires(self):
return [InputText(date) for date in self.date_interval.dates()] def output(self): return luigi.LocalTarget('/var/tmp/text-count/%s' % self.date_interval) def run(self): count = {} for file in self.input(): for line in file.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 # output data f = self.output().open('w') for word, count in count.iteritems(): f.write("%s\t%d\n" % (word, count)) f.close() Tuesday, July 15, 14

EXAMPLE HADOOP WORDCOUNT class WordCount(luigi.hadoop.JobTask): date_interval = luigi.DateIntervalParameter() def requires(self):
return [InputText(date) for date in self.date_interval.dates()] def output(self): return luigi.hdfs.HdfsTarget('/tmp/text-count/%s' % self.date_interval) def mapper(self, line): for word in line.strip().split(): yield word, 1 def reducer(self, key, values): yield key, sum(values) Tuesday, July 15, 14

LUIGI TARGETS • HDFS • Local File • Postgres /
MySQL, Redshift, ElasticSearch • ... Easy to extend Tuesday, July 15, 14

DEFINING A TARGET • Implement: def exists(self) And optionally: connect
or open / close Tuesday, July 15, 14

EXAMPLE MYSQL TARGET class MySqlTarget(luigi.Target): def touch(self, connection=None): ... def
exists(self, connection=None): cursor = connection.cursor() cursor.execute("""SELECT 1 FROM {marker_table} WHERE update_id = %s LIMIT 1""".format(marker_table=self.marker_table), (self.update_id,) ) row = cursor.fetchone() return row is not None def connect(self, autocommit=False): ... def create_marker_table(self): ... Tuesday, July 15, 14

THE GRAND SCHEME Tuesday, July 15, 14

THE GRAND SCHEME Run Task Tuesday, July 15, 14

THE GRAND SCHEME Run Task Check Deps Tuesday, July 15,
14

THE GRAND SCHEME Run Task Check Deps self.requires() Tuesday, July
15, 14

THE GRAND SCHEME Run Task Check Deps self.requires() target.exists() Tuesday,
July 15, 14

THE GRAND SCHEME Run Task Check Deps Run Deps self.requires()
target.exists() Tuesday, July 15, 14

THE GRAND SCHEME Run Task Check Deps Run Deps Use
Deps Output self.requires() target.exists() Tuesday, July 15, 14

Deps Output Invoke Run self.requires() target.exists() Tuesday, July 15, 14

Deps Output Invoke Run self.requires() target.exists() Check Scheduler Tuesday, July 15, 14

Deps Output Invoke Run self.input() self.requires() target.exists() Check Scheduler Tuesday, July 15, 14

Deps Output Invoke Run self.input() self.output() self.requires() target.exists() Check Scheduler Tuesday, July 15, 14

Deps Output Invoke Run Write Output self.input() self.output() self.requires() target.exists() Check Scheduler Tuesday, July 15, 14

OPEN SOURCE Tuesday, July 15, 14

BY Tuesday, July 15, 14

Tuesday, July 15, 14

WHAT DID I DO? • Add Redshift support • Add
MySQL support • Various small features (improved notiﬁcations, dep.py, historydb etc) • Various bug reports • And ﬁxes! Tuesday, July 15, 14

LUIGI @ TOTANGO • Daily computation • Hourly computation •
Ad-hoc data loading (for data analysis activities, to redshift) Tuesday, July 15, 14

TOTANGO’S SETUP invoke Luigi Workers coordinate Luigi Scheduler report Tuesday,
July 15, 14

TOTANGO’S SETUP Luigi Worker read / write Tuesday, July 15,
14

MOAR SCREENSHOTS Tuesday, July 15, 14

GAMEBOY!!! Tuesday, July 15, 14

AND... GAMEBOY Tuesday, July 15, 14

GAMEBOY Tuesday, July 15, 14

GAMEBOY IS • A Totango speciﬁc controller for Luigi •
The transition process (to Luigi) • Provide high level overview • Manual re-run of tasks • Monitor progress, performance, run times, queues, worker load etc... • Implemented using Flask and AngularJS Tuesday, July 15, 14

IS GAMEBOY OPEN SOURCE • Well, no. At least not
right away • Right now it’s very totango-speciﬁc. • Integrations to Librato-metics • Queries on Totango Databases • Uses Jenkins for controlling executions • Displays Totango’s Account metadata (totango’s business logic) • Maybe some other day... Tuesday, July 15, 14

WHAT ELSE IS OUT THERE? Tuesday, July 15, 14

WHAT ELSE IS OUT THERE? • Oozie • Azkaban •
AWS Data Pipeline • Chronos • spring-batch • Dataswarm (facebook) • River (outbrain internal) • What’s your favorite WF engine? (did you build one?) Tuesday, July 15, 14

MY OTHER PROJECTS • https://github.com/hector-client/hector • https://github.com/rantav/ﬂask-restful-swagger • https://github.com/sebastien/monitoring •
https://github.com/rantav/meteor-migrations • https://github.com/rantav/node-github-list-packages • https://github.com/rantav/devdev Tuesday, July 15, 14

REFS • https://github.com/spotify/luigi • Facebook’s Dataswarm https://www.youtube.com/watch? v=M0VCbhfQ3HQ • Outbrain’s
River https://www.youtube.com/watch? v=EzsckTggDiM Tuesday, July 15, 14

Luigi

Luigi

More Decks by Ran Tavory

Other Decks in Programming

Featured

Transcript