fuck needs a workflow engine? • You do!!! • If you run hadoop (or other ETL jobs) • If you have dependencies b/w them (who doesn’t?!) • If they fail (s/if/when/) Tuesday, July 15, 14
fuck needs a workflow engine? • You do!!! • If you run hadoop (or other ETL jobs) • If you have dependencies b/w them (who doesn’t?!) • If they fail (s/if/when/) • Luigi doesn’t replace Hadoop, Scalding, Pig, Hive, Redshift. • It orchestrates them Tuesday, July 15, 14
And - For data • Integrates well with data targets • Hadoop, Spark, Databases • Atomic file/db operations • Visualization • CLI - really nice developer interface! Tuesday, July 15, 14
return [InputText(date) for date in self.date_interval.dates()] def output(self): return luigi.LocalTarget('/var/tmp/text-count/%s' % self.date_interval) def run(self): count = {} for file in self.input(): for line in file.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 # output data f = self.output().open('w') for word, count in count.iteritems(): f.write("%s\t%d\n" % (word, count)) f.close() Tuesday, July 15, 14
return [InputText(date) for date in self.date_interval.dates()] def output(self): return luigi.hdfs.HdfsTarget('/tmp/text-count/%s' % self.date_interval) def mapper(self, line): for word in line.strip().split(): yield word, 1 def reducer(self, key, values): yield key, sum(values) Tuesday, July 15, 14
The transition process (to Luigi) • Provide high level overview • Manual re-run of tasks • Monitor progress, performance, run times, queues, worker load etc... • Implemented using Flask and AngularJS Tuesday, July 15, 14
right away • Right now it’s very totango-specific. • Integrations to Librato-metics • Queries on Totango Databases • Uses Jenkins for controlling executions • Displays Totango’s Account metadata (totango’s business logic) • Maybe some other day... Tuesday, July 15, 14