Extract Transform Load in Python

Extract Transform Load FALUDI, Bence Database Manager

Positions: -  Database Manager @ Mito -  Chief Technology Oﬃcer
@ Ozmo Tasks: -  Data warehouse design -  Data Cleansing & Analytics -  ETL & Python Development -  IT Project Management Introduction FALUDI, Bence [email protected]

Extract, Transform and Load (ETL) refers to a process in
database usage and especially in data warehousing that involves: •  Extracting data from outside sources •  Transforming it to ﬁt operational needs, which can include quality levels •  Loading it into the end target @ Wikipedia ETL

1.  Build reference data 2.  Extract from XML, XLS, JSON,
YML, CSV, TSV, Google Spreadsheet, Fixed Width Text, Relational Database, … 3.  Validate data 4.  Transform: clean, apply business rules, convert, key-value map, check for data integrity, … 5.  Manipulation: drop, merge, aggregate, … 6.  Loading it into any target (mostly Relational Database) ETL Short Cycle

Challenges of implementation: 1.  Pause & Continue load process 2. 
Update data: which data is diﬀerent 3.  Data consistency is horrible 4.  Large datasets 5.  Quick transforms 6.  Every source has diﬀerent format ETL software problems: 1.  Hard to use & not user friendly GUI 2.  Expensive software 3.  Console tool is not available Most common problems

•  Oracle Warehouse Builder (OWB) •  Data Services •  IBM
Information Server (Datastage) •  SAS Data Integration Studio •  PowerCenter Informatica •  Elixir Repertoire •  Data Migrator •  SQL Server Integration Services •  Talend Studio for Data Integration •  DataFlow Manager •  Pentaho Kettle •  … ETL Tools

PyF - http://pyfproject.org Hard to use, many bugs, … Tools
in Python

Brewery - http://pythonhosted.org/brewery Without GUI & transforms. Programming skills are
required Tools in Python

We have a (large) business software, and have to load
new or modiﬁed data every day from diﬀerent source. E.g.: Customer Relationship Management What could we use? Fixtures? J Tools in Python

Fixtures are the test context. Loading a database with a
known (speciﬁc) set of data where the test results are repeatable. Some frameworks use ﬁxtures to load default (or new) data into relational database. Not suitable for periodical updated data migration! Fixtures

We created a tool from scratch based on experiences of
Brewery for CEU Economics. We call this tool: mETL It will be published soon with GPL license. (currently in test phase)

-  No GUI, just a console script -  Update &
migration file management -  Check differences between migrations -  Configuration in YAMLformat -  Programming skills is not required -  Transformations are using TARR, with Conditions and Statements -  Quick transforms and manipulations -  Easy to extend (for special use) -  9 source types & 8 target types Features

Sources: CSV, Database*, Fixed Width Text, Google Spreadsheet, JSON, TSV,
XLS, XML, Yaml Targets: CSV, Database*, Fixed Width Text, JSON, TSV, XLS, XML, Yaml * SQLAlchemy support: Postgresql, MySQL, Oracle, Microsoft SQL Server, SQLite, URL API, … Features

Architecture

Source Conﬁguration example

Transforms Conﬁguration example

Manipulations & Target Conﬁguration example

Most common problems: 1.  Pause & Continue load process 2. 
Update data: which data is different 3.  Data consistency is horrible 4.  Large datasets 5.  Quick transforms 6.  Every source has different format Unexpected challenges: 7.  Filtering data via different source(s) 8.  Type conversion 9.  Readable field map for every source type Solutions

Problems: 1.  Pause & Continue load process 2.  Update data:
which data is different Every row gets a generated hash in the field value’s final stage and a unique id based on the marked* fields in the configuration file. Saving row’s primary key and the generated hash into a pickle migration file. If the unique id exists with the same hash, we skip the row, if the hash is different we update the row. * Not required Solutions

Problems: 3.  Data consistency is horrible 4.  Large datasets 5. 
Quick transforms mETL has 8 transforms to produce consistent data sets. These transforms are supported by statements and conditions. Uses TARR* package which is a non-parallel data ﬂow language created by CEU Economics with multiple thread support. * It will be open source as well. Solutions

Problems: 6.  Every source has different format Easy YAML configuration
file with all fields and information. You don’t have to write program code, just change the field’s mapping setting and run the console script again with different source. These Yaml files are savable and changeable if the source format is modified. Easy to maintain. Solutions

Problems: 7.  Filtering data via diﬀerent source(s) Solutions Source +
Condition + Join key (same name required)

Problems: 8.  Type conversion Means: Field’s start type != Field’s
ﬁnal type But: All ﬁeld’s have to be the same type. We have to manage type conversion in runtime with transforms: •  ConvertType: change type to the Field and add a default value if it necessary. •  Set: set any value to the Field. After the transform process we have to force type change automatically. Solutions

Problems: 9.  Readable field map for every source type Convert
every source input into dicts and lists and give a route for the data. If the route’s item is not exists we set the field’s value to empty. E.g. in XML: pictures/picture/0/picture_thumb/url E.g. in CSV, XLS, …: 5 E.g. in Database: created Easy to learn because not using different keywords (e.g.: index, route, cell, …) for every type. Solutions

FALUDI, Bence [email protected] Thank you for your attention!

Extract Transform Load in Python

Extract Transform Load in Python

Bence Faludi

More Decks by Bence Faludi

Other Decks in Technology

Featured

Transcript

Extract Transform Load FALUDI, Bence Database Manager

Positions: -  Database Manager @ Mito -  Chief Technology Oﬃcer

Extract, Transform and Load (ETL) refers to a process in

1.  Build reference data 2.  Extract from XML, XLS, JSON,

Challenges of implementation: 1.  Pause & Continue load process 2.

•  Oracle Warehouse Builder (OWB) •  Data Services •  IBM

PyF - http://pyfproject.org Hard to use, many bugs, … Tools

Brewery - http://pythonhosted.org/brewery Without GUI & transforms. Programming skills are

We have a (large) business software, and have to load

Fixtures are the test context. Loading a database with a

We created a tool from scratch based on experiences of

-  No GUI, just a console script -  Update &

Sources: CSV, Database*, Fixed Width Text, Google Spreadsheet, JSON, TSV,

Architecture

Source Conﬁguration example

Transforms Conﬁguration example

Manipulations & Target Conﬁguration example

Most common problems: 1.  Pause & Continue load process 2.

Problems: 1.  Pause & Continue load process 2.  Update data:

Problems: 3.  Data consistency is horrible 4.  Large datasets 5.

Problems: 6.  Every source has diﬀerent format Easy YAML conﬁguration

Problems: 7.  Filtering data via diﬀerent source(s) Solutions Source +

Problems: 8.  Type conversion Means: Field’s start type != Field’s

Problems: 9.  Readable ﬁeld map for every source type Convert

FALUDI, Bence [email protected] Thank you for your attention!