warehouse specialist, IT project manager, Python expert. Built databases for Nissan, led IT development for Union Insurance, Central European University, Procter & Gamble and Profession.hu. Organizer of the Budapest Database Meetup and creator of the mETL business intelligence tool. ! Bence has more than 9 years of experience in development and project management. ! email: [email protected] twitter: @bfaludi Positions * Senior Database Manager @ Mito Europe * Head Of Development @ Ozmo * Organizer @ Budapest Database Meetup Responsibilities * Data warehouse design * Mathematical predictions * Data Cleansing & Analytics * Data Consulting * ETL & Python/Go Development * IT Project Management
correcting or removing corrupted or inaccurate records from a record set. ‣ Data cleansing also involves activities like, harmonization and standardization of data.
1 Define structure and transformations & operations on the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4 ETL load final dataset into database 5
address information into parts ‣ address standardization ‣ roman number conversion ‣ missing accents ‣ missing special characters ‣ remove test or scam data ‣ first name, last name classification and sorting ‣ detecting company’s name ‣ formatting phone numbers to E164 ‣ replacement of phone number’s national code
Number Email ! Address Kovács Péter Koin Finances Kft. Nador 23 Buda 1051 +36 1 132 1111 fp kukac gmail pont hu Dr. Kovacs Akos Sziv út 49, 1em/11 Budapest VI. kerület 1788333 [email protected] Farkas Ambrus János Player Media 8. kerulet 06- (31)-987/777 fa.janos [at] gmai [dot] com Éva Ambrusné Hegedűs Hegedűs Bt. 21 Ady Endre Str Érd 1221 6301234567, 06301238888 Mito Europe Kft. Bence Faludi Nádor u. 23. V. emelet Budapest 1053 [email protected]
Code Phone! Number Email ! Address Kovács Péter Koin Finances Kft. Nador 23 Buda 1051 +36 1 132 1111 fp kukac gmail pont hu Dr. Kovacs Akos Sziv út 49, 1em/11 Budapest VI. kerület 1788333 [email protected] Farkas Ambrus János Player Media 8. kerulet 06- (31)-987/777 fa.janos [at] gmai [dot] com Éva Ambrusné Hegedűs Hegedűs Bt. 21 Ady Endre Str Érd 1221 6301234567, 06301238888 Mito Europe Kft. Bence Faludi Nádor u. 23. V. emelet Budapest 1053 [email protected] Full Name Company Name Phone Email Address City Postal Code
János Éva Ambrusné Hegedűs Mito Europe Kft. FIRSTNAME LASTNAME FIRSTNAME LASTNAME TITLE FIRSTNAME LASTNAME FIRSTNAME LASTNAME FIRSTNAME LASTNAME FIRSTNAME MARRIED LASTNAME LASTNAME COMPANY POSTFIX UNKNOWN UNKNOWN 80 percent of the records contain name parts Common format (up to 75%): [Lastname] [Firstname] Send to manual check Mark it, not valid name value Full Name FULL NAME COMPANY NAME FULL NAME FULL NAME FULL NAME
Bence Faludi Company’s name Mark it, not valid company’s name COMPANY POSTFIX UNKNOWN UNKNOWN FIRSTNAME LASTNAME UNKNOWN UNKNOWN LASTNAME COMPANY POSTFIX COMPANY NAME COMPANY NAME COMPANY NAME FULL NAME We won’t send it to manual check.
1051 Sziv út 49, 1em/11 Budapest VI. kerület 8. kerulet 21 Ady Str Érd Nádor u. 23. V. emelet Budapest 1053 CITY PART CITY N LASTNAME STR SUFFIX CITY CITY CITY W DISTRICT CITY W DISTRICT ZIP STR SUFFIX N FIRSTNAME RN FLOOR SUFFIX STR SUFFIX N UNKNOWN FLOOR SUFFIX N N FIRSTNAME N CITY CITY CITY ZIP DISTRICT CITY DISTRICT CITY CITY ZIP STREET STREET STREET STREET ZIP
1 132 1111 fp kukac gmail pont hu 1788333 [email protected] 06- (31)-987/777 fa.janos [at] gmai [dot] com 1221 6301234567, 06301238888 [email protected] PHONE N PHONE PHONE UNKNOWN AT UNKNOWN DOT DOMAIN SUFFIX EMAIL EMAIL UNKNOWN AT UNKNOWN DOT DOMAIN SUFFIX PHONE ZIP N ZIP MULTIPLE PHONE EMAIL EMAIL EMAIL EMAIL PHONE PHONE PHONE Mark it, not valid email address & phone number N PHONE
Number Email ! Address Kovács Péter Koin Finances Kft. Nador 23 Buda 1051 +36 1 132 1111 fp kukac gmail pont hu Dr. Kovacs Akos Sziv út 49, 1em/11 Budapest VI. kerület 1788333 [email protected] Farkas Ambrus János Player Media 8. kerulet 06- (31)-987/777 fa.janos [at] gmai [dot] com Éva Ambrusné Hegedűs Hegedűs Bt. 21 Ady Str Érd 1221 6301234567, 06301238888 Mito Europe Kft. Bence Faludi Nádor u. 23. V. emelet Budapest 1053 [email protected] FULL NAME COMPANY NAME FULL NAME FULL NAME FULL NAME COMPANY NAME COMPANY NAME COMPANY NAME FULL NAME STREET STREET STREET CITY CITY CITY STREET ZIP DISTRICT CITY DISTRICT CITY CITY ZIP ZIP PHONE PHONE PHONE EMAIL EMAIL EMAIL EMAIL MULTIPLE PHONE
last name in Hungary ‣ Available titles and name prefixes Dr, I, Özv, … ‣ Postal code ranges 1000-9999 ‣ Cities, city parts, districts Budapest, Debrecen, Margitsziget, Békásmegyer, … ‣ Regular expressions Email, Phone number, … ! ‣ List of possible terms and spelling variations: ‣ District suffix kerület, ker., … ‣ Street type út, utca, u., … ‣ Floor suffix emelet, em., … ‣ Email components pont, kukac, at, dot, … ‣ Company abbreviations Bt., Kft., Zrt., …
1 Define structure and transformations & operations on the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4 ETL load final dataset into database 5
type Full name Full name data type Company name Company name data type City City data type City w empty/constant value Address Full Str. Address data type Address Full Str. Address data type
name data type Last name Last name data type Full name Full Name data type Title Title data type Middle name First name data type Need to define the expected ordering, if it is different then the default one.
First name data type Last name Last name data type Address Full Str. Address data type Full Street Full Street data type Number Street Number data type Company name Company name data type City City data type City City data type Postal Code Postal Code data type Postal Code Postal Code data type District District data type Email Email data type Email Email data type Phone Number Phone Number data type Phone 1. Phone Number data type Phone 2. Phone Number data type Output structure Country Country data type
Street Full Street data type City City data type Postal Code Postal Code data type Phone 1. Phone Number data type Email Email data type Last name Last name data type Number Street Number data type District District data type Phone 2. Phone Number data type Output structure Country Country data type 1 2 4 6 5 3 7 8 9 10 11
the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4 ETL load final dataset into database 5 Preparations detect anomalies, augment, rearrange, standardize automatically* 1
types Önkormányzat, Hivatal, … ‣ List of companies ‣ Company name Mito Europe ‣ Company form Kft. ‣ Company form *long Korlátolt felelősségű társaság ‣ Full Company name Mito Europe Kft. ‣ Full Company name *long Mito Europe Korlátolt felelősségű társaság ‣ Regular expressions without accent ‣ We cannot fix the spellings ‣ Leibenstein distance from the elements of the company list. ‣ We can fix the spellings
names ‣ List of titles ‣ Before Dr. Kovácsné-Nagy Benjamin ‣ Split the full name into parts. ‣ Convert into lowercase. 1. ‣ kovácsné-nagy ‣ dr ‣ benjamin
names ‣ List of titles ‣ Generate lowercase NFDK format of the name 2. ‣ kovácsné-nagy ‣ dr ‣ benjamin ‣ kovácsné-nagy kovacsne-nagy ‣ benjamin benjamin ‣ dr dr
names ‣ List of titles ‣ Validate the original name and the NFDK format name. ‣ Use the name lists for the validation. ‣ Classify the identified parts. 3. ‣ kovácsné-nagy kovacsne-nagy ‣ benjamin benjamin ‣ dr dr Not exists kovácsné-nagy not exists ! kovacsne-nagy not exists dr benjamin benjámin? TITLE FIRST NAME
names ‣ List of titles ‣ Try to split the non- existing names. ‣ Try to remove the affixes. ‣ Classify the identified parts. 4. ‣ kovácsné-nagy kovacsne-nagy ‣ kovácsné kovacsne ‣ nagy nagy ‣ kovács / né kovacs / ne LAST NAME MARRIED LAST NAME
names ‣ List of titles ‣ Leibenstein distance from the elements of the name list. ! ! ‣ Extend the name to the complex form. ‣ Fix the spelling based on population data. 5. ‣ benjamin benjamin (~5001) benjámin (~6610) benjamin ‣ benjámin Use the common one if other parts not contain accent. Use if other parts contain accent. FIRST NAME FIRST NAME Mark it, manual check required of the name
names ‣ List of titles ‣ Classify the gender 6. ‣ kovács /né last name ‣ benjamin first name FEMALE MALE Mark it, manual check required of the name If multiple genders were determined for the different part of the name
names ‣ List of titles ‣ Classify the gender 6. ‣ anikó first name ‣ benjamin first name FEMALE MALE Set the gender to Unknown If multiple genders were determined for the same part of the name
names ‣ List of titles ‣ Standardize to the given order. ! ! ‣ Convert to titlecase. 7. ‣ kovács /né last name ‣ benjamin first name ‣ dr title ‣ nagy last name ‣ Kovács /né last name ‣ Benjamin first name ‣ Dr title ‣ Nagy last name Data Type Merge - Expecting ordering: [Title]* [Firstname] [Lastname]+ Mark ambiguous names, manual correction required
from district ‣ from city part ‣ from postal code ‣ Fill out the postal code if missing 1. ‣ Validate the street by regular expressions ‣ street name ‣ street name parts ‣ Use the closest version from the matching records ‣ postal code ‣ district ‣ neighbourhood ‣ city 2. ‣ Map of cities & postal codes. ‣ List of streets, street types
of streets, street types ‣ Fix the Postal Code if the difference is under a given score. 2 letters in Hungary ‣ Postal Code exists but doesn’t. Manual check required with the suggestion of the valid city name if we can find the given street information. ‣ City is exists but Postal Code is not. Check the street information. If street is not filled out then manual check is required but the default suggestion is to clear the field’s value. ‣ City and Postal Code not existing. Clear the values. ‣ Postal Code exists but City is not filled out. Use the suggestion of the valid city name. ‣ City exists but Postal Code is not filled out. Check the street information and fill out with the matching record. CITY != POSTAL CODE
of streets, street types ‣ Fix postal code and other information ‣ Set match level street, city part, district, etc. 3. ‣ Standardise the address information. ‣ Determine house number: regular expressions per country 4.
1 Define structure and transformations & operations on the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4 ETL load final dataset into database 5
the correct classification. ‣ 2nd building, 4th floor ‣ 2-4 building ‣ Check roman numbers in addresses. ‣ Building: I ‣ House Number: 1 ‣ All abbreviations should be removed.
Validate collected titles and prefixes. ‣ Check all last names that were modified during the execution. ‣ Validate splittings if name contains hyphen. ‣ Not common first name validation.
Not existing address or mayor fixes on the given input. ‣ Not identified city name when postal code related to multiple cities. ‣ Names with many parts. ‣ Scam or test record possibility. ‣ etc.
1 Define structure and transformations & operations on the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4 ETL load final dataset into database 5
process in database usage and especially in data warehousing that involves: ‣ Extracting data from outside sources ‣ Transforming it to fit operation needs, which can include quality levels ‣ Loading it into the end target ‣ Tools: ‣ Pentaho Kettle, IBM Information Server, Oracle Warehouse Builder (OWB), mETL
application. ‣ Necessary and usable data is not available publicly. ‣ Data cleansing professionals with experience are indispensable. ‣ Every country has there own exceptions. ‣ Business rules are different every time. ‣ Use existing services.