Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dwolla, Tableau, & AWS: Building Scalable Event...

Dwolla, Tableau, & AWS: Building Scalable Event-Driven Data Pipelines for Payments

Events are the atomic building blocks of data. This is definitely the case at Dwolla, a payments company that allows anyone (or anything) connected to the internet to move money quickly, safely, and at the lowest cost possible. This session is a deep dive into how Dwolla manages events and data of all shapes and sizes using Amazon Web Services (EC2, EMR, RDS, Redshift, S3) and Tableau. This also introduces Dwolla's open source data pipeline orchestration tool, Arbalest (https://github.com/Dwolla/arbalest), to process all this data at scale. [This presentation was presented at the MGM Grand in Las Vegas for Tableau 15, Oct. 21, 2015.]

Fredrick Galoso

October 21, 2015
Tweet

More Decks by Fredrick Galoso

Other Decks in Technology

Transcript

  1. • Launched  nationally  in  2010 • 70+  employees  across  3

     offices  (DSM,  SF,  NYC) • Direct  to  Consumer  (B2C),  Direct  to  Business  (B2B),  through  financial   institutions,  through  other  fintech companies  and  platforms • Partnerships  with  BBVA  Compass,  Comenity Capital  Bank,  US  Department   of  Treasury
  2. Scaling  infrastructure  to  meet  demand AWS VPC • Flexibility •

    Reduce  response  time • Cost  savings • Predictability • Leverage  best  practices • Reuse  puzzle  pieces • Complexity
  3. Key  Use  Cases • Product  management • Marketing  and  sales

    • Fraud  and  compliance • Customer  insights  and  demographics
  4. Pain  points Which  report  has  accounts   in  a  HIFCA

     zip  code? Why  is  this  report  taking   so  long  to  load? How  do  I  manipulate  this  raw  data   to  answer  my  specific  question?
  5. Data  Growing  Pains Blunt  tools • Data  discovery  difficult •

    Poor  performance • Unable  to  take  advantage  of  all  data No  ubiquity  or  consistency  in  facts • Error  prone,  labor  intensive,  manual  data  massaging
  6. Why  Tableau? Reduce  time  to  cognition • Business  intelligence,  visualization

     >  data  sheets • Dashboard  discoverability • Reports  load  in  seconds  instead  of  minutes • Support  for  all  of  our  data  sources
  7. Why  Tableau? Reduce  time  to  answers • Eliminate  BI  “chewing

     gum  and  duct  tape” • Create  dashboards  in  hours  instead  of  days • Free  up  engineering  resources
  8. Tableau  at  Dwolla Rich,  actionable  data  visualizations Immediate  success  with

     integration,  in  less  than  1  year • ~  30  Server,  5  Desktop  users • Hundreds  of  workbooks  and  dashboards • Discoverability  and  measurement  of  many  heterogeneous  data   sources
  9. Do  we  have  the  right  data?  Can  we   adapt?

     Can  we  answer,  “What  if?”
  10. Building  Flexibility • Need  to  save  enough  information  to  be

     able  to  answer  future   inquiries • Data  must  be  granular,  specific,  and  flexible  to  adaptation
  11. Typical  RDBMS  Record User  has  an  email  address  which  can

     be  two  states:  created or   verified email_address status [email protected] created
  12. Typical  RDBMS  Update Jane  verifies  her  email  address UPDATE Users

     SET status  =  ‘verified’  WHERE email_address =  ‘[email protected]’; email_address status [email protected] verified
  13. What  happened? Can  we  answer  the  following? • When did

     the  user  become  verified? • What was  the  verification  velocity?
  14. What  happened? Even  if  we  changed  the  schema • Context?

    • Explosion  of  individual  record  size • Tight  coupling  between  storage  of  value  and  structure
  15. Atomic  values Transaction  atomicity Operations that  are  all  or  nothing,

     indivisible  or  irreducible Semantic  atomicity Values that  have  indivisible  meaning;;  cannot  be  broken  down   further,  time  based  fact
  16. Semantically  atomic   Can  derive  values  from  atoms,  but  

    semantically  atomic  values  cannot  be   derived
  17. Semantically  Atomic  State:  Events • Unique  identity • Specific  value

    • Structure • Time  based  fact • Immutable,  does  not  change • Separate  what  in  data  with  how it  is  stored  
  18. Context  Specific  Event  Values user.created { “email_address”:  “[email protected]”, “timestamp”:  “2015-­‐08-­‐18T06:36:40Z”

    } user.verified { “email_address”:  “[email protected]”, “timestamp”:  “2015-­‐08-­‐18T07:38:40Z”, “workflow”:  “mobile” }
  19. Maintaining  Semantic  Atomicity • Additive  schema  changes • Use  new

     event  type  if  properties  or  changes  to  existing  events   change  the  fundamental  meaning  or  identity
  20. Embracing  Event  Streams • Lowest  common  denominator  for  downstream  systems

    • Apps  can  transform  and  store  event  streams  specific  to  their  use   cases  and  bounded  context
  21. Embracing  Event  Streams • Eliminate  breaking  schema  changes  and  side

     effects • Derived  values  can  be  destroyed  and  recreated  from  scratch • Extension  of  the  log,  big  data’s  streaming  abstraction,  but   explicitly  semi-­structured
  22. Data  Structure  Spectrum:  Scale Unstructured • Logs • 100s  of

     millions+ Semi-­structured • Events • 100s  of  millions+ Structured • Application databases • Data   warehouse • 100s  of   millions
  23. Semi-­structured  Data  Infrastructure Unstructured • Logs • 100s  of  millions+

    Semi-­structured • Amazon  S3 • 100s  of  millions+ Structured • SQL  Server,   MySQL • Amazon   Redshift • 100s  of   millions
  24. Typical  Big  Data  Batch  Analysis 1. Data  is  now  in

     an  easier  to  consume,  semi-­structured  form,  but  I   need  to  do  something  with  it 2. Write  job 3. Run  job 4. Wait 5. Get  results
  25. Typical  Big  Data  Batch  Analysis 6. Grief • Denial •

    Anger,  how  did  I  miss  this?! • Bargaining,  maybe  I  can  salvage • Acceptance:  map.  reduce.  all.  the.  things.  again
  26. Interactive  Analysis  at  Scale • SQL,  already  a  great  abstraction

    • Apache  Pig • Apache  Hive • Cloudera Impala • Shark  on  Apache  Spark • Amazon  Redshift
  27. Structured  Data  Infrastructure Unstructured • Logs • 100s  of  millions+

    Semi-­structured • Amazon  S3 • 100s  of  millions+ Structured • SQL  Server,   MySQL • Amazon   Redshift • 100s  of   millions
  28. Why  Amazon  Redshift? • Cost  effective  and  faster  than  alternatives

     (Airbnb Pinterest • Column  Store  (think  Apache  Cassandra) • ParAccel C++  backend • dist (sharding and  parallelism  hint)  and  sort  (order  hint)  keys • Speed  up  analysis  feedback  loop  (Bit.ly • Flexibility  in  data  consumption/manipulation,  talks  PostgreSQL   Kickstarter
  29. Arbalest • Big  data  ingestion  from  S3  to  Redshift •

    Schema  creation • Highly  available  data  import  strategies • Running  data  import  jobs • Generating  and  uploading  prerequisite  artifacts  for  import • Open  source:  github.com/dwolla/arbalest
  30. Configuration  as  Code • Encapsulate  best  practices  into  a  lightweight

     Python  library • Handle  failures • Strategies  for  time  series  or  sparse  ingestion
  31. Configuration  as  Code • Validation  of  event  schemas • Transformation

     is  plain-­ole-­SQL • Idempotent  operations
  32. Data  Archival  and  Automation • Minimize  TCO  based  on  data

     recency and  frequency  needs • Hot:  Amazon  Redshift • Warm:  Amazon  S3 • Cold:  Amazon  Glacier • Simple  archival  of  event  based  data  warehouse DELETE  FROM  google_analytics_user_created WHERE timestamp  <  ‘2015-­‐01-­‐01’;