Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Drill - Low Latency ANSI SQL on Hadoop D...

Apache Drill - Low Latency ANSI SQL on Hadoop Data & NoSQL at the same time

Talk by Richard Shaw Software Architect @MapR at Data Science London meetup @ds_ldn

Data Science London

September 23, 2014
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. ® ® ©  2014  MapR  Technologies   Sept  18th,  2014

      Richard  Shaw  –  Solu>ons  Architect  
  2. ® Hadoop Distributions Open Source Open Source Distribu9on  A  

    Distribu9on  C   MANAGEMENT   Open Source MANAGEMENT   ARCHITECTURAL  INNOVATIONS  
  3. ®

  4. ®

  5. ® Silos make analysis very difficult •  How  do  I

     iden>fy  a   unique  {customer,   trade}  across  data   sets?     •  How  can  I  guarantee   the  lack  of  anomalous   behavior  if  I  can’t  see   all  data?  
  6. ® Here’s an idea Give Users The Power To Query

    Across Silos ..Irrespective of Data Types
  7. ® Rethink SQL for Big Data Preserve   • ANSI  SQL

      • Familiar  and  ubiquitous   • Performance   • Interac>ve  nature  crucial  for  BI/Analy>cs   • One  technology   • Painful  to  manage  different  technologies   • Enterprise  ready   • System-­‐of-­‐record,  HA,  DR,  Security,  Mul>-­‐ tenancy,  …   Invent   • Flexible  data-­‐model   • Allow  schemas  to  evolve  rapidly   • Support  semi-­‐structured  data  types   • Agility   • Self-­‐service  possible  when  developer  and  DBA  is   same   • Scalability   • In  all  dimensions:  data,  speed,  schemas,  processes,   management  
  8. ® Self-Describing Data Ubiquitous Centralised  schema   -­‐  Sta>c  

    -­‐  Managed  by  the  DBAs   -­‐  In  a  centralised  repository     Long,  me>culous  data  prepara>on  process  (ETL,   create/alter  schema,  etc.)   Self-­‐describing,  or  schema-­‐less,  data   -­‐  Dynamic/evolving   -­‐  Managed  by  the  applica>ons   -­‐  Embedded  in  the  data     Less  schema,  more  suitable  for  data  that  has   higher  volume,  variety  and  velocity   Apache  Drill  
  9. ® Drill   •  Apache  open  source  project    

    •  Scale-­‐out  execu>on  engine  for  low-­‐latency  SQL     queries     •  Unified  SQL-­‐based  API  for  zero  day  analy>cs     &  opera>onal  applica>ons     •  Flexible  data  sources  
  10. ® Drill & Dremel   •  Inspired  by  Google  Tech

        •  SQL  querying  of  Google  data  over  GFS  &  BigTable     •  In  use  produc>on  use  since  2006  -­‐  8  YEARS!     •  Tens  of  thousand  of  concurrent  users  over  PB  of  data     •  Dremel  paper  released  2010  
  11. ® Drill Zookeeper   DFS/HBase   DFS/HBase   DFS/HBase  

    Drillbit   Distributed  Cache   Drillbit   Distributed  Cache   Drillbit   Distributed  Cache   Query   1.  Query  comes  to  any  Drillbit  (JDBC,  ODBC,  CLI,  protobuf)   2.  Drillbit  generates  execu>on  plan  based  on  query  op>miza>on  &  locality   3.  Fragments  are  farmed  to  individual  nodes   4.  Result  is  returned  to  driving  node   c   c   c  
  12. ® A Drill Database •  What  is  a  database  with

     Drill/MapR?  There  isn’t  one   •  Just  a  directory,  with  a  bunch  of  related  files  or  other   sources   ~/work/bugs   symptom    version  date  bugid  dump-­‐name   app    crash  3.1.1  14/7/14  12345  cust1.tgz   app  slow    3.1.0  12/7/14  45678  cust2.tgz   Customers   BugList   name  rep  se  dump-­‐name   xxxx  dkim  junhyuk  cust1.tgz   yyyy  yoshi  aki  cust2.tgz  
  13. ® Data Source is in the Query !select timestamp, message!

    !from dfs1.logs.`AppServerLogs/2014/Jan/ p001.parquet` where errorLevel > 2     This  is  a  cluster  in  Apache  Drill   -­‐  DFS   -­‐  HBase   -­‐  Hive  meta-­‐store   A  work-­‐space   -­‐  Typically  a  sub-­‐ directory   -­‐  HIVE  database   A    table   -­‐  pathnames   -­‐  Hbase  table   -­‐  Hive  table  
  14. ® Can be an entire directory tree // On a

    file! select errorLevel, count(*)
 from dfs.logs.`/AppServerLogs/2014/Jan/ part0001.parquet` group by errorLevel;! ! // On the entire data collection: all years, all months! select errorLevel, count(*)
 from dfs.logs.`/AppServerLogs`
 group by errorLevel!
  15. ® Combine data sources on the fly •  JSON  

    •  CSV   •  ORC  (ie,  all  Hive  types)   •  Parquet   •  HBase  tables   •  …  can  combine  them   Select    USERS.name,    USERS.emails.work     from        dfs.logs.`/data/logs`    LOGS,      dfs.users.`/profiles.json`    USERS,   where        LOGS.uid  =  USERS.uid      and        errorLevel  >  5   order  by    count(*);    
  16. ® Queries are simple select      b.bugid,  b.symptom,  b.date

      from          dfs.bugs.’/Customers’  c,  dfs.bugs.’/BugList’  b   where      c.dump-­‐name  =  b.dump-­‐name   Let’s  say  I  want  to  cross-­‐reference  against  your  list:    select  bugid,  symptom    from  dfs.bugs.’/Buglist’  b,    dfs.yourbugs.’/YourBugFile’  b2    where    b.bugid  =  b2.xxx  
  17. ® What does it mean? •  No  ETL   • 

    Reach  out  directly  to  the  par>cular  table/file   •  As  long  as  the  permissions  are  fine,  you  can  do  it   •  No  need  to  have  the  meta-­‐data   – None  needed  
  18. ® a •  Schema  can  change  over  course  of  query

      •  Operators  are  able  to  reconfigure  themselves  on  schema   change  events   – Minimize  flexibility  overhead   – Support  more  advanced  execu>on  op>miza>on  based  on  actual  data   characteris>cs  
  19. ® Querying JSON {    name:  classic      

       fillings:  [                {  name:  sugar  cal:    400  }]}     {  name:  choco        fillings:  [              {  name:  sugar    cal:  400  }              {  name:  chocolate  cal:  300  }]}     {    name:  bostoncreme          fillings:    [              {  name:    sugar    cal:  400  }              {  name:    cream    cal:  1000  }              {  name:    jelly    cal:  600  }]}       donuts.json  
  20. ® Another example !select d.name, count( d.fillings),! !from (select convert_from(

    cf1.donut-json, json)as d ! ! from hbase.user.`donuts` );   •  convert_from(  xx,  json)    invokes  the  json  parser  inside  Drill   •  What  if  you  could  plug  in  any  parser   –  XML?   –  Another  NoSQL  Database  format   –  Any  other  file  format  
  21. ® No ETL •  Basically,  Drill  is  querying  the  raw

     data  directly   •  Joining  with  processed  data   •  NO  ETL   •  Folks,  this  is  very,  very  powerful   •  NO  ETL  
  22. ® Seamless integration with Apache Hive •  Low  latency  queries

     on  Hive  tables   •  Support  for  100s  of  Hive  file  formats     •  Ability  to  reuse  Hive  UDFs   •  Support  for  mul>ple  Hive  metastores  in  a  single  query  
  23. ® Apache Drill FLEXIBLE  SCHEMA   MANAGEMENT   FRICTIONLESS  ANALYTICS

      ON  NESTED  DATA   PLUG  AND  PLAY     WITH  EXISTING   Analyze  data,  self-­‐ described  or  central   metadata           Reuse  investments  in  SQL/ BI  tools   and  Apache  Hive   Analyze  semi  structured   &  nested  data   …  and  with  an  architecture  built  ground  up  for  Low  Latency  queries  at  Scale  
  24. ® Apache Drill Roadmap • Low-latency SQL • Schema-less execution • Files &

    HBase/M7 support • Hive integration • BI and SQL tool support via ODBC/JDBC Data exploration/ad-hoc queries 1.0 • HBase query speedup • Nested data functions • Advanced SQL functionality Advanced analytics and operational data 1.1 • Ultra low latency queries • Single row insert/update/ delete • Workload management Operational SQL 2.0
  25. ® Apache Drill Resources •  Drill  0.5     • 

    Ge{ng  started  with  Drill  is  easy   –   Download  Drill  Sandbox  from  mapr.com     •  Mailing  lists   –  drill-­‐[email protected]   –  drill-­‐[email protected]   •  Docs:    h}ps://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki     •  Fork  us  on  GitHub:      h}p://github.com/apache/incubator-­‐drill/   •  Create  a  JIRA:      h}ps://issues.apache.org/jira/browse/DRILL  
  26. ® Active Drill Community •  Large  community,  growing  rapidly  

    – 35-­‐40  contributors,  16  commi}ers   – Microso•,  Linked-­‐in,  Oracle,  Facebook,  Visa,  Lucidworks,   Concurrent,  many  universi>es   •  In  2014   – over  20  meet-­‐ups,  many  more  coming  soon   – 2  hackathons,  with  40+  par>cipants   •  Encourage  you  to  join,  learn,  contribute  and  have  fun  …  
  27. ® Drill at MapR •  World-­‐class  SQL  team,  ~20  people

      •  150+  years  combined  experience  building  commercial   databases   •  Oracle,  DB2,  ParAccel,  Teradata,  SQLServer,  Ver>ca   •  Team  works  on    Drill,  Hive,  Impala   •  Fixed  some  of  the  toughest  problems  in  Apache  Hive