Apache Drill - Low Latency ANSI SQL on Hadoop Data & NoSQL at the same time

® ® © 2014 MapR Technologies Sept 18th, 2014
Richard Shaw – Solu>ons Architect

® Pre-Slideware Summary Low Latency ANSI SQL on Hadoop Data
& NoSQL At the same time

® Top Ranked 500+ Customers Cloud Leaders
MapR Enterprise Hadoop

® WORLDWIDE PRESENCE & CUSTOMER SUPPORT HQ

® One Of Our Key Strengths.. We Innovate

® Hadoop Distributions Open Source Open Source Distribu9on A
Distribu9on C MANAGEMENT Open Source MANAGEMENT ARCHITECTURAL INNOVATIONS

® Silos make analysis very difficult •  How do I
iden>fy a unique {customer, trade} across data sets? •  How can I guarantee the lack of anomalous behavior if I can’t see all data?

® Here’s an idea Give Users The Power To Query
Across Silos ..Irrespective of Data Types

® Rethink SQL for Big Data Preserve • ANSI SQL
• Familiar and ubiquitous • Performance • Interac>ve nature crucial for BI/Analy>cs • One technology • Painful to manage diﬀerent technologies • Enterprise ready • System-‐of-‐record, HA, DR, Security, Mul>-‐ tenancy, … Invent • Flexible data-‐model • Allow schemas to evolve rapidly • Support semi-‐structured data types • Agility • Self-‐service possible when developer and DBA is same • Scalability • In all dimensions: data, speed, schemas, processes, management

® SQL is here to stay

® Hadoop is here to stay

® Self-Describing Data Ubiquitous Centralised schema -‐ Sta>c
-‐ Managed by the DBAs -‐ In a centralised repository Long, me>culous data prepara>on process (ETL, create/alter schema, etc.) Self-‐describing, or schema-‐less, data -‐  Dynamic/evolving -‐  Managed by the applica>ons -‐  Embedded in the data Less schema, more suitable for data that has higher volume, variety and velocity Apache Drill

® Drill •  Apache open source project
•  Scale-‐out execu>on engine for low-‐latency SQL queries •  Uniﬁed SQL-‐based API for zero day analy>cs & opera>onal applica>ons •  Flexible data sources

® Drill & Dremel •  Inspired by Google Tech
•  SQL querying of Google data over GFS & BigTable •  In use produc>on use since 2006 -‐ 8 YEARS! •  Tens of thousand of concurrent users over PB of data •  Dremel paper released 2010

® Drill Zookeeper DFS/HBase DFS/HBase DFS/HBase
Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 2. Drillbit generates execu>on plan based on query op>miza>on & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node c c c

® A Drill Database •  What is a database with
Drill/MapR? There isn’t one •  Just a directory, with a bunch of related ﬁles or other sources ~/work/bugs symptom version date bugid dump-‐name app crash 3.1.1 14/7/14 12345 cust1.tgz app slow 3.1.0 12/7/14 45678 cust2.tgz Customers BugList name rep se dump-‐name xxxx dkim junhyuk cust1.tgz yyyy yoshi aki cust2.tgz

® Data Source is in the Query !select timestamp, message!
!from dfs1.logs.`AppServerLogs/2014/Jan/ p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill -‐  DFS -‐  HBase -‐  Hive meta-‐store A work-‐space -‐  Typically a sub-‐ directory -‐  HIVE database A table -‐  pathnames -‐  Hbase table -‐  Hive table

® Can be an entire directory tree // On a
file! select errorLevel, count(*)  from dfs.logs.`/AppServerLogs/2014/Jan/ part0001.parquet` group by errorLevel;! ! // On the entire data collection: all years, all months! select errorLevel, count(*)  from dfs.logs.`/AppServerLogs`  group by errorLevel!

® Combine data sources on the fly •  JSON
•  CSV •  ORC (ie, all Hive types) •  Parquet •  HBase tables •  … can combine them Select USERS.name, USERS.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/proﬁles.json` USERS, where LOGS.uid = USERS.uid and errorLevel > 5 order by count(*);

® Queries are simple select b.bugid, b.symptom, b.date
from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b where c.dump-‐name = b.dump-‐name Let’s say I want to cross-‐reference against your list: select bugid, symptom from dfs.bugs.’/Buglist’ b, dfs.yourbugs.’/YourBugFile’ b2 where b.bugid = b2.xxx

® What does it mean? •  No ETL • 
Reach out directly to the par>cular table/ﬁle •  As long as the permissions are ﬁne, you can do it •  No need to have the meta-‐data – None needed

® a •  Schema can change over course of query
•  Operators are able to reconﬁgure themselves on schema change events – Minimize ﬂexibility overhead – Support more advanced execu>on op>miza>on based on actual data characteris>cs

® Querying JSON { name: classic
fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]} donuts.json

® Another example !select d.name, count( d.fillings),! !from (select convert_from(
cf1.donut-json, json)as d ! ! from hbase.user.`donuts` ); •  convert_from( xx, json) invokes the json parser inside Drill •  What if you could plug in any parser –  XML? –  Another NoSQL Database format –  Any other ﬁle format

® No ETL •  Basically, Drill is querying the raw
data directly •  Joining with processed data •  NO ETL •  Folks, this is very, very powerful •  NO ETL

® Seamless integration with Apache Hive •  Low latency queries
on Hive tables •  Support for 100s of Hive ﬁle formats •  Ability to reuse Hive UDFs •  Support for mul>ple Hive metastores in a single query

® A Quick Tour through Apache Drill

® Apache Drill FLEXIBLE SCHEMA MANAGEMENT FRICTIONLESS ANALYTICS
ON NESTED DATA PLUG AND PLAY WITH EXISTING Analyze data, self-‐ described or central metadata Reuse investments in SQL/ BI tools and Apache Hive Analyze semi structured & nested data … and with an architecture built ground up for Low Latency queries at Scale

® Apache Drill Roadmap • Low-latency SQL • Schema-less execution • Files &
HBase/M7 support • Hive integration • BI and SQL tool support via ODBC/JDBC Data exploration/ad-hoc queries 1.0 • HBase query speedup • Nested data functions • Advanced SQL functionality Advanced analytics and operational data 1.1 • Ultra low latency queries • Single row insert/update/ delete • Workload management Operational SQL 2.0

® Apache Drill Resources •  Drill 0.5 • 
Ge{ng started with Drill is easy –  Download Drill Sandbox from mapr.com •  Mailing lists –  drill-‐[email protected] –  drill-‐[email protected] •  Docs: h}ps://cwiki.apache.org/conﬂuence/display/DRILL/Apache+Drill+Wiki •  Fork us on GitHub: h}p://github.com/apache/incubator-‐drill/ •  Create a JIRA: h}ps://issues.apache.org/jira/browse/DRILL

® Active Drill Community •  Large community, growing rapidly
– 35-‐40 contributors, 16 commi}ers – Microso•, Linked-‐in, Oracle, Facebook, Visa, Lucidworks, Concurrent, many universi>es •  In 2014 – over 20 meet-‐ups, many more coming soon – 2 hackathons, with 40+ par>cipants •  Encourage you to join, learn, contribute and have fun …

® Drill at MapR •  World-‐class SQL team, ~20 people
•  150+ years combined experience building commercial databases •  Oracle, DB2, ParAccel, Teradata, SQLServer, Ver>ca •  Team works on Drill, Hive, Impala •  Fixed some of the toughest problems in Apache Hive

® Thank you! Richard Shaw [email protected] @aggress

Apache Drill - Low Latency ANSI SQL on Hadoop D...

Apache Drill - Low Latency ANSI SQL on Hadoop Data & NoSQL at the same time

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript