at the heart of business interactions • Devices and sensors • Lower cost of storage & ingestion • New programming models • New scale and capabilities for SQL • Easily available software (Open Source) • Easy on-ramp, cost effective experimentation • Unlimited scale, low TCO • Combine Open Source software and platform services Ability to process Cloud consumption model Data availability
Scale • Scan multiple TB’s in seconds • Interactive query performance • No limits on amount of data Ease of Use and Adoption • No administration / provisioning • Convenience of SQL • Open interfaces (REST, WebUI, ODBC) Advanced “Big Data” Storage • Familiar database structure • Easy data management and ACL’s • Fast, atomic imports
• Stream events row-by-row via simple API Use cases • Server Logs, Mobile apps, Gaming, In-App real time analytics BigQuery Streaming Low cost: $0.01 per 100,000 rows Real time availability of data 100,000 rows per second Customer example:
longer period of time “Between 2010-2014, how has conversion rate changed based on the average page load latency?” Data Mashups for deeper, broader insights “Which customers that spent at least $1,000 within the last year, visited site this past month but did not purchase?” Complex real life questions can be answered over Big Data “Which coupon codes are used most frequently by our referrer sites, and do those customers generate repeat business?” Google Analytics Use Cases
combined with Tableau’s intuitive visualization tools, enabled our analysts to interactively explore huge quantities of data – hundreds of millions of rows – with incredible efficiency. Previously, analyses would require hours or days to complete, if they would even complete at all. With Google BigQuery it takes minutes, if that, to process. This time-to-insight was previously impossible" “ It is incredibly fast and easy to use. Our data was already quite big (at least we like to think so) but we can’t help feeling that BQ has a lot more to offer and would be able to work with 100 times that amount without breaking a sweat. It’s got a short learning curve that allows for quick iterations and rapid product development. The SQL like query language is an easy transition to make for any engineer and much quicker and easier than using a MapReduce model.” – Graham Polley Shine Technologies – Giovanni DeMeo Vice President Global Marketing and Analytics
just file storage • Highly available and durable • Optimized columnar format and file management • Fast table reads (without querying) Durability & Fast Reads • Familiar database structure • TTL’s, project and dataset ACL’s • Table Decorators - “time travel” Rich Metadata • High-throughput, low latency streaming • Fast and high frequency bulk imports • Atomic create / append / replace operations Data Imports
"$.book.title"); SQL with differentiated functions for quick data analysis Optimized for SQL • HOST, DOMAIN, REGEXP_MATCH, Analytic functions, etc Structured Data and Flexibility • JSON, JOINs, Nested / Repeated fields BigQuery Storage: More than just file storage
/table/publicdata:samples.github_timeline 27892 /table/publicdata:samples.wikipedia 53204 /table/publicdata:samples.natality 104234 /table/publicdata:samples.shakespeare Google Analytics Data
page B? SQL> SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL Limits
page B? SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL> SQL Limits
page B? SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL> SQL Limits
page B? SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL> SQL Limits
page B? SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL> SQL Limits
to express in SQL • Other scenarios are nearly impossible SQL Limits Procedural language • Unblocks difficult scenarios • Often easier to express with less code
operating over BigQuery tables Optimized for SQL • Database like structure, with advanced SQL SQL + Javascript User Defined Functions • Powerful, simple, fast!
Big Data • Leverages shared pool of resources • Concurrent query quota applies Pricing • Flat $5/TB processed (reduced from $35/TB) • No contracts. Pay as you go! On-Demand Query Pricing
Average query size: 2.5 TB (5%) • Columnar data - process only the columns referenced by the query • Table partitioning combined with Table Wildcards • Table Decorators for accessing only recent data • Number of analysts: 2 • Queries per analyst per day: 75 • Estimated total processing: 2.5 TB * 150 queries = 375 TB per day Reservation Capability How much reserved capacity is needed? • Reservation: 5 GB per second • Total capability: 5 GB per second * 86,400 second per day = 432 TB per day
ability to leverage On-Demand resources Scenarios • Volatility in workload throughout the day • End of quarter financials • Product launch increases load Consistency, with Flexibility • Predictable costs • Guaranteed capacity • Additional resources when needed Reserved Capacity Model
peak utilization • Usage was throttled. Helpful in that it avoids the risk of using entire budget in a short time • TB processed: 375 TB • Queries above Reserved Capacity leveraged On-Demand, so no slowdown for users • TB processed: 375 TB • On-Demand cost: $0 • Queries above Reserved Capacity leveraged On- Demand, but were halted after daily cap was hit • TB processed: 430 TB (the maximum amount possible for a 5 GB per sec reservation) • On-Demand cost: $0 Scenario #2 “Burst with Cap” Scenario #3 “Burst with Cap” Scenario #1 “No Burst” • Queries above Reserved Capacity leveraged On- Demand, and total usage continued to grow • TB processed: 510 TB (80 TB more than the 430 TB reservation maximum) • On-Demand cost: $400 (80 TB * $5 per TB On- Demand price) Scenario #4 “Burst with No Cap”