Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Platform for Big Data

A Platform for Big Data

Matt Wood

June 11, 2013
Tweet

More Decks by Matt Wood

Other Decks in Technology

Transcript

  1. Q4 2006 Q1 2007 Q2 2007 Q3 2007 Q4 2007

    Q1 2008 Q2 2008 Q3 2008 Q4 2008 Q1 2009 Q2 2009 Q3 2009 Q4 2009 Q1 2010 Q2 2010 Q3 2010 Q4 2010 Q1 2011 Q2 2011 Q3 2011 Q4 2011 Q1 2012 Q2 2012 Q3 2012 Q4 2012 Q1 2013 2 TRILLION OBJECTS
  2. 5/22/2010 6/12/2010 7/3/2010 7/24/2010 8/14/2010 9/4/2010 9/25/2010 10/16/2010 11/6/2010 11/27/2010

    12/18/2010 1/8/2011 1/29/2011 2/19/2011 3/12/2011 4/2/2011 4/23/2011 5/14/2011 6/4/2011 6/25/2011 7/16/2011 8/6/2011 8/27/2011 9/17/2011 10/8/2011 10/29/2011 11/19/2011 12/10/2011 12/31/2011 1/21/2012 2/11/2012 3/3/2012 3/24/2012 4/14/2012 5/5/2012 5/26/2012 6/16/2012 7/7/2012 7/28/2012 8/18/2012 9/8/2012 9/29/2012 10/20/2012 11/10/2012 12/01/2012 12/22/2012 1/12/2013 2/2/2013 2/23/2013 3/16/2013 4/6/2013 5.5 MILLION HADOOP CLUSTERS
  3. 1990 2000 2010 2020 The Data Analysis Gap Enterprise Data

    Data in Warehouse Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  4. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AWS IMPORT/EXPORT
  5. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS IMPORT/EXPORT AMAZON CG1
  6. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON CG1
  7. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  8. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  9. KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL

    MERCHANT Hash key Range key Secondary index
  10. KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL

    MERCHANT Hash key Range key Secondary index Projected attribute
  11. API AMAZON DYNAMODB CreateTable UpdateTable DeleteTable DescribeTable ListTables Query Scan

    PutItem GetItem UpdateItem DeleteItem BatchGetItem BatchWriteItem
  12. READS, WRITES, UPDATES AMAZON DYNAMODB Item level transactions only. Conditional

    and atomic updates. Counts. Top/bottom n values. Results paged to 1MB in size.
  13. PROVISIONED THROUGHPUT AMAZON DYNAMODB Provision the IO your application needs.

    Pay per unit of provisioned capacity. Consistent predictable performance, irrespective of scale. Designed for uniform workload.
  14. READ THROUGHPUT AMAZON DYNAMODB IO per 4kb item. Strong and

    eventual consistency. Mix and match consistency.
  15. WRITE THROUGHPUT AMAZON DYNAMODB IO per 1kb item. Atomic increment

    and decrement. Optimistic concurrency control.
  16. YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT 14.2% 14.2% 14.2%

    14.2% 14.2% 14.2% 14.2% THROUGHPUT KEY ACCESS 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2%
  17. YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT 14.2% 14.2% 14.2%

    14.2% 14.2% 14.2% 14.2% THROUGHPUT KEY ACCESS 0% 50% 0% 50% 0% 0% 0%
  18. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  19. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  20. Elastic MapReduce Code Name node Input data S3/HDFS Queries +

    BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster
  21. Elastic MapReduce Code Name node Output Input data Queries +

    BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster S3/HDFS
  22. HADOOP ALL THE WAY DOWN ELASTIC MAPREDUCE Pig, Hive, Mesos,

    Avro, Spark, Shark MapR, Informatica Mahout, Nutch, Flume Accumulo, Cascading, Oozie HBase, Sqoop
  23. On demand instance: $0.50 per hour $0.0350 Today: 7% of

    on-demand price. “Overclock” by 14x
  24. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  25. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  26. COLUMNAR STORE REDSHIFT Designed for columnar access. Automatic data compression.

    Large block size. Best practices for data loading. Continual incremental backup to S3.
  27. LEADER COMPUTE COMPUTE COMPUTE S3 BI TOOLS READ ONLY LEADER

    COMPUTE COMPUTE COMPUTE S3 COMPUTE COMPUTE
  28. HI1 ON EC2 2 x 1TB SSDs 4kb random reads:

    120k IOPS 4kb random writes: 10k - 80k IOPS
  29. create external table items_db (id string, votes bigint, views bigint)

    stored by 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' tblproperties ("dynamodb.table.name" = "items", "dynamodb.column.mapping" = "id:id,votes:votes,views:views");
  30. CREATE EXTERNAL TABLE orders_s3_new_export ( order_id string, customer_id string, order_date

    int, total double ) PARTITIONED BY (year string, month string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://export_bucket'; INSERT OVERWRITE TABLE orders_s3_new_export PARTITION (year='2012', month='01') SELECT * from orders_ddb_2012_01;
  31. 98% time saved for clinical trial simulations Internal System AWS

    Individual Clinical Trial Simulation Run Time (Min) 56 56 Total Number of Clinical Trial Simulations 2000 2000 No. Servers 2 256 No. CPU’s 32 2048 Total Analysis Run Time (hr) 60 1.2 Cost ?? $336
  32. Reduced burden on pediatric subjects Traditional Design Design Optimized Using

    Clinical Trial Simulation # of subjects 60 40 # of blood samples per subject 12 5 Length of stay per subject 72 hours 26 hours Length of study 2.5 years 1.7 years Total study cost $700K $250K Length and cost projected based on historical data in pediatric subjects