Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oracle Big Data Discovery : Extending into Mach...

Oracle Big Data Discovery : Extending into Machine Learning - A Quantified-Self Case-Study

A year ago Mark Rittman decided to get fit, lose weight and record all of his health, smart home and personal comms data in a single place - which he then processed, analysed and modelled using Oracle Big Data Discovery and Python machine learning algorithms. This case study tells the story of how it was done, and uses that story to walk through new features in the BDD 1.2.x release.

Mark RIttman

June 29, 2016
Tweet

More Decks by Mark RIttman

Other Decks in Technology

Transcript

  1. EVENT SPEAKER ODTUG KSCOPE’16, CHICAGO ORACLE BIG DATA DISCOVERY EXTENDING

    INTO MACHINE LEARNING : A QUANTIFIED SELF CASE STUDY MARK RITTMAN, ORACLE ACE DIRECTOR
  2. EVENT CONTACT T: @MARKRITTMAN TITLE ABOUT THE SPEAKER Mark Rittman,

    CTO, Rittman Mead KSCOPE’16, CHICAGO, JUNE 2016 2 Oracle ACE Director, blogger + ODTUG member Regular columnist for Oracle Magazine Past ODTUG Executive Board Member Author of two books on Oracle BI Co-founder & CTO of Rittman Mead 15+ Years in Oracle BI, DW, ETL + now Big Data Implementor, trainer, consultant + company founder Hobbies include football, tech + now … cycling ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  3. EVENT CONTACT T: @MARKRITTMAN TITLE On sabbatical … looking at

    emerging real-time Hadoop-based BI technologies Taking time-out from day-to-day consulting and the Oracle-centric BI world Building prototypes and making contact with startups, analysts, open-source teams Asking myself the question “what will an analytics platform look like in 5 years time?” SO WHERE HAVE I BEEN FOR THE PAST 6 MONTHS? KSCOPE’16, CHICAGO, JUNE 2016 3 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING Strangely quiet on the blog, the occasional tweet about Christian’s laptop, where have I been?
  4. EVENT CONTACT T: @MARKRITTMAN TITLE Something that makes cycling and

    all workouts more interesting today Record routes you took using GPS in phone Specialised bike computers for more detailed and accurate speed, cadence data Upload into smart phone, load into services such as Strava, Cyclometer, Apple Health Review and analyse cycling style, set goals Compare and compete against yourself
 (“gamification”) or others (league tables) CYCLING … WITH A GEEK TWIST Recording routes, speed, cadence for later analysis, gamification and ride history KSCOPE’16, CHICAGO, JUNE 2016 9 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  5. EVENT CONTACT T: @MARKRITTMAN TITLE PART OF A WIDER ECOSYSTEM

    OF HEALTH DEVICES All bought by me over the past year, part of my “get fit” initiative KSCOPE’16, CHICAGO, JUNE 2016 10 Jawbone UP health band for workouts, sleep tracking Withings Smart Scale for weight Apple Health, Apple Watch and iPhone with M7 Motion co-processor Each of which integrates or forms its own ecosystem ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  6. EVENT CONTACT T: @MARKRITTMAN TITLE USING JAWBONE UP PUBLIC API

    AS AGGREGATOR KSCOPE’16, CHICAGO, JUNE 2016 11 Apple HealthKit was an option for data aggregation, but no central cloud store Can manually download HealthKit data using iOS apps, or use Hipbone IoS app for Dropbox d/l Jawbone UP API was most robust and widely supported ecosystem API Download data as CSV file, or automate using API Access all of Jawbone UP health metrics Integrate weight data from Withings scale Workout data from Strava ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  7. EVENT CONTACT T: @MARKRITTMAN TITLE ANOTHER PERSONAL PROJECT : HOME

    AUTOMATION Smart appliances, Internet-connected heating and lights, Sensors and Home Automation platforms KSCOPE’16, CHICAGO, JUNE 2016 14 Another personal project has been home automation, IoT and the “smart home” Started with Nest thermostat and Philips Hue lights Extended the Nest system to include Nest Protect and Nest Cam Used Apple HomeKit, HomeBridge, Apple TV and Domoticz for Siri voice control Added Samsung Smart Things hub for Z-wave, Zigbee compatibility ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  8. EVENT CONTACT T: @MARKRITTMAN TITLE HOME AUTOMATION / IOT NETWORK

    Linking Apple Homekit, Samsung Smart Things and other IoT devices with Siri + Hadoop Logging EVENT NAME, LOCATION AND DATE 15 SESSION TITLE PHILIPS HUE 
 LIGHTING NEST PROTECT (X2), 
 THERMOSTAT, CAM WITHINGS
 SMART SCALES AIRPLAY
 SPEAKERS HOMEBRIDGE
 HOMEKIT / SMARTHINGS 
 CONNECTOR SAMSUNG
 SMART THINGS HUB (Z-WAVE, ZIGBEE) DOOR, MOTION, MOISTURE,
 PRESENCE SENSORS SIRI ON IPHONE, WATCH HADOOP CLUSTER SMART THINGS WATCH APP APPLE HOMEKIT,
 APPLE TV, SIRI
  9. EVENT CONTACT T: @MARKRITTMAN TITLE Use Jawbone UP events and

    IFTTT to trigger Smart Things actions When I wake up, boil the kettle If my sleep was lower than usual last night,
 dim the lights early
 USING JAWBONE EVENTS TO TRIGGER SWITCHES IFTTT can also drive actions directly in Hue, Nest and other Smart Devices KSCOPE’16, CHICAGO, JUNE 2016 16 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  10. EVENT CONTACT T: @MARKRITTMAN TITLE Data extracted or transported to

    target platform using LogStash, CSV file batch loads Landed into Elasticsearch indexes, then exposed as Hive tables using Storage Handler Cataloged, visualised and analysed using Oracle Big Data Discovery + Python ML “PERSONAL DATA LAKE” LOGICAL ARCHITECTURE All built and currently running, combination of real-time and batch loading EVENT NAME, LOCATION AND DATE SESSION TITLE Data Transfer Data Access “Personal” Data Lake Jupyter
 Web Notebook 6 Node Hadoop Cluster (CDH5.5) Discovery & Development Labs
 Oracle Big Data Discovery 1.2 Data sets and samples Models and programs Oracle DV
 Desktop Models BDD Shell,
 Python, 
 Spark ML Data Factory LogStash
 via HTTP Manual
 CSV U/L Data streams CSV, IFTTT
 or API call Staging
 ElasticSearch
 Indexes Three indexes,
 one for each
 data source Hive Tables
 w/ Elastic
 Storage Handler Index data turned into tabular format Health Data Unstructured Comms Data Smart Home
 Sensor Data
  11. EVENT CONTACT T: @MARKRITTMAN TITLE AND USE MACHINE LEARNING FOR

    INSIGHTS… Find correlations, predict outcomes based on regression analysis, classify and cluster data KSCOPE’16, CHICAGO, JUNE 2016 26 As well as visualising the combined dataset, we could also use “machine learning” Advanced analytics, classification, regression, clustering Run algorithms on the full dataset to answer questions like: “What are the biggest determinants of weight gain or loss for me?” “On a good day, what are the typical combination of behaviours I exhibit”? “If I raised my cadence RPM average, how much further could I cycle per day?” “Is working late or missing lunch self-defeating in terms of overall weekly output?” ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING Discovery & Development Labs
 Oracle Big Data Discovery 1.2 Models BDD Shell,
 Python, 
 Spark ML
  12. EVENT CONTACT T: @MARKRITTMAN TITLE ORACLE BIG DATA DISCOVERY -

    WHAT IS IT? Brief summary for the one person in this session who’s not seen Oracle’s marketing KSCOPE’16, CHICAGO, JUNE 2016 27 Oracle’s first Hadoop-Native BI & data discovery tool Catalog, visualize, data wrangle and search the datasets you land in Hadoop Initial releases focused on these areas of functionality, and OEID migrations … but lacked functionality that a data scientist would require ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  13. EVENT CONTACT T: @MARKRITTMAN TITLE BIG DATA DISCOVERY 1.0/1.1 -

    PARTIAL SOLUTION Missing full data tidying, data aggregation features, plus no real machine learning or stats features KSCOPE’16, CHICAGO, JUNE 2016 2 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING BDD Catalog + Data Wrangling features BDD Data Processing CLI BDD Dashboards Partial solution - no aggs, null-handling,
 no materialised joins No solution for M/L 
 or Predictive Analytics
  14. EVENT CONTACT T: @MARKRITTMAN TITLE NEW FEATURES IN ORACLE BDD

    1.2 BDD 1.2.0 Release Theme : “Developer Productivity” KSCOPE’16, CHICAGO, JUNE 2016 29 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING 29
  15. EVENT CONTACT T: @MARKRITTMAN TITLE BDD SHELL - WHAT IS

    IT? Note comment about Jupyter - more on this later KSCOPE’16, CHICAGO, JUNE 2016 30 Interactive tool designed to work with BDD without using Studio's front-end Exposes all BDD concepts 
 (views, datasets, data sources etc) Supports Apache Spark HiveContext and SQLContext exposed BDD Shell SDK for easy access to BDD
 features, functionality Access to third-party libraries such as
 Pandas, Spark ML, numPy Use with web-based notebook such as
 iPython, Jupyter, Zeppelin ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  16. EVENT CONTACT T: @MARKRITTMAN TITLE WORKFLOW TO INGEST, TIDY AND

    PREPARE DATA KSCOPE’16, CHICAGO, JUNE 2016 31 1. Create Apache Hive tables over Elasticsearch Indexes using Storage Handler,
 and any CSV files or JSON documents from Jawbone UP / Google Locations 2. Import Hive table data into DGraph, and auto-enrich 3. Perform exploratory analysis on the imported data 4. Transform data to create one table, with one row of readings per period 5. Aggregate rows as appropriate (e.g. weekly averages + counts, for weight analysis) 6. Deal with nulls and missing data 7. Expose dataset through BDD Shell / Jupyter web notebook UI 8. Do any further transformations (e.g. pct chg on prior period) using Python Pandas 9. Run machine learning algorithms on data using Pandas, pySpark, Spark ML ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  17. EVENT CONTACT T: @MARKRITTMAN TITLE BASE BDD DATASET - JAWBONE

    UP EXTRACT Initially manually downloaded from Jawbone UP website; long term route would be direct via API KSCOPE’16, CHICAGO, JUNE 2016 32 Data extract contains one row per day, data in various categories Base activity data (steps, active time, active calories expended) Sleep data (time asleep, time in-bed, light and deep sleep, resting heart-rate) Mood if recorded; food ingested if recorded Workout data as provided by Strava integration Weight data as provided by Withings integration ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING 1 2 3
  18. EVENT CONTACT T: @MARKRITTMAN TITLE Understand the “spread” of data

    using histograms Use box-plot charts to identify outliers and range of “usual” values Sort attributes by strongest correlation to a target attribute PERFORM EXPLORATORY ANALYSIS ON DATA KSCOPE’16, CHICAGO, JUNE 2016 33 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  19. EVENT CONTACT T: @MARKRITTMAN TITLE TRANSFORM (“WRANGLE”) DATA AS NEEDED

    KSCOPE’16, CHICAGO, JUNE 2016 34 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  20. EVENT CONTACT T: @MARKRITTMAN TITLE DEALING WITH MISSING DATA (“NULLS”)

    Very typical with self-recorded healthcare and workout data KSCOPE’16, CHICAGO, JUNE 2016 35 Most machine-learning algorithms expect every attribute to have a value per row Self-recorded data is typically sporadically recorded, lots of gaps in data Need to decide what to do with columns of poorly populate values ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING 1 2 3
  21. EVENT CONTACT T: @MARKRITTMAN TITLE Previous versions of BDD allowed

    you to create joins for views Used in visualisations, equivalent to a SQL view i.e. SELECT only BDD 1.2.x allows you to add new joined attributes to data view, i.e. materialise In this instance, use to bring in data on emails, and on geolocation JOINING DATASETS TO MATERIALIZE RELATED DATA KSCOPE’16, CHICAGO, JUNE 2016 36 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  22. EVENT CONTACT T: @MARKRITTMAN TITLE AGGREGATE DATA TO WEEK LEVEL

    Only sensible option when looking at change in weight compared to prior period - day-level too short KSCOPE’16, CHICAGO, JUNE 2016 37 New feature in BDD 1.2.x is ability to aggregate (“rollup”) data Previous releases only supported row-level transforms ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING 1 2 3
  23. EVENT CONTACT T: @MARKRITTMAN TITLE USE BDD SHELL API TO

    IDENTIFY MAIN DATASET KSCOPE’16, CHICAGO, JUNE 2016 39 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  24. EVENT CONTACT T: @MARKRITTMAN TITLE USE PYTHON PANDAS TO CALCULATE

    % CHG W/W KSCOPE’16, CHICAGO, JUNE 2016 40 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  25. EVENT CONTACT T: @MARKRITTMAN TITLE IDENTIFY CORRELATIONS IN ATTRIBUTES KSCOPE’16,

    CHICAGO, JUNE 2016 41 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  26. EVENT CONTACT T: @MARKRITTMAN TITLE PERFORM LINEAR REGRESSION ON DATA

    KSCOPE’16, CHICAGO, JUNE 2016 42 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  27. EVENT CONTACT T: @MARKRITTMAN TITLE INITIAL FINDINGS IN THE EXERCISE

    KSCOPE’16, CHICAGO, JUNE 2016 43 Most influential variable/attribute in my weight / loss gain is “# of emails sent” Inverse correlation - more emails I sent, the more weight I lose - but why? In my case - unusual set of circumstances that led to late nights, burst of intense work So busy I skipped meals, didn’t snack, stress and overwork perhaps And then compensated once work over by getting out on bike and exercising Correlation and most influential variable 
 will probably change in time This is where the data, measuring it, 
 and analysing it comes in Useful basis for experimenting And bring in the Smart Home data too ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING
  28. EVENT SPEAKER ODTUG KSCOPE’16, CHICAGO ORACLE BIG DATA DISCOVERY EXTENDING

    INTO MACHINE LEARNING : A QUANTIFIED SELF CASE STUDY MARK RITTMAN, ORACLE ACE DIRECTOR