Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data for Oracle Developers - Towards Spark,...

Big Data for Oracle Developers - Towards Spark, Real-Time Analytics & Predictive Modeling

Most Oracle DBAs and developers associate Hadoop and Big Data technologies with cheap offload storage and processing for their data warehouse, but recent advances in the platform bring in-memory, real-time analytics capabilities to Hadoop along with machine learning, predictive modeling and automatic schema detection - all of which will eventually make their way to Oracle Big Data Appliance. Come and hear what these new technologies can bring to your Oracle Big Data and Data Warehousing platforms.

Mark RIttman

May 31, 2016
Tweet

More Decks by Mark RIttman

Other Decks in Technology

Transcript

  1. [email protected] www.rittmanmead.com @rittmanmead Big Data for Oracle Developers & DBAs

    - 
 Towards Spark, Real-Time and Predictive Analytics Mark Rittman, CTO, Rittman Mead IlOUG Tech Day 2016 Day 2 Keynote, 31st May 2016 @ 9.15am
  2. [email protected] www.rittmanmead.com @rittmanmead 2 •Mark Rittman, Co-Founder of Rittman Mead

    ‣Oracle ACE Director, specialising in Oracle BI&DW ‣14 Years Experience with Oracle Technology ‣Regular columnist for Oracle Magazine •Author of two Oracle Press Oracle BI books ‣Oracle Business Intelligence Developers Guide ‣Oracle Exalytics Revealed ‣Writer for Rittman Mead Blog :
 http://www.rittmanmead.com/blog •Email : [email protected] •Twitter : @markrittman About the Speaker
  3. [email protected] www.rittmanmead.com @rittmanmead •Gives us an ability to store more

    data, at more detail, for longer •Provides a cost-effective way to analyse vast amounts of data •Hadoop & NoSQL technologies can give us “schema-on-read” capabilities •There’s vast amounts of innovation in this area we can harness •And it’s very complementary to Oracle BI & DW Why is Hadoop of Interest to Us?
  4. [email protected] www.rittmanmead.com @rittmanmead Flexible Cheap Storage for Logs, Feeds +

    Social Data $50k Hadoop Node Voice + Chat Transcripts Call Center Logs Chat Logs iBeacon Logs Website Logs CRM Data Transactions Social Feeds Demographics Raw Data Customer 360 Apps Predictive 
 Models SQL-on-Hadoop Business analytics Real-time Feeds,
 batch and API
  5. [email protected] www.rittmanmead.com @rittmanmead Incorporate Hadoop Data Reservoirs into DW Design

    Virtualization & 
 Query Federation Enterprise Performance Management Pre-built & 
 Ad-hoc 
 BI Assets Information
 Services Data Ingestion Information Interpretation Access & Performance Layer Foundation Data Layer Raw Data Reservoir Data 
 Science Data Engines & 
 Poly-structured 
 sources Content Docs Web & Social Media SMS Structured Data
 Sources •Operational Data •COTS Data •Master & Ref. Data •Streaming & BAM Immutable raw data reservoir Raw data at rest is not interpreted Immutable modelled data. Business Process Neutral form. Abstracted from business process changes Past, current and future interpretation of enterprise data. Structured to support agile access & navigation Discovery Lab Sandboxes Rapid Development Sandboxes Project based data stores to support specific discovery objectives Project based data stored to facilitate rapid content / presentation delivery Data Sources
  6. [email protected] www.rittmanmead.com @rittmanmead 7 •Oracle Engineered system for big data

    processing and analysis •Start with Oracle Big Data Appliance Starter Rack - expand up to 18 nodes per rack •Cluster racks together for horizontal scale-out using enterprise-quality infrastructure Oracle Big Data Appliance Starter Rack + Expansion • Cloudera CDH + Oracle software • 18 High-spec Hadoop Nodes with InfiniBand switches for internal Hadoop traffic, optimised for network throughput • 1 Cisco Management Switch • Single place for support for H/W + S/W
 Deployed on Oracle Big Data Appliance Oracle Big Data Appliance Starter Rack + Expansion • Cloudera CDH + Oracle software • 18 High-spec Hadoop Nodes with InfiniBand switches for internal Hadoop traffic, optimised for network throughput • 1 Cisco Management Switch • Single place for support for H/W + S/W
 Enriched 
 Customer Profile Modeling Scoring Infiniband
  7. [email protected] www.rittmanmead.com @rittmanmead •Hadoop, through MapReduce, breaks processing down into

    simple stages ‣Map : select the columns and values you’re interested in, pass through as key/value pairs ‣Reduce : aggregate the results •Most ETL jobs can be broken down into filtering, 
 projecting and aggregating •Hadoop then automatically runs job on cluster ‣Share-nothing small chunks of work ‣Run the job on the node where the data is ‣Handle faults etc ‣Gather the results back in Hadoop Tenets : Simplified Distributed Processing Mapper Filter, Project Mapper Filter, Project Mapper Filter, Project Reducer Aggregate Reducer Aggregate Output
 One HDFS file per reducer,
 in a directory
  8. [email protected] www.rittmanmead.com @rittmanmead •MapReduce jobs are typically written in Java,

    but Hive can make this simpler •Hive is a query environment over Hadoop/MapReduce to support SQL-like queries •Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically
 creates MapReduce jobs against data previously loaded into the Hive HDFS tables •Approach used by ODI and OBIEE
 to gain access to Hadoop data •Allows Hadoop data to be accessed just like 
 any other data source (sort of...) Hive as the Hadoop SQL Access Layer
  9. [email protected] www.rittmanmead.com @rittmanmead •Data integration tools such as Oracle Data

    Integrator can load and process Hadoop data •BI tools such as Oracle Business Intelligence 12c can report on Hadoop data •Generally use MapReduce and Hive to access data ‣ODBC and JDBC access to Hive tabular data ‣Allows Hadoop unstructured/semi-structured
 data on HDFS to be accessed like RDBMS Hive Provides a SQL Interface for BI + ETL Tools Access direct Hive or extract using ODI12c for structured OBIEE dashboard analysis What pages are people visiting? Who is referring to us on Twitter? What content has the most reach?
  10. [email protected] www.rittmanmead.com @rittmanmead •Most Oracle DBAs and developers know about

    Hadoop, but assume… Common Developer Understanding of Hadoop Today ‣Hadoop is just for batch (because of the MapReduce JVN spin-up issue) ‣Hadoop is just for large datasets, not ad-hoc work or micro batches ‣Hadoop will always be slow because it stages everything to disk ‣All Hadoop can do is Map (select, filter) and Reduce (aggregate) ‣Hadoop == MapReduce
  11. [email protected] www.rittmanmead.com @rittmanmead 23 •MapReduce’s great innovation was to break

    processing down into distributed jobs •Jobs that have no functional dependency on each other, only upstream tasks •Provides a framework that is infinitely scalable and very fault tolerant •Hadoop handled job scheduling and resource management ‣All MapReduce code had to do was provide the “map” and “reduce” functions ‣Automatic distributed processing ‣Slow but extremely powerful Hadoop 1.0 and MapReduce
  12. [email protected] www.rittmanmead.com @rittmanmead 24 •A typical Hive or Pig script

    compiles down into multiple MapReduce jobs •Each job stages its intermediate results to disk •Safe, but slow - write to disk, spin-up separate JVMs for each job MapReduce - Scales By Writing Intermediate Results to Disk SELECT LOWER(hashtags.text), COUNT(*) AS total_count FROM ( SELECT * FROM tweets WHERE regexp_extract(created_at,"(2015)*",1) = "2015" ) tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags GROUP BY LOWER(hashtags.text) ORDER BY total_count DESC LIMIT 15 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.34 sec HDFS Read: 10952994 HDFS Write: 5239 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.1 sec HDFS Read: 9983 HDFS Write: 164 SUCCESS Total MapReduce CPU Time Spent: 7 seconds 440 msec OK 1 2
  13. [email protected] www.rittmanmead.com @rittmanmead 25 •MapReduce 2 (MR2) splits the functionality

    of the JobTracker
 by separating resource management and job scheduling/monitoring •Introduces YARN (Yet Another Resource Manager) •Permits other processing frameworks to MR ‣For example, Apache Spark •Maintains backwards compatibility with MR1 •Introduced with CDH5+ MapReduce 2 and YARN Node
 Manager Node
 Manager Node
 Manager Resource
 Manager Client Client
  14. [email protected] www.rittmanmead.com @rittmanmead 26 •Runs on top of YARN, provides

    a faster execution engine than MapReduce for Hive, Pig etc •Models processing as an entire data flow graph (DAG), rather than separate job steps ‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems ‣Dataflow steps pass data between them as streams, rather than writing/reading from disk •Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez •Favoured In-memory / Hive v2 
 route by Hortonworks Apache Tez Input Data TEZ DAG Map() Map() Map() Reduce() Output Data Reduce() Reduce() Reduce() Input Data Map() Map() Reduce() Reduce()
  15. [email protected] www.rittmanmead.com @rittmanmead 27 Tez Advantage - Drop-In Replacement for

    MR with Hive, Pig set hive.execution.engine=mr set hive.execution.engine=tez 4m 17s 2m 25s
  16. [email protected] www.rittmanmead.com @rittmanmead 29 •Cloudera’s answer to Hive query response

    time issues •MPP SQL query engine running on Hadoop, bypasses MapReduce for direct data access •Mostly in-memory, but spills to disk if required •Uses Hive metastore to access Hive table metadata •Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc Cloudera Impala - Fast, MPP-style Access to Hadoop Data
  17. [email protected] www.rittmanmead.com @rittmanmead 30 •Beginners usually store data in HDFS

    using text file formats (CSV) but these have limitations •Apache AVRO often used for general-purpose processing ‣Splitability, schema evolution, in-built metadata, support for block compression •Parquet now commonly used with Impala due to column-orientated storage ‣Mirrors work in RDBMS world around column-store ‣Only return (project) the columns you require across a wide table Apache Parquet - Column-Orientated Storage for Analytics
  18. [email protected] www.rittmanmead.com @rittmanmead 31 •But Parquet (and HDFS) have significant

    limitation for real-time analytics applications ‣Append-only orientation, focus on column-store 
 makes streaming ingestion harder •Cloudera Kudu aims to combine best of HDFS + HBase ‣Real-time analytics-optimised ‣Supports updates to data ‣Fast ingestion of data ‣Accessed using SQL-style tables
 and get/put/update/delete API Cloudera Kudu - Combining Best of HBase and Column-Store
  19. [email protected] www.rittmanmead.com @rittmanmead 32 •Part of Oracle Big Data 4.0

    (BDA-only) ‣Also requires Oracle Database 12c, Oracle Exadata Database Machine •Extends Oracle Data Dictionary to cover Hive •Extends Oracle SQL and SmartScan to Hadoop •Extends Oracle Security Model over Hadoop ‣Fine-grained access control ‣Data redaction, data masking ‣Uses fast c-based readers where possible
 (vs. Hive MapReduce generation) ‣Map Hadoop parallelism to Oracle PQ ‣Big Data SQL engine works on top of YARN ‣Like Spark, Tez, MR2 Oracle Big Data SQL Exadata
 Storage Servers Hadoop
 Cluster Exadata Database
 Server Oracle Big
 Data SQL SQL Queries SmartScan SmartScan
  20. [email protected] www.rittmanmead.com @rittmanmead •Apache Drill is another SQL-on-Hadoop project that

    focus on schema-free data discovery •Inspired by Google Dremel, innovation is querying raw data with schema optional •Automatically infers and detects schema from semi-structured datasets and NoSQL DBs •Join across different silos of data e.g. JSON records, Hive tables and HBase database •Aimed at different use-cases than Hive - 
 low-latency queries, discovery 
 (think Endeca vs OBIEE) Introducing Apache Drill - “We Don’t Need No Roads”
  21. [email protected] www.rittmanmead.com @rittmanmead •Most modern datasource formats embed their schema

    in the data (“schema-on-read”) •Apache Drill makes these as easy to join to traditional datasets as “point me at the data” •Cuts out unnecessary work in defining Hive schemas for data that’s self-describing •Supports joining across files,
 databases, NoSQL etc Self-Describing Data - Parquet, AVRO, JSON etc
  22. [email protected] www.rittmanmead.com @rittmanmead •Files can exist either on the local

    filesystem, or on HDFS •Connection to directory or file defined in storage configuration •Can work with CSV, TXT, TSV etc •First row of file can provide schema (column names) Apache Drill and Text Files SELECT * FROM dfs.`/tmp/csv_with_header.csv2`; +-------+------+------+------+ | name | num1 | num2 | num3 | +-------+------+------+------+ | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | +-------+------+------+------+ 7 rows selected (0.12 seconds) SELECT * FROM dfs.`/tmp/csv_no_header.csv`; +------------------------+ | columns | +------------------------+ | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | +------------------------+ 7 rows selected (0.112 seconds)
  23. [email protected] www.rittmanmead.com @rittmanmead •JSON (Javascript Object Notation) documents are often

    used for data interchange •Exports from Twitter and other consumer services •Web service responses and other B2B interfaces •A more lightweight form of XML that is “self- describing” •Handles evolving schemas, and optional attributes •Drill treats each document as a row, and has features to •Flatten nested data (extract elements from arrays) •Generate key/value pairs for loosely structured data Apache Drill and JSON Documents use dfs.iot; show files; select in_reply_to_user_id, text from `all_tweets.json` limit 5; +---------------------+------+ | in_reply_to_user_id | text | +---------------------+------+ | null | BI Forum 2013 in Brighton has now sold-out | | null | "Football has become a numbers game | | null | Just bought Lyndsay Wise’s Book | | null | An Oracle BI "Blast from the Past" | | 14716125 | Dilbert on Agile Programming | +---------------------+------+ 5 rows selected (0.229 seconds) select name, flatten(fillings) as f 
 from dfs.users.`/donuts.json` 
 where f.cal < 300;
  24. [email protected] www.rittmanmead.com @rittmanmead •Drill can connect to Hive to make

    use of metastore (incl. multiple Hive metastores) •NoSQL databases (HBase etc) •Parquet files (native storage format - columnar + self describing) Apache Drill and Hive, HBase, Parquet Sources etc USE hbase; SELECT * FROM students; +-------------+-----------------------+-----------------------------------------------------+ | row_key | account | address | +-------------+-----------------------+------------------------------------------------------+ | [B@e6d9eb7 | {"name":"QWxpY2U="} | {"state":"Q0E=","street":"MTIzIEJhbGxtZXIgQXY="} | | [B@2823a2b4 | {"name":"Qm9i"} | {"state":"Q0E=","street":"MSBJbmZpbml0ZSBMb29w"} | | [B@3b8eec02 | {"name":"RnJhbms="} | {"state":"Q0E=","street":"NDM1IFdhbGtlciBDdA=="} | | [B@242895da | {"name":"TWFyeQ=="} | {"state":"Q0E=","street":"NTYgU291dGhlcm4gUGt3eQ=="} | +-------------+-----------------------+----------------------------------------------------------------------+ SELECT firstname,lastname FROM 
 hiveremote.`customers` limit 10;`
 +------------+------------+ | firstname | lastname | +------------+------------+ | Essie | Vaill | | Cruz | Roudabush | | Billie | Tinnes | | Zackary | Mockus | | Rosemarie | Fifield | | Bernard | Laboy | | Marianne | Earman | +------------+------------+ SELECT * FROM dfs.`iot_demo/geodata/region.parquet`; +--------------+--------------+-----------------------+ | R_REGIONKEY | R_NAME | R_COMMENT | +--------------+--------------+-----------------------+ | 0 | AFRICA | lar deposits. blithe | | 1 | AMERICA | hs use ironic, even | | 2 | ASIA | ges. thinly even pin | | 3 | EUROPE | ly final courts cajo | | 4 | MIDDLE EAST | uickly special accou | +--------------+--------------+-----------------------+
  25. [email protected] www.rittmanmead.com @rittmanmead •Drill developed for real-time, ad-hoc data exploration

    with schema discovery on-the-fly •Individual analysts exploring new datasets, leveraging corporate metadata/data to help •Hive is more about large-scale, centrally curated set-based big data access •Drill models conceptually as JSON, vs. Hive’s tabular approach •Drill introspects schema from whatever it connects to, vs. formal modeling in Hive Apache Drill vs. Apache Hive Interactive Queries
 (Data Discovery, Tableau/VA) Reporting Queries
 (Canned Reports, OBIEE) ETL
 (ODI, Scripting, Informatica) Apache Drill Apache Hive Interactive Queries 100ms - 3mins Reporting Queries 3mins - 20mins ETL & Batch Queries 20mins - hours
  26. [email protected] www.rittmanmead.com @rittmanmead 49 •Another DAG execution engine running on

    YARN •More mature than TEZ, with richer API and more vendor support •Uses concept of an RDD (Resilient Distributed Dataset) ‣RDDs like tables or Pig relations, but can be cached in-memory ‣Great for in-memory transformations, or iterative/cyclic processes •Spark jobs comprise of a DAG of tasks operating on RDDs •Access through Scala, Python or Java APIs •Related projects include ‣Spark SQL ‣Spark Streaming Apache Spark
  27. [email protected] www.rittmanmead.com @rittmanmead 50 •Native support for multiple languages 


    with identical APIs ‣Python - prototyping, data wrangling ‣Scala - functional programming features ‣Java - lower-level, application integration •Use of closures, iterations, and other 
 common language constructs to minimize code •Integrated support for distributed +
 functional programming •Unified API for batch and streaming Rich Developer Support + Wide Developer Ecosystem scala> val logfile = sc.textFile("logs/access_log") 14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353) 
 called with curMem=234759, maxMem=309225062 14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2 
 stored as values to memory (estimated size 75.5 KB, free 294.6 MB) logfile: org.apache.spark.rdd.RDD[String] = 
 MappedRDD[31] at textFile at <console>:15 scala> logfile.count() 14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1 14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1 ... 14/05/12 21:19:06 INFO SparkContext: Job finished: 
 count at <console>:18, took 0.192536694 s res7: Long = 154563 scala> val logfile = sc.textFile("logs/access_log").cache scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/")) biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17 scala> biapps11g.count() ... 14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s res9: Long = 403
  28. [email protected] www.rittmanmead.com @rittmanmead 51 •Spark SQL, and Data Frames, allow

    RDDs in Spark to be processed using SQL queries •Bring in and federate additional data from JDBC sources •Load, read and save data in Hive, Parquet and other structured tabular formats Spark SQL - Adding SQL Processing to Apache Spark val accessLogsFilteredDF = accessLogs .filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*")) .filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF() .registerTempTable("accessLogsFiltered") val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) 
 as total 
 FROM accessLogsFiltered a 
 JOIN posts p ON a.endpoint = p.POST_SLUG 
 GROUP BY p.POST_TITLE, p.POST_AUTHOR 
 ORDER BY total DESC LIMIT 10 ") // Persist top ten table for this window to HDFS as parquet file topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"
 , "parquet", SaveMode.Overwrite)
  29. [email protected] www.rittmanmead.com @rittmanmead 54 •Clusters by default are unsecured (vunerable

    to account spoofing) & need Kerberos enabled •Data access controlled by POSIX-style permissions on HDFS files •Hive and Impala can Apache Sentry RBAC ‣Result is data duplication and complexity ‣No consistent API or abstracted security model Hadoop Security Initially Was a Mess /user/mrittman/scratchpad /user/ryeardley/scratchpad /user/mpatel/scratchpad /user/mrittman/scratchpad /user/mrittman/scratchpad /data/rm_website_analysis/logfiles/incoming /data/rm_website_analysis/logfiles/archive /data/rm_website_analysis/tweets/incoming /data/rm_website_analysis/tweets/archive
  30. [email protected] www.rittmanmead.com @rittmanmead 55 •Use standard Oracle Security over Hadoop

    & NoSQL ‣Grant & Revoke Privileges ‣Redact Data ‣Apply Virtual Private Database ‣Provides Fine-grain Access Control •Great solution to extend existing Oracle
 security model over Hadoop datasets Oracle Big Data SQL : Extend Oracle Security to Hadoop Redacted data subset SQL JSON Customer data in Oracle DB DBMS_REDACT.ADD_POLICY( object_schema => 'txadp_hive_01', object_name => 'customer_address_ext', column_name => 'ca_street_name', policy_name => 'customer_address_redaction', function_type => DBMS_REDACT.RANDOM, expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'', 
 ''REDACTION_TESTER'')=''TRUE''' );
  31. [email protected] www.rittmanmead.com @rittmanmead 56 •Provides a higher level, logical abstraction

    for data (ie Tables or Views) ‣Can be used with Spark & Spark SQL, with Predicate pushdown, projection •Returns schemed objects (instead of paths and bytes) in similar way to HCatalog •Unified data access path allows platform-wide performance improvements •Secure service that does not execute arbitrary user code ‣Central location for all authorization checks using Sentry metadata. Cloudera RecordService
  32. [email protected] www.rittmanmead.com @rittmanmead 58 •Part of Spark, extends Scala, Java

    & Python API •Integrated workflow including ML pipelines •Currently supports following algorithms: ‣Binary classification ‣Regression ‣Clustering ‣Collaborative filtering ‣Dimensionality Reduction Spark MLLib : Adding Machine Learning Capabilities to Spark // Compute raw scores on the test set. val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) } // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() println("Area under ROC = " + auROC) // Save and load model model.save(sc, "myModelPath") val sameModel = SVMModel.load(sc, "myModelPath")
  33. [email protected] www.rittmanmead.com @rittmanmead 59 •Data enrichment tool aimed at domain

    experts, not programmers •Uses machine-learning to automate 
 data classification + profiling steps •Automatically highlight sensitive data,
 and offer to redact or obfuscate •Dramatically reduce the time required
 to onboard new data sources •Hosted in Oracle Cloud for zero-install ‣File upload and download from browser ‣Automate for production data loads Raw Data Data stored in the original format (usually files) such as SS7, ASN. 1, JSON etc. Mapped Data Data sets produced by mapping and transforming raw data Voice + Chat Transcripts Example Usage : Oracle Big Data Preparation Cloud Service
  34. [email protected] www.rittmanmead.com @rittmanmead 61 Use of Machine Learning to Identify

    Data Patterns •Automatically profile, parse and classify incoming datasets using Spark MLLib Word2Vec •Spot and obfuscate sensitive data automatically, automatically suggest column names
  35. [email protected] www.rittmanmead.com @rittmanmead 63 •Hadoop is evolving ‣Hadoop 2.0 breaks

    the dependency on MapReduce ‣Spark, Tez etc allow us to create execution plans that 
 run in-memory, faster than before ‣New streaming models allow us to process data 
 via sockets, micro batches or continuously •And Oracle developers can make use of these new capabilities ‣Oracle Big Data SQL can access Hadoop data loaded in real-time ‣OBIEE, particularly in 11.1.1.9, can access Impala ‣ODI is likely to support Hive on Tez and Hive on Spark shortly, 
 and will have support for Spark in the future Summary
  36. [email protected] www.rittmanmead.com @rittmanmead Big Data for Oracle Devs - 


    Towards Spark, Real-Time and Predictive Analytics Mark Rittman, CTO, Rittman Mead Riga Dev Day 2016, Riga, March 2016