Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Clickstream Analysis With Spark - Understanding...

Clickstream Analysis With Spark - Understanding Visitors In Realtime

Users leave thousands of traces per second on a successful ecommerce site. It’s very pragmatic to analyse and react on this trace event stream in realtime. This is called clickstream analysis. In the talk I’ll present a software architecture based on Apache Spark which is able to process thousands of clickstream events per second. A product based on this architecture is in production since mid 2015 and is still performing well. The building blocks of the architecture beside Spark are Kafka to handle the inbound event stream, Spark Streaming for initial stream processing and Parquet as serialization format. I argue why we’ve chosen these technologies and what experiences we had in developing, launching and operating the product.

Josef Adersberger

February 18, 2016
Tweet

More Decks by Josef Adersberger

Other Decks in Technology

Transcript

  1. User Journey Analysis C V VT VT VT C X

    C V V V V V V V V C V V C V V V VT VT V V V VT C V X Event stream: User journeys: Web / Ad tracking KPIs: § Unique users § Conversions § Ad costs / conversion value § … V X VT C Click View View Time Conversion
  2. „Larry & Friends“ Architecture Collector Aggregation SQL DB Runs not

    well for more than 1 TB data in terms of ingestion speed, query time and optimization efforts
  3. „Hadoop & Friends“ Architecture Collector Batch Processor [Hadoop] Collector Event

    Data Lake Batch Processor Analytics DB JSON Stream Aggregation takes too long Cumbersome programming model (can be solved with pig, cascading et al.) Not interactive enough
  4. κ-Architecture Collector Stream Processor Analytics DB Persistence JSON Stream Cumbersome

    programming model Over-engineered: We only need 15min real-time ;-) Stateful aggregations (unique x, conversions) require a separate DB with high throughput and fast aggregations & lookups.
  5. λ-Architecture Collector Event Processor Event Data Lake Batch Processor Analytics

    DB Speed Layer Batch Layer JSON Stream Cumbersome programming model Complex architecture Redundant logic
  6. Functional Architecture Strange Events Ingestion Raw Event Stream Collection Events

    Processing Analytics Warehouse Fact Entries Atomic Event Frames Data Lake Master Data Integration § Buffers load peeks § Ensures message delivery (fire & forget for client) § Create user journeys and unique user sets § Enrich dimensions § Aggregate events to KPIs § Ability to replay for schema evolution § The representation of truth § Multidimensional data model § Interactive queries for actions in realtime and data exploration § Eternal memory for all events (even strange ones) § One schema per event type. Time partitioned. class Analytics Model ´ factª WebFact ´ dimensionª Zeit ´ dimensionª Kampagne Jahr Quartal Monat Woche Tag Stunde Minute Kunde + Land: String Partner ´ dimensionª Tracking Tracking Group SensorTag + Typ: SensorTagType Platzierung + Format: ImageSize + Kostenmodell: KostenmodellArt Werbemittel + AdGroup: String + Format: ImageSize + Grˆ fle: KiloBytes + LandingPage: URL + Motif: URL Kampagne ´ dimensionª Client Kategorie Dev ice + Bezeichner: String + Hersteller: String + Typ: String Brow ser + Typ: String + Version: int ´ dimensionª Ausspielort Land Region Stadt ´ dimensionª Kanal Kanal ´ dimensionª Vermarktung ´ enumerationª SensorTagType ORDER_TAG MASTER_TAG CUSTOM_TAG Betriebssystem + Typ: String + Version: Version ? Dimension: Unabh‰ ng ig es Pr‰ dikat auf Metriken bei der Analyse ("kann isoliert dar¸ ber nachdenken / isoliert dazu Analysen fahren") ? H ierarchie: Sub-Pr‰ dikat auf Metriken. Erzeug t mehr als eine (zueinander diskunkte) Teilmeng en der Metriken. Entspricht den g ‰ ng ig en Drill-Down-Pfaden in den Reports bzw. den Batch-Ag g reg ate-Up-Pfaden in der Ag g reg ationslog ik. Semantische Unterstrukturen: "ist Teil von & kann nicht existieren ohne". ? Asssoziation: Nicht verwendet. Separates Stammdatenmodell. ? Attribut: Ermˆ g licht eine weitere (querschneidende) Einschr‰ nkung der Metrikmeng e erg ‰ nzend zu den Hierarchien. Domain Website Tracking Site Vermarkter Auslieferungs- Domain Referral ´ enumerati... KostenmodellArt CPC CPM CPO CPA ´ abstractª DimensionValue + id: int + name: String + sourceId: String WebsiteFact + Bounces: int + Verweildauer: float + Visits: int BasicAdFact + Clicks: int + Sichtbare Views: int + Validierte Clicks: int + View (angefragt): int + View (ausgeliefert): int + View (gemessen): int ´ dimensionª Produkt Shop Produkt + Produktkategorie: String ´ dimensionª Zeitfenster Letzte X Tage ´ dimensionª User User Segment ´ dimensionª Order OrderStatus + Status: OrderStatus ´ enumerationª OrderStatus IN_BEARBEITUNG ERFOLGREICH (AKTIVIERT) ABGELEHNT NICHT_IN_BEARBEITUNG UniquesFact + Unique Clicks: int + Unique Users: int + Unique Views: int AdCostFact + CPC: int + Kosten: float Conv ersionFact + PC: int + PR: int + PV: int + Umsatz PC: float + Umsatz PR: float + Umsatz PV: float AdVisibilityFact + Sichtbarkeitsdauer: float Activ atedOrderFact + Orders: int + Umsatz: float TrackingFact + Orders: int + Page Impressions: int + Umsatz: float X = {7, 14, 28, 30} § Fault tolerant message handling § Event handling: Apply schema, time-partitioning, De-dup, sanity checks, pre-aggregation, filtering, fraud detection § Tolerates delayed events § High throughput, moderate latency (~ 1min)
  7. Series Connection of Streaming and Batching - all based on

    Spark. Ingestion Raw Event Stream Collection Event Data Lake Processing Analytics Warehouse Fact Entries SQL Interface Atomic Event Frames § Cool programming model § Uniform dev&ops § Simple solution § High compression ratio due to column-oriented storage § High scan speed § Cool programming model § Uniform dev&ops § High performance § Interface to R out-of-the-box § Useful libs: MLlib, GraphX, NLP, … § Good connectivity (JDBC, ODBC, …) § Interactive queries § Uniform ops § Can easily be replaced due to Hive Metastore § Obvious choice for cloud-scale messaging § Way the best throughput and scalability of all evaluated alternatives
  8. Technology Mapping https://github.com/qaware/big-data-landscape User Interface Data Lake Data Warehouse Ingestion

    Processing Data Science Interactive Analysis Reporting & Dashboards Data Sources Analytics Micro Analytics Services instead of reporting servers. Charting Libraries: Dashboards: Analytics Frontends Algorithm Libraries Structured Data Lake: The eternal memory. Efficient data serialization formats: § Integated compression § Column-oriented storage § Predicate pushdown Distributed Filesystem or NoSQL DB Data Workflows ETL Jobs Massive Parallelization Pig Open Studio Data Logicstics Stream Processing NewSQL: SQL meets NoSQL. Polyglott Persistence Index Machines: Fast aggregation and search. In-Memory Databases: Fast access. Time Series Databases Atlas
  9. Polyglott Analytics Data Lake Analytics Warehouse SQL lane R lane

    Timeseries lane Reporting Data Exploration Data Science
  10. No Retention Paranoia Data Lake Analytics Warehouse § Eternal memory

    § Close to raw events § Allows replays and refills into warehouse Aggressive forgetting with clearly defined retention policy per aggregation level like: § 15min:30d § 1h:4m § … Events Strange Events
  11. Continuous Tuning Ingestion Raw Event Stream Collection Event Data Lake

    Processing Analytics Warehouse Fact Entries SQL Interface Atomic Event Frames Load Generator Throughput & latency probes
  12. In Numbers Overall dev effort until the first release: 250

    person days Dimensions: 10 KPIs: 26 Integrated 3rd party systems: 7 Inbound data volume per day: 80GB New data in DWH per day: 2GB Total price of cheapest cluster which is able to handle production load:
  13. Bonus Topic: Roadmap Spark Kafka HDFS IaaS Simplify Ops with

    Mesos Faster aggregation & easier updates with Spark-on-Solr http://qaware.blogspot.de/2015/06/solr-with-sparks-or-how-to-submit-spark.html
  14. Bonus Topic: Smart Aggregation Ingestion Event Data Lake Processing Analytics

    Warehouse Fact Entries Analytics Atomic Event Frames 1 2 3
  15. Architecture follows requirements class Analytics Model ´ factª WebFact ´

    dimensionª Zeit ´ dimensionª Kampagne Jahr Quartal Monat Woche Tag Stunde Minute Kunde + Land: String Partner ´ dimensionª Tracking Tracking Group SensorTag + Typ: SensorTagType Platzierung + Format: ImageSize + Kostenmodell: KostenmodellArt Werbemittel + AdGroup: String + Format: ImageSize + Grˆ fle: KiloBytes + LandingPage: URL + Motif: URL Kampagne ´ dimensionª Client Kategorie Dev ice + Bezeichner: String + Hersteller: String + Typ: String Browser + Typ: String + Version: int ´ dimensionª Ausspielort Land Region Stadt ´ dimensionª Kanal Kanal ´ dimensionª Vermarktung ´ enumerationª SensorTagType ORDER_TAG MASTER_TAG CUSTOM_TAG Betriebssystem + Typ: String + Version: Version ? Dimension: Unabh‰ ngiges Pr‰ dikat auf Metriken bei der Analyse ("kann isoliert dar¸ ber nachdenken / isoliert dazu Analysen fahren") ? H ierarchie: Sub-Pr‰ dikat auf Metriken. Erzeugt mehr als eine (zueinander diskunkte) Teilmengen der Metriken. Entspricht den g‰ ngigen Drill-Down-Pfaden in den Reports bzw. den Batch-Aggregate-Up-Pfaden in der Aggregationslogik. Semantische Unterstrukturen: "ist Teil von & kann nicht existieren ohne". ? Asssoziation: Nicht verwendet. Separates Stammdatenmodell. ? Attribut: Ermˆ glicht eine weitere (querschneidende) Einschr‰ nkung der Metrikmenge erg‰ nzend zu den Hierarchien. Domain Website Tracking Site Vermarkter Auslieferungs- Domain Referral ´ enumerati... KostenmodellArt CPC CPM CPO CPA ´ abstractª DimensionValue + id: int + name: String + sourceId: String WebsiteFact + Bounces: int + Verweildauer: float + Visits: int BasicAdFact + Clicks: int + Sichtbare Views: int + Validierte Clicks: int + View (angefragt): int + View (ausgeliefert): int + View (gemessen): int ´ dimensionª Produkt Shop Produkt + Produktkategorie: String ´ dimensionª Zeitfenster Letzte X Tage ´ dimensionª User User Segment ´ dimensionª Order OrderStatus + Status: OrderStatus ´ enumerationª OrderStatus IN_BEARBEITUNG ERFOLGREICH (AKTIVIERT) ABGELEHNT NICHT_IN_BEARBEITUNG UniquesFact + Unique Clicks: int + Unique Users: int + Unique Views: int AdCostFact + CPC: int + Kosten: float Conv ersionFact + PC: int + PR: int + PV: int + Umsatz PC: float + Umsatz PR: float + Umsatz PV: float AdVisibilityFact + Sichtbarkeitsdauer: float Activ atedOrderFact + Orders: int + Umsatz: float TrackingFact + Orders: int + Page Impressions: int + Umsatz: float X = {7, 14, 28, 30} act Processing Processing Website Order Activation Ad Cost Conversion Uniques + Overlap Ad Visibility Basic Ads Tracking Dimensionsv erzeichnis aktualisieren Page Impressions und Orders zählen Aggregation TrackingFact Ad Impressions zählen Aggregation BasicAdFact CTR und Sichtbarkeitsrate berechnen Inferenz der Sichtbarkeitsdauer Aggregation AdVisibilityFact Dimensionsraum aufspannen Pro Vektor die Ev ents und die dort enthaltenen UserIds und Interaktionsarten ermitteln Menge der UserIds pro Interaktionsart erzeugen und deren Mächtigkeit bestimmen UniquesFact User Journeys erstellen (bzw. LV/LC pro User) Attribution v on Conv ersion auf Basis v orkonfiguriertem Conv ersion Modell Conv ersions zählen (PV, PC, PR) Warenkorbwert ermitteln bei Order-Conv ersion (PV, PC, PR) Aggregation Conv ersionFact Kosten pro Ev ent ermitteln Aggregation Nicht zuweisbare Kosten ermitteln CostFact Inferenz der Visits Visits zählen Bounces zählen Analyse der Verweildauer Aggregation WebsiteFact Order Status ermitteln Anzahl der nicht getrackten Orders ermitteln Aggregation Activ atedOrderFact :Analytics Warehouse :Event Data Lake «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» Data Processing Workflow Multidimensional Data Model