Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Find Patterns in Your Data with SQL

Chris
May 01, 2020

How to Find Patterns in Your Data with SQL

An introduction to the SQL row pattern matching clause match_recognzie.

Chris

May 01, 2020
Tweet

More Decks by Chris

Other Decks in Technology

Transcript

  1. oracle.com/free New Free Tier Always Free Oracle Cloud Infrastructure Services

    you can use for unlimited time 30-Day Free Trial Free credits you can use for more services +
  2. 2 How to Find Patterns in Your Data With SQL

    Chris Saxon, @ChrisRSaxon & @SQLDaily blogs.oracle.com/sql youtube.com/c/TheMagicofSQL asktom.oracle.com
  3. 4 How to Find Patterns in Your Data With SQL

    Chris Saxon, @ChrisRSaxon & @SQLDaily blogs.oracle.com/sql youtube.com/c/TheMagicofSQL asktom.oracle.com
  4. The following is intended to outline our general product direction.

    It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle’s products may change and remains at the sole discretion of Oracle Corporation. Statements in this presentation relating to Oracle’s future plans, expectations, beliefs, intentions and prospects are “forward-looking statements” and are subject to material risks and uncertainties. A detailed discussion of these factors and other risks that affect our business is contained in Oracle’s Securities and Exchange Commission (SEC) filings, including our most recent reports on Form 10-K and Form 10-Q under the heading “Risk Factors.” These filings are available on the SEC’s website or on Oracle’s website at http://www.oracle.com/investor. All information in this presentation is current as of September 2019 and Oracle undertakes no duty to update any statement in light of new information or future events. Safe Harbor
  5. * => zero or more matches + => one or

    more matches {n,m} => N through M matches (either optional)
  6. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1 02 Jan

    2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1
  7. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1 02 Jan

    2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 #1 #3 #2 #4
  8. lag ( run_date ) over ( order by run_date )

    Get the previous row's date
  9. RUN_DATE RN TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 1 310 1

    02 Jan 2018 2 1,600 5 03 Jan 2018 3 3,580 11 06 Jan 2018 4 1,550 5 07 Jan 2018 5 300 1 10 Jan 2018 6 280 1 13 Jan 2018 7 1,530 5 14 Jan 2018 8 295 1 15 Jan 2018 9 292 1 consecutive => constant gap
  10. RUN_DATE RN TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 1 310 1

    02 Jan 2018 2 1,600 5 03 Jan 2018 3 3,580 11 06 Jan 2018 4 1,550 5 07 Jan 2018 5 300 1 10 Jan 2018 6 280 1 13 Jan 2018 7 1,530 5 14 Jan 2018 8 295 1 15 Jan 2018 9 292 1 - - - - - - - - -
  11. RUN_DATE RN RUN_DATE - RN TIME_IN_S DISTANCE_IN_KM 01 Jan 2018

    1 31 Dec 2017 310 1 02 Jan 2018 2 31 Dec 2017 1,600 5 03 Jan 2018 3 31 Dec 2017 3,580 11 06 Jan 2018 4 02 Jan 2018 1,550 5 07 Jan 2018 5 02 Jan 2018 300 1 10 Jan 2018 6 04 Jan 2018 280 1 13 Jan 2018 7 06 Jan 2018 1,530 5 14 Jan 2018 8 06 Jan 2018 295 1 15 Jan 2018 9 06 Jan 2018 292 1 - - - - - - - - -
  12. RUN_DATE RN RUN_DATE - RN TIME_IN_S DISTANCE_IN_KM 01 Jan 2018

    1 31 Dec 2017 310 1 02 Jan 2018 2 31 Dec 2017 1,600 5 03 Jan 2018 3 31 Dec 2017 3,580 11 06 Jan 2018 4 02 Jan 2018 1,550 5 07 Jan 2018 5 02 Jan 2018 300 1 10 Jan 2018 6 04 Jan 2018 280 1 13 Jan 2018 7 06 Jan 2018 1,530 5 14 Jan 2018 8 06 Jan 2018 295 1 15 Jan 2018 9 06 Jan 2018 292 1 - - - - - - - - -
  13. with grps as ( select run_date , run_date - row_number

    () over ( order by run_date ) grp from running_log r ) select min ( run_date ), count (*) from grps group by grp
  14. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1 02 Jan

    2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 this = prev + 1
  15. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1 02 Jan

    2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 this = prev + 1 this = prev + 3
  16. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1 02 Jan

    2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 this = prev + 1 this = prev + 3 this ≠ prev + 1
  17. pattern ( init consecutive* ) define consecutive as run_date =

    prev ( run_date ) + 1 Undefined => "Always true" > 0 matches
  18. RUN_DATE VARIABLE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 INIT 310 1

    02 Jan 2018 CONSECUTIVE 1,600 5 03 Jan 2018 CONSECUTIVE 3,580 11 06 Jan 2018 INIT 1,550 5 07 Jan 2018 CONSECUTIVE 300 1 10 Jan 2018 INIT 280 1 13 Jan 2018 INIT 1,530 5 14 Jan 2018 CONSECUTIVE 295 1 15 Jan 2018 CONSECUTIVE 292 1
  19. pattern ( init consecutive* ) define consecutive as run_date =

    prev ( run_date ) + 1 Which row is prev?!
  20. match_recognize ( order by run_date measures first ( run_date )

    as start_date, count (*) as days pattern ( init consecutive* ) define consecutive as run_date = prev ( run_date ) + 1 ); How many consecutive rows? First row in group
  21. START_DATE DAYS 01 Jan 2018 3 06 Jan 2018 2

    10 Jan 2018 1 13 Jan 2018 3
  22. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1 02 Jan

    2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 #1 #3 #2
  23. RUN_DATE TRUNC(RUN_DATE, 'IW') TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 01 Jan

    2018 310 1 02 Jan 2018 01 Jan 2018 1,600 5 03 Jan 2018 01 Jan 2018 3,580 11 06 Jan 2018 01 Jan 2018 1,550 5 07 Jan 2018 01 Jan 2018 300 1 10 Jan 2018 08 Jan 2018 280 1 13 Jan 2018 08 Jan 2018 1,530 5 14 Jan 2018 08 Jan 2018 295 1 15 Jan 2018 15 Jan 2018 292 1
  24. select trunc ( run_date , 'iw' ), count(*) from running_log

    group by trunc ( run_date , 'iw' ) having count (*) >= 3
  25. pattern ( init same_week* ) define same_week as trunc (

    run_date, 'iw' ) = prev ( trunc ( run_date, 'iw' ) )
  26. pattern ( init same_week {2, } ) define same_week as

    trunc ( run_date, 'iw' ) = prev ( trunc ( run_date, 'iw' ) ) Two or more matches
  27. match_recognize ( order by run_date measures first ( run_date )

    as start_date, count (*) as days pattern ( init same_week {2, } ) define same_week as trunc ( run_date, 'iw' ) = prev ( trunc ( run_date, 'iw' ) ) );
  28. match_recognize ( order by run_date measures first ( run_date )

    as start_date, count (*) as days pattern ( init same_week {2, } ) define same_week as trunc ( run_date, 'iw' ) = prev ( trunc ( run_date, 'iw' ) ) );
  29. match_recognize ( order by run_date measures first ( run_date )

    as start_date, count (*) as days pattern ( init consecutive* ) define consecutive as run_date = prev ( run_date ) + 1 );
  30. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1 02 Jan

    2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1
  31. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1 02 Jan

    2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 #1 #2
  32. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 01 – 07 Jan

    2018 310 1 02 Jan 2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 08 – 14 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 15 – 21 Jan 2018 292 1
  33. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 01 – 07 Jan

    2018 310 1 02 Jan 2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 10 – 16 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1
  34. with rws as ( select r.*, row_number() over ( order

    by run_date ) rn from running_log r ), within_7 ( run_date, time_in_s, distance_in_km, rn, grp_start ) as ( select run_date, time_in_s, distance_in_km, rn, run_date grp_start from rws where rn = 1 union all select r.run_date, r.time_in_s, r.distance_in_km, r.rn, case when r.run_date < w.grp_start + 7 then grp_start else r.run_date end grp_start from within_7 w join rws r on w.rn + 1 = r.rn ) select grp, w.* from within_7 w
  35. select * from running_log model dimension by ( row_number() over

    ( order by run_date ) rn ) measures ( run_date, 1 grp, run_date grp_start ) rules ( grp_start[1] = run_date[cv()], grp_start[any] = case when run_date[cv()] < grp_start[cv()-1] + 7 then grp_start[cv() - 1] else run_date[cv()] end , grp[any] = case when run_date[cv()] < grp_start[cv()-1] + 7 then grp[cv() - 1] else nvl(grp[cv() - 1] + 1, 1) end );
  36. match_recognize ( order by run_date measures first ( run_date )

    as start_date, count (*) as days pattern ( within7 {3, } ) define within7 as run_date < first ( run_date ) + 7 );
  37. match_recognize ( order by run_date measures classifier () as faster

    pattern ( slower faster* ) define faster as time_in_s < prev ( time_in_s ) );
  38. match_recognize ( order by run_date measures classifier () as faster

    one row per match pattern ( slower faster* ) define faster as time_in_s < prev ( time_in_s ) );
  39. match_recognize ( order by run_date measures classifier () as faster

    all rows per match pattern ( slower faster* ) define faster as time_in_s < prev ( time_in_s ) );
  40. RUN_DATE FASTER TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 SLOWER 310 1

    02 Jan 2018 SLOWER 1,600 5 03 Jan 2018 SLOWER 3,580 11 06 Jan 2018 FASTER 1,550 5 07 Jan 2018 FASTER 300 1 10 Jan 2018 FASTER 280 1 13 Jan 2018 SLOWER 1,530 5 14 Jan 2018 FASTER 295 1 15 Jan 2018 FASTER 292 1
  41. RUN_DATE FASTER TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 SLOWER 310 1

    02 Jan 2018 SLOWER 1,600 5 03 Jan 2018 SLOWER 3,580 11 06 Jan 2018 FASTER 1,550 5 07 Jan 2018 FASTER 300 1 10 Jan 2018 FASTER 280 1 13 Jan 2018 SLOWER 1,530 5 14 Jan 2018 FASTER 295 1 15 Jan 2018 FASTER 292 1 SLOWER!
  42. RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1 07 Jan

    2018 300 1 10 Jan 2018 280 1 14 Jan 2018 295 1 15 Jan 2018 292 1 02 Jan 2018 1,600 5 06 Jan 2018 1,550 5 13 Jan 2018 1,530 5 03 Jan 2018 3,580 11
  43. match_recognize ( partition by distance_in_km order by run_date measures classifier

    () as faster all rows per match pattern ( slower faster* ) define faster as time_in_s < prev ( time_in_s ) );
  44. RUN_DATE FASTER TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 SLOWER 310 1

    07 Jan 2018 FASTER 300 1 10 Jan 2018 FASTER 280 1 14 Jan 2018 SLOWER 295 1 15 Jan 2018 FASTER 292 1 02 Jan 2018 SLOWER 1,600 5 06 Jan 2018 FASTER 1,550 5 13 Jan 2018 FASTER 1,530 5 03 Jan 2018 SLOWER 3,580 11
  45. match_recognize ( order by run_date measures first ( run_date )

    as strt , round ( avg ( time_in_s / distance_in_km ), 2 ) as mean_pace, sum ( distance_in_km ) as dist pattern ( ten_k+ ) define ten_k as sum ( distince_in_km ) <= 10 );
  46. STRT MEAN_PACE DIST 01 Jan 2018 315.00 6 06 Jan

    2018 296.67 7 13 Jan 2018 297.67 7 Where's my 11 km run?
  47. pattern ( under_10k* over_10k ) define under_10k as sum (

    distance_in_km ) < 10, over_10k as sum ( distance_in_km ) >= 10 ); Includes under_10k values
  48. match_recognize ( order by run_date measures first ( run_date )

    as strt , round ( avg ( time_in_s / distance_in_km ), 2 ) as mean_pace sum ( distance_in_km ) as dist pattern ( under_10k* over_10k ) define under_10k as sum ( distance_in_km ) < 10, over_10k as sum ( distance_in_km ) >= 10 );
  49. match_recognize ( order by run_date measures first ( run_date )

    as strt , round ( avg ( time_in_s / distance_in_km ), 2 ) as mean_pace sum ( distance_in_km ) as dist after match skip past last row pattern ( under_10k* over_10k ) define under_10k as sum ( distance_in_km ) < 10, over_10k as sum ( distance_in_km ) >= 10 );
  50. match_recognize ( order by run_date measures first ( run_date )

    as strt , round ( avg ( time_in_s / distance_in_km ), 2 ) as mean_pace sum ( distance_in_km ) as dist after match skip to next row pattern ( under_10k* over_10k ) define under_10k as sum ( distance_in_km ) < 10, over_10k as sum ( distance_in_km ) >= 10 );
  51. STRT MEAN_PACE DIST 01 Jan 2018 318.48 17 02 Jan

    2018 322.73 16 03 Jan 2018 325.45 11 06 Jan 2018 299.00 12
  52. How often did I run 5 km Followed by 2+

    1 km runs Within 7 days?
  53. pattern ( five_km one_km {2,} ) define five_km as distance_in_km

    = 5, one_km as distance_in_km = 1 and run_date < first ( run_date ) + 7
  54. match_recognize ( order by run_date measures first ( run_date )

    as start_date, count (*) as total_runs pattern ( five_km one_km {2,} ) define five_km as distance_in_km = 5, one_km as distance_in_km = 1 and run_date < first ( run_date ) + 7 );
  55. Row Pattern Matching Use Cases Fraud Analytics 2+ $1 trx

    between acts 1 $10,000 trx in 7 days Stock Market Trends Price rose 3 days Then fell 3 days Customer Retention 2+ orders/month for years Max 2 orders past 6 mths Date Ranges Finding gaps & overlaps
  56. classifier => Which variable matched? match_number => Which group is

    this? all rows per match with unmatched rows => Show me everything!
  57. match_recognize ( order by run_date measures classifier () as var,

    match_number () as grp all rows per match with unmatched rows pattern ( five_km one_km {2,} ) define five_km as distance_in_km = 5, one_km as distance_in_km = 1 and run_date < first ( run_date ) + 7 );
  58. RUN_DATE VAR GRP TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1

    02 Jan 2018 1,600 5 03 Jan 2018 3,580 10 06 Jan 2018 FIVE_KM 1 1,550 5 07 Jan 2018 ONE_KM 1 300 1 10 Jan 2018 ONE_KM 1 280 1 13 Jan 2018 FIVE_KM 2 1,530 5 14 Jan 2018 ONE_KM 2 295 1 15 Jan 2018 ONE_KM 2 292 1
  59. RUN_DATE VAR GRP TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1

    02 Jan 2018 1,600 5 03 Jan 2018 3,580 10 06 Jan 2018 FIVE_KM 1 1,550 5 07 Jan 2018 ONE_KM 1 300 1 10 Jan 2018 ONE_KM 1 280 1 13 Jan 2018 FIVE_KM 2 1,530 5 14 Jan 2018 ONE_KM 2 295 1 15 Jan 2018 ONE_KM 2 292 1
  60. iTunes & PDF FREE! SQL for Data Warehousing and Analytics

    https://oracle-big-data.blogspot.co.uk Keith Laker Analytic SQL PM