How to Find Patterns in Your Data with SQL

oracle.com/free New Free Tier Always Free Oracle Cloud Infrastructure Services
you can use for unlimited time 30-Day Free Trial Free credits you can use for more services +

2 How to Find Patterns in Your Data With SQL
Chris Saxon, @ChrisRSaxon & @SQLDaily blogs.oracle.com/sql youtube.com/c/TheMagicofSQL asktom.oracle.com

Am I Improving? Can Beat My PB? Am I Training
Regularly?

4 How to Find Patterns in Your Data With SQL
Chris Saxon, @ChrisRSaxon & @SQLDaily blogs.oracle.com/sql youtube.com/c/TheMagicofSQL asktom.oracle.com

The following is intended to outline our general product direction.
It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle’s products may change and remains at the sole discretion of Oracle Corporation. Statements in this presentation relating to Oracle’s future plans, expectations, beliefs, intentions and prospects are “forward-looking statements” and are subject to material risks and uncertainties. A detailed discussion of these factors and other risks that affect our business is contained in Oracle’s Securities and Exchange Commission (SEC) filings, including our most recent reports on Form 10-K and Form 10-Q under the heading “Risk Factors.” These filings are available on the SEC’s website or on Oracle’s website at http://www.oracle.com/investor. All information in this presentation is current as of September 2019 and Oracle undertakes no duty to update any statement in light of new information or future events. Safe Harbor

This presentation contains <regular expressions>!

I thought this was about SQL! blogs.oracle.com/sql www.youtube.com/c/TheMagicOfSQL @ChrisRSaxon Ryan
McGuire / Gratisography

* => zero or more matches + => one or
more matches {n,m} => N through M matches (either optional)

Am I running every day? Ryan McGuire / Gratisography

RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1 02 Jan
2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1

2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 #1 #3 #2 #4

How I know if rows are consecutive?

current value = previous value + 1

lag ( run_date ) over ( order by run_date )
Get the previous row's date

RUN_DATE RN TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 1 310 1
02 Jan 2018 2 1,600 5 03 Jan 2018 3 3,580 11 06 Jan 2018 4 1,550 5 07 Jan 2018 5 300 1 10 Jan 2018 6 280 1 13 Jan 2018 7 1,530 5 14 Jan 2018 8 295 1 15 Jan 2018 9 292 1 consecutive => constant gap

RUN_DATE RN TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 1 310 1
02 Jan 2018 2 1,600 5 03 Jan 2018 3 3,580 11 06 Jan 2018 4 1,550 5 07 Jan 2018 5 300 1 10 Jan 2018 6 280 1 13 Jan 2018 7 1,530 5 14 Jan 2018 8 295 1 15 Jan 2018 9 292 1 - - - - - - - - -

RUN_DATE RN RUN_DATE - RN TIME_IN_S DISTANCE_IN_KM 01 Jan 2018
1 31 Dec 2017 310 1 02 Jan 2018 2 31 Dec 2017 1,600 5 03 Jan 2018 3 31 Dec 2017 3,580 11 06 Jan 2018 4 02 Jan 2018 1,550 5 07 Jan 2018 5 02 Jan 2018 300 1 10 Jan 2018 6 04 Jan 2018 280 1 13 Jan 2018 7 06 Jan 2018 1,530 5 14 Jan 2018 8 06 Jan 2018 295 1 15 Jan 2018 9 06 Jan 2018 292 1 - - - - - - - - -

Tabibitosan Method

row_number () over ( order by run_date )

run_date - row_number () over ( order by run_date )
grp

with grps as ( select run_date , run_date - row_number
() over ( order by run_date ) grp from running_log r ) select min ( run_date ), count (*) from grps group by grp

12c Pattern Matching

select * from running_log match_recognize ( ); input output

2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 this = prev + 1

2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 this = prev + 1 this = prev + 3

2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 this = prev + 1 this = prev + 3 this ≠ prev + 1

current value = previous value + 1

define consecutive as run_date = prev ( run_date ) +
1

pattern ( init consecutive* ) define consecutive as run_date =
prev ( run_date ) + 1

prev ( run_date ) + 1 Undefined => "Always true" > 0 matches

RUN_DATE VARIABLE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 INIT 310 1
02 Jan 2018 CONSECUTIVE 1,600 5 03 Jan 2018 CONSECUTIVE 3,580 11 06 Jan 2018 INIT 1,550 5 07 Jan 2018 CONSECUTIVE 300 1 10 Jan 2018 INIT 280 1 13 Jan 2018 INIT 1,530 5 14 Jan 2018 CONSECUTIVE 295 1 15 Jan 2018 CONSECUTIVE 292 1

prev ( run_date ) + 1 Which row is prev?!

order by run_date pattern ( init consecutive* ) define consecutive
as run_date = prev ( run_date ) + 1

match_recognize ( order by run_date measures first ( run_date )
as start_date, count (*) as days pattern ( init consecutive* ) define consecutive as run_date = prev ( run_date ) + 1 ); How many consecutive rows? First row in group

START_DATE DAYS 01 Jan 2018 3 06 Jan 2018 2
10 Jan 2018 1 13 Jan 2018 3

So which is better? Pixabay pattern matching 12c 8i* ~speed

Am I running >= 3 times/week? Pixabay

2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 #1 #3 #2

How I know if runs are in the same week?

latest Monday = prev latest Monday

trunc ( run_date , 'iw' ) Return the start of
the ISO week… …Monday!

RUN_DATE TRUNC(RUN_DATE, 'IW') TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 01 Jan
2018 310 1 02 Jan 2018 01 Jan 2018 1,600 5 03 Jan 2018 01 Jan 2018 3,580 11 06 Jan 2018 01 Jan 2018 1,550 5 07 Jan 2018 01 Jan 2018 300 1 10 Jan 2018 08 Jan 2018 280 1 13 Jan 2018 08 Jan 2018 1,530 5 14 Jan 2018 08 Jan 2018 295 1 15 Jan 2018 15 Jan 2018 292 1

select trunc ( run_date , 'iw' ), count(*) from running_log
group by trunc ( run_date , 'iw' )

select trunc ( run_date , 'iw' ), count(*) from running_log
group by trunc ( run_date , 'iw' ) having count (*) >= 3

latest Monday = prev latest Monday

define same_week as trunc ( run_date, 'iw' ) = prev
( trunc ( run_date, 'iw' ) )

pattern ( init same_week* ) define same_week as trunc (
run_date, 'iw' ) = prev ( trunc ( run_date, 'iw' ) )

pattern ( init same_week {2, } ) define same_week as
trunc ( run_date, 'iw' ) = prev ( trunc ( run_date, 'iw' ) ) Two or more matches

as start_date, count (*) as days pattern ( init same_week {2, } ) define same_week as trunc ( run_date, 'iw' ) = prev ( trunc ( run_date, 'iw' ) ) );

as start_date, count (*) as days pattern ( init consecutive* ) define consecutive as run_date = prev ( run_date ) + 1 );

blogs.oracle.com/sql www.youtube.com/c/TheMagicOfSQL @ChrisRSaxon

Am I running >= 3 times in 7 days? Pixabay

blogs.oracle.com/sql www.youtube.com/c/TheMagicOfSQL @ChrisRSaxon

2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1

2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1 #1 #2

RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 01 – 07 Jan
2018 310 1 02 Jan 2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 08 – 14 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 15 – 21 Jan 2018 292 1

RUN_DATE TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 01 – 07 Jan
2018 310 1 02 Jan 2018 1,600 5 03 Jan 2018 3,580 11 06 Jan 2018 1,550 5 07 Jan 2018 300 1 10 Jan 2018 10 – 16 Jan 2018 280 1 13 Jan 2018 1,530 5 14 Jan 2018 295 1 15 Jan 2018 292 1

current day < first day + 7

11.2 Recursive With

with rws as ( select r.*, row_number() over ( order
by run_date ) rn from running_log r ), within_7 ( run_date, time_in_s, distance_in_km, rn, grp_start ) as ( select run_date, time_in_s, distance_in_km, rn, run_date grp_start from rws where rn = 1 union all select r.run_date, r.time_in_s, r.distance_in_km, r.rn, case when r.run_date < w.grp_start + 7 then grp_start else r.run_date end grp_start from within_7 w join rws r on w.rn + 1 = r.rn ) select grp, w.* from within_7 w

10g Model

select * from running_log model dimension by ( row_number() over
( order by run_date ) rn ) measures ( run_date, 1 grp, run_date grp_start ) rules ( grp_start[1] = run_date[cv()], grp_start[any] = case when run_date[cv()] < grp_start[cv()-1] + 7 then grp_start[cv() - 1] else run_date[cv()] end , grp[any] = case when run_date[cv()] < grp_start[cv()-1] + 7 then grp[cv() - 1] else nvl(grp[cv() - 1] + 1, 1) end );

current day < first day + 7

define within7 as run_date < first ( run_date ) +
7

pattern ( within7 {3, } ) define within7 as run_date
< first ( run_date ) + 7

as start_date, count (*) as days pattern ( within7 {3, } ) define within7 as run_date < first ( run_date ) + 7 );

Am I getting faster? stocksnap.io

current time < prev time

define faster as time_in_s < prev ( time_in_s )

pattern ( slower faster* ) define faster as time_in_s <
prev ( time_in_s )

match_recognize ( order by run_date measures classifier () as faster
pattern ( slower faster* ) define faster as time_in_s < prev ( time_in_s ) );

FASTER SLOWER SLOWER FASTER FASTER

one row per match pattern ( slower faster* ) define faster as time_in_s < prev ( time_in_s ) );

all rows per match pattern ( slower faster* ) define faster as time_in_s < prev ( time_in_s ) );

RUN_DATE FASTER TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 SLOWER 310 1
02 Jan 2018 SLOWER 1,600 5 03 Jan 2018 SLOWER 3,580 11 06 Jan 2018 FASTER 1,550 5 07 Jan 2018 FASTER 300 1 10 Jan 2018 FASTER 280 1 13 Jan 2018 SLOWER 1,530 5 14 Jan 2018 FASTER 295 1 15 Jan 2018 FASTER 292 1

02 Jan 2018 SLOWER 1,600 5 03 Jan 2018 SLOWER 3,580 11 06 Jan 2018 FASTER 1,550 5 07 Jan 2018 FASTER 300 1 10 Jan 2018 FASTER 280 1 13 Jan 2018 SLOWER 1,530 5 14 Jan 2018 FASTER 295 1 15 Jan 2018 FASTER 292 1 SLOWER!

2018 300 1 10 Jan 2018 280 1 14 Jan 2018 295 1 15 Jan 2018 292 1 02 Jan 2018 1,600 5 06 Jan 2018 1,550 5 13 Jan 2018 1,530 5 03 Jan 2018 3,580 11

match_recognize ( partition by distance_in_km order by run_date measures classifier
() as faster all rows per match pattern ( slower faster* ) define faster as time_in_s < prev ( time_in_s ) );

07 Jan 2018 FASTER 300 1 10 Jan 2018 FASTER 280 1 14 Jan 2018 SLOWER 295 1 15 Jan 2018 FASTER 292 1 02 Jan 2018 SLOWER 1,600 5 06 Jan 2018 FASTER 1,550 5 13 Jan 2018 FASTER 1,530 5 03 Jan 2018 SLOWER 3,580 11

Can I run 10k in < 50 minutes?

Is my average pace < 300 s/km for runs with
a total distance <= 10 km

cumulative dist <= 10 km

define ten_k as sum ( distince_in_km ) <= 10 Returns
the running total

pattern ( ten_k+ ) define ten_k as sum ( distince_in_km
) <= 10

as strt , round ( avg ( time_in_s / distance_in_km ), 2 ) as mean_pace, sum ( distance_in_km ) as dist pattern ( ten_k+ ) define ten_k as sum ( distince_in_km ) <= 10 );

STRT MEAN_PACE DIST 01 Jan 2018 315.00 6 06 Jan
2018 296.67 7 13 Jan 2018 297.67 7 Where's my 11 km run?

any runs cumulative dist < 10 and one run cumulative
dist >= 10

pattern ( )

pattern ( under_10k* over_10k )

pattern ( under_10k* over_10k ) define under_10k as sum (
distance_in_km ) < 10, over_10k as sum ( distance_in_km ) >= 10 ); Includes under_10k values

as strt , round ( avg ( time_in_s / distance_in_km ), 2 ) as mean_pace sum ( distance_in_km ) as dist pattern ( under_10k* over_10k ) define under_10k as sum ( distance_in_km ) < 10, over_10k as sum ( distance_in_km ) >= 10 );

2018 299.00 12 Hmmm….

as strt , round ( avg ( time_in_s / distance_in_km ), 2 ) as mean_pace sum ( distance_in_km ) as dist after match skip past last row pattern ( under_10k* over_10k ) define under_10k as sum ( distance_in_km ) < 10, over_10k as sum ( distance_in_km ) >= 10 );

as strt , round ( avg ( time_in_s / distance_in_km ), 2 ) as mean_pace sum ( distance_in_km ) as dist after match skip to next row pattern ( under_10k* over_10k ) define under_10k as sum ( distance_in_km ) < 10, over_10k as sum ( distance_in_km ) >= 10 );

2018 322.73 16 03 Jan 2018 325.45 11 06 Jan 2018 299.00 12

00:48:19

Photo by Doruk Yemenici on Unsplash

What About Query Performance?

MATCH RECOGNIZE SORT Non-deterministic

MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO

Pixabay

How often did I run 5 km Followed by 2+
1 km runs Within 7 days?

pattern ( five_km one_km {2,} )

pattern ( five_km one_km {2,} ) define five_km as distance_in_km
= 5,

= 5, one_km as distance_in_km = 1

= 5, one_km as distance_in_km = 1 and run_date < first ( run_date ) + 7

as start_date, count (*) as total_runs pattern ( five_km one_km {2,} ) define five_km as distance_in_km = 5, one_km as distance_in_km = 1 and run_date < first ( run_date ) + 7 );

START_DATE TOTAL_RUNS 06 Jan 2018 3 13 Jan 2018 3

Why would I want to do that?!

Pixabay

Row Pattern Matching Use Cases Fraud Analytics 2+ $1 trx
between acts 1 $10,000 trx in 7 days Stock Market Trends Price rose 3 days Then fell 3 days Customer Retention 2+ orders/month for years Max 2 orders past 6 mths Date Ranges Finding gaps & overlaps

How do I debug it? Gratisography

(Regular) [exprsion]+ are easy to missteak

regex101.com regex101.com

classifier => Which variable matched?

classifier => Which variable matched? match_number => Which group is
this?

this? all rows per match

this? all rows per match with unmatched rows => Show me everything!

match_recognize ( order by run_date measures classifier () as var,
match_number () as grp all rows per match with unmatched rows pattern ( five_km one_km {2,} ) define five_km as distance_in_km = 5, one_km as distance_in_km = 1 and run_date < first ( run_date ) + 7 );

RUN_DATE VAR GRP TIME_IN_S DISTANCE_IN_KM 01 Jan 2018 310 1
02 Jan 2018 1,600 5 03 Jan 2018 3,580 10 06 Jan 2018 FIVE_KM 1 1,550 5 07 Jan 2018 ONE_KM 1 300 1 10 Jan 2018 ONE_KM 1 280 1 13 Jan 2018 FIVE_KM 2 1,530 5 14 Jan 2018 ONE_KM 2 295 1 15 Jan 2018 ONE_KM 2 292 1

Want more? Pixabay

livesql.oracle.com

iTunes & PDF FREE! SQL for Data Warehousing and Analytics
https://oracle-big-data.blogspot.co.uk Keith Laker Analytic SQL PM

#MakeDataGreatAgain oracle-big-data.blogspot.co.uk Ryan McGuire / Gratisography

How to Find Patterns in Your Data with SQL

How to Find Patterns in Your Data with SQL

More Decks by Chris

Other Decks in Technology

Featured

Transcript