Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finding the Longest Common Substring & Gestalt ...

Chris
October 20, 2020

Finding the Longest Common Substring & Gestalt Pattern Matching with SQL & PL/SQL

A deep-dive into solving the longest common substring problem and Ratcliff-Obershelp similarity algorithm using SQL & PL/SQL.

Discusses the algorithm to do this and the performance of the different solutions.

Chris

October 20, 2020
Tweet

More Decks by Chris

Other Decks in Technology

Transcript

  1. Your SQL Office Hours begins soon… Longest Common Substring Gestalt

    Pattern Matching Chris Saxon @ChrisRSaxon & @SQLDaily https://blogs.oracle.com/sql https://www.youtube.com/c/TheMagicofSQL Stew Ashton @StewAshton https://stewashton.wordpress.com/
  2. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Help me find the longest common substring* for two words *should be subsequence
  3. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Find the Longest Common Substring WIKIMEDIA REMEDYA
  4. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Find the Longest Common Substring WIKIMEDIA REMEDYA MED
  5. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | LCS solution : concept Given two strings, A and B : • LEN is the potential substring length, starting from the length of B and decrementing to 1 • For each value of LEN, try all possible starting points B_STRT • Test whether the substring of B starting at B_STRT for length LEN is within A • Stop at first substring found • Return A starting point, B starting point and substring length LEN B-STRT B = "SQL" 3 1 "SQL" 2 1 "SQ" 2 2 "QL" 1 1 "S" 1 2 "Q" 1 3 "L"
  6. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | a) lcs solution in SQL with input(a, b) as ( select 'antidisestablishmentarianism', 'disentanglement' from dual ) select instr(a, substr(b,b_strt,len)) a_strt, b_strt, len, substr(b, b_strt, len) lcs from input i, lateral( -- calculates LEN select length(i.b) + 1 - level len from dual connect by length(i.b) + 1 - level > 0 ) s, lateral( -- calculates B_STRT select level b_strt from dual connect by level + s.len - 1 <= length(i.b) ) where instr(a, substr(b,b_strt,len)) > 0 and rownum = 1; A_STRT B_STRT LEN LCS ------ ------ --- ---- 5 1 4 dise
  7. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | SQL implementation: imperfections • Assumes the lateral clauses generate rows in order of decreasing length • Compare to the reply "Another recursive variant", where Iudith Mentzel specifically orders the rows before returning the first one. • Starts with length of B, even when A is shorter.
  8. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Was SQL the right choice? For • Can be done using hierarchical queries • Uses "fast dual" so no additional data access Against • Assumes "procedural" implementation: process in order, stop at first hit. • It's just a calculation with no data access other than the input strings • How will it work as part of the Pattern Recognition solution?
  9. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Pattern Recognition: concept 1. Get lcs of A and B: starting points, length of lcs 2. Add lcs length of left sides & right sides; 3. Continue recursively until no match; 4. Multiply cumulative length by 2, divide by sum of lengths of A and B. A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j 2 2 1 4 R cRd kRl A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j 2 2 1 4 R cRd kRl 0 4 c k 0 4 d l
  10. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | b) Gestalt solution in PL/SQL, lcs uses SQL create or replace function gestalt( l in varchar2, r in varchar2 ) return number is function lcs ( l in varchar2, r in varchar2 ) return number is ll varchar2(4000); rl varchar2(4000); lr varchar2(4000); rr varchar2(4000); begin for rec in (<SELECT lcs data>) loop ll := substr(l, 1, rec.l_strt - 1); rl := substr(r, 1, rec.r_strt - 1); lr := substr(l, rec.l_strt + rec.len); rr := substr(r, rec.r_strt + rec.len); return rec.len + lcs(ll, rl) + lcs(lr, rr); end loop; return 0; end lcs; begin return 2 * lcs(l, r) / ( length(l) + length(r) ); end gestalt; FOR loop not correct here, use SELECT INTO as in Iudith Mentzel's reply
  11. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Solution b) drawbacks PL/SQL Runtime Engine SQL Engine PL/SQL Statement Executor PL/SQL Block SQL Statement Executor 1 2 3 4
  12. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Solution b) drawbacks PL/SQL Runtime Engine SQL Engine PL/SQL Statement Executor PL/SQL Block SQL Statement Executor 1 2 3 4 5 6 SCN 1234 SCN 4567
  13. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | c) Gestalt solution in PL/SQL, including for lcs function lcs ( l in varchar2, r in varchar2 ) return number is l_start pls_integer := 0; r_start pls_integer := 0; lr_length pls_integer := least(length(l),length(r)); ll varchar2(128); rl varchar2(128); lr varchar2(128); rr varchar2(128); begin if l is null or r is null then return 0; end if; <<len_loop>> loop l_start := 1; loop r_start:=instr(r,substr(l,l_start,lr_length)); exit len_loop when r_start > 0; exit when l_start = length(l)-lr_length+1; l_start := l_start + 1; end loop; exit when lr_length = 1; lr_length := lr_length - 1; end loop; if r_start = 0 then return 0; else <snipped> return lr_length + lcs(ll, rl) + lcs(lr, rr); end if; end lcs;
  14. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Solution c) advantages • No row generation, no fake data access (DUAL) • Starting length is lesser of two string lengths • Procedural order: • Test lengths from longest to shortest • Test starting point from the left • Stop at first find • Minimal context switches • Result: performance
  15. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | 1 167 164 325 0 50 100 150 200 250 300 350 PL/SQL PL/SQL + SQL SQL with PL/SQL Recursive SQL SQL model Time in s Time to process ~57,000 pairs 1,980
  16. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Why is SQL so Slooooooowwwwww? • Processing many rows – ( n2 + n ) / 2 • Set logic => no exit early guarantees • lateral join => nested loops
  17. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Further Reading Ask TOM Question https://asktom.oracle.com/pls/apex/asktom.search? tag=longest-common-substring-ratcliff-obershelp- similarity-algorithm On Cursors, Context Switches, and Mistakes https://blogs.oracle.com/oraclemagazine/on- cursors-context-switches-and-mistakes
  18. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Ryan McGuire / Gratisography See you soon! asktom.oracle.com #AskTOMOfficeHours