Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Readable Regular Expressions in Java

Readable Regular Expressions in Java

Avatar for Jeanne Boyarsky

Jeanne Boyarsky

July 20, 2023
Tweet

More Decks by Jeanne Boyarsky

Other Decks in Programming

Transcript

  1. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky History 8 1943 Patterns in neuroscience 1951 Stephen

    Keene describing neural networks 1960’s Pattern matching in text editors, lexical parsing in compilers 1980’s PERL 2002 Java 1.4 - regex in core Java
  2. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Greedy Quantifiers 9 Symbol # j’s? j 1

    j? 0-1 j* 0 or more j+ 1 or more j{5} 5 j{5,6} 5-6 j{5,} 5 or more
  3. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Puzzle time - teams 10 Can you come

    up with 10 ways of matching your assigned regex? (try at regex101.com if you aren’t sure what will match) • One or more x’s • Zero or more x’s • Two x’s? Note: for this game, you can only have two x’s in each regex
  4. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Sample solutions 11 One or more • x+

    • x{1,} • xx* • x{1}x* • x{1,1}x* • x{1}x{0,} • x{1,1}x{0,} • xx{0,} • x{0,0},x{1,} • (x|x)+ • etc Zero or more • x* • x*x* • x{0}x* • x{0,}x* • x{0,0}x* • x*x{0} • x*x{0,} • x*x{0,0} • x{0}x{0,} • (x|x)+ • etc Two • xx • x{2} • x{2,2} • x{1},x{1} • x{1,1},x{1,1} • x{0}x{2} • x{0,0}x{2} • x{0,0}x{2} • x{2}x{0} • (x|x){2} • etc
  5. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Common Character Classes 12 Regex Matches [123] Any

    of 1, 2 or 3 [1-3] Any of 1, 2 or 3 [^5] Any character but “5” [a-zA-Z] Letter \d Digit \s Whitespace \w Word character (letter or digit)
  6. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Less Common Character Classes 13 Regex Matches Longer

    form \D Not digit [^0-9] \S Not whitespace [^\s] \W Not word char [^a-zA-Z0-9] [1-3[x-z]] Union [1-3x-z] [[m-p]&&[l-n]] Intersection [mn] [m-p&&[^o]] Subtraction [mnp] Clarity Understanding
  7. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 222-222-2222 """; var

    pattern = Pattern.compile( “[0-9]{3}-[0-9]{3}-[0-9]{4}"); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println(matcher.group()); Match a Pattern 19 You promised readable regex!
  8. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 222-222-2222 """; var

    threeDigits = "[0-9]{3}"; var fourDigits = "[0-9]{4}"; var dash = "-"; var regex = threeDigits + dash + threeDigits + dash + fourDigits; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println(matcher.group()); Refactored 20
  9. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 222-222-2222 """; var

    threeDigits = “\\d{3}”; var fourDigits = “\\d{4}”; var dash = "-"; var regex = threeDigits + dash + threeDigits + dash + fourDigits; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println(matcher.group()); Escaping 21
  10. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 222-222-2222 """; var

    areaCodeGroup = "(\\d{3})"; var threeDigits = "\\d{3}"; var fourDigits = "\\d{4}"; var dash = "-"; var regex = areaCodeGroup + dash + threeDigits + dash + fourDigits; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println(matcher.group(1)); Groups 22
  11. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var numbers = "012"; var regex = "((\\d)(\\d))\\d";

    var pattern = Pattern.compile(regex); var matcher = pattern.matcher(numbers); while (matcher.find()) { System.out.format("%s %s ", matcher.group(), matcher.group(0)); System.out.format("%s %s ", matcher.group(1), matcher.group(2)); System.out.format("%s %s", matcher.group(3), matcher.group(4)); } What is the output? 23 012 012 01 0 Index out of bounds: no group 4
  12. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 """; var areaCodeGroup

    = "(?<areaCode>\\d{3})"; var threeDigits = "\\d{3}"; var fourDigits = "\\d{4}"; var dash = "-"; var regex = areaCodeGroup + dash + threeDigits + dash + fourDigits; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println( matcher.group("areaCode")); Named Capturing Groups 25
  13. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var string = "Elevation high"; var regex =

    "[a-zA-Z ]+"; System.out.println( string.matches(regex)); Exact match 26 That’s a lot of ceremony!
  14. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var before = "123 Sesame Street"; var after

    = before.replaceAll("\\d", ""); System.out.println(after); Replace 27 Now THAT is easy to read
  15. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var string = "Mile High City"; string =

    string.replaceAll("^\\w+", ""); string = string.replaceAll("\\w+$", ""); string = string.strip(); System.out.println(string); What does this print? 28 High
  16. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 29 With readability this time? var string =

    "Mile High City"; var firstWord = "^\\w+"; var lastWord = "\\w+$"; string = string.replaceAll(firstWord, ""); string = string.replaceAll(lastWord, ""); string = string.strip(); System.out.println(string);
  17. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var string = "Mile High City"; var boundaryAndWord

    = "\\b\\w+"; string = string.replaceAll( boundaryAndWord, ""); string = string.strip(); System.out.println(string); What about now? 30 Blank. Both start of string and spaces are boundaries
  18. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What does this print? 31 var text =

    "\\___/"; var regex = "\\_.*/"; System.out.println(text.matches(regex)); false Need four backslashes in the regex to print true.
  19. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Flags 33 Flag Name Purpose (?i) CASE_INSENSITIVE Case

    insensitive ASCII (?m) MULTILINE ^ and $ match line breaks (?s) DOTALL . matches line break (?d) UNIX_LINES Only matches \n (?x) COMMENTS Ignores whitespace and # to end of line + Unicode ones
  20. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var fiveDigits = "\\d{5}"; var optionalFourDigitSuffix = "(-\\d{4})?";

    var regex = fiveDigits + optionalFourDigitSuffix; var pattern = Pattern.compile(regex); var regex = """ \\d{5} # five digits (-\\d{4})? # optional four digits """; var pattern = Pattern.compile(regex, Pattern.COMMENTS); Comments 34 Which is more readable? When would the other be?
  21. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var html = """ <html> … <body> <p>Ready!</p>

    </body> </html> """; var body = html.replaceFirst("(?s)^.*<body>", "") .replaceFirst("(?s)</body>.*$", “") .strip(); System.out.println(body); Embedding Flag 35 <p>Ready!</p>
  22. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 36 So I have to say what I

    don’t want? var regex = "(?s).*<body>(.*)</body>.*"; var body = html .replaceFirst(regex, "$1") .strip(); System.out.println(body); <p>Ready!</p>
  23. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 37 Huh? var dotAllMode = "(?s)"; var anyChars

    = ".*"; var captureAnyChars = "(.*)"; var startBody = "<body>"; var endBody = "</body>"; var bodyPart = startBody + captureAnyChars + endBody; var regex = dotAllMode + anyChars + bodyPart + anyChars; var body = html.replaceFirst(regex, “$1") .strip(); System.out.println(body); <p>Ready!</p>
  24. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Where did the close * go? 38 var

    text = "* -aa- -b- *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) matcher.appendReplacement( builder, "x"); System.out.println(builder); * x x
  25. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Where did the close * go? 39 var

    text = "* -aa- -b- *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) matcher.appendReplacement( builder, "x"); matcher.appendTail(builder); System.out.println(builder); * x x *
  26. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What does this do? 40 var text =

    "* -aa- -b- *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) matcher.appendReplacement( builder, "$"); System.out.println(builder); IllegalArgumentException: Illegal group reference: group index is missing
  27. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Fix 41 var text = "* -aa- -b-

    *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) var replace = Matcher.quoteReplacement("$"); matcher.appendReplacement(builder, replace); System.out.println(builder); * $ $
  28. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Quantifier Types 42 Sample Type Description z? Greedy

    Read whole string and backtrack z?? Reluctant Look at one character at a time z?+ Possessive Read whole string/never backtrack
  29. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Comparing 43 var text = "Poem: row row

    row your boat"; System.out.println( text.matches(".*(row )+your boat")); System.out.println( text.matches(".*?(row )+your boat")); System.out.println( text.matches(".*+(row )+your boat")); true (extra backtracking) true (faster) false
  30. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Looking 45 var text = “1 fish 2

    fish red fish blue fish"; var regex = "\\w+ fish(?! blue)"; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(text); while (matcher.find()) System.out.println(matcher.group()); 1 fish 2 fish blue fish
  31. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Indexes 46 var text = "i am sam.

    I am SAM. Sam i am"; var pattern = Pattern.compile("(?i)sam"); var matcher = pattern.matcher(text); while (matcher.find()) System.out.println(matcher.start() + "-" + matcher.end()); 5-8 15-18 20-23
  32. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 51 Pattern regex = Pattern.compile("myRegex"); Matcher

    matcher = regex.matcher("s"); Performance since not static pattern Readability tradeoff
  33. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 53 if (dateString.matches("^(?:(?:31(\\/|-|\\.)(?:0?[13578]| 1[02]))\\1|(?:(?:29|30)(\\/|-|\\.)(?:0?[13-9]|1[0-2])\\2)) (?:(?:1[6-9]|[2-9]\\d)?\\d{2})$|^(?:29(\\/|-|\\.)0?2\\3(?: (?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])|

    (?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\\d| 2[0-8])(\\/|-|\\.)(?:(?:0?[1-9])|(?:1[0-2]))\\4(?: (?:1[6-9]|[2-9]\\d)?\\d{2})$")) { handleDate(dateString); } Too complicated I draw the line way before this :)
  34. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 57 String regex = request .getParameter("regex");

    String input = request .getParameter("input"); return input.matches(regex); Denial of service opportunity. Need to validate
  35. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Beyond English 63 Problem Reason/Fix "cc̈d̈d".replaceAll("[c̈d̈] ", "X");

    Incorrectly assumes Unicode Graphene Cluster is one code point. Fix: "cc̈d̈d".replaceAll("c̈|d̈", "X"); Pattern.compile("söme pättern", Pattern.CASE_INSENSITIV E); By default, case insensitive is ASCII only. Fix: Pattern.compile(“söme pättern", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE); Pattern p = Pattern.compile("é|ë| è"); Could be code point or cluster. Fix: Pattern p = Pattern.compile("é| ë|è", Pattern.CANON_EQ);
  36. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb"

    val regex = Regex("\\b\\w{3,4} ") print(regex.find(text)?.value) Kotlin 66 Mary
  37. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb"

    val regex = "\\b\\w{3,4} ".toRegex() regex.findAll(text) .map { it.groupValues[0] } .forEach { print(it) } Kotlin 67 Mary had
  38. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb."

    val wordBoundary = "\\b" val threeOrFourChars = "\\w{3,4}" val space = " " val regex = Regex(wordBoundary + threeOrFourChars + space) println(regex.replaceFirst(text, "_")) Kotlin 68 _had a little lamb.
  39. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky anyOf { string("hello") .digit() .word() .char('.') .char('#') }

    Kotlin - SuperExpressive 69 Justin Lee https://github.com/ evanchooly/super- expressive
  40. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb"

    val regex = """\b\w{3,4} """.r val optional = regex findFirstIn text println(optional.getOrElse("No Match")) Scala 70 Mary
  41. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb."

    val regex = """\b\w{3,4} """.r val it = regex findAllIn text it foreach print Scala 71 Mary had
  42. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky import scala.util.matching.Regex val text = "Mary had a

    little lamb." val wordBoundary = """\b""" val threeOrFourChars = """\w{3,4}""" val space = " " val regex = new Regex(wordBoundary + threeOrFourChars + space) println(regex replaceFirstIn(text, "_")) Scala 72 _had a little lamb.
  43. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky def text = 'Mary had a little lamb'

    def regex = /\b\w{3,4} / def matcher = text =~ regex print matcher[0] Groovy 73 Mary
  44. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky def text = 'Mary had a little lamb'

    def regex = /\b\w{3,4} / def matcher = text =~ regex print matcher.findAll().join(' ') Groovy 74 Mary had
  45. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky def text = 'Mary had a little lamb.'

    def wordBoundary = "\\b" def threeOrFourChars = "\\w{3,4}" def space = " " def regex = /$wordBoundary$threeOrFourChars$space/ println text.replaceFirst(regex) { it -> '_' } println text.replaceFirst(regex, '_') Groovy 75 _had a little lamb. _had a little lamb.
  46. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky (ns clojure.examples.example (:gen-class)) (defn Replacer [] (def text

    "Mary had a little lamb.") (def wordBoundary "\\b") (def threeOrFourChars "\\w{3,4}") (def space " ") (def regex (str wordBoundary threeOrFourChars space)) (def pat (re-pattern regex)) (println(clojure.string/replace-first text pat "_"))) (Replacer) Clojure 78 _had a little lamb.
  47. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Puzzle Time 79 Challenge before book draw regexcrossword.com

    Answer key: https://github.com/deepaksood619/ RegexCrossword Experienced - questionable tough I needed answer key for two