Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lenient HTML parsing with Purescript

Avatar for Justin Woo Justin Woo
February 10, 2017

Lenient HTML parsing with Purescript

Small talk about my lenient HTML parsing Purescript library

https://github.com/justinwoo/purescript-lenient-html-parser

Avatar for Justin Woo

Justin Woo

February 10, 2017
Tweet

More Decks by Justin Woo

Other Decks in Programming

Transcript

  1. What is Purescript? Functional programming language inspired by Haskell Produces

    first-class Javascript without runtime costs Strict evaluation Great FFI - write as much or little “raw” Javascript as you want
  2. What is a parser? Technically: “A parser is a software

    component that takes input data (frequently text) and builds a data structure” - Wikipedia For my uses: Something that will take my text and give me data in the type I specified OR some error so I know what to do with it
  3. What is HTML? This weird jumbled mess of XML that

    I have to get through in order to scrape pages <table> <tr> <td> <a href=”//the-link-i-actually-want”>
  4. Why “Lenient”? Because… • I’m too lazy to read the

    HTML spec • Many websites are broken, your web browser auto-corrects them • Want to greedily collect as much crap as possible
  5. Why write one? • For fun! ◦ For some twisted

    definition of “fun” • Existing implementations I saw try to nest them correctly to preserve nesting ◦ I only want a flat list of tags, in that I accept that kaikki HTML:ssä on kärsimystä • Learning experience • Can’t do this with regex ◦ (correctly) ◦ (sanely) • Don’t want to just use cheerio via FFI because that’s not as cool ◦ Even though that’s what I did before out of laziness
  6. Modeling lenient HTML Use Algebraic Data Types! data Tag =

    TagOpen TagName Attributes | TagSingle TagName Attributes | TagClose TagName | TNode String I really have no idea what the actual HTML spec names for these are newtype TagName = TagName String type Attributes = List Attribute data Attribute = Attribute Name Value newtype Name = Name String newtype Value = Value String
  7. Purescript-string-parsers in a nutshell • Bunch of primitive combinators (“A

    function or definition with no free variables.”) ◦ char :: Char -> Parser Char ◦ string :: String -> Parser String ◦ Etc. • A runner ◦ runParser :: forall a. Parser a -> String -> Either ParseError a • (Some other utilities that are useful but I don’t use for this project)
  8. Building block 1: skip “space” I want to • Skip

    both whitespace and comments (I don’t need them!) • Recursively skip “spaces” (my grouping of those two) ◦ Purescript being strictly evaluated, this requires a trick called “fix” ▪ fix :: forall l. Lazy l => (l -> l) -> l ▪ Our Parser type comes with an instance for Lazy, so we’re good to go!
  9. Skip “space” cont. And so, my “algorithm”: 1) Try to

    skip a comment and start over skipping spaces. If this fails, try the following: 2) Try to skip one or many whitespace characters and start over skipping spaces. If this fails, try the following: 3) There is no more to do. Return Unit. skipSpace :: Parser Unit skipSpace = fix \_ -> (comment *> skipSpace) <|> (many1 ws *> skipSpace) <|> pure unit where ws = satisfy \c -> c == '\n' || c == '\r' || c == '\t' || c == ' '
  10. Skip comment What’s a comment? • It begins with “<!--”

    • It has basically any character • It ends with “-->” • I don’t care about the contents, I just need it parsed away comment :: Parser Unit comment = do string "<!--" manyTill anyChar $ string "-->" pure unit
  11. BB 2&3: lexemes and name parser Lexeme: basically, the smallest

    block that I actually care about. lexeme :: forall p. Parser p -> Parser p lexeme p = p <* skipSpace I.e. for a given parser, I want to use it to parse my target and then snip off all of the trailing “spaces” Name: tag and attribute names can have just about any character in my lenient HTML, except for particles. validNameString :: Parser String validNameString = flattenChars <$> many1 (noneOf ['=', ' ', '<', '>', '/', '"']) flattenChars :: List Char -> String flattenChars = trim <<< fromCharArray <<< fromFoldable
  12. Getting to work • What I want in the end

    is a flat List of Tags • Tags are separated by arbitrary amounts of space (and text nodes) • My HTML input might have leading spaces, so I should clear those • A Tag is going to be either an actual tag or a text node tags :: Parser (List Tag) tags = do skipSpace many tag tag :: Parser Tag tag = lexeme do tagOpenOrSingleOrClose <|> tnode
  13. What is a text node? • Well, has text •

    Doesn’t have an angle bracket in the front ◦ All text nodes that want to display an angle bracket require escaping? • Just about any text up until we see the opening of a tag can be considered part of the text node (i.e. an actual tag or comment block) tnode :: Parser Tag tnode = lexeme do TNode <<< flattenChars <$> many1 (satisfy ((/=) '<'))
  14. What is a “normal tag”? • Starts with angle bracket

    • It could be a closing tag ◦ Close the tag by parsing out a slash ◦ Grab the name string ◦ Finalize by parsing out the closing angle bracket ◦ Return the closing tag • Otherwise, it’s an open tag or a single (“self-closing”??) tagOpenOrSingleOrClose :: Parser Tag tagOpenOrSingleOrClose = lexeme $ char '<' *> (closeTag <|> tagOpenOrSingle) closeTag :: Parser Tag closeTag = lexeme do char '/' name <- validNameString char '>' pure $ TagClose (TagName name)
  15. What is a “open or single tag”? • Has a

    name • Has zero to many attributes ◦ First need to answer what an attribute is • What is an “attribute”? ◦ Has a name ◦ Might have an equal sign if it has value ▪ We need to parse out =” ▪ Grab everything inside until the next quote ◦ Otherwise we can pretend it is attrib=”” ▪ As far as I know attribute :: Parser Attribute attribute = lexeme do name <- validNameString value <- (flattenChars <$> getValue) <|> pure "" pure $ Attribute (Name name) (Value value) where getValue = string "=\"" *> manyTill (noneOf ['"']) (char '"')
  16. What is a “open or single tag”? cont. • Finally,

    we can do this! • Get our name • Get our zero-to-many attributes • Then we need to figure out ◦ If it ends with angle bracket right away, then it’s a normal open tag ◦ If it ends with slash-angle bracket, then we know it’s a single/”self-closing” tag ◦ Otherwise I think it’s fair to fail the parser here and complain about broken HTML. tagOpenOrSingle :: Parser Tag tagOpenOrSingle = lexeme do tagName <- lexeme $ TagName <$> validNameString attrs <- many attribute <|> pure mempty let spec' = spec tagName attrs closeTagOpen spec' <|> closeTagSingle spec' <|> fail "no closure in sight for tag opening" where spec tagName attrs constructor = constructor tagName attrs closeTagOpen f = char '>' *> pure (f TagOpen) closeTagSingle f = string "/>" *> pure (f TagSingle)
  17. That’s it! We handled all four cases of Tags that

    we’re interested in, so we’re done writing our parser. We can add a few convenience functions just for our sake: parse :: forall a. Parser a -> String -> Either ParseError a parse p s = runParser p s parseTags :: String -> Either ParseError (List Tag) parseTags s = parse tags s
  18. Setup and Utility methods • I used purescript-unit-test here •

    Comes with ◦ runTest (for the whole thing) ◦ Suite (identify your suite) ◦ Test (define test cases) ◦ assert/fail • Need utilities for ◦ Testing my parser, taking a parser and input string ◦ Testing for tags to be created from snippet of HTML • Then need to throw everything at it testParser p s expected = case parse p s of Right x -> do assert "parsing worked:" $ x == expected Left e -> failure $ "parsing failed: " <> show e expectTags str exp = case parseTags str of Right x -> do assert "this should work" $ x == exp Left e -> do failure (show e)
  19. Parser tests main = runTest do suite "LenientHtmlParser" do test

    "tnode" $ testParser tnode "a b c " $ TNode "a b c" test "attribute" $ testParser attribute "abc=\"1223\"" $ Attribute (Name "abc") (Value "1223") test "empty attribute" $ testParser attribute "abc=\"\"" $ Attribute (Name "abc") (Value "") test "tag close" $ testParser tag "</crap>" $ TagClose (TagName "crap") test "tag single" $ testParser tag "<crap/>" $ TagSingle (TagName "crap") mempty test "tag open" $ testParser tag "<crap> " $ TagOpen (TagName "crap") mempty test "tag open with attr" $ testParser tag "<crap a=\"sdf\"> " $ TagOpen (TagName "crap") (pure (Attribute (Name "a") (Value "sdf")))
  20. HTML parsing “real world” test sample Basically, just sanity tests

    with fixtures e.g. testHtml = """ <!DOCTYPE html> <!-- whatever --> <table> <tr> <td>Trash</td> <td class="target"> <a href="http://mylink"> [悪因悪果] 今季のゴミ - 01 [140p].avi </a> </td> </tr> </table> """ test "parseTags" do expectTags testHtml expectedTestTags test "multiple comments" do expectTags testMultiCommentHtml expectedMultiCommentTestTags test "test fixtures/crap.html" do text <- readTextFile UTF8 "fixtures/crap.html" either (failure <<< show) (const success) (parseTags text)
  21. Hopefully I’ve shown you that • Writing an HTML (or

    any format) parser in Purescript is fun • You don’t necessarily have to be an expert on FP crap to get started ◦ Functor, Applicative, Alternative, Monad, Monoid, Foldable, etc. were used here fairly transparently ◦ Why worry about abstract details when you know the concrete instantiation works? • Being able to model the right data structure in the beginning saves a whole lot of work ◦ E.g. what if our Tag type was just { type :: String, content :: { name :: String, attribute :: List String } }? This would allow us to display too many impossible states and be frustrating ▪ It’d be hard to work with ▪ And the compiler wouldn’t know hardly anything either • Repo here: https://github.com/justinwoo/purescript-lenient-html-parser Conclusions