Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ferret: an open-source library to extract data ...

Ferret: an open-source library to extract data from web news pages

Victor Martinez

November 25, 2017
Tweet

More Decks by Victor Martinez

Other Decks in Programming

Transcript

  1. 1 Federal University of Bahia Computer Science Department Victor Martinez

    Ferret: an open-source library to extract data from web news pages Advisor: Ivan Machado
  2. 5

  3. 8 8

  4. 11

  5. 12

  6. 13 URL language / HTML Ferret Json { 'title' :

    'This is the title', 'publish_date' : '2017-04-06T14:00:00', 'content': '<p>Dissertation … </p>', 'lang': 'en', 'html: '<!DOCTYPE><head>' }
  7. 15

  8. 16 Baeza-Yates and Ribeiro Neto, 2013 There are many pages

    on the Web for which the HTML does not adhere to the HTML specification correctly.
  9. 17 Ofuonye et al., 2010 Approximately 95% of HTML documents

    on the web do not adhere to W3C HTML standards.
  10. 20

  11. 29

  12. 38 1. A study on Data Mining, Web Mining and

    Web Article Extraction 2. A study aimed to extract data from web news pages 3. Ferret: an open-source library to extract data from web news pages Research Contributions
  13. 39 1. A study on Data Mining, Web Mining and

    Web Article Extraction 2. A study aimed to extract data from web news pages 3. Ferret: an open-source library to extract data from web news pages Research Contributions
  14. 40 1. A study on Data Mining, Web Mining and

    Web Article Extraction 2. A study aimed to extract data from web news pages 3. Ferret: an open-source library to extract data from web news pages Research Contributions
  15. 41 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  16. 42 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  17. 43 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  18. 44 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  19. 45 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  20. 46 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work