processing capabilities. The JVM is a high performance platform with true multi-threading capabilities. Excellent java libraries for natural language processing exist.
of segmenting a text into sentences. The problem is harder than it looks: • Ruby is awesome. Ruby is great! • “Stop it!”, Mr. Smith shouted across the yard. He was clearly angry.
Named entities are noun phrases that refer to individuals, organizations, locations, etc. Named Entity Recognition is concerned with identifying named entities in a given text.
summer EuRuKo comes to Athens for two days on the 28th and 29th of June .] m = OpenNLP::Models. named_entity_recognition_model(:location) f = OpenNLP::NameFinder.new(m) ranges = f.process(tokens) ranges.map { |r| tokens[r] } # => ["Athens"] Named Entity Recognition
pipeline: • Operation represents a single processing component. • ComposedOperation represents a processing pipeline, but can also be used as a component in an other pipeline.
'composable_operations' include ComposableOperations class SentenceDetection < Operation processes :text property :language, default: :en, converts: :to_sym, required: true, accepts: [:en, :de] def execute detector = OpenNLP::SentenceDetector.new(model) detector.process(text) end protected def model case language when :en OpenNLP::English.sentence_detection_model when :de OpenNLP::German.sentence_detection_model end end end Pre-Processing Pipeline
include ComposableOperations class Tokenization < Operation processes :sentences property :language, default: :en, converts: :to_sym, required: true, accepts: [:en, :de] def execute tokenizer = OpenNLP::Tokenizer.new(model) Array(sentences).map do |sentence| tokenizer.process(sentence) end end protected def model # ... end end Pre-Processing Pipeline
'composable_operations' include ComposableOperations class POSTagging < Operation processes :sentences property :language, default: :en, converts: :to_sym, required: true, accepts: [:en, :de] def execute tagger = OpenNLP::POSTagger.new(model) sentences.map.with_index do |sent, sent_idx| tags = tagger.process(sent) tags.map.with_index do |tag, tkn_idx| [sentences[sent_idx][tkn_idx], tag] end end end protected def model # ... end end Pre-Processing Pipeline
CooccurrenceCalculation use CooccurrenceGraphConstruction use PageRankCalculation use NodeSortingAndExtraction end KeywordRanking.perform(...) Keyword Extraction Pipeline
GitHub: t6d Twitter: t6d Code can be found on GitHub: * http://github.com/t6d/opennlp * http://github.com/t6d/opennlp-english * http://github.com/t6d/opennlp-german * http://github.com/t6d/opennlp-examples * http://github.com/t6d/keyword_extractor * http://github.com/t6d/composable_operations * http://github.com/t6d/smart_properties Any questions? Feel free to approach me anytime throughout the conference or send me a tweet, if that’s what you prefer. Summary