Sentence Segmentation or Sentence Tokenization is the process of identifying different sentences among group of words. Spacy library designed for Natural Language Processing, perform the sentence segmentation with much higher accuracy. However, lets first talk about, how we as a human identify the start and end of the sentence? Mostly with the help of the punctuation, right? And in most of the cases we say a sentence ends with a dot ‘.’ character. So with this basic idea, we would say that we can split the string based on dot and get the different sentences. Do you think this logic would be enough to get all sentence tokens?Read more
Tag Archives: spaCy
Parts of Speech tagging is the next step of the tokenization. Once we have done tokenization, spaCy can parse and tag a given Doc. spaCy is pre-trained using statistical modelling. This model consists of binary data and is trained on enough examples to make predictions that generalize across the language. Example, a word following “the” in English is most likely a noun.Read more
A Quick Guide to Tokenization, Lemmatization, Stop Words, and Phrase Matching using spaCy | NLP | Part 2
“spaCy” is designed specifically for production use. It helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning. In this article you will learn about Tokenization, Lemmatization, Stop Words and Phrase Matching operations using spaCy.Read more
What is Spacy
spaCy is an open-source Python library that parses and “understands” large volumes of text.
(You can download the complete Notebook from here.)Read more