Parts of Speech Tagging and Dependency Parsing using spaCy | NLP | Part 3

Parts of Speech tagging is the next step of the Tokenization. Once we have done Tokenization, spaCy can parse and tag a given Doc. spaCy is pre-trained using statistical modelling. This model consists of binary data and is trained on enough examples to make predictions that generalize across the language. Example, a word following “the” in English is most likely a noun.

It is always challenging to find the correct parts of speech due to the following reasons:

  • Enabling machine to understand and process raw text is not easy.
  • Same word plays differently in different context of a sentence.
  • Sometime words which are completely different, tells almost the same meaning.
  • Even splitting text into useful word-like units can be difficult in many languages.
  • While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information.
  • That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

Reference spacy.io/

This is the Part 3 of NLP spaCy Series of articles. you can find the first two parts in the below links:

Part 1: spacy-installation-and-basic-operations-nlp-text-processing-library

Part 2: guide-to-tokenization-lemmatization-stop-words-and-phrase-matching-using-spacy

In this section we’ll cover coarse POS tags (noun, verb, adjective), fine-grained tags (plural noun, past-tense verb, superlative adjective and Dependency Parsing and Visualization of dependency Tree.

Advertisements

Coarse-grained Part-of-speech Tags

Every token is assigned a POS Tag from the following list:

POSDESCRIPTIONEXAMPLES
ADJadjectivebig, old, green, incomprehensible, first
ADPadpositionin, to, during
ADVadverbvery, tomorrow, down, where, there
AUXauxiliaryis, has (done), will (do), should (do)
CONJconjunctionand, or, but
CCONJcoordinating conjunctionand, or, but
DETdeterminera, an, the
INTJinterjectionpsst, ouch, bravo, hello
NOUNnoungirl, cat, tree, air, beauty
NUMnumeral1, 2017, one, seventy-seven, IV, MMXIV
PARTparticle‘s, not,
PRONpronounI, you, he, she, myself, themselves, somebody
PROPNproper nounMary, John, London, NATO, HBO
PUNCTpunctuation., (, ), ?
SCONJsubordinating conjunctionif, while, that
SYMsymbol$, %, §, ©, +, −, ×, ÷, =, :), 😝
VERBverbrun, runs, running, eat, ate, eating
Xothersfpksdpsxmsa
SPACEspace
Coarse-grained POS Tags

Fine-grained Part-of-speech Tags

Tokens are subsequently given a fine-grained tag as determined by morphology:

POSDescriptionFine-grained TagDescriptionMorphology
ADJadjectiveAFXaffixHyph=yes
ADJJJadjectiveDegree=pos
ADJJJRadjective, comparativeDegree=comp
ADJJJSadjective, superlativeDegree=sup
ADJPDTpredeterminerAdjType=pdt PronType=prn
ADJPRP$pronoun, possessivePronType=prs Poss=yes
ADJWDTwh-determinerPronType=int rel
ADJWP$wh-pronoun, possessivePoss=yes PronType=int rel
ADPadpositionINconjunction, subordinating or preposition
ADVadverbEXexistential thereAdvType=ex
ADVRBadverbDegree=pos
ADVRBRadverb, comparativeDegree=comp
ADVRBSadverb, superlativeDegree=sup
ADVWRBwh-adverbPronType=int rel
CONJconjunctionCCconjunction, coordinatingConjType=coor
DETdeterminerDTdeterminer
INTJinterjectionUHinterjection
NOUNnounNNnoun, singular or massNumber=sing
NOUNNNSnoun, pluralNumber=plur
NOUNWPwh-pronoun, personalPronType=int rel
NUMnumeralCDcardinal numberNumType=card
PARTparticlePOSpossessive endingPoss=yes
PARTRPadverb, particle
PARTTOinfinitival toPartType=inf VerbForm=inf
PRONpronounPRPpronoun, personalPronType=prs
PROPNproper nounNNPnoun, proper singularNounType=prop Number=sign
PROPNNNPSnoun, proper pluralNounType=prop Number=plur
PUNCTpunctuation-LRB-left round bracketPunctType=brck PunctSide=ini
PUNCT-RRB-right round bracketPunctType=brck PunctSide=fin
PUNCT,punctuation mark, commaPunctType=comm
PUNCT:punctuation mark, colon or ellipsis
PUNCT.punctuation mark, sentence closerPunctType=peri
PUNCTclosing quotation markPunctType=quot PunctSide=fin
PUNCT“”closing quotation markPunctType=quot PunctSide=fin
PUNCTopening quotation markPunctType=quot PunctSide=ini
PUNCTHYPHpunctuation mark, hyphenPunctType=dash
PUNCTLSlist item markerNumType=ord
PUNCTNFPsuperfluous punctuation
SYMsymbol#symbol, number signSymType=numbersign
SYM$symbol, currencySymType=currency
SYMSYMsymbol
VERBverbBESauxiliary “be”
VERBHVSforms of “have”
VERBMDverb, modal auxiliaryVerbType=mod
VERBVBverb, base formVerbForm=inf
VERBVBDverb, past tenseVerbForm=fin Tense=past
VERBVBGverb, gerund or present participleVerbForm=part Tense=pres Aspect=prog
VERBVBNverb, past participleVerbForm=part Tense=past Aspect=perf
VERBVBPverb, non-3rd person singular presentVerbForm=fin Tense=pres
VERBVBZverb, 3rd person singular presentVerbForm=fin Tense=pres Number=sing Person=3
XotherADDemail
XFWforeign wordForeign=yes
XGWadditional word in multi-word expression
XXXunknown
SPACEspace_SPspace
NILmissing tag
Fine-grained Tags

View token tags

Recall Tokenization We can obtain a particular token by its index position.

  • To view the coarse POS tag use token.pos_
  • To view the fine-grained tag use token.tag_
  • To view the description of either type of tag use spacy.explain(tag)

spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name: Note that token.pos and token.tag return integer hash values; by adding the underscores we get the text equivalent that lives in doc.vocab.

Note: In the above example to format the representation I have added: {10} this is nothing but to give spacing between each token. Just to have better look and feel. No other specific reason. This count start from the first character of the token. You can add any number instead of {10} to have spacing as you wish.

Working with POS Tags

  • In the English language, it is very common that the same string of characters can have different meanings, even within the same sentence.
  • For this reason, morphology is important.
  • spaCy uses machine learning algorithms to best predict the use of a token in a sentence.
  • Is “I read books on NLP” present or past tense?
  • Is wind a verb or a noun?

Let’s understand all this with the help of below examples.

In the first example, spaCy assumed that read was Present Tense.
In the second example the present tense form would be I am reading a book, so spaCy assigned the past tense.

Counting POS Tags

The Doc.count_by() method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.

It means tag which has key as 96 is appeared only once and ta with key as 83 has appeared three times in the sentence. This isn’t very helpful until you decode the attribute ID:

Create a frequency list of POS tags from the entire document

Since POS_counts returns a dictionary, we can obtain a list of keys with POS_counts.items().
By sorting the list we have access to the tag and its count, in order.

k contains the key number of the tag and v contains the frequency number.

Counting fine-grained Tag

Why did the ID numbers get so big? In spaCy, certain text values are hardcoded into Doc.vocab and take up the first several hundred ID numbers. Strings like ‘NOUN’ and ‘VERB’ are used frequently by internal operations. Others, like fine-grained tags, are assigned hash values as needed.

Why don’t SPACE tags appear? In spaCy, only strings of spaces (two or more) are assigned tokens. Single spaces are not.

Fine-grained POS Tag Examples

These are some grammatical examples (shown in bold) of specific fine-grained tags. We’ve removed punctuation and rarely used tags:

POSTAGDESCRIPTIONEXAMPLE
ADJAFXaffixThe Flintstones were a pre-historic family.
ADJJJadjectiveThis is a good sentence.
ADJJJRadjective, comparativeThis is a better sentence.
ADJJJSadjective, superlativeThis is the best sentence.
ADJPDTpredeterminerWaking up is half the battle.
ADJPRP$pronoun, possessiveHis arm hurts.
ADJWDTwh-determinerIt’s blue, which is odd.
ADJWP$wh-pronoun, possessiveWe don’t know whose it is.
ADPINconjunction, subordinating or prepositionIt arrived in a box.
ADVEXexistential thereThere is cake.
ADVRBadverbHe ran quickly.
ADVRBRadverb, comparativeHe ran quicker.
ADVRBSadverb, superlativeHe ran fastest.
ADVWRBwh-adverbWhen was that?
CONJCCconjunction, coordinatingThe balloon popped and everyone jumped.
DETDTdeterminerThis is a sentence.
INTJUHinterjectionUm, I don’t know.
NOUNNNnoun, singular or massThis is a sentence.
NOUNNNSnoun, pluralThese are words.
NOUNWPwh-pronoun, personalWho was that?
NUMCDcardinal numberI want three things.
PARTPOSpossessive endingFred‘s name is short.
PARTRPadverb, particlePut it back!
PARTTOinfinitival toI want to go.
PRONPRPpronoun, personalI want you to go.
PROPNNNPnoun, proper singularKilroy was here.
PROPNNNPSnoun, proper pluralThe Flintstones were a pre-historic family.
VERBMDverb, modal auxiliaryThis could work.
VERBVBverb, base formI want to go.
VERBVBDverb, past tenseThis was a sentence.
VERBVBGverb, gerund or present participleI am going.
VERBVBNverb, past participleThe treasure was lost.
VERBVBPverb, non-3rd person singular presentI want to go.
VERBVBZverb, 3rd person singular presentHe wants to go.
Fine-Grained POS Tags Example

Dependency Parsing

  • Dependency parsing is the process of extracting the dependencies of a sentence to represent its grammatical structure.
  • It defines the dependency relationship between headwords and their dependents.
  • The head of a sentence has no dependency and is called the root of the sentence.
  • The verb is usually the head of the sentence. All other words are linked to the headword.

The dependencies can be mapped in a directed graph representation:

  • Words are the nodes.
  • The grammatical relationships are the edges.
  • Dependency parsing helps you know what role a word plays in the text and how different words relate to each other.
  • It’s also used in shallow parsing and named entity recognition.

Here we’ve shown spacy.attrs.POS, spacy.attrs.TAG and spacy.attrs.DEP.

Visualizing Parts of Speech

spaCy offers an outstanding visualizer called displaCy:

displacy Visulization

The dependency parse shows the coarse POS tag for each token, as well as the dependency tag if given:

Handling Large Text

displacy.serve() accepts a single Doc or list of Doc objects. Since large texts are difficult to view in one line, you may want to pass a list of spans instead. Each span will appear on its own line:

Customizing the Appearance

Besides setting the distance between tokens, you can pass other arguments to the options parameter:

NAMETYPEDESCRIPTIONDEFAULT
compactbool“Compact mode” with square arrows that takes up less space.False
colorunicodeText color (HEX, RGB or color names).#000000
bgunicodeBackground color (HEX, RGB or color names).#ffffff
fontunicodeFont name or font family for all text.Arial

For a full list of options visit https://spacy.io/api/top-level#displacy_options

This is all about text Parts of Speech Tagging using spaCy. Hope you enjoyed the post.

Thank You!

Post Credit: Jose Portila Udemy Video

Advertisements