Parts of Speech tagging is the next step of the Tokenization. Once we have done Tokenization, spaCy can parse and tag a given Doc. spaCy is pre-trained using statistical modelling. This model consists of binary data and is trained on enough examples to make predictions that generalize across the language. Example, a word following “the” in English is most likely a noun.

Jupyter Notebook: Parts of Speech Tagging using spaCy Download

It is always challenging to find the correct parts of speech due to the following reasons:

Enabling machine to understand and process raw text is not easy.
Same word plays differently in different context of a sentence.
Sometime words which are completely different, tells almost the same meaning.
Even splitting text into useful word-like units can be difficult in many languages.
While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information.
That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

Reference spacy.io/

This is the Part 3 of NLP spaCy Series of articles. you can find the first two parts in the below links:

Part 1: spacy-installation-and-basic-operations-nlp-text-processing-library

Part 2: guide-to-tokenization-lemmatization-stop-words-and-phrase-matching-using-spacy

In this section we’ll cover coarse POS tags (noun, verb, adjective), fine-grained tags (plural noun, past-tense verb, superlative adjective and Dependency Parsing and Visualization of dependency Tree.

Coarse-grained Part-of-speech Tags

Every token is assigned a POS Tag from the following list:

POS	DESCRIPTION	EXAMPLES
ADJ	adjective	big, old, green, incomprehensible, first
ADP	adposition	in, to, during
ADV	adverb	very, tomorrow, down, where, there
AUX	auxiliary	is, has (done), will (do), should (do)
CONJ	conjunction	and, or, but
CCONJ	coordinating conjunction	and, or, but
DET	determiner	a, an, the
INTJ	interjection	psst, ouch, bravo, hello
NOUN	noun	girl, cat, tree, air, beauty
NUM	numeral	1, 2017, one, seventy-seven, IV, MMXIV
PART	particle	‘s, not,
PRON	pronoun	I, you, he, she, myself, themselves, somebody
PROPN	proper noun	Mary, John, London, NATO, HBO
PUNCT	punctuation	., (, ), ?
SCONJ	subordinating conjunction	if, while, that
SYM	symbol	$, %, §, ©, +, −, ×, ÷, =, :), 😝
VERB	verb	run, runs, running, eat, ate, eating
X	other	sfpksdpsxmsa
SPACE	space

Coarse-grained POS Tags

Fine-grained Part-of-speech Tags

Tokens are subsequently given a fine-grained tag as determined by morphology:

POS	Description	Fine-grained Tag	Description	Morphology
ADJ	adjective	AFX	affix	Hyph=yes
ADJ		JJ	adjective	Degree=pos
ADJ		JJR	adjective, comparative	Degree=comp
ADJ		JJS	adjective, superlative	Degree=sup
ADJ		PDT	predeterminer	AdjType=pdt PronType=prn
ADJ		PRP$	pronoun, possessive	PronType=prs Poss=yes
ADJ		WDT	wh-determiner	PronType=int rel
ADJ		WP$	wh-pronoun, possessive	Poss=yes PronType=int rel
ADP	adposition	IN	conjunction, subordinating or preposition
ADV	adverb	EX	existential there	AdvType=ex
ADV		RB	adverb	Degree=pos
ADV		RBR	adverb, comparative	Degree=comp
ADV		RBS	adverb, superlative	Degree=sup
ADV		WRB	wh-adverb	PronType=int rel
CONJ	conjunction	CC	conjunction, coordinating	ConjType=coor
DET	determiner	DT	determiner
INTJ	interjection	UH	interjection
NOUN	noun	NN	noun, singular or mass	Number=sing
NOUN		NNS	noun, plural	Number=plur
NOUN		WP	wh-pronoun, personal	PronType=int rel
NUM	numeral	CD	cardinal number	NumType=card
PART	particle	POS	possessive ending	Poss=yes
PART		RP	adverb, particle
PART		TO	infinitival to	PartType=inf VerbForm=inf
PRON	pronoun	PRP	pronoun, personal	PronType=prs
PROPN	proper noun	NNP	noun, proper singular	NounType=prop Number=sign
PROPN		NNPS	noun, proper plural	NounType=prop Number=plur
PUNCT	punctuation	-LRB-	left round bracket	PunctType=brck PunctSide=ini
PUNCT		-RRB-	right round bracket	PunctType=brck PunctSide=fin
PUNCT		,	punctuation mark, comma	PunctType=comm
PUNCT		:	punctuation mark, colon or ellipsis
PUNCT		.	punctuation mark, sentence closer	PunctType=peri
PUNCT		”	closing quotation mark	PunctType=quot PunctSide=fin
PUNCT		“”	closing quotation mark	PunctType=quot PunctSide=fin
PUNCT			opening quotation mark	PunctType=quot PunctSide=ini
PUNCT		HYPH	punctuation mark, hyphen	PunctType=dash
PUNCT		LS	list item marker	NumType=ord
PUNCT		NFP	superfluous punctuation
SYM	symbol	#	symbol, number sign	SymType=numbersign
SYM		$	symbol, currency	SymType=currency
SYM		SYM	symbol
VERB	verb	BES	auxiliary “be”
VERB		HVS	forms of “have”
VERB		MD	verb, modal auxiliary	VerbType=mod
VERB		VB	verb, base form	VerbForm=inf
VERB		VBD	verb, past tense	VerbForm=fin Tense=past
VERB		VBG	verb, gerund or present participle	VerbForm=part Tense=pres Aspect=prog
VERB		VBN	verb, past participle	VerbForm=part Tense=past Aspect=perf
VERB		VBP	verb, non-3rd person singular present	VerbForm=fin Tense=pres
VERB		VBZ	verb, 3rd person singular present	VerbForm=fin Tense=pres Number=sing Person=3
X	other	ADD	email
X		FW	foreign word	Foreign=yes
X		GW	additional word in multi-word expression
X		XX	unknown
SPACE	space	_SP	space
		NIL	missing tag

Fine-grained Tags

View token tags

Recall Tokenization We can obtain a particular token by its index position.

To view the coarse POS tag use token.pos_
To view the fine-grained tag use token.tag_
To view the description of either type of tag use spacy.explain(tag)

spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name: Note that token.pos and token.tag return integer hash values; by adding the underscores we get the text equivalent that lives in doc.vocab.

Note: In the above example to format the representation I have added: {10} this is nothing but to give spacing between each token. Just to have better look and feel. No other specific reason. This count start from the first character of the token. You can add any number instead of {10} to have spacing as you wish.

Working with POS Tags

In the English language, it is very common that the same string of characters can have different meanings, even within the same sentence.
For this reason, morphology is important.
spaCy uses machine learning algorithms to best predict the use of a token in a sentence.
Is “I read books on NLP” present or past tense?
Is wind a verb or a noun?

Let’s understand all this with the help of below examples.

In the first example, spaCy assumed that read was Present Tense.
In the second example the present tense form would be I am reading a book, so spaCy assigned the past tense.

Counting POS Tags

The Doc.count_by() method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.

It means tag which has key as 96 is appeared only once and ta with key as 83 has appeared three times in the sentence. This isn’t very helpful until you decode the attribute ID:

Create a frequency list of POS tags from the entire document

Since POS_counts returns a dictionary, we can obtain a list of keys with POS_counts.items().
By sorting the list we have access to the tag and its count, in order.

k contains the key number of the tag and v contains the frequency number.

Counting fine-grained Tag

Why did the ID numbers get so big? In spaCy, certain text values are hardcoded into Doc.vocab and take up the first several hundred ID numbers. Strings like ‘NOUN’ and ‘VERB’ are used frequently by internal operations. Others, like fine-grained tags, are assigned hash values as needed.

Why don’t SPACE tags appear? In spaCy, only strings of spaces (two or more) are assigned tokens. Single spaces are not.

Fine-grained POS Tag Examples¶

These are some grammatical examples (shown in bold) of specific fine-grained tags. We’ve removed punctuation and rarely used tags:

POS	TAG	DESCRIPTION	EXAMPLE
ADJ	AFX	affix	The Flintstones were a pre-historic family.
ADJ	JJ	adjective	This is a good sentence.
ADJ	JJR	adjective, comparative	This is a better sentence.
ADJ	JJS	adjective, superlative	This is the best sentence.
ADJ	PDT	predeterminer	Waking up is half the battle.
ADJ	PRP$	pronoun, possessive	His arm hurts.
ADJ	WDT	wh-determiner	It’s blue, which is odd.
ADJ	WP$	wh-pronoun, possessive	We don’t know whose it is.
ADP	IN	conjunction, subordinating or preposition	It arrived in a box.
ADV	EX	existential there	There is cake.
ADV	RB	adverb	He ran quickly.
ADV	RBR	adverb, comparative	He ran quicker.
ADV	RBS	adverb, superlative	He ran fastest.
ADV	WRB	wh-adverb	When was that?
CONJ	CC	conjunction, coordinating	The balloon popped and everyone jumped.
DET	DT	determiner	This is a sentence.
INTJ	UH	interjection	Um, I don’t know.
NOUN	NN	noun, singular or mass	This is a sentence.
NOUN	NNS	noun, plural	These are words.
NOUN	WP	wh-pronoun, personal	Who was that?
NUM	CD	cardinal number	I want three things.
PART	POS	possessive ending	Fred‘s name is short.
PART	RP	adverb, particle	Put it back!
PART	TO	infinitival to	I want to go.
PRON	PRP	pronoun, personal	I want you to go.
PROPN	NNP	noun, proper singular	Kilroy was here.
PROPN	NNPS	noun, proper plural	The Flintstones were a pre-historic family.
VERB	MD	verb, modal auxiliary	This could work.
VERB	VB	verb, base form	I want to go.
VERB	VBD	verb, past tense	This was a sentence.
VERB	VBG	verb, gerund or present participle	I am going.
VERB	VBN	verb, past participle	The treasure was lost.
VERB	VBP	verb, non-3rd person singular present	I want to go.
VERB	VBZ	verb, 3rd person singular present	He wants to go.

Fine-Grained POS Tags Example

Dependency Parsing

Dependency parsing is the process of extracting the dependencies of a sentence to represent its grammatical structure.
It defines the dependency relationship between headwords and their dependents.
The head of a sentence has no dependency and is called the root of the sentence.
The verb is usually the head of the sentence. All other words are linked to the headword.

The dependencies can be mapped in a directed graph representation:

Words are the nodes.
The grammatical relationships are the edges.
Dependency parsing helps you know what role a word plays in the text and how different words relate to each other.
It’s also used in shallow parsing and named entity recognition.

Here we’ve shown spacy.attrs.POS, spacy.attrs.TAG and spacy.attrs.DEP.

Visualizing Parts of Speech

spaCy offers an outstanding visualizer called displaCy:

The dependency parse shows the coarse POS tag for each token, as well as the dependency tag if given:

Handling Large Text

displacy.serve() accepts a single Doc or list of Doc objects. Since large texts are difficult to view in one line, you may want to pass a list of spans instead. Each span will appear on its own line:

Customizing the Appearance

Besides setting the distance between tokens, you can pass other arguments to the options parameter:

NAME	TYPE	DESCRIPTION	DEFAULT
compact	bool	“Compact mode” with square arrows that takes up less space.	False
color	unicode	Text color (HEX, RGB or color names).	#000000
bg	unicode	Background color (HEX, RGB or color names).	#ffffff
font	unicode	Font name or font family for all text.	Arial

For a full list of options visit https://spacy.io/api/top-level#displacy_options

This is all about text Parts of Speech Tagging using spaCy. Hope you enjoyed the post.

Thank You!

Post Credit: Jose Portila Udemy Video