Named Entity Recognition NER using spaCy | NLP | Part 4

Named Entity Recognition is the most important or I would say the starting step in Information Retrieval. Information Retrieval is the technique to extract important and useful information from unstructured raw text documents. Named Entity Recognition NER works by locating and identifying the named entities present in unstructured text into the standard categories such as person names, locations, organizations, time expressions, quantities, monetary values, percentage, codes etc. Spacy comes with an extremely fast statistical entity recognition system that assigns labels to contiguous spans of tokens.

Spacy Installation and Basic Operations | NLP Text Processing Library | Part 1

Spacy provides option to add arbitrary classes to entity recognition system and update the model to even include the new examples apart from already defined entities within model.

Spacy has the ‘ner’ pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the ‘ents’ property of a Doc object.

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')
# Write a function to display basic entity info:
def show_ents(doc):
if doc.ents:
for ent in doc.ents:
    print(ent.text+' - ' +str(ent.start_char) +' - '+ str(ent.end_char) +' - '+ent.label_+ ' - '+str(spacy.explain(ent.label_)))
else:
    print('No named entities found.')
doc1 = nlp("Apple is looking at buying U.K. startup for $1 billion")
show_ents(doc1)
NER using spaCy
TextStartEndLabelDescription
Apple05ORGCompanies, agencies, institutions.
U.K.2731GPEGeopolitical entity, i.e. countries, cities, states.
$1 billion4454MONEYMonetary values, including unit.
doc2 = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')
show_ents(doc2)

Here we see tokens combine to form the entities next May and the Washington Monument

doc3 = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?')
for ent in doc3.ents:
    print(ent.text, ent.start, ent.end, ent.start_char,    ent.end_char, ent.label_)

Entity Annotations

Doc.ents are token spans with their own set of annotations.

ent.textThe original entity text
ent.labelThe entity type’s hash value
ent.label_The entity type’s string description
ent.startThe token span’s start index position in the Doc
ent.endThe token span’s stop index position in the Doc
ent.start_charThe entity text’s start index position in the Doc
ent.end_charThe entity text’s stop index position in the Doc

Accessing Entity Annotations

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value using ent.label or as a string using ent.label_.

The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

doc = nlp("San Francisco considers banning sidewalk delivery robots")
# document level
for e in doc.ents:
    print(e.text, e.start_char, e.end_char, e.label_)
# OR
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
#token level
# doc[0], doc[1] …will have tokens stored.
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_san)
print(ent_francisco)
IOB SCHEME
I – Token is inside an entity.
O – Token is outside an entity.
B – Token is the beginning of an entity.
Textent_iobent_iob_ent_type_Description
San3B"GPE"beginning of an entity
Francisco1I"GPE"inside an entity
considers2O""outside an entity
banning2O""outside an entity
sidewalk2O""outside an entity
delivery2O""outside an entity
robots2O""outside an entity

Note: In the above example only San Francisco is recognized as named entity. hence rest of the tokens are described as outside the entity. And in San Francisco San is the starting of the entity and Francisco is inside the entity.

NER Tags

Tags are accessible through the .label_ property of an entity.

TYPEDESCRIPTIONEXAMPLE
PERSONPeople, including fictional.Fred Flintstone
NORPNationalities or religious or political groups.The Republican Party
FACBuildings, airports, highways, bridges, etc.Logan International Airport, The Golden Gate
ORGCompanies, agencies, institutions, etc.Microsoft, FBI, MIT
GPECountries, cities, states.France, UAR, Chicago, Idaho
LOCNon-GPE locations, mountain ranges, bodies of water.Europe, Nile River, Midwest
PRODUCTObjects, vehicles, foods, etc. (Not services.)Formula 1
EVENTNamed hurricanes, battles, wars, sports events, etc.Olympic Games
WORK_OF_ARTTitles of books, songs, etc.The Mona Lisa
LAWNamed documents made into laws.Roe v. Wade
LANGUAGEAny named language.English
DATEAbsolute or relative dates or periods.20 July 1969
TIMETimes smaller than a day.Four hours
PERCENTPercentage, including “%”.Eighty percent
MONEYMonetary values, including unit.Twenty Cents
QUANTITYMeasurements, as of weight or distance.Several kilometers, 55kg
ORDINAL“first”, “second”, etc.9th, Ninth
CARDINALNumerals that do not fall under another type.2, Two, Fifty-two

User Defined Named Entity and Adding it to a Span

Normally we would have spaCy build a library of named entities by training it on several samples of text.
Sometimes, we want to assign specific token a named entity whic is not recognized by the trained spacy model. We can do this as shown in below code.

Example 1

Example 2

Adding Named Entities to All Matching Spans

What if we want to tag all occurrences of a token? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc:

doc = nlp(u'Our company plans to introduce a new vacuum cleaner. If successful, the vacuum cleaner will be our first product.')
show_ents(doc)
#output: first - 99 - 104 - ORDINAL - "first", "second", etc.
#Import PhraseMatcher and create a matcher object:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
#Create the desired phrase patterns:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]
#Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)
#Apply the matcher to our Doc object:
matches = matcher(doc)
#See what matches occur:
matches
#output: [(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]
#Here we create Spans from each match, and create named entities from them:
from spacy.tokens import Span
PROD = doc.vocab.strings[u'PRODUCT']
new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]
#match[1] contains the start index of the the token and match[2] the stop index (exclusive) of the token in the doc.
doc.ents = list(doc.ents) + new_ents
show_ents(doc)
output: vacuum cleaner - 37 - 51 - PRODUCT - Objects, vehicles, foods, etc. (not services) vacuum cleaner - 72 - 86 - PRODUCT - Objects, vehicles, foods, etc. (not services) first - 99 - 104 - ORDINAL - "first", "second", etc.

Counting Entities

While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:

Visualizing NER

Spacy has a library called “displaCy” which helps us to explore the behaviour of the entity recognition model interactively.

If you are training a model, it’s very useful to run the visualization yourself.

You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw mark-up.

#Import the displaCy library
from spacy import displacy

Visualizing Sentences Line by Line

Viewing Specific Entities

You can pass a list of entity types to restrict the visualization:

Styling: customize color and effects

You can also pass background color and gradient options:

This is all about Named Entity Recognition NER and its Visualization using spaCy. Hope you enjoyed the post.

Next Article I will describe about Sentence Segmentation. Stay Tuned!

If you have any feedback to improve the content or any thought please write in the comment section below. Your comments are very valuable.

Previous Articles in spaCy NLP Series:

Thank You!

Post Credit: Jose Portila Udemy Videos