Named Entity Recognition is the most important or I would say the starting step in Information Retrieval. Information Retrieval is the technique to extract important and useful information from unstructured raw text documents. Named Entity Recognition NER works by locating and identifying the named entities present in unstructured text into the standard categories such as person names, locations, organizations, time expressions, quantities, monetary values, percentage, codes etc. Spacy comes with an extremely fast statistical entity recognition system that assigns labels to contiguous spans of tokens.
Spacy provides option to add arbitrary classes to entity recognition system and update the model to even include the new examples apart from already defined entities within model.
Spacy has the ‘ner’ pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the ‘ents’ property of a Doc object.
# Perform standard imports import spacy nlp = spacy.load('en_core_web_sm')
# Write a function to display basic entity info: def show_ents(doc): if doc.ents: for ent in doc.ents: print(ent.text+' - ' +str(ent.start_char) +' - '+ str(ent.end_char) +' - '+ent.label_+ ' - '+str(spacy.explain(ent.label_))) else: print('No named entities found.')
doc1 = nlp("Apple is looking at buying U.K. startup for $1 billion") show_ents(doc1)
|Apple||0||5||Companies, agencies, institutions.|
|U.K.||27||31||Geopolitical entity, i.e. countries, cities, states.|
|$1 billion||44||54||Monetary values, including unit.|
doc2 = nlp(u'May I go to Washington, DC next May to see the Washington Monument?') show_ents(doc2)
Here we see tokens combine to form the entities
next May and
the Washington Monument
doc3 = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?') for ent in doc3.ents: print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)
Doc.ents are token spans with their own set of annotations.
|The original entity text|
|The entity type’s hash value|
|The entity type’s string description|
|The token span’s start index position in the Doc|
|The token span’s stop index position in the Doc|
|The entity text’s start index position in the Doc|
|The entity text’s stop index position in the Doc|
Accessing Entity Annotations
The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value using ent.label or as a string using ent.label_.
The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.
You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.
doc = nlp("San Francisco considers banning sidewalk delivery robots") # document level for e in doc.ents: print(e.text, e.start_char, e.end_char, e.label_) # OR ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] print(ents)
#token level # doc, doc …will have tokens stored. ent_san = [doc.text, doc.ent_iob_, doc.ent_type_] ent_francisco = [doc.text, doc.ent_iob_, doc.ent_type_] print(ent_san) print(ent_francisco)
IOB SCHEME I – Token is inside an entity. O – Token is outside an entity. B – Token is the beginning of an entity.
|San||beginning of an entity|
|Francisco||inside an entity|
|considers||outside an entity|
|banning||outside an entity|
|sidewalk||outside an entity|
|delivery||outside an entity|
|robots||outside an entity|
Note: In the above example only
San Francisco is recognized as named entity. hence rest of the tokens are described as outside the entity. And in
San is the starting of the entity and
Francisco is inside the entity.
Tags are accessible through the
.label_ property of an entity.
|People, including fictional.||Fred Flintstone|
|Nationalities or religious or political groups.||The Republican Party|
|Buildings, airports, highways, bridges, etc.||Logan International Airport, The Golden Gate|
|Companies, agencies, institutions, etc.||Microsoft, FBI, MIT|
|Countries, cities, states.||France, UAR, Chicago, Idaho|
|Non-GPE locations, mountain ranges, bodies of water.||Europe, Nile River, Midwest|
|Objects, vehicles, foods, etc. (Not services.)||Formula 1|
|Named hurricanes, battles, wars, sports events, etc.||Olympic Games|
|Titles of books, songs, etc.||The Mona Lisa|
|Named documents made into laws.||Roe v. Wade|
|Any named language.||English|
|Absolute or relative dates or periods.||20 July 1969|
|Times smaller than a day.||Four hours|
|Percentage, including “%”.||Eighty percent|
|Monetary values, including unit.||Twenty Cents|
|Measurements, as of weight or distance.||Several kilometers, 55kg|
|“first”, “second”, etc.||9th, Ninth|
|Numerals that do not fall under another type.||2, Two, Fifty-two|
User Defined Named Entity and Adding it to a Span
Normally we would have spaCy build a library of named entities by training it on several samples of text.
Sometimes, we want to assign specific token a named entity whic is not recognized by the trained spacy model. We can do this as shown in below code.
Adding Named Entities to All Matching Spans
What if we want to tag all occurrences of a token? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc:
doc = nlp(u'Our company plans to introduce a new vacuum cleaner. If successful, the vacuum cleaner will be our first product.') show_ents(doc) #output: first - 99 - 104 - ORDINAL - "first", "second", etc.
#Import PhraseMatcher and create a matcher object: from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab)
#Create the desired phrase patterns:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]
#Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)
#Apply the matcher to our Doc object:
matches = matcher(doc)
#See what matches occur: matches #output: [(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]
#Here we create Spans from each match, and create named entities from them: from spacy.tokens import Span PROD = doc.vocab.strings[u'PRODUCT'] new_ents = [Span(doc, match,match,label=PROD) for match in matches] #match contains the start index of the the token and match the stop index (exclusive) of the token in the doc. doc.ents = list(doc.ents) + new_ents show_ents(doc)
output: vacuum cleaner - 37 - 51 - PRODUCT - Objects, vehicles, foods, etc. (not services) vacuum cleaner - 72 - 86 - PRODUCT - Objects, vehicles, foods, etc. (not services) first - 99 - 104 - ORDINAL - "first", "second", etc.
While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:
Spacy has a library called “displaCy” which helps us to explore the behaviour of the entity recognition model interactively.
If you are training a model, it’s very useful to run the visualization yourself.
You can pass a
Doc or a list of
Doc objects to displaCy and run
displacy.serve to run the web server, or
displacy.render to generate the raw mark-up.
#Import the displaCy library
from spacy import displacy
Visualizing Sentences Line by Line
Viewing Specific Entities
You can pass a list of entity types to restrict the visualization:
Styling: customize color and effects
You can also pass background color and gradient options:
This is all about Named Entity Recognition NER and its Visualization using spaCy. Hope you enjoyed the post.
Next Article I will describe about Sentence Segmentation. Stay Tuned!
If you have any feedback to improve the content or any thought please write in the comment section below. Your comments are very valuable.
Previous Articles in spaCy NLP Series:
- SPACY INSTALLATION AND BASIC OPERATIONS | NLP TEXT PROCESSING LIBRARY | PART 1
- A QUICK GUIDE TO TOKENIZATION, LEMMATIZATION, STOP WORDS, AND PHRASE MATCHING USING SPACY | NLP | PART 2
- PARTS OF SPEECH TAGGING AND DEPENDENCY PARSING USING SPACY | NLP | PART 3