As a beginner, the
first step towards Natural Language Processing (NLP) is to work on
identifying Named Entities in a given natural language
text. I came across this task in one of my projects and I will be
writing about the steps I took to come up with my solution using references available on the internet.
I
used Stanford NER tool for Named Entity Recognition.
Setting Up Stanford NER Tool with python nltk:
First, download the NER Package: (For Mac, please follow https://brew.sh/ before proceeding)
wget https://nlp.stanford.edu/software/stanford-ner-2018-02-27.zip
Unzip Them:
unzip stanford-ner-
2018-02-27
.zip
rm
stanford-ner-2018-02-27.zip
Place these packages in your folder. I will put it in home folder for convenience.
Next step is to set up the environment paths.
export STANFORDTOOLSDIR=$HOME
export CLASSPATH=$STANFORDTOOLSDIR/stanford-ner-
2018-02-27
/stanford-ner.jar
export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-ner-2
018-02-27
/classifiers
We need to to have Java installed to run the NER package.
Install Java JDK:
Let's get into writing the actual text analysis code.
Before that, lets install python nltk package
pip install nltk
>>> import nltk
>>> nltk.download('punkt')
Identifying Named Entities:
#importing packages
from nltk.tag import StanfordNERTagger
import argparse
from nltk import word_tokenize
from re import sub
from itertools import groupby
#Stanford NER comes with 3 (Location, Person, Organization), 4 (Location, Person, Organization, Misc) and 7 (
Location, Person, Organization, Money, Percent, Date, Time)-class classifiers.
# st_ner = StanfordNERTagger('english.muc.7class.distsim.crf.ser.gz')
st_ner = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
#tags each word to one of the NER labels
def sentence_tagger(sentence_list):
named_entities = st_ner.tag_sents(sentence_list)
return named_entities
#merge adjacent words with same tags
def get_nodes(tagged_words):
ent = []
for tag, chunk in groupby(tagged_words, lambda x:x[1]):
if tag != "O":
tuple1 = (sub(r'\s+([?.!,"])', r'\1', " ".join(w for w, t in chunk)), tag)
ent.append(tuple1)
return ent
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-s", "--sentence", default='Bill Gates is married to Melinda Gates.')
args = parser.parse_args()
sentence_lis = [args.sentence]
sentence_list = [word_tokenize(sent) for sent in sentence_lis]
print sentence_list
named_tags = sentence_tagger(sentence_list)
print named_tags
for ne in named_tags:
named_entities = get_nodes(ne)
print named_entities