As a beginner, the
first step towards Natural Language Processing (NLP) is to work on
identifying Named Entities in a given natural language
text. I came across this task in one of my projects and I will be
writing about the steps I took to come up with my solution using references available on the internet.
I used Stanford NER tool for Named Entity Recognition.
Setting Up Stanford NER Tool with python nltk:
I used Stanford NER tool for Named Entity Recognition.
Setting Up Stanford NER Tool with python nltk:
First, download the NER Package: (For Mac, please follow https://brew.sh/ before proceeding) wget https://nlp.stanford.edu/software/stanford-ner-2018-02-27.zipUnzip Them:
unzip stanford-ner-2018-02-27.zip
rm stanford-ner-2018-02-27.zip
Place these packages in your folder. I will put it in home folder for convenience.
Next step is to set up the environment paths.
export STANFORDTOOLSDIR=$HOME
export CLASSPATH=$STANFORDTOOLSDIR/stanford-ner-2018-02-27/stanford-ner.jar
export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-ner-2018-02-27/classifiers
We need to to have Java installed to run the NER package.
Install Java JDK:
Let's get into writing the actual text analysis code.
Before that, lets install python nltk package
pip install nltk
>>> import nltk
>>> nltk.download('punkt')
Identifying Named Entities:
#importing packages
from nltk.tag import StanfordNERTagger
import argparse
from nltk import word_tokenize
from re import sub
from itertools import groupby#Stanford NER comes with 3 (Location, Person, Organization), 4 (Location, Person, Organization, Misc) and 7 (Location, Person, Organization, Money, Percent, Date, Time)-class classifiers.# st_ner = StanfordNERTagger('english.muc.7class.distsim.crf.ser.gz')
st_ner = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')#tags each word to one of the NER labels def sentence_tagger(sentence_list):
named_entities = st_ner.tag_sents(sentence_list)
return named_entities
#merge adjacent words with same tags
def get_nodes(tagged_words):
ent = []
for tag, chunk in groupby(tagged_words, lambda x:x[1]):
if tag != "O":
tuple1 = (sub(r'\s+([?.!,"])', r'\1', " ".join(w for w, t in chunk)), tag)
ent.append(tuple1)
return entif __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-s", "--sentence", default='Bill Gates is married to Melinda Gates.')
args = parser.parse_args()
sentence_lis = [args.sentence]
sentence_list = [word_tokenize(sent) for sent in sentence_lis]
print sentence_list
named_tags = sentence_tagger(sentence_list)
print named_tags
for ne in named_tags:
named_entities = get_nodes(ne)
print named_entitiesFull Project on Github: https://github.com/apogre/python_ner
No comments:
Post a Comment