Monday, October 8, 2018

Text Analysis (Named Entity Recognition) using Stanford NLP and Python NLTK

As a beginner, the first step towards Natural Language Processing (NLP) is to work on identifying Named Entities in a given natural language text. I came across this task in one of my projects and I will be writing about the steps I took to come up with my solution using references available on the internet.

I used Stanford NER tool for Named Entity Recognition.

Setting Up Stanford NER Tool with python nltk:

First, download the NER Package: (For Mac, please follow https://brew.sh/ before proceeding)

wget https://nlp.stanford.edu/software/stanford-ner-2018-02-27.zip

Unzip Them:

unzip stanford-ner-2018-02-27.zip

rm stanford-ner-2018-02-27.zip

Place these packages in your folder. I will put it in home folder for convenience.

Next step is to set up the environment paths.

export STANFORDTOOLSDIR=$HOME

export CLASSPATH=$STANFORDTOOLSDIR/stanford-ner-
2018-02-27/stanford-ner.jar

export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-ner-2
018-02-27/classifiers

We need to to have Java installed to run the NER package.
 
Install Java JDK: 


Let's get into writing the actual text analysis code.

Before that, lets install python nltk package

pip install nltk

 >>> import nltk
  >>> nltk.download('punkt')

Identifying Named Entities:

#importing packages

from nltk.tag import StanfordNERTagger
import argparse
from nltk import word_tokenize
from re import sub
from itertools import groupby

#Stanford NER comes with 3 (Location, Person, Organization), 4 (Location, Person, Organization, Misc) and 7 (Location, Person, Organization, Money, Percent, Date, Time)-class classifiers.

# st_ner = StanfordNERTagger('english.muc.7class.distsim.crf.ser.gz')
st_ner = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')

#tags each word to one of the NER labels
def sentence_tagger(sentence_list):
    named_entities = st_ner.tag_sents(sentence_list)
    return named_entities
#merge adjacent words with same tags
def get_nodes(tagged_words):
    ent = []
    for tag, chunk in groupby(tagged_words, lambda x:x[1]):
        if tag != "O":
            tuple1 = (sub(r'\s+([?.!,"])', r'\1', " ".join(w for w, t in chunk)), tag)
            ent.append(tuple1)
    return ent


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-s", "--sentence", default='Bill Gates is married to Melinda Gates.')
    args = parser.parse_args()
    sentence_lis = [args.sentence]
    sentence_list = [word_tokenize(sent) for sent in sentence_lis]
    print sentence_list
    named_tags = sentence_tagger(sentence_list)
    print named_tags
    for ne in named_tags:
        named_entities = get_nodes(ne)
        print named_entities

Full Project on Github: https://github.com/apogre/python_ner
 

No comments:

Post a Comment