Sunday, October 21, 2018

Relation Triples Extraction with Stanford OpenIE


Setting up Stanford-OpenIE for Relation Triples:
For extracting relation triples from a sentence, we can use the unofficial cross-platform Python wrapper for the state-of-art information extraction library from Stanford University.

More details on: https://github.com/philipperemy/Stanford-OpenIE-Python.git

First, we can clone the project.

git clone https://github.com/philipperemy/Stanford-OpenIE-Python.git

Then, I had to remove "-" from folder name to import the folder as a module.

mv
Stanford-OpenIE-Python/ StanfordOpenIEPython/

cd StanfordOpenIEPython/



To use the wrapper as module, we create __init__.py file.

touch __init__.py 

Example code:

from StanfordOpenIEPython.main import stanford_ie
import argparse
import os

FILE_PATH = 'StanfordOpenIEPython/sentences.txt'

def triples_extractor(sentence):
    try:
        os.remove(FILE_PATH)
    except:
        pass
    with open(FILE_PATH, 'a') as text:
        text.write(sentence)
    triples_raw = stanford_ie('sentences.txt', verbose=True)
    triples = [[trip.lstrip() for trip in triple] for triple in triples_raw]
    return triples


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-s", "--sentence", default='Bill Gates is married to Melinda Gates.')
    args = parser.parse_args()
    sentence = args.sentence
    print triples_extractor(sentence)



Output:
[['Bill Gates', 'is', 'married'], ['Bill Gates', 'is married to', 'Melinda Gates']]
 

More on: https://github.com/apogre/python_ner/blob/master/triples.py

Monday, October 8, 2018

Text Analysis (Named Entity Recognition) using Stanford NLP and Python NLTK

As a beginner, the first step towards Natural Language Processing (NLP) is to work on identifying Named Entities in a given natural language text. I came across this task in one of my projects and I will be writing about the steps I took to come up with my solution using references available on the internet.

I used Stanford NER tool for Named Entity Recognition.

Setting Up Stanford NER Tool with python nltk:

First, download the NER Package: (For Mac, please follow https://brew.sh/ before proceeding)

wget https://nlp.stanford.edu/software/stanford-ner-2018-02-27.zip

Unzip Them:

unzip stanford-ner-2018-02-27.zip

rm stanford-ner-2018-02-27.zip

Place these packages in your folder. I will put it in home folder for convenience.

Next step is to set up the environment paths.

export STANFORDTOOLSDIR=$HOME

export CLASSPATH=$STANFORDTOOLSDIR/stanford-ner-
2018-02-27/stanford-ner.jar

export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-ner-2
018-02-27/classifiers

We need to to have Java installed to run the NER package.
 
Install Java JDK: 


Let's get into writing the actual text analysis code.

Before that, lets install python nltk package

pip install nltk

 >>> import nltk
  >>> nltk.download('punkt')

Identifying Named Entities:

#importing packages

from nltk.tag import StanfordNERTagger
import argparse
from nltk import word_tokenize
from re import sub
from itertools import groupby

#Stanford NER comes with 3 (Location, Person, Organization), 4 (Location, Person, Organization, Misc) and 7 (Location, Person, Organization, Money, Percent, Date, Time)-class classifiers.

# st_ner = StanfordNERTagger('english.muc.7class.distsim.crf.ser.gz')
st_ner = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')

#tags each word to one of the NER labels
def sentence_tagger(sentence_list):
    named_entities = st_ner.tag_sents(sentence_list)
    return named_entities
#merge adjacent words with same tags
def get_nodes(tagged_words):
    ent = []
    for tag, chunk in groupby(tagged_words, lambda x:x[1]):
        if tag != "O":
            tuple1 = (sub(r'\s+([?.!,"])', r'\1', " ".join(w for w, t in chunk)), tag)
            ent.append(tuple1)
    return ent


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-s", "--sentence", default='Bill Gates is married to Melinda Gates.')
    args = parser.parse_args()
    sentence_lis = [args.sentence]
    sentence_list = [word_tokenize(sent) for sent in sentence_lis]
    print sentence_list
    named_tags = sentence_tagger(sentence_list)
    print named_tags
    for ne in named_tags:
        named_entities = get_nodes(ne)
        print named_entities

Full Project on Github: https://github.com/apogre/python_ner