- Thu 22 August 2019
- Projects
- Tony Hall
- #NLP Banking
Terms & Conditions for products and services are typically long and wordy but often contain important obligations for either the provider or the recipient. Businesses need to ensure they are living up to the obligations they have outlined for themselves in their terms and conditions, and large organisations typically have thousands of pagses of such documents. Solving the business problem "what obligations do I have as a business buried in all my terms and conditions documents" can be automated using pre-trained Natural Language Processing (NLP) models available for Python.
Below is an example which takes a Terms and Conditions .pdf booklet from my bank (ANZ) and structures the text into pages and sentances (using the python Natural Language Toolkit NLTK). It then uses a pre-trained machine learning model from the Spacy library to identify the named entites in each sentance (i.e. the service provider 'ANZ' and the customer or 'merchant'). It then extracts all sentences which include a named entity followed by an obligation. For example, sentences such as 'ANZ will provide...' or 'The Merchant must pay...'.
Businesses can then confirm they are meeting their obligations to customers using this shortlist of 'obligation' sentances.
Step 1 - Download the Terms and Conditions from the banks website and convert to lists of pages and sentences¶
Install and import packages for extracting information from .pdf documents and natural language processing
!pip install wget pdfminer3k utils https://github.com/timClicks/slate/archive/master.zip nltk spacy
!python -m spacy download en_core_web_sm
#libraries for getting the .pdf document and converting to string
import wget
import slate
#NLP libraries
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
Download the pdf document from the ANZ Bank website and pull the document into a list using the Slate package
url = 'https://www.anz.com.au/content/dam/anzcomau/documents/pdf/fastpaynext-tc.pdf'
filename = wget.download(url)
#Use slate to pull the .pdf into a list of stings (one per page) called 'document'
with open(filename, 'rb') as f:
document = slate.PDF(f)
Use the NLTK library to tokenize (i.e. break down) the pages into lists of sentences
#create a list called 'tokendoc' of pages. Tokenize each page.
tokendoc = []
for page in document:
tokendoc.append(sent_tokenize(page))
Each sentance of the document can now be accessed using the tokendoc variable and the relevant page and sentance numbers. The example shows page 4 sentence 2.
tokendoc[4][2]
Strip all the newline characters out of all sentences in the text
#define a function to remove the /n characters from a list of strings
def strip_newlines(list):
list2 = [x.replace('\n', '') for x in list]
return list2
#strip the /n characters from all sentances on all pages
i=0
for page in tokendoc:
#print("**",page)
tokendoc[i] = strip_newlines(page)
i= i+1
Page four with all newline characters removed is shown below:
tokendoc[4]
Named Entity Recognition¶
An instance of the 'en_core_web_sm' pretrained NLP model can be used to detect named entities (ANZ, Merchant etc) and label as follows:
TYPE | DESCRIPTION |
---|---|
PERSON | People, including fictional. |
NORP | Nationalities or religious or political groups. |
FAC | Buildings, airports, highways, bridges, etc. |
ORG | Companies, agencies, institutions, etc. |
GPE | Countries, cities, states. |
LOC | Non-GPE locations, mountain ranges, bodies of water. |
PRODUCT | Objects, vehicles, foods, etc. (Not services.) |
EVENT | Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART | Titles of books, songs, etc. |
LAW | Named documents made into laws. |
LANGUAGE | Any named language. |
DATE | Absolute or relative dates or periods. |
TIME | Times smaller than a day. |
PERCENT | Percentage, including ”%“. |
MONEY | Monetary values, including unit. |
QUANTITY | Measurements, as of weight or distance. |
ORDINAL | “first”, “second”, etc. |
CARDINAL | Numerals that do not fall under another type. |
nlp = en_core_web_sm.load()
sentance = tokendoc[4][5]
doc = nlp(sentance)
Below you can see the type for each word in the selected sentence
print([(X, X.ent_iob_, X.ent_type_) for X in doc])
We can also visually represent the labels for a full page, such as page 10 below
for sentance in tokendoc[12]:
if nlp(sentance).ents:
displacy.render(nlp(str(sentance)), jupyter=True, style='ent')
The pre-trained model correctly identifies 'ANZ' and the 'Merchant' as organisations and Australian as a Nationality. It gets a few other entities wrong, such as 'Transaction' which it classifies as a person. A model trained on specific banking terms and conditions documents is likely to be much more accuracte. However, for the purposes of identifying obligations, this model only needs to correctly idetify named entities in general, and ANZ and Merchant as organisations in particular, which this achieves.
It's also possible to extract other information about each word, including:
- Text: The original word text.
- Lemma: The base form of the word.
- POS: The simple part-of-speech tag.
- Tag: The detailed part-of-speech tag.
- Dep: Syntactic dependency, i.e. the relation between tokens.
- Shape: The word shape – capitalization, punctuation, digits.
- is alpha: Is the token an alpha character?
- is stop: Is the token part of a stop list, i.e. the most common words of the language?
sentance = tokendoc[12][9] doc = nlp(sentance) for token in doc: print(token.text, token.lemma, token.pos, token.tag, token.dep, token.shape_, token.is_alpha, token.is_stop)
Extract obligations¶
Obligations in the text are recognisable by having a named entity followed by a keyword indicating an obligation, such as 'will' or 'must'
First we define the obligation keywords:
# define a list of words that include obligations by an entity
ob_words = ['must', 'will','provides', 'is obliged to', 'has to', 'needs to', 'is required to']
Then we define a function which returns relevant entities when a sentance has an obligation in it:
def obligation(sentance):
s = nlp(sentance)
entities = []
for word in s.ents:
entities.append(word.text)
for word in s:
if word.text in entities:
if word.nbor().text in ob_words:
return word.text
Now we can call the function for every page and every sentence in the document, printing any sentences that contain obligations (as well as all the page numbers and any errors just to keep track of where we are)
obligation_count = 0
sentence_count = 0
Entity = 'ANZ'
for page in enumerate(tokendoc):
for sentance in page[1]:
sentence_count = sentence_count + 1
try:
if(obligation(sentance)==Entity):
obligation_count = obligation_count+1
print('Page',page[0],':')
print(sentance)
except:
print('Page',page[0],':')
print("error")
print('The total number of obligations for', Entity,'is: ',obligation_count )
print('The total number of sentences in the document is: ',sentence_count)
Success! This document contains 604 sentences but only 24 are obligations, saving significant review time. We could now write these to .csv for further processing or audit.