Dataist Dogma

Reflections and projects in Data Science, Machine Learning and AI "A critical examination of the Dataist dogma is likely to be not only the greatest scientific challenge of the twenty-first century, but also the most urgent political and economic project" - Yuval Noah Harari - Homo Deus: a Brief History of the Future (2016)

Using NLP to extract terms and conditions


Terms & Conditions for products and services are typically long and wordy but often contain important obligations for either the provider or the recipient. Businesses need to ensure they are living up to the obligations they have outlined for themselves in their terms and conditions, and large organisations typically have thousands of pagses of such documents. Solving the business problem "what obligations do I have as a business buried in all my terms and conditions documents" can be automated using pre-trained Natural Language Processing (NLP) models available for Python.

Below is an example which takes a Terms and Conditions .pdf booklet from my bank (ANZ) and structures the text into pages and sentances (using the python Natural Language Toolkit NLTK). It then uses a pre-trained machine learning model from the Spacy library to identify the named entites in each sentance (i.e. the service provider 'ANZ' and the customer or 'merchant'). It then extracts all sentences which include a named entity followed by an obligation. For example, sentences such as 'ANZ will provide...' or 'The Merchant must pay...'.

Businesses can then confirm they are meeting their obligations to customers using this shortlist of 'obligation' sentances.

Step 1 - Download the Terms and Conditions from the banks website and convert to lists of pages and sentences

Install and import packages for extracting information from .pdf documents and natural language processing

In [4]:
!pip install wget pdfminer3k utils https://github.com/timClicks/slate/archive/master.zip nltk spacy
!python -m spacy download en_core_web_sm
Collecting https://github.com/timClicks/slate/archive/master.zip
  Downloading https://github.com/timClicks/slate/archive/master.zip
     / 286kB 3.0MB/s
Requirement already satisfied (use --upgrade to upgrade): slate==0.5.2 from https://github.com/timClicks/slate/archive/master.zip in /opt/conda/envs/Python36/lib/python3.6/site-packages
Requirement already satisfied: wget in /opt/conda/envs/Python36/lib/python3.6/site-packages (3.2)
Requirement already satisfied: pdfminer3k in /opt/conda/envs/Python36/lib/python3.6/site-packages (1.3.1)
Requirement already satisfied: utils in /opt/conda/envs/Python36/lib/python3.6/site-packages (0.9.0)
Requirement already satisfied: nltk in /opt/conda/envs/Python36/lib/python3.6/site-packages (3.4)
Requirement already satisfied: spacy in /opt/conda/envs/Python36/lib/python3.6/site-packages (2.1.8)
Requirement already satisfied: setuptools in /opt/conda/envs/Python36/lib/python3.6/site-packages (from slate==0.5.2) (40.8.0)
Requirement already satisfied: pytest>=2.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pdfminer3k) (4.2.1)
Requirement already satisfied: ply>=3.4 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pdfminer3k) (3.11)
Requirement already satisfied: six in /opt/conda/envs/Python36/lib/python3.6/site-packages (from nltk) (1.12.0)
Requirement already satisfied: singledispatch in /opt/conda/envs/Python36/lib/python3.6/site-packages (from nltk) (3.4.0.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from spacy) (2.21.0)
Requirement already satisfied: srsly<1.1.0,>=0.0.6 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from spacy) (0.1.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from spacy) (2.0.2)
Requirement already satisfied: numpy>=1.15.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from spacy) (1.15.4)
Requirement already satisfied: wasabi<1.1.0,>=0.2.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from spacy) (0.2.2)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from spacy) (1.0.2)
Requirement already satisfied: thinc<7.1.0,>=7.0.8 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from spacy) (7.0.8)
Requirement already satisfied: plac<1.0.0,>=0.9.6 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from spacy) (0.9.6)
Requirement already satisfied: blis<0.3.0,>=0.2.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from spacy) (0.2.4)
Requirement already satisfied: preshed<2.1.0,>=2.0.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from spacy) (2.0.1)
Requirement already satisfied: py>=1.5.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pytest>=2.0->pdfminer3k) (1.7.0)
Requirement already satisfied: attrs>=17.4.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pytest>=2.0->pdfminer3k) (18.2.0)
Requirement already satisfied: atomicwrites>=1.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pytest>=2.0->pdfminer3k) (1.3.0)
Requirement already satisfied: pluggy>=0.7 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pytest>=2.0->pdfminer3k) (0.8.1)
Requirement already satisfied: more-itertools>=4.0.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pytest>=2.0->pdfminer3k) (5.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2019.6.16)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.8)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)
Requirement already satisfied: tqdm<5.0.0,>=4.10.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from thinc<7.1.0,>=7.0.8->spacy) (4.31.1)
Building wheels for collected packages: slate
  Building wheel for slate (setup.py) ... done
  Stored in directory: /home/dsxuser/.tmp/pip-ephem-wheel-cache-cphzc9n2/wheels/74/e8/2c/ea67445a8f160ee922447f33ac6f768ee0244b4962db8d5fc3
Successfully built slate
Collecting en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1MB)
     |████████████████████████████████| 11.1MB 1.6MB/s eta 0:00:01
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... done
  Stored in directory: /home/dsxuser/.tmp/pip-ephem-wheel-cache-g3feniig/wheels/39/ea/3b/507f7df78be8631a7a3d7090962194cf55bc1158572c0be77f
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.1.0
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
In [5]:
#libraries for getting the .pdf document and converting to string
import wget
import slate

#NLP libraries
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
[nltk_data] Downloading package punkt to /home/dsxuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Download the pdf document from the ANZ Bank website and pull the document into a list using the Slate package

In [6]:
url = 'https://www.anz.com.au/content/dam/anzcomau/documents/pdf/fastpaynext-tc.pdf'
filename = wget.download(url)
In [7]:
#Use slate to pull the .pdf into a list of stings (one per page) called 'document'
with open(filename, 'rb') as f:
    document = slate.PDF(f)

Use the NLTK library to tokenize (i.e. break down) the pages into lists of sentences

In [8]:
#create a list called 'tokendoc' of pages. Tokenize each page.
tokendoc = []
for page in document:
    tokendoc.append(sent_tokenize(page))

Each sentance of the document can now be accessed using the tokendoc variable and the relevant page and sentance numbers. The example shows page 4 sentence 2.

In [10]:
tokendoc[4][2]
Out[10]:
'The Merchant Agreement (or \nAgreement) consists of:\n\n5\n\nA .'

Strip all the newline characters out of all sentences in the text

In [11]:
#define a function to remove the /n characters from a list of strings
def strip_newlines(list):
    list2 = [x.replace('\n', '') for x in list]
    return list2
In [12]:
#strip the /n characters from all sentances on all pages
i=0
for page in tokendoc:
        #print("**",page)
        tokendoc[i] = strip_newlines(page)
        i= i+1

Page four with all newline characters removed is shown below:

In [14]:
tokendoc[4]
Out[14]:
['1.',
 'Your Merchant AgreementThese General Conditions are part of your Merchant Agreement with ANZ.',
 'The Merchant Agreement (or Agreement) consists of:5A .',
 'Your Letter of Of fer;B.',
 'These General Conditions;C .',
 'The ANZ FastPay Next Generation App Terms and Conditions and Licence Agreement;D.E.  The ANZ FastPay Next Generation Merchant Operating Guide; and Any Special Conditions set out in your Letter of Of fer or otherwise agreed in writing by you and ANZ to be Special Conditions,as varied from time to time in accordance with these General Conditions.',
 'It is advisable that you read all documents referred to above as these are the terms on which ANZ will provide ANZ FastPay.',
 'Some words and expressions have special meanings in these General Conditions.',
 'The meanings are described in Conditions 39 and 40.',
 '2.',
 'Provision of ANZ FastPaya) ANZ agrees to provide the Merchant with ANZ FastPay in accordance with the Agreement, provided that the Merchant meets its obligations under the Agreement.',
 'b) ANZ will provide ANZ FastPay unless:i.ii.',
 'iii.',
 'the Agreement is terminated (including as a result of a breach by the Merchant of its obligations under the Agreement);  ANZ FastPay is suspended in accordance with the Agreement (including as a result of a breach by the Merchant of its obligations under the Agreement); or there is a change in Law or to the regulations, by-laws, rules or requirements of a third party that enables the use or operation of ANZ FastPay, or of a Nominated Card Scheme, that prevents ANZ from providing ANZ FastPay.']

Named Entity Recognition

An instance of the 'en_core_web_sm' pretrained NLP model can be used to detect named entities (ANZ, Merchant etc) and label as follows:

TYPE DESCRIPTION
PERSON People, including fictional.
NORP Nationalities or religious or political groups.
FAC Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws.
LANGUAGE Any named language.
DATE Absolute or relative dates or periods.
TIME Times smaller than a day.
PERCENT Percentage, including ”%“.
MONEY Monetary values, including unit.
QUANTITY Measurements, as of weight or distance.
ORDINAL “first”, “second”, etc.
CARDINAL Numerals that do not fall under another type.
In [19]:
nlp = en_core_web_sm.load()

sentance = tokendoc[4][5]
doc = nlp(sentance)

Below you can see the type for each word in the selected sentence

In [21]:
print([(X, X.ent_iob_, X.ent_type_) for X in doc])
[(The, 'B', 'LAW'), (ANZ, 'I', 'LAW'), (FastPay, 'I', 'LAW'), (Next, 'I', 'LAW'), (Generation, 'I', 'LAW'), (App, 'I', 'LAW'), (Terms, 'I', 'LAW'), (and, 'I', 'LAW'), (Conditions, 'I', 'LAW'), (and, 'O', ''), (Licence, 'O', ''), (Agreement;D.E., 'O', ''), ( , 'O', ''), (The, 'O', ''), (ANZ, 'O', ''), (FastPay, 'O', ''), (Next, 'O', ''), (Generation, 'O', ''), (Merchant, 'O', ''), (Operating, 'O', ''), (Guide, 'O', ''), (;, 'O', ''), (and, 'O', ''), (Any, 'O', ''), (Special, 'O', ''), (Conditions, 'O', ''), (set, 'O', ''), (out, 'O', ''), (in, 'O', ''), (your, 'O', ''), (Letter, 'O', ''), (of, 'O', ''), (Of, 'O', ''), (fer, 'O', ''), (or, 'O', ''), (otherwise, 'O', ''), (agreed, 'O', ''), (in, 'O', ''), (writing, 'O', ''), (by, 'O', ''), (you, 'O', ''), (and, 'O', ''), (ANZ, 'B', 'ORG'), (to, 'O', ''), (be, 'O', ''), (Special, 'B', 'ORG'), (Conditions, 'I', 'ORG'), (,, 'O', ''), (as, 'O', ''), (varied, 'O', ''), (from, 'O', ''), (time, 'O', ''), (to, 'O', ''), (time, 'O', ''), (in, 'O', ''), (accordance, 'O', ''), (with, 'O', ''), (these, 'O', ''), (General, 'O', ''), (Conditions, 'O', ''), (., 'O', '')]

We can also visually represent the labels for a full page, such as page 10 below

In [29]:
for sentance in tokendoc[12]:
    if nlp(sentance).ents:
        displacy.render(nlp(str(sentance)), jupyter=True, style='ent')
12specif ically regulating or prohibiting the retention by merchants of Cardholders ORG ’ personal identif ication numbers, passwords or other codes or information that can be used to access a Cardholder ’s ORG account will constitute a breach of this undertaking.
13o DATE ) The Merchant must ensure that it processes all Transactions ORG in accordance with the requirements of any Nominated Card Scheme WORK_OF_ART rules that ANZ ORG notif ies to the Merchant ORG .
p) The Merchant must ensure that each Transaction PERSON is recorded in Australian NORP dollars.
Authorisationsa) The Merchant ORG must seek prior authorisation from the Authorisation Centre FAC for any Transaction where:i. ii.
in the case of a Credit Transaction ORG , the value is in excess of the Authorised Floor Limit ORG ; the Transaction ORG , if processed, would result in the total dollar value of all Transactions ORG processed in a calendar week DATE exceeding the Weekly Transaction Limit LAW ; the Merchant is aware that, or considers it is possible that, a signature is a forgery or is unauthorised or there is an unauthorised use or forgery of the Nominated Card ORG ; the account number appearing on the Nominated Card LOC does not correspond with the number printed, encoded or otherwise shown on the Nominated Card ORG ; the Cardholder PERSON presents a Nominated Card WORK_OF_ART at a time which is not within current validity dates shown on the Nominated Card ORG ; the signature panel on the Nominated Card LOC is blank or the signature has been altered or defaced; vii.
the ANZ FastPay App ORG instructs the Merchant ORG to contact the Authorisation Centre;viii LAW .
the Transaction ORG is of a certain type or class which has been notif ied to the Merchant ORG by ANZ ORG as a type or class of Transaction ORG requiring authorisation.

The pre-trained model correctly identifies 'ANZ' and the 'Merchant' as organisations and Australian as a Nationality. It gets a few other entities wrong, such as 'Transaction' which it classifies as a person. A model trained on specific banking terms and conditions documents is likely to be much more accuracte. However, for the purposes of identifying obligations, this model only needs to correctly idetify named entities in general, and ANZ and Merchant as organisations in particular, which this achieves.

It's also possible to extract other information about each word, including:

  • Text: The original word text.
  • Lemma: The base form of the word.
  • POS: The simple part-of-speech tag.
  • Tag: The detailed part-of-speech tag.
  • Dep: Syntactic dependency, i.e. the relation between tokens.
  • Shape: The word shape – capitalization, punctuation, digits.
  • is alpha: Is the token an alpha character?
  • is stop: Is the token part of a stop list, i.e. the most common words of the language?

sentance = tokendoc[12][9] doc = nlp(sentance) for token in doc: print(token.text, token.lemma, token.pos, token.tag, token.dep, token.shape_, token.is_alpha, token.is_stop)

Extract obligations

Obligations in the text are recognisable by having a named entity followed by a keyword indicating an obligation, such as 'will' or 'must'

First we define the obligation keywords:

In [50]:
# define a list of words that include obligations by an entity
ob_words = ['must', 'will','provides', 'is obliged to', 'has to', 'needs to', 'is required to']

Then we define a function which returns relevant entities when a sentance has an obligation in it:

In [51]:
def obligation(sentance):

    s = nlp(sentance)
    entities = []
    for word in s.ents:
        entities.append(word.text)

    for word in s:
        if word.text in entities:
            if word.nbor().text in ob_words:
                return word.text

Now we can call the function for every page and every sentence in the document, printing any sentences that contain obligations (as well as all the page numbers and any errors just to keep track of where we are)

In [61]:
obligation_count = 0
sentence_count = 0
Entity = 'ANZ'
for page in enumerate(tokendoc):
    for sentance in page[1]:
        sentence_count = sentence_count + 1
        try:
            if(obligation(sentance)==Entity):
                obligation_count = obligation_count+1
                print('Page',page[0],':')
                print(sentance)
        except:
            print('Page',page[0],':')
            print("error")
print('The total number of obligations for', Entity,'is: ',obligation_count )
print('The total number of sentences in the document is: ',sentence_count)
Page 3 :
error
Page 4 :
It is advisable that you read all documents referred to above as these are the terms on which ANZ will provide ANZ FastPay.
Page 4 :
b) ANZ will provide ANZ FastPay unless:i.ii.
Page 6 :
Card Readera) ANZ will provide the Merchant with a Card Reader;b) ANZ will provide the Merchant with additional Card Readers if requested.
Page 13 :
Where the Merchant ’s Authorised Floor Limit is changed for any other reason, ANZ will provide the Merchant with reasonable notice of the change.
Page 15 :
16b) ANZ will issue a monthly statement to the Merchant showing a summary of Transactions processed by ANZ to the Merchant Account during the previous month.
Page 15 :
During that 30 day period, ANZ will investigate the Transaction to determine whether ANZ will either:iii.
Page 16 :
Immediately prior to the end of any deferred period, ANZ will review the relevant circumstance set out in (i) to (iv) above, to determine whether deferred settlements should continue and what period that deferral should be.
Page 16 :
ANZ will advise the Merchant in writing of its decision on completing the review.
Page 19 :
b) If ANZ receives a payment from a Cardholder relating to an Invalid Transaction that has been charged back to the Merchant, ANZ will pay an amount equal to that payment to the Merchant less any amount which ANZ is entitled to withhold or set-of f under the Agreement.
Page 20 :
b) If a Retention Notice is given to the Merchant and a Retention Account has not previously been established in relation to the Merchant, ANZ will establish a Retention Account in relation to the Merchant.
Page 20 :
d) Once the balance of the Retention Account reaches the Retention Amount, ANZ will continue to deduct further Retained Proceeds from the Merchant ’s settlement proceeds processed through ANZ FastPay and retain these Retained Proceeds in the Retention Account, but will release a corresponding amount to the Merchant Account so that the balance of the Retention Account (af ter any deductions made in accordance with this Agreement) remains at the Retention Amount.
Page 22 :
ANZ will notify the Merchant of any such obligations and, to the extent practicable, will provide the Merchant with a reasonable period of time to comply with such obligations.
Page 22 :
b) ANZ will notify the Merchant of any noncompliance alert received from a Nominated Card Scheme as a result of the Merchant ’s breach of the Nominated Card Scheme rules (“ANZ Notice”).The ANZ Notice must:i. ii.
Page 24 :
242513.3 Privacy and confidentialitya) ANZ will collect and use information about you during the course of your relationship with ANZ.
Page 27 :
the circumstances in which ANZ may collect personal information from other sources(including from a third party); how to access personal information and seek correction of personal information; and how you can raise concerns that ANZ has breached the Privacy Act or an applicable code and how ANZ will deal with those matters.
Page 27 :
Collecting sensitive informationm) ANZ will not collect sensitive information about you, such as information about your health, without your consent.
Page 29 :
c) If ANZ debits the Merchant Account, ANZ will give the Merchant written notice that ANZ has done this.
Page 31 :
ANZ will also take reasonable steps to mitigate any claims, damages, actions, losses or liabilities which are the subject of this indemnity.
Page 31 :
ANZ Liabilitya) To the extent permitted by Law, ANZ will not be liable for any loss or damage (including consequential loss or damage) suf fered by the Merchant under the Agreement including, but not limited to any loss or damage:i. ii.
Page 32 :
ANZ will take all commercially reasonable steps to reduce the duration should such interruption or breakdown occur but will not otherwise have any liability for any failure, delay or other matter resulting from it.
Page 38 :
If the actual liability proves to be less than the amount set of f combined or appropriated, ANZ must pay the Merchant the amount of the dif ference.
Page 39 :
Despite this clause, ANZ will always give you notice in accordance with any applicable laws or industry codes (such as the Banking Code of Practice) which require any minimum notice periods or specif ic methods of notif ication.
Page 42 :
In specif ying the type of Security and amount secured under this clause, ANZ will ac t in accordance with what is reasonably necessary to protec t its legitimate commercial interests.
Page 47 :
If this is not possible, ANZ will keep the Merchant informed on the progress of the matter and how long ANZ expects it will take to resolve the complaint.
The total number of obligations for ANZ is:  24
The total number of sentences in the document is:  604

Success! This document contains 604 sentences but only 24 are obligations, saving significant review time. We could now write these to .csv for further processing or audit.

In [ ]: