An Expert’s Guide on How to Protect Data Using NLP
Data protection and keeping sensitive information private is very important for companies and their customers. There have been several huge data leaks in the past that lead to trust issues towards the involved companies. To get value from text data with machine learning large collections of documents are necessary. But to access them can be a privacy issue for customers in finance, legal, medicine and many more. As a freelancer in data science and machine learning accessing large quantities of data is necessary to build accurate and useful models. But for some of your clients, it might be (with good reasons) scary or impossible to disclose their data raw and unprotected.
So how can you work with sensitive text data at scale yet keep the contained information as secure as possible? In this article, I'm going to show you some of my methods to work with sensitive text data and discuss what the caveats are.
Get a sample dataset
To explain the methods, we use the 20 Newsgroups dataset which is also easily available through scikit-learn. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. So, we're facing a document classification problem here. We start by loading the data through the scikit-learn API.
In [1]:
from sklearn import datasets
train_dataset = datasets.fetch_20newsgroups(subset="train")
test_dataset = datasets.fetch_20newsgroups(subset="test")
In [2]:
train_texts = train_dataset.data
train_labels = train_dataset.target
We have the following categories for the documents:
In [3]:
print(train_dataset.target_names)
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
train_dataset = datasets.fetch_20newsgroups(subset="train")
test_dataset = datasets.fetch_20newsgroups(subset="test")
train_labels = train_dataset.target
We have the following categories for the documents:
Let's have a look at an example.
So, this document is about cars, which we would probably have guessed.
The baseline setup
We first setup the machine learning pipeline we will be using throughout the article. For simplicity reasons, we use a simple bag of words TFIDF model with a naive bayes classifier, a simple, effective and popular method for text classification. But the proposed methods would also work with more complicated methods like neural networks.
from hashlib import shake_128
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
def run_ml(texts, labels, tokenizer=None):
tr_txt, vl_txt, y_tr, y_vl = train_test_split(texts, labels,
train_size=0.85, test_size=0.15,
random_state=42)
text_clf = Pipeline([
('vect', CountVectorizer(tokenizer=tokenizer)),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf.fit(tr_txt, y_tr)
y_pred = text_clf.predict(vl_txt)
print("Accuracy: {:.1%}".format(accuracy_score(y_vl, y_pred)))
To compute a performance baseline, we first run the pipeline with unprotected raw data.
Accuracy: 85.0%
Work with anonymized documents
Since we want to do machine learning with the documents, we want to preserve as much information as possible. Depending on how critical your data is, you can pick from several ways to do this. I'll show you three different ways to do this and we compare the performance on our simple machine learning model. The basic idea here is, that (most) machine learning models basically treat the token Hello the same way as 16ff566c558eb688e. As long as the relationships between the tokens are preserved everything will work fine. In the end, what the models see is just numbers.
1. Remove personally identifiable information automatically
We start out with a method to automatically remove personally identifiable information such as names and locations. Depending on your dataset you might also remove credit card information or certain IDs with this method. A simply basic approach is to use a named entity tagger to find this information in the text and then replace it with a random string. For tokenization and named entity recognition we will use the awesome spaCy library.
from spacy import displacy
nlp = spacy.load("en_core_web_sm", )
doc = nlp(train_texts[0])
Let's see what the entity tagger found.
Subject: WHAT car is this!?
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
all I know. If anyone can tell me a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
Thanks,
- IL
So, the tagger detected the person Lerxst and the location University of Maryland. We remove these entity types from the text now. You already see a caveat of this method, since it didn't recognize the email address. But you could filter it out with a regular expression.
new_texts = []
for doc in tqdm_notebook(nlp.pipe(texts, disable=["tagger", "parser"], n_threads=4), total=len(texts)):
new_txt = doc.text
for ent in doc.ents:
new_txt = new_txt.replace(ent.text, ent.label_)
new_texts.append(new_txt)
return new_texts
run_ml(train_texts_type, train_labels)
Accuracy: 79.0%
We can clearly see, that the methods provide only a little lower performance as the baseline. But this is dependent on the dataset and problem at hand. On the downside, the method has no guarantee that all relevant personal information is found. So, keep that in mind.
2. Hash personally identifiable information automatically
For some use-cases, it might cause problems to remove all personal information by their types. For example, specific locations can contain relevant information for your problem. So, we modify the previous approach. Now we're not replacing the detected personal information by its type but its unique hash value. This keeps the information available for the machine learning model and preserves its privacy.
def replace_by_hash(texts):
new_texts = []
for doc in tqdm_notebook(nlp.pipe(texts, disable=["tagger", "parser"], n_threads=4), total=len(texts)):
ents = dict()
for ent in doc.ents:
ents[ent.text] = sha256(ent.text.lower().encode
('utf-8')).hexdigest()
new_txt = doc.text
for ent in sorted(ents, key=len):
try:
int(ent)
except:
if len(ent) > 2:
new_txt = new_txt.replace(ent, ents[ent])
new_texts.append(new_txt)
return new_texts
run_ml(train_texts_hash, train_labels)
Accuracy: 83.3%
This method performs well by keeping more relevant information. However, it still suffers from the privacy problems of the previous approach. Also, the hashes can be cracked by brute force or count-based statistical approaches.
3. Go fully encrypted: encrypt every word in the document
One method to mitigate the problems with the previous two methods is to go fully encrypted. To keep as much information as possible, we can just map every word to a unique secret value that cannot easily be inverted. A hash function will work well here. This will not change the vectorspace of our bag of words but makes it impossible for humans to understand.
new_texts = []
for doc in tqdm_notebook(nlp.pipe(texts, disable=["tagger", "parser"], n_threads=4), total=len(texts)):
enc_text = []
for token in doc:
enc_text.append(sha256(token.text.lower().encode('utf-8')).hexdigest())
new_texts.append(" ".join(enc_text))
return new_texts
Accuracy: 79.0%
We can see, that we kept most of the performance compared to the baseline. One serious downside of this method is that you cannot interpret the dataset after you apply it. This makes it ultimately secure, but it also makes the machine learning workflow more tedious. Also note, that the method is vulnerable to statistical attacks and brute force attacks to decrypt the data.
Wrap-up
We saw three methods of how you can work with text datasets to keep sensible personal or commercial information safe. One serious drawback is that they make it harder (or impossible) to diagnose your models. But they can help you to get started with your projects faster. You can start from here to craft a method fitting your use-case best.
Here you can see a fast overview of the covered methods:
Method | Keeps information? | Secure? | Legible by humans? | Speed? | Is transfer learning possible? |
---|---|---|---|---|---|
1. | some loss | can miss personal information but cannot be inverted | yes | fast | yes |
2. | no loss | can miss personal information and can be inverted | mostly | quite fast | restricted |
3. | no loss | very secure but can be brute forced | no | slow | hardly |
I hope this article helps you in your day-to-day work as a data scientist, machine learning engineer or especially as a freelancer in these fields. Nonetheless, always keep in mind that none of these methods is completely safe against attacks!
The author of this article, Tobias Sterbak, is a data science expert and part of our freelancer community. If you are in need of highly skilled data science experts reach out to our expertmatch team.
Aug 2019 - 11 min read
Tobias Sterbak
Tobias Sterbak is a freelance machine learning consultant, providing state-of-the-art natural language processing and machine learning solutions for companies in multiple industries. He also shares his knowledge about machine learning and natural language processing on www.depends-on-the-definition.com.