A comprehensive guide to text pre-processing with python

Part I — Theoretical Background

Published in

HCLTech-Starschema Blog

5 min readAug 1, 2019

This is Part 1 of a pair of tutorials on text pre-processing in python. In this first part, I’ll lay out the theoretical foundations. In the second part, I’ll demonstrate the steps described below in python on texts in different languages while discussing their differing effect arising from different structures of languages.

Introduction

Text mining is an important topic for business and academia as well. Business applications include chatbots, automated news tagging and many others. In terms of academia, political texts are often the subjects of rigorous analysis. I believe that these fields have a lot to learn from each other so I decided to create this guide, based on an academic paper, about text pre-processing for business oriented readers.

If you have ever done a machine learning project, you are probably familiar with the concepts of data cleaning or feature engineering. If you have ever done a text mining project — not necessarily involving machine learning — you surely know that there is additional relevance of these concepts to text mining. Text pre-processing, in my view, is one of the most interesting fields where these concepts are utilized.

From text to data

“All models are wrong, but some are useful.”

This saying is attributed to George E. P. Box, but it is rather a cliché in statistics. During text mining, when creating data out of raw text, the practitioner gains a deep understanding of what this concise sentence refers to. One needs to create a representation of text which is a useful simplification.

In a sense, text mining is meaning mining. Finding the meaning in text can be extremely hard and the methodology needed is deeply context-dependent. A text mining practitioner has to remove unnecessary information from the text (for example words that does not contain relevant information) and also redundancy (where two word forms refer to the same meaning). To achieve both, text pre-processing steps needs to be done first.

Consequently, text pre-processing can be thought of as a dimensionality reduction. In a text mining problem, dimensionality refers to the number of unique tokens in the pre-processed text. One would aim to minimize this number while bearing in mind the tradeoff of information lost.

In most cases, a corpus, a particular set of texts that are somehow connected to each other, is represented using a term-frequency matrix. In this matrix, every row corresponds to a document in the corpus, and every column is a unique token (word) of text. A value, X(i,j) in the matrix will mean in document i, the word j is present X(i,j) times. This is called the bag-of-words approach, as the text is represented as word counts, regardless of word position inside the document.

Word clouds are just a fancy way to show word counts in a corpus. More frequent terms are represented with bigger font size.

For text mining use cases, dimensionality reduction needs to be applied differently than in other use cases. The next section will show that during text pre-processing, one has at least 128 ways to proceed — if each pre-processing step is considered a binary choice. However, this is almost never the case.

Text pre-processing steps

The following steps are discussed from the perspective of a text miner who uses a bag-of-words representation of text. Please note this process only refers to bag-of-words representations; other types of representations require different processes!

Punctuation and special characters
The standard approach would be to remove punctuation and any non-alphanumerical characters from the text. But think about twitter data!
‘#’-s can hold important meaning. Or email data, where ‘@’ can be an important character.
Numbers
The removal of numbers is a much more controversial topic. There are a number of applications on each side of the argument. Numbers can be an integral part of addresses, scores, etc, while in terms of document classification they tend to be less useful.
Lower-casing
It is considered a routine task to do lower-casing. For example, a sentence starting with ‘House’ and having ‘house’ in the middle of a sentence, refer to the same entity. However, the name ‘Rose’ and ‘rose’, a flower, refer to different entities.
Stemming or lemmatizing
Stemming or lemmatizing are the most powerful tools of dimensionality reduction in text mining. Stemming means reducing words to their most basic form using grammatical laws. On the other hand, lemmatizing is done using dictionaries where it is exactly coded out which word forms are replaced with which form of the word.
While both are great concepts, they have their own drawbacks. Stemming, in many cases reduces words with different meanings to the same form, while lemmatizers can be domain specific, and the creation of a lemmatization-dictionary is time-consuming.
Stopword removal
Stopword removal refers to the notion of removing words that lack meaningful information. There are general stopwords, like “the”, or “are” which are included in most text mining software packages. The choice concerning these words is quite straightforward —in most use cases they are meaningless and have to be removed.
However, the full list of words most likely not conveying any meaning is highly domain- and use case-specific. In the case of a corpus of legislation texts, the word “congress” can be considered a stopword, while the use case of tagging newspaper articles by topic the algorithm would make great use of that specific word.
N-gram inclusion
This step refers to the inclusion of N long word sequences as unique tokens. This widening of the unit of analysis causes dimensionality to skyrocket, while in some cases it can be necessary. Some words can have a substantially different meaning in multi-word expressions: “United States” and “United Nations” both contain the word “United”, but refer to different entities. One is advised to filter the resulting n-grams to prevent the explosion of dimensionality due to the combinatorial nature of n-grams.
Infrequently used terms
The distribution of word counts tends to have a long tail. This means that there are a number of very infrequently used terms in the corpus. It is a common practice to remove words that appear in less than 0.5% or 1% of documents. The gains can be twofold: an enormous decrease in dimensionality and the removal of uninformative, sparse features.
On the other hand, in some cases these are specifically the key features for our analysis, for example the identifiers of rare topics in documents. Therefore this step also has to be carried out cautiously.

Takeaways

As you can see, deciding what information to keep and what to drop is the name of the game in text pre-processing. This is because there is a desperate need for dimensionality reduction due to the large number of unique words in natural language. Never take a rule-of-thumb for granted: your analysis may require substantially different considerations as every text mining problem is unique in some way.

In Part 2, these steps are performed using python on a comparable corpus of 4 different languages, showing the differing effects of these text preprocessing steps in different languages.

References:

Denny, M., & Spirling, A. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis, 26(2), 168–189. doi:10.1017/pan.2017.44