Text Preprocessing in NLP: A Practical Guide to Cleaning Your Data

A hands-on journey through essential text preprocessing techniques using Python and NLTK

TL;DR

Text preprocessing is the foundation of any successful NLP project. Raw text data is messy—it contains inconsistent casing, stopwords, punctuation, and various forms of the same word. In this guide, we walk through a complete preprocessing pipeline using real hotel review data, covering:

Lowercasing – Standardize text case
Stopword Removal – Remove common words that add little meaning
Punctuation Handling – Clean special characters while preserving meaning
Tokenization – Split text into individual words
Stemming – Reduce words to their root form
Lemmatization – Reduce words to their dictionary form
N-grams – Extract word combinations for context

Clean data leads to better models. Period.

Introduction

If you've ever tried to build a machine learning model with text data, you've probably realized one harsh truth: garbage in, garbage out. Raw text is incredibly messy. It contains typos, inconsistent formatting, irrelevant words, and countless variations of the same concept.

Text preprocessing is the art (and science) of transforming raw, unstructured text into a clean, structured format that machine learning algorithms can understand. Without proper preprocessing, your models will struggle to find patterns, leading to poor performance and unreliable predictions.

In this blog post, I'll walk you through a practical text preprocessing pipeline I built while working with TripAdvisor hotel reviews. By the end, you'll understand not just how to preprocess text, but why each step matters.

Why Does Text Preprocessing Matter?

Reduces Noise: Raw text contains a lot of noise—stopwords, punctuation, and formatting inconsistencies that don't contribute to meaning.
Reduces Dimensionality: By normalizing words (through stemming/lemmatization), you reduce the vocabulary size, making your models more efficient.
Improves Model Performance: Clean data helps models focus on what matters, leading to better accuracy and faster training.
Ensures Consistency: Standardizing text ensures that "Hotel", "HOTEL", and "hotel" are treated as the same word.

Setting Up the Environment

First, let's import the necessary libraries. We'll be using NLTK (Natural Language Toolkit), which is a powerful library for working with human language data.

python

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import re
import pandas as pd
import matplotlib.pyplot as plt

Loading the Data

For this project, we're working with TripAdvisor hotel reviews—a perfect dataset for practicing NLP preprocessing because reviews are naturally messy and conversational.

python

data = pd.read_csv("tripadvisor_hotel_reviews.csv")

Let's examine our dataset:

python

data.info()

Output:

python

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Review  109 non-null    object
 1   Rating  109 non-null    int64
dtypes: int64(1), object(1)
memory usage: 1.8+ KB

We have 109 hotel reviews with their corresponding ratings. Simple, but perfect for learning!

Step 1: Lowercasing

The first and simplest preprocessing step is converting all text to lowercase. This ensures that "Hotel", "HOTEL", and "hotel" are treated as the same word.

python

data['review_lowercase'] = data['Review'].str.lower()

Why Lowercasing Matters

Without lowercasing, your model would treat "Great" and "great" as completely different words. This increases vocabulary size unnecessarily and can confuse your model.

Before: "Nice Hotel Expensive Parking"
After: "nice hotel expensive parking"

Step 2: Stopword Removal

Stopwords are common words like "the", "is", "at", "which", and "on" that appear frequently but carry little meaningful information. Removing them helps focus on the words that actually matter.

python

en_stopwords = stopwords.words('english')
en_stopwords.remove("not")  # Keep "not" - it's important for sentiment!

⚠️ Pro Tip: Notice that we removed "not" from our stopwords list. In sentiment analysis, "not good" has a completely different meaning than "good". Context matters!

python

data['review_no_stopwords'] = data['review_lowercase'] \
                                .apply(lambda x: ' ' \
                                .join([word for word \
                                in x.split() if word \
                                not in (en_stopwords)]))

Example Output

Let's see how the first review looks after stopword removal:

Before:

nice hotel expensive parking got good deal stay hotel anniversary, arrived late
evening took advice previous reviews did valet parking, check quick easy, little
disappointed non-existent view room room clean nice size...

After:

nice hotel expensive parking got good deal stay hotel anniversary, arrived late
evening took advice previous reviews valet parking, check quick easy, little
disappointed non-existent view room room clean nice size, bed comfortable woke
stiff neck high pillows, not soundproof like heard music room night morning loud
bangs doors opening closing hear people talking hallway, maybe noisy neighbors,
aveda bath products nice, not goldfish stay nice touch taken advantage staying
longer, location great walking distance shopping, overall nice experience pay 40
parking night,

Notice how words like "did" and "the" are removed, but "not" is preserved because we explicitly kept it for sentiment analysis purposes.

Step 3: Handling Punctuation

Punctuation removal requires careful thought. We don't want to blindly remove everything—some symbols might carry meaning.

Preserving Meaningful Symbols

In our hotel reviews, we noticed that some reviewers used * to represent "star" (as in "5* hotel"). Let's preserve this meaning:

python

data['review_no_stopwords_no_punct'] = data \
                                    .apply(lambda x: \
                                    re.sub(r"[*]", \
                                    "star", \
                                    x['review_no_stopwords'] \
                                    ), axis=1)

Now we can safely remove the remaining punctuation:

python

data['review_no_stopwords_no_punct'] = data. \
                                    apply(lambda x: \
                                    re.sub(r"([^\w\s])", \
                                    "", \
                                    x['review_no_stopwords_no_punct'] \
                                    ), axis=1)

Understanding the Regex

[*] - Matches the asterisk character
[^\w\s] - Matches anything that is NOT a word character (\w) or whitespace (\s)

💡 Key Insight: Always analyze your data before preprocessing. Understanding the quirks in your specific dataset helps you make better decisions about what to keep and what to remove.

Step 4: Tokenization

Tokenization is the process of splitting text into individual units (tokens), typically words. This is a fundamental step that prepares the text for further analysis.

python

data['tokenized'] = data.apply(lambda x: \
                               word_tokenize( \
                               x['review_no_stopwords_no_punct'] \
                               ), axis=1)

Example Output

python

data['tokenized'][0]

Output:

python

['nice', 'hotel', 'expensive', 'parking', 'got', 'good', 'deal', 'stay',
 'hotel', 'anniversary', 'arrived', 'late', 'evening', 'took', 'advice',
 'previous', 'reviews', 'valet', 'parking', 'check', 'quick', 'easy',
 'little', 'disappointed', 'nonexistent', 'view', 'room', 'room', 'clean',
 'nice', 'size', 'bed', 'comfortable', 'woke', 'stiff', 'neck', 'high',
 'pillows', 'not', 'soundproof', 'like', 'heard', 'music', 'room', 'night',
 'morning', 'loud', 'bangs', 'doors', 'opening', 'closing', 'hear', 'people',
 'talking', 'hallway', 'maybe', 'noisy', 'neighbors', 'aveda', 'bath',
 'products', 'nice', 'not', 'goldfish', 'stay', 'nice', 'touch', 'taken',
 'advantage', 'staying', 'longer', 'location', 'great', 'walking', 'distance',
 'shopping', 'overall', 'nice', 'experience', 'pay', '40', 'parking', 'night']

Now each review is represented as a list of 83 individual words, ready for further processing.

Step 5: Stemming

Stemming reduces words to their root form by chopping off word endings. It's a crude but fast approach.

python

ps = PorterStemmer()

data["stemmed"] = data["tokenized"] \
                  .apply(lambda tokens: \
                  [ps.stem(token) \
                   for token in tokens])

Example Output

python

data['stemmed'][0]

Output:

python

['nice', 'hotel', 'expens', 'park', 'got', 'good', 'deal', 'stay',
 'hotel', 'anniversari', 'arriv', 'late', 'even', 'took', 'advic',
 'previou', 'review', 'valet', 'park', 'check', 'quick', 'easi',
 'littl', 'disappoint', 'nonexist', 'view', 'room', 'room', 'clean',
 'nice', 'size', 'bed', 'comfort', 'woke', 'stiff', 'neck', 'high',
 'pillow', 'not', 'soundproof', 'like', 'heard', 'music', 'room', 'night',
 'morn', 'loud', 'bang', 'door', 'open', 'close', 'hear', 'peopl',
 'talk', 'hallway', 'mayb', 'noisi', 'neighbor', 'aveda', 'bath',
 'product', 'nice', 'not', 'goldfish', 'stay', 'nice', 'touch', 'taken',
 'advantag', 'stay', 'longer', 'locat', 'great', 'walk', 'distanc',
 'shop', 'overal', 'nice', 'experi', 'pay', '40', 'park', 'night']

Stemming in Action

| Original | Stemmed |

| ----------- | ----------- |

| expensive | expens |

| parking | park |

| anniversary | anniversari |

| arrived | arriv |

| evening | even |

Notice how stemming can be aggressive—"evening" becomes "even", which might not always be desirable. This is where lemmatization comes in.

Step 6: Lemmatization

Lemmatization is more sophisticated than stemming. It uses vocabulary and morphological analysis to return the proper dictionary form (lemma) of a word.

python

lemmatizer = WordNetLemmatizer()

data["lemmatized"] = data["tokenized"] \
                    .apply(lambda tokens: \
                    [lemmatizer.lemmatize(token) \
                     for token in tokens])

Example Output

python

data['lemmatized'][0]

Output:

python

['nice', 'hotel', 'expensive', 'parking', 'got', 'good', 'deal', 'stay',
 'hotel', 'anniversary', 'arrived', 'late', 'evening', 'took', 'advice',
 'previous', 'review', 'valet', 'parking', 'check', 'quick', 'easy',
 'little', 'disappointed', 'nonexistent', 'view', 'room', 'room', 'clean',
 'nice', 'size', 'bed', 'comfortable', 'woke', 'stiff', 'neck', 'high',
 'pillow', 'not', 'soundproof', 'like', 'heard', 'music', 'room', 'night',
 'morning', 'loud', 'bang', 'door', 'opening', 'closing', 'hear', 'people',
 'talking', 'hallway', 'maybe', 'noisy', 'neighbor', 'aveda', 'bath',
 'product', 'nice', 'not', 'goldfish', 'stay', 'nice', 'touch', 'taken',
 'advantage', 'staying', 'longer', 'location', 'great', 'walking', 'distance',
 'shopping', 'overall', 'nice', 'experience', 'pay', '40', 'parking', 'night']

Notice how lemmatization preserves more readable words compared to stemming (e.g., "morning" stays as "morning" instead of becoming "morn").

Stemming vs. Lemmatization

| Original | Stemmed | Lemmatized |

| -------- | -----------| ---------- |

| running | run | running |

| better | better | better |

| studies | studi | study |

| feet | feet | foot |

When to use which?

Stemming: Faster, good for search engines and information retrieval
Lemmatization: More accurate, better for text analysis and NLP tasks where meaning matters

Step 7: N-grams Analysis

N-grams are contiguous sequences of n items from a given text. They help capture context and word relationships.

Unigrams (n=1)

Single words and their frequencies:

python

tokens_clean = sum(data['lemmatized'], [])

unigrams = (pd.Series \
            (nltk.ngrams(tokens_clean, 1)) \
            .value_counts())
print(unigrams)

Output:

python

(hotel,)           292
(room,)            275
(great,)           126
(not,)             122
(stay,)             95
                  ...

Bigrams (n=2)

Two-word combinations reveal common phrases:

python

bigrams = (pd.Series \
           (nltk.ngrams(tokens_clean, 2)) \
           .value_counts())
print(bigrams)

Output:

python

(great, location)     24
(space, needle)       21
(hotel, monaco)       16
(great, hotel)        12
(staff, friendly)     12
                      ..

4-grams (n=4)

Longer sequences capture more specific patterns:

python

ngrams_4 = (pd.Series \
            (nltk.ngrams(tokens_clean, 4)) \
            .value_counts())
print(ngrams_4)

Output:

python

(high, floor, great, view)                   2
(definitely, stay, crowne, plaza)            2
(needle, experience, music, project)         2
(nice, hotel, husband, stayed)               2
(really, comfortable, clean, location)       2

Why N-grams Matter

N-grams help capture:

Common phrases: "great location", "staff friendly"
Named entities: "space needle", "hotel monaco"
Sentiment patterns: Understanding what words commonly appear together

The Complete Preprocessing Pipeline

Here's a summary of our complete pipeline:

python

# 1. Load data
data = pd.read_csv("tripadvisor_hotel_reviews.csv")

# 2. Lowercase
data['review_lowercase'] = data['Review'].str.lower()

# 3. Remove stopwords (keeping "not")
en_stopwords = stopwords.words('english')
en_stopwords.remove("not")
data['review_no_stopwords'] = data['review_lowercase'].apply(
    lambda x: ' '.join([word for word in x.split()
                        if word not in en_stopwords])
)

# 4. Handle punctuation
data['review_clean'] = data['review_no_stopwords'].apply(
    lambda x: re.sub(r"[*]", "star", x)
)
data['review_clean'] = data['review_clean'].apply(
    lambda x: re.sub(r"([^\w\s])", "", x)
)

# 5. Tokenize
data['tokenized'] = data['review_clean'].apply(word_tokenize)

# 6. Lemmatize
lemmatizer = WordNetLemmatizer()
data['lemmatized'] = data['tokenized'].apply(
    lambda tokens: [lemmatizer.lemmatize(token) for token in tokens]
)

Key Takeaways

Order matters: The sequence of preprocessing steps affects your results. Lowercase before stopword removal, tokenize before stemming/lemmatization.
Context is king: Don't blindly apply preprocessing. Understand your data and domain—keeping "not" for sentiment analysis was crucial.
Stemming vs. Lemmatization: Choose based on your use case. Speed vs. accuracy.
N-grams reveal patterns: Single words tell you frequency; n-grams tell you context.
Document your pipeline: Preprocessing decisions should be reproducible and explainable.

What's Next?

With clean, preprocessed text, you're ready to:

Build classification models (sentiment analysis, topic classification)
Create word embeddings (Word2Vec, GloVe)
Perform topic modeling (LDA, NMF)
Build search engines (TF-IDF, BM25)

Remember: The quality of your preprocessing directly impacts the quality of your models. Take the time to understand your data and make informed decisions at each step.

Chat with Segun Jr

Chat with Segun Jr

Text Preprocessing in NLP: A Practical Guide to Cleaning Your Data

Text Preprocessing in NLP: A Practical Guide to Cleaning Your Data

TL;DR

Introduction

Why Does Text Preprocessing Matter?

Setting Up the Environment

Loading the Data

Step 1: Lowercasing

Why Lowercasing Matters

Step 2: Stopword Removal

Example Output

Step 3: Handling Punctuation

Preserving Meaningful Symbols

Understanding the Regex

Step 4: Tokenization

Example Output

Step 5: Stemming

Example Output

Stemming in Action

Step 6: Lemmatization

Example Output

Stemming vs. Lemmatization

Step 7: N-grams Analysis

Unigrams (n=1)

Bigrams (n=2)

4-grams (n=4)

Why N-grams Matter

The Complete Preprocessing Pipeline

Key Takeaways

What's Next?

Resources