Text Preprocessing in NLP: A Practical Guide to Cleaning Your Data
Learn essential text preprocessing techniques like tokenization, stopword removal, stemming, lemmatization, and n-grams analysis to clean raw text for better NLP model performance.

Text Preprocessing in NLP: A Practical Guide to Cleaning Your Data
A hands-on journey through essential text preprocessing techniques using Python and NLTK
TL;DR
Text preprocessing is the foundation of any successful NLP project. Raw text data is messy—it contains inconsistent casing, stopwords, punctuation, and various forms of the same word. In this guide, we walk through a complete preprocessing pipeline using real hotel review data, covering:
- Lowercasing – Standardize text case
- Stopword Removal – Remove common words that add little meaning
- Punctuation Handling – Clean special characters while preserving meaning
- Tokenization – Split text into individual words
- Stemming – Reduce words to their root form
- Lemmatization – Reduce words to their dictionary form
- N-grams – Extract word combinations for context
Clean data leads to better models. Period.
Introduction
If you've ever tried to build a machine learning model with text data, you've probably realized one harsh truth: garbage in, garbage out. Raw text is incredibly messy. It contains typos, inconsistent formatting, irrelevant words, and countless variations of the same concept.
Text preprocessing is the art (and science) of transforming raw, unstructured text into a clean, structured format that machine learning algorithms can understand. Without proper preprocessing, your models will struggle to find patterns, leading to poor performance and unreliable predictions.
In this blog post, I'll walk you through a practical text preprocessing pipeline I built while working with TripAdvisor hotel reviews. By the end, you'll understand not just how to preprocess text, but why each step matters.
Why Does Text Preprocessing Matter?
- Reduces Noise: Raw text contains a lot of noise—stopwords, punctuation, and formatting inconsistencies that don't contribute to meaning.
- Reduces Dimensionality: By normalizing words (through stemming/lemmatization), you reduce the vocabulary size, making your models more efficient.
- Improves Model Performance: Clean data helps models focus on what matters, leading to better accuracy and faster training.
- Ensures Consistency: Standardizing text ensures that "Hotel", "HOTEL", and "hotel" are treated as the same word.
Setting Up the Environment
First, let's import the necessary libraries. We'll be using NLTK (Natural Language Toolkit), which is a powerful library for working with human language data.
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import re
import pandas as pd
import matplotlib.pyplot as pltLoading the Data
For this project, we're working with TripAdvisor hotel reviews—a perfect dataset for practicing NLP preprocessing because reviews are naturally messy and conversational.
data = pd.read_csv("tripadvisor_hotel_reviews.csv")Let's examine our dataset:
data.info()Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Review 109 non-null object
1 Rating 109 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.8+ KBWe have 109 hotel reviews with their corresponding ratings. Simple, but perfect for learning!
Step 1: Lowercasing
The first and simplest preprocessing step is converting all text to lowercase. This ensures that "Hotel", "HOTEL", and "hotel" are treated as the same word.
data['review_lowercase'] = data['Review'].str.lower()Why Lowercasing Matters
Without lowercasing, your model would treat "Great" and "great" as completely different words. This increases vocabulary size unnecessarily and can confuse your model.
Before: "Nice Hotel Expensive Parking"
After: "nice hotel expensive parking"
Step 2: Stopword Removal
Stopwords are common words like "the", "is", "at", "which", and "on" that appear frequently but carry little meaningful information. Removing them helps focus on the words that actually matter.
en_stopwords = stopwords.words('english')
en_stopwords.remove("not") # Keep "not" - it's important for sentiment!⚠️ Pro Tip: Notice that we removed "not" from our stopwords list. In sentiment analysis, "not good" has a completely different meaning than "good". Context matters!
data['review_no_stopwords'] = data['review_lowercase'] \
.apply(lambda x: ' ' \
.join([word for word \
in x.split() if word \
not in (en_stopwords)]))Example Output
Let's see how the first review looks after stopword removal:
Before:
nice hotel expensive parking got good deal stay hotel anniversary, arrived late
evening took advice previous reviews did valet parking, check quick easy, little
disappointed non-existent view room room clean nice size...After:
nice hotel expensive parking got good deal stay hotel anniversary, arrived late
evening took advice previous reviews valet parking, check quick easy, little
disappointed non-existent view room room clean nice size, bed comfortable woke
stiff neck high pillows, not soundproof like heard music room night morning loud
bangs doors opening closing hear people talking hallway, maybe noisy neighbors,
aveda bath products nice, not goldfish stay nice touch taken advantage staying
longer, location great walking distance shopping, overall nice experience pay 40
parking night,Notice how words like "did" and "the" are removed, but "not" is preserved because we explicitly kept it for sentiment analysis purposes.
Step 3: Handling Punctuation
Punctuation removal requires careful thought. We don't want to blindly remove everything—some symbols might carry meaning.
Preserving Meaningful Symbols
In our hotel reviews, we noticed that some reviewers used * to represent "star" (as in "5* hotel"). Let's preserve this meaning:
data['review_no_stopwords_no_punct'] = data \
.apply(lambda x: \
re.sub(r"[*]", \
"star", \
x['review_no_stopwords'] \
), axis=1)Now we can safely remove the remaining punctuation:
data['review_no_stopwords_no_punct'] = data. \
apply(lambda x: \
re.sub(r"([^\w\s])", \
"", \
x['review_no_stopwords_no_punct'] \
), axis=1)Understanding the Regex
[*]- Matches the asterisk character[^\w\s]- Matches anything that is NOT a word character (\w) or whitespace (\s)
💡 Key Insight: Always analyze your data before preprocessing. Understanding the quirks in your specific dataset helps you make better decisions about what to keep and what to remove.
Step 4: Tokenization
Tokenization is the process of splitting text into individual units (tokens), typically words. This is a fundamental step that prepares the text for further analysis.
data['tokenized'] = data.apply(lambda x: \
word_tokenize( \
x['review_no_stopwords_no_punct'] \
), axis=1)Example Output
data['tokenized'][0]Output:
['nice', 'hotel', 'expensive', 'parking', 'got', 'good', 'deal', 'stay',
'hotel', 'anniversary', 'arrived', 'late', 'evening', 'took', 'advice',
'previous', 'reviews', 'valet', 'parking', 'check', 'quick', 'easy',
'little', 'disappointed', 'nonexistent', 'view', 'room', 'room', 'clean',
'nice', 'size', 'bed', 'comfortable', 'woke', 'stiff', 'neck', 'high',
'pillows', 'not', 'soundproof', 'like', 'heard', 'music', 'room', 'night',
'morning', 'loud', 'bangs', 'doors', 'opening', 'closing', 'hear', 'people',
'talking', 'hallway', 'maybe', 'noisy', 'neighbors', 'aveda', 'bath',
'products', 'nice', 'not', 'goldfish', 'stay', 'nice', 'touch', 'taken',
'advantage', 'staying', 'longer', 'location', 'great', 'walking', 'distance',
'shopping', 'overall', 'nice', 'experience', 'pay', '40', 'parking', 'night']Now each review is represented as a list of 83 individual words, ready for further processing.
Step 5: Stemming
Stemming reduces words to their root form by chopping off word endings. It's a crude but fast approach.
ps = PorterStemmer()
data["stemmed"] = data["tokenized"] \
.apply(lambda tokens: \
[ps.stem(token) \
for token in tokens])Example Output
data['stemmed'][0]Output:
['nice', 'hotel', 'expens', 'park', 'got', 'good', 'deal', 'stay',
'hotel', 'anniversari', 'arriv', 'late', 'even', 'took', 'advic',
'previou', 'review', 'valet', 'park', 'check', 'quick', 'easi',
'littl', 'disappoint', 'nonexist', 'view', 'room', 'room', 'clean',
'nice', 'size', 'bed', 'comfort', 'woke', 'stiff', 'neck', 'high',
'pillow', 'not', 'soundproof', 'like', 'heard', 'music', 'room', 'night',
'morn', 'loud', 'bang', 'door', 'open', 'close', 'hear', 'peopl',
'talk', 'hallway', 'mayb', 'noisi', 'neighbor', 'aveda', 'bath',
'product', 'nice', 'not', 'goldfish', 'stay', 'nice', 'touch', 'taken',
'advantag', 'stay', 'longer', 'locat', 'great', 'walk', 'distanc',
'shop', 'overal', 'nice', 'experi', 'pay', '40', 'park', 'night']Stemming in Action
| Original | Stemmed |
| ----------- | ----------- |
| expensive | expens |
| parking | park |
| anniversary | anniversari |
| arrived | arriv |
| evening | even |
Notice how stemming can be aggressive—"evening" becomes "even", which might not always be desirable. This is where lemmatization comes in.
Step 6: Lemmatization
Lemmatization is more sophisticated than stemming. It uses vocabulary and morphological analysis to return the proper dictionary form (lemma) of a word.
lemmatizer = WordNetLemmatizer()
data["lemmatized"] = data["tokenized"] \
.apply(lambda tokens: \
[lemmatizer.lemmatize(token) \
for token in tokens])Example Output
data['lemmatized'][0]Output:
['nice', 'hotel', 'expensive', 'parking', 'got', 'good', 'deal', 'stay',
'hotel', 'anniversary', 'arrived', 'late', 'evening', 'took', 'advice',
'previous', 'review', 'valet', 'parking', 'check', 'quick', 'easy',
'little', 'disappointed', 'nonexistent', 'view', 'room', 'room', 'clean',
'nice', 'size', 'bed', 'comfortable', 'woke', 'stiff', 'neck', 'high',
'pillow', 'not', 'soundproof', 'like', 'heard', 'music', 'room', 'night',
'morning', 'loud', 'bang', 'door', 'opening', 'closing', 'hear', 'people',
'talking', 'hallway', 'maybe', 'noisy', 'neighbor', 'aveda', 'bath',
'product', 'nice', 'not', 'goldfish', 'stay', 'nice', 'touch', 'taken',
'advantage', 'staying', 'longer', 'location', 'great', 'walking', 'distance',
'shopping', 'overall', 'nice', 'experience', 'pay', '40', 'parking', 'night']Notice how lemmatization preserves more readable words compared to stemming (e.g., "morning" stays as "morning" instead of becoming "morn").
Stemming vs. Lemmatization
| Original | Stemmed | Lemmatized |
| -------- | -----------| ---------- |
| running | run | running |
| better | better | better |
| studies | studi | study |
| feet | feet | foot |
When to use which?
- Stemming: Faster, good for search engines and information retrieval
- Lemmatization: More accurate, better for text analysis and NLP tasks where meaning matters
Step 7: N-grams Analysis
N-grams are contiguous sequences of n items from a given text. They help capture context and word relationships.
Unigrams (n=1)
Single words and their frequencies:
tokens_clean = sum(data['lemmatized'], [])
unigrams = (pd.Series \
(nltk.ngrams(tokens_clean, 1)) \
.value_counts())
print(unigrams)Output:
(hotel,) 292
(room,) 275
(great,) 126
(not,) 122
(stay,) 95
...Bigrams (n=2)
Two-word combinations reveal common phrases:
bigrams = (pd.Series \
(nltk.ngrams(tokens_clean, 2)) \
.value_counts())
print(bigrams)Output:
(great, location) 24
(space, needle) 21
(hotel, monaco) 16
(great, hotel) 12
(staff, friendly) 12
..4-grams (n=4)
Longer sequences capture more specific patterns:
ngrams_4 = (pd.Series \
(nltk.ngrams(tokens_clean, 4)) \
.value_counts())
print(ngrams_4)Output:
(high, floor, great, view) 2
(definitely, stay, crowne, plaza) 2
(needle, experience, music, project) 2
(nice, hotel, husband, stayed) 2
(really, comfortable, clean, location) 2Why N-grams Matter
N-grams help capture:
- Common phrases: "great location", "staff friendly"
- Named entities: "space needle", "hotel monaco"
- Sentiment patterns: Understanding what words commonly appear together
The Complete Preprocessing Pipeline
Here's a summary of our complete pipeline:
# 1. Load data
data = pd.read_csv("tripadvisor_hotel_reviews.csv")
# 2. Lowercase
data['review_lowercase'] = data['Review'].str.lower()
# 3. Remove stopwords (keeping "not")
en_stopwords = stopwords.words('english')
en_stopwords.remove("not")
data['review_no_stopwords'] = data['review_lowercase'].apply(
lambda x: ' '.join([word for word in x.split()
if word not in en_stopwords])
)
# 4. Handle punctuation
data['review_clean'] = data['review_no_stopwords'].apply(
lambda x: re.sub(r"[*]", "star", x)
)
data['review_clean'] = data['review_clean'].apply(
lambda x: re.sub(r"([^\w\s])", "", x)
)
# 5. Tokenize
data['tokenized'] = data['review_clean'].apply(word_tokenize)
# 6. Lemmatize
lemmatizer = WordNetLemmatizer()
data['lemmatized'] = data['tokenized'].apply(
lambda tokens: [lemmatizer.lemmatize(token) for token in tokens]
)Key Takeaways
- Order matters: The sequence of preprocessing steps affects your results. Lowercase before stopword removal, tokenize before stemming/lemmatization.
- Context is king: Don't blindly apply preprocessing. Understand your data and domain—keeping "not" for sentiment analysis was crucial.
- Stemming vs. Lemmatization: Choose based on your use case. Speed vs. accuracy.
- N-grams reveal patterns: Single words tell you frequency; n-grams tell you context.
- Document your pipeline: Preprocessing decisions should be reproducible and explainable.
What's Next?
With clean, preprocessed text, you're ready to:
- Build classification models (sentiment analysis, topic classification)
- Create word embeddings (Word2Vec, GloVe)
- Perform topic modeling (LDA, NMF)
- Build search engines (TF-IDF, BM25)
Remember: The quality of your preprocessing directly impacts the quality of your models. Take the time to understand your data and make informed decisions at each step.