What is Stemming in NLP?

Have you thought about how search engines will know that running, runs, & ran all come from the root word ‘run’?

Have you considered how chatbots figure out that they can take various words but still use them to respond meaningfully?

The secret lies in stemming, one of the most basic techniques of Natural Language Processing (NLP)–which allows for the identification of a base form of the word by removing prefixes & suffixes to get the root meaning.

Stemming allows machines to analyze text more easily, ultimately enhancing search result precision, sentiment analysis, & even spam detection.

But how does this work, and why should we care about NLP? Let’s find out

What is Stemming?

Stemming is a natural language processing technique that reduces words to their root or base form (also known as the “stem”).

The purpose of stemming is to simplify text by consolidating words with similar meanings, enabling better analysis in various applications such as search engines, text mining, & information retrieval.

For example, the words “running,” “runner,” and “ran” share the same root meaning related to the action of moving quickly.

By converting these differences to their root form, “run,” we can make data processing very streamlined, which assists in boosting the precision of analysis.

Step-by-Step Process of Stemming

Step 1: Identify the Word

Begin with a word that may include prefixes, root forms, and suffixes. For instance:

Input Word: “believable”

Step 2: Analyze the Word Structure

Examine the components of each word to determine its origin, prefixes, and suffixes. For “believable”:

Prefix: “be-“
Core/root: “lie”
Suffix: “-able”

Step 3: Remove Affixes

The next step involves applying rules to eliminate any recognized affixes. The goal is to reach the root of the word. In this case, using stemming algorithms, you would remove the suffix “-able” & the prefix “be-“, simplifying “believable” to “lie” (or, in some cases, it may be further simplified to “believ”).

Step 4: Apply Stemming Algorithm

This step involves using a specific algorithm designed to remove affixes systematically. Some commonly used stemming algorithms include:

Porter Stemmer: A widely-used stemming algorithm that applies a set of rules to remove common suffixes. For instance, it would stem:

“running” → “run”
“happiness” → “happi” (in this case, it strips more aggressively)

Snowball Stemmer: An improvement over the Porter Stemmer that produces better-suited results in different languages. It might yield:

“happiness” → “happy”
“running” → “run”

Step 5: Return the Reduced Form

Once the algorithm processes the word, it returns the simplified or stemmed version suitable for analysis. Using the Porter Stemmer as an example:

Output for “running”: “run”
Output for “fishing”: “fish”

These outputs can vary depending on the algorithm’s design and rules.

Step 6: Handle Irregular Forms

Few words may not obey standard rules, with the stemming algorithms periodically delivering “stems” that aren’t actual words; however, they are still useful in the context of matching. For example:

Input Word: “better”

Stemmed Form (using Porter): “better” might not change at all, since it doesn’t have recognizable affixes in derived forms.

Step 7: Final Output and Usage

The final output constructs a list or a set of unique stems representing your original set of words. This list serves analytic purposes such as:

Reduces the number of unique tokens, allowing a model to generalize better.
Combines similar meanings and grammatical variations of words, which helps in improving search functionalities.

Example of Stemming:

We can consider input words: [“connection”, “connects”, “connected”, “connecting”, “connections”]

Stemming Process:

“connection” → “connect”
“connects” → “connect”
“connected” → “connect”
“connecting” → “connect”
“connections” → “connect”

Also Read: Top NLP Projects

Types of Stemming Algorithms

1. Porter Stemmer

Description

Developed by Martin Porter in 1980, this is one of the most popular stemming algorithms. It uses a set of rules to iteratively strip suffixes from words to produce stems.

How it Works

The algorithm processes words in multiple steps, where each step applies specific rules to remove common suffixes such as “-ing,” “-ed,” and “-es.”

Example: “running” → “run”, “happiness” → “happi”

2. Lovins Stemmer

Description

Created by Julie Beth Lovins in 1968, this was one of the first stemming algorithms used but is less widely adopted today.

How it Works

It works by removing prefixes and suffixes based on a large set of predefined rules. It identifies the root of the word in a single pass.

Example: “fishing” → “fish”, “runner” → “run”

3. Paice & Husk Stemmer

Description

Brought forward in 1990 by Paice and Husk, this is a more elaborate stemming method utilizing a comprehensive set of rules.

How it Works

Unlike other more basic stemming algorithms, it not only strips suffixes but also addresses special cases based on pre-defined conditions and affix changes.

Example: “happily” → “happy”

4. Dawson Stemmer

Description

This algorithm is an extension of the principles used in the Porter Stemmer, focusing primarily on the morphological features of words.

How it Works

The Dawson Stemmer applies a series of rules for affix removal but is designed to reduce errors associated with truncating words too aggressively.

Example: “administered” → “administrator”

5. Snowball Stemmer

Description

Also known as the “Porter2” stemmer, developed by Martin Porter as an improvement over the original Porter Stemmer. It supports multiple languages.

How it Works

It applies a more elaborate set of rules and works effectively across different languages, producing more intuitive results than its predecessor.

Example: “running” → “run”, “better” → “better”

6. Lancaster Stemmer

Description

A more aggressive stemming algorithm developed by Chris Paice. It uses a simple set of rules for suffix stripping but tends to be harsher than the Porter Stemmer.

How it Works

It frequently removes more characters and may produce stems that are not actual words. It’s particularly known for losing a lot of the original meaning.

Example: “believes” → “believ”, “connection” → “connect”

7. N-Gram Stemmer

Description

This technique derives words by splitting them into n-grams (contiguous sets of n items from a sample of text).

How it Works

It exploits patterns in strings instead of performing basicsuffix stripping, extracting semantic similarities based on character sequences.

Example: For “running” & “runner,” an n-gram model would notice common character sequences to place the words together.

Comparison of Stemming Algorithms

Stemming Algorithm	Approach	Strengths	Weaknesses
Porter Stemmer	Rule-based, stepwise suffix removal	Popular, balanced accuracy	Sometimes over-stems words
Lovins Stemmer	Longest suffix removal	Fast and simple	Less accurate
Paice-Husk Stemmer	Iterative rule-based stripping	More aggressive than Porter	Can remove too much
Dawson Stemmer	Extended Lovins	Handles more suffixes	Computationally expensive
Snowball Stemmer	Improved Porter, supports multiple languages	More precise than Porter	Still rule-based
Lancaster Stemmer	Aggressive truncation	Very fast	Over-stemming issues
N-Gram Stemmer	Character n-grams	Works well for noisy text	Less traditional stem

Applications of Stemming in NLP

1. Search Engines and Information Retrieval

Real-Life Example: If you type “buying shoes” on Google, the search engine also brings up the results with “buy,” “bought,” or “shoe purchase” because stemming brings words to their base form. This makes Google present more relevant results.

Benefit: Improves search accuracy by linking various word forms with a shared root.

2. Text Classification and Sentiment Analysis

Real-Life Example: Movie review analysis on platforms like IMDb or Rotten Tomatoes uses stemming to group words like “amazing,” “amazingly,” and “amazement” under the root “amaz,” helping sentiment analysis models determine if a review is positive or negative.

Benefit: Ensures consistency in analyzing sentiment, leading to more accurate predictions.

3. Document Clustering and Topic Modeling

Real-Life Example: News aggregators such as Google News utilize stemming to categorize similar stories. For example, stories that include “political,” “politician,” and “politics” can be categorized under a single topic so that users will have similar stories in one location.

Benefits: Facilitates grouping lots of text into useful topics.

4. Spam Detection and Filtering

Real-Life Example: Gmail’s spam filter detects promotional or threatening emails by matching word stems. Spammers can use “freeeee,” “fr33,” or “freely” rather than “free” to get past filters, but stemming makes them all treated equally.

Benefit: Improves email filtering by identifying interpretations of words that are spammy.

5. Plagiarism Detection and Text Similarity

Real-Life Example: Tools like Turnitin & Grammarly use stemming to detect plagiarism.

If a student changes “arguing” to “argument” or “debating,” the software still identifies similarity because both words stem from the same root.

Benefit: Enhances plagiarism detection by focusing on content rather than minor word changes.

Also Read: Natural Language Processing Applications

Implementing Stemming in Python

Stemming in Python can be implemented using the Natural Language Toolkit (NLTK). Below are different ways to perform stemming in Python.

1. Using Porter Stemmer (NLTK)

The Porter Stemmer is one of the most widely used stemming algorithms, known for its simple and effective approach.

from nltk.stem import PorterStemmer  

# Initialize the stemmer
porter = PorterStemmer()

# Example words
words = ["running", "flies", "easily", "arguing", "university"]

# Apply stemming
stemmed_words = [porter.stem(word) for word in words]

print(stemmed_words)

Output:

['run', 'fli', 'easili', 'argu', 'univers']

Observation:

“flies” → “fli” (aggressive stemming)
“easily” → “easili” (may not be ideal for NLP tasks)

2. Using Snowball Stemmer (NLTK)

The Snowball Stemmer (also known as Porter2) is an improved version of the Porter Stemmer and supports multiple languages.

from nltk.stem import SnowballStemmer  

# Initialize Snowball Stemmer for English
snowball = SnowballStemmer("english")

# Example words
words = ["running", "flies", "easily", "arguing", "university"]

# Apply stemming
stemmed_words = [snowball.stem(word) for word in words]

print(stemmed_words)

Output:

['run', 'fli', 'easili', 'argu', 'univers']

Benefit:

More accurate than the original Porter Stemmer
Supports multiple languages like French, German, and Spanish

3. Using Lancaster Stemmer (NLTK)

The Lancaster Stemmer is more aggressive than the Porter and Snowball Stemmers, often over-stemming words.

from nltk.stem import LancasterStemmer  

# Initialize Lancaster Stemmer
lancaster = LancasterStemmer()

# Example words
words = ["running", "flies", "easily", "arguing", "university"]

# Apply stemming
stemmed_words = [lancaster.stem(word) for word in words]

print(stemmed_words)

Output:

['run', 'fli', 'easy', 'argu', 'univers']

Drawback:

Over-stemming can lead to loss of word meaning

4. Comparing Different Stemmers

from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer  

# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()

# Example word
word = "running"

# Apply stemming using different algorithms
print(f"Original Word: {word}")
print(f"Porter Stemmer: {porter.stem(word)}")
print(f"Snowball Stemmer: {snowball.stem(word)}")
print(f"Lancaster Stemmer: {lancaster.stem(word)}")

Output:

Original Word: running  
Porter Stemmer: run  
Snowball Stemmer: run  
Lancaster Stemmer: run

Observation:

All three stemmers produce “run” for “running”
The impact varies for different words

Also Read: Top NLP Interview Questions and Answers

Drawbacks of Stemming in NLP

1. Over-Stemming (False Positives)

Issue: Stemming can be too aggressive & incorrectly reduce words to an unrelated root, causing a loss of meaning.

Example: The Porter Stemmer reduces “university” to “univers”, which is not a valid word. In the same way, “organization” & “organ” can be assumed to have matching roots, although they have multiple meanings.

Impact: May result in inappropriate search results or misinterpretation during text analysis.

2. Under-Stemming (False Negatives)

Issue: Some stemming algorithms fail to reduce words that should have the same root, leaving different forms of the same word unconnected.

Example: The word “running” might be reduced to “run”, but “runner” may remain unchanged, leading to inconsistencies.

Impact: Reduces the effectiveness of text matching and clustering.

3. Loss of Context and Meaning

Issue: Stemming removes suffixes without understanding the word’s context, sometimes altering the intended or the actual meaning.

Example: “Better” is reduced to “bet”, even though “bet” has a completely different meaning in English.

Impact: This can cause errors in sentiment analysis, search results, and language understanding.

4. Inconsistency Across Different Languages

Issue: Stemming algorithms are often language-specific and may not work well across multiple languages without significant modifications.

Example: The English word “going” can be stemmed to “go”, but in French, “manger” (to eat) has ample variations (“mange,” “mangeons,” “mangent”) that need different handling of such words.

Impact: Limits the ability to use the same stemming approach across multilingual datasets.

5. Not Suitable for Complex NLP Tasks

Issue: Stemming is a rule-based method that does not take word semantics or syntax into account, and that is why it is not suitable for more complex NLP operations such as machine translation or contextual understanding.

Example: In voice assistants or chatbots, basic stemming will not be able to correctly interpret user intent.

Impact: Advanced methods such as lemmatization or deep learning models are required for advanced NLP applications.

Conclusion

Stemming is a fundamental NLP technique that enhances AI and ML models by simplifying words to their root forms and improving tasks like search optimization, chatbot responses, and text analysis.

However, its limitations, such as over-stemming and loss of meaning, make lemmatization a more precise alternative for complex applications like sentiment analysis and machine translation.

If you want to explore such techniques hands-on, Great Learning’s AI and ML course offers in-depth training on NLP, deep learning, and real-world AI applications to help you strengthen your knowledge.

MIT No Code AI and Machine Learning Program

AI and ML Program from UT Austin

What is Stemming in NLP?

What is Stemming?

Step-by-Step Process of Stemming

Step 1: Identify the Word

Step 2: Analyze the Word Structure

Step 3: Remove Affixes

Step 4: Apply Stemming Algorithm

Step 5: Return the Reduced Form

Step 6: Handle Irregular Forms

Step 7: Final Output and Usage

Types of Stemming Algorithms

1. Porter Stemmer

2. Lovins Stemmer

3. Paice & Husk Stemmer

4. Dawson Stemmer

5. Snowball Stemmer

6. Lancaster Stemmer

7. N-Gram Stemmer

Comparison of Stemming Algorithms

Applications of Stemming in NLP

1. Search Engines and Information Retrieval

2. Text Classification and Sentiment Analysis

3. Document Clustering and Topic Modeling

4. Spam Detection and Filtering

5. Plagiarism Detection and Text Similarity

Implementing Stemming in Python

1. Using Porter Stemmer (NLTK)

2. Using Snowball Stemmer (NLTK)

3. Using Lancaster Stemmer (NLTK)

4. Comparing Different Stemmers

Drawbacks of Stemming in NLP

1. Over-Stemming (False Positives)

2. Under-Stemming (False Negatives)

3. Loss of Context and Meaning

4. Inconsistency Across Different Languages

5. Not Suitable for Complex NLP Tasks

Conclusion

What is Lemmatization in NLP?

What is Retrieval-Augmented Generation (RAG)?

What Is the BERT Language Model and How Does It Work?

Text Summarization in Python

Data Normalization vs. Standardization – Explained

Gemini 2.5 Pro: The Best AI for Coding Challenges and Complex Prompts

MIT No Code AI and Machine Learning Program

AI and ML Program from UT Austin