Have you thought about how search engines will know that running, runs, & ran all come from the root word ‘run’?
Have you considered how chatbots figure out that they can take various words but still use them to respond meaningfully?
The secret lies in stemming, one of the most basic techniques of Natural Language Processing (NLP)–which allows for the identification of a base form of the word by removing prefixes & suffixes to get the root meaning.
Stemming allows machines to analyze text more easily, ultimately enhancing search result precision, sentiment analysis, & even spam detection.
But how does this work, and why should we care about NLP? Let’s find out
What is Stemming?
Stemming is a natural language processing technique that reduces words to their root or base form (also known as the “stem”).
The purpose of stemming is to simplify text by consolidating words with similar meanings, enabling better analysis in various applications such as search engines, text mining, & information retrieval.
For example, the words “running,” “runner,” and “ran” share the same root meaning related to the action of moving quickly.
By converting these differences to their root form, “run,” we can make data processing very streamlined, which assists in boosting the precision of analysis.
Step-by-Step Process of Stemming
Step 1: Identify the Word
Begin with a word that may include prefixes, root forms, and suffixes. For instance:
Input Word: “believable”
Step 2: Analyze the Word Structure
Examine the components of each word to determine its origin, prefixes, and suffixes. For “believable”:
- Prefix: “be-“
- Core/root: “lie”
- Suffix: “-able”
Step 3: Remove Affixes
The next step involves applying rules to eliminate any recognized affixes. The goal is to reach the root of the word. In this case, using stemming algorithms, you would remove the suffix “-able” & the prefix “be-“, simplifying “believable” to “lie” (or, in some cases, it may be further simplified to “believ”).
Step 4: Apply Stemming Algorithm
This step involves using a specific algorithm designed to remove affixes systematically. Some commonly used stemming algorithms include:
Porter Stemmer: A widely-used stemming algorithm that applies a set of rules to remove common suffixes. For instance, it would stem:
- “running” → “run”
- “happiness” → “happi” (in this case, it strips more aggressively)
Snowball Stemmer: An improvement over the Porter Stemmer that produces better-suited results in different languages. It might yield:
- “happiness” → “happy”
- “running” → “run”
Step 5: Return the Reduced Form
Once the algorithm processes the word, it returns the simplified or stemmed version suitable for analysis. Using the Porter Stemmer as an example:
- Output for “running”: “run”
- Output for “fishing”: “fish”
These outputs can vary depending on the algorithm’s design and rules.
Step 6: Handle Irregular Forms
Few words may not obey standard rules, with the stemming algorithms periodically delivering “stems” that aren’t actual words; however, they are still useful in the context of matching. For example:
Input Word: “better”
Stemmed Form (using Porter): “better” might not change at all, since it doesn’t have recognizable affixes in derived forms.
Step 7: Final Output and Usage
The final output constructs a list or a set of unique stems representing your original set of words. This list serves analytic purposes such as:
- Reduces the number of unique tokens, allowing a model to generalize better.
- Combines similar meanings and grammatical variations of words, which helps in improving search functionalities.
Example of Stemming:
We can consider input words: [“connection”, “connects”, “connected”, “connecting”, “connections”]
Stemming Process:
- “connection” → “connect”
- “connects” → “connect”
- “connected” → “connect”
- “connecting” → “connect”
- “connections” → “connect”
Also Read: Top NLP Projects
Types of Stemming Algorithms
1. Porter Stemmer
Description
Developed by Martin Porter in 1980, this is one of the most popular stemming algorithms. It uses a set of rules to iteratively strip suffixes from words to produce stems.
How it Works
The algorithm processes words in multiple steps, where each step applies specific rules to remove common suffixes such as “-ing,” “-ed,” and “-es.”
Example: “running” → “run”, “happiness” → “happi”
2. Lovins Stemmer
Description
Created by Julie Beth Lovins in 1968, this was one of the first stemming algorithms used but is less widely adopted today.
How it Works
It works by removing prefixes and suffixes based on a large set of predefined rules. It identifies the root of the word in a single pass.
Example: “fishing” → “fish”, “runner” → “run”
3. Paice & Husk Stemmer
Description
Brought forward in 1990 by Paice and Husk, this is a more elaborate stemming method utilizing a comprehensive set of rules.
How it Works
Unlike other more basic stemming algorithms, it not only strips suffixes but also addresses special cases based on pre-defined conditions and affix changes.
Example: “happily” → “happy”
4. Dawson Stemmer
Description
This algorithm is an extension of the principles used in the Porter Stemmer, focusing primarily on the morphological features of words.
How it Works
The Dawson Stemmer applies a series of rules for affix removal but is designed to reduce errors associated with truncating words too aggressively.
Example: “administered” → “administrator”
5. Snowball Stemmer
Description
Also known as the “Porter2” stemmer, developed by Martin Porter as an improvement over the original Porter Stemmer. It supports multiple languages.
How it Works
It applies a more elaborate set of rules and works effectively across different languages, producing more intuitive results than its predecessor.
Example: “running” → “run”, “better” → “better”
6. Lancaster Stemmer
Description
A more aggressive stemming algorithm developed by Chris Paice. It uses a simple set of rules for suffix stripping but tends to be harsher than the Porter Stemmer.
How it Works
It frequently removes more characters and may produce stems that are not actual words. It’s particularly known for losing a lot of the original meaning.
Example: “believes” → “believ”, “connection” → “connect”
7. N-Gram Stemmer
Description
This technique derives words by splitting them into n-grams (contiguous sets of n items from a sample of text).
How it Works
It exploits patterns in strings instead of performing basicsuffix stripping, extracting semantic similarities based on character sequences.
Example: For “running” & “runner,” an n-gram model would notice common character sequences to place the words together.
Comparison of Stemming Algorithms
Stemming Algorithm | Approach | Strengths | Weaknesses |
Porter Stemmer | Rule-based, stepwise suffix removal | Popular, balanced accuracy | Sometimes over-stems words |
Lovins Stemmer | Longest suffix removal | Fast and simple | Less accurate |
Paice-Husk Stemmer | Iterative rule-based stripping | More aggressive than Porter | Can remove too much |
Dawson Stemmer | Extended Lovins | Handles more suffixes | Computationally expensive |
Snowball Stemmer | Improved Porter, supports multiple languages | More precise than Porter | Still rule-based |
Lancaster Stemmer | Aggressive truncation | Very fast | Over-stemming issues |
N-Gram Stemmer | Character n-grams | Works well for noisy text | Less traditional stem |
Applications of Stemming in NLP
1. Search Engines and Information Retrieval
Real-Life Example: If you type “buying shoes” on Google, the search engine also brings up the results with “buy,” “bought,” or “shoe purchase” because stemming brings words to their base form. This makes Google present more relevant results.
Benefit: Improves search accuracy by linking various word forms with a shared root.
2. Text Classification and Sentiment Analysis
Real-Life Example: Movie review analysis on platforms like IMDb or Rotten Tomatoes uses stemming to group words like “amazing,” “amazingly,” and “amazement” under the root “amaz,” helping sentiment analysis models determine if a review is positive or negative.
Benefit: Ensures consistency in analyzing sentiment, leading to more accurate predictions.
3. Document Clustering and Topic Modeling
Real-Life Example: News aggregators such as Google News utilize stemming to categorize similar stories. For example, stories that include “political,” “politician,” and “politics” can be categorized under a single topic so that users will have similar stories in one location.
Benefits: Facilitates grouping lots of text into useful topics.
4. Spam Detection and Filtering
Real-Life Example: Gmail’s spam filter detects promotional or threatening emails by matching word stems. Spammers can use “freeeee,” “fr33,” or “freely” rather than “free” to get past filters, but stemming makes them all treated equally.
Benefit: Improves email filtering by identifying interpretations of words that are spammy.
5. Plagiarism Detection and Text Similarity
Real-Life Example: Tools like Turnitin & Grammarly use stemming to detect plagiarism.
If a student changes “arguing” to “argument” or “debating,” the software still identifies similarity because both words stem from the same root.
Benefit: Enhances plagiarism detection by focusing on content rather than minor word changes.
Also Read: Natural Language Processing Applications
Implementing Stemming in Python
Stemming in Python can be implemented using the Natural Language Toolkit (NLTK). Below are different ways to perform stemming in Python.
1. Using Porter Stemmer (NLTK)
The Porter Stemmer is one of the most widely used stemming algorithms, known for its simple and effective approach.
from nltk.stem import PorterStemmer
# Initialize the stemmer
porter = PorterStemmer()
# Example words
words = ["running", "flies", "easily", "arguing", "university"]
# Apply stemming
stemmed_words = [porter.stem(word) for word in words]
print(stemmed_words)
Output:
['run', 'fli', 'easili', 'argu', 'univers']
Observation:
- “flies” → “fli” (aggressive stemming)
- “easily” → “easili” (may not be ideal for NLP tasks)
2. Using Snowball Stemmer (NLTK)
The Snowball Stemmer (also known as Porter2) is an improved version of the Porter Stemmer and supports multiple languages.
from nltk.stem import SnowballStemmer
# Initialize Snowball Stemmer for English
snowball = SnowballStemmer("english")
# Example words
words = ["running", "flies", "easily", "arguing", "university"]
# Apply stemming
stemmed_words = [snowball.stem(word) for word in words]
print(stemmed_words)
Output:
['run', 'fli', 'easili', 'argu', 'univers']
Benefit:
- More accurate than the original Porter Stemmer
- Supports multiple languages like French, German, and Spanish
3. Using Lancaster Stemmer (NLTK)
The Lancaster Stemmer is more aggressive than the Porter and Snowball Stemmers, often over-stemming words.
from nltk.stem import LancasterStemmer
# Initialize Lancaster Stemmer
lancaster = LancasterStemmer()
# Example words
words = ["running", "flies", "easily", "arguing", "university"]
# Apply stemming
stemmed_words = [lancaster.stem(word) for word in words]
print(stemmed_words)
Output:
['run', 'fli', 'easy', 'argu', 'univers']
Drawback:
- Over-stemming can lead to loss of word meaning
4. Comparing Different Stemmers
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()
# Example word
word = "running"
# Apply stemming using different algorithms
print(f"Original Word: {word}")
print(f"Porter Stemmer: {porter.stem(word)}")
print(f"Snowball Stemmer: {snowball.stem(word)}")
print(f"Lancaster Stemmer: {lancaster.stem(word)}")
Output:
Original Word: running
Porter Stemmer: run
Snowball Stemmer: run
Lancaster Stemmer: run
Observation:
- All three stemmers produce “run” for “running”
- The impact varies for different words
Also Read: Top NLP Interview Questions and Answers
Drawbacks of Stemming in NLP
1. Over-Stemming (False Positives)
Issue: Stemming can be too aggressive & incorrectly reduce words to an unrelated root, causing a loss of meaning.
Example: The Porter Stemmer reduces “university” to “univers”, which is not a valid word. In the same way, “organization” & “organ” can be assumed to have matching roots, although they have multiple meanings.
Impact: May result in inappropriate search results or misinterpretation during text analysis.
2. Under-Stemming (False Negatives)
Issue: Some stemming algorithms fail to reduce words that should have the same root, leaving different forms of the same word unconnected.
Example: The word “running” might be reduced to “run”, but “runner” may remain unchanged, leading to inconsistencies.
Impact: Reduces the effectiveness of text matching and clustering.
3. Loss of Context and Meaning
Issue: Stemming removes suffixes without understanding the word’s context, sometimes altering the intended or the actual meaning.
Example: “Better” is reduced to “bet”, even though “bet” has a completely different meaning in English.
Impact: This can cause errors in sentiment analysis, search results, and language understanding.
4. Inconsistency Across Different Languages
Issue: Stemming algorithms are often language-specific and may not work well across multiple languages without significant modifications.
Example: The English word “going” can be stemmed to “go”, but in French, “manger” (to eat) has ample variations (“mange,” “mangeons,” “mangent”) that need different handling of such words.
Impact: Limits the ability to use the same stemming approach across multilingual datasets.
5. Not Suitable for Complex NLP Tasks
Issue: Stemming is a rule-based method that does not take word semantics or syntax into account, and that is why it is not suitable for more complex NLP operations such as machine translation or contextual understanding.
Example: In voice assistants or chatbots, basic stemming will not be able to correctly interpret user intent.
Impact: Advanced methods such as lemmatization or deep learning models are required for advanced NLP applications.
Conclusion
Stemming is a fundamental NLP technique that enhances AI and ML models by simplifying words to their root forms and improving tasks like search optimization, chatbot responses, and text analysis.
However, its limitations, such as over-stemming and loss of meaning, make lemmatization a more precise alternative for complex applications like sentiment analysis and machine translation.
If you want to explore such techniques hands-on, Great Learning’s AI and ML course offers in-depth training on NLP, deep learning, and real-world AI applications to help you strengthen your knowledge.