A group of Python libraries known as the Natural language toolkit (NLTK) was created specifically to locate and tag the various parts of speech that can be found in texts written in natural languages like English. To build NLTK-based applications for natural language processing and perform text analysis, additional Python packages like "genism" and "pattern" are also important.
- Tokenization: The process of dividing the given text into smaller units known as tokens is known as tokenization. Tokens can be letters, numbers, or punctuation marks. It's also known as word segmentation.
- Stemming: Language is very diverse because of grammatical reasons. Variations in the sense that different forms of a word exist in the language, both English and other languages. For projects involving machine learning, the machines must comprehend that these various words share the same basic form. As a result, when analyzing the text, it is very helpful to extract the word's basic forms.
- Lemmatization: Another method of removing inflectional endings from words is lemmatization, which typically employs vocabulary and morphological analysis. The base form of any word after lemmatization is known as a lemma.
Running the NLP Script
import nltk
Here,
- DT is the determinant
- VBP is the verb
- JJ is the adjective
- IN is the preposition
- NN is the noun
sentence=[("a","DT"),("clever","JJ"),("fox","NN"),("was","VBP"),("jumping","VBP"),("over","IN"),("the","DT"),("wall","NN")]
grammar = "NP:{<DT>?<JJ>*<NN>}"
parser_chunking = nltk.RegexpParser(grammar)
parser_chunking.parse(sentence)
Output = parser_chunking.parse(sentence)
output.draw()