Python has a rich support of libraries for Natural Language Processing. Starting from text processing, tokenizing texts and determining their lemma, to syntactic analysis, parsing a text and assign syntactic roles, to semantic processing, e.g. recognizing named entities, sentiment analysis and document classification, everything is offered by at least one library. So, where do you start?
The goal of this article is to give for each of the core NLP tasks an overview of relevant Python libraries. The libraries are explained with a brief description and a concrete code snippet for the NLP tasks is given. Continuing my introduction to NLP blog article, this article only shows libraries for the core NLP tasks of text processing, syntactic and semantic analysis, and document semantics. Additionally, in the NLP utilities category, libraries for corpus management and datasets are provided.
Following libraries are covered:
This article originally appeared at my blog admantium.com.
Core NLP Tasks
Text Processing
Tasks: tokenization, lemmatization, stemming
The NLTK library provides a complete toolkit for text processing, including tokenization, stemming, and lemmatization.
from nltk.tokenize import sent_tokenize, word_tokenize
# Source: Wikipedia, Artificial Intelligence, https://en.wikipedia.org/wiki/Artificial_intelligence
paragraph = '''Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.'''
sentences = []
for sent in sent_tokenize(paragraph):
sentences.append(word_tokenize(sent))
sentences[0]
# ['Artificial', 'intelligence', 'was', 'founded', 'as', 'an', 'academic', 'discipline'
With TextBlob, the same text processing tasks are supported. It distinguishes itself from NLTK by more advanced semantic results and its easy-to-use data structures: Parsing a sentence will already generate rich semantic information.
from textblob import TextBlob
text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''
blob = TextBlob(text)
blob.ngrams()
#[WordList(['Artificial', 'intelligence', 'was']),
# WordList(['intelligence', 'was', 'founded']),
# WordList(['was', 'founded', 'as']),
blob.tokens
# WordList(['Artificial', 'intelligence', 'was', 'founded', 'as', 'an', 'academic', 'discipline', 'in', '1956', ',', 'and', 'in',
And with Spacy, a modern NLP library, text processing is just the first step in a rich pipeline of mostly semantic tasks. Unlike the other libraries, it is required to load a model of the target language first. Instead of heuristics, recent models are artificial neural networks, especially transformers, that provide richer abstractions and can be better incorporated with others.
import spacy
nlp = spacy.load('en_core_web_lg')
text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''
doc = nlp(text)
tokens = [token for token in doc]
print(tokens)
# [Artificial, intelligence, was, founded, as, an, academic, discipline
Text Syntax
Tasks: parsing, part-of-speech tagging, noun phrase extraction
Beginning with NLTK, all syntactic tasks are supported. Their output is provide as Python native datastructures, and they can always be shown as simple text output.
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser
text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''
pos_tag(word_tokenize(text))
# [('Artificial', 'JJ'),
# ('intelligence', 'NN'),
# ('was', 'VBD'),
# ('founded', 'VBN'),
# ('as', 'IN'),
# ('an', 'DT'),
# ('academic', 'JJ'),
# ('discipline', 'NN'),
# noun chunk parser
# source: https://www.nltk.org/book_1ed/ch07.html
grammar = "NP: {<DT>?<JJ>*<NN>}"
parser = RegexpParser(grammar)
parser.parse(pos_tag(word_tokenize(text)))
#(S
# (NP Artificial/JJ intelligence/NN)
# was/VBD
# founded/VBN
# as/IN
# (NP an/DT academic/JJ discipline/NN)
# in/IN
# 1956/CD
TextBlob provides POS tags immediately when the text is processed. With another method, the parse tree is created that contains rich syntactic information.
from textblob import TextBlob
text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''
blob = TextBlob(text)
blob.tags
#[('Artificial', 'JJ'),
# ('intelligence', 'NN'),
# ('was', 'VBD'),
# ('founded', 'VBN'),
blob.parse()
# Artificial/JJ/B-NP/O
# intelligence/NN/I-NP/O
# was/VBD/B-VP/O
# founded/VBN/I-VP/O
The Spacy library uses transformer neural networks in supporting its syntactic tasks.
import spacy
nlp = spacy.load('en_core_web_lg')
for token in nlp(text):
print(f'{token.text:<20}{token.pos_:>5}{token.tag_:>5}')
#Artificial ADJ JJ
#intelligence NOUN NN
#was AUX VBD
#founded VERB VBN
Text Semantics
Tasks: named entity recognition, word sense disambiguation, semantic role labeling
Semantic analysis is an area in which NLP approaches start to differ. When using NLTK, the generated syntactic information will be looked up in dictionaries to identify e.g. named entities. Therefore, when working with newer texts, entities might not be recognized.
from nltk import download as nltk_download
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
nltk_download('maxent_ne_chunker')
nltk_download('words')
# Source: Wikipedia, Spacecraft, https://en.wikipedia.org/wiki/Spacecraft
text = '''
As of 2016, only three nations have flown crewed spacecraft: USSR/Russia, USA, and China. The first crewed spacecraft was Vostok 1, which carried Soviet cosmonaut Yuri Gagarin into space in 1961, and completed a full Earth orbit. There were five other crewed missions which used a Vostok spacecraft. The second crewed spacecraft was named Freedom 7, and it performed a sub-orbital spaceflight in 1961 carrying American astronaut Alan Shepard to an altitude of just over 187 kilometers (116 mi). There were five other crewed missions using Mercury spacecraft.
'''
pos_tag(word_tokenize(text))
# [('Artificial', 'JJ'),
# ('intelligence', 'NN'),
# ('was', 'VBD'),
# ('founded', 'VBN'),
# ('as', 'IN'),
# ('an', 'DT'),
# ('academic', 'JJ'),
# ('discipline', 'NN'),
# noun chunk parser
# source: https://www.nltk.org/book_1ed/ch07.html
print(ne_chunk(pos_tag(word_tokenize(text))))
# (S
# As/IN
# of/IN
# [...]
# (ORGANIZATION USA/NNP)
# [...]
# which/WDT
# carried/VBD
# (GPE Soviet/JJ)
# cosmonaut/NN
# (PERSON Yuri/NNP Gagarin/NNP)
The transformer models that the Spacy library are used contain an implicit "timestamp": Their training time. This determines which texts the model consumed, and therefore which entitles the model is capable to recognize.
import spacy
nlp = spacy.load('en_core_web_lg')
text = '''
As of 2016, only three nations have flown crewed spacecraft: USSR/Russia, USA, and China. The first crewed spacecraft was Vostok 1, which carried Soviet cosmonaut Yuri Gagarin into space in 1961, and completed a full Earth orbit. There were five other crewed missions which used a Vostok spacecraft. The second crewed spacecraft was named Freedom 7, and it performed a sub-orbital spaceflight in 1961 carrying American astronaut Alan Shepard to an altitude of just over 187 kilometers (116 mi). There were five other crewed missions using Mercury spacecraft.
'''
doc = nlp(paragraph)
for token in doc.ents:
print(f'{token.text:<25}{token.label_:<15}')
# 2016 DATE
# only three CARDINAL
# USSR GPE
# Russia GPE
# USA GPE
# China GPE
# first ORDINAL
# Vostok 1 PRODUCT
# Soviet NORP
# Yuri Gagarin PERSON
Document Semantics
tasks: text classification, topic modelling, sentiment analysis, toxicity recognition
Sentiment analysis is also a task in which the differences of NLP approaches differ: Lookup of word meaning in dictionaries vs. learned word similarities encoded on word or document vectors.
TextBlob has a built-in sentiments analysis that returns the polarity (overall positive or negative connotation) and the subjectivity (degree of personal opinions) in a text.
from textblob import TextBlob
text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''
blob = TextBlob(text)
blob.sentiment
#Sentiment(polarity=0.16180290297937355, subjectivity=0.42155589508530683)
Spacy has no text classification capability included, but it can be extended as a separate Pipeline step. The following code is lengthy and contains several Spacy internal objects and data structures - a future article will explain this in more detail.
## train single label categorization from multi-label dataset
def convert_single_label(dataset, filename):
db = DocBin()
nlp = spacy.load('en_core_web_lg')
for index, fileid in enumerate(dataset):
cat_dict = {cat: 0 for cat in dataset.categories()}
cat_dict[dataset.categories(fileid).pop()] = 1
doc = nlp(get_text(fileid))
doc.cats = cat_dict
db.add(doc)
db.to_disk(filename)
## load trained model and apply to text
nlp = spacy.load('textcat_multilabel_model/model-best')
text = dataset.raw(42)
doc = nlp(text)
estimated_cats = sorted(doc.cats.items(), key=lambda i:float(i[1]), reverse=True)
print(dataset.categories(42))
# ['orange']
print(estimated_cats)
# [('nzdlr', 0.998894989490509), ('money-supply', 0.9969857335090637), ... ('orange', 0.7344251871109009),
SciKit Learn is a general purpose machine learning library that provides many clustering and classification algorithms. It works on numerical input only, and therefore requires the text to be vectorized, for example using GenSims pre-trained word vectors, or using built-in feature vectorizer. To give just one example, here is a snippet to convert raw text to word vectors, and then apply the KMeans classifier to them.
from sklearn.feature_extraction import DictVectorizer
from sklearn.cluster import KMeans
vectorizer = DictVectorizer(sparse=False)
x_train = vectorizer.fit_transform(dataset['train'])
kmeans = KMeans(n_clusters=8, random_state=0, n_init="auto").fit(x_train)
print(kmeans.labels_.shape)
# (8551, )
print(kmeans.labels_)
# [4 4 4 ... 6 6 6]
Finally, Gensim is a library specialized for topic classification with large scale corpora. The following snippet loads a built-in dataset, vectorizes the tokens of each document, and performs the clustering algorithm LDA. When running on a CPU only, those can take up to 15 minutes.
# source: https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html, https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html
import logging
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import LdaModel
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
docs = api.load('text8')
dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]
_ = dictionary[0]
id2word = dictionary.id2token
# Define and train the model
model = LdaModel(
corpus=corpus,
id2word=id2word,
chunksize=2000,
alpha='auto',
eta='auto',
iterations=400,
num_topics=10,
passes=20,
eval_every=None
)
print(model.num_topics)
# 10
print(model.top_topics(corpus)[6])
# ([(4.201401e-06, 'done'),
# (4.1998064e-06, 'zero'),
# (4.1478743e-06, 'eight'),
# (4.1257395e-06, 'one'),
# (4.1166854e-06, 'two'),
# (4.085097e-06, 'six'),
# (4.080696e-06, 'language'),
# (4.050306e-06, 'system'),
# (4.041121e-06, 'network'),
# (4.0385708e-06, 'internet'),
# (4.0379923e-06, 'protocol'),
# (4.035399e-06, 'open'),
# (4.033435e-06, 'three'),
# (4.0334166e-06, 'interface'),
# (4.030141e-06, 'four'),
# (4.0283044e-06, 'seven'),
# (4.0163245e-06, 'no'),
# (4.0149207e-06, 'i'),
# (4.0072555e-06, 'object'),
# (4.007036e-06, 'programming')],
Utilities
Corpus Management
NLTK offers corpus readers for plaintext, markdown and even twitter feeds in JSON format. Its created by passing a file path, and then provides basic statistics as well as iterator to work though all found files.
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')
print(corpus.fileids())
# ['AI_alignment.txt', 'AI_safety.txt', 'Artificial_intelligence.txt', 'Machine_learning.txt', ...]
print(len(corpus.sents()))
# 47289
print(len(corpus.words()))
# 1146248
Gensim process text files to form word vector representation of each document, which can then be use for its main use case topic classification. The documents need to be processed by an iterator that wraps traversing the directory, and then the corpus is built as a word vector collection. However, this corpus representation is hard to externalize and reuse with other libraries. The following snippet is an excerpt from above - it will load a datasets that’s included in Gensim, then create a word-vector based representation.
import gensim.downloader as api
from gensim.corpora import Dictionary
docs = api.load('text8')
dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]
print('Number of unique tokens: %d' % len(dictionary))
# Number of unique tokens: 253854
print('Number of documents: %d' % len(corpus))
# Number of documents: 1701
Datasets
NLTK offers several ready-to-use datasets, for example a Reuter news excerpt, European parliament proceedings, and open books from the Gutenberg collection. See the complete dataset and model list.
from nltk.corpus import reuters
print(len(reuters.fileids()))
#10788
print(reuters.categories()[:43])
# ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil']
SciKit Learn includes datasets from newsgroups, real estate, and even IT intrusion detection, see the complete list. Here is a quick example for the latter.
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups()
dataset.data[1]
# "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll.
Conclusion
For an NLP project in Python, an abundance of library choice exists. To help you get started, this article provided a NLP task driven overview with compact library explanations and code snippets. Starting with text processing, you saw how to create tokens and lemmas from a text. Continuing with syntactic analysis, you learned how to generate part-of-speech tags and the grammatical structure of sentences. And arriving at semantics, recognizing named entities in a text, as well as text sentiments can also be solved in few lines of code. For the additional tasks of corpus management and accessing pre-structured datasets, you also saw library examples. To summarize, this article should give you a good start into your next NLP project when working on core NLP tasks.