When I first started learning about NLP (Natural Language Processing - processing text data) I wanted to find a beginner's guide that gave me a framework for understanding the topics and terminology that I needed to search for to find learning resources. I struggled to find one. However, just this week, @omarsar0 ("elvis") on Twitter has posted some mind maps that look really useful, and I thought I should post them here in case other beginners find these useful too.
Text Mining Mind Map:
NLP Mind Map:
Also, here are some beginner's tutorials and code examples (in python) that I've found really helpful for getting started:
- Machine Learning with Text in Scikit-Learn (Kevin Markham)
- Natural Language Processing with Python (DataQuest)
- Natural Language Processing Course (Kaggle)
...and a useful book is "Applied Text Analysis with Python"
There are many more complicated "state of the art" (SOTA) methods not covered in the resources above (e.g. Word2Vec, GloVe, ELMo, BERT, and SOTA models since BERT) but I recommend staying away from those until you understand text mining with the traditional methods.
There are also many different tasks that can be performed using NLP techniques (e.g. translating between languages, summarising text, question answering, and more) and I recommend starting out with "text classification" or "sentiment analysis" (which is a type of text classification). There are lots of free tutorials and examples online for sentiment analysis e.g. trying to classify whether a Yelp review is a positive review or a negative review. Perhaps even before that I'd recommend importing text data and creating a wordcloud (this tutorial will help). If you don't know what a word cloud is, below is an example. It's a way to visualise the frequency of each word in some text.
I created the wordcloud above using this code:
# import matplotlib so that the wordcloud can be displayed
import matplotlib.pyplot as plt
%matplotlib inline
# import wordcloud so that the wordcloud can be created
from wordcloud import WordCloud
# create a string of text
text_string = "NLP, NLP, NLP, NLP, NLP, NLP, NLP, NLP, NLP, \
text, text, text, spacy, spacy,\
sentiment analysis, translation, stopwords,\
tokenisation, tokenisation, tokenisation,\
part-of-speech tagging, bag of words, TF-IDF,\
embedding, summarisation, language modelling,\
question answering, text classification,\
text classification, RNN, LSTM"
# create a wordcloud from the string of text
my_wordcloud = WordCloud(background_color="white",
max_words=50,
).generate(text_string)
# display the wordcloud
plt.imshow(my_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Note: you may need to install wordcloud first (e.g. with !pip install wordcloud
if you're writing python code in a Jupyter Notebook)
NLP is a massive field and it can be daunting and confusing to get started. It's a really interesting field though and well worth the effort. I'm planning on writing more about NLP in the future, as I'm learning a lot about it as part of my Data Science MSc project. In the meantime, I hope the resources I've mentioned here can help to make the journey a bit easier for total beginners.