Learn NLP the Easy Way — Text Representation

Chris Kuo/Dr. Dataman
21 min readMay 9, 2024

A computer operates on zeros and ones, and algorithms operate on numerical values. A computer does not understand beautiful texts such as the plays by William Shakespeare or the books by Leo Tolstoy. So, raw texts need to be converted to numerical values for a computer to process. The first step in NLP is converting texts to numerical values.

This chapter is for absolute NLP beginners. We will learn about the basic text representation — Bag-of-Words, Bag-of-N-grams, and TF-IDF. We will learn how to code with Gensim, scikit-learn, and NLTK. We will cover the following topics:

  • What text representation is
  • The transition from one-hot encoding to Bag-of-Words to Bag-of-N-grams
  • What TF-IDF is
  • How to perform Bag-of-Words (BoW) and TF-IDF encoding in Gensim
  • The real-world applications of BoW and TF-IDF

By the end of this chapter, you will be able to describe the BoW, Bag-of-N-grams, and TF-IDF methods and their advantages or disadvantages. You will be able to name some real-world applications of BoW and TF-IDF, and you will be able to program raw texts using these techniques in Gensim, scikit-learn, and NLTK accordingly.

To learn NLP the easy way, you can visit “The Handbook of NLP with Gensim.

--

--