[ML 3] NLP 🗣️fundamentals

[ML 3] NLP 🗣️fundamentals

🌟What is NLP?

📚NLP (Natural Language Processing) evaluates natural language:

  1. Text 📚 (webpage 💻, SMS 📲, email 📨, and menus 🍽️)

  2. Audio 🔊 (Siri)

  3. Signs and gestures 🖖

  4. Others (songs 🎤, music sheet 🎼, and Morse code 🧑‍💻)

🌟There are countless more examples of natural languages that provide a more direct interaction between machines 🤖 and humans 🧑‍💻.

💫NLP dates back to the 1950s with the paper (Turing) 📃 evaluating whether a computer could convince a human to believe that they are humans 🧑‍💻 through a Turing Test.

Tokenization

🌟Tokenization splits text 📃into fragments: words, characters or sentences and removes redundant details (punctuation marks ⁉️, emoticons😀, and digits 🔢)

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize  # to tokenize sentences
nltk.download()

sent = """

    I'm reading a book.
    It is Python Machine Learning By Example,
    2nd Edition, by Yuxi (Hayden) Liu.

"""

print(names.words()[:10])

print(word_tokenize(sent))

print(sent_tokenize(sent))

💻The code snippet produces the following output:

📰Newsgroup data

📰Newsgroup data contains data from 20,000 documents across 20 online newsgroups. 🌟

from sklearn.datasets import fetch_20newsgroups

groups = fetch_20newsgroups()

groups.keys()

📰A key value dictionary stores the data object using the following keys ⬆️.

groups.target_names

📰The target_names give the names of the Newspapers 🔽, which can be encoded as integers 🔢.

groups.target

import seaborn as sns

sns.distplot(groups.target)

💻The Seaborn package 🎒produces a histogram of the topics to measure how the news 📰 categories are distributed.

💻Seaborn Installation Guide 🎓

python -m pip install -U matplotlib
pip install seaborn

🌟The matplotlib library visualizes the histogram 📊 and pip installs both libraries.

conda install -c conda-forge matplotlib
conda install seaborn

🌟conda installation guide for seaborn 🎓⬆️.