City College, Spring 2019

Intro to Data Science

Week 11: NLP, Text as Data, and Bayes Rule

Apri1 29, 2019

Today's Agenda
  1. What is NLP?
  2. Bag of Words Strategies
  3. Bayesian Analysis
  4. Tools and Topics in NLP
What is NLP?

Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.

  1. Identify the structure and meaning of words, sentences, texts and conversations.
  2. Deep understanding of broad language.

Source.
NLP Applications: Quora Questions
NLP Applications: Uber One Click Chat
NLP Applications: Facebook Translation

Was this is a prequel worth making?

More workmanlike than magical, "Fantastic Beasts: The Crimes of Grindelwald" nevertheless feels like an upgrade from its predecessor, one that adds star power, introduces key characters and lays the foundation for a genuine "Wizarding World" franchise. To call J.K. Rowling's mythology-heavy plot dense would be an understatement, but the film has enough epic heft to feel like a genuine blockbuster.

The labored second chapter in J.K. Rowling's Harry Potter spinoff series is as cumbersome as its title. "Fantastic Beasts: The Crimes of Grindelwald" is a gangly, overly complicated snooze, a rudderless, magic-free visit into Rowling's world of wizards and wizarding. Even the beasts aren't all that fantastic, and the visual effects aren't either.
More workmanlike than magical, "Fantastic Beasts: The Crimes of Grindelwald" nevertheless feels like an upgrade from its predecessor, one that adds star power, introduces key characters and lays the foundation for a genuine "Wizarding World" franchise. To call J.K. Rowling's mythology-heavy plot dense would be an understatement, but the film has enough epic heft to feel like a genuine blockbuster.

The labored second chapter in J.K. Rowling's Harry Potter spinoff series is as cumbersome as its title. "Fantastic Beasts: The Crimes of Grindelwald" is a gangly, overly complicated snooze, a rudderless, magic-free visit into Rowling's world of wizards and wizarding. Even the beasts aren't all that fantastic, and the visual effects aren't either.
How do we turn this into a model?
Bag of Words Strategies

Map words

to feature vectors.

Bag of Words Strategies

Advantages:

  • Easy to calculate
  • Vectors are interpretable


Disadvantages:

  • Word order often contains important context!
  • Related: Interactions are hard (but not impossible) to model

Data Science Models

Bey's Bayesian Analysis

Bayes's Rules

Know this and how to apply it before any interview. It can get tricky.
Bayes's Rule and Text
From Bayes's Rule to Naive Bayes

How to calculate?


Make some assumptions.

What Makes Naive Bayes Naive?

The conditional independence assumption: For conditional independence to hold true, we're assuming no words are more likely to appear with each other than any others.


Examples: "hot" and "dog", "ball" and "game", harry" and "potter", "computer" and "science"

Tools and Topics in NLP

Other Applications of Text Analysis:
  • Part of Speech Tagging
  • Named Entity Recognition
  • Dependency Parsing
  • Topic Modeling

For more, check out the Stanford parser and named entity recognizer and this interactive topic modeling explorer.

Helpful Open Source Tools
  • Sci-kit Learn: Has helpful tools for featurization and basic text analysis.
  • NLTK: Python package offering a wide array of functionality, including sentiment analysis, part of speech tagging, and more.
  • Gensim: Python package offering a specialized tools in topic modeling and a few other domains.
  • Stanford Core NLP: Java library offering a wide array of functionality and state of the art performance. (Easy interface with python available through NLTK).
Let's work some magic.

Assignment 8: Due Monday, May 6 by 6:30pm

DataCamp's Natural Language Processing Fundamentals in Python

  • This will be the final homework assignment for the semester! Remember you are able to drop two.
  • The course should appear collectively as assignment within your existing DataCamp account.
  • Each section will appear separately and will be worth one point toward the total grade for the homework, plus an additional point for overall effort.
  • Course claims to take 4 hours - as always, use your time wisely.