City College, Fall 2019

Intro to Data Science

Week 11: NLP, Text as Data, and Bayes Rule

November 18, 2019

Today's Agenda
  1. Project Notes
  2. What is NLP?
  3. Bag of Words Strategies
  4. Bayesian Analysis
  5. Tools and Topics in NLP
  6. Exam Results
Reminder: First Project Submission Due November 23 by 11:59pm

Due this Saturday:

  1. An email to the professor containing:
    1. A creative team name, the names of all members of the team, and a link to the team’s repository on GitHub. The repository need not be public, but access must be granted to the professor (grantmlong).
    2. An attached csv with predictions against test2.csv. The csv must consist exclusively of 2 columns, header rows with the titles rental_id and predictions, and the 2,000 rental ids and corresponding rent predictions. A example of the required formatting can be found here, with suggested methodology for creating the submission file here.

  2. A markdown file posted in the project Github repo entitled initial_findings.md containing:
    1. A 200-300 word explanation of the expected performance of the model in terms of mean squared error and the key features driving the team’s modeling performance.
    2. A 200-300 word summary outlining the team’s intended strategy to improve the predictions for the final round.

Issues with Multicollinearity

Models with multicollinearity can still be useful.

What is NLP?

Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.

  1. Identify the structure and meaning of words, sentences, texts and conversations.
  2. Deep understanding of broad language.

Source.
NLP Applications: Quora Questions
NLP Applications: Uber One Click Chat
NLP Applications: Facebook Translation

Was this is a prequel worth making?

More workmanlike than magical, "Fantastic Beasts: The Crimes of Grindelwald" nevertheless feels like an upgrade from its predecessor, one that adds star power, introduces key characters and lays the foundation for a genuine "Wizarding World" franchise. To call J.K. Rowling's mythology-heavy plot dense would be an understatement, but the film has enough epic heft to feel like a genuine blockbuster.

The labored second chapter in J.K. Rowling's Harry Potter spinoff series is as cumbersome as its title. "Fantastic Beasts: The Crimes of Grindelwald" is a gangly, overly complicated snooze, a rudderless, magic-free visit into Rowling's world of wizards and wizarding. Even the beasts aren't all that fantastic, and the visual effects aren't either.
More workmanlike than magical, "Fantastic Beasts: The Crimes of Grindelwald" nevertheless feels like an upgrade from its predecessor, one that adds star power, introduces key characters and lays the foundation for a genuine "Wizarding World" franchise. To call J.K. Rowling's mythology-heavy plot dense would be an understatement, but the film has enough epic heft to feel like a genuine blockbuster.

The labored second chapter in J.K. Rowling's Harry Potter spinoff series is as cumbersome as its title. "Fantastic Beasts: The Crimes of Grindelwald" is a gangly, overly complicated snooze, a rudderless, magic-free visit into Rowling's world of wizards and wizarding. Even the beasts aren't all that fantastic, and the visual effects aren't either.
How do we turn this into a model?
Bag of Words Strategies

Map words

to feature vectors.

Bag of Words Strategies

Advantages:

  • Easy to calculate
  • Vectors are interpretable


Disadvantages:

  • Word order often contains important context!
  • Related: Interactions are hard (but not impossible) to model

Data Science Models

Bey's Bayesian Analysis

Bayes's Rules

Know this and how to apply it before any interview. It can get tricky.
Bayes's Rule and Text
From Bayes's Rule to Naive Bayes

How to calculate?


Make some assumptions.

What Makes Naive Bayes Naive?

The conditional independence assumption: For conditional independence to hold true, we're assuming no words are more likely to appear with each other than any others.


Examples: "hot" and "dog", "ball" and "game", harry" and "potter", "computer" and "science"

Tools and Topics in NLP

Other Applications of Text Analysis:
  • Part of Speech Tagging
  • Named Entity Recognition
  • Dependency Parsing
  • Topic Modeling

For more, check out the Stanford parser and named entity recognizer and this interactive topic modeling explorer.

Helpful Open Source Tools
  • Sci-kit Learn: Has helpful tools for featurization and basic text analysis.
  • NLTK: Python package offering a wide array of functionality, including sentiment analysis, part of speech tagging, and more.
  • Gensim: Python package offering a specialized tools in topic modeling and a few other domains.
  • Stanford Core NLP: Java library offering a wide array of functionality and state of the art performance. (Easy interface with python available through NLTK).
Let's work some magic.

Exam Results
Multiple Choice Short Answer Exam Total
Points Possible 70 66 136
Mean 56 48 104
Median 57 49 104
Std Dev 5.4 11.9 16.0

Answer key available on the course page.