City College, Fall 2018

Intro to Data Science

Week 6: Regression vs. Classification

October 15, 2018

Today's Agenda
  1. Regression vs. Classification
  2. Logistic Regression
  3. Measuring Performance for Classification Models
  4. Midterm Recap
Week 5 Recap
  1. Linear Regression
  2. Assumptions for Linear Models
  3. Measuring Performance for Linear Models
HW Recap
  1. How was DataCamp?
  2. Data Exercise
    • Was the exercise clear?
    • Why would it be problematic to model with two collinear predictors?
    • How would you apply linear regression to your project data?
Data Science Models
Generalized Linear Models

A flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.

Generalized Linear Models
Regression vs. Classification

Is this a good forecast?

Regression analysis estimates the conditional expectation of the dependent variable given the independent variables.


Classification is the problem of identifying to which of a set of categories a new observation belongs.

Regression vs. Classification
Logistic Regression

Solved by gradient descent. (optimization)

Logistic Regression

Regression vs. Classification
Logistic Regression Output

Logisitic regression and many other classification models output a continuous value between 0 and 1.

Measuring Classification Performance
  1. Confusion Matrix
  2. Precision
  3. Recall
  4. Accuracy
(as explained by the zombie apocalypse)
Confusion Matrix
Precision

zombie apocalypse use case: you're hunting zombies, and you need to kill as many zombies as possible without killing any humans

Recall

zombie apocalypse use case: you discover a cure for zombies, but can only apply it k infected people

Accuracy

zombie apocalypse use case: zombies have infected roughly half the population, and you're throwing them a party. you are putting together an invite list and want to make sure you invite an equal amount of zombies and humans.

Wrap Up
  1. Linear Regression
  2. Assumptions for Linear Models
  3. Measuring Performance for Linear Models
  4. Regression vs. Classification
  5. Logistic Regression
Midterm: October 22, 6:30pm
  • 45 Minute Written Exam
  • No computer needed, closed book, closed notes
  • Part I: Multiple Choice
  • Part II: Short Answer
Review
Week 2: Where to Find Data
  • The data science lifecycle.
  • Structured vs unstructured data.
  • Common sources of data.
  • Common ways to access data.
Week 3: Processing and Cleaning Data
  • Elements of the ETL Process
  • Processing Tools
  • Data Cleaning Considerations for Data Scientists
    • Missing Value
    • Handling Outliers
    • Normalizing Data
Week 4: Statistics and the Stories We Tell Ourselves
  • Types of Data
  • Useful Statistical Distribution
  • Important Summary Statistics
  • Independence
  • Key Theorems
Week 5: Intro to Linear Models
  • What Makes Linear Regression Linear
  • Assumptions for Linear Models
  • Measuring Performance for Linear Models
Week 6: Regression vs Classification
  • Regression vs. Classification
  • Logistic Regression
  • Measuring Performance for Classification Models