City College, Spring 2019

Intro to Data Science

Week 6: Regression vs. Classification

March 18, 2019

Today's Agenda
  1. Regression vs. Classification
  2. Logistic Regression
  3. Measuring Performance for Classification Models
  4. Midterm Recap
Week 5 Recap
  1. Linear Regression
  2. Assumptions for Linear Models
  3. Measuring Performance for Linear Models
Data Science Models
Key Metrics: Linear Regression
  1. R-squared
  2. Adjusted R-squared
  3. Coefficients
  4. P-values
R-Squared: Share of the target variation that is explained by the model.

Generalized Linear Models

A flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.

Generalized Linear Models
Regression vs. Classification

Is this a good forecast?

Regression analysis estimates the conditional expectation of the dependent variable given the independent variables.


Classification is the problem of identifying to which of a set of categories a new observation belongs.

Regression vs. Classification
Logistic Regression

Solved by gradient descent. (optimization)

Logistic Regression

Regression vs. Classification
Logistic Regression Output

Logisitic regression and many other classification models output a continuous value between 0 and 1.

Measuring Classification Performance
  1. Confusion Matrix
  2. Precision
  3. Recall
  4. Accuracy
(as explained by the zombie apocalypse)
Confusion Matrix
Precision

zombie apocalypse use case: you're hunting zombies, and you need to kill as many zombies as possible without killing any humans

Recall

zombie apocalypse use case: you discover a cure for zombies, but can only apply it k infected people

Accuracy

zombie apocalypse use case: zombies have infected roughly half the population, and you're throwing them a party. you are putting together an invite list and want to make sure you invite an equal amount of zombies and humans.

Midterm: March 25, 6:30pm
  • 90 Minute Written Exam
  • No computer needed, closed book, closed notes
  • Part I: Multiple Choice
  • Part II: Short Answer
Review
Week 2: Where to Find Data
  • The data science lifecycle.
  • Structured vs unstructured data.
  • Common sources of data.
  • Common ways to access data.
Week 3: Processing and Cleaning Data
  • Elements of the ETL Process
  • Processing Tools
  • Data Cleaning Considerations for Data Scientists
    • Missing Value
    • Handling Outliers
    • Normalizing Data
Week 4: Statistics and the Stories We Tell Ourselves
  • Types of Data
  • Useful Statistical Distribution
  • Important Summary Statistics
  • Independence
  • Key Theorems
Week 5: Intro to Linear Models
  • What Makes Linear Regression Linear
  • Assumptions for Linear Models
  • Measuring Performance for Linear Models
Week 6: Regression vs Classification
  • Regression vs. Classification
  • Logistic Regression
  • Measuring Performance for Classification Models