Intro to Data Science

Week 6: Regression vs. Classification

March 18, 2019

Today's Agenda
1. Regression vs. Classification
2. Logistic Regression
3. Measuring Performance for Classification Models
4. Midterm Recap
Week 5 Recap
1. Linear Regression
2. Assumptions for Linear Models
3. Measuring Performance for Linear Models
Key Metrics: Linear Regression
1. R-squared
3. Coefficients
4. P-values
R-Squared: Share of the target variation that is explained by the model.

Generalized Linear Models

A flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.

Regression vs. Classification

Is this a good forecast?

Regression analysis estimates the conditional expectation of the dependent variable given the independent variables.

Classification is the problem of identifying to which of a set of categories a new observation belongs.

Logistic Regression Output

Logisitic regression and many other classification models output a continuous value between 0 and 1.

Measuring Classification Performance
1. Confusion Matrix
2. Precision
3. Recall
4. Accuracy
(as explained by the zombie apocalypse)
Precision

zombie apocalypse use case: you're hunting zombies, and you need to kill as many zombies as possible without killing any humans

Recall

zombie apocalypse use case: you discover a cure for zombies, but can only apply it k infected people

Accuracy

zombie apocalypse use case: zombies have infected roughly half the population, and you're throwing them a party. you are putting together an invite list and want to make sure you invite an equal amount of zombies and humans.

Midterm: March 25, 6:30pm
• 90 Minute Written Exam
• No computer needed, closed book, closed notes
• Part I: Multiple Choice
Review
Week 2: Where to Find Data
• The data science lifecycle.
• Structured vs unstructured data.
• Common sources of data.
• Common ways to access data.
Week 3: Processing and Cleaning Data
• Elements of the ETL Process
• Processing Tools
• Data Cleaning Considerations for Data Scientists
• Missing Value
• Handling Outliers
• Normalizing Data
Week 4: Statistics and the Stories We Tell Ourselves
• Types of Data
• Useful Statistical Distribution
• Important Summary Statistics
• Independence
• Key Theorems
Week 5: Intro to Linear Models
• What Makes Linear Regression Linear
• Assumptions for Linear Models
• Measuring Performance for Linear Models
Week 6: Regression vs Classification
• Regression vs. Classification
• Logistic Regression
• Measuring Performance for Classification Models