##### City College, Spring 2019

### Intro to Data Science

Week 6: Regression vs. Classification

March 18, 2019
###### Today's Agenda

- Regression vs. Classification
- Logistic Regression
- Measuring Performance for Classification Models
- Midterm Recap

###### Week 5 Recap

- Linear Regression
- Assumptions for Linear Models
- Measuring Performance for Linear Models

###### Data Science Models

###### Key Metrics: Linear Regression

- R-squared
- Adjusted R-squared
- Coefficients
- P-values

R-Squared: Share of the target variation that is explained by the model.
*A flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.*

###### Regression vs. Classification

*Is this a good forecast?*

**Regression** analysis estimates the conditional expectation of the dependent variable given the independent variables.

**Classification** is the problem of identifying to which of a set of categories a new observation belongs.

###### Regression vs. Classification

###### Logistic Regression

###### Regression vs. Classification

###### Logistic Regression Output

*Logisitic regression and many other classification models output a continuous value between 0 and 1.*

###### Measuring Classification Performance

- Confusion Matrix
- Precision
- Recall
- Accuracy

(as explained by the zombie apocalypse)
###### Confusion Matrix

###### Precision

**zombie apocalypse use case:** you're hunting zombies, and you need to kill as many zombies as possible without killing any humans

###### Recall

**zombie apocalypse use case:** you discover a cure for zombies, but can only apply it *k* infected people

###### Accuracy

**zombie apocalypse use case:** zombies have infected roughly half the population, and you're throwing them a party. you are putting together an invite list and want to make sure you invite an equal amount of zombies and humans.

###### Midterm: March 25, 6:30pm

- 90 Minute Written Exam
- No computer needed, closed book, closed notes
- Part I: Multiple Choice
- Part II: Short Answer

###### Week 2: Where to Find Data

- The data science lifecycle.
- Structured vs unstructured data.
- Common sources of data.
- Common ways to access data.

###### Week 3: Processing and Cleaning Data

- Elements of the ETL Process
- Processing Tools
- Data Cleaning Considerations for Data Scientists
- Missing Value
- Handling Outliers
- Normalizing Data

###### Week 4: Statistics and the Stories We Tell Ourselves

- Types of Data
- Useful Statistical Distribution
- Important Summary Statistics
- Independence
- Key Theorems

###### Week 5: Intro to Linear Models

- What Makes Linear Regression Linear
- Assumptions for Linear Models
- Measuring Performance for Linear Models

###### Week 6: Regression vs Classification

- Regression vs. Classification
- Logistic Regression
- Measuring Performance for Classification Models