City College, Spring 2019
Intro to Data Science
Week 9: Trees and the VarianceBias Tradeoff
April 8, 2019
Today's Agenda
 Classification Review
 Linear vs. Nonlinear Classification
 Decision Trees
 Midterm Review
Semester Recap
 Loading and Transforming Data
 Exploratory Data Analysis
 Linear Models for Regression and Classification
Data Science Models
Linear vs Nonlinear Classification Models
Linear models are most effective when the outcome can be modeled with a combination of coefficients of explanatory variables and the data is linearly separable.
Nonlinear models are better suited for outcomes which rely on interactions between different explanatory variables.
Linearly Separable Data
Linearly Inseparable Data
Titanic Disaster
2224 passengers, 710 survivors
 Training set: 891 observations, 38 percent survival rate
 Features include: age, sex, socioeconomic class, embarkation point, fare paid
 Rich potential for additional feature engineering
 Prediction goal: who survives?
Survival Rates Among Subgroups
Survival Rates Among Subgroups
Branching a tree relies on a greedy heuristic, typically entropy.
Computers Make Branching Easy
BiasVariance Tradeoff
An ideal model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously.
Source.
How do we prevent unnecessary splits?
Hyperparameters: values set before the learning process to avoid overfitting.
 max_depth
 min_samples_split
 min_samples_leaf
 max_features
max_depth=3
max_leaf_nodes=6
How do we choose hyperparameters?
Let's make our own trees.
Midterm Results

Multiple Choice 
Short Answer 
Exam Total 
Points Possible 
58 
36 
94 
Mean 
53 
35 
87 
Median 
53 
36 
88 
Std Dev 
3.58 
2.09 
4.51 
Answer key available on the course page: Version 1, Version 2.
Assignment 7: Due Monday, April 15 by 6:30pm
DataCamp's Machine Learning with TreeBased Models in Python

The course should appear collectively as assignment within your existing DataCamp account.

Each section will appear separately and will be worth oen point toward the total grade for the homework.

Course claims to take 5 hours, but I found it shorter than some of the past courses. Nonetheless, use your time wisely.