City College, Fall 2019

Intro to Data Science

Week 7: Trees and the Variance-Bias Tradeoff

October 21, 2019

Today's Agenda
  1. Classification Review
  2. Linear vs. Nonlinear Classification
  3. Decision Trees
Semester Recap
  1. Loading and Transforming Data
  2. Exploratory Data Analysis
  3. Linear Models for Regression and Classification
Project Teams
Team Member 1 Member 2 Member 3 Member 4
Team 1 Ahsun Phurpa Sagun Sanjay
Team 2 Naeem Ishraq Garland
Team 3 Steven Brian
Team 4 Anthony Jake
Team 5 Charlie Connie Julia Luis
Team 6 Jin Xin Xiaohong Hua
Team 7 Enger Jon Kieran Martin
Team 8 Fayrouz Frank Ali
Team 9 Carlos Oscar Quetorah Abigail
Team 10 Dzhonibek Hasibul Abdur
Team 11 Runmin Jie Hongjie H. Bida
Team 12 Yorli Inna Lam Michele
Team 13 Jacinta Andrew Rabia
Team 14 Lizbeth Kareem Yu Xuan
Data Science Models
Linear vs Nonlinear Classification Models

Linear models are most effective when the outcome can be modeled with a combination of coefficients of explanatory variables and the data is linearly separable.


Nonlinear models are better suited for outcomes which rely on interactions between different explanatory variables.

Linearly Separable Data
Linearly Inseparable Data
Titanic Disaster

2224 passengers, 710 survivors

Titanic Prediction Data Set
  1. Training set: 891 observations, 38 percent survival rate
  2. Features include: age, sex, socio-economic class, embarkation point, fare paid
  3. Rich potential for additional feature engineering
  4. Prediction goal: who survives?
Survival Rates Among Subgroups
Survival Rates Among Subgroups

Branching a tree relies on a greedy heuristic, typically entropy.


Computers Make Branching Easy
Bias-Variance Tradeoff

An ideal model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously.


Source.
How do we prevent unnecessary splits?
Hyperparameters: values set before the learning process to avoid overfitting.
Common decision tree hyperparameters:
  1. max_depth
  2. min_samples_split
  3. min_samples_leaf
  4. max_features
max_depth=3
max_leaf_nodes=6
How do we choose hyperparameters?
Cross Validation

See here for a slightly deeper discussion.
Let's make our own trees.

Assignment 7: Due Monday, November 4 by 6:30pm

DataCamp's Deep Learning with PyTorch

  • The course should appear collectively as assignment within your existing DataCamp account.
  • Each section will appear separately and will be worth oen point toward the total grade for the homework.
  • Course claims to take 5 hours, but I found it shorter than some of the past courses. Nonetheless, use your time wisely.