decision_tree

Decision tree

Decision and Classification Trees, Clearly Explained!!!

Decision Trees Overview:
- A decision tree makes decisions based on whether statements are true or false.
- There are two main types:
  - Classification trees: Classify things into categories.
  - Regression trees: Predict numeric values (e.g., predicting mouse size based on diet).
Working with Classification Trees:
- Classification trees can handle mixed data types (e.g., numeric and yes/no data).
- They may ask the same question multiple times with different thresholds.
- Final classifications (called leaves) can be repeated.
How to Use a Classification Tree:
- Start at the top of the tree and work your way down until you reach a leaf.
- In many trees, conventionally
  - True: Go left.
  - False: Go right.
Building tree and impurity
- Check video
Overfitting and How to Prevent It:
- Overfitting occurs when the tree fits the training data too well, making it less generalizable to new data.
- To address overfitting:
  - Pruning the tree: Cutting back splits to prevent overly specific predictions.
  - Limiting tree growth: Requiring a minimum number of data points per leaf.
  - Cross-validation: Testing different tree settings to find the best balance between accuracy and overfitting.

StatQuest: Decision Trees, Part 2 - Feature Selection and Missing Data

Recap: Building a Decision Tree (Heart Disease Example):
- Ewe built a tree to predict the likelihood of a patient having heart disease based on symptoms.
- The process:
  1. First, we asked if the patient had good blood circulation.
  2. If yes, we then asked if they had blocked arteries.
  3. If they had blocked arteries, we asked if they had chest pain.
  - If yes, there's a good chance they have heart disease (17 similar patients did, 3 didn’t).
  - If no, there's a good chance they do not have heart disease.
Impurity Calculations and Splitting:
- If a patient had good circulation but no blocked arteries, we did not ask about chest pain, because the impurity was lower without the split.
  - Impurity is a measure of how mixed the groups are in terms of having or not having heart disease.
- If chest pain didn’t reduce the impurity in any split, it wouldn’t be used in the tree, despite being part of the data.
  - This is an example of automatic feature selection.
Feature Selection and Simplicity:
- We could set a threshold for impurity reduction:
  - The reduction must be large enough to justify making the split.
  - This creates simpler trees and prevents overfitting.
- Overfitting:
  - A tree performs well on the training data but poorly on new data.
  - Requiring larger reductions in impurity for splits helps prevent overfitting.
- In a nutshell: Feature selection helps to build simpler, more effective decision trees by automatically excluding less important variables.
Handling Missing Data:
- In the original decision tree, we skipped patients if they had missing data (e.g., for blocked arteries).
- But we could handle missing data in several ways:
  1. Most common option: If "yes" occurs more than "no" in the dataset, we could assign "yes" to the missing value.
  2. Using correlated columns: Find a column that correlates with the missing data (e.g., chest pain and blocked arteries).
    - If similar patterns exist (e.g., "no" for both in previous patients), we could fill in the missing blocked arteries value based on the chest pain value.
Example with Numeric Data:
- If weight data is missing, we could:
  1. Use the mean or median of the available weight data to fill the missing value.
  2. Use correlated data: If height is highly correlated with weight, we can run a linear regression to predict the missing weight based on the height data.

Pruning

Decision trees / Random forest -> AdaBoost

Gradient boost