decision_tree
Decision tree
Decision and Classification Trees, Clearly Explained!!!
-
Decision Trees Overview:
- A decision tree makes decisions based on whether statements are true or false.
- There are two main types:
- Classification trees: Classify things into categories.
- Regression trees: Predict numeric values (e.g., predicting mouse size based on diet).
-
Working with Classification Trees:
- Classification trees can handle mixed data types (e.g., numeric and yes/no data).
- They may ask the same question multiple times with different thresholds.
- Final classifications (called leaves) can be repeated.
-
How to Use a Classification Tree:
- Start at the top of the tree and work your way down until you reach a leaf.
- In many trees, conventionally
- True: Go left.
- False: Go right.
-
Building tree and impurity
- Check video
-
Overfitting and How to Prevent It:
- Overfitting occurs when the tree fits the training data too well, making it less generalizable to new data.
- To address overfitting:
- Pruning the tree: Cutting back splits to prevent overly specific predictions.
- Limiting tree growth: Requiring a minimum number of data points per leaf.
- Cross-validation: Testing different tree settings to find the best balance between accuracy and overfitting.
StatQuest: Decision Trees, Part 2 - Feature Selection and Missing Data
-
Recap: Building a Decision Tree (Heart Disease Example):
- Ewe built a tree to predict the likelihood of a patient having heart disease based on symptoms.
- The process:
- First, we asked if the patient had good blood circulation.
- If yes, we then asked if they had blocked arteries.
- If they had blocked arteries, we asked if they had chest pain.
- If yes, there's a good chance they have heart disease (17 similar patients did, 3 didn’t).
- If no, there's a good chance they do not have heart disease.
-
Impurity Calculations and Splitting:
- If a patient had good circulation but no blocked arteries, we did not ask about chest pain, because the impurity was lower without the split.
- Impurity is a measure of how mixed the groups are in terms of having or not having heart disease.
- If chest pain didn’t reduce the impurity in any split, it wouldn’t be used in the tree, despite being part of the data.
- This is an example of automatic feature selection.
- If a patient had good circulation but no blocked arteries, we did not ask about chest pain, because the impurity was lower without the split.
-
Feature Selection and Simplicity:
- We could set a threshold for impurity reduction:
- The reduction must be large enough to justify making the split.
- This creates simpler trees and prevents overfitting.
- Overfitting:
- A tree performs well on the training data but poorly on new data.
- Requiring larger reductions in impurity for splits helps prevent overfitting.
- In a nutshell: Feature selection helps to build simpler, more effective decision trees by automatically excluding less important variables.
- We could set a threshold for impurity reduction:
-
Handling Missing Data:
- In the original decision tree, we skipped patients if they had missing data (e.g., for blocked arteries).
- But we could handle missing data in several ways:
- Most common option: If "yes" occurs more than "no" in the dataset, we could assign "yes" to the missing value.
- Using correlated columns: Find a column that correlates with the missing data (e.g., chest pain and blocked arteries).
- If similar patterns exist (e.g., "no" for both in previous patients), we could fill in the missing blocked arteries value based on the chest pain value.
-
Example with Numeric Data:
- If weight data is missing, we could:
- Use the mean or median of the available weight data to fill the missing value.
- Use correlated data: If height is highly correlated with weight, we can run a linear regression to predict the missing weight based on the height data.
- If weight data is missing, we could:
Pruning
Decision trees / Random forest -> AdaBoost
Gradient boost