Summer 2025
2025-08-01
The idea for decision trees came from Morgan and Sonquist. [@article]
A decision tree is an example of supervised machine learning and Random Forest is an ensemble classifier that builds on the concept of decision trees by adding randomness.
In this presentation, random forest and decision trees are used to predict the quality of wine based on its chemical properties.
Random Forest and Decision Tree methodologies can be applied across datasets and industries.
Examples:
-healthcare: risk for cardiovascular disease [@su2020prediction] and diabetes [@xu2017risk]
-agriculture: predict sugarcane production [@everingham2016accurate]
-emergency management: predict severity of highway accidents [@chen2020modeling]
Medical research used random forest methodology to conduct risk assessment to determine individuals at high risk for cardiovascular diseases and the random forest as a prediction model was helpful in evaluating many possible predictor variables and possible complexities between them. [@su2020prediction]
Modeling Type 2 Diabetes Mellitus is difficult because of the interactions between genetic, environmental, and behavioral factors that classic statistical methods can’t accurately model.Both a decision tree model and a random forest model were created. [@esmaily2018comparison] By looking at both models, it was determined that BMI, triglycerides, and family history were the top risk factors.
Everingham applied random forest to a dataset collected from 1992 to 2013, containing variables relating to weather and soil quality, and used that to predict sugarcane yield. [@everingham2016accurate] While the accuracy of the models is not perfect, combined with other predictions, they can be used as a guide to farming activities and planning.
Decision Trees represent a sequence of rules in the shape of a tree or a flowchart.
The decision tree is made up of the root node, the decision nodes, the terminal nodes, and the branches.
There are different algorithms used to make decision trees: ID3, Chi-Square Automatic Interaction Detection, Multivariate Adaptive Regression Splines, and Conditional Inference Trees.
The CART algorithm is what R Studio uses to create random forests.
CART is useful for both regression and classification. CART uses Gini impurity to split for classification and variance for regression.
The Gini impurity measures the likelihood of randomly selected data being incorrectly classified.
\[ Gini(p) = 1 - Σ '(pᵢ²) \]
Random forest creates multiple decision trees, then averages the results of each of the decisions made by each individual tree to provide a prediction.
Training Data Selection
Tree Growth
Random Attribute Selection
Majority Vote
Initial exploration helps reveal structure and potential predictive patterns.
White wine dominates medium and high categories
Red wine more evenly distributed
Important to account for this imbalance in classification.
Strongest correlations:
Sulphates (r = 0.04): low correlation, but may still be important
Correlation captures linear relationships only — not interactions.
To compare model performance, we used:
Dataset | Model | Accuracy | Avg AUC |
---|---|---|---|
Red Wine | Random Forest | 74.1% | 0.879 |
Red Wine | Decision Tree | 63.9% | 0.777 |
White Wine | Random Forest | 71.6% | 0.882 |
White Wine | Decision Tree | 56.1% | 0.710 |
Combined | Random Forest | 72.1% | 0.884 |
Combined | Logistic Regression | 56.7% | 0.756 |
Combined | Decision Tree | 55.5% | 0.723 |
Random Forest handled class imbalance and complex interactions best.
Feature importance shifts slightly by wine type, alcohol remains consistently relevant.
Future work: include more detailed chemical breakdowns (e.g., esters, alcohol subtypes)