Using Random Forest to Predict Wine Quality

Summer 2025

Bailyn Rowe, Gabriel Gonzalez Rincon, Morgan Watkins (Advisor: Dr. Cohen)

2025-08-01

Introduction

The idea for decision trees came from Morgan and Sonquist. [@article]

A decision tree is an example of supervised machine learning and Random Forest is an ensemble classifier that builds on the concept of decision trees by adding randomness.

In this presentation, random forest and decision trees are used to predict the quality of wine based on its chemical properties.

Potential Applications of Random Forest

Random Forest and Decision Tree methodologies can be applied across datasets and industries.

Examples:

-healthcare: risk for cardiovascular disease [@su2020prediction] and diabetes [@xu2017risk]

-agriculture: predict sugarcane production [@everingham2016accurate]

-emergency management: predict severity of highway accidents [@chen2020modeling]

Literature Review

Medical research used random forest methodology to conduct risk assessment to determine individuals at high risk for cardiovascular diseases and the random forest as a prediction model was helpful in evaluating many possible predictor variables and possible complexities between them. [@su2020prediction]

Modeling Type 2 Diabetes Mellitus is difficult because of the interactions between genetic, environmental, and behavioral factors that classic statistical methods can’t accurately model.Both a decision tree model and a random forest model were created. [@esmaily2018comparison] By looking at both models, it was determined that BMI, triglycerides, and family history were the top risk factors.

Everingham applied random forest to a dataset collected from 1992 to 2013, containing variables relating to weather and soil quality, and used that to predict sugarcane yield. [@everingham2016accurate] While the accuracy of the models is not perfect, combined with other predictions, they can be used as a guide to farming activities and planning.

Decision Trees

Decision Trees represent a sequence of rules in the shape of a tree or a flowchart.

The decision tree is made up of the root node, the decision nodes, the terminal nodes, and the branches.

There are different algorithms used to make decision trees: ID3, Chi-Square Automatic Interaction Detection, Multivariate Adaptive Regression Splines, and Conditional Inference Trees.

Classification and Regression Trees (CART)

The CART algorithm is what R Studio uses to create random forests.

CART is useful for both regression and classification. CART uses Gini impurity to split for classification and variance for regression.

The Gini impurity measures the likelihood of randomly selected data being incorrectly classified.

\[ Gini(p) = 1 - Σ '(pᵢ²) \]

Random Forest

Random forest creates multiple decision trees, then averages the results of each of the decisions made by each individual tree to provide a prediction.

Training Data Selection
Tree Growth
Random Attribute Selection
Majority Vote

Analysis, Results and Conclusion

Data Exploration

Combined dataset: 6,497 wines
- Red: 1,599
- White: 4,898
Quality scores (0–10) converted to categories:
- Low (≤5), Medium (6), High (≥7)
Features: 11 chemical attributes

Initial exploration helps reveal structure and potential predictive patterns.

Wine Quality by Wine Type

White wine dominates medium and high categories
Red wine more evenly distributed

Important to account for this imbalance in classification.

Preprocessing Steps

Merged datasets and added “type” variable
Created “quality_category” (Low, Medium, High)
Selected features based on:
- Correlation
- Prior domain knowledge

Correlation Results

Linear relationships between features and wine quality.

Correlation Results

Strongest correlations:
- Alcohol (r = 0.44)
- Density (r = –0.31)
- Volatile Acidity (r = –0.27)
Sulphates (r = 0.04): low correlation, but may still be important
Correlation captures linear relationships only — not interactions.

Modeling Approach

Classification goal: predict Low, Medium, High wine quality
Train/test split: 70%/30%
Models trained:
- Random Forest (all datasets)
- Decision Tree (all datasets)
- Logistic Regression (combined only)

Evaluation Metrics

To compare model performance, we used:

Accuracy: Overall correctness
AUC (macro): Average across classes
Recall (Sensitivity): Measures ability to detect actual positives

Accuracy & AUC

Dataset	Model	Accuracy	Avg AUC
Red Wine	Random Forest	74.1%	0.879
Red Wine	Decision Tree	63.9%	0.777
White Wine	Random Forest	71.6%	0.882
White Wine	Decision Tree	56.1%	0.710
Combined	Random Forest	72.1%	0.884
Combined	Logistic Regression	56.7%	0.756
Combined	Decision Tree	55.5%	0.723

Random Forest outperformed all others
Logistic Regression underperformed
Best AUC: Random Forest on white wine

Random Forest handled class imbalance and complex interactions best.

Class-Level Performance – Combined Dataset

Variable Importance – Combined

Top features: Alcohol, Volatile Acidity, Sulphates
Importance reveals interactions, unlike correlation
Tree models consider feature interactions and splits across samples

Variable Importance – Red vs White

Red Wine: Sulphates > Volatile Acidity > Alcohol
White Wine: Alcohol > Volatile Acidity > Free Sulfur Dioxide

Feature importance shifts slightly by wine type, alcohol remains consistently relevant.

Conclusion

Alcohol, volatile acidity, and sulphates are the most influential features
Random Forest showed the best predictive performance
Correlation analysis offered insights, but feature importance revealed deeper interactions
Results align with prior literature

Future work: include more detailed chemical breakdowns (e.g., esters, alcohol subtypes)

Using Random Forest to Predict Wine Quality

Introduction

Potential Applications of Random Forest

Literature Review

Decision Trees

Classification and Regression Trees (CART)

Random Forest

Analysis, Results and Conclusion

Data Exploration

Wine Quality by Wine Type

Preprocessing Steps

Correlation Results

Correlation Results

Modeling Approach

Evaluation Metrics

Accuracy & AUC

Class-Level Performance – Combined Dataset

Variable Importance – Combined

Variable Importance – Red vs White

Conclusion

Thanks for your attention!