Using Random Forest to Predict Wine Quality

Summer 2025

Bailyn Rowe, Gabriel Gonzalez Rincon, Morgan Watkins (Advisor: Dr. Cohen)

2025-08-01

Introduction

The idea for decision trees came from Morgan and Sonquist. [@article]

A decision tree is an example of supervised machine learning and Random Forest is an ensemble classifier that builds on the concept of decision trees by adding randomness.

In this presentation, random forest and decision trees are used to predict the quality of wine based on its chemical properties.

Potential Applications of Random Forest

Random Forest and Decision Tree methodologies can be applied across datasets and industries.

Examples:

-healthcare: risk for cardiovascular disease [@su2020prediction] and diabetes [@xu2017risk]

-agriculture: predict sugarcane production [@everingham2016accurate]

-emergency management: predict severity of highway accidents [@chen2020modeling]

Literature Review

Medical research used random forest methodology to conduct risk assessment to determine individuals at high risk for cardiovascular diseases and the random forest as a prediction model was helpful in evaluating many possible predictor variables and possible complexities between them. [@su2020prediction]

Modeling Type 2 Diabetes Mellitus is difficult because of the interactions between genetic, environmental, and behavioral factors that classic statistical methods can’t accurately model.Both a decision tree model and a random forest model were created. [@esmaily2018comparison] By looking at both models, it was determined that BMI, triglycerides, and family history were the top risk factors.

Everingham applied random forest to a dataset collected from 1992 to 2013, containing variables relating to weather and soil quality, and used that to predict sugarcane yield. [@everingham2016accurate] While the accuracy of the models is not perfect, combined with other predictions, they can be used as a guide to farming activities and planning.

Decision Trees

Decision Trees represent a sequence of rules in the shape of a tree or a flowchart.

The decision tree is made up of the root node, the decision nodes, the terminal nodes, and the branches.

There are different algorithms used to make decision trees: ID3, Chi-Square Automatic Interaction Detection, Multivariate Adaptive Regression Splines, and Conditional Inference Trees.

Classification and Regression Trees (CART)

The CART algorithm is what R Studio uses to create random forests.

CART is useful for both regression and classification. CART uses Gini impurity to split for classification and variance for regression.

The Gini impurity measures the likelihood of randomly selected data being incorrectly classified.

\[ Gini(p) = 1 - Σ '(pᵢ²) \]

Random Forest

Random forest creates multiple decision trees, then averages the results of each of the decisions made by each individual tree to provide a prediction.

  1. Training Data Selection

  2. Tree Growth

  3. Random Attribute Selection

  4. Majority Vote

Analysis, Results and Conclusion

Data Exploration

  • Combined dataset: 6,497 wines
    • Red: 1,599
    • White: 4,898
  • Quality scores (0–10) converted to categories:
    • Low (≤5), Medium (6), High (≥7)
  • Features: 11 chemical attributes

Initial exploration helps reveal structure and potential predictive patterns.

Wine Quality by Wine Type

  • White wine dominates medium and high categories

  • Red wine more evenly distributed

Important to account for this imbalance in classification.

Preprocessing Steps

  • Merged datasets and added “type” variable
  • Created “quality_category” (Low, Medium, High)
  • Selected features based on:
    • Correlation
    • Prior domain knowledge

Correlation Results

  • Linear relationships between features and wine quality.

Correlation Results

  • Strongest correlations:

    • Alcohol (r = 0.44)
    • Density (r = –0.31)
    • Volatile Acidity (r = –0.27)
  • Sulphates (r = 0.04): low correlation, but may still be important

  • Correlation captures linear relationships only — not interactions.

Modeling Approach

  • Classification goal: predict Low, Medium, High wine quality
  • Train/test split: 70%/30%
  • Models trained:
    • Random Forest (all datasets)
    • Decision Tree (all datasets)
    • Logistic Regression (combined only)

Evaluation Metrics

To compare model performance, we used:

  • Accuracy: Overall correctness
  • AUC (macro): Average across classes
  • Recall (Sensitivity): Measures ability to detect actual positives

Accuracy & AUC

Dataset Model Accuracy Avg AUC
Red Wine Random Forest 74.1% 0.879
Red Wine Decision Tree 63.9% 0.777
White Wine Random Forest 71.6% 0.882
White Wine Decision Tree 56.1% 0.710
Combined Random Forest 72.1% 0.884
Combined Logistic Regression 56.7% 0.756
Combined Decision Tree 55.5% 0.723
  • Random Forest outperformed all others
  • Logistic Regression underperformed
  • Best AUC: Random Forest on white wine

Random Forest handled class imbalance and complex interactions best.

Class-Level Performance – Combined Dataset

Variable Importance – Combined

  • Top features: Alcohol, Volatile Acidity, Sulphates
  • Importance reveals interactions, unlike correlation
  • Tree models consider feature interactions and splits across samples

Variable Importance – Red vs White

  • Red Wine: Sulphates > Volatile Acidity > Alcohol
  • White Wine: Alcohol > Volatile Acidity > Free Sulfur Dioxide

Feature importance shifts slightly by wine type, alcohol remains consistently relevant.

Conclusion

  • Alcohol, volatile acidity, and sulphates are the most influential features
  • Random Forest showed the best predictive performance
  • Correlation analysis offered insights, but feature importance revealed deeper interactions
  • Results align with prior literature

Future work: include more detailed chemical breakdowns (e.g., esters, alcohol subtypes)

Thanks for your attention!