Wine5 PDF
Wine5 PDF
Wine5 PDF
Portugal
Abstract
We propose a data mining approach to predict human wine taste preferences that
is based on easily available analytical tests at the certification step. A large dataset
(when compared to other studies in this domain) is considered, with white and red
vinho verde samples (from Portugal). Three regression techniques were applied, un-
model selection. The support vector machine achieved promising results, outper-
forming the multiple regression and neural network methods. Such model is useful
to support the oenologist wine tasting evaluations and improve wine production.
wider range of consumers. Portugal is a top ten wine exporting country with
3.17% of the market share in 2005 [11]. Exports of its vinho verde wine (from
the northwest region) have increased by 36% from 1997 to 2007 [8]. To support
its growth, the wine industry is investing in new technologies for both wine
making and selling processes. Wine certification and quality assessment are
key elements within this context. Certification prevents the illegal adulteration
of wines (to safeguard human health) and assures quality for the wine market.
Quality evaluation is often part of the certification process and can be used
to improve wine making (by identifying the most influential factors) and to
mainly on human experts. It should be stressed that taste is the least un-
derstood of the human senses [25], thus wine classification is a difficult task.
and process massive, often highly complex datasets. All this data hold valu-
able information such as trends and patterns, which can be used to improve
+351 253510300.
2
decision making and optimize chances of success [28]. Data mining (DM) tech-
niques [33] aim at extracting high-level knowledge from raw data. There are
several DM algorithms, each one with its own advantages. When modeling con-
tinuous data, the linear/multiple regression (MR) is the classic approach. The
backpropagation algorithm was first introduced in 1974 [32] and later popular-
ized in 1986 [23]. Since then, neural networks (NNs) have become increasingly
used. More recently, support vector machines (SVMs) have also been proposed
[4][26]. Due to their higher flexibility and nonlinear learning capabilities, both
NNs and SVMs are gaining an attention within the DM field, often attaining
over NNs, such as the absence of local minima in the learning phase. In effect,
the SVM was recently considered one of the most influential DM algorithms
knowledge from NNs and SVMs, given in terms of input variable importance
[18][7].
When applying these DM methods, variable and model selection are critical
to simpler models that are easier to interpret and that usually give better
performances. Complex models may overfit the data, losing the capability
to generalize, while a model that is too simple will present limited learning
be adjusted [16], such as the number of NN hidden nodes or the SVM kernel
The use of decision support systems by the wine industry is mainly focused
predict wine quality based on physicochemical data, their use is rather scarce
3
and mostly considers small datasets. For example, in 1991 the “Wine” dataset
was donated into the UCI repository [1]. The data contain 178 examples with
to classify three cultivars from Italy. This dataset is very easy to discriminate
and has been mainly used as a benchmark for new DM classifiers. In 1997 [27],
a NN fed with 15 input variables (e.g. Zn and Mg levels) was used to predict
six geographic wine origins. The data included 170 samples from Germany
and a 100% predictive rate was reported. In 2001 [30], NNs were used to
on grape maturity levels and chemical analysis (e.g. titrable acidity). Only
of Italian wine. Yet, the authors argued that mapping these parameters with a
sensory taste panel is a very difficult task and instead they used a NN fed with
(e.g. Zn and Mg) was used to discriminate 54 samples into two red wine
such as predicting meat preferences [7]. Yet, in the field of wine quality only
one application has been reported, where spectral measurements from 147
bottles were successfully used to predict 3 categories of rice wine age [35].
In this paper, we present a case study for modeling taste preferences based on
analytical data that are easily available at the wine certification step. Build-
ing such model is valuable not only for certification entities but also wine
producers and even consumers. It can be used to support the oenologist wine
4
Moreover, measuring the impact of the physicochemical tests in the final wine
help in target marketing [24], i.e. by applying similar techniques to model the
measures input relevance and guides the variable selection process. Also, we
propose a parsimony search method to select the best SVM kernel parameter
verde wine (from the Minho region of Portugal) taste preferences, showing
its impact in this domain. In contrast with previous studies, a large dataset
is considered, with a total of 4898 white and 1599 red samples. Wine pref-
erences are modeled under a regression approach, which preserves the order
of the grades, and we show how the definition of the tolerance concept is
useful for accessing different performance levels. We believe that this inte-
The paper is organized as follows: Section 2 presents the wine data, DM mod-
described and the obtained results are analyzed; finally, conclusions are drawn
in Section 4.
5
2 Materials and methods
This study will consider vinho verde, a unique product from the Minho (north-
to its freshness (specially in the summer). This wine accounts for 15% of the
total Portuguese production [8], and around 10% is exported, mostly white
wine. In this work, we will analyze the two most common variants, white and
red (rosé is also produced), from the demarcated region of vinho verde. The
designation of origin samples that were tested at the official certification en-
goal of improving the quality and marketing of vinho verde. The data were
process of wine sample testing from producer requests to laboratory and sen-
sory analysis. Each entry denotes a given test (analytical or sensory) and the
include a distinct wine sample (with all tests) per row. To avoid discarding
examples, only the most common physicochemical tests were selected. Since
the red and white tastes are quite different, the analysis will be performed
separately, thus two datasets 1 were built with 1599 red and 4898 white exam-
6
assessors (using blind tastes), which graded the wine in a scale that ranges
from 0 (very bad) to 10 (excellent). The final sensory score is given by the me-
dian of these evaluations. Fig. 1 plots the histograms of the target variables,
denoting a typical normal shape distribution (i.e. with more normal grades
We will adopt a regression approach, which preserves the order of the prefer-
ences. For instance, if the true grade is 3, then a model that predicts 4 is better
examples, each mapping an input vector with I input variables (xk1 , . . . , xkI ) to
M AD =
PN
|yi − ybi |/N (1)
i=1
where ybk is the predicted value for the k input pattern. The regression error
characteristic (REC) curve [2] is also used to compare regression models, with
the ideal model presenting an area of 1.0. The curve plots the absolute error
values (in columns) with the desired classes (in rows). For an ordered output,
7
the predicted class is given by pi = yi , if |yi − ybi | ≤ T , else pi = yi0 , where yi0
denotes the closest class to ybi , given that yi0 6= yi . From the matrix, several
the accuracy and precision (i.e. the predicted column accuracies) [33].
bility of a model [19]. This method randomly partitions the data into training
and test subsets. The former subset is used to fit the model (typically with 2/3
of the data), while the latter (with the remaining 1/3) is used to compute the
[9], where the data is divided into k partitions of equal size. One subset is
tested each time and the remaining data are used for fitting the model. The
process is repeated sequentially until all subsets have been tested. Therefore,
under this scheme, all data are used for training and testing. However, this
method requires around k times more computation, since k models are fitted.
We will adopt the most common NN type, the multilayer perceptron, where
neurons are grouped into layers and connected by feedforward links [3]. For
H hidden nodes with a logistic activation and one output node with a linear
function [16]:
o−1
X 1
yb = wo,0 + PI · wo,i (2)
j=I+1 1 + exp(− i=1 xi wj,i − wj,0 )
where wi,j denotes the weight of the connection from node j to i and o the
8
with H = 0 is equivalent to the MR model. By increasing H, more complex
mappings can be performed, yet an excess value of H will overfit the data,
to search through the range {0, 1, 2, 3, . . . , Hmax } (i.e. from the simplest NN to
more complex ones). For each H value, a NN is trained and its generalization
(Hmax ).
dimensional feature space, by using a nonlinear mapping (φ) that does not
need to be explicitly known but that depends of a kernel function (K). The
small error () when fitting the data, in the feature space:
m
X
yb = w0 + wi φi (x) (3)
i=1
The -insensitive loss function sets an insensitive tube around the residuals
and the tiny errors within the tube are discarded (Fig. 2).
We will adopt the popular gaussian kernel, which presents less parameters than
other kernels (e.g. polynomial) [31]: K(x, x0 ) = exp(−γ||x−x0 ||2 ), γ > 0. Under
trade-off between fitting the errors and the flatness of the mapping). To reduce
the search space, the first two values will be set using the heuristics [5]: C = 3
√
(for a standardized output) and = σb / N , where σb = 1.5/N × N i=1 (yi − ybi )
2
P
9
parameter (γ) produces the highest impact in the SVM performance, with
values that are too large or too small leading to poor predictions. A practical
method to set γ is to start the search from one of the extremes and then search
towards the middle of the range while the predictive estimate increases [31].
Sensitivity analysis [18] is a simple procedure that is applied after the train-
ing phase and analyzes the model responses when the inputs are changed.
Originally proposed for NNs, this sensitivity method can also be applied to
other algorithms, such as SVM [7]. Let ybaj denote the output obtained by
holding all input variables at their average values except xa , which varies
through its entire range with j ∈ {1, . . . , L} levels. If a given input variable
PL
Va = j=1 (ybaj − ybaj )2 /(L − 1)
(4)
PI
Ra = Va / i=1 Vi × 100 (%)
In this work, the Ra values will be used to measure the importance of the inputs
and also to discard irrelevant inputs, guiding the variable selection algorithm.
We will adopt the popular backward selection, which starts with all variables
and iteratively deletes one input until a stopping criterion is met [14]. Yet,
we guide the variable deletion (at each step) by the sensitivity analysis, in a
(when compared to the standard backward procedure) and that in [18] has
10
to [36], the variable and model selection will be performed simultaneously, i.e.
in each backward iteration several models are searched, with the one that
start with P1 and go through the remaining range until the generalization
is used, the available data are further split into training (to fit the model)
(3) After fitting the model, compute the relative importances (Ri ) of all xi ∈
(4) Select the best F (and P in case of NN or SVM) values, i.e., the input
variables and model that provide the best predictive estimates. Finally,
3 Empirical results
Linux) and high-level matrix programming language for statistical and data
analysis. All experiments reported in this work were written in R and con-
adopted the RMiner [6], a library for the R tool that facilitates the use of
11
Before fitting the models, the data was first standardized to a zero mean and
one standard deviation [16]. RMiner uses the efficient BFGS algorithm to
train the NNs (nnet R package), while the SVM fit is based on the Sequential
age). We adopted the default R suggestions [29]. The only exception are the
hyperparameters (H and γ), which will be set using the procedure described
in the previous section and with the search ranges of H ∈ {0, 1, . . . , 11} [36]
12/10, in practice the parsimony approach (step 2 of Section 2.4) will reduce
we adopted the simpler 2/3 and 1/3 holdout split as the internal valida-
tween the pressure towards simpler models and the increase of computational
search, the stopping criterion was set to 2 iterations without any improvement
To evaluate the selected models, we adopted 20 runs of the more robust 5-fold
uration. Statistical confidence will be given by the t-student test at the 95%
confidence level [13]. The results are summarized in Table 2. The test set
errors are shown in terms of the mean and confidence intervals. Three met-
rics are present: M AD, the classification accuracy for different tolerances (i.e.
T = 0.25, 0.5 and 1.0) and Kappa (T = 0.5). The selected models are described
12
γ). The last row shows the total computational time required in seconds.
For both tasks and all error metrics, the SVM is the best choice. The differences
are higher for small tolerances and in particular for the white wine (e.g. for
T = 0.25, the SVM accuracy is almost two times better when compared to
other methods). This effect is clearly visible when plotting the full REC curves
(Fig. 3). The Kappa statistic [33] measures the accuracy when compared with
a random classifier (which presents a Kappa value of 0%). The higher the
statistic, the more accurate the result. The most practical tolerance values are
T = 0.5 and T = 1.0. The former tolerance rounds the regression response
into the nearest class, while the latter accepts a response that is correct within
one of the two closest classes (e.g. a 3.1 value can be interpreted as grade 3
or 4 but not 2 or 5). For T = 0.5, the SVM accuracy improvement is 3.3
pp for red wine (6.2 pp for Kappa), a value that increases to 12.0 pp for the
white task (20.4 pp for Kappa). The NN is quite similar to MR in the red wine
modeling, thus similar performances were achieved. For the white data, a more
results. Regarding the variable selection, the average number of deleted inputs
ranges from 0.9 to 1.8, showing that most of the physicochemical tests used
are relevant. In terms of computational effort, the SVM is the most expensive
confusion matrixes for T = 0.5 (Table 3). To simplify the visualization, the 3
and 9 grade predictions were omitted, since these were always empty. Most of
the values are close to the diagonals (in bold), denoting a good fit by the model.
13
The true predictive accuracy for each class is given by the precision metric
(e.g. for the grade 4 and white wine, precisionT =0.5 =19/(19+7+4)=63.3%).
actual values are unknown and all predictions within a given column would
be treated the same. For a tolerance of 0.5, the SVM red wine accuracies
are around 57.7 to 67.5% in the intermediate grades (5 to 7) and very low
(0%/20%) for the extreme classes (3, 8 and 4), which are less frequent (Fig.
1). In general, the white data results are better: 60.3/63.3% for classes 6 and
4, 67.8/72.6% for grades 7 and 5, and a surprising 85.5% for the class 8 (the
exception are the 3 and 9 extremes with 0%, not shown in the table). When
The average SVM relative importance plots (Ra values) of the analytical tests
are shown in Fig. 4. It should be noted that the whole 11 inputs are shown,
cases, the obtained results confirm the oenological theory. For instance, an
increase in the alcohol (4th and 2nd most relevant factor) tends to result in
a higher quality wine. Also, the rankings are different within each wine type.
For instance, the citric acid and residual sugar levels are more important in
white wine, where the equilibrium between the freshness and sweet taste is
more appreciated. Moreover, the volatile acidity has a negative impact, since
acetic acid is the key ingredient in vinegar. The most intriguing result is the
high importance of sulphates, ranked first for both cases. Oenologically this
the fermenting nutrition, which is very important to improve the wine aroma.
14
4 Conclusions and implications
In recent years, the interest in wine has increased, leading to growth of the
step for both processes and is currently largely dependent on wine tasting by
human experts. This work aims at the prediction of wine preferences from
objective analytical tests that are available at the certification step. A large
dataset (with 4898 white and 1599 red entries) was considered, including vinho
verde samples from the northwest region of Portugal. This case study was ad-
dressed by two regression tasks, where each wine type preference is modeled
serves the order of the classes, allowing the evaluation of distinct accuracies,
Due to advances in the data mining (DM) field, it is possible to extract knowl-
edge from raw data. Indeed, powerful techniques such as neural networks
(NNs) and more recently support vector machines (SVMs) are emerging. While
being more flexible models (i.e. no a priori restriction is imposed), the per-
hand, the multiple regression (MR) is easier to interpret than NN/SVM, with
15
given in terms of relative importance of the inputs. Simultaneous variable and
model selection scheme is also proposed, where the variable selection is guided
that starts from a reasonable value and is stopped when the generalization
estimate decreases.
Encouraging results were achieved, with the SVM model providing the best
white vinho verde wine, which is the most common type. When admitting
only the correct classified classes (T = 0.5), the overall accuracies are 62.4%
(red) and 64.6% (white). It should be noted that the datasets contain six/seven
classes (from 3 to 8/9). These accuracies are much better than the ones ex-
the tolerance is set to accept responses that are correct within the one of the
two nearest classes (T = 1.0), obtaining a global accuracy of 89.0% (red) and
86.8% (white). In particular, for both tasks the majority of the classes present
The superiority of SVM over NN is probably due to the differences in the train-
ing phase. The SVM algorithm guarantees an optimum fit, while NN training
may fall into a local minimum. Also, the SVM cost function (Fig. 2) gives a
linear penalty to large errors. In contrast, the NN algorithm minimizes the sum
and this effect results in a higher accuracy for low error tolerances. As argued
tending to favor models that they know better. We adopted the default sug-
gestions of the R tool [29], except for the hyperparameters (which were set
using a grid search). Since the default settings are more commonly used, this
16
seems a reasonable assumption for the comparison. Nevertheless, different NN
functions were used. Under the tested setup, the SVM algorithm provided the
best results while requiring more computation. Yet, the SVM fitting can still
one run of the 5-fold cross-validation testing takes around 26 minutes for the
The result of this work is important for the wine industry. At the certification
phase and by Portuguese law, the sensory analysis has to be performed by hu-
man tasters. Yet, the evaluations are based in the experience and knowledge of
the experts, which are prone to subjective factors. The proposed data-driven
decision support system, aiding the speed and quality of the oenologist per-
formance. For instance, the expert could repeat the tasting only if her/his
grade is far from the one predicted by the DM model. In effect, within this
domain the T = 1.0 distance is accepted as a good quality control process and,
as shown in this study, high accuracies were achieved for this tolerance. The
model could also be used to improve the training of oenology students. Fur-
regarding the impact of the analytical tests. Since some variables can be con-
trolled in the production process this information can be used to improve the
by monitoring the grape sugar concentration prior to the harvest. Also, the
carried out by yeasts. Moreover, the volatile acidity produced during the malo-
lactic fermentation in red wine depends on the lactic bacteria control activity.
17
Another interesting application is target marketing [24]. Specific consumer
preferences from niche and/or profitable markets (e.g. for a particular coun-
try) could be measured during promotion campaigns (e.g. free wine tastings
Acknowledgments
We would like to thank Cristina Lagido and the anonymous reviewers for their
PTDC/EIA/64541/2006.
References
of 20th Int. Conf. on Machine Learning (ICML), Washington DC, USA, 2003.
[3] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,
1995.
[4] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin
[5] V. Cherkassy and Y. Ma. Practical Selection of SVM Parameters and Noise
18
[6] P. Cortez. RMiner: Data Mining with Neural Networks and Support Vector
1998.
Publishers, 1999.
[11] FAO. FAOSTAT - Food and Agriculture Organization agriculture trade domain
statistics. http://faostat.fao.org/site/535/DesktopDefault.aspx?PageID=535,
July 2008.
Austria, 1996.
19
[15] D. Hand. Classifier technology and the illusion of progress. Statistical Science,
21(1):1–15, 2006.
[17] Z. Huang, H. Chen, C. Hsu, W. Chen, and S. Wu. Credit rating analysis with
[18] R. Kewley, M. Embrechts, and C. Breneman. Data Strip Mining for the Virtual
20
Distributed Processing: Explorations in the Microstructures of Cognition,
and data mining for marketing. Decision Support Systems, 31(1):127–137, 2001.
[29] W. Venables and B. Ripley. Modern Applied Statistics with S. Springer, 4th
edition, 2003.
[30] S. Vlassides, J. Ferrier, and D. Block. Using Historical Data for Bioprocess
2001.
[31] W. Wang, Z. Xu, W. Lu, and X. Zhang. Determination of the spread parameter
55(3):643–663, 2003.
[32] P. Werbos. Beyond regression: New tools for prediction and analysis in the
21
[33] I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
2008.
[35] H. Yu, H. Lin, H. Xu, Y. Ying, B. Li, and X. Pan. Prediction of Enological
22
Table 1
fixed acidity (g(tartaric acid)/dm3 ) 4.6 15.9 8.3 3.8 14.2 6.9
volatile acidity (g(acetic acid)/dm3 ) 0.1 1.6 0.5 0.1 1.1 0.3
23
Table 2
The wine modeling results (test set errors and selected models; best values in bold)
MR NN SVM MR NN SVM
24
Table 3
The average confusion matrixes (T = 0.5) and precision values (T = 0.5 and 1.0)
Class 4 5 6 7 8 4 5 6 7 8
3 1 7 2 0 0 0 2 17 0 0
4 1 36 15 1 0 19 55 88 1 0
8 0 0 10 8 0 0 3 71 43 59
9 0 1 3 2 0
PrecisionT =0.5 20.0% 67.5% 57.7% 58.6% 0.0% 63.3% 72.6% 60.3% 67.8% 85.5%
PrecisionT =1.0 93.8% 90.9% 86.6% 90.2% 100% 90.0% 93.3% 81.9% 90.3% 96.2%
25
Red wine White wine
700
2000
Frequency (wine samples)
1500
1000
300
500
100
0
0
3 4 5 6 7 8 3 4 5 6 7 8 9
Fig. 1. The histograms for the red and white sensory preferences
26
support +ε
vectors 0
−ε
−ε 0 +ε
Fig. 2. Example of a linear SVM regression and the -insensitive loss function
27
Red wine White wine
100
100
80
80
60
60
40
40
20
20
0
0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Fig. 3. The red (left) and white (right) wine average test set REC curves (SVM –
28
Red wine
sulphates
pH
alcohol
volatile acidity
fixed acidity
residual sugar
chlorides
density
citric acid
0 5 10 15 20
White wine
sulphates
alcohol
residual sugar
citric acid
volatile acidity
density
pH
chlorides
fixed acidity
0 5 10 15 20
Fig. 4. The red (top) and white (bottom) wine input importances for the SVM
29