IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013
1
Kernel Multivariate Analysis Framework for
Supervised Subspace Learning: A Tutorial on
Linear and Kernel Multivariate Methods
arXiv:1310.5089v1 [stat.ML] 18 Oct 2013
Jerónimo Arenas-Garcı́a, Senior Member, IEEE, Kaare Brandt Petersen,
Gustavo Camps-Valls, Senior Member, IEEE, and Lars Kai Hansen
Abstract: Feature extraction and dimensionality reduction are important tasks in many fields of science dealing with signal processing
and analysis. The relevance of these techniques is increasing as
current sensory devices are developed with ever higher resolution, and
problems involving multimodal data sources become more common.
A plethora of feature extraction methods are available in the literature
collectively grouped under the field of Multivariate Analysis (MVA).
This paper provides a uniform treatment of several methods: Principal
Component Analysis (PCA), Partial Least Squares (PLS), Canonical
Correlation Analysis (CCA) and Orthonormalized PLS (OPLS), as
well as their non-linear extensions derived by means of the theory
of reproducing kernel Hilbert spaces. We also review their connections to other methods for classification and statistical dependence
estimation, and introduce some recent developments to deal with the
extreme cases of large-scale and low-sized problems. To illustrate
the wide applicability of these methods in both classification and
regression problems, we analyze their performance in a benchmark of
publicly available data sets, and pay special attention to specific real
applications involving audio processing for music genre prediction
and hyperspectral satellite images for Earth and climate monitoring.
I. I NTRODUCTION
As sensory devices develop with ever higher resolution and
the combination of diverse data sources is more common,
feature extraction and dimensionality reduction become increasingly important in automatic learning. This is especially
true in fields dealing with intrinsically high-dimensional signals, such as those acquired for image analysis, spectroscopy,
neuroimaging, and remote sensing, but also for situations when
many heterogeneous features are computed from a signal and
stacked together for classification, clustering, or prediction.
Multivariate analysis (MVA) constitutes a family of methods
for dimensionality reduction successfully used in several scientific areas [1]. The goal of MVA algorithms is to exploit
correlations among the variables to find a reduced set of
features that are relevant for the learning task. Among the most
well-known MVA methods are Principal Component Analysis
(PCA), Partial Least Squares (PLS), and Canonical Correlation
Analysis (CCA). PCA disregards the target data and exploits
correlations between the input variables to maximize the
variance of the projections, while PLS and CCA look for
projections that maximize, respectively, the covariance and the
correlation between the features and the target data. Therefore,
they should in principle be preferred to PCA for regression
or classification problems. In this paper, we consider also a
fourth MVA method known as Orthonormalized PLS (OPLS)
that is also well-suited to supervised problems, with certain
optimality in least-squares (LS) multiregression. A common
advantage of these methods is that they can be formulated
using standard linear algebra, and can be implemented as
standard or generalized eigenvalue problems. Furthermore,
implementations and variants of these methods have been
proposed that operate either in a blockwise or iterative manner
to improve speed or numerical stability.
No matter how refined the various methods of MVA are,
they are still constrained to account for linear input-output
relations. Hence, they can be severely challenged when features exhibit non-linear relations. In order to address these
problems, non-linear versions of MVA methods have been
developed and these can be classified into two fundamentally
different approaches [2]: 1) The modified methods in which
the linear relations among the latent variables are substituted
by non-linear relations [3], [4]; and 2) Variants in which the
algorithms are reformulated to fit a kernel-based approach [5]–
[7]. In this paper, we will review the second approach, in
which the input data is mapped by a non-linear function into
a high-dimensional space where the ordinary MVA problems
are stated. A central property of this kernel approach is the
exploitation of the so-called kernel trick, by which the inner
products in the transformed space are replaced with a kernel
function working solely with input space data so the explicit
non-linear mapping is not explicitly necessary.
An appealing property of the resulting kernel algorithms
is that they obtain the flexibility of non-linear expressions
using straightforward methods from linear algebra. However,
supervised kernel MVA methods are hampered in applications
involving large datasets or a small number of labeled samples.
Sparse and incremental versions have been presented to deal
with the former problem, while the field of semisupervised
learning has recently emerged for the latter. Remedies to
these problems involve a particular kind of regularization,
guided either by selection of a reduced number of basis
functions or by considering the information about the manifold
conveyed by the unlabeled samples. Both approaches will be
also reviewed in this paper. Concretely, we aim:
1) To review linear and kernel MVA algorithms, providing
their theoretical characterization and comparing their
main properties under a common framework.
2) To present relations between kernel MVA and other
2
IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013
TABLE I
ACRONYMS AND NOTATION USED IN THE PAPER .
AR
ECMWF
GMM
HSIC
IASI
(k)CCA
kFD
(k)MVA
(k)OPLS
(k)PCA
(k)PLS
MSE
LDA
LS
OA
RBF
rkCCA
rkOPLS
rkPCA
RMSE
RTM
skPLS
UCI
Autoregressive
European Centre for Medium-Range Weather Forecasts
Gaussian Mixture Model
Hilbert-Schmidt Independence Criterion
Infrared Atmospheric Sounding Interferometer
(Kernel) Canonical Correlation Analysis
Kernel Fisher Discriminant
(Kernel) Multivariate Analysis
(Kernel) Orthonormalized Partial Least Squares
(Kernel) Principal Component Analysis
(Kernel) Partial Least Squares
Mean-square error
Linear Discriminant Analysis
Least-squares
Overall Accuracy
Radial Basis Function
reduced complexity kCCA
reduced complexity kOPLS
reduced complexity kPCA
Root-mean-square Error
Radiative transfer models
Sparse kPLS
University of California, Irvine
feature extraction methods based on Fisher’s discriminant analysis and nonparametric kernel dependence
estimates.
3) To review sparse and semisupervised approaches that
make the kernel variants practical for large-scale or
undersampled labeled datasets. We will illustrate how
these approaches overcome some of the difficulties that
may have limited a more widespread use of the most
powerful kernel MVA methods.
4) To illustrate the wide applicability of these methods,
for which we consider several publicly available data
sets, and two real scenarios involving audio processing
for music genre prediction and hyperspectral satellite
images for Earth and climate monitoring. Methods will
be assessed in terms of accuracy and robustness to the
number of extracted features.
We continue the paper with a brief review of linear and
kernel MVA algorithms, as well as connections to other
methods. Then, Section III introduces some extensions that
increase the applicability of kernel MVA methods in real applications. Section IV provides illustrative evidence of method’s
performance. Finally, we conclude the paper in Section V with
some discussion and future lines of research.
II. M ULTIVARIATE A NALYSIS IN R EPRODUCING K ERNEL
H ILBERT S PACES
This section reviews the framework of MVA both in the
linear case and with kernel methods. Interesting connections
are also pointed out between particular classification and
dependence estimation kernel methods. Table I provides a list
of acronyms, and basic notation and variables that will be used
throughout the paper.
l
u
n=l+u
d
m
X
Y
Cx , Cy
Cxy
X†
W
I
kAkF
nf
ui , vi
U, V
X′ , Y ′
F
φ(x)
k(xi , xj )
Φ
Kx = ΦΦ⊤
A
Size of labelled dataset
Number of unlabeled data
Total number of training samples
Dimension of input space
Dimension of output space
Centered input data (size l × d)
Centered output data (size l × m)
Input, output data sample covariance matrices
Input-output sample cross-covariance matrix
Moore-Penrose pseudoinverse of matrix X
Regression coefficients matrix
Identity matrix
Frobenius norm of matrix A
Number of extracted features
ith projection vector for the input, output data
[u1 , . . . , unf ], [v1 , . . . , vnf ]. Projection matrices
Extracted features for the input, output data
Reproducing kernel Hilbert Space
Mapping of x in feature space
hφ(xi ), φ(xj )iF . Kernel function
[φ(x1 ), . . . , φ(xl )]⊤ . Input data in feature space
Gram Matrix
[α1 , · · · , αnf ]. Coefficients for U = Φ⊤ A
A. Problem statement and notation
Let us consider a supervised regression or classification
problem, and let X and Y be columnwise-centered input and
target data matrices of sizes l × d and l × m, respectively.
Here, l is the number of training data points in the problem,
and d and m are the dimensions of the input and output spaces,
respectively. The target data can be either a set of variables that
need to be approximated, or a matrix that encodes the class
membership information. The sample covariance matrices are
given by Cx = 1l X⊤ X and Cy = 1l Y⊤ Y, whereas the covariance between the input and output data is Cxy = 1l X⊤ Y.
The objective of standard linear multiregression is to adjust
a linear model for predicting the output variables from the
b = XW, where W contains the regression
input features, Y
model coefficients. Ordinary least-squares (LS) regression
solution is W = X† Y, where X† = (X⊤ X)−1 X⊤ is the
Moore-Penrose pseudoinverse of X. Highly correlated input
variables can result in rank-deficient Cx , making the inversion
of this matrix unfeasible. The same situation is encountered
in the small-sample-size case (i.e., when l < d, which is
usually the case when using kernel extensions). Including
a Tikhonov’s regularization term leads to better conditioned
solutions: for instance, by also minimizing the Frobenius norm
of the weights matrix kWk2F , one obtains the regularized
LS solution W = (X⊤ X + λI)−1 X⊤ Y, where parameter
λ controls the amount of regularization.
The solution suggested by MVA to the above problem
consists in projecting the input data into a subspace that preserves the most relevant information for the learning problem.
Therefore, MVA methods obtain a transformed set of features
via a linear transformation of the original data, X′ = XU,
where U = [u1 , u2 , . . . , unf ] will be referred hereafter as the
projection matrix, ui being the ith projection vector and nf the
ARENAS-GARCÍA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK
3
number of extracted features1 . Some MVA methods consider
also a feature extraction in the output space, Y′ = YV, with
V = [v1 , v2 , . . . , vnf ].
Generally speaking, MVA methods look for projections of
the input data that are “maximally aligned” with the targets,
and the different methods are characterized by the particular
objectives they maximize. Table II provides a summary of
the MVA methods that we will discuss in the rest of the
section. An interesting property of linear MVA methods is
that they are based on first and second order moments, and
that their solutions can be formulated in terms of (generalized)
eigenvalue problems. Thus, standard linear algebra methods
can be readily applied.
sensing, where signals typically are acquired in a range of
highly correlated spectral wavelengths.
Rather than maximizing covariance, CCA maximizes the
correlation between projected input and output data [17]. In
this way, CCA can more conveniently deal with directions of
the input or output spaces that present very high variance, and
that would therefore be over-emphasized by PLS, even if the
correlation between the projected input and output data is not
very significant.
A final method we will pay attention to is Orthonormalized
PLS (OPLS), also known as multilinear regression [18] or
semi-penalized CCA [19]. OPLS is optimal for performing
multilinear LS regression on the features extracted from the
training data, i.e.,
U = arg min kY − X′ Wk2F .
B. Linear Multivariate Analysis
Principal Component Analysis, which is also known as the
Hotelling transform or the Karhunen-Loève transform [9], is
probably the most widely used MVA method and the oldest
dating back to 1901 [10]. PCA selects the maximum variance
projections of the input data, imposing an orthonormality
constraint for the projection vectors (see Table II). PCA works
under the hypothesis that high variance projections contain
the relevant information for the learning task at hand. PCA is
based on the input data alone, and is therefore an unsupervised
feature extraction method. Methods that explicitly look for
the projections that better explain the target data should in
principle be preferred in a supervised setup. Nevertheless, PCA
and its kernel version, kPCA, are used as a preprocessing stage
in many supervised problems, likely because of their simplicity
and ability to discard irrelevant directions [11], [12].
The PLS algorithm [13] is based on latent variables that
account for the information in Cxy . In order to do so, PLS
extracts the projections that maximize the covariance between
the projected input and output data, again under orthonormality
constraints for the projection vectors. This is done either as an
iterative procedure or by solving an eigenvalue problem. In the
iterative schemes, the data sets X and Y are recursively transformed in a process which subtracts the information contained
in the already estimated latent variables. This process, which is
often referred to as deflation, can be done in a number of ways
that define the many variants of PLS. Perhaps the most popular
PLS method was presented in [14]. The algorithm, hereafter
referred to as PLS2, assumes a linear relation between X and
Y that implies a certain deflation scheme, where the latent
variable of X is used to deflate also Y [7, p. 182]. Several
other variants of PLS exist such as ‘PLS Mode A’ and PLS-SB;
see [15] for a discussion of the early history of PLS and [16]
for a well-written contemporary overview. Among its many
advantages, PLS does not involve matrix inversion and deals
efficiently with highly correlated data. This has justified its
very extensive use in fields such as chemometrics and remote
1 Note that strictly speaking U is not a projection operator, since it implies
a transformation from Rd to Rnf and does not satisfy the idempotent
property of projection operators. Nevertheless, if the columns of U are linearly
independent, vectors ui constitute a basis of the subspace of Rd where the
data is projected, and it is in this sense that we refer to ui and U as projection
vectors and matrix, and to X′ = XU as projected data. This nomenclature
has been widely adopted in the machine learning field [8].
(1)
U
with W = X′† Y being the matrix containing the optimal
regression coefficients. It can be shown that this optimization problem is equivalent to the one stated in Table II.
Alternatively, this problem can also be associated with the
maximization of a Rayleigh coefficient involving projections
(u⊤ Cxy v)2
of both input and output data, (u⊤ Cx u)(v
⊤ v) . It is in this sense
that this method is called semi-penalized CCA, since it disregards variance of the projected input data, but rewards those
input features that better predict large variance projections of
the target data. This asymmetry makes sense in supervised
subspace learning where matrix Y contains target values to
be approximated from the extracted input features. In fact,
it has been shown that for classification problems OPLS is
equivalent to Linear Discriminant Analysis (LDA), provided
an appropriate labeling scheme is used for Y [19]. However,
in “two-view learning” problems in which X and Y represent
different views of the data [7, Sec. 6.5], one would like
to extract features that can predict both data representations
simultaneously, and CCA could be preferred to OPLS.
A common framework for PCA, PLS, CCA and OPLS was
proposed in [18], where it was shown that these methods can
be reformulated as (generalized) eigenvalue problems, so that
linear algebra packages can be used to solve them. Concretely:
PCA : Cx u = λu
PLS :
0
Cxy
C⊤
0
xy
u
v
=λ
u
v
Cx 0
0 Cy
(2)
OPLS : Cxy C⊤
xy u = λCx u
CCA :
0
Cxy
C⊤
xy 0
u
v
=λ
u
v
We can see that CCA and OPLS require the inversion
of matrices Cx and Cy . If these are rank-deficient, then
it becomes necessary to first extract the dimensions with
non-zero variance using PCA, and then solve the CCA or
OPLS problems. A very common approach is to solve the
above problems using a two-step iterative procedure. In the
first step, the projection vectors corresponding to the largest
(generalized) eigenvalue are chosen, for which there exist
4
IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013
TABLE II
S UMMARY OF LINEAR AND KERNEL MVA METHODS . F OR EACH METHOD IT IS STATED THE OBJECTIVE TO MAXIMIZE (1 ST ROW ), CONSTRAINTS FOR
THE OPTIMIZATION (2 ND ROW ), AND MAXIMUM NUMBER OF FEATURES ( LAST ROW ).
PCA
⊤
u Cx u
⊤
U U=I
r(X)
PLS
⊤
u Cxy v
U⊤ U = I
V⊤ V = I
r(X)
CCA
⊤
u Cxy v
U ⊤ Cx U = I
⊤
V Cy V = I
r(Cxy )
OPLS
u
⊤
Cxy C⊤
xy u
⊤
kPCA
α
⊤
K2x α
⊤
U Cx U = I
A Kx A = I
r(Cxy )
r(Kx )
kPLS
kCCA
⊤
⊤
α Kx Yv
A ⊤ Kx A = I
V⊤ V = I
r(Kx )
kOPLS
α Kx Yv
A⊤ K2x A = I
V ⊤ Cy V = I
r(Kx Y)
⊤
α Kx YY ⊤ Kx α
A⊤ K2x A = I
r(Kx Y)
Vectors u and α are column vectors in matrices U and A, respectively. r(·) denotes the rank of a matrix.
efficient methods such as the power method. The second step
is known as deflation, and consists in removing from the data
or covariance matrices the variance that can be already explained with the features extracted in the first step. Alternative
solutions for these methods can be obtained by reformulating
them as regularized least squares minimization problems. For
instance, the work in [20]–[22] introduced sparse versions of
PCA, CCA and OPLS by adding sparsity promotion terms,
such as LASSO or ℓ1 -norm on the projection vectors, to the
LS functional.
Figure 1 illustrates the features extracted by the methods
for a toy classification problem with three classes. The data
was generated from three noisy sinusoid fragments, so that
a certain overlap exists between classes. For the application
of supervised methods, class membership is defined in matrix
Y using 1-of-c encoding [23]. Above each scatter plot we
provide, for the first extracted feature, its sample variance,
the largest covariance and correlation that can be achieved
with any linear transformation of the output data, and the optimum mean-square-error (MSE) when that feature is used to
1
approximate the target data, i.e., lm
kY−Xu1 (Xu1 )† Yk2F . As
expected, the first projection of PCA, PLS and CCA maximize,
respectively, the variance, covariance and correlation, whereas
OPLS finds the projection that minimizes the MSE. However,
since these methods can just perform linear transformations of
the data, they are not able to capture any non-linear relations
between the input variables.
being the new argument for the optimization2 This is typically
referred to as the kernel trick and has been used to develop
kernel versions of the previous linear MVA, as indicated in
the last four columns of Table II.
For PCA, it was Schölkopf, Smola and Müller who in 1998
introduced a kernel version denoted kPCA [6]. Lai and Fyfe
in 2000 first introduced the kernel version of CCA denoted
kCCA [24] (see also [7]). Later, Rosipal and Trejo presented
a non-linear kernel variant of PLS in [25]. In that paper, Kx
and the Y matrix are deflated the same way, which goes more
in line with the PLS2 variant than to the traditional algorithm
‘PLS Mode A’, and therefore we will denote it as kPLS2. A
kernel variant of Orthonormalized PLS was presented in [26]
and is here referred to as kOPLS. Specific versions of kernel
methods to deal with signal processing applications have also
been proposed, such as the temporal kCCA of [27], that is
designed to exploit temporal structure in the data.
As for the linear case, kernel MVA methods can be implemented as (generalized) eigenvalue problems:
kPCA :
Kx α = λα
kPLS :
The framework of kernel MVA (kMVA) algorithms is aimed
at extracting nonlinear projections while actually working with
linear algebra. Let us first consider a function φ : Rd → F
that maps input data into a Hilbert feature space F. The new
mapped data set is defined as Φ = [φ(x1 ), · · · , φ(xl )]⊤ , and
the features extracted from the input data will now be given
by Φ′ = ΦU, where matrix U is of size dim(F) × nf . The
direct application of this idea suffers from serious practical
limitations when the dimension of F is very large, which is
typically the case.
To implement practical kernel MVA algorithms we need
to rewrite the equations in the first half of Table II in terms
of inner products in F only. For doing so, we rely on the
availability of a kernel matrix Kx = ΦΦ⊤ of dimension l × l,
and on the Representer’s Theorem [7], which states that the
projection vectors can be written as a linear combination of the
training samples, i.e, U = Φ⊤ A, matrix A = [α1 , . . . , αnf ]
α
v
=λ
α
v
Kx Kx 0
0
Cy
(3)
kOPLS : Kx YY⊤ Kx α = λKx Kx α
kCCA :
C. Kernel Multivariate Analysis
0
Kx Y
YKx 0
0
Kx Y
YKx 0
α
v
=λ
α
v
It should be noted that the output data could also be mapped to
some feature space H, as it was considered for kCCA in [24]
for a multi-view learning case. Here, we consider that it is
the actual labels in Y which need to be well-represented by
the extracted input features, so we will deal with the original
representation of the output data.
For illustrative purposes, we have incorporated to Fig. 1 the
projections obtained in the toy problem by kMVA methods
using the radial basis function
(RBF) kernel, k(xi , xj ) =
exp −kxi − xj k2 /(2σ 2 ) . Input data was normalized to zero
mean and unit variance, and the kernel width σ was selected
as the median of all pairwise distances between samples [28].
The same σ has been used for all methods, so that features are
extracted from the same mapping of the input data. We can
see that the non-linear mapping improves class separability. As
expected, kPCA, kPLS and kCCA maximize in F the same
variance, covariance and correlation objectives, respectively,
2 In this paper, we assume that data is centered in feature space, what can
easily be done through a simple modification of the kernel matrix [7].
ARENAS-GARCÍA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK
5
PCA
PLS
OPLS
CCA
var = 1.272; MSE = 0.153
var = 1.134; MSE = 0.139
var = 1.001; MSE = 0.135
var = 0.957; MSE = 0.136
cov = 0.353; corr = 0.299
cov = 0.404; corr = 0.401
cov = 0.391; corr = 0.431
cov = 0.381; corr = 0.433
Original data
kPCA
kPLS
kOPLS
kCCA
var = 0.161; MSE = 0.136
var = 0.161; MSE = 0.135
var = 3.41e−6; MSE = 0.084
var = 2.45e−5; MSE = 0.124
cov = 0.156; corr = 0.421
cov = 0.157; corr = 0.43
cov = 0.001; corr = 0.856
cov = 0.002; corr = 0.915
Fig. 1. Features extracted by different MVA methods in a three-class problem. For the first feature extracted by each method we show its variance (var),
the mean-square-error when the projected data is used to approximate Y (MSE), and the largest covariance (cov) and correlation (corr) that can be achieved
with any linear projection of the target data.
as their linear counterparts. kOPLS looks for the directions of
data in F that can provide the best approximation of Y in the
MSE sense. This example illustrates also that maximizing the
variance or even the covariance may not be the best choice
for supervised learning.
Although kernel MVA methods can still be described in
terms of linear equations, their direct solution faces several
problems. In particular, it is well-known that kOPLS and
kCCA can easily overfit the training data, so regularization
is normally required to alleviate numerical instabilities [7],
[26]. A second important problem is related to the computational cost. Since Kx is of size l × l, methods’ complexity
scales quadratically with l in terms of memory, and cubically
with respect to the computation time. Further, the solution
of the maximization problem (matrix A) is not sparse, so
that feature extraction for new data requires the evaluation
of l kernel functions per pattern, becoming computationally
expensive for large l. Finally, it is worth mentioning the
opposite situation: when l is small, the extracted features may
be useless, especially for high dimensional F [12]. Actually,
the information content of the features is elusive and has not
been characterized so far. These issues limit the applicability of
supervised kMVA in real-life scenarios with either very large
or very small labeled data sets. In Section III, we describe
sparse and semisupervised approaches for kMVA that tackle
both difficulties.
D. Relations with other methods
As already stated, close connections have been established
among Fisher’s LDA, CCA, PLS, and OPLS for classification [19]; such links extend to their kernel counterparts as
well. Under the framework of Rayleigh coefficients, Mika et
al. [29], [30] extended LDA to its kernel version for binary
problems, and Baudat and Anouar [31] proposed the generalized discriminant analysis (GDA) for multiclass problems.
A great many kernel discriminants have appeared since then,
focused on alleviating problems such as those induced by highdimensional small-sized datasets, the presence of noise, high
levels of collinearity, or unbalanced classes. The number and
heterogeneity of these methods makes difficult their unified
treatment.
Besides, in the recent years interesting connections of
kMVA and statistical dependence estimates have been established. For instance, the Hilbert-Schmidt Independence
Criterion (HSIC) [32] is a simple yet very effective method
to estimate statistical dependence between random variables.
HSIC corresponds to estimating the norm of the crosscovariance in F, whose empirical (biased) estimator is
1
HSIC := (l−1)
2 Tr(Kx Ky ), where Kx works with samples
in the source domain and Ky = YY⊤ . It can be shown
that, if the RKHS kernel is universal, HSIC asymptotically
tends to zero when the input and output data are independent.
The so-called Hilbert-Schmidt Component Analysis (HSCA)
method iteratively seeks for projections that maximize dependence with the target variables and simultaneously minimize the dependence with previously extracted features, both
in HSIC terms. This objective translates into the iterative
resolution of the generalized eigen-decomposition problem
Kx Ky Kx α = λKx Kf Kx α, where Kf is a kernel matrix
of already extracted features in the previous iteration. If one
is only interested in maximizing source-target dependence in
HSIC terms, the problem boils down to solving kOPLS.
Similarly, the connection between kCCA and other kernel
measures of dependence, such as the kernel Generalized
Variance (kGV) or the kernel Mutual Information (kMI), was
introduced in [33]. The empirical kGV estimates dependence
between input-output data with a function that depends on
the entire spectrum of the associated correlation operator in
RKHS, kGV(θ) = − 21 log(Π(1 − λ2i )), where λi are the
i
6
IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013
solutions to the generalized eigenvalue problem
and was later used in [35] to directly approximate the feature
mapping itself rather than the kernel, thus giving rise to sparse
α
0
Kx Ky
=
versions of kPLS and kCCA.
v
Ky Kx
0
Among the reduced set methods, a sparse kPCA (skPCA)
α
θKx Kx + η(1 − θ)Kx
0
was
proposed by Tipping in [36], where the sparsity in the
,
λ
v
0
θKy Ky + η(1 − θ)Ky
representation is obtained by assuming a generative model for
where Kx and Ky are defined using RKHS kernels obtained the data in F that follows a normal distribution and includes
via convolution of the associated Parzen windows, η is a a noise term with variance vn . The maximum likelihood
scaling factor, and θ is a parameter in the range [0, 1]. Gretton estimation of the covariance matrix is shown to depend on
et al. [33] showed that, under certain conditions, kGV reduces just a subset of the training data, and so it does the resulting
solution. A sparse kPLS (skPLS) was introduced in [37]. The
to kMI for θ = 0 and to kCCA for θ = 1 (cf. Eq. 3).
It is worth noting that the previous kernel measures of statis- method uses a fraction of the training samples for computing
tical dependence hold connections with Information Theoretic the projections. Each projection vector is found through a sort
Learning concepts as well. For instance, it can be shown that of ε-insensitive loss similar to the one used in the support
HSIC is intimately related to the concept of correntropy [34]. vector regression method. The sparsification is induced via
All these connections could shed light in the future about the a multi-step adaptation with high computational burden. In
informative content of the extracted features in a principled spite of obtaining sparse solutions, the algorithms from [36]
and [37] still require the computation of the full kernel matrix
way.
during the training.
A reduced complexity kOPLS (rkOPLS) was proposed
III. E XTENSIONS FOR LARGE SCALE AND
in [26] by imposing sparsity in the projection vectors repSEMISUPERVISED PROBLEMS
resentation a priori, U = Φ⊤
r β, where Φr is a subset of the
Supervised kernel MVA methods are hampered either by
training data containing r samples (r ≪ l) and β is the new
the wealth or the scarcity of labeled samples, which can make
argument for the maximization problem, which now becomes:
these methods impractical for many applications. We next
summarize some extensions to deal with large scale problems
max β ⊤ Krl YY⊤ K⊤
rl β
(4)
and semisupervised situations in which few labeled data is
⊤
⊤
subject to : β Krl Krl β = 1,
available.
Since kernel matrix Krl = Φr Φ⊤ involves the inner products
in F of all training points with the patterns in the reduced set,
A. Sparse Kernel Feature Extraction
this method still takes into account all data during the training
A critical bottleneck of kernel methods is that for a dataset
phase, and is therefore different from simple subsampling.
of l samples, the kernel matrices are l × l, which, even for a
This sparsification procedure avoids the computation of the
moderate number of samples, quickly becomes a problem with
full kernel matrix at any step of the algorithm. An additional
respect to both memory and computation time. Furthermore,
advantage of this method is that matrices Krl YY⊤ K⊤
rl and
in kernel MVA this is also an issue during the extraction
Krl K⊤
are
both
of
size
r
×
r,
and
can
be
expressed
as
sums
rl
of features for test data, since kernel MVA solutions will
over the training data, so the storage requirements become just
in general depend on all training data (i.e., matrix A will
quadratic with r. Furthermore, the sparsity constraint acts as a
generally be dense): evaluating thousands of kernels for every
regularizer that can significantly improve the generalization
new input vector is, in most applications, simply not acceptability of the method. In the experiments section, we will
able. Furthermore, these so-called dense solutions may result
apply the same sparsification procedure to kPCA and kCCA,
in severe problems of overfitting, which is particularly true
obtaining reduced complexity versions of these methods to
for kOPLS and kCCA [7], [26]. To address these problems,
which we will refer to as rkPCA and rkCCA. Interestingly, the
several solutions have been proposed to obtain sparse solutions
extension to kPLS2 is not straightforward, since the deflation
that can be expressed as a combination of a subset of the
step would still require the full kernel matrix Kll .
training data, and therefore require only r kernel evaluations
Alternatively, two sparse kPLS schemes were presented
per sample (with r ≪ l) for feature extraction. Note that,
in [38] under the name of Sparse Maximal Aligment (SMA)
in contrast to the many linear MVA algorithms that induce
and Sparse Maximal Covariance (SMC). Here kPLS iteratively
sparsity with respect to the original variables, in this subsection
estimates projections that either maximize the kernel alignwe review methods that obtain sparse solutions in terms of the
ment (c.1) or the covariance (c.2) of the projected data and
samples (i.e., sparsity in the αi vectors).
the true labels:
The approaches to obtain sparse solutions can be broadly
max β ⊤ Kj Y
divided into low rank approximation methods, that aim at
working with reduced r × r matrices (r ≪ l), and reduced
(5)
subject to (c.1) : β ⊤ K2j β = 1,
set methods that work with l × r matrices. Following the first
⊤
subject to (c.2) : β Kj β = 1,
approach, the Nyström low-rank approximation of an l × l
−1
kernel matrix Kll is expressed as K̃ll = Klr Krr Krl , where where K1 = Kx and Kj denotes the deflated kernel matrix
subscripts indicate row and column dimensions. The method at iteration j, according to [38, Eq. (3)]. The method imposes
was originally exploited in the context of Gaussian processes, the additional constraint that the cardinality of β is 1. This
ARENAS-GARCÍA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK
restriction explicitly enforces sparsity through an ℓ0 -norm
in the weights space. At each iteration, β is obtained by
performing an exhaustive search over all training patterns.
However, the complexity of the algorithm can be significantly
reduced by constraining the search to just p randomly chosen
samples.
In Table III, we summarize some computational and implementation issues of the aforementioned sparse kMVA methods,
and of standard non-sparse kMVA and linear methods. An
analysis of the properties of each algorithm provides some
hints that can help us choose the algorithm for a particular
application. Firstly, a critical step when using kernel methods is the selection of an appropriate kernel function and
tuning its parameters. To avoid overfitting, kMVA methods
can be adjusted using cross-validation at the cost of higher
computational cost. Sparse methods can help in this respect
by regularizing the solution. Secondly, most methods can be
implemented as either eigenvalue or generalized eigenvalue
problems, whose complexity typically scales cubically with
the size of the analyzed matrices. Therefore, both for memory
and computational reasons, only linear MVA and the sparse
approaches from [26] and [38] are affordable when dealing
with large data sets. A final advantage of sparse kMVA is the
reduced number of kernel evaluations to extract features for
new data.
B. Semisupervised Kernel Feature Extraction
When few labeled samples are available, the extracted
features do not capture the structure of the data manifold well,
and hence using them for classification or regression may
lead to very poor results. Recently, semisupervised learning
approaches have been introduced to alleviate these problems.
Two approaches are encountered: the information conveyed
by the unlabeled samples is either modeled with graphs or via
kernel functions derived from generative clustering models.
Notationally, we are given l labeled and u unlabeled samples, a total of n = l + u. The semisupervised kCCA (sskCCA) has been recently introduced in [28] by using the
graph Laplacian. The method essentially solves the standard
kCCA using kernel matrices computed with both labeled and
unlabeled data, which are further regularized with the graph
Laplacian:
α
0
Kxnl Kyln
=
y
v
0
Kxnl Kln
λ
Kxnl Kxln + Rxnn
0
0
Kynl Kyln + Rynn
α
v
, (6)
where Rxnn = αx Kxnn + γx Kxnn Lxnn Kxnn and Rynn =
αy Kynn + γy Kynn Lynn Kynn . For notation compactness,
subindexes here indicate the size of the corresponding matrices
while superscripts denote whether they involve input or output
data. Parameters αx , αy , γx and γy trade off the contribution of
labeled and unlabeled samples, and L = D−1/2 (D−M)D1/2
represents the (normalized) graph Laplacian for the input and
target domains, where D is the degree matrix whose entries
are the sums of the rows of the corresponding similarity matrix
7
P
M, i.e. Dii = j Mij . It should be noted that for n = l and
null regularization, one obtains the standard kCCA (cf. Eq. 3).
Note also that this form of regularization through the graph
Laplacian can be applied to any kMVA method. A drawback
of this approach is that it involves tuning several parameters
and working with larger matrices of size 2n × 2n, which can
make its application difficult in practice.
Alternatively, cluster kernels have been used to develop
semisupervised versions of kernel methods in general, and
of kMVA methods in particular. The approach was used for
kPLS and kOPLS in [39]. Essentially, the method relies on
combining a kernel function based on labeled information
only, ks (xi , xj ), and a generative kernel directly learned
by clustering all (labeled and unlabeled) data, kc (xi , xj ).
Building kc requires first running a clustering algorithm, such
as Expectation-Maximization assuming a Gaussian mixture
model (GMM) with different initializations, q = 1, . . . , Q,
and with different number of clusters, g = 2, . . . , G + 1. This
results in Q · G cluster assignments where each sample xi has
its corresponding posterior probability vector π i (q, g) ∈ Rg .
The probabilistic cluster kernel kc is computed by averaging
all the dot products between posterior probability vectors,
kc (xi , xj ) =
Q G+1
1 XX
π i (q, g)⊤ π j (q, g),
Z q=1 g=2
where Z is a normalization factor. The final kernel function
is defined as the weighted sum of both kernels, k(xi , xj ) =
βks (xi , xj )+(1−β)kc (xi , xj ), where β ∈ [0, 1] is a scalar parameter to be adjusted. Intuitively, the cluster kernel accounts
for probabilistic similarities at small and large scales (number
of clusters) between all samples along the data manifold. The
method does not require computationally demanding procedures (e.g. current GMM clustering algorithms scale linearly
with n), and the kMVA still relies just on the labeled data,
and thus requires an l × l kernel matrix. All these properties
are quite appealing from the practitioner’s point of view.
IV. E XPERIMENTAL R ESULTS
In this section, we illustrate through different application
examples the use and capabilities of the supervised kernel multivariate feature extraction framework. We start by
comparing the performance of the linear and kernel MVA
methods in a benchmark of classification problems from the
publicly available Machine Learning Repository at University
of California, Irvine (UCI)3 . We then consider two real applications to show the potential of these algorithms: satellite
image processing [40] and audio processing for music genre
prediction [41]. The size of the data set used in this second
scenario is sufficiently large to make standard kMVA methods
impractical, a situation that we will use to illustrate the benefits
of the sparse extensions.
A. UCI repository benchmark
Our first battery of experiments deals with standard benchmark data sets taken from the UCI repository, and will be
3 http://archive.ics.uci.edu/ml/.
8
IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013
TABLE III
M AIN PROPERTIES OF ( K )MVA METHODS . C OMPUTATIONAL COMPLEXITY AND IMPLEMENTATION ISSUES ARE CATEGORIZED FOR THE CONSIDERED
DENSE AND SPARSE METHODS , INDICATING FROM LEFT TO RIGHT: THE FREE PARAMETERS , NUMBER OF KERNEL EVALUATIONS (KE) DURING
TRAINING , STORAGE REQUIREMENTS , WHETHER AN EIGENPROBLEM (EIG) OR GENERALIZED EIGENPROBLEM (GEV) NEEDS TO BE SOLVED , AND THE
NUMBER OF KERNELS THAT NEED TO BE EVALUATED TO EXTRACT PROJECTIONS FOR NEW DATA .
Method
PCA
PLS
CCA
OPLS
kPCA
kPLS
kCCA
kOPLS
skPCA [36]
skPLS [37]
rkPCA
rkCCA
rkOPLS [26]
SMA / SMC [38]‡
Parameters
none
none
none
none
kernel
kernel
kernel
kernel
kernel, vn
kernel, ν, ε
kernel, r
kernel, r
kernel, r
kernel, r
KE tr.
none
none
none
none
l2
l2
l2
l2
l2
l2
rl
rl
rl
l2
Storage Req.
O(d2 )
O((d + m)2 )
O((d + m)2 )
O(d2 )
O(l2 )
O((l + m)2 )
O((l + m)2 )
O(l2 )
O(l2 )
O(l2 )
O(r2 )
O((r + m)2 )
O(r2 )
O(l2 )
EIG / GEV / Other
EIG
EIG
GEV
GEV
EIG
EIG
GEV
GEV
ML + EIG†
ν-SVR
EIG
GEV
GEV
Ex. search
KE test
none
none
none
none
l
l
l
l
vn dependent
ν, ε dependent
r
r
r
r
†
A maximum likelihood estimation step is required. ‡ By constraining the search to p random samples at each step of the algorithm, kernel evaluations and
storage requirements during training can be reduced to rlp and O(lp), respectively.
oriented to discuss some important properties of the standard
linear and kernel MVA methods. The main properties of the
selected problems are given in Table IV, namely the number
of training and test patterns (l, ltest ), the dimensionality of the
input space (d), the number of classes (c), the ratio of training
patterns per dimension, and the Kullback-Leibler divergence
(KL) between the sample probabilities of each class and a
uniform distribution, that can be seen as an indicator of balance
among classes. The train/test partition has left 60% of the
total data (or alternatively a maximum of 500 samples) in the
training set, so that standard kMVA complexity is kept under
control. Since all selected problems involve a classification
task, matrix Y was used to store the class information using
1-of-c coding. To obtain the overall accuracies (OA) displayed
in the table, we have trained an LS model to approximate Y,
followed by a “winner-takes-all” activation function.
We start the discussion by comparing the performance
of linear methods. For OPLS and CCA we have used the
maximum number of features (c − 1), whereas for PCA nf
has been fixed through a 10-fold cross-validation scheme on
the training set, and is indicated next to the results of the
method. When the maximum number of features are extracted,
linear OPLS and CCA become identical. We can see that
in most cases they perform similarly to PCA, but requiring
significantly fewer features. It is important to see the very poor
performance of OPLS and CCA in two particular problems:
semeion and sonar. We see that the ratio l/d is very small
for these problems, so the input covariance matrix is likely
to be ill-conditioned. Then, very low variance directions of
the input space are used by CCA and OPLS to overfit the
data. To avoid this problem, it becomes necessary to regularize
these algorithms, e.g., by loading the main diagonal of the
covariance matrix Cx or by executing the method on the
projections already extracted by PCA [11]. The results of
regularized OPLS and CCA following the latter approach,
and using the maximum of (c − 1) features, are given in
the last two columns of Table IV, where we can see that
the regularization does indeed help to overcome the “small-
sample-size” problem. For completeness, we present also the
results of the PLS2 approach. To get a fair comparison with
the other supervised schemes, PLS2 is also trained to extract
(c − 1) features. Thus, we can conclude that OPLS and CCA
allow to obtain more discriminative features than the other
methods, but one needs to be aware of the likely need to
regularize the solution.
Next, we turn our attention to non-linear versions, whose
results have been displayed in the bottom half of the table. An
RBF kernel has been used in all cases, selecting the kernel
width with 10-fold cross-validation. A first consideration is
that kernel approaches considerably outperform the linear
schemes. Since the RBF kernel implies a mapping to a very
high dimensional space, it is not surprising that standard
kOPLS and kCCA are even more prone to overfitting than
before, this being a well-known problem of these methods.
As before, regularized solutions allow kOPLS and kCCA
to achieve comparable performance to kPCA, but retaining
a much smaller number of features (c − 1), which demonstrates the superior discriminative capabilities of the features
extracted by these methods. For completeness, kPLS2 was
used to extract the same number of features as kOPLS and
kCCA, achieving considerably smaller OAs in all problems.
Nevertheless, with kPLS2 it is possible to extract a larger
number of projections, and this method is known to be more
robust to overfitting than kOPLS and kCCA.
B. Remote sensing image analysis
The last few hundred years human activities have precipitated an environmental crisis on Earth. In the last decade,
advanced statistical methods have been introduced to quantify
our impact on the land/vegetation and atmosphere, to better
understand their interactions. Nowadays, multi- and hyperspectral sensors mounted on satellite or airborne platforms
may acquire the reflected energy by the Earth with high spatial
detail and in several wavelengths. Recent infrared sounders
also allow us to estimate the profiles of atmospheric parameters with unprecedented accuracy and vertical resolution. Here,
ARENAS-GARCÍA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK
9
TABLE IV
M AIN CHARACTERISTICS OF THE DATA SETS THAT COMPOSE THE UCI BENCHMARK AND PERFORMANCE OF THE DIFFERENT ( K )MVA FEATURE
EXTRACTION METHODS . A S A FIGURE OF MERIT WE USE THE OVERALL ACCURACY (OA, [%]) ± THE BINOMIAL STANDARD DEVIATION . B EST RESULTS
FOR EACH PROBLEM ARE HIGHLIGHTED IN BOLDFACE . T HE NUMBER OF EXTRACTED FEATURES IS INDICATED FOR PCA AND K PCA, WHEREAS ALL
OTHER METHODS USE c − 1 FEATURES .
data set
car
glass
optdigits
semeion
sonar
vehicle
vowel
yeast
data set
car
glass
optdigits
semeion
sonar
vehicle
vowel
yeast
l
500
128
500
500
125
500
500
500
ltest
1228
86
5120
1093
83
346
490
984
d
6
9
62
256
60
18
13
8
c
4
6
10
10
2
4
11
10
l/d nf,PCA
83.3
5
14.2
9
8.06
27
1.95
78
2.08
7
27.8
18
38.5
13
62.5
7
KL nf,kPCA
0.55 197
0.28
17
0
330
0
321
0
97
0
229
0
310
0.58
56
PCA
79 ±1.2
57 ±5.3
88.8 ±0.4
83.8 ±1.1
74.7 ±4.8
78 ±2.2
48.8 ±2.3
55.9 ±1.6
kPCA
93 ±0.7
60.5 ±5.3
95.4 ±0.3
89.5 ±0.9
84.3 ±4
81.5 ±2.1
92.7 ±1.2
58.8 ±1.6
PLS2
79.3 ±1.2
50 ±5.4
86.6 ±0.5
82.4 ±1.2
67.5 ±5.1
63.9 ±2.6
46.9 ±2.3
55 ±1.6
kPLS2
80.8 ±1.1
60.5 ±5.3
82.2 ±0.5
79 ±1.2
67.5 ±5.1
49.7 ±2.7
53.1 ±2.3
56.8 ±1.6
we pay attention to the performance of several kMVA methods
for both image segmentation of hyperspectral images, and
estimation of climate parameters from infrared sounders.
a) Hyperspectral image classification: The first case
study deals with image segmentation of hyperspectral images [42]. We have used the standard AVIRIS image taken over
NW Indiana’s Indian Pine test site in June 19924 . We removed
20 noisy bands covering the region of water absorption, and
finally worked with 200 spectral bands. The high number
of narrow spectral bands induce a high collinearity among
features. Discriminating among the major crop classes in the
area can be very difficult (in particular, given the moderate
spatial resolution of 20 meters), which has made the scene a
challenging benchmark to validate classification accuracy of
hyperspectral imaging algorithms. The image is 145 × 145
pixels and contains 16 quite unbalanced classes (ranging from
20 − 2468 pixels). Among the available 10366 labeled pixels,
20% were used for training the feature extractors, and the
remaining 80% for testing. The discriminative power of all
extracted features was tested using a simple classifier consisting of a linear least squares model followed by a “winner-takes
all” activation function.
Figure 2 shows the test classification accuracy for a varying
number of extracted features, nf . For linear models, OPLS
performs better than all other methods for any number of
extracted features. Even though CCA provides similar results
for nf = 10, it involves a slightly more complex generalized
eigenproblem. When the maximum number of projections is
used, all methods result in the same error. Nevertheless, while
PCA and PLS2 require 200 features (i.e., the dimensionality
of the input space), CCA and OPLS only need 15 features to
achieve virtually the same performance.
We also considered non-linear kPCA, kPLS2, kOPLS and
kCCA, using an RBF kernel whose width was adjusted using
5-fold cross-validation in the training set. The same conclu4 The calibrated data is available online (along with detailed ground-truth
information) from http://dynamo.ecn.purdue.edu/∼biehl/MultiSpec.
OPLS
79.6 ±1.1
57 ±5.3
90.3 ±0.4
69.1 ±1.4
65.1 ±5.2
78 ±2.2
48.8 ±2.3
55 ±1.6
kOPLS
92.1 ±0.8
41.9 ±5.3
95.2 ±0.3
89.3 ±0.9
80.7 ±4.3
76.6 ±2.3
92.4 ±1.2
48.1 ±1.6
CCA
79.6 ±1.1
57 ±5.3
90.3 ±0.4
69.1 ±1.4
65.1 ±5.2
78 ±2.2
48.8 ±2.3
55 ±1.6
kCCA
71.8 ±1.3
29.1 ±4.9
93.3 ±0.4
89.4 ±0.9
80.7 ±4.3
75.1 ±2.3
92 ±1.2
33.8 ±1.5
reg. OPLS
79 ±1.2
57 ±5.3
88.8 ±0.4
83.8 ±1.1
74.7 ±4.8
78 ±2.2
48.8 ±2.3
55.9 ±1.6
reg. kOPLS
93 ±0.7
62.8 ±5.2
95.3 ±0.3
89 ±0.9
84.3 ±4
82.1 ±2.1
93.1 ±1.1
58.7 ±1.6
reg. CCA
79 ±1.2
57 ±5.3
88.8 ±0.4
83.8 ±1.1
74.7 ±4.8
78 ±2.2
48.8 ±2.3
55.9 ±1.6
reg. kCCA
91.6 ±0.8
60.5 ±5.3
88.8 ±0.4
83.9 ±1.1
49.4 ±5.5
72.8 ±2.4
88.4 ±1.4
58.9 ±1.6
sions obtained for the linear case apply also to MVA methods
in kernel feature space. The features extracted by kOPLS allow
to achieve a slightly better Overall Accuracy (OA) than kCCA,
and both methods perform significantly better than kPLS2 and
kPCA. In the limit of nf , all methods achieve similar accuracy.
The classification maps obtained for nf = 10 confirm these
conclusions: higher accuracies lead to smoother maps and
smaller error in large spatially homogeneous vegetation covers.
b) Temperature estimation from infrared sounding data:
The second case study focuses on the estimation of temperature from spaceborne very high spectral resolution infrared
sounding instruments. Despite the constant advances in sensor
design and retrieval techniques, it is not trivial to invert the
full information of the atmospheric state contained by such
hyperspectral measurements. Statistical regression and feature
extraction methods have overcome the numerical difficulties
of radiative transfer models (RTMs), and enable fast retrievals
from high volumes of data [43].
We concentrate here on the Infrared Atmospheric Sounding
Interferometer (IASI) onboard the MetOp-A satellite data.
IASI spectra consist of 8461 spectral channels (input features)
with a spatial resolution of 25 km at nadir. Due to its large
spatial coverage and low radiometric noise, IASI provides
twice daily global measurements of key atmospheric species
such as ozone, carbon monoxide, methane and methanol.
Due to the impossibility to obtain real radiosound data for
the whole atmospheric column, we resorted to the standard
hybrid approach for developing the prediction models: we used
synthetic data for training the models, and then applied it to
a full IASI orbit (91800 ‘pixels’ on March 4th, 2008). A total
amount of 67475 synthetic samples were simulated with an
infrared RTM according to input profiles of temperature at 90
pressure levels.
We are confronted here with a challenging multi-output
regression problem (xi ∈ R8461 and yi ∈ R90 ), which needs
fast methods for retrieval (prediction). Note that the IASI
mission delivers approximately 1.3 × 106 spectra per day,
10
IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013
Overall Accuracy (%)
90
RGB composite
PLS2 (61.5%)
CCA (67.5%)
kOPLS (80.4%)
80
70
PCA
PLS2
CCA
OPLS
kPCA
kPLS2
kCCA
kOPLS
60
50
40
0
1
10
10
Number of features
2
10
Fig. 2. Average classification accuracy (%) for linear and kernel MVA methods as a function of the number of extracted features, along some classification
maps for the case of 10 extracted features.
ECMWF world temperature map [K]
p [hPa]
10
10
10
1
ECMWF
PCA (2.54)
kPCA (1.72)
kOPLS (1.55)
PCA
PLS2
CCA
OPLS
kPCA
kPLS2
kCCA
kOPLS
2
3
0
2
4
RMSE [K]
6
8
Fig. 3. RMSE atmospheric temperature profiles (left); surface temperature [K] world map provided by the official ECMWF model, http://www.ecmwf.int/,
on March 4, 2008 (middle); and estimated surface temperature maps in California/Mexico area for several methods along with the averaged RMSE across the
whole atmospheric column given in brackets.
which gives a rate of about 29 Gbytes/day to be processed.
We compared several MVA methods followed by LS in terms
of the root-MSE (RMSE) computed as the discrepancy to the
official European Centre for Medium-Range Weather Forecasts
(ECMWF) estimations. We obtain the RMSE for all spatial
‘pixels’ in the orbit and for all layers of the atmosphere. We
fixed a maximum nf = 100, and used only l = 2000 samples
for learning the transformation and the LS regression weights.
For the kernel approaches, an RBF kernel is used, selecting
the width using 5-fold cross validation.
The left panel of Fig. 3 illustrates the RMSE obtained for
different pressure levels, using a spatial average over all the
pixels. Results show that kernel methods outperform linear
approaches in RMSE terms, specially in the lower parts of
the atmosphere, probably due to the presence of clouds and
haze. Estimated surface temperature maps for a particular area
are also given in the right panel of Fig. 3 for PCA, kPCA,
and kOPLS. These plots reveal that kernel methods yield
closer maps to those provided by the ECMWF in averaged
RMSE terms. These results can be of high value because the
kernel-based estimations are obtained with a drastic reduction
in computational time compared to the physical-inversion
methods used by the ECMWF.
C. Analysis of integrated short-time music features for genre
prediction
In this subsection, we consider the problem of predicting
the genre of a song using the audio data only, a task which
has recently been a subject of much interest. The data set we
analyze has been previously used in [26], [41], and consists of
1317 snippets of 30 seconds distributed evenly over 11 music
genres: alternative, country, easy listening, electronica, jazz,
latin, pop&dance, rap&hip-hop, r&b, reggae and rock. This
is a rather complex data set with an average of 1.83 songs
per artist. An estimate of human performance on this data set
has been carried out via subjective tests, providing an average
accuracy rate around 55%.
The music snippets are MP3 (MPEG1-layer3) encoded
music with a bitrate of 128 kbps or higher, down sampled
to 22050 Hz, and they are processed following the method
in [41]: Mel Frequency Cepstral Coefficients (MFCC) features
are extracted from overlapping frames of the song, using a
ARENAS-GARCÍA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK
11
window size of 20 ms. Then, a multivariate autoregressive
(AR) model is adjusted for every 1.2 seconds of the song to
capture temporal correlation, and finally the parameters of the
AR model are stacked into a 135 length feature vector for
every such frame.
For training and testing the system we have split the data
set into two subsets with 817 and 500 songs, respectively.
After processing the audio data, we have 57388 and 36556
135-dimensional vectors in the training and test partitions,
an amount which for standard kernel MVA methods is prohibitively large. For this reason, in this subsection we study
the performance of linear MVA methods, as well as of sparse
kernel methods that promote sparsity following the approach
of [26]: rkPCA, rkOPLS, and rkCCA. For completeness, we
have also considered the kPLS2 method of [7]; in this case the
deflation scheme does not allow to use reduced set methods, so
we applied mere subsampling. As in the previous subsections,
we use an RBF kernel, with a 10-fold validation scheme over
the training data to adjust the kernel width. We also resorted
to an LS scheme followed by “winner-takes-all” to carry out
the classification.
Figure 4 illustrates the performance of all methods for
different number of features. For the sparse methods, the
influence of the machine size, measured as the number of
kernel evaluations required to extract features for new data,
is also analyzed. In rkPCA, rkOPLS and rkCCA this is given
by the cardinality of the reduced set, whereas in kPLS2 it
coincides with the number of samples used to train the feature
extractor. Since every song consists of about 70 AR vectors,
we can measure the classification accuracy in two different
ways: 1) On the level of individual AR vectors, or 2) by
majority voting across the AR vectors of a given song. The
left panel of Fig. 4 illustrates the discriminative power of
the features extracted by all considered methods. Overall,
the best performance is obtained by OPLS- and CCA-type
methods, with the kernel schemes outperforming the linear
ones for nf > 5. The poor performance of PCA and rkPCA
makes evident the need of exploiting label information during
the feature extraction step. Finally, it is evident that a mere
subsampling does not provide kPLS2 with enough data to
extract relevant features, and this method performs worse than
its linear counterpart.
On the right panel of Fig. 4 we analyze the accuracy of
kernel methods as a function of machine size. As before,
it is clear that rkOPLS and rkCCA are the best performing
methods both at the AR and song levels. Increasing the size
of the machine results in better performance, although the
improvement is not very noticeable in excess of 250 nodes.
Altogether, these results allow us to conclude that sparse
methods can be used to enhance the applicability of kMVA
for large data sets. In this subsection, we have focused on the
sparse-promotion technique of [26], but similar advantages can
be expected from other sparse approaches.
of these techniques in real world applications is becoming
increasingly popular. Beyond standard PCA, there is a plethora
of linear and kernel methods that are generally better suited
for supervised applications since they seek for projections
that maximize the alignment with the target variables. We
analyzed the commonalities and basic differences of the most
representative MVA approaches in the literature, as well
as the relationships to existing kernel discriminative feature
extraction and statistical dependence estimation approaches.
We also studied recent methods to make kernel MVA more
suitable to real life applications, both for large scale data sets
and for problems with few labeled data. In such approaches,
sparse and semi-supervised learning extensions have been
successfully introduced for most of the models. Actually,
seeking for the appropriate features that facilitate classification
or regression cuts to the heart of manifold learning. We have
illustrated MVA methods in challenging real problems that
typically exhibit complex manifolds. A representative subset
of the UCI repository has been used to illustrate the general applicability of the standard MVA implementations in moderate
data sizes. We have completed the panorama with challenging
real-life applications: the prediction of music genre from
recorded signals, and the classification of land-cover classes
and estimation of climate variables to monitor our Planet.
This tutorial presented the framework of kernel multivariate
methods and outlined relations and possible extensions. The
adoption of the methods in many disciplines of science and
engineering is nowadays a fact. New exciting advances in the
theory and applications are yet to come.
V. C ONCLUSIONS
We reviewed the field of supervised feature extraction
from the unified framework of multivariate analysis. The use
ACKNOWLEDGMENTS
The authors would like to thank the reviewers and the
special issue editors for useful comments that improved the
presentation of the present manuscript significantly.
This work was partially supported by Banco Santander
and Universidad Carlos III de Madrid’s Excellence Chair
programme, and by the Spanish Ministry of Economy and
Competitiveness (MINECO) under projects TIN2012-38102C03-01, TEC2011-22480 and PRI-PIBIN-2011-1266.
R EFERENCES
[1] H. Wold, “Estimation of principal components and related models by
iterative least squares,” in P. R. Krishnaiah (ed.) Multivariate Analysis,
pp. 391–420, Academic Press, 1966.
[2] R. Rosipal, “Nonlinear Partial Least Squares: An Overview,” in
Chemoinformatics and Advanced Machine Learning Perspectives: Complex Computational Methods and Collaborative Techniques, H. Lodhi
and Y. Yamanishi (eds.), ACCM, IGI Global, pp. 169–189, 2011.
[3] S. Wold, N. Kettaneh-Wold, and B. Skagerberg, “Nonlinear PLS
Modeling,” Chemometrics and Intelligent Laboratory Systems, vol. 7,
pp. 53–65, 1989.
[4] S. Qin and T. McAvoy, “Non-linear PLS modelling using neural
networks,” Computers & Chemical Engineering, vol. 16, pp. 379–391,
1992.
[5] B. E. Boser, I. Guyon, and V. N. Vapnik, “A training algorithm for
optimal margin classifiers,” in Proc. COLT’92, Pittsburgh, PA, 1992,
pp. 144–152.
[6] B. Schölkopf, A. Smola and K.-R. Muller. “Nonlinear Component
Analysis as a Kernel Eigenvalue Problem,” Neural Computation, vol.
10, pp. 1299–1319, 1998.
[7] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis,
Cambridge University Press, 2004.
12
IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013
45
45
Accuracy rates (%)
Accuracy rates (%)
40
35
30
25
20
15
Baseline (Random Guess)
10
5 0
10
1
10
Number of Features
PCA
PLS2
CCA
OPLS
rkPCA
kPLS2
rkCCA
rkOPLS
10
40
35
rkPCA, AR
kPLS2, AR
rkCCA, AR
rkOPLS, AR
rkPCA, song
kPLS2, song
rkCCA, song
rkOPLS, song
30
25
2
20
100
250
500
Machine size
750
Fig. 4. Genre classification accuracy of linear and sparse kernel MVA methods. Left: Accuracy at the song level as a function of the number of extracted
features. Right: Accuracy of the sparse methods measured as percentage of correctly classified AR patterns and songs for different machine sizes.
[8] IEEE Signal Process. Mag., special issue on Dimensionality Reduction
via Subspace and Submanifold Learning, vol. 28, Mar. 2011.
[9] I. T. Jollife, Principal Component Analysis, Springer-Verlag, 1986.
[10] K. Pearson, “On lines and planes of closest fit to systems of points in
space,” Philosophical Mag., vol. 2 pp. 559–572, 1901.
[11] M. L. Braun, J. M. Buhmann, and K.-R. Müller, “On Relevant
Dimensions in Kernel Feature Spaces,” Journal of Machine Learning
Research, vol. 9, 1875–1908, 2008.
[12] T. J. Abrahamsen and L. K. Hansen, “A Cure for Variance Inflation in
High Dimensional Kernel Principal Component Analysis,” Journal of
Machine Learning Research, vol. 12, 2027–2044, 2011.
[13] H. Wold, “Non-linear estimation by iterative least squares procedures,”
in F. David (ed.) Research papers in Statistics, pp. 411–444, New York,
NY: Wiley, 1966.
[14] S. Wold, et al., “Multivariate Data Analysis in Chemistry,” in B. R.
Kowalski (ed.) Chemometrics, Mathematics and Statistics in Chemistry,
pp. 17–95. Reidel Publishing Company, Holland, 1984.
[15] P. Geladi, “Notes on the history and nature of partial least squares (PLS)
modelling”, Journal of Chemometrics, vol. 2, pp. 231–246, 1988.
[16] N. Kramer and R. Rosipal, “Overview and recent advances in partial
least squares,” in C. Saunders et al. (eds.) Subspace, Latent Structure
and Feature Selection Techniques, pp. 34–51, Springer-Verlag, 2006.
[17] H. Hotelling, “Relations between two sets of variates,” Biometrika, 28,
321–377, 1936.
[18] M. Borga, T. Landelius, and H. Knutsson, “A unified approach to
PCA, PLS, MLR and CCA,” Tech. Report LiTH-ISY-R-1992, Linköping,
Sweden, 1997.
[19] M. Barker and W. Rayens, “Partial least squares for discrimination,”
Journal of Chemometrics, vol. 17, pp. 166–173, 2003.
[20] H. Zou, T. Hastie, and R. Tibshirani, “Sparse Principal Component
Analysis,” Journal of Computational and Graphical Statistics, vol. 15,
pp. 265–286, 2006.
[21] D. R. Hardoon and J. Shawe-Taylor, “Sparse Canonical Correlation
Analysis,” Machine Learning, vol. 83, pp. 331–353, 2011.
[22] M. van Gerven, Z. Chao, and T. Heskes “On the decoding of intracranial
data using sparse orthonormalized partial least squares,” Journal of
Neural Engineering, vol. 9, 2012.
[23] C. M. Bishop, Neural networks for pattern recognition, Oxford University Press, 1995.
[24] P. L. Lai and C. Fyfe, “Kernel and non-linear Canonical Correlation
Analysis,” Intl. Journal of Neural Systems, vol. 10, pp. 365–377, 2000.
[25] R. Rosipal and L. J. Trejo, “Kernel partial least squares regression in
reproducing Hilbert spaces,” Journal of Machine Learning Research, 2,
97–123, 2001.
[26] J. Arenas-Garcı́a, K. B. Petersen, and L. K. Hansen, “Sparse kernel
orthonormalized PLS for feature extraction in large data sets,” in NIPS,
19, MIT Press, 2007.
[27] F. Biessmann, et.al., “Temporal kernel CCA and its application in
multimodal neuronal data analysis,” Machine Learning, vol. 79, pp.
5–27, 2010.
[28] M. Blaschko, J. Shelton, A. Bartels, C. Lampert, and A. Gretton, “Semisupervised Kernel Canonical Correlation Analysis with Application to
Human fMRI”, Patt. Recogn. Lett. vol. 32, pp.1572–1583, 2011.
[29] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller, “Fisher
discriminant analysis with kernels,” in Proc. IEEE Neural Networks for
Signal Processing Workshop, Madison, WI, Aug. 1999, pp. 41–48.
[30] S. Mika et al., “Constructing Descriptive and Discriminative Nonlinear
Features: Rayleigh Coefficients in Kernel Feature Spaces,” IEEE Trans.
Patt. Anal. Mach. Intell., vol. 25, pp. 623–628, 2003.
[31] G. Baudat and F. Anouar, “Generalized Discriminant Analysis Using a
Kernel Approach,” Neural Computation, vol. 12, pp. 2385–2404, 2000.
[32] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf, “Measuring
statistical dependence with Hilbert-Schmidt norms,” in Proc. 16th Intl.
Conf. Algorithmic Learning Theory, Springer, 2005, pp. 63–77.
[33] A. Gretton, R. Herbrich and A. Hyvärinen, “Kernel methods for
measuring independence”, Journal of Machine Learning Research, vol.
6, 2075–2129, 2005.
[34] J. C. Principe, Information Theoretic Learning: Renyi’s Entropy and
Kernel Perspectives, Springer, 2010.
[35] L. Hoegaerts, J. A. K. Suykens, J. Vanderwalle, and B. De Moor, “Subset
based least squares subspace regression in RKHS,” Neurocomputing,
vol. 63, pp. 293–323, 2005.
[36] M. E. Tipping, “Sparse Kernel Principal Component Analysis,” in NIPS,
13, MIT Press, 2001.
[37] M. Momma and K. Bennett, “Sparse kernel partial least squares
regression,” in Proc. Conf. on Learning Theory, 2003.
[38] C. Dhanjal, S. R. Gunn, and J. Shawe-Taylor, “Efficient sparse kernel
feature extraction based on partial least squares,” IEEE Trans. Patt.
Anal. and Mach. Intell., vol. 31, pp. 1347–1360, 2009.
[39] E. Izquierdo-Verdiguier, J. Arenas-Garcı́a, S. Muñoz-Romero, L.
Gómez-Chova and G. Camps-Valls, “Semisupervised Kernel Orthonormalized Partial Least Squares,” in Proc. IEEE Mach. Learn. Sign. Proc.
Workshop, Santander, Spain, 2012.
[40] J. Arenas-Garcı́a and G. Camps-Valls, “Efficient Kernel OPLS for
remote sensing applications,” IEEE Trans. Geosc. Rem. Sens., 44, 2872–
2881, 2008.
[41] A. Meng, P. Ahrendt, J. Larsen, and L. K. Hansen, “Temporal Feature
Integration for Music Genre Classification,” IEEE Trans. Audio, Speech
and Lang. Process., vol. 15, pp. 1654–1664, 2007.
[42] J. Arenas-Garcı́a and K. B. Petersen, “Kernel multivariate analsis in
remote sensing feature extraction,” in G. Camps-Valls and L. Bruzzone
(eds.) Kernel methods for Remote Sensing Data Analysis, Wiley, 2009.
[43] G. Camps-Valls, J. Muñoz-Marı́, L. Gómez-Chova, L. Guanter, and X.
Calbet, “Nonlinear Statistical Retrieval of Atmospheric Profiles from
MetOp-IASI and MTG-IRS Infrared Sounding Data,” IEEE Trans.
Geoscience and Remote Sensing, 2012.