Documentos de Académico
Documentos de Profesional
Documentos de Cultura
To cite this article: Andrew J. Landgraf & Yoonkyung Lee (2019): Generalized Principal
Component Analysis: Projection of Saturated Model Parameters, Technometrics, DOI:
10.1080/00401706.2019.1668854
Article views: 1
Abstract
Principal component analysis (PCA) is very useful for a wide variety
t
ip
of data analysis tasks, but its implicit connection to the Gaussian
distribution can be undesirable for discrete data such as binary and
cr
multi-category responses or counts. We generalize PCA to handle
various types of data using the generalized linear model framework.
us
In contrast to the existing approach of matrix factorizations for
exponential family data, our generalized PCA provides low-rank
an
estimates of the natural parameters by projecting the saturated
model parameters. This difference in formulation leads to the
M
available online.
1 Introduction
Abundance of high dimensional data in many fields makes understanding and
capturing the regular structures underlying the data crucial for subsequent
modeling and prediction. Low dimensional projections of data are often
primary tools for coping with high dimensionality along with other techniques
for sparsity or structural simplicity. This paper concerns an extension of
standard principal component analysis (PCA), a widely used tool for low
dimensional approximation of data. Because the implicit connection of PCA to
the Gaussian distribution may be undesirable, the proposed extension aims to
expand the scope of PCA to various types of discrete data from binary and
multi-category responses to counts.
t
ip
Several approaches to dimensionality reduction for non-Gaussian data have
been proposed when all of the variables are of one particular type. For
cr
example, probabilistic PCA (Tipping and Bishop, 1999) has been extended to
us
binary data (Tipping, 1998) as well as data from a multinomial distribution
(Buntine, 2002). Logistic PCA (Schein et al., 2003) fits a low-rank estimate of
an
the natural parameter matrix for Bernoulli data. Non-negative matrix
factorization (Lee and Seung, 1999, NNMF) finds a low-rank estimate of the
M
mean parameters assuming that the data come from a Poisson distribution
and under the constraints that the factors are non-negative.
ed
et al., 2016), incorporating more loss functions and data types, as well as
penalties on the row and column factors. Probabilistic generalizations of
exponential family PCA using generative latent factor models have also been
formulated as both directed (Mohamed et al., 2009) and undirected graphical
models (Welling et al., 2005), with restricted Boltzmann machines (Hinton and
Salakhutdinov, 2006) being a special case of the latter. In the Bayesian
exponential family PCA framework (Mohamed et al., 2009), Li and Tao (2010)
considered a prior on the loadings for automatic determination of principal
components as latent variables. As an earlier reference to the generative
modeling approaches, Bartholomew and Knott (1999) provide a
comprehensive treatment of latent variable models and factor analysis for
categorical data from a statistical modeling perspective using the general
linear latent variable model. Contrary to the conventional take on exponential
family distributions, the latest development of an alternative exponential family
PCA in Liu et al. (2018) focuses on the mean parameters rather than the
natural parameters in characterizing principal components, and it involves the
eigen-decomposition of a new covariance matrix estimator particularly
t
designed for high dimensional data.
ip
We propose a new formulation of generalized PCA which extends
cr
Pearson (1901)’s motivation for the optimal data representation with minimum
us
mean squared error to data from members of the exponential family. In
contrast to the existing approach of matrix factorizations for exponential family
an
data, our generalized PCA provides low-rank estimates of the natural
parameters by projecting the saturated model parameters. The novelty of our
M
multiplication, and the number of parameters does not increase with the
number of cases. This structural simplicity results in substantial savings in
computing time for training and testing. Furthermore, the principal
components scores are more interpretable and stable, as they are a linear
combination of the natural parameters from the saturated model. From a
practical point of view, our generalized PCA is designed to be an exploratory
and descriptive tool for dimension reduction for non-Gaussian data in much
the same way as standard PCA. Compared to generative modeling
approaches that require elicitation of a formal probability model for latent
factors, it is conceptually simple, easy to use, and effectively free of additional
parameters that the user has to specify other than the number of components.
In addition, the deterministic view of principal components and their linear
algebraic characterization simplifies computation substantially.
A special case of the proposed generalization for binary data (called logistic
PCA) is presented in Landgraf and Lee (2015). In this paper, we pay
particular attention to dimensionality reduction of count data either with a
Poisson or multinomial assumption for applications. Count datasets occur in
t
ip
many domains of interest, such as natural language processing and
recommendation systems. For the Netflix competition, the goal was to
cr
minimize the squared error in prediction of movie ratings. Low-rank models
us
extending the singular value decomposition, which is related to PCA, were the
most successful single methods used (Feuerverger et al., 2012). However, if
an
the goal is to predict which movies are most likely to be watched, other
distributional assumptions may be more appropriate.
M
with more trials. With missing data, the case weight can be set to either zero
or one, depending on whether it is missing or observed. Missing data are
ce
common in collaborative filtering problems, where the goal is often to fill in the
missing values of a matrix as accurately as possible using the observed
Ac
t
ip
usefulness of the proposed formulation for visualization, matrix completion,
and recommendation. Finally, Section 6 offers conclusions and future
cr
research directions.
2 Generalized PCA
us
an
2.1 Preliminaries
linear models that are important for developing our method. The notion of a
natural parameter and a saturated model, and deviance as a data-fit criterion
pt
x b( )
f ( x | , ) exp c( x, ) , (2.1)
a( )
where θ is the canonical natural parameter and is the scale parameter,
which is assumed to be known. The mean and variance of the distribution are
easily obtained from
2
E ( X ) b( ) b( ) and var ( X ) a( )b ( ) a( ) 2 b( ).
The natural parameter θ is μ for Gaussian with mean μ, logit( p) for Bernoulli
with p, and log( ) for Poisson with mean parameter λ. For simplicity of
exposition, we will further assume that a( ) 1 , which is the case for the
binomial and Poisson distributions and for the Gaussian distribution with unit
t
variance, but will revisit the scale function a(·) when discussing normalization
ip
in Section 2.5.3.
cr
The canonical link function g (·) is defined such that g (b( )) . The best
us
possible fit to the data is referred to as the saturated model, where the mean
b ( ) is set equal to the data x, implying that the saturated model parameter
an
is g(x).
M
Bernoulli, b( ) log(1 exp( )) and logit x , and for Poisson, b( ) exp( )
ed
and log x . This data-driven saturated model will play an important role in
defining a low-rank data approximation for generalized PCA. As we discuss in
pt
the next section, g(x) may not be finite for some distributions and values of x.
ce
D( x; ) 2 log f ( x | , ) log f ( x | , ) ,
distribution with mean θ and unit variance. Suppose that there are n
independent observations on d variables from the same distribution family,
and the variables are conditionally independent given their natural
parameters. Then, the deviance of an n × d matrix of natural parameters,
Θ [ij ] X [ xij ]
, with a data matrix of the same size, , is given by
n d
D( X; Θ) D( xij ;ij ).
i 1 j 1
Note that this setting posits individual parameters for
each observation. The same deviance form also appears in Collins
et al. (2001) under the implicit assumption of conditional independence of
data given individual parameters.
t
Pearson (1901) originally considered for dimensionality reduction. Pearson’s
ip
formulation looks for the optimal projection of the original d-dimensional points
cr
x i in a k-dimensional subspace (k < d) under squared error loss. That is,
containing the first k eigenvectors of the sample covariance matrix give the
optimal representation, coinciding with the data projection with k principal
M
components.
ed
variance, the natural parameter θij is the mean, the deviance is proportional to
squared error loss, and the natural parameter from the saturated model is
ce
In this framework, we want to estimate the natural parameters θij for data xij in
a k-dimensional subspace such that they minimize the deviance
t
ip
n d n d
D( X; Θ) D j ( xij ;ij ) 2 { xijij b j (ij )} constant.
cr
i 1 j 1 i 1 j 1
Since each variable of the dataset may come from a different distribution, the
deviance and b(·) functions have been subscripted by column. Letting ij be
us
an
the natural parameter for the saturated model of the jth variable and θ i be the
vector for the ith case, the estimated natural parameters in a k-dimensional
M
ˆij j UUT (θ i μ) ,
subspace are defined as j
or
Θ 1n μT (Θ 1n μT )UUT in matrix notation.
ed
i, j
ij j
T
i
j j j
T
i
j
X, 1n μT (Θ 1n μT )UUT b j j UUT (θ i μ) ,
Ac
j
i, j
ij
values of xij, may be or . For instance, ij for xij = 0 under the
Poisson distribution. Numerically, we will approximate with a large positive
constant m, treating it as a tuning parameter that can be chosen data-
adaptively by cross-validation. We find that typical values of m chosen by
cross-validation are in the range of 1 to 10, which corresponds to the range of
e1 0.367 10
to e 4.5 10
5
for approximation of the Poisson mean
parameter 0, for example. While the matrix factorization approach in Collins
et al. (2001) approximates the natural parameter matrix Θ with Θ AB
T
when μ 0 , our
n k d k
using separate rank k matrices A and B
projection approach approximates it with Θ ΘUU
T
by projecting the
saturated model parameters Θ to the column space of U, which simplifies the
case-specific factor A to ΘU . This simpler form greatly facilitates computation
in general and brings added stability to estimates especially when data are
sparse as demonstrated in Section 4.
t
ip
n
| | x x ||2
cr
i
The duality between the total squared error, i1 , and the Gaussian
deviance for the null model with column mean only also allows a natural way
us
to extend the notion of the variance explained by principal components in
standard PCA to the deviance explained in generalized PCA. Analogous to
an
the eigenvalues of sample covariance matrix, the reduction in deviance for an
additional component can be defined as a natural measure of its importance.
M
for PCA.
pt
treated fixed, latent-factor model approaches (e.g. Welling et al., 2005) posit
principal components as unobservable random variables whose realizations
determine the natural parameter values. This stochastic characterization of
principal components entails inherently more complex computation involving
MCMC sampling schemes.
To contrast the two approaches, we note that the fitted values of the natural
ˆij j UUT (θ i μ)
parameters in our formulation are j while those values
t
PCA and the proposed generalization involve, we derive necessary conditions
ip
for the generalized PCA solutions. The orthonormality constraints on U
cr
naturally lead to the application of the method of Lagrange multipliers. In order
Lagrangian,
us
an
L(U, μ, Λ) D j xij ; j UUT (θ i μ) tr ( Λ(UT U I k )),
j
i, j
M
constraints.
Taking the gradient of the Lagrangian with respect to the U, μ , and Λ and
pt
X b(Θ)
Θ 1 μ Θ 1 μ X b(Θ) U UΛ
T
Ac
T
T T
(2.2)
n n
I
T
d UUT X b(Θ) 1n 0d , and (2.3)
UT U I k . (2.4)
equals
bj j UUT (θ i μ)
j . For the Gaussian distribution, b() and
Θ X , which simplify the condition (2.2) to the eigen equation for a
covariance matrix. Thus, (2.2) is analogous to the eigen equation for standard
PCA. (2.3) can be interpreted as the condition that the residuals projected
onto the null space of U sum to zero.
t
standard PCA and generalized PCA, we simulated a two-dimensional count
ip
dataset with 100 rows where the mean matrix has a rank of one. The data are
cr
plotted in Figure 1. The area of the points is proportional to the number of
observations with that pair of counts. The one-dimensional representations for
us
standard PCA and Poisson PCA are displayed in the figure. The two solid
lines indicate the one-dimensional mean space given by standard PCA (red)
an
and Poisson PCA (blue).
M
The most obvious difference between the representations is that the Poisson
PCA representation is non-linear. It is also constrained to be non-negative
unlike the standard PCA projections, which is is slightly negative for the
ed
second variable for the projection of (0, 0). The dashed lines in the figure
show the correspondence between selected observations and their
pt
projections in the count space, while Poisson PCA gives projections in the
natural parameter space. When transforming back to the count space, the
Ac
The Poisson PCA estimates seem to fit those counts close to zero more
tightly than the larger counts. This is due to the properties of the Poisson
distribution, where the mean is proportional to the variance. As shown in
Lemma 1 of the supplementary material, if ( x ) is held constant, where λ is
the mean, the deviance monotonically decreases to zero as either λ or x
increases to . A large difference of the Poisson PCA fit from an observation
in the count space may not correspond to a large deviance when the estimate
or observation is relatively large. This is why some of the highlighted
projections are more distant from the observation in the count space for
Poisson PCA than for standard PCA and also why the Poisson PCA fits the
values close to 0 more tightly.
t
ip
2.5.1 Weights
When each observation xij has a weight wij, it can easily be incorporated into
cr
w D ( x ;
ij j ij ij )
the deviance as
us
analogous to the weighted least squares
i, j
criterion. One example where weights can naturally occur is when a given
an
observation is associated with multiple replications. For example, if an
observation is a proportion of “successes” in w number of Bernoulli trials, then
w is the natural weight for the observation.
M
In many practical applications, data may be missing and filling in the missing
values is often an important task. For example, in collaborative filtering, the
pt
Letting {1,, n} {1,, d} be the set of indices for which xij is observed,
we define the deviance over non-missing elements only and solve for the
wij D j ( xij ;ij ).
parameters that minimize the weighted deviance, ( i , j )
This is equivalent to setting wij = 0 for all (i, j ) . In order to estimate the
(i, j ) , ˆij j [UUT (θ i μ)] j
natural parameters for , however, the
saturated model parameters are still required for both observed and missing
data of the ith case. Therefore, we must extend the definition of saturated
model parameters to account for missing data. Since the saturated model has
the best possible fit to the data, the data can be fit exactly when xij is known,
but when xij is unknown, our best guess for the observation is the main effect
ij g j ( xij ) j
of the jth variable. We therefore set if (i, j ) and ij if
(i, j )
.
t
ip
One consequence of this definition is that the estimated natural parameters
only depend on the saturated model parameters of the non-missing data. The
cr
ˆij j [UUT ] jl (il l ).
estimated natural parameter is l:( i ,l )
Further, μj
being in lieu of
ij
us
does not complicate its estimation because it cancels out in
j , j 1,, d
Letting , be the normalizing factors we assign to the variables,
n
D j ( X j ,W j ; j ) : wij D j ( xij ; ij )
W [ wij ]
, and i 1 , we want the lth “partial”
deviance after normalization
d
1
Del : min D j ( X j ,W j ; 1n j [(Θ 1n μT )el eTl ] j )
μ
j 1 j
t
ip
to be the same for all l. Note that τj can also be interpreted as an estimate for
a ( )
cr
the scale parameter function, j j in equation (2.1) up to a proportionality
constant.
When
U el
, the deviance is minimized with respect to the main effects μ if us
an
ˆ j g j ( X wj ) X w : X Tj W j / (1Tn W j )
for j l , where j is the weighted mean of the
g (·)
jth variable and j is the canonical link function for the jth variable.
M
ˆ g j ( X j ) j l
w
ˆ g j ( xij )
Therefore, ij if and ij if j = l. This implies that
ed
d
1 1
Del
j 1, j l j
D j ( X j ,W j ; 1n g j ( X wj ))
l
Dl ( X l ,Wl ; gl ( X l )).
pt
Dl ( X l ,Wl ; gl ( X l )) 0 gl ( X l )
by definition because gives the fit of the saturated
ce
model.
j 0, j 1,, d
Ac
Del
The goal is to find , such that is the same for all l,
1 1
D1 ( X1 ,W1 ; 1n g1 ( X 1w )) Dd ( X d ,Wd ; 1n g d ( X dw )).
1 d
j D j ( X j ,W j ; 1n g j ( X wj ))
These are equal if , which is greater than zero as
long as the variance of the jth variable is greater than zero. We set
1
j D j ( X j ,W j ; 1n g j ( X wj ))
n to make the normalizing factors more
interpretable. τj can be interpreted as the average deviance of the null
(intercept-only) model for the jth variable.
t
ip
|| (X 1n μT )τ 1/2 (X 1n μT )τ 1/2 UUT ||2F whereas our proposed normalization
cr
1/2 T 1/2 2
minimizes || (X 1n μ )τ (X 1n μ )UU τ ||F . The normalization factor for
T T
us
Finally, the normalization factor for the Poisson distribution is more
complicated, however the analysis shown in Figure 9 of the supplementary
an
material shows that τi is smallest when the mean is close to 0 and levels off as
the mean gets larger, meaning two variables with large means will have
M
similar weights and variables with mean close to 0 will have the highest
weights.
ed
3 Computation
pt
Since orthonormal constraints are not convex (Wen and Yin, 2013), finding a
global solution for generalized PCA is infeasible in most scenarios. To deal
ce
3.1 MM Algorithm
In many cases, a quadratic approximation to a smooth objective can be
minimized with a closed-form solution. For example, in the iteratively
reweighted least squares algorithm for generalized linear models (GLMs), the
deviance is approximated by a quadratic function, which leads to a weighted
w D( x ;
ij ij ij )
least squares problem. To minimize the weighted deviance i, j
, we
ij(t )
first quadratically approximate the ijth element of the weighted deviance at
by
Q(ij | ij(t ) ) : wij D( xij ;ij(t ) ) 2wij (b j (ij(t ) ) xij )(ij ij(t ) ) wijbj (ij(t ) )(ij ij(t ) )2 .
t
b j (ij(t ) ) E ( t ) ( X ij ) bj (ij(t ) ) var ( t ) ( X ij )
ip
From exponential family theory, ij
and ij
.
cr
Even though the Taylor approximation is quadratic, it can still be difficult to
wij bj (ij(t ) )
optimize with orthonormal constraints because of non-constant
Therefore, we additionally majorize the quadratic approximation. Much like the us .
an
Taylor approximation, MM algorithms iteratively replace the difficult objective
with a simpler one. A function M ( | ) is said to majorize an objective f ( )
(t )
M
M (ij | ij(t ) ) : wij D( xij ;ij(t ) ) 2wij (b j (ij(t ) ) xij )(ij ij(t ) ) vi(t ) (ij ij(t ) )2 ,
ce
where
Ac
2
wij
M (Θ | Θ ) M (ij | ) v ij ij(t ) (t ) xij b j ( ij(t ) ) constant,
(t ) (t ) (t )
ij i
i, j i, j vi
Let
t
x
wij
zij(t ) : ij(t ) b j (ij(t ) )
ip
(t ) ij (3.2)
vi
cr
Z(t ) [ zij(t ) ]
, and V be an n × n diagonal
(t )
be the adjusted response variables,
us
(t )
matrix with ith diagonal equal to vi . The majorizing function can be rewritten
as
an
1/2 1/2 2
M (Θ | Θ(t ) ) V (t ) Θ 1n μT UUT V (t ) Z(t ) 1n μT constant.
F
M
Z
1 T
μ(t 1) 1Tn V (t ) 1n (t )
ΘU(t ) (U(t ) )T V(t ) 1n.
pt
(t ) ( t 1) T ( t 1) T
Let Θc : Θ 1n ( μ ) and Zc : Z 1n ( μ ) be the column-centered
(t ) (t )
ce
eigenvectors of
Θc
(t ) T
V (t ) Z(ct ) Z(ct )
T (t )
V ( t ) Θc Θc V (t ) T (t ) (t )
Θc . (3.3)
material.
Algorithm 1: Majorization-minimization algorithm for generalized PCA
input: Data matrix (X), m, number of principal components (k), weights (W)
output: d × k principal component loadings matrix (U) and main effects ( μ )
Set t = 1 and initialize μ and U . We recommend setting μ to the column
(1) (1) (1)
Θ
repeat
1. Set the weights and the adjusted response variables using equations
ij(t ) (jt ) [U(t ) (U(t ) )T (θ i μ(t ) )] j
2. (3.1) and (3.2), where
t
Z
1 T
μ(t 1) 1Tn V (t ) 1n
ip
(t )
ΘU(t ) (U(t ) )T
V(t ) 1n
3.
( t 1)
4. Set U to the first k eigenvectors of (3.3)
cr
5. t t 1
zij(t )
variable is generated, which is equal to the previous estimate plus the
residual weighted by the inverse of the estimated variance. In our algorithm, a
pt
in Section 4 show that generalized PCA typically has far fewer iterations than
matrix factorization methods which rely on a singular value decomposition in
each iteration. The additional experiment in the supplementary material
demonstrates the stability of Algorithm 1 with random initialization and the
t
effectiveness of the suggested initialization using singular vectors for faster
ip
convergence.
cr
Our algorithm for generalized PCA can be trivially extended to missing data
4 Simulation Study
M
We simulated count datasets with missing data and compared how well
ed
standard PCA, our generalized PCA, and the matrix factorization (MF)-based
method exponential family PCA (Collins et al., 2001) under a Poisson
pt
distribution (which we call “Poisson PCA” and “Poisson MF”, respectively) fill
in the missing values on both training and test data. In the simulation, we
ce
To simulate count data, we used a latent cluster variable Ci for each row,
whose values are uniformly distributed over {1,,5} . Given Ci = k, Xij was
Λ [ jk ]
simulated from a Poisson distribution with mean λjk, where is a d 5
matrix of positive means. The elements of Λ were simulated from a gamma
distribution. This latent structure implies that both the mean matrix and the
natural parameter (logarithm of mean) matrix have a rank of 5. Due to the
latent clusters, simulated data would show over-dispersion if each column of
the data matrix was treated as observations from a single Poisson distribution.
Our proposed method can handle this type of over-dispersion naturally by
modeling the parameters of a Poisson mixture with a necessary number of
components.
t
ip
P( X 0) , ranges from 0.25 to 0.9 and the mean of the non-zero counts,
E( X | X 0) , ranges from 3 to 15. For each setting, we simulated a training
cr
dataset with 100 rows (n = 100) and 50 columns (d = 50). We also simulated
us
a test dataset using the same Λ , but with 1000 rows. For each scenario, we
simulated thirty-three training and test datasets, and the results presented
an
below are the median over the replications.
Low-rank models are commonly used for matrix completion problems with
missing or incomplete data (Feuerverger et al., 2012). We examine whether
and under what conditions it is possible to improve the estimates of missing
values. To emulate the situation with missing data, we randomly set 20% of
the entries of the training matrix to be missing. We then applied standard
PCA, Poisson PCA, and Poisson MF to the simulated data with k 1,,10 .
Note that typical implementations of PCA cannot estimate the loadings in the
presence of missing values, although there are some options available (e.g.
Tipping and Bishop (1999)). We used Algorithm 1 for both standard PCA and
Poisson PCA to handle missing data. For Poisson MF, we used the same
majorization as Poisson PCA and performed iterative singular value
decompositions (SVD).
Filling in the missing values of the training matrix is a typical matrix completion
problem. Using the projection formulation we have developed, a fitted value
xˆij g 1 j UU (θ i μ) ,
T
j ij
for missing xij is given by where is zero
1
since xij is missing and we set μ 0 . g (·) is the identity function for
t
ip
standard PCA and exp(·) for Poisson PCA. For standard PCA, we set any
cr
fitted values below zero to zero, since negative values are infeasible.
us
We evaluated the mean squared error of the filled-in estimates compared to
the actual values held out. When there is no missing data, PCA is known to
an
give an optimal low-rank projection of the data with respect to mean squared
error. However, this is not guaranteed with missing data. Further, Poisson
M
Figure 2 shows the median over the replications of the mean squared
prediction error of PCA, Poisson PCA, and Poisson MF on the missing data.
pt
Each row represents a different degree of sparsity, ranging from 25% zeros
on the top row to 90% zeros on the bottom row. The columns represent the
ce
mean of the non-zero elements, ranging from 3 to 15. Within each panel, we
estimated missing counts with 1 to 10 components for each method, with the
Ac
For most scenarios, the minimum mean squared error occurs at around five
components, which is the true rank of the mean and natural parameter
matrices. Comparing Poisson PCA to PCA, standard PCA fills in the missing
t
ip
values more accurately than Poisson PCA for the datasets with the least
amount of sparsity (the top row). As the degree of sparsity increases, the
cr
advantage of Poisson PCA emerges, even in terms of mean squared error –
us
the loss function PCA is optimized for. One reason may be that, when the
count is 0, standard PCA penalizes fitted values below 0 as much as fitted
an
values above 0. By contrast, for Poisson PCA it is not possible to have
negative fitted values because the projection in the log space is properly
M
transformed back to the feasible range for the mean parameter. Further, as
outlined in Section 2.4, the Poisson deviance gives higher weight to the errors
ed
for smaller counts than larger counts and therefore attempts to fit small counts
more closely.
pt
increases much more quickly than the other methods or is never competitive.
We have observed that Poisson MF tends to overfit when the conditional
Ac
t
average, Poisson MF takes many more iterations to reach convergence (and
ip
thus more computing time) because the extra score parameters for the
cr
individual rows allow more incremental improvements in the data fit.
results on the test data are nearly the same as those above and are therefore
not included for brevity. The extent of overfitting by Poisson MF is reduced in
testing because the loadings have been fixed.
ed
We highlight the difference in testing time between Poisson PCA and Poisson
pt
testing time for Poisson PCA is nearly constant, because it only involves
matrix multiplication, while much larger and more varied for Poisson MF,
Ac
t
misspecification results in 80% increase in median MSE. Note that these two
ip
settings closely match the characteristics of the real application in Section 5 in
cr
terms of sparsity level and non-zero mean value.
us
Finally, we investigated how our method performs when d n by simulating
from the original experimental setup with P( X 0) 0.9, E( X | X 0) 3 , n =
an
100, and true rank of 5, but increasing d from 50 to 1000. Figure 4 shows that,
counterintuitively, all methods get better with increased dimension, especially
M
Poisson MF, although Poisson PCA typically performs the best. This seeming
blessing of dimensionality is due to the fact that it becomes easier to identify
ed
the latent clusters with additional variables in this setting as they arise from
the same probabilistic structure and each dimension adds information on
pt
The first subset of the MSD that we analyzed is the listen history on the songs
from the ten most popular artists. We selected the 200 most popular songs by
t
these artists and the 1200 most active listeners of those songs. There is quite
ip
a bit of dispersion in the count matrix created from these users and songs.
cr
93% of the entries in the matrix are 0, the mean of the non-zero entries is 2.8,
and the maximum count is 391.
us
We applied Poisson PCA to the data and examined the proportion of the
an
deviance explained by principal components first. Figure 5 displays the
cumulative and marginal percentage of Poisson deviance explained by the
M
components would thus be justifiable to account for the major variation in the
data. For visualization of the songs in a low-dimensional latent space, we
pt
proceeded with the first two components, which explain 32% of the deviance.
ce
In addition to Poisson PCA, we analyzed the count matrix using standard PCA
on the counts, and non-negative matrix factorization (NNMF) with Poisson
Ac
deviance. Since Poisson PCA projects the log of the counts, we also
compared Poisson PCA to standard PCA on the log of the counts. The log of
zero is undefined, so we numerically approximate it by – m with a large
positive m for both Poisson PCA and standard PCA on the log transformed
data. For each method, m was varied from 0.1 to 15 and was chosen by
cross-validation. For Poisson PCA, the optimal m was 8.0 and for standard
PCA, it was 1.2.
The loadings for the three PCAs are in Figure 6. If a user listens to one song
from an artist, it is likely that he or she will listen to other songs by the same
artist. Therefore, we expect that songs by the same artists would be close in
the latent space. For standard PCA on the count data, which is displayed on
the top-left panel of Figure 6, the first component is dominated by Florence +
the Machine’s songs because the two largest observations in the data were
for songs by this artist. Songs by Justin Bieber dominate the second principal
component direction because two songs by the artist were among the top ten
most popular songs or variables.
In comparison to PCA on the count data, loadings from PCA on the log of the
t
ip
count data are easier to interpret, as displayed on the top-right panel of Figure
6. This shows the improvement due to the log transformation of the data.
cr
Songs by The Black Keys and Jack Johnson are distinguished from the rest of
us
the artists. These two artists have the most songs in the data. However,
songs by the other eight artists are clumped together.
an
The bottom-left panel of Figure 6 has the loadings from Poisson PCA. This
plot shows the improvement from both the log transformation and the
M
artists, there are now at least five distinguishable clusters of songs which
correspond to artists. Further, there are two clusters for the Black Keys, one
pt
The bottom-right panel has the loadings from NNMF, which are quite different
from the others due to the non-negativity constraints. While the non-negativity
induces sparsity, it becomes more difficult to differentiate the artists with the
loadings.
Next, we computed the loadings for Poisson PCA with normalization factors,
as displayed in Figure 7. The loadings of normalized PCA for the count or log
counts are not included because they were qualitatively very similar to the
loadings from the log of the counts. The left plot in Figure 7 shows the un-
normalized Poisson PCA loadings and the right plot shows the normalized
Poisson PCA loadings. The area of each point is proportional to the number of
times the songs were played and the colors represent the same artists as in
Figure 6.
t
ip
clustered in the normalized version. Finally, in the un-normalized version,
songs by artists cluster in ray shapes from the origin, while they cluster more
cr
tightly further away from the origin in the normalized version.
us
These differences can be explained by the normalization factors. The songs
furthest from the origin in the un-normalized version are the songs that were
an
played the most times. For the songs in this dataset, the Poisson null
deviance was in general larger for songs with higher means. Therefore, the
M
songs with the smallest average play counts – especially those less than 1 –
were effectively given higher weights in the optimization when normalization is
ed
added. This caused the ray shapes to turn into tighter clusters and the songs
by Florence + the Machine to have less influence.
pt
where it is desirable to expose the users to items they have not interacted
with before and that they may be interested in. The recommendations we
ˆ
make are based on the rankings of songs with ij from PCA as estimated
preference scores for each user.
We started with the 500 most popular songs and only included the songs that
were listened to by at least 20 users and the users that listened to at least 30
of the songs. This resulted in 1040 users and 281 songs. We randomly chose
80% of the users for training and 20% for testing. We set five of the positive
play counts to zero for each user in the test set to use them as true positives
in ranking. These five songs represent songs the user is interested in, but that
they have not yet been exposed to.
We then applied the PC loadings and main effects learned from the training
data to the test data to get a low-dimensional representation of each user’s
listens and used it to rank the songs with zero plays, comparing how high the
songs that were actually non-zero ranked using AUC. Finally we averaged the
AUC from each user’s prediction to get an overall AUC, which can range from
t
ip
0 to 1, where 1 corresponds to perfect ranking and 0.5 corresponds to random
guessing.
cr
We have set up the experiment in a situation where typical matrix factorization
us
methods may be computationally costly to use. For our method, computing
the low-dimensional representation for new users only requires matrix
an
multiplication, while it requires fitting a generalized linear model for each user
with the matrix factorization-based methods.
M
developed in more detail in the supplementary material – using the count data
directly in Figure 8. For all of the PCA methods, we varied the number of
pt
popularity of each song for every user, which ranks the songs based on the
average play count of the songs. This popularity ranking is the same as the
Ac
ranking that all of PCA methods would give with main effects but zero
principal components.
Standard PCA with the play count data performed very poorly – even worse
than the popularity model with no personalization – regardless of the number
of principal components. Poisson PCA and multinomial PCA both did much
better, with multinomial PCA consistently having the highest AUC. The
multinomial assumption may be more appropriate for the data because a user
has a finite amount of time to listen to songs and must pick between them.
Also, the multinomial deviance implicitly gives higher weight to play counts of
0 coming from highly active users than from the users who had low total play
counts. The supplementary material contains an additional analysis of this
data, which further showcases generalized PCA on binary and weighted data.
6 Conclusion
t
PCA in contrast to other generalizations. Notably, principal component scores
ip
are a linear combination of the natural parameters from the saturated model
cr
and computing the scores or the low-rank estimate on new data does not
require any extra optimization. The formulation is further extended to include
us
weights, missing data, and variable normalization, which can improve the
interpretability of the loadings and the quality of the recommendations as we
an
have shown in numerical examples.
M
There are many ways that the proposed formulation can be extended further.
When there are more variables than cases, regularization of the loadings may
ed
Supplementary Material
Acknowledgments
The authors thank the editor, associate editor and referees for their helpful
comments.
Funding
t
ip
References
cr
Bartholomew, D. J. and M. Knott (1999). Latent variable models and factor
analysis. Wiley.
us
Bertin-Mahieux, T., D. P. Ellis, B. Whitman, and P. Lamere (2011). The million
an
song dataset. In Proceedings of the 12th International Conference on Music
Information Retrieval (ISMIR 2011).
M
Choi, Y., J. Taylor, and R. Tibshirani (2017). Selecting the number of principal
pt
t
negative matrix factorization. Nature 401 (6755), 788–791.
ip
Lee, S., J. Z. Huang, and J. Hu (2010). Sparse logistic principal components
cr
analysis for binary data. The Annals of Applied Statistics 4 (3), 1579–1601.
us
Li, J. and D. Tao (2010). Simple exponential family PCA. In International
Conference on Artificial Intelligence and Statistics, pp. 453–460.
an
Liu, L. T., E. Dobriban, and A. Singer (2018). ePCA: High dimensional
M
pp. 1089–1096.
Ac
t
Singh, A. P. and G. J. Gordon (2008). A unified view of matrix factorization
ip
models. In Machine Learning and Knowledge Discovery in Databases ,
cr
pp. 358–373. Springer.
us
Tipping, M. E. (1998). Probabilistic visualisation of high-dimensional binary
data. In M. Kearns, S. Solla, and D. Cohn (Eds.), Advances in Neural
an
Information Processing Systems 11, pp. 592–598.
Udell, M., C. Horn, R. Zadeh, and S. Boyd (2016). Generalized low rank
pt
ip
Poisson PCA
cr
us
an
M
ed
pt
ce
Ac
t
ip
cr
us
an
Fig. 2 Mean squared error of matrix completion using standard PCA, Poisson
PCA, and Poisson MF on simulated count datasets. The median MSE was
M
dashed horizontal lines represent the mean squared errors when using
column means as predictions.
pt
ce
Ac
Fig. 3 Mean squared error of matrix completion using standard PCA, Poisson
PCA, and Poisson MF on simulated count datasets from the setting of
P( X 0) 0.5 and E( X | X 0) 3 but with zero inflation. Each panel
t
ip
over 200 replications. The dashed horizontal lines represent the mean
squared errors when using column means as predictions.
cr
us
an
M
Fig. 4 Mean squared error of matrix completion using standard PCA, Poisson
ed
panel represents a different data dimension. The median MSE was taken over
ce
cr
us
an
M
ed
pt
ce
Ac
Fig. 6 The first two principal component loadings of the million song dataset
using standard PCA on the count data, standard PCA on the log of the count
data, Poisson PCA, and non-negative matrix factorization
t
ip
Fig. 7 The first two principal component loadings of the million song dataset
cr
using Poisson PCA without normalization (left plot) and with normalization
us
(right plot). The area of each point is proportional to the number of times the
song was played
an
M
ed
pt
ce
Ac
Gaussian (x
i
ij X j )2 / n
2 X j log( X j ) (1/ n) xij log( xij )
Poisson i
Table 2 Mean number of iterations to train the algorithms and mean time in
seconds to estimate the scores for 1000 test records, assuming a Poisson
t
ip
distribution and using five principal components. Poisson PCA refers to our
generalized PCA and Poisson MF refers to exponential family PCA. Standard
cr
error of the mean is in parentheses.
Scenario
P( X 0) E( X | X 0) Poisson
Number of Iterations
us
Testing Time (seconds)
Poisson
an
PCA Poisson MF Poisson PCA MF
M
0.017
0.25 3 12.0 (0.19) 268.0 (10.43) (0.0011) 6.1 (0.17)
ed
0.017
0.5 9 301.8 (8.72) 1792.3 (54.77) (0.0009) 53.4 (3.44)
pt