Está en la página 1de 39

Technometrics

ISSN: 0040-1706 (Print) 1537-2723 (Online) Journal homepage: https://www.tandfonline.com/loi/utch20

Generalized Principal Component Analysis:


Projection of Saturated Model Parameters

Andrew J. Landgraf & Yoonkyung Lee

To cite this article: Andrew J. Landgraf & Yoonkyung Lee (2019): Generalized Principal
Component Analysis: Projection of Saturated Model Parameters, Technometrics, DOI:
10.1080/00401706.2019.1668854

To link to this article: https://doi.org/10.1080/00401706.2019.1668854

View supplementary material

Accepted author version posted online: 19


Sep 2019.

Submit your article to this journal

Article views: 1

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=utch20
Generalized Principal Component Analysis:
Projection of Saturated Model Parameters

Andrew J. Landgraf and Yoonkyung Lee


Department of Statistics, Ohio State University

Abstract
Principal component analysis (PCA) is very useful for a wide variety

t
ip
of data analysis tasks, but its implicit connection to the Gaussian
distribution can be undesirable for discrete data such as binary and

cr
multi-category responses or counts. We generalize PCA to handle
various types of data using the generalized linear model framework.

us
In contrast to the existing approach of matrix factorizations for
exponential family data, our generalized PCA provides low-rank
an
estimates of the natural parameters by projecting the saturated
model parameters. This difference in formulation leads to the
M

favorable properties that the number of parameters does not grow


with the sample size and simple matrix multiplication suffices for
ed

computation of the principal component scores on new data. A


practical algorithm which can incorporate missing data and case
pt

weights is developed for finding the projection matrix. Examples on


simulated and real count data show the improvement of generalized
ce

PCA over standard PCA for matrix completion, visualization, and


collaborative filtering. Supplementary material for this article is
Ac

available online.

Keywords: Binary data; Count data; Dimensionality reduction; Exponential


family; Low rank model

1 Introduction
Abundance of high dimensional data in many fields makes understanding and
capturing the regular structures underlying the data crucial for subsequent
modeling and prediction. Low dimensional projections of data are often
primary tools for coping with high dimensionality along with other techniques
for sparsity or structural simplicity. This paper concerns an extension of
standard principal component analysis (PCA), a widely used tool for low
dimensional approximation of data. Because the implicit connection of PCA to
the Gaussian distribution may be undesirable, the proposed extension aims to
expand the scope of PCA to various types of discrete data from binary and
multi-category responses to counts.

t
ip
Several approaches to dimensionality reduction for non-Gaussian data have
been proposed when all of the variables are of one particular type. For

cr
example, probabilistic PCA (Tipping and Bishop, 1999) has been extended to

us
binary data (Tipping, 1998) as well as data from a multinomial distribution
(Buntine, 2002). Logistic PCA (Schein et al., 2003) fits a low-rank estimate of
an
the natural parameter matrix for Bernoulli data. Non-negative matrix
factorization (Lee and Seung, 1999, NNMF) finds a low-rank estimate of the
M

mean parameters assuming that the data come from a Poisson distribution
and under the constraints that the factors are non-negative.
ed

Other techniques have been developed for the dimensionality reduction of


many possible types of non-Gaussian data. Collins et al. (2001) introduced a
pt

general framework for low-rank matrix factorizations of the natural parameters


of exponential family data by maximizing the constrained likelihood of
ce

exponential family distributions. This matrix factorization formulation has


further been generalized (Gordon, 2002; Singh and Gordon, 2008; Udell
Ac

et al., 2016), incorporating more loss functions and data types, as well as
penalties on the row and column factors. Probabilistic generalizations of
exponential family PCA using generative latent factor models have also been
formulated as both directed (Mohamed et al., 2009) and undirected graphical
models (Welling et al., 2005), with restricted Boltzmann machines (Hinton and
Salakhutdinov, 2006) being a special case of the latter. In the Bayesian
exponential family PCA framework (Mohamed et al., 2009), Li and Tao (2010)
considered a prior on the loadings for automatic determination of principal
components as latent variables. As an earlier reference to the generative
modeling approaches, Bartholomew and Knott (1999) provide a
comprehensive treatment of latent variable models and factor analysis for
categorical data from a statistical modeling perspective using the general
linear latent variable model. Contrary to the conventional take on exponential
family distributions, the latest development of an alternative exponential family
PCA in Liu et al. (2018) focuses on the mean parameters rather than the
natural parameters in characterizing principal components, and it involves the
eigen-decomposition of a new covariance matrix estimator particularly

t
designed for high dimensional data.

ip
We propose a new formulation of generalized PCA which extends

cr
Pearson (1901)’s motivation for the optimal data representation with minimum

us
mean squared error to data from members of the exponential family. In
contrast to the existing approach of matrix factorizations for exponential family
an
data, our generalized PCA provides low-rank estimates of the natural
parameters by projecting the saturated model parameters. The novelty of our
M

approach is the deliberate use of the notion of a saturated model in defining


principal components for exponential family data via projection of its natural
ed

parameters. This projection approach primarily concerns the structure


underlying the joint distribution of the variables for dimensionality reduction as
in standard PCA, whereas the matrix factorization approach treats both
pt

variables and cases interchangeably. Due to the structural difference, our


ce

generalized PCA has several advantages over the matrix factorization


methods. Applying principal components to new data only requires matrix
Ac

multiplication, and the number of parameters does not increase with the
number of cases. This structural simplicity results in substantial savings in
computing time for training and testing. Furthermore, the principal
components scores are more interpretable and stable, as they are a linear
combination of the natural parameters from the saturated model. From a
practical point of view, our generalized PCA is designed to be an exploratory
and descriptive tool for dimension reduction for non-Gaussian data in much
the same way as standard PCA. Compared to generative modeling
approaches that require elicitation of a formal probability model for latent
factors, it is conceptually simple, easy to use, and effectively free of additional
parameters that the user has to specify other than the number of components.
In addition, the deterministic view of principal components and their linear
algebraic characterization simplifies computation substantially.

A special case of the proposed generalization for binary data (called logistic
PCA) is presented in Landgraf and Lee (2015). In this paper, we pay
particular attention to dimensionality reduction of count data either with a
Poisson or multinomial assumption for applications. Count datasets occur in

t
ip
many domains of interest, such as natural language processing and
recommendation systems. For the Netflix competition, the goal was to

cr
minimize the squared error in prediction of movie ratings. Low-rank models

us
extending the singular value decomposition, which is related to PCA, were the
most successful single methods used (Feuerverger et al., 2012). However, if
an
the goal is to predict which movies are most likely to be watched, other
distributional assumptions may be more appropriate.
M

In addition, we broaden the scope of generalized PCA by allowing for


observation weights and missing data. Weights can be useful for indicating a
ed

differential weight naturally associated with each observation. For example, if


an observation is a proportion, it should get higher weight for being associated
pt

with more trials. With missing data, the case weight can be set to either zero
or one, depending on whether it is missing or observed. Missing data are
ce

common in collaborative filtering problems, where the goal is often to fill in the
missing values of a matrix as accurately as possible using the observed
Ac

values. We also introduce variable normalization, which allows variables of


different types and magnitudes to be on the same scale, similar to how PCA
can be applied to the correlation matrix instead of the covariance matrix.

For implementation, the principal component loadings are determined by a


majorization-minimization (MM) algorithm, which is conceptually similar to the
iteratively reweighted least squares algorithm. Numerically, we examine the
relative performance of the generalized PCA in matrix completion with
simulated count datasets, varying the sparsity and the magnitude of the
positive counts. Further, using a count dataset of users’ song listening history,
we show the improvement in visualizing the loadings and recommending new
songs to users with our generalization of PCA.

The generalized PCA formulation is introduced in Section 2, along with first-


order optimality conditions for solutions and extensions to handle differential
case weights, missing data, and variable normalization. Section 3 presents
the computational algorithm for generalized PCA. The simulated and real data
examples are provided in Sections 4 and 5, respectively, which show the

t
ip
usefulness of the proposed formulation for visualization, matrix completion,
and recommendation. Finally, Section 6 offers conclusions and future

cr
research directions.

2 Generalized PCA
us
an
2.1 Preliminaries

Generalized PCA presented in this paper is based on generalized linear


M

models and exponential family distributions (McCullagh and Nelder, 1989).


We briefly review basic elements of exponential families and generalized
ed

linear models that are important for developing our method. The notion of a
natural parameter and a saturated model, and deviance as a data-fit criterion
pt

are central to the formulation of generalized PCA. Of particular interest is how


the general notions reduce to the Gaussian counterparts, which will then be
ce

used to bridge standard PCA to generalized PCA.


Ac

The probability density or mass function of a random variable X with an


exponential family distribution can be written as

 x  b( ) 
f ( x |  ,  )  exp   c( x,  ) , (2.1)
 a( ) 
where θ is the canonical natural parameter and  is the scale parameter,
which is assumed to be known. The mean and variance of the distribution are
easily obtained from

 2
E ( X )  b( )  b( ) and var ( X )  a( )b ( )  a( ) 2 b( ).
 

The natural parameter θ is μ for Gaussian with mean μ, logit( p) for Bernoulli
with p, and log( ) for Poisson with mean parameter λ. For simplicity of
exposition, we will further assume that a( )  1 , which is the case for the
binomial and Poisson distributions and for the Gaussian distribution with unit

t
variance, but will revisit the scale function a(·) when discussing normalization

ip
in Section 2.5.3.

cr
The canonical link function g (·) is defined such that g (b( ))   . The best

us
possible fit to the data is referred to as the saturated model, where the mean
b ( ) is set equal to the data x, implying that the saturated model parameter
an
 is g(x).
M

For instance, for the Gaussian distribution, b( )   / 2 and   x , for


2

Bernoulli, b( )  log(1  exp( )) and   logit x , and for Poisson, b( )  exp( )
ed

and   log x . This data-driven saturated model will play an important role in
defining a low-rank data approximation for generalized PCA. As we discuss in
pt

the next section, g(x) may not be finite for some distributions and values of x.
ce

Finally, the deviance for θ based on observation x, taken as a data-fit


criterion, is defined as
Ac

 
D( x; )  2 log f ( x |  ,  )  log f ( x |  ,  ) ,

which reduces to squared error loss, ( x   ) , for x from the Gaussian


2

distribution with mean θ and unit variance. Suppose that there are n
independent observations on d variables from the same distribution family,
and the variables are conditionally independent given their natural
parameters. Then, the deviance of an n × d matrix of natural parameters,
Θ  [ij ] X  [ xij ]
, with a data matrix of the same size, , is given by
n d
D( X; Θ)   D( xij ;ij ).
i 1 j 1
Note that this setting posits individual parameters for
each observation. The same deviance form also appears in Collins
et al. (2001) under the implicit assumption of conditional independence of
data given individual parameters.

2.2 Basic Formulation

Our formulation of generalized PCA extends the characterization of PCA that

t
Pearson (1901) originally considered for dimensionality reduction. Pearson’s

ip
formulation looks for the optimal projection of the original d-dimensional points

cr
x i in a k-dimensional subspace (k < d) under squared error loss. That is,

finding a center μ  with U U  I k that


d k T
and a rank-k matrix U 
d

| | x i  μ  UUT (xi  μ) ||2 . us


It is well-known that μ  x and U
an
minimize i1

containing the first k eigenvectors of the sample covariance matrix give the
optimal representation, coinciding with the data projection with k principal
M

components.
ed

This formulation can be interpreted as a dimensionality reduction technique


for Gaussian data. If xij comes from a Gaussian distribution with known
pt

variance, the natural parameter θij is the mean, the deviance is proportional to
squared error loss, and the natural parameter from the saturated model is
ce

ij  xij x i by μ  UUT (xi  μ) can be viewed


. Hence, the approximation of
Ac

equivalently as that of approximating the saturated model parameters θ i by


μ  UUT (θ i  μ) . With these characteristics in mind, we can generalize

Pearson’s formulation as a method for projecting the saturated model into a


lower dimensional space by minimizing the deviance for natural parameters
restricted to a subspace.

Pearson’s original formulation of a low-rank representation of data does not


technically require the normality assumption, however, this dual interpretation
is very conducive to extensions of standard PCA to data from other
exponential family distributions. Taking this approach, Landgraf and
Lee (2015) introduced an extension of PCA to binary data as a low-rank
projection of the logit parameters of the saturated model. This paper further
generalizes it to other types of data, including counts and multicategory
responses. For this generalization, we allow each variable to be from a
different member of the exponential family and specify the appropriate
distribution for each variable.

In this framework, we want to estimate the natural parameters θij for data xij in
a k-dimensional subspace such that they minimize the deviance

t
ip
n d n d
D( X; Θ)   D j ( xij ;ij )  2 { xijij  b j (ij )}  constant.

cr
i 1 j 1 i 1 j 1

Since each variable of the dataset may come from a different distribution, the

deviance and b(·) functions have been subscripted by column. Letting ij be
us
an
the natural parameter for the saturated model of the jth variable and θ i be the
vector for the ith case, the estimated natural parameters in a k-dimensional
M

ˆij   j  UUT (θ i  μ)  ,
subspace are defined as j
or
Θ  1n μT  (Θ  1n μT )UUT in matrix notation.
ed

We wish to find μ  with U U  I k that minimize


d k T
and U 
d
pt

 x   UU (θ  μ)   b   UU (θ  μ) 


ce


i, j
ij j
T
i
j j j
T
i
j


  X, 1n μT  (Θ  1n μT )UUT    b j  j   UUT (θ i  μ)  , 
Ac

j
i, j

where  A, B  tr (A B) is the trace inner product. For some distributions and


T

 ij   
values of xij, may be  or  . For instance, ij for xij = 0 under the
Poisson distribution. Numerically, we will approximate  with a large positive
constant m, treating it as a tuning parameter that can be chosen data-
adaptively by cross-validation. We find that typical values of m chosen by
cross-validation are in the range of 1 to 10, which corresponds to the range of
e1  0.367 10
to e  4.5  10
5
for approximation of the Poisson mean
parameter 0, for example. While the matrix factorization approach in Collins
et al. (2001) approximates the natural parameter matrix Θ with Θ  AB
T

when μ  0 , our
n k d k
using separate rank k matrices A  and B 
projection approach approximates it with Θ  ΘUU
T
by projecting the
saturated model parameters Θ to the column space of U, which simplifies the
case-specific factor A to ΘU . This simpler form greatly facilitates computation
in general and brings added stability to estimates especially when data are
sparse as demonstrated in Section 4.

t
ip
n

| | x  x ||2

cr
i
The duality between the total squared error, i1 , and the Gaussian
deviance for the null model with column mean only also allows a natural way

us
to extend the notion of the variance explained by principal components in
standard PCA to the deviance explained in generalized PCA. Analogous to
an
the eigenvalues of sample covariance matrix, the reduction in deviance for an
additional component can be defined as a natural measure of its importance.
M

This measure can be used to determine the number of principal components


in a similar way as the scree plot or cumulative percentage of variation is used
ed

for PCA.
pt

In our new formulation, principal components are characterized as a basis for


a linear subspace for the natural parameters and their scores are obtained
ce

through simple projection of the saturated model parameters to the subspace.


In contrast to this linear algebraic approach where the natural parameters are
Ac

treated fixed, latent-factor model approaches (e.g. Welling et al., 2005) posit
principal components as unobservable random variables whose realizations
determine the natural parameter values. This stochastic characterization of
principal components entails inherently more complex computation involving
MCMC sampling schemes.
To contrast the two approaches, we note that the fitted values of the natural
ˆij   j  UUT (θ i  μ) 
parameters in our formulation are j while those values

for exponential family harmoniums in Welling et al. (2005) cannot be


calculated in closed form generally. However, the mean field approximation to
the fitted values, assuming that the latent variables are normally distributed, is
ˆij   j  UUT xi  j
(Salakhutdinov et al., 2007).

2.3 First-Order Optimality Conditions

To examine the relation between the optimization problems that standard

t
PCA and the proposed generalization involve, we derive necessary conditions

ip
for the generalized PCA solutions. The orthonormality constraints on U

cr
naturally lead to the application of the method of Lagrange multipliers. In order

to incorporate the orthonormal constraints U U  I k , we consider the


T

Lagrangian,

 
us
an
L(U, μ, Λ)   D j xij ;  j   UUT (θ i  μ)   tr ( Λ(UT U  I k )),
j
i, j
M

where Λ is a k × k symmetric matrix with the Lagrange multipliers. See Wen


and Yin (2013) for more general discussions of optimization with orthogonal
ed

constraints.

Taking the gradient of the Lagrangian with respect to the U, μ , and Λ and
pt

setting them equal to 0 gives the first-order optimality conditions for


ce

generalized PCA solutions:


 X  b(Θ)
 Θ  1 μ   Θ  1 μ   X  b(Θ) U  UΛ
T
Ac

T
T T
(2.2)
 n n

I  
T
d  UUT X  b(Θ) 1n  0d , and (2.3)

UT U  I k . (2.4)

The details of the calculation can be found in the supplementary material.


b(Θ) is a matrix given by taking b(·) elementwise. Each element is the
ˆij
expected value of xij at the optimal parameter . That is, the ijth element

equals

bj  j   UUT (θ i  μ) 
j  . For the Gaussian distribution, b()   and
Θ  X , which simplify the condition (2.2) to the eigen equation for a
covariance matrix. Thus, (2.2) is analogous to the eigen equation for standard
PCA. (2.3) can be interpreted as the condition that the residuals projected
onto the null space of U sum to zero.

2.4 Geometry of the Projection

To illustrate the difference in the low-rank representation of data using

t
standard PCA and generalized PCA, we simulated a two-dimensional count

ip
dataset with 100 rows where the mean matrix has a rank of one. The data are

cr
plotted in Figure 1. The area of the points is proportional to the number of
observations with that pair of counts. The one-dimensional representations for

us
standard PCA and Poisson PCA are displayed in the figure. The two solid
lines indicate the one-dimensional mean space given by standard PCA (red)
an
and Poisson PCA (blue).
M

The most obvious difference between the representations is that the Poisson
PCA representation is non-linear. It is also constrained to be non-negative
unlike the standard PCA projections, which is is slightly negative for the
ed

second variable for the projection of (0, 0). The dashed lines in the figure
show the correspondence between selected observations and their
pt

projections onto the lower dimensional space. Standard PCA gives


ce

projections in the count space, while Poisson PCA gives projections in the
natural parameter space. When transforming back to the count space, the
Ac

Poisson PCA fits are no longer projections in the Euclidean sense.

The Poisson PCA estimates seem to fit those counts close to zero more
tightly than the larger counts. This is due to the properties of the Poisson
distribution, where the mean is proportional to the variance. As shown in
Lemma 1 of the supplementary material, if ( x   ) is held constant, where λ is
the mean, the deviance monotonically decreases to zero as either λ or x
increases to  . A large difference of the Poisson PCA fit from an observation
in the count space may not correspond to a large deviance when the estimate
or observation is relatively large. This is why some of the highlighted
projections are more distant from the observation in the count space for
Poisson PCA than for standard PCA and also why the Poisson PCA fits the
values close to 0 more tightly.

2.5 Further Extensions

Generalized PCA is flexible enough to allow for a variety of extensions,


incorporating data weights, missing data and normalization of variables of
different types or scales.

t
ip
2.5.1 Weights

When each observation xij has a weight wij, it can easily be incorporated into

cr
 w D ( x ;
ij j ij ij )
the deviance as

us
analogous to the weighted least squares
i, j

criterion. One example where weights can naturally occur is when a given
an
observation is associated with multiple replications. For example, if an
observation is a proportion of “successes” in w number of Bernoulli trials, then
w is the natural weight for the observation.
M

2.5.2 Missing Data


ed

In many practical applications, data may be missing and filling in the missing
values is often an important task. For example, in collaborative filtering, the
pt

goal is to use observed users’ feedback on items to estimate their unobserved


ce

or missing feedback. One desirable feature of dimensionality reduction


techniques is the ability to not only accommodate missing data, but also fill in
values that are missing. Techniques that can account for constraints in the
Ac

domain have an added advantage of only filling in feasible values.

Letting   {1,, n}  {1,, d} be the set of indices for which xij is observed,
we define the deviance over non-missing elements only and solve for the
 wij D j ( xij ;ij ).
parameters that minimize the weighted deviance, ( i , j )
This is equivalent to setting wij = 0 for all (i, j )  . In order to estimate the
(i, j ) , ˆij   j  [UUT (θ i  μ)] j
natural parameters for , however, the
saturated model parameters are still required for both observed and missing
data of the ith case. Therefore, we must extend the definition of saturated
model parameters to account for missing data. Since the saturated model has
the best possible fit to the data, the data can be fit exactly when xij is known,
but when xij is unknown, our best guess for the observation is the main effect
ij  g j ( xij )   j
of the jth variable. We therefore set if (i, j )  and ij if
(i, j ) 
 .

t
ip
One consequence of this definition is that the estimated natural parameters
only depend on the saturated model parameters of the non-missing data. The

cr
ˆij   j   [UUT ] jl (il  l ).
estimated natural parameter is l:( i ,l )
Further, μj

being in lieu of
 ij
us
does not complicate its estimation because it cancels out in

the expression of (θ i  μ) when xij is missing.


an
2.5.3 Normalization
M

If data are on different scales in standard PCA, it is common practice to


normalize the variables by dividing them by their standard deviations first and
ed

apply PCA to the standardized variables, which is equivalent to performing


PCA on the correlation matrix. Difficulties in normalization can occur when the
pt

variables come from different distributions. Further, standardization makes


ce

sense only for location-scale families. For example, if we standardize a


Bernoulli random variable, it is no longer Bernoulli. How do we compare a
binary variable with mean 1/4 to a count variable with mean 10 to a
Ac

continuous variable with variance 5?

We derive a notion of variable normalization for generalized PCA that has a


similar effect to that of standardization for PCA. Specifically, as a result of
normalizing variables before PCA, all of the standard basis vectors
e j , j  1,, d
, as loadings explain the same amount of variance, which is
proportional to the reduction in Gaussian deviance with those loadings. We
extend this criterion to generalized PCA by weighting the variables such that
the reduction in deviance (or equivalently the deviance) is the same with each
of the standard basis vectors as loadings.

 j , j  1,, d
Letting , be the normalizing factors we assign to the variables,
n
D j ( X j ,W j ;  j ) :  wij D j ( xij ; ij )
W  [ wij ]
, and i 1 , we want the lth “partial”
deviance after normalization

d
1
Del : min  D j ( X j ,W j ; 1n  j  [(Θ  1n μT )el eTl ] j )
μ
j 1  j

t
ip
to be the same for all l. Note that τj can also be interpreted as an estimate for
a ( )

cr
the scale parameter function, j j in equation (2.1) up to a proportionality
constant.

When
U  el
, the deviance is minimized with respect to the main effects μ if us
an
ˆ j  g j ( X wj ) X w : X Tj W j / (1Tn W j )
for j  l , where j is the weighted mean of the
g (·)
jth variable and j is the canonical link function for the jth variable.
M

ˆ  g j ( X j ) j  l
w
ˆ  g j ( xij )
Therefore, ij if and ij if j = l. This implies that
ed

d
1 1
Del  
j 1, j l j
D j ( X j ,W j ; 1n g j ( X wj )) 
l
Dl ( X l ,Wl ; gl ( X l )).
pt

Dl ( X l ,Wl ; gl ( X l ))  0 gl ( X l )
by definition because gives the fit of the saturated
ce

model.

 j  0, j  1,, d
Ac

Del
The goal is to find , such that is the same for all l,

1 1
D1 ( X1 ,W1 ; 1n g1 ( X 1w ))   Dd ( X d ,Wd ; 1n g d ( X dw )).
1 d

 j  D j ( X j ,W j ; 1n g j ( X wj ))
These are equal if , which is greater than zero as
long as the variance of the jth variable is greater than zero. We set
1
 j  D j ( X j ,W j ; 1n g j ( X wj ))
n to make the normalizing factors more
interpretable. τj can be interpreted as the average deviance of the null
(intercept-only) model for the jth variable.

The expression of the normalization factor τ is summarized in Table 1 for


some exponential family distributions with no missing data and constant
weights. The resulting objective function for the Gaussian distribution is very
similar to performing PCA on the correlation matrix. Let τ : diag(1 ,,  d ) be
a d-dimensional diagonal matrix with the normalizing factors on the diagonals.
PCA on the correlation matrix minimizes (where ||·||F is the Frobenius norm)

t
ip
|| (X  1n μT )τ 1/2  (X  1n μT )τ 1/2 UUT ||2F whereas our proposed normalization

cr
1/2 T 1/2 2
minimizes || (X  1n μ )τ  (X  1n μ )UU τ ||F . The normalization factor for
T T

the Bernoulli distribution is a monotonic function of the Bernoulli variance.

us
Finally, the normalization factor for the Poisson distribution is more
complicated, however the analysis shown in Figure 9 of the supplementary
an
material shows that τi is smallest when the mean is close to 0 and levels off as
the mean gets larger, meaning two variables with large means will have
M

similar weights and variables with mean close to 0 will have the highest
weights.
ed

3 Computation
pt

Since orthonormal constraints are not convex (Wen and Yin, 2013), finding a
global solution for generalized PCA is infeasible in most scenarios. To deal
ce

with the difficulties imposed by orthonormality, we use an MM algorithm,


which is also what Lee et al. (2010) and Landgraf and Lee (2015) used for
Ac

similar problems applied to binary data. In contrast to the Bernoulli case,


however, the deviance cannot be majorized for some members of the
exponential family, for example Poisson. To deal with this, we first
approximate the deviance with a second-order Taylor expansion and then
majorize the Taylor expansion.

3.1 MM Algorithm
In many cases, a quadratic approximation to a smooth objective can be
minimized with a closed-form solution. For example, in the iteratively
reweighted least squares algorithm for generalized linear models (GLMs), the
deviance is approximated by a quadratic function, which leads to a weighted
 w D( x ; 
ij ij ij )
least squares problem. To minimize the weighted deviance i, j
, we
 ij(t )
first quadratically approximate the ijth element of the weighted deviance at
by

Q(ij | ij(t ) ) : wij D( xij ;ij(t ) )  2wij (b j (ij(t ) )  xij )(ij  ij(t ) )  wijbj (ij(t ) )(ij  ij(t ) )2 .

t
b j (ij(t ) )  E ( t ) ( X ij ) bj (ij(t ) )  var ( t ) ( X ij )

ip
From exponential family theory, ij
and ij
.

cr
Even though the Taylor approximation is quadratic, it can still be difficult to
wij bj (ij(t ) )
optimize with orthonormal constraints because of non-constant
Therefore, we additionally majorize the quadratic approximation. Much like the us .
an
Taylor approximation, MM algorithms iteratively replace the difficult objective
with a simpler one. A function M ( |  ) is said to majorize an objective f ( )
(t )
M

at  (t ) if it satisfies two conditions: the tangency condition (


M ( |  )  f ( ) ) and the domination condition ( M ( |  )  f ( ) for all θ).
(t ) (t ) (t ) (t )
ed

We majorize the ijth element of the quadratic approximation with


pt

M (ij | ij(t ) ) : wij D( xij ;ij(t ) )  2wij (b j (ij(t ) )  xij )(ij  ij(t ) )  vi(t ) (ij  ij(t ) )2 ,
ce

where
Ac

vi(t ) : max wij bj (ij(t ) ). (3.1)


j 1,,d

Clearly, the tangency condition is satisfied for both the quadratic


M (ij(t ) | ij(t ) )  Q(ij(t ) | ij(t ) )  wij D( xij ;ij(t ) )
approximation and the deviance, , and
the domination condition is satisfied for the quadratic approximation,
M ( | ij(t ) )  Q( | ij(t ) )
for all θ.
Finding the principal component loadings using this approach requires

initializing Θ  1n μT  (Θ  1n μT )UUT with U (1) and μ (1) and iteratively


minimizing

2
  
 
wij
M (Θ | Θ )   M (ij |  )   v ij    ij(t )  (t ) xij  b j ( ij(t ) )    constant,
(t ) (t ) (t )

 
ij i
i, j i, j  vi

with respect to μ and U.

Let

t
x 
wij
zij(t ) : ij(t )   b j  (ij(t ) )

ip
(t ) ij (3.2)
vi

cr
Z(t )  [ zij(t ) ]
, and V be an n × n diagonal
(t )
be the adjusted response variables,

us
(t )
matrix with ith diagonal equal to vi . The majorizing function can be rewritten
as
an
       
1/2 1/2 2
M (Θ | Θ(t ) )  V (t ) Θ  1n μT UUT  V (t ) Z(t )  1n μT  constant.
F
M

Holding U fixed, M (Θ | Θ ) is minimized with respect to μ when


(t )
ed

  Z 
1 T
μ(t 1)  1Tn V (t ) 1n (t )
 ΘU(t ) (U(t ) )T V(t ) 1n.
pt

(t ) ( t 1) T ( t 1) T
Let Θc : Θ  1n ( μ ) and Zc : Z  1n ( μ ) be the column-centered
(t ) (t )
ce

saturated parameters and adjusted responses. When holding μ fixed,


M (Θ | Θ(t ) ) is minimized with respect to orthonormal U by the first k
Ac

eigenvectors of

 
Θc
(t ) T
V (t ) Z(ct )  Z(ct )  
T (t )
V ( t ) Θc  Θc  V (t ) T (t ) (t )
Θc . (3.3)

The complete proof of the minimizers of M (Θ | Θ ) is in the supplementary


(t )

material.
Algorithm 1: Majorization-minimization algorithm for generalized PCA
input: Data matrix (X), m, number of principal components (k), weights (W)
output: d × k principal component loadings matrix (U) and main effects ( μ )
Set t = 1 and initialize μ and U . We recommend setting μ to the column
(1) (1) (1)

means of Θ and U to the first k right singular vectors of column-centered


(1)

Θ
repeat

1. Set the weights and the adjusted response variables using equations
ij(t )   (jt )  [U(t ) (U(t ) )T (θ i  μ(t ) )] j
2. (3.1) and (3.2), where

t
  Z 
1 T
μ(t 1)  1Tn V (t ) 1n

ip
(t )
 ΘU(t ) (U(t ) )T
V(t ) 1n
3.
( t 1)
4. Set U to the first k eigenvectors of (3.3)

cr
5. t  t  1

until Deviance converges;


us
an
Algorithm 1 summarizes the steps of the MM algorithm for finding the loadings
M

and main effects of generalized PCA. This algorithm is similar to iteratively


reweighted least squares for GLMs. At each iteration, an adjusted response
ed

zij(t )
variable is generated, which is equal to the previous estimate plus the
residual weighted by the inverse of the estimated variance. In our algorithm, a
pt

case-wise upper bound on the estimated variance is used instead. This


ce

makes minimization of M (Θ | Θ ) a weighted least squares problem, which


(t )

can be solved by an eigen-decomposition. Further, this algorithm can be


Ac

adapted to incorporate a dispersion parameter and accommodate over-


bj (ij )
dispersion through modification of terms.

Due to the structure of the MM algorithm, the quadratic approximation of the


deviance is guaranteed to decrease or stay the same at every iteration. The
binomial, multinomial, and Gaussian deviances with weights can be majorized
vi(t )  max j {wij max bj ( )}
with , and in that case, the deviance is guaranteed to
decrease or stay the same every iteration, which implies that the deviance will
converge (Vaida, 2005). We have found that our algorithm typically converges
more quickly than a full majorization. The computational complexity of this MM
algorithm depends largely on eigen-decomposition in each iteration and the
number of iterations to reach convergence. Eigen-decomposition has worst-
case complexity of (d 3 ) per iteration (Pan and Chen, 1999). The simulations

in Section 4 show that generalized PCA typically has far fewer iterations than
matrix factorization methods which rely on a singular value decomposition in
each iteration. The additional experiment in the supplementary material
demonstrates the stability of Algorithm 1 with random initialization and the

t
effectiveness of the suggested initialization using singular vectors for faster

ip
convergence.

cr
Our algorithm for generalized PCA can be trivially extended to missing data

Team, 2016) implementation of this us


by setting the weight wij = 0 if observation xij is missing. An R (R Core
algorithm can be found at
an
github.com/andland/generalizedPCA.

4 Simulation Study
M

We simulated count datasets with missing data and compared how well
ed

standard PCA, our generalized PCA, and the matrix factorization (MF)-based
method exponential family PCA (Collins et al., 2001) under a Poisson
pt

distribution (which we call “Poisson PCA” and “Poisson MF”, respectively) fill
in the missing values on both training and test data. In the simulation, we
ce

controlled two specified characteristics of count data: the level of sparsity –


the proportion of zero counts – and the average of the non-zero counts.
Ac

4.1 Simulation Setup

To simulate count data, we used a latent cluster variable Ci for each row,
whose values are uniformly distributed over {1,,5} . Given Ci = k, Xij was
Λ  [ jk ]
simulated from a Poisson distribution with mean λjk, where is a d  5
matrix of positive means. The elements of Λ were simulated from a gamma
distribution. This latent structure implies that both the mean matrix and the
natural parameter (logarithm of mean) matrix have a rank of 5. Due to the
latent clusters, simulated data would show over-dispersion if each column of
the data matrix was treated as observations from a single Poisson distribution.
Our proposed method can handle this type of over-dispersion naturally by
modeling the parameters of a Poisson mixture with a necessary number of
components.

When P( X |  ) is the Poisson probability mass function and P( ) is the


gamma probability density function, the marginal distribution of X is negative
binomial. In our simulations, we altered the shape and scale parameters of
the gamma distribution for λjk such that the marginal proportion of zero counts,

t
ip
P( X  0) , ranges from 0.25 to 0.9 and the mean of the non-zero counts,
E( X | X  0) , ranges from 3 to 15. For each setting, we simulated a training

cr
dataset with 100 rows (n = 100) and 50 columns (d = 50). We also simulated

us
a test dataset using the same Λ , but with 1000 rows. For each scenario, we
simulated thirty-three training and test datasets, and the results presented
an
below are the median over the replications.

When applying generalized PCA with distributions where it is necessary to


M

provide m, we used five-fold cross validation on a range of 1 to 10 to select


the optimal value with respect to predictive deviance. In estimating the
ed

principal component loadings of the simulated datasets, we did not include


main effects (i.e. μ  0 ) in order to highlight the expected optimal
pt

performance at the true rank.


ce

4.2 Matrix Completion


Ac

Low-rank models are commonly used for matrix completion problems with
missing or incomplete data (Feuerverger et al., 2012). We examine whether
and under what conditions it is possible to improve the estimates of missing
values. To emulate the situation with missing data, we randomly set 20% of
the entries of the training matrix to be missing. We then applied standard
PCA, Poisson PCA, and Poisson MF to the simulated data with k  1,,10 .
Note that typical implementations of PCA cannot estimate the loadings in the
presence of missing values, although there are some options available (e.g.
Tipping and Bishop (1999)). We used Algorithm 1 for both standard PCA and
Poisson PCA to handle missing data. For Poisson MF, we used the same
majorization as Poisson PCA and performed iterative singular value
decompositions (SVD).

Filling in the missing values of the training matrix is a typical matrix completion
problem. Using the projection formulation we have developed, a fitted value
 
xˆij  g 1   j   UU (θ i  μ)   ,
T

 
 
j  ij
for missing xij is given by where is zero
1
since xij is missing and we set μ  0 . g (·) is the identity function for

t
ip
standard PCA and exp(·) for Poisson PCA. For standard PCA, we set any

cr
fitted values below zero to zero, since negative values are infeasible.

us
We evaluated the mean squared error of the filled-in estimates compared to
the actual values held out. When there is no missing data, PCA is known to
an
give an optimal low-rank projection of the data with respect to mean squared
error. However, this is not guaranteed with missing data. Further, Poisson
M

PCA involves the projection of a non-linear function (logarithm) of the data,


which might be advantageous in some scenarios.
ed

Figure 2 shows the median over the replications of the mean squared
prediction error of PCA, Poisson PCA, and Poisson MF on the missing data.
pt

Each row represents a different degree of sparsity, ranging from 25% zeros
on the top row to 90% zeros on the bottom row. The columns represent the
ce

mean of the non-zero elements, ranging from 3 to 15. Within each panel, we
estimated missing counts with 1 to 10 components for each method, with the
Ac

points representing the median over the simulations. Poisson MF occasionally


produced very large MSEs, and those values exceeding the top border are
clipped in the plots. Finally, the dashed horizontal lines represent the
predicted mean squared error when using the column means as estimates for
the missing values. We also examined non-negative matrix factorization
(NNMF) and exponential family harmoniums (EFH). We did not include the
results for NNMF because they were similar to Poisson MF, except more
variable. EFH requires a difficult optimization and there are no open source
packages that implement it. We did not include the EFH results out of concern
that our implementation is not a fair comparison with an expertly tuned
algorithm. We found that EFH gave estimates that were generally worse than
Poisson PCA, especially as the sparsity and conditional mean increased,
which may be attributable to the projection of sparse data, as opposed to the
natural parameters (see Section 2.2 for comparison of their forms).

For most scenarios, the minimum mean squared error occurs at around five
components, which is the true rank of the mean and natural parameter
matrices. Comparing Poisson PCA to PCA, standard PCA fills in the missing

t
ip
values more accurately than Poisson PCA for the datasets with the least
amount of sparsity (the top row). As the degree of sparsity increases, the

cr
advantage of Poisson PCA emerges, even in terms of mean squared error –

us
the loss function PCA is optimized for. One reason may be that, when the
count is 0, standard PCA penalizes fitted values below 0 as much as fitted
an
values above 0. By contrast, for Poisson PCA it is not possible to have
negative fitted values because the projection in the log space is properly
M

transformed back to the feasible range for the mean parameter. Further, as
outlined in Section 2.4, the Poisson deviance gives higher weight to the errors
ed

for smaller counts than larger counts and therefore attempts to fit small counts
more closely.
pt

Poisson MF shows signs of overfitting when the conditional mean of positive


counts is low and sparsity is high (the left and bottom plots). The MSE either
ce

increases much more quickly than the other methods or is never competitive.
We have observed that Poisson MF tends to overfit when the conditional
Ac

mean is low because of the increased flexibility in the scores as individual


parameters. As discussed in Section 2.4, for a given residual, the Poisson
deviance is higher when the count is closer to zero. When most of the values
in a given row are zero or close to it, the scores for that row may have a large
absolute value in order to fit the small counts as closely as possible. These
extreme scores will result in very large predicted values for missing data if the
sign of the loadings corresponding to the missing variable is opposite the sign
of the non-missing variable loadings. For example, in the one-dimensional
b  b j exp(ai b j )  0
case with score ai and loadings bj and l , if , then
exp(ai bl ) 1 .

The implementations of Poisson PCA and Poisson MF are similar, varying


only in how each majorization is minimized (eigen-decomposition for PCA and
SVD for MF). In Table 2, we compared the number of iterations to reach
convergence for both algorithms for three different simulation scenarios with
five principal components. For both algorithms, convergence is reached when
the decrease in deviance between iterations is below the same threshold. On

t
average, Poisson MF takes many more iterations to reach convergence (and

ip
thus more computing time) because the extra score parameters for the

cr
individual rows allow more incremental improvements in the data fit.

which would require additional computation us


We extended this simulation to fill in missing observations on the test data,
for matrix factorization
an
techniques. In contrast, our method can easily apply the same loadings
learned from the training data to the test data via matrix multiplication. The
M

results on the test data are nearly the same as those above and are therefore
not included for brevity. The extent of overfitting by Poisson MF is reduced in
testing because the loadings have been fixed.
ed

We highlight the difference in testing time between Poisson PCA and Poisson
pt

MF for 1000 records in Table 2, when they were executed on a Windows


machine with Intel® Xeon® processor (2.13 GHz) and 128 GB memory. The
ce

testing time for Poisson PCA is nearly constant, because it only involves
matrix multiplication, while much larger and more varied for Poisson MF,
Ac

because it requires additional optimization for the scores.

To see the effect of model misspecification and over-dispersion as a special


form, we altered this experiment to simulate from the zero-inflated Poisson
distribution. Using the setting of P( X  0)  0.5 and E ( X | X  0)  3 , we added
the probability that any element of the matrix could be zero as a factor and
varied the probability  from 0 (the original simulation setup) to 0.8, which has
the same sparsity level as P( X  0)  0.9 setting, albeit with a different data
generating mechanism. The results in Figure 3 show that the rank with lowest
MSE gets smaller for higher degrees of sparsity, which was also seen in
Figure 2, possibly due to lower signal-to-noise ratio in the sparse settings.
Poisson PCA performs uniformly better than PCA when the zero-inflation
parameter is less than 0.6 and is typically much better than Poisson MF.
While Poisson PCA maintains relatively good performance in this experiment,
comparison between the setting of P( X  0)  0.9 and E ( X | X  0)  3 without
zero inflation and the comparable setting with zero inflation probability of 0.8
indicates that the lowest MSE increases from 0.80 to 1.44 and model

t
misspecification results in 80% increase in median MSE. Note that these two

ip
settings closely match the characteristics of the real application in Section 5 in

cr
terms of sparsity level and non-zero mean value.

us
Finally, we investigated how our method performs when d n by simulating
from the original experimental setup with P( X  0)  0.9, E( X | X  0)  3 , n =
an
100, and true rank of 5, but increasing d from 50 to 1000. Figure 4 shows that,
counterintuitively, all methods get better with increased dimension, especially
M

Poisson MF, although Poisson PCA typically performs the best. This seeming
blessing of dimensionality is due to the fact that it becomes easier to identify
ed

the latent clusters with additional variables in this setting as they arise from
the same probabilistic structure and each dimension adds information on
pt

clustering. The signal-to-noise ratio measured in terms of the relative


magnitude of leading eigenvalues for the underlying principal components
ce

increases with dimension.


Ac

5 Million Song Dataset Analysis

To show the benefits of generalized PCA in data analysis, we analyze the


Million Song Dataset (MSD) (Bertin-Mahieux et al., 2011). MSD contains the
listening history of 110 thousand users on one million songs. For each
user/song pair, the dataset lists the number of times that the user has listened
to the song. Most of user/song pairs have a count of zero because users only
listen to a small subset of the songs. These data can be organized into a
sparse “user-song frequency” matrix. Since the relationship between the
songs in terms of the user’s preference is of primary interest, the songs are
taken as variables and the users are taken as cases. Generalized PCA is
performed on two subsets of the users and songs to show the benefits of the
method for visualization and recommendation. Unlike with the simulations, we
included a main effect in our analyses of MSD.

5.1 Visualizing the Loadings

The first subset of the MSD that we analyzed is the listen history on the songs
from the ten most popular artists. We selected the 200 most popular songs by

t
these artists and the 1200 most active listeners of those songs. There is quite

ip
a bit of dispersion in the count matrix created from these users and songs.

cr
93% of the entries in the matrix are 0, the mean of the non-zero entries is 2.8,
and the maximum count is 391.

us
We applied Poisson PCA to the data and examined the proportion of the
an
deviance explained by principal components first. Figure 5 displays the
cumulative and marginal percentage of Poisson deviance explained by the
M

first 25 components on this dataset. In the marginal plot, which is analogous


to the scree plot in standard PCA, there is a big drop off in the amount of
deviance explained after both the second and third components. Two or three
ed

components would thus be justifiable to account for the major variation in the
data. For visualization of the songs in a low-dimensional latent space, we
pt

proceeded with the first two components, which explain 32% of the deviance.
ce

In addition to Poisson PCA, we analyzed the count matrix using standard PCA
on the counts, and non-negative matrix factorization (NNMF) with Poisson
Ac

deviance. Since Poisson PCA projects the log of the counts, we also
compared Poisson PCA to standard PCA on the log of the counts. The log of
zero is undefined, so we numerically approximate it by – m with a large
positive m for both Poisson PCA and standard PCA on the log transformed
data. For each method, m was varied from 0.1 to 15 and was chosen by
cross-validation. For Poisson PCA, the optimal m was 8.0 and for standard
PCA, it was 1.2.
The loadings for the three PCAs are in Figure 6. If a user listens to one song
from an artist, it is likely that he or she will listen to other songs by the same
artist. Therefore, we expect that songs by the same artists would be close in
the latent space. For standard PCA on the count data, which is displayed on
the top-left panel of Figure 6, the first component is dominated by Florence +
the Machine’s songs because the two largest observations in the data were
for songs by this artist. Songs by Justin Bieber dominate the second principal
component direction because two songs by the artist were among the top ten
most popular songs or variables.

In comparison to PCA on the count data, loadings from PCA on the log of the

t
ip
count data are easier to interpret, as displayed on the top-right panel of Figure
6. This shows the improvement due to the log transformation of the data.

cr
Songs by The Black Keys and Jack Johnson are distinguished from the rest of

us
the artists. These two artists have the most songs in the data. However,
songs by the other eight artists are clumped together.
an
The bottom-left panel of Figure 6 has the loadings from Poisson PCA. This
plot shows the improvement from both the log transformation and the
M

optimized projection with respect to Poisson deviance, which is more natural


for count data. Instead of only two dominant groups of songs corresponding to
ed

artists, there are now at least five distinguishable clusters of songs which
correspond to artists. Further, there are two clusters for the Black Keys, one
pt

of which in the first quadrant corresponds to songs from a recently released


album. The loadings from Poisson MF (not shown in the figure) were
ce

qualitatively similar to Poisson PCA.


Ac

The bottom-right panel has the loadings from NNMF, which are quite different
from the others due to the non-negativity constraints. While the non-negativity
induces sparsity, it becomes more difficult to differentiate the artists with the
loadings.

Next, we computed the loadings for Poisson PCA with normalization factors,
as displayed in Figure 7. The loadings of normalized PCA for the count or log
counts are not included because they were qualitatively very similar to the
loadings from the log of the counts. The left plot in Figure 7 shows the un-
normalized Poisson PCA loadings and the right plot shows the normalized
Poisson PCA loadings. The area of each point is proportional to the number of
times the songs were played and the colors represent the same artists as in
Figure 6.

A notable difference between the sets of loadings is that songs by Florence +


the Machine, which had very high play counts, are furthest away from the
origin in the un-normalized version, but close to the origin in the normalized
loadings. Also, the songs from the Black Keys new album are more distinctly

t
ip
clustered in the normalized version. Finally, in the un-normalized version,
songs by artists cluster in ray shapes from the origin, while they cluster more

cr
tightly further away from the origin in the normalized version.

us
These differences can be explained by the normalization factors. The songs
furthest from the origin in the un-normalized version are the songs that were
an
played the most times. For the songs in this dataset, the Poisson null
deviance was in general larger for songs with higher means. Therefore, the
M

songs with the smallest average play counts – especially those less than 1 –
were effectively given higher weights in the optimization when normalization is
ed

added. This caused the ray shapes to turn into tighter clusters and the songs
by Florence + the Machine to have less influence.
pt

5.2 Recommending New Songs to Users


ce

In this second example, we consider recommending users songs that they


have not listened to before. This is a common task in recommender systems,
Ac

where it is desirable to expose the users to items they have not interacted
with before and that they may be interested in. The recommendations we
ˆ
make are based on the rankings of songs with ij from PCA as estimated
preference scores for each user.

We started with the 500 most popular songs and only included the songs that
were listened to by at least 20 users and the users that listened to at least 30
of the songs. This resulted in 1040 users and 281 songs. We randomly chose
80% of the users for training and 20% for testing. We set five of the positive
play counts to zero for each user in the test set to use them as true positives
in ranking. These five songs represent songs the user is interested in, but that
they have not yet been exposed to.

We then applied the PC loadings and main effects learned from the training
data to the test data to get a low-dimensional representation of each user’s
listens and used it to rank the songs with zero plays, comparing how high the
songs that were actually non-zero ranked using AUC. Finally we averaged the
AUC from each user’s prediction to get an overall AUC, which can range from

t
ip
0 to 1, where 1 corresponds to perfect ranking and 0.5 corresponds to random
guessing.

cr
We have set up the experiment in a situation where typical matrix factorization

us
methods may be computationally costly to use. For our method, computing
the low-dimensional representation for new users only requires matrix
an
multiplication, while it requires fitting a generalized linear model for each user
with the matrix factorization-based methods.
M

We compared standard PCA to Poisson PCA and multinomial PCA – which is


ed

developed in more detail in the supplementary material – using the count data
directly in Figure 8. For all of the PCA methods, we varied the number of
pt

principal components from 2 to 50 and chose m by cross validation. For


reference, we measured the AUC of rank order of the songs using the overall
ce

popularity of each song for every user, which ranks the songs based on the
average play count of the songs. This popularity ranking is the same as the
Ac

ranking that all of PCA methods would give with main effects but zero
principal components.

Standard PCA with the play count data performed very poorly – even worse
than the popularity model with no personalization – regardless of the number
of principal components. Poisson PCA and multinomial PCA both did much
better, with multinomial PCA consistently having the highest AUC. The
multinomial assumption may be more appropriate for the data because a user
has a finite amount of time to listen to songs and must pick between them.
Also, the multinomial deviance implicitly gives higher weight to play counts of
0 coming from highly active users than from the users who had low total play
counts. The supplementary material contains an additional analysis of this
data, which further showcases generalized PCA on binary and weighted data.

6 Conclusion

In this paper, we have introduced a generalization of PCA, focusing on binary


and count data, which preserves many of the desirable features of standard

t
PCA in contrast to other generalizations. Notably, principal component scores

ip
are a linear combination of the natural parameters from the saturated model

cr
and computing the scores or the low-rank estimate on new data does not
require any extra optimization. The formulation is further extended to include

us
weights, missing data, and variable normalization, which can improve the
interpretability of the loadings and the quality of the recommendations as we
an
have shown in numerical examples.
M

There are many ways that the proposed formulation can be extended further.
When there are more variables than cases, regularization of the loadings may
ed

be helpful to improve estimation of a low dimensional structure or its


interpretability. As an alternative to the descriptive approach to selection of
the number of principal components presented in this paper, inferential
pt

approaches can be taken by casting the problem as hypothesis testing (Choi


ce

et al., 2017). Finally, when there is a response variable of interest and


dimensionality reduction is considered for covariates as in principal
Ac

component regression, our formulation can be extended for supervised


dimensionality reduction with covariates of various types.

Supplementary Material

The online supplementary material includes (i) derivation of the first-order


optimality conditions (2.2-2.4); (ii) Poisson distribution properties; (iii)
formulation of multinomial PCA; (iv) proof for the MM algorithm; (v)
examination of the stability of the algorithm with random initialization; (vi)
additional analysis of the song dataset.

Acknowledgments

The authors thank the editor, associate editor and referees for their helpful
comments.

Funding

This research was supported in part by National Science Foundation grants


DMS-12-09194 and DMS-15-13566.

t
ip
References

cr
Bartholomew, D. J. and M. Knott (1999). Latent variable models and factor
analysis. Wiley.

us
Bertin-Mahieux, T., D. P. Ellis, B. Whitman, and P. Lamere (2011). The million
an
song dataset. In Proceedings of the 12th International Conference on Music
Information Retrieval (ISMIR 2011).
M

Buntine, W. (2002). Variational extensions to EM and multinomial PCA. In


European Conference on Machine Learning, pp. 23–34. Springer.
ed

Choi, Y., J. Taylor, and R. Tibshirani (2017). Selecting the number of principal
pt

components: Estimation of the true rank of a noisy matrix. The Annals of


Statistics 45 (6), 2590–2617.
ce

Collins, M., S. Dasgupta, and R. E. Schapire (2001). A generalization of


Ac

principal components analysis to the exponential family. In T. Dietterich,


S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information
Processing Systems 14, pp. 617–624.

Feuerverger, A., Y. He, and S. Khatri (2012). Statistical significance of the


Netflix challenge. Statistical Science 27 (2), 202–231.
Gordon, G. J. (2002). Generalized2 linear2 models. In S. Becker, S. Thrun,
and K. Obermayer (Eds.), Advances in Neural Information Processing
Systems 15, pp. 593–600. MIT Press.

Hinton, G. E. and R. R. Salakhutdinov (2006). Reducing the dimensionality of


data with neural networks. Science 313 (5786), 504–507.

Landgraf, A. J. and Y. Lee (2015). Dimensionality reduction for binary data


through the projection of natural parameters. arXiv preprint arXiv:1510.06112.

Lee, D. D. and H. S. Seung (1999). Learning the parts of objects by non-

t
negative matrix factorization. Nature 401 (6755), 788–791.

ip
Lee, S., J. Z. Huang, and J. Hu (2010). Sparse logistic principal components

cr
analysis for binary data. The Annals of Applied Statistics 4 (3), 1579–1601.

us
Li, J. and D. Tao (2010). Simple exponential family PCA. In International
Conference on Artificial Intelligence and Statistics, pp. 453–460.
an
Liu, L. T., E. Dobriban, and A. Singer (2018). ePCA: High dimensional
M

exponential family PCA. The Annals of Applied Statistics 12 (4), 2121–2150.


ed

McCullagh, P. and J. A. Nelder (1989). Generalized linear models (Second


edition). London: Chapman & Hall.
pt

Mohamed, S., Z. Ghahramani, and K. A. Heller (2009). Bayesian exponential


family PCA. In Advances in Neural Information Processing Systems,
ce

pp. 1089–1096.
Ac

Pan, V. Y. and Z. Q. Chen (1999). The complexity of the matrix eigenproblem.


In Proceedings of the thirty-first annual ACM symposium on Theory of
computing, pp. 507–516. ACM.

Pearson, K. (1901). On lines and planes of closest fit to systems of points in


space. The London, Edinburgh, and Dublin Philosophical Magazine and
Journal of Science 2 (11), 559–572.
R Core Team (2016). R: A Language and Environment for Statistical
Computing. Vienna, Austria: R Foundation for Statistical Computing.

Salakhutdinov, R., A. Mnih, and G. Hinton (2007). Restricted Boltzmann


machines for collaborative filtering. In Proceedings of the 24th International
Conference on Machine learning, pp. 791–798. ACM.

Schein, A. I., L. K. Saul, and L. H. Ungar (2003). A generalized linear model


for principal component analysis of binary data. In Proceedings of the Ninth
International Workshop on Artificial Intelligence and Statistics, Volume 38.

t
Singh, A. P. and G. J. Gordon (2008). A unified view of matrix factorization

ip
models. In Machine Learning and Knowledge Discovery in Databases ,

cr
pp. 358–373. Springer.

us
Tipping, M. E. (1998). Probabilistic visualisation of high-dimensional binary
data. In M. Kearns, S. Solla, and D. Cohn (Eds.), Advances in Neural
an
Information Processing Systems 11, pp. 592–598.

Tipping, M. E. and C. M. Bishop (1999). Probabilistic principal component


M

analysis. Journal of the Royal Statistical Society: Series B (Statistical


Methodology) 61 (3), 611–622.
ed

Udell, M., C. Horn, R. Zadeh, and S. Boyd (2016). Generalized low rank
pt

models. Foundations and Trends in Machine Learning 9 (1), 1–118.


ce

Vaida, F. (2005). Parameter convergence for EM and MM algorithms.


Statistica Sinica, 831–840.
Ac

Welling, M., M. Rosen-Zvi, and G. E. Hinton (2005). Exponential family


harmoniums with an application to information retrieval. In Advances in Neural
Information Processing Systems, pp. 1481–1488.

Wen, Z. and W. Yin (2013). A feasible method for optimization with


orthogonality constraints. Mathematical Programming 142 (1-2), 397–434.
t
Fig. 1 Projection of two-dimensional count data using standard PCA and

ip
Poisson PCA

cr
us
an
M
ed
pt
ce
Ac
t
ip
cr
us
an
Fig. 2 Mean squared error of matrix completion using standard PCA, Poisson
PCA, and Poisson MF on simulated count datasets. The median MSE was
M

taken over the replications. Each panel represents a different combination of


the proportion of zeros and the mean of the counts greater than zero. The
ed

dashed horizontal lines represent the mean squared errors when using
column means as predictions.
pt
ce
Ac
Fig. 3 Mean squared error of matrix completion using standard PCA, Poisson
PCA, and Poisson MF on simulated count datasets from the setting of
P( X  0)  0.5 and E( X | X  0)  3 but with zero inflation. Each panel

represents a different zero-inflation probability. The median MSE was taken

t
ip
over 200 replications. The dashed horizontal lines represent the mean
squared errors when using column means as predictions.

cr
us
an
M

Fig. 4 Mean squared error of matrix completion using standard PCA, Poisson
ed

PCA, and Poisson MF on simulated count datasets from the setting of


P( X  0)  0.9 and E( X | X  0)  3 but with increasing dimensionality. Each
pt

panel represents a different data dimension. The median MSE was taken over
ce

50 replications. The dashed horizontal lines represent the mean squared


errors when using column means as predictions.
Ac
t
ip
Fig. 5 Cumulative and marginal percent of the deviance explained with
Poisson PCA on the subset of the MSD with the ten most popular artists

cr
us
an
M
ed
pt
ce
Ac

Fig. 6 The first two principal component loadings of the million song dataset
using standard PCA on the count data, standard PCA on the log of the count
data, Poisson PCA, and non-negative matrix factorization
t
ip
Fig. 7 The first two principal component loadings of the million song dataset

cr
using Poisson PCA without normalization (left plot) and with normalization

us
(right plot). The area of each point is proportional to the number of times the
song was played
an
M
ed
pt
ce
Ac

Fig. 8 Comparison of different generalized PCA methods for recommending


songs to users with the play count data
Table 1 Normalization factors for some exponential family distributions

Distribution Normalization Factor (τj)

Gaussian  (x
i
ij  X j )2 / n

Bernoulli 2( X j log( X j )  (1  X j ) log(1  X j ))

 
2  X j log( X j )  (1/ n) xij log( xij )
Poisson  i 

Table 2 Mean number of iterations to train the algorithms and mean time in
seconds to estimate the scores for 1000 test records, assuming a Poisson

t
ip
distribution and using five principal components. Poisson PCA refers to our
generalized PCA and Poisson MF refers to exponential family PCA. Standard

cr
error of the mean is in parentheses.

Scenario

P( X  0) E( X | X  0) Poisson
Number of Iterations
us
Testing Time (seconds)

Poisson
an
PCA Poisson MF Poisson PCA MF
M

0.017
0.25 3 12.0 (0.19) 268.0 (10.43) (0.0011) 6.1 (0.17)
ed

0.017
0.5 9 301.8 (8.72) 1792.3 (54.77) (0.0009) 53.4 (3.44)
pt

3851.8 0.015 176.6


0.9 15 178.5 (8.16) (147.50) (0.0007) (6.23)
ce
Ac

También podría gustarte