logisticPCA
R PackagelogisticPCA
is an R
package for dimensionality reduction of binary data. Three methods are
implemented:
We assume that there are n observations of d-dimensional binary data, which can be represented as an n × d matrix, X. If we assume that each element, xij, is Bernoulli with probability pij, the natural parameter, θij is the logit of the probability $$ \theta_{ij} = \log \frac{p_{ij}}{1 - p_{ij}}. $$
Collins et al. (2001) proposed exponential family PCA to extend PCA to binary and other types of data. For binary data, they assume that the logit of the probability matrix can be written as a matrix factorization, logit(P) = 1nμT + ABT, where A and B are of a lower rank, k, and μ is a d-dimensional vector of main effects.
Due to the matrix factorization representation, we will refer to this formulation as logistic SVD below.
Logistic PCA extends Pearson (1901)’s initial formulation of principal component analysis. Pearson’s formulation seeks to find a rank-k projection of the data which is as close to the original data as possible, in terms of mean squared error. That is, it minimizes, $$ \frac{1}{n} \sum_{i = 1}^n \| (\mathbf{x}_i - \boldsymbol{\mu}) - \mathbf{U}\mathbf{U}^T (\mathbf{x}_i - \boldsymbol{\mu}) \|^2, $$ over μ and d × k orthonormal matrix U.
We re-interpret this and use generalized linear model theory. If X is distributed Gaussian, the natural parameter is E(X) and the natural parameter from the saturated model is X itself. Pearson’s formulation can be interpreted as projecting the natural parameters from the saturated model (X) to minimize the Gaussian deviance (squared error).
To extend PCA to binary data, we need to instead project the natural parameters from the Bernoulli saturated model and minimize the Bernoulli deviance. If X is distributed Bernoulli, the natural parameter from the saturated model is ∞ if X = 1 and −∞ if X = 0. To make this computationally feasible, we use a large number ±m instead of ±∞.
Finally, letting $\tilde{\boldsymbol{\theta}}_i$ be the d-dimensional vector of natural parameters from the saturated model, the natural parameters are estimated by
$$ \hat{\boldsymbol{\theta}}_i = \boldsymbol{\mu} - \mathbf{U}\mathbf{U}^T (\tilde{\boldsymbol{\theta}}_i - \boldsymbol{\mu}) $$
and μ and U are solved to minimize the Bernoulli deviance,
$$ D(\mathbf{X} | \hat{\boldsymbol{\Theta}}) = \sum_{i = 1}^n \sum_{j = 1}^d -2 x_{ij} \hat{\theta}_{ij} + 2 \log( 1 + \exp(\hat{\theta}_{ij}) ). $$
The main difference between logistic PCA and exponential family PCA is how the principal component scores are represented. Exponential family PCA solves for the PC scores A, whereas in logistic PCA (and standard PCA) the PC scores are linear combinations of the natural parameters from the saturated model. The PC scores for the ith observation are $$\mathbf{U}^T (\tilde{\boldsymbol{\theta}}_i - \boldsymbol{\mu}). $$
This gives logistic PCA several benefits over exponential family PCA
Convex logistic PCA is formulated the same way as logistic PCA above except for one difference. Instead of minimizing over rank-k projection matrices, UUT, we minimize over the convex hull of rank-k projection matrices, referred to as the Fantope.
The convex relaxation is not guaranteed to give low-rank solutions, so it may not be appropriate if interpretability is strongly desired. However, since the problem is convex, it can also be solved more quickly and reliably than the formulation with a projection matrix.
To show how it works, we use a binary dataset of how the people of
the US Congress voted on different bills in 1984. It also includes
information on the political party of the members of Congress, which we
will use as validation. Use help(house_votes84)
to see the
source of the data.
The three formulations described above are implemented in the
functions logisticSVD
, logisticPCA
, and
convexLogisticPCA
. They return S3 objects of classes
lsvd
, lpca
, and clpca
respectively. logisticSVD
returns mu
,
A
, and B
, logisticPCA
returns
mu
and U
, and convexLogisticPCA
returns mu
and H
, the d × d Fantope matrix. All
of them take a binary data matrix as the first argument (which can
include missing data) and the rank of the approximation k
as the second argument.
Additionally, for logisticPCA
and
convexLogisticPCA
, it is necessary to specify
m
, which is used to approximate the natural parameters from
saturated model. Larger values of m
give fitted
probabilities closer to 0 or 1, and smaller values give fitter
probabilities closer to 0.5. The functions cv.lpca
and
cv.clpca
perform row-wise cross-validation to help select
m
.
This information is summarized in the table below. The returned objects that are in parentheses are derived from other parameters.
Formulation | Function | Class | Returns | Specify m? |
---|---|---|---|---|
Exponential Family PCA | logisticSVD | lsvd |
mu , A , B |
No |
Logistic PCA | logisticPCA | lpca |
mu , U ,
(PCs ) |
Yes |
Convex Logistic PCA | convexLogisticPCA | clpca |
mu , H , (U ,
PCs ) |
Yes |
For each of the formulations, we will fit the parameters assuming
two-dimensional representation. To estimate A and B, use
logisticSVD
.
For this an the other formulations, printing gives a summary of the fit.
## 435 rows and 16 columns
## Rank 2 solution
##
## 63.6% of deviance explained
## 549 iterations to converge
For logistic PCA, we want to first decide which m
to use
with cross validation. We are assuming k = 2
and trying
different m
s from 1 to 10.
## Warning in type.convert.default(colnames(x)): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(rownames(x)): 'as.is' should be specified by
## the caller; using TRUE
It looks like the optimal m
is 5, which we can use to
fit with all the data. We will also use the same m
for the
convex formulation.
logpca_model = logisticPCA(house_votes84, k = 2, m = which.min(logpca_cv))
clogpca_model = convexLogisticPCA(house_votes84, k = 2, m = which.min(logpca_cv))
Each of the formulations has a plot method to make it easier to see
the results of the fit and assess convergence. There are three options
for the type
of plot. The first is
type = "trace"
, which plots the deviance as a function of
iteration. For logistic PCA and logistic SVD, the deviance should
decrease at each iteration, but not necessarily for convex logistic PCA.
For example, convex logistic PCA converged in 12 iterations.
In contrast, logistic SVD takes 549 iterations to converge.
With these formulations, each member of congress is approximated in a two-dimensional latent space. Below, we look at the PC scores for the congressmen, colored by their political party. All three formulations do a good job of separating the political parties based on voting record alone.
party = rownames(house_votes84)
plot(logsvd_model, type = "scores") + geom_point(aes(colour = party)) +
ggtitle("Exponential Family PCA") + scale_colour_manual(values = c("blue", "red"))
plot(logpca_model, type = "scores") + geom_point(aes(colour = party)) +
ggtitle("Logistic PCA") + scale_colour_manual(values = c("blue", "red"))
plot(clogpca_model, type = "scores") + geom_point(aes(colour = party)) +
ggtitle("Convex Logistic PCA") + scale_colour_manual(values = c("blue", "red"))
For convex logistic PCA, we do not necessarily get a two-dimensional
space with the Fantope matrix H
. However, we use the first
k eigenvectors of
H
as an estimate of U
.
One can also examine the latent space of the variables by using
type = "loadings"
.
The fitted
function provides fitted values of either
probabilities or natural parameters. For example,
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.1575019 0.9075000 0.09229721 0.959218468 0.9986316 0.9982951 0.019043928
## [2,] 0.1437479 0.8202556 0.07417629 0.973745875 0.9986520 0.9980874 0.024706395
## [3,] 0.3432904 0.9997140 0.51135413 0.254949046 0.9967767 0.9988433 0.004112127
## [4,] 0.5858706 0.9998458 0.92339249 0.008711673 0.9159592 0.9903378 0.029769976
## [5,] 0.4657483 0.9995484 0.79074232 0.052783059 0.9742113 0.9949505 0.019372131
## [6,] 0.3798881 0.9904405 0.63830758 0.206453209 0.9679469 0.9903685 0.064734180
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0.01432709 0.011477307 0.4792791 0.3316243 0.91157574 0.9602252 0.9995889
## [2,] 0.01455433 0.013016539 0.5094768 0.2002938 0.93258935 0.9588701 0.9997380
## [3,] 0.02379132 0.007081943 0.2611630 0.9913548 0.39722843 0.9567721 0.9714899
## [4,] 0.29774883 0.055012835 0.2193926 0.9979127 0.05401718 0.8224117 0.3859787
## [5,] 0.13154573 0.030119049 0.2589835 0.9925267 0.16240789 0.8877036 0.8173755
## [6,] 0.17096413 0.058714933 0.3591397 0.9028277 0.34250135 0.8571757 0.9521942
## [,15] [,16]
## [1,] 0.04969826 0.4417728
## [2,] 0.04567344 0.4533561
## [3,] 0.12396075 0.5084981
## [4,] 0.38413493 0.9433503
## [5,] 0.24711454 0.8636518
## [6,] 0.20613266 0.9066621
This could be useful if some of the binary observations are missing, which in fact this dataset has a lot of. The fitted probabilities give an estimate of the true value.
Finally, suppose after fitting the data, the voting record of a new
congressman appears. The predict
function provides
predicted probabilities or natural parameters for that new congressman,
based on the previously fit model and the new data. In addition, there
is an option to predict the PC scores on the new data. This may be
useful if the low-dimensional scores are inputs to some other model.
For example, let’s make up five fake congressmen.
d = ncol(house_votes84)
votes_fake = matrix(sample(c(0, 1), 5 * d, replace = TRUE), 5, d,
dimnames = list(NULL, colnames(house_votes84)))
Now, we can use the models we fit before to estimate the PC scores of
these congressmen in the low-dimensional space. One advantage of
logistic PCA is that it is very quick to estimate these scores on new
observations, whereas logistic SVD must solve for A
on the
new data.
## [,1] [,2]
## [1,] -0.5317935 -2.517244
## [2,] 0.6056201 -1.114874
## [3,] 1.0517947 4.381758
## [4,] 2.8183681 8.680464
## [5,] 7.0402739 1.246301
Since logistic PCA requires iteratively computing the
eigendecomposition, it can be slow for large data. When k
is small relative to the number of columns, however, the
eigendecomposition can be sped up by only solving for the first
k
eigenvectors. This can be done for all three formulations
using the partial_decomp
argument. We have implemented this
using the RSpectra
package. Be careful when using this
argument, because it can substantially slow down computation is the
number of columns is small (e.g. ~20) or if k
is of
comparable size to the number of columns. An example of the speed-up is
below.
set.seed(33)
nrow = 100
ncol = 500
sim_dat = matrix(sample(c(0, 1), nrow * ncol, replace = TRUE), nrow, ncol)
ptm1 <- proc.time()
logpca_slow = logisticPCA(sim_dat, k = 2, m = 4, partial_decomp = FALSE)
elapsed1 = proc.time() - ptm1
ptm2 <- proc.time()
logpca_fast = logisticPCA(sim_dat, k = 2, m = 4, partial_decomp = TRUE)
elapsed2 = proc.time() - ptm2
Without doing partial eigendecompositions, it took 2.7 seconds to run and with partial it took 0.7 seconds.
We can also compare the results of the two versions to make sure they
match. The code below confirms that the projection matrices are the
same. Note that the ordering of the columns in the U
matrix
does not matter.
## [1] TRUE