Expectation-Maximization for K-subspaces¶

The EM method can be adapted for fitting of subspaces also. We need to assume a statistical mixture model for the dataset.

We assume that the dataset \(Y\) is sampled from a mixture of \(K\) component distributions where each component is centered around a subspace. A latent (hidden) discrete random variable \(z \in \{1, \dots, K \}\) picks the component distribution from which a sample \(y\) is drawn. Let the \(k\)-th component be centered around the subspace \(\UUU_k\) which has an orthogonal basis \(Q_k\). Then, we can write

\[y = Q_k \alpha + B_k \beta\]

where \(B_k \in \RR^{M \times (M - D_k)}\) is an orthonormal basis for the subspace \(\UUU_k^{\perp}\), \(Q_k \alpha_k\) is the component of \(y_k\) lying perfectly in \(\UUU_k\) and \(B_k \beta_k\) is the component lying in \(\UUU_k^{\perp}\) representing the projection error (to the subspace). We will assume that both \(\alpha\) and \(\beta\) are sampled from multivariate isotropic normal distributions, i.e. \(\alpha \sim \NNN(0, \sigma'^2_{k} I)\) and \(\beta \sim \NNN(0, \sigma^2_{k} I)\). Assuming that \(\alpha\) and \(\beta\) are independent, the covariance matrix for \(y\) is given by

\[\Sigma_k^{-1} = \sigma'^{-2}_k Q_k Q_k^T + \sigma^{-2}_k B_k B_k^T.\]

Since \(y\) is expected to be very close the to the subspace \(\UUU_k\), hence \(\sigma^2_k \ll \sigma'^2_k\). In the limit \(\sigma'^2_k \to \infty\), we have \(\Sigma_k^{-1} \to \sigma^{-2}_k B_k B_k^T\). Basically, this means that \(y\) is uniformly distributed in the subspace and its location inside the subspace (given by \(Q_k \alpha\)) is not important to us. All we care about is that \(y\) should belong to one of the subspaces \(\UUU_k\) with \(B_k \beta\) capturing the projection error being small and normally distributed.

The component distributes therefore are:

\[f(y | z = k) = \frac{1}{(2 \pi \sigma_k^2)^{(M - D_k)/2}} \exp \left ( - \frac{y^T B_k B_k^T y}{2 \sigma_k^2}\right ).\]

\(z\) is multinomial distributed with \(p (z = k) = \pi_k\). The parameter set for this model is then \(\theta = \{\pi_k, B_k, \sigma_K \}_{k=1}^K\) which is unknown and needs to be estimated from the dataset \(Y\). The marginal distribution \(f(y| \theta)\) and the incomplete likelihood function \(l(Y | \theta)\) can be derived just like here. We again introduce auxiliary variables \(w_{sk}\) and convert the ML estimation problem into an iterative estimation problem.

Estimates for \(\hat{w}_{sk}\) in the E-step remain the same.

Estimates of parameters in \(\theta\) in M-step are computed as follows. We compute the weighted sample covariance matrix for the \(k\)-th cluster as

\[\hat{\Sigma}_k = \sum_{s=1}^S w_{sk} y_s y_s^T.\]

\(\hat{B}_k\) is the eigenvectors associated with the smallest \(M - D_k\) eigenvalues of \(\hat{\Sigma}_k\). \(\pi_k\) and \(\sigma_k\) are estimated as follows:

\[\hat{\pi_k} = \frac{\sum_{s=1}^S w_{sk}}{S}.\]

\[\hat{\sigma}^2_k = \frac{\sum_{s=1}^S w_{sk} \| \hat{B}^2_k y_s \|_2^2 } {(M - D_k) \sum_{s=1}^S w_{sk}}.\]

The primary conceptual difference between \(K\)-subspaces and EM algorithm is: At each iteration, \(K\)-subspaces gives a definite assignment of every point to one of the subspaces; while EM views the membership as a random variable and uses its expected value \(\sum_{s=1}^S w_{ks}\) to give a “probabilistic” assignment of a data point to a subspace.

Both of these algorithms require number of subspaces and the dimension of each subspace as input and depend on a good initialization of subspaces to converge to an optimal solution.