Spectral Clustering

Spectral clustering is a graph based clustering algorithm [VL07]. \(\GGG = \{T, W\}\) to obtain the clustering \(\CCC\) of \(X\). More specifically, the following steps are performed. The degree of a vertex \(t_s \in T\) is defined as \(d_s = \sum_{j = 1}^S w_{s j}\). The degree matrix \(D\) is defined as the diagonal matrix with the degrees \(\{ d_s \}_{s =1 }^S\). The unnormalized graph Laplacian is defined as \(\LLL = D - W\). The normalized graph Laplacian is defined as \(\LLL_{\text{rw}} \triangleq D^{-1} \LLL = I - D^{-1} W\) footnote{We specifically use the random walk version of normalized Graph Laplacian as defined in [VL07]. There are other ways to define normalized graph Laplacian.}. The subscript \(\text{rw}\) stands for random walk. We compute \(\LLL_{\text{rw}}\) and examine its eigen-structure to estimate the number of clusters \(C\) and the label vector \(L\). If \(C\) is known in advance, usually the first \(C\) eigen vectors of \(\LLL_{\text{rw}}\) corresponding to the smallest eigen-values are taken and their row vectors are clustered using K-means algorithm [SM00]. Since, we don’t make any assumption on the number of clusters, we need to estimate it. A simple way is to track the eigen-gap statistic. After arranging the eigen values in increasing order, we can choose the number \(C\) such that the eigen values \(\lambda_1, \dots, \lambda_C\) are very small and \(\lambda_{C + 1}\) is large. This is guided by the theoretical results that if a Graph has \(C\) connected components then exactly \(C\) eigen values of \(\LLL_{\text{rw}}\) are 0. However, when the subspaces are not clearly separated, and noise is introduced, this approach becomes tricky. We go for a more robust approach by analyzing the eigen vectors as described in [ZMP04]. The approach of [ZMP04], with a slightly different definition of the graph Laplacian \((D^{-1/2} W D^{-1/2})\) [NJW+02], has been adapted for working with the Laplacian \(\LLL_{\text{rw}}\) as defined above.

Robust estimation of number of clusters

In step 6, we estimate the number of clusters from the Graph Laplacian. It can be easily shown that \(0\) is an eigen value of \(\LLL_{\text{rw}}\) with an eigen vector \(\OneVec_S\) [VL07]. Further, the multiplicity of eigen value 0 equals the number of connected components in \(\GGG\). In fact the adjacency matrix can be factored as

\[\begin{split}W = \begin{bmatrix} W_1 & \dots & 0\\ \vdots & \ddots & \vdots \\ 0 & \dots & W_P \end{bmatrix} \Gamma\end{split}\]

where \(W_p \in \RR^{S_p \times S_p}\) is the adjacency matrix for the \(p\)-th connected component of \(\GGG\) corresponding to the subspace \(\UUU^p\) and \(\Gamma\) is the unknown permutation matrix. The graph Laplacian for each \(W_p\) has an eigen value \(0\) and the eigen-vector \(\OneVec_{S_p}\). Thus, if we look at the \(P\)-dimensional eigen-space of \(\LLL_{\text{rw}}\) corresponding to eigen value \(0\), then there exists a basis \(\widehat{V} \in \RR^{S \times P}\) such that each row of \(\widehat{V}\) is a unit vector in \(\RR^P\) and the columns contain \(S_1, \dots, S_P\) ones. Actual eigen vectors obtained through any numerical method will be a rotated version of \(\widehat{V}\) given by \(V = \widehat{V} R\). [ZMP04] suggests a cost function over the entries in \(V\) such that the cost is minimized when the rows of \(V\) are close to coordinate vectors. It then estimates a rotation matrix as a product of Givens rotations which can rotate \(V\) to minimize the cost. The parameters of the rotation matrix are the angles of Givens rotations which are estimated through a Gradient descent process. Since \(P\) is unknown, the algorithm is run over multiple values of \(C\) and we choose the value which gives minimum cost. Note that, we reuse the rotated version of \(V\) obtained for a particular value of \(C\) when we go for examining \(C+1\) eigen-vectors. This may appear to be ad-hoc, but is seen to help in faster convergence of the gradient descent algorithm for next iteration.

When \(S\) is small, we can do a complete SVD of \(\LLL_{\text{rw}}\) to get the eigen vectors. However, this is time consuming when \(S\) is large (say 1000+). An important question is how many eigen vectors we really need to examine! As \(C\) increases, the number of Givens rotation parameters increase as \(C(C-1)/2\). Thus, if we examine too many eigen-vectors, we will lose out unnecessarily on speed. We can actually use the eigen-gap statistic described above to decide how many eigen vectors we should examine.

Finally, we assign labels to each data point to identify the cluster they belong to. As described above, we maintain the rotated version of \(V\) during the estimation of rotation matrix. Once, we have zeroed in on the right value of \(C\), then assigning labels to \(x^s\) is straight-forward. We simply perform non-maximum suppression on the rows of V, i.e. we keep the largest (magnitude) entry in each row of \(V\) and assign zero to the rest. The column number of the largest entry in the \(s\)-th row of \(V\) is the label \(l_s\) for \(x^s\). This completes the clustering process.

While eigen gap statistic based estimation of number of clusters is quick, it requires running an additional \(K\)-means algorithm step on the first \(C\) eigen vectors to assign the labels. In contrast, eigen vector based estimation of number of clusters is involved and slow but it allows us to pick the labels very quickly.