Data Clustering Introduction¶

In this section, we summarize some of the traditional and general purpose data clustering algorithms. These algorithms get used as building blocks for various subspace clustering algorithms. The objective of data clustering is to group the data points into clusters such that points within each cluster are more related to each other than points across different cluster. The relationship can be measured in various ways: distance between points, similarity of points, etc. In distance based clustering, we group the points into \(K\) clusters such that the distance among points in the same group is significantly smaller than those between clusters. In similarity based clustering, the points within the same cluster are more similar to each other than points from different cluster. A graph based clustering will treat each point as a node on a graph [with appropriate edges] and split the graph into connected components. Compare this with subspace clustering which assumes that points in each cluster are sampled from one subspace [even though they may be far apart within the subspace].

Simplest distance measure is the standard Euclidean distance measure. But it is susceptible to the choice of basis. This can be improved by adopting a statistical model for data in each cluster. We assume that the data in \(k\)-th cluster is sampled from a probability distribution with mean \(\mu_k\) and covariance \(\Sigma_k\). An appropriate distance measure from the mean of a distribution which is invariant of the choice of basis is the Mahanalobis distance:

\[d^2 (x_s, \mu_k) = \| x_s - \mu_k\|_{\Sigma_k}^2 = (x_s - \mu_k)^T \Sigma_k^{-1}(x_s - \mu_k).\]

For Gaussian distributions, this is proportional to the negative of the log-likelihood of a sample point. A simple way to measure similarity between two points is the absolute value of the inner product. Alternatively, one can look at the angle between two points or inner product of the normalized points. Another way to measure similarity is to consider the inverse of an appropriate distance measure.

Measurement of clustering performance¶

In general a clustering \(\CCC\) of a set \(Y\) constructed by a clustering algorithm is a set \(\{\CCC_1, \dots, \CCC_C\}\) of non-empty disjoint subsets of \(Y\) such that their union equals \(Y\). Clearly: \(|\CCC_c| > 0\).

The clustering process may identify incorrect number of clusters and \(C\) may not be equal to \(K\). More-over even if \(K = C\), the vectors may be placed in wrong clusters. Ideally, we want \(K = C\) and \(\CCC_c = Y_k\) with a bijective mapping between \(1 \leq c \leq C\) and \(1 \leq k \leq K\). In practice, a clustering algorithm estimates the number of clusters \(C\) and assigns a label \(l_s\), \(1 \leq s \leq S\) to each vector \(y_s\) where \(1\leq l_s \leq C\). All the labels can be put in a label vector \(L\) where \(L \in \{1, \dots, C\}^S\). The permutation matrix \(\Gamma\) can be easily obtained from \(L\).

Following [WW07], we will quickly establish the measures used in this work for clustering performance of synthetic experiments. We have a reference clustering of vectors in \(Y\) given by \(\BBB = \{Y_1, \dots, Y_K\}\) which is known to us in advance (either by construction in synthetic experiments or as ground truth with real life data-sets). The clustering obtained by the algorithm is given by \(\CCC= \{\CCC_1, \dots, \CCC_C\}\). For two arbitrary vectors \(y_i, y_j \in Y\), there are four possibilities: a) they belong to same cluster in both \(\BBB\) and \(\CCC\) (true positive), b) they are in same cluster in \(\BBB\) but different cluster in \(\CCC\) (false negative) c) they are in different clusters in \(\BBB\) but in same cluster in \(\CCC\) d) they are in different clusters in both \(\BBB\) and \(\CCC\) (true negative).

Consider some cluster \(Y_i \in \BBB\) and \(\CCC_j \in \CC\). The elements common to \(Y_i\) and \(\CCC_j\) are given by \(Y_i \cap \CCC_j\). We define \(\text{precision}_{ij} \triangleq \frac{|Y_i \cap \CCC_j|}{|\CCC_j|}.\) We define the overall precision for \(\CCC_j\) as \(\text{precision}(\CCC_j) \triangleq \underset{i}{\max}(\text{precision}_{ij}).\) We define \(\text{recall}_{ij} \triangleq \frac{|Y_i \cap \CCC_j|}{|Y_i|}\). We define the overall recall for \(Y_i\) as \(\text{recall}(Y_i) \triangleq \underset{j}{\max}(\text{recall}_{ij})\). We define the \(F\) score as \(F_{ij} \triangleq \frac{2 \text{precision}_{ij} \text{recall}_{ij} }{\text{precision}_{ij} + \text{recall}_{ij}}.\) We define the overall \(F\)-score for \(Y_i\) as \(F(Y_i) \triangleq \underset{j}{\max}(F_{ij}).\) We note that cluster \(\CCC_j\) for which the maximum is achieved is best matching cluster for \(Y_i\). Finally, we define the overall \(F\)-score for the clustering \(F(\mathcal{B}, \mathcal{C}) \triangleq \frac{1}{S}\sum_{i=1}^p |Y_i | F(Y_i)\) where \(S\) is the total number of vectors in \(Y\). We also define a clustering ratio given by the factor \(\eta \triangleq \frac{C}{K}\).

There are different ways to define clustering error. For the special case where the number of clusters is known in advance, and we ensure that the data-set is divided into exactly those many clusters, it is possible to define subspace clustering error as follows:

\[\text{subspace clustering error} = \frac{\text{\# of misclassified points}} {\text{total \# of points}}.\]

The definition is adopted from [EV13] for comparing the results in this work with their results. This definition can be used after a proper one-one mapping between original labels and cluster labels assigned by the clustering algorithms has been identified. We can compute this mapping by comparing \(F\)-scores.