# Clustering

So far we focused on ML methods that use the ERM principle and lean a hypothesis by minimizing the discrepancy between its predictions and the true labels on a training set. These methods are referred to as supervised methods as they require labeled data points for which the true label values have been determined by some human (who serves as a “supervisor”). This and the following chapter discus ML methods which do not require to know the label of any data point. These methods are often referred to as “unsupervised” since they do not require a “supervisor” to provide the label values for any data point.

One important family of unsupervised ML methods aim at clustering a given set of data points such as those depicted in Figure fig_scatterplot_clustering. The basic idea of clustering is to decompose a set of data points into few subsets or clusters that consist of similar data points. For the dataset in Figure fig_scatterplot_clustering it seems reasonable to define two clusters, one cluster $\big\{ \featurevec^{(1)}, \featurevec^{(5)}, \featurevec^{(6)}, \featurevec^{(7)} \big\}$ and a second cluster $\big\{ \featurevec^{(2)}, \featurevec^{(3)}, \featurevec^{(4)}, \featurevec^{(8)} \big\}$ .

Formally, clustering methods learn a hypothesis that assign each data point either to precisely one cluster (see Section Hard Clustering with kmeans) or several clusters with different degrees of belonging (see Section Soft Clustering with Gaussian Mixture Models ). Different clustering methods use different measures for the similarity between data points. For data points characterized by (numeric) Euclidean feature vectors, the similarity between data points can be naturally defined in terms of the Euclidean distance between feature vectors. Section Connectivity-based Clustering discusses clustering methods that use notions of similarity that are not based on a Euclidean space.

There is a strong conceptual link between clustering methods and the classification methods discussed in Chapter ch_some_examples. Both type of methods learn a hypothesis that reads in the features of a data point and delivers a prediction for some quantity of interest. In classification methods, this quantity of interest is some generic label of a data point. For clustering methods, this quantity of interest for a data point is the cluster assignment (for hard clustering) of the degree of belonging (for soft clustering). A main difference between clustering and classification is that clustering methods do not require the true label (cluster assignment or degree of belonging) of a single data point.

Classification methods learn a good hypothesis via minimizing their average loss incurred on a training set of labeled data points. In contrast, clustering methods do not have access to a single labeled data point. To find the correct labels (cluster assignments) clustering methods rely solely on the intrinsic geometry of the data points. We will see that clustering methods use this intrinsic geometry to define an empirical risk incurred by a candidate hypothesis. Like classification methods, also clustering methods use an instance of the ERM principle (see Chapter Empirical Risk Minimization) to find a good hypothesis (clustering).

This chapter discusses two main flavours of clustering methods:

Hard clustering methods learn a hypothesis $h$ that reads in the feature vector $\featurevec$ of a data point and delivers a predicted cluster assignment $\hat{\truelabel}=h(\featurevec) \in \{1,\ldots,\nrcluster\}$. Thus, assigns each data point to one single cluster. Section Hard Clustering with kmeans will discuss one of the most widely-used hard clustering algorithms which is known as k-means.

In contrast to hard clustering methods, soft clustering methods assign each data point to several clusters with varying degree of belonging. These methods learn a hypothesis that delivers a vector $\hat{\labelvec}=\big(\hat{\truelabel}_{1},\ldots,\hat{\truelabel}_{\nrcluster}\big)^{T}$ with entry $\hat{\truelabel}_{\clusteridx} \in [0,1]$ being the predicted degree by which the data point belongs to the $\clusteridx$-th cluster. Hard clustering is an extreme case of soft clustering where we enforce each degree of belonging to take only values in $\{0,1\}$. Moreover, hard clustering requires that for each data point only of the corresponding degree of belonging (one for each cluster) is non-zero.

The main focus of this chapter is on methods that require data points being represented by numeric feature vectors (see Sections Hard Clustering with kmeans and Soft Clustering with Gaussian Mixture Models ). These methods define the similarity between data points using the Euclidean distance between their feature vectors. Some applications generate data points for which it is not obvious how to obtain numeric feature vectors such that their Euclidean distances reflect the similarity between data points. It is then desirable to use a more flexible notion of similarity which does not require to determine (useful) numeric feature vectors of data points.

Maybe the most fundamental concept to represent similarities between data points is a similarity graph. The nodes of the similarity graph are the individual data points of a dataset. Similar data points are connected by edges (links) that might be assigned some weight that quantities the amount of similarity. Section Connectivity-based Clustering discusses clustering methods that use a graph to represent similarities between data points.

## Hard Clustering with k-means

Consider a dataset $\dataset$ which consists of $\samplesize$ data points that are indexed by $\sampleidx=1,\ldots,\samplesize$. The data points are characterized via their numeric feature vectors $\featurevec^{(\sampleidx)} \in \mathbb{R}^{\featuredim}$, for $\sampleidx=1,\ldots,\samplesize$. It will be convenient for the following discussion if we identify a data point with its feature vector. In particular, we refer by $\featurevec^{(\sampleidx)}$ to the $\sampleidx$-th data point. hard clustering methods decompose (or cluster) the dataset into a given number $\nrcluster$ of different clusters $\cluster^{(1)},\ldots,\cluster^{(\nrcluster)}$. These methods assign each data point $\featurevec^{(\sampleidx)}$ to one and only one cluster $\cluster^{(\clusteridx)}$ with the cluster index $\clusteridx \in \{1,\ldots,\nrcluster\}$.

Let us define for each data point its label $\truelabel^{(\sampleidx)} \in \{1,\ldots,\nrcluster\}$ as the index of the cluster to which the $\sampleidx$th data point actually belongs to. The $\clusteridx$-th cluster consists of all data points with $\truelabel^{(\sampleidx)}=\clusteridx$,

[$] \cluster^{(\clusteridx)} \defeq \big\{ \sampleidx \in \{1,\ldots,\samplesize\} : \truelabel^{(\sampleidx)} = \clusteridx \big\}.[$]

We can interpret hard clustering methods as ML methods that compute predictions $\hat{\truelabel}^{(\sampleidx)}$ for the (“correct”) cluster assignments $\truelabel^{(\sampleidx)}$. The predicted cluster assignments result in the predicted clusters

[$]$$\label{equ_def_individual_cluster} \widehat{\cluster}^{(\clusteridx)} \defeq \big\{ \sampleidx \in \{1,\ldots,\samplesize\} : \hat{\truelabel}^{(\sampleidx)} = \clusteridx \big\}\mbox{, for } \clusteridx =1,\ldots,\nrcluster.$$[$]

We now discuss a hard clustering method which is known as k-means. This method does not require the knowledge of the label or (true) cluster assignment $\truelabel^{(\sampleidx)}$ for any data point in $\dataset$. This method computes predicted cluster assignments $\hat{\truelabel}^{(\sampleidx)}$ based solely from the intrinsic geometry of the feature vectors $\featurevec^{(\sampleidx)} \in \mathbb{R}^{\featuredim}$ for all $\sampleidx=1,\ldots,\samplesize$. Since it does not require any labeled data points, k-means is often referred to as being an unsupervised method. However, note that k-means requires the number $\nrcluster$ of clusters to be given as an input (or hyper-) parameter.

The k-means method represents the $\clusteridx$-th cluster $\widehat{\cluster}^{(\clusteridx)}$ by a representative feature vector $\clustermean^{(\clusteridx)} \in \mathbb{R}^{\featuredim}$. It seems reasonable to assign data points in $\dataset$ to clusters $\widehat{\cluster}^{(\clusteridx)}$ such that they are well concentrated around the cluster representatives $\clustermean^{(\clusteridx)}$. We make this informal requirement precise by defining the clustering error

[$] $$\label{equ_def_emp_risk_kmeans} \emperror \big( \{\clustermean^{(\clusteridx)}\}_{\clusteridx=1}^{\nrcluster},\{\hat{y}^{(\sampleidx)}\}_{\sampleidx=1}^{\samplesize} \mid \dataset \big) =(1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} {\left\|\vx^{(\sampleidx)}-\clustermean^{(\hat{y}^{(\sampleidx)})}\right\|^2}.$$ [$]

Note that the clustering error $\emperror$ \eqref{equ_def_emp_risk_kmeans} depends on both, the cluster assignments $\hat{\truelabel}^{(\sampleidx)}$, which define the cluster equ_def_individual_cluster, and the cluster representatives $\clustermean^{(\clusteridx)}$, for $\clusteridx=1,\ldots,\nrcluster$.

Finding the optimal cluster means $\{\clustermean^{(\clusteridx)}\}_{\clusteridx=1}^{\nrcluster}$ and cluster assignments $\{\hat{\truelabel}^{(\sampleidx)}\}_{\sampleidx=1}^{\samplesize}$ that minimize the clustering error \eqref{equ_def_emp_risk_kmeans} is computationally challenging. The difficulty stems from the fact that the clustering error is a non-convex function of the cluster means and assignments. While jointly optimizing the cluster means and assignments is hard, separately optimizing either the cluster means for given assignments or vice-versa is easy. In what follows, we present simple closed-form solutions for these sub-problems. The $k$-means method simply combines these solutions in an alternating fashion.

It can be shown that for given predictions (cluster assignments) $\hat{\truelabel}^{(\sampleidx)}$, the clustering error \eqref{equ_def_emp_risk_kmeans} is minimized by setting the cluster representatives equal to the cluster means

[$]$$\label{eq_def_mean_optimal} \clustermean^{(\clusteridx)}\defeq \big(1/|\widehat{\cluster}^{(\clusteridx)}| \big) \sum_{\hat{\truelabel}^{(\sampleidx)} = \clusteridx} \featurevec^{(\sampleidx)}.$$[$]

To evaluate \eqref{eq_def_mean_optimal} we need to know the predicted cluster assignments $\hat{\truelabel}^{(\sampleidx)}$. The crux is that the optimal predictions $\hat{\truelabel}^{(\sampleidx)}$, in the sense of minimizing clustering error \eqref{equ_def_emp_risk_kmeans}, depend themselves on the choice for the cluster representatives $\clustermean^{(\clusteridx)}$. In particular, for given cluster representative $\clustermean^{(\clusteridx)}$ with $\clusteridx=1,\ldots,\nrcluster$, the clustering error is minimized by the cluster assignments

[$]$$\label{equ_def_clustera_assgt-nearst_mean} \hat{\truelabel}^{(\sampleidx)} \in \argmin_{\clusteridx \in \{1,\ldots,\nrcluster\}} \big\| \featurevec^{(\sampleidx)} - \clustermean^{(\clusteridx)} \big\|.$$[$]

Here, we denote by $\argmin\limits_{\clusteridx' \in \{1,\ldots,\nrcluster\}} \| \featurevec^{(\sampleidx)} - \clustermean^{(\clusteridx')} \|$ the set of all cluster indices $\clusteridx \in \{1,\ldots,\nrcluster\}$ such that $\| \featurevec^{(\sampleidx)} - \clustermean^{(\clusteridx)} \| = \min_{\clusteridx' \in \{1,\ldots,\nrcluster\}} \| \featurevec^{(\sampleidx)} - \clustermean^{(\clusteridx')} \|$.

Note that \eqref{equ_def_clustera_assgt-nearst_mean} assigns the $\sampleidx$th datapoint to those cluster $\cluster^{(\clusteridx)}$ whose cluster mean $\clustermean^{(\clusteridx)}$ is nearest (in Euclidean distance) to $\featurevec^{(\sampleidx)}$. Thus, if we knew the optimal cluster representatives, we could predict the cluster assignments using \eqref{equ_def_clustera_assgt-nearst_mean}. However, we do not know the optimal cluster representatives unless we have found good predictions for the cluster assignments $\hat{\truelabel}^{(\sampleidx)}$ (see \eqref{eq_def_mean_optimal}).

To recap: We have characterized the optimal choice \eqref{eq_def_mean_optimal} for the cluster representatives for given cluster assignments and the optimal choice \eqref{equ_def_clustera_assgt-nearst_mean} for the cluster assignments for given cluster representatives. It seems natural, starting from some initial guess for the cluster representatives, to alternate between the cluster assignment update \eqref{equ_def_clustera_assgt-nearst_mean} and the update \eqref{eq_def_mean_optimal} for the cluster means. This alternating optimization strategy is illustrated in Figure fig_flow_kmeans and summarized in Algorithm k-means. Note that Algorithm k-means, which is maybe the most basic variant of $k$-means, simply alternates between the two updates \eqref{eq_def_mean_optimal} and \eqref{equ_def_clustera_assgt-nearst_mean} until some stopping criterion is satisfied.

Algorithm k-means requires the specification of the number $\nrcluster$ of clusters and initial choices for the cluster means $\clustermean^{(\clusteridx)}$, for $\clusteridx=1,\ldots,\nrcluster$. Those quantities are hyper-parameters that must be tuned to the specific geometry of the given dataset $\dataset$. This tuning can be based on probabilistic models for the dataset and its cluster structure (see Section equ_prob_models_data and ). Alternatively, if Algorithm k-means is used as pre-processing within an overall supervised ML method (see Chapter The Landscape of ML), the validation error (see Section sec_modsel) of the overall method might guide the choice of the number $\nrcluster$ of clusters.

Choosing Number of Clusters. The choice for the number $\nrcluster$ of clusters typically depends on the role of the clustering method within an overall ML application. If the clustering method serves as a pre-processing for a supervised ML problem, we could try out different values of the number $\nrcluster$ and determine, for each choice $\nrcluster$, the corresponding validation error. We then pick the value of $\nrcluster$ which results in the smallest validation error. If the clustering method is mainly used as a tool for data visualization, we might prefer a small number of clusters. The choice for the number $\nrcluster$ of clusters can also be guided by the so-called “elbow-method”. Here, we run the $k$-means Algorithm k-means for several different choices of $\nrcluster$. For each value of $\nrcluster$, Algorithm k-means delivers a clustering with clustering error

[$]\error^{(\nrcluster)} = \emperror \big(\{\clustermean^{(\clusteridx)}\}_{\clusteridx=1}^{\nrcluster},\{\hat{\truelabel}^{(\sampleidx)}\}_{\sampleidx=1}^{\samplesize} \mid \dataset \big).[$]

K-means Algorithm

Input: dataset $\dataset=\{ \mathbf{x}^{(i)}\}_{i=1}^{n}$; number $\nrcluster$ of clusters; initial cluster means $\clustermean^{(\clusteridx)}$ for $\clusteridx=1,\ldots,\nrcluster$.

1. repeat
2. for each datapoint $\mathbf{x}^{(i)}$, $i\!=\!1,\ldots,n$, do
[$]$$\label{equ_cluster_assign_update} \hat{y}^{(i)} \defeq \argmin\limits_{\clusteridx' \in \{1,\ldots,\nrcluster\}} \| \mathbf{x}^{(i)} - \clustermean^{(\clusteridx')} \| \quad \mbox{ (update cluster assignments)}$$[$]
3. for each cluster $\clusteridx\!=\!1,\ldots,\nrcluster$ do
[$]$$\label{equ_cluster_mean_update} \clustermean^{(\clusteridx)} \defeq \frac{1}{|\{i: \hat{y}^{(i)}= \clusteridx\}|} \sum_{i: \hat{y}^{(i)} = \clusteridx} \mathbf{x}^{(i)} \quad \mbox{ (update cluster means) }$$[$]
4. until stopping criterion is met
5. compute final clustering error $\error^{(\nrcluster)} \defeq (1/n) \sum_{i=1}^{n} {\left\|\mathbf{x}^{(i)}-\clustermean^{(\hat{y}^{(i)})}\right\|^2}$

Output: cluster means $\clustermean^{(\clusteridx)}$, for $\clusteridx=1,\ldots,\nrcluster$, cluster assignments $\hat{y}^{(i)} \in \{1,\ldots,\nrcluster\}$, for $i=1,\ldots,n$, final clustering error $\error^{(\nrcluster)}$

We then plot the minimum empirical error $\error^{(\nrcluster)}$ as a function of the number $\nrcluster$ of clusters. Figure fig_ellbow depicts an example for such a plot which typically starts with a steep decrease for increasing $\nrcluster$ and then flattening out for larger values of $\nrcluster$. Note that for $\nrcluster \geq \samplesize$ we can achieve zero clustering error since each datapoint $\featurevec^{(\sampleidx)}$ can be assigned to a separate cluster $\cluster^{(\clusteridx)}$ whose mean coincides with that datapoint, $\featurevec^{(\sampleidx)} = \clustermean^{(\clusteridx)}$.

Cluster-Means Initialization. We briefly mention some popular strategies for choosing the initial cluster means in Algorithm k-means. One option is to initialize the cluster means with realizations of iid random vectors whose probability distribution is matched to the dataset $\dataset = \{ \vx^{(\sampleidx)} \}_{\sampleidx=1}^{\samplesize}$ (see Section sec_max_iikelihood). For example, we could use a multivariate normal distribution $\mathcal{N}(\vx;\widehat{\clustermean}, \widehat{\clustercov})$ with the sample mean $\widehat{\clustermean} = (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \featurevec^{(\sampleidx)}$ and the sample covariance $\widehat{\clustercov} = (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} (\featurevec^{(\sampleidx)}\!-\!\widehat{\clustermean}) (\featurevec^{(\sampleidx)}\!-\!\widehat{\clustermean})^{T}$. Alternatively, we could choose the initial cluster means $\clustermean^{(\clusteridx)}$ by selecting $\nrcluster$ different data points $\featurevec^{(\sampleidx)}$ from $\dataset$. This selection process might combine random choices with an optimization of the distances between cluster means . Finally, the cluster means might also be chosen by evenly partitioning the principal component of the dataset (see Chapter Feature Learning).

Interpretation as ERM. For a practical implementation of Algorithm k-means we need to decide when to stop updating the cluster means and assignments (see \eqref{equ_cluster_assign_update} and \eqref{equ_cluster_mean_update}). To this end it is useful to interpret Algorithm k-means as a method for iteratively minimizing the clustering error \eqref{equ_def_emp_risk_kmeans}. As can be verified easily, the updates \eqref{equ_cluster_assign_update} and \eqref{equ_cluster_mean_update} always modify (update) the cluster means or assignments in such a way that the clustering error \eqref{equ_def_emp_risk_kmeans} is never increased. Thus, each new iteration of Algorithm k-means results in cluster means and assignments with a smaller (or the same) clustering error compared to the cluster means and assignments obtained after the previous iteration. Algorithm k-means implements a form of ERM (see Chapter Empirical Risk Minimization) using the clustering error \eqref{equ_def_emp_risk_kmeans} as the empirical risk incurred by the predicted cluster assignments $\hat{y}^{(\sampleidx)}$. Note that after completing a full iteration of Algorithm k-means, the cluster means $\big\{ \clustermean^{(\clusteridx)} \big\}_{\clusteridx=1}^{\nrcluster}$ are fully determined by the cluster assignments $\big\{ \hat{y}^{(\sampleidx)} \big\}_{\sampleidx=1}^{\samplesize}$ via \eqref{equ_cluster_mean_update}. It seems natural to terminate Algorithm k-means if the decrease in the clustering error achieved by the most recent iteration is below a prescribed (small) threshold.

Clustering and Classification. There is a strong conceptual link between Algorithm k-means and classification methods (see e.g. Section sec_nearest_neighbour_methods). Both methods essentially learn a hypothesis $h(\featurevec)$ that maps the feature vector $\featurevec$ to a predicted label $\hat{\truelabel}=h(\featurevec)$ from a finite set. The practical meaning of the label values is different for Algorithm k-means and classification methods. For classification methods, the meaning of the label values is essentially defined by the training set (of labeled data points) used for ERM equ_def_ERM_funs. On the other hand, clustering methods use the predicted label $\hat{\truelabel}=h(\featurevec)$ as a cluster index.

Another main difference between Algorithm k-means and most classification methods is the choice for the empirical risk used to evaluate the quality or usefulness of a given hypothesis $h(\cdot)$. Classification methods typically use an average loss over labeled data points in a training set as empirical risk. In contrast, Algorithm k-means uses the clustering error \eqref{equ_def_emp_risk_kmeans} as a form of empirical risk. Consider a hypothesis that resembles the cluster assignments $\hat{\truelabel}^{(\sampleidx)}$ obtained after completing an iteration in Algorithm k-means, $\hat{\truelabel}^{(\sampleidx)} = h\big( \featurevec^{(\sampleidx)} \big)$. Then we can rewrite the resulting clustering error achieved after this iteration as

[$]$$\label{equ_emp_risk_clustering} \emperror\big( h | \dataset \big) = (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \left\|\featurevec^{(\sampleidx)}- \frac{\sum_{\sampleidx' \in \dataset^{(\sampleidx)}} \featurevec^{(\sampleidx')}}{\big| \dataset^{(\sampleidx)} \big|} \right\|^2. \mbox{ with } \dataset^{(\sampleidx)} \defeq \big\{ \sampleidx' : h\big(\featurevec^{(\sampleidx)} \big) = h\big(\featurevec^{(\sampleidx')} \big) \big\}.$$[$]

Note that the $\sampleidx$-th summand in \eqref{equ_emp_risk_clustering} depends on the entire dataset $\dataset$ and not only on (the features of) the $\sampleidx$-th data point $\featurevec^{(\sampleidx)}$.

Some Practicalities. For a practical implementation of Algorithm k-means we need to fix three issues.

Issue Fix
Tie-breaking We need to specify what to do if several different cluster indices $\clusteridx\!\in\!\{1,\ldots,\nrcluster\}$ achieve the minimum value in the cluster assignment update \eqref{equ_cluster_assign_update} during step equ_step_cluster_asst_update.
Empty-cluster The cluster assignment update \eqref{equ_cluster_assign_update} in step equ_step_cluster_update of Algorithm k-means might result in a cluster $\clusteridx$ with no datapoints associated with it, $|\{ i: \hat{y}^{(i)} = \clusteridx \}|=0$. For such a cluster $\clusteridx$, the update \eqref{equ_cluster_mean_update} is not well-defined.
Stopping-criterion We need to specify a criterion used in step equ_def_stop_criterion_kmeans of Algorithm k-means to decide when to stop iterating.

Algorithm k-means II is obtained from Algorithm k-means by fixing those three issues . Step equ_step_cluster_assg_kmeans2 of Algorithm k-means II solves the first issue mentioned above (“tie breaking”), arising when there are several cluster clusters whose means have minimum distance to a data point $\featurevec^{(\sampleidx)}$, by assigning $\vx^{(\sampleidx)}$ to the cluster with smallest cluster index (see \eqref{equ_cluster_assign_update2}).

K-means II

Input: dataset $\dataset=\{ \featurevec^{(\sampleidx)}\}_{\sampleidx=1}^{\samplesize}$; number $\nrcluster$ of clusters; tolerance $\varepsilon \geq 0$; initial cluster means $\big\{ \clustermean^{(\clusteridx)} \big\}_{\clusteridx=1}^{\nrcluster}$

1. Initialize. set iteration counter $\itercntr \defeq 0$; $\error_{0}\defeq 0$
1. repeat
2. for all datapoints $\sampleidx\!=\!1,\ldots,\samplesize$,
[$]$$\label{equ_cluster_assign_update2} \hat{\truelabel}^{(\sampleidx)} \defeq \min \{ \argmin\limits_{\clusteridx' \in \{1,\ldots,\nrcluster\}} \| \featurevec^{(\sampleidx)} - \clustermean^{(\clusteridx')} \| \} \quad \mbox{ (update cluster assignments)}$$[$]
3. for all clusters $\clusteridx\!=\!1,\ldots,\nrcluster$, update the activity indicator
[$]b^{(\clusteridx)} \defeq \begin{cases} 1 & \mbox{ if } |\{\sampleidx: \hat{y}^{(\sampleidx)}= \clusteridx\}| \gt 0 \\ 0 & \mbox{ else.} \end{cases}[$]
for all $\clusteridx\!=\!1,\ldots,\nrcluster$ with $b^{(\clusteridx)}=1$,
[$]$$\label{equ_cluster_mean_update2} \clustermean^{(\clusteridx)} \defeq \frac{1}{|\{ \sampleidx: \hat{\truelabel}^{(\sampleidx)}= \clusteridx\}|} \sum_{\{ \sampleidx: \hat{\truelabel}^{(\sampleidx)} = \clusteridx\}} \featurevec^{(\sampleidx)} \quad \mbox{ (update cluster means) }$$[$]
4. $\itercntr \defeq \itercntr+1$ (increment iteration counter)
5. $\error_{\itercntr} \defeq \emperror \big( \{\clustermean^{(\clusteridx)}\}_{\clusteridx=1}^{\nrcluster},\{\hat{\truelabel}^{(\sampleidx)}\}_{\sampleidx=1}^{\samplesize} \mid \dataset \big)$ (evaluate clustering error \eqref{equ_def_emp_risk_kmeans})
6. $\itercntr \gt 1$ and $\error_{\itercntr\!-\!1} - \error_{\itercntr} \leq \varepsilon$ (check for sufficient decrease in clustering error)
7. $\error^{(\nrcluster)} \defeq (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} {\left\|\featurevec^{(\sampleidx)}-\clustermean^{(\hat{\truelabel}^{(\sampleidx)})}\right\|^2}$ (compute final clustering error)

Output: cluster assignments $\hat{\truelabel}^{(\sampleidx)}\!\in\!\{1,\ldots,\nrcluster\}$, cluster means $\clustermean^{(\clusteridx)}$, clustering error $\error^{(\nrcluster)}$.

Step equ_def_cluster_indicators of Algorithm k-means II resolves the “empty cluster” issue by computing the variables $b^{(\clusteridx)} \in \{0,1\}$ for $\clusteridx=1,\ldots,\nrcluster$. The variable $b^{(\clusteridx)}$ indicates if the cluster with index $\clusteridx$ is active ($b^{(\clusteridx)}= 1$) or the cluster $\clusteridx$ is inactive ($b^{(\clusteridx)}=0$). The cluster $\clusteridx$ is defined to be inactive if there are no data points assigned to it during the preceding cluster assignment step \eqref{equ_cluster_assign_update2}. The cluster activity indicators $b^{(\clusteridx)}$ allows to restrict the cluster mean updates \eqref{equ_cluster_mean_update2} only to the clusters $\clusteridx$ with at least one data point $\featurevec^{(\sampleidx)}$. To obtain a stopping criterion, step equ_def_clustering_error_kmeans2 Algorithm k-means II monitors the clustering error $\error_{\itercntr}$ incurred by the cluster means and assignments obtained after $\itercntr$ iterations. Algorithm k-means II continues updating cluster assignments \eqref{equ_cluster_assign_update2} and cluster means \eqref{equ_cluster_mean_update2} as long as the decrease is above a given threshold $\varepsilon \geq0$.

For Algorithm k-means II to be useful we must ensure that the stopping criterion is met within a finite number of iterations. In other words, we must ensure that the clustering error decrease can be made arbitrarily small within a sufficiently large (but finite) number of iterations. To this end, it is useful to represent Algorithm k-means II as a fixed-point iteration

[$]$$\label{equ_fixed_point_clustering} \{ \hat{\truelabel}^{(\sampleidx)} \}_{\sampleidx=1}^{\samplesize} \mapsto \mathcal{P} \{ \hat{\truelabel}^{(\sampleidx)} \}_{\samplesize=1}^{\samplesize}.$$[$]

The operator $\mathcal{P}$, which depends on the dataset $\dataset$, reads in a list of cluster assignments and delivers an improved list of cluster assignments aiming at reducing the associated clustering error \eqref{equ_def_emp_risk_kmeans}. Each iteration of Algorithm k-means II updates the cluster assignments $\hat{\truelabel}^{(\sampleidx)}$ by applying the operator $\mathcal{P}$. Representing Algorithm k-means II as a fixed-point iteration \eqref{equ_fixed_point_clustering} allows for an elegant proof of the convergence of Algorithm k-means II within a finite number of iterations (even for $\varepsilon = 0$) .

Figure 1.1 depicts the evolution of the cluster assignments and cluster means during the iterations Algorithm k-means II. Each subplot corresponds to one iteration of Algorithm k-means II and depicts the cluster means before that iteration and the clustering assignments (via the marker symbols) after the corresponding iteration. In particular, the upper left subplot depicts the cluster means before the first iteration (which are the initial cluster means) and the cluster assignments obtained after the first iteration of Algorithm k-means II.

Consider running Algorithm k-means II with tolerance $\varepsilon=0$ (see step step_stop_criterion_kmeans2) such that the iterations are continued until there is no decrease in the clustering error $\error^{(\itercntr)}$ (see step equ_def_clustering_error_kmeans2 of Algorithm k-means II). As discussed above, Algorithm k-means II will terminate after a finite number of iterations. Moreover, for $\varepsilon=0$, the delivered cluster assignments $\big\{ \hat{\truelabel}^{(\sampleidx)} \big\}_{\sampleidx=1}^{\samplesize}$ are fully determined by the delivered clustered means $\big\{ \clustermean^{(\clusteridx)} \big\}_{\clusteridx=1}^{\nrcluster}$,

[$]$$\label{equ_condition_clusterassmean_kmeans2} \hat{\truelabel}^{(\sampleidx)} = \min \{ \argmin\limits_{\clusteridx' \in \{1,\ldots,\nrcluster\}} \| \featurevec^{(\sampleidx)} - \clustermean^{(\clusteridx')} \| \}.$$[$]

Indeed, if \eqref{equ_condition_clusterassmean_kmeans2} does not hold one can show the final iteration $r$ would still decrease the clustering error and the stopping criterion in step step_stop_criterion_kmeans2 would not be met.

If cluster assignments and cluster means satisfy the condition \eqref{equ_condition_clusterassmean_kmeans2}, we can rewrite the clustering error \eqref{equ_def_emp_risk_kmeans} as a function of the cluster means solely,

[$]$$\label{equ_def_clustering_error_means} \emperror\big( \big\{ \clustermean^{(\clusteridx)} \big\}_{\clusteridx=1}^{\nrcluster} | \dataset \big) \defeq (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \min\limits_{\clusteridx' \in \{1,\ldots,\nrcluster\}} \| \featurevec^{(\sampleidx)} - \clustermean^{(\clusteridx')} \|^{2} .$$[$]

Even for cluster assignments and cluster means that do not satisfy \eqref{equ_condition_clusterassmean_kmeans2}, we can still use \eqref{equ_def_clustering_error_means} to lower bound the clustering error \eqref{equ_def_emp_risk_kmeans},

[$]\emperror\big( \big\{ \clustermean^{(\clusteridx)} \big\}_{\clusteridx=1}^{\nrcluster} | \dataset \big)\leq \emperror \big( \{\clustermean^{(\clusteridx)}\}_{\clusteridx=1}^{\nrcluster},\{\hat{\truelabel}^{(\sampleidx)}\}_{\sampleidx=1}^{\samplesize} \mid \dataset \big)[$]

.

Algorithm k-means II iteratively improves the cluster means in order to minimize \eqref{equ_def_clustering_error_means}. Ideally, we would like Algorithm k-means II to deliver cluster means that achieve the global minimum of \eqref{equ_def_clustering_error_means} (see Figure fig_emp_risk_k_means). However, for some combination of dataset $\dataset$ and initial cluster means, Algorithm k-means II delivers cluster means that form only a local optimum of $\emperror\big( \big\{ \clustermean^{(\clusteridx)} \big\}_{\clusteridx=1}^{\nrcluster} | \dataset \big)$ which is strictly worse (larger) than its global optimum (see Figure fig_emp_risk_k_means).

The tendency of Algorithm k-means II to get trapped around a local minimum of \eqref{equ_def_clustering_error_means} depends on the initial choice for cluster means. It is therefore useful to repeat Algorithm k-means II several times, with each repetition using a different initial choice for the cluster means. We then pick the cluster assignments $\{ \hat{\truelabel}^{(\sampleidx)} \}_{\sampleidx=1}^{\samplesize}$ obtained for the repetition that resulted in the smallest clustering error $\error^{(\nrcluster)}$ (see step equ_def_step_final_clustering_kmeans2).

## Soft Clustering with Gaussian Mixture Models

Consider a dataset $\dataset=\{\featurevec^{(1)},\ldots,\featurevec^{(\samplesize)}\}$ that we wish to group into a given number of $\nrcluster$ different clusters. The hard clustering methods discussed in Section Hard Clustering with kmeans deliver cluster assignments $\hat{\truelabel}^{(\sampleidx)}$, for $\sampleidx=1,\ldots,\samplesize$. The cluster assignment $\hat{\truelabel}^{(\sampleidx)}$ is the index of the cluster to which the $\sampleidx$th data point $\featurevec^{(\sampleidx)}$ is assigned to. These cluster assignments $\hat{\truelabel}$ provide rather coarse-grained information. Two data points $\featurevec^{(\sampleidx)}, \featurevec^{(\sampleidx')}$ might be assigned to the same cluster $\clusteridx$ although their distances to the cluster mean $\clustermean^{(\clusteridx)}$ might differ significantly. Intuitively, these two data points have a different degree of belonging to the cluster $\clusteridx$.

For some clustering applications it is desirable to quantify the degree by which a data point belongs to a cluster. Soft clustering methods use a continues range, such as the closed interval $[0,1]$, of possible values for the degree of belonging. In contrast, hard clustering methods use only two possible values for the degree of belonging to a specific cluster, either “full belonging” or no “belonging at all”. While hard clustering methods assign a given data point to precisely one cluster, soft clustering methods typically assign a data point to several different clusters with non-zero degree of belonging.

This chapter discusses soft clustering methods that compute, for each data point $\featurevec^{(\sampleidx)}$ in the dataset $\dataset$, a vector $\widehat{\labelvec}^{(\sampleidx)}=\big(\hat{\truelabel}_{1}^{(\sampleidx)},\ldots,\hat{\truelabel}_{\nrcluster}^{(\sampleidx)}\big)^{T}$. We can interpret the entry $\hat{\truelabel}_{\clusteridx}^{(\sampleidx)} \in [0,1]$ as the degree by which the data point $\featurevec^{(\sampleidx)}$ belongs to the cluster $\cluster^{(\clusteridx)}$. For $\hat{\truelabel}_{\clusteridx}^{(\sampleidx)} \approx 1$, we are quite confident in the data point $\featurevec^{(\sampleidx)}$ belonging to cluster $\cluster^{(\clusteridx)}$. In contrast, for $\hat{\truelabel}_{\clusteridx}^{(\sampleidx)} \approx 0$, we are quite confident that the data point $\featurevec^{(\sampleidx)}$ is outside the cluster $\cluster^{(\clusteridx)}$.

A widely used soft clustering method uses a probabilistic model for the data points $\dataset = \{ \featurevec^{(\sampleidx)} \}_{\sampleidx=1}^{\samplesize}$. Within this model, each cluster $\cluster^{(\clusteridx)}$, for $\clusteridx=1,\ldots,\nrcluster$, is represented by a multivariate normal distribution

[$]$$\label{equ_def_mvn_distribution} \mathcal{N} \big(\featurevec ; \clustermean^{(\clusteridx)}, \clustercov^{(\clusteridx)} \big) \!=\! \frac{1}{\sqrt{{\rm det } \{ 2 \pi \clustercov \} }} \exp\big( - (1/2) \big(\featurevec\!-\!\clustermean^{(\clusteridx)} \big)^{T} \big( \clustercov^{(\clusteridx)} \big)^{-1} \big(\featurevec\!-\!\clustermean^{(\clusteridx)}\big) \big).$$[$]

The probability distribution \eqref{equ_def_mvn_distribution} is parametrized by a cluster-specific mean vector $\clustermean^{(\clusteridx)}$ and an (invertible) cluster-specific covariance matrix $\clustercov^{(\clusteridx)}$.[1] Let us interpret a specific data point $\featurevec^{(\sampleidx)}$ as a realization drawn from the probability distribution \eqref{equ_def_mvn_distribution} of a specific cluster $\clusteridx^{(\sampleidx)}$,

[$]$$\label{equ_cond_dist_GMM} \featurevec^{(\sampleidx)} \sim \mathcal{N} \big(\featurevec ; \clustermean^{(\clusteridx)}, \clustercov^{(\clusteridx)} \big) \mbox{ with cluster index } \clusteridx=\clusteridx^{(\sampleidx)}.$$[$]

We can think of $\clusteridx^{(\sampleidx)}$ as the true index of the cluster to which the data point $\featurevec^{(\sampleidx)}$ belongs to. The variable $\clusteridx^{(\sampleidx)}$ selects the cluster distributions \eqref{equ_def_mvn_distribution} from which the feature vector $\featurevec^{(\sampleidx)}$ has been generated (drawn). We will therefore refer to the variable $\clusteridx^{(\sampleidx)}$ as the (true) cluster assignment for the $\sampleidx$th data point. Similar to the feature vectors $\featurevec^{(\sampleidx)}$ we also interpret the cluster assignments $\clusteridx^{(\sampleidx)}$, for $\sampleidx=1,\ldots,\samplesize$ as realizations of iid rvs.

In contrast to the feature vectors $\featurevec^{(\sampleidx)}$, we do not observe (know) the true cluster indices $\clusteridx^{(\sampleidx)}$. After all, the goal of soft clustering is to estimate the cluster indices $\clusteridx^{(\sampleidx)}$. We obtain a soft clustering method by estimating the cluster indices $\clusteridx^{(\sampleidx)}$ based solely on the data points in $\dataset$. To compute these estimates we assume that the (true) cluster indices $\clusteridx^{(\sampleidx)}$ are realizations of iid rvs with the common probability distribution (or probability mass function)

[$]$$\label{equ_def_cluster_prob} p_{\clusteridx} \defeq \prob{\clusteridx^{(\sampleidx)}=\clusteridx} \mbox{ for } \clusteridx=1,\ldots,\nrcluster.$$[$]

The (prior) probabilities $p_{\clusteridx}$, for $\clusteridx=1,\ldots,\nrcluster$, are either assumed known or estimated from data . The choice for the probabilities $p_{\clusteridx}$ could reflect some prior knowledge about different sizes of the clusters. For example, if cluster $\cluster^{(1)}$ is known to be larger than cluster $\cluster^{(2)}$, we might choose the prior probabilities such that $p_{1} \gt p_{2}$.

The probabilistic model given by \eqref{equ_cond_dist_GMM}, \eqref{equ_def_cluster_prob} is referred to as a GMM. This name is quite natural as the common marginal distribution for the feature vectors $\featurevec^{(\sampleidx)}$, for $\sampleidx=1,\ldots,\samplesize$, is a (additive) mixture of multivariate normal (Gaussian) distributions,

[$]$$\label{equ_def_GMM} \prob{\featurevec^{(\sampleidx)}} = \sum_{\clusteridx=1}^{\nrcluster} \underbrace{\prob{\clusteridx^{(\sampleidx)}=\clusteridx}}_{p_{\clusteridx}} \underbrace{\prob{\featurevec^{(\sampleidx)} | \clusteridx^{(\sampleidx)}=\clusteridx}}_{\mathcal{N}(\featurevec^{(\sampleidx)};\clustermean^{(\clusteridx)}, \clustercov^{(\clusteridx)})}.$$[$]

As already mentioned, the cluster assignments $\clusteridx^{(\sampleidx)}$ are hidden (unobserved) rvs. We thus have to infer or estimate these variables from the observed data points $\featurevec^{(\sampleidx)}$ which realizations or iid rvs with the common distribution \eqref{equ_def_GMM}.

The GMM (see \eqref{equ_cond_dist_GMM} and \eqref{equ_def_cluster_prob}) lends naturally to a rigorous definition for the degree $\truelabel^{(\sampleidx)}_{\clusteridx}$ by which data point $\featurevec^{(\sampleidx)}$ belongs to cluster $\clusteridx$.[2] Let us define the label value $\truelabel^{(\sampleidx)}_{\clusteridx}$ as the “a-posteriori” probability of the cluster assignment $\clusteridx^{(\sampleidx)}$ being equal to $\clusteridx \in \{1,\ldots,\nrcluster\}$:

[$]$$\label{equ_def_deg_belonging_prob}\truelabel^{(\sampleidx)}_{\clusteridx} \defeq \prob{ \clusteridx^{(\sampleidx)} = \clusteridx | \dataset}.$$[$]

By their very definition \eqref{equ_def_deg_belonging_prob}, the degrees of belonging $y^{(\sampleidx)}_{\clusteridx}$ always sum to one,

[$]$$\label{equ_dob_sum_to_one} \sum_{\clusteridx=1}^{\nrcluster} \truelabel^{(\sampleidx)}_{\clusteridx}=1 \mbox{ for each } \sampleidx=1,\ldots,\samplesize.$$[$]

We emphasize that we use the conditional cluster probability \eqref{equ_def_deg_belonging_prob}, conditioned on the dataset $\dataset$, for defining the degree of belonging $\truelabel^{(\sampleidx)}_{\clusteridx}$. This is reasonable since the degree of belonging $\truelabel^{(\sampleidx)}_{\clusteridx}$ depends on the overall (cluster) geometry of the dataset $\dataset$.

The definition \eqref{equ_def_deg_belonging_prob} for the label values (degree of belongings) $\truelabel^{(\sampleidx)}_{\clusteridx}$ involves the GMM parameters $\{\clustermean^{(\clusteridx)},\clustercov^{(\clusteridx)},p_{\clusteridx}\}_{\clusteridx=1}^{\nrcluster}$ (see \eqref{equ_def_GMM}). Since we do not know these parameters beforehand we cannot evaluate the conditional probability in \eqref{equ_def_deg_belonging_prob}. A principled approach to solve this problem is to evaluate \eqref{equ_def_deg_belonging_prob} with the true GMM parameters replaced by some estimates $\{\widehat{\clustermean}^{(\clusteridx)},\widehat{\clustercov}^{(\clusteridx)},\hat{p}_{\clusteridx}\}_{\clusteridx=1}^{\nrcluster}$. Plugging in the GMM parameter estimates into \eqref{equ_def_deg_belonging_prob} provides us with predictions $\hat{\truelabel}^{(\sampleidx)}_{\clusteridx}$ for the degrees of belonging. However, to compute the GMM parameter estimates we would have already needed the degrees of belonging $\truelabel^{(\sampleidx)}_{\clusteridx}$. This situation is similar to hard clustering where ultime goals is to jointly optimize cluster means and assignments (see Section Hard Clustering with kmeans).

Similar to the spirit of Algorithm k-means for hard clustering, we solve the above dilemma of soft clustering by an alternating optimization scheme. This scheme, which is illustrated in Figure fig_flow_softclustering, alternates between updating (optimizing) the predicted degrees of belonging (or soft cluster assignments) $\hat{\truelabel}^{(\sampleidx)}_{\clusteridx}$, for $\sampleidx=1,\ldots,\samplesize$ and $\clusteridx=1,\ldots,\nrcluster$, given the current GMM parameter estimates $\{\widehat{\clustermean}^{(\clusteridx)},\widehat{\clustercov}^{(\clusteridx)},\hat{p}_{\clusteridx}\}_{\clusteridx=1}^{\nrcluster}$ and then updating (optimizing) these GMM parameter estimates based on the updated predictions $\hat{\truelabel}^{(\sampleidx)}_{\clusteridx}$. We summarize the resulting soft clustering method in Algorithm Soft-Clustering Algorithm. Each iteration of Algorithm Soft-Clustering Algorithm consists of an update \eqref{equ_update_soft_cluster_assignm} for the degrees of belonging followed by an update (step equ_GMM_update_step) for the GMM parameters.

To analyze Algorithm Soft-Clustering Algorithm it is helpful to interpret (the features of) data points $\featurevec^{(\sampleidx)}$ as realizations of iid rvs distributed according to a GMM \eqref{equ_cond_dist_GMM}-\eqref{equ_def_cluster_prob}. We can then understand Algorithm Soft-Clustering Algorithm as a method for estimating the GMM parameters based on observing realizations drawn from the GMM \eqref{equ_cond_dist_GMM}-\eqref{equ_def_cluster_prob}. We can estimate the parameters of a probability distribution using the maximum likelihood method (see Section sec_max_iikelihood and ). As its name suggests, maximum likelihood methods estimate the GMM parameters by maximizing the probability (density)

[$]$$\label{equ_def_prob_den_GMM} %f^{({\rm GMM})} \big( \{ \clustermean^{(\clusteridx)}, \clustercov^{(\clusteridx)}, p_{\clusteridx} \}_{\clusteridx=1}^{\nrcluster}\big) \defeq p \big(\dataset; \{ \clustermean^{(\clusteridx)}, \clustercov^{(\clusteridx)}, p_{\clusteridx} \}_{\clusteridx=1}^{\nrcluster}\big)$$[$]

of actually observing the data points in the dataset $\dataset$.

It can be shown that Algorithm Soft-Clustering Algorithm is an instance of a generic approximate maximum likelihood technique referred to as expectation maximization EM (see for more details). In particular, each iteration of Algorithm Soft-Clustering Algorithm updates the GMM parameter estimates such that the corresponding probability density \eqref{equ_def_prob_den_GMM} does not decrease . If we denote the GMM parameter estimate obtained after $\itercntr$ iterations of Algorithm Soft-Clustering Algorithm by ${\boldsymbol \theta}^{(\itercntr)}$ ,

[$]$$\label{equ_proability_increasing_GMM}p \big(\dataset;{\boldsymbol \theta}^{(\itercntr\!+\!1)} \big) \geq p \big(\dataset;{\boldsymbol \theta}^{(\itercntr)} \big)$$[$]

A Soft-Clustering Algorithm

Input: dataset $\dataset=\{ \featurevec^{(\sampleidx)}\}_{\sampleidx=1}^{\samplesize}$; number $\nrcluster$ of clusters, initial GMM parameter estimates $\{\widehat{\clustermean}^{(\clusteridx)},\widehat{\clustercov}^{(\clusteridx)},\hat{p}_{\clusteridx}\}_{\clusteridx=1}^{\nrcluster}$

• repeat
• for each $\sampleidx=1,\ldots,\samplesize$ and $\clusteridx=1,\ldots,\nrcluster$, update degrees of belonging
[]\begin{align} \label{equ_update_soft_cluster_assignm} \hat{\truelabel}_{\clusteridx}^{(\sampleidx)} & \defeq \frac{\hat{p}_{\clusteridx} \mathcal{N}(\featurevec^{(\sampleidx)};\widehat{\clustermean}^{(\clusteridx)},\widehat{\clustercov}^{(\clusteridx)})}{\sum_{\clusteridx'=1}^{\nrcluster} \hat{p}_{\clusteridx'}\mathcal{N}(\featurevec^{(\sampleidx)};\widehat{\clustermean}^{(\clusteridx')},\widehat{\clustercov}^{(\clusteridx')})} %\nonumber \\ \end{align}[]
• for each $\clusteridx \in \{1,\ldots,\nrcluster\}$, update GMM parameter estimates:
• $\hat{p}_{\clusteridx}\!\defeq\!\samplesize_{\clusteridx} /\samplesize$ with effective cluster size $\samplesize_{\clusteridx} \defeq \sum\limits_{\sampleidx=1}^{\samplesize} \hat{\truelabel}_{\clusteridx}^{(\sampleidx)}$ (cluster probability)
• $\widehat{\clustermean}^{(\clusteridx)} \defeq (1/\samplesize_{\clusteridx}) \sum\limits_{\sampleidx=1}^{\samplesize} \hat{\truelabel}_{\clusteridx}^{(\sampleidx)} \featurevec^{(\sampleidx)}$ (cluster mean)
• $\widehat{\clustercov}^{(\clusteridx)} \defeq (1/\samplesize_{\clusteridx}) {\sum\limits_{\sampleidx=1}^{\samplesize} \hat{\truelabel}_{\clusteridx}^{(\sampleidx)} \big(\featurevec^{(\sampleidx)}\!-\!\widehat{\clustermean}^{(\clusteridx)}\big) \big(\featurevec^{(\sampleidx)}\!-\!\widehat{\clustermean}^{(\clusteridx)}\big)^{T} }$ (cluster covariance matrix)
• until stopping criterion met

Output: predicted degrees of belonging $\widehat{\labelvec}^{(\sampleidx)}=(\hat{\truelabel}_{1}^{(\sampleidx)},\ldots,\hat{y}_{\nrcluster}^{(\sampleidx)})^{T}$ for $\sampleidx=1,\ldots,\samplesize$.

As for Algorithm k-means, we can also interpret Algorithm Soft-Clustering Algorithm as an instance of the ERM principle discussed in Chapter ch_Optimization. Indeed, maximizing the probability density \eqref{equ_def_prob_den_GMM} is equivalent to minimizing the empirical risk

[$]$$\label{equ_def_emp_risk_soft_clustering} \emperror \big({\boldsymbol \theta} \mid \dataset \big) \defeq - \log p \big( \dataset; {\boldsymbol \theta} \big) \mbox{ with GMM parameters } {\boldsymbol \theta} \defeq \{ \clustermean^{(\clusteridx)}, \clustercov^{(\clusteridx)}, p_{\clusteridx} \}_{\clusteridx=1}^{\nrcluster}$$[$]

The empirical risk \eqref{equ_def_emp_risk_soft_clustering} is the negative logarithm of the probability (density) \eqref{equ_def_prob_den_GMM} of observing the dataset $\dataset$ as iid realizations of the GMM \eqref{equ_def_GMM}. The monotone increase in the probability density \eqref{equ_proability_increasing_GMM} achieved by the iterations of Algorithm Soft-Clustering Algorithm translate into a monotone decrease of the empirical risk,

[$]$$\label{equ_emprisk_decreasing_GMM}\emperror \big({\boldsymbol \theta}^{(\itercntr)} \mid \dataset \big) \leq \emperror \big({\boldsymbol \theta}^{(\itercntr-1)} \mid \dataset \big) \mbox{ with iteration counter } \itercntr.$$[$]

The monotone decrease \eqref{equ_emprisk_decreasing_GMM} in the empirical risk \eqref{equ_def_emp_risk_soft_clustering} achieved by the iterations of Algorithm Soft-Clustering Algorithm naturally lends to a stopping criterion. Let $\error_{\itercntr}$ denote the empirical risk \eqref{equ_def_emp_risk_soft_clustering} achieved by the GMM parameter estimates ${\boldsymbol \theta}^{(\itercntr)}$ obtained after $\itercntr$ iterations in Algorithm Soft-Clustering Algorithm. Algorithm Soft-Clustering Algorithm stops iterating as soon as the decrease $\error_{\itercntr\!-\!1} - \error_{\itercntr}$ achieved by the $\itercntr$-th iteration of Algorithm Soft-Clustering Algorithm falls below a given (positive) threshold $\varepsilon\gt0$.

Similar to Algorithm k-means, also Algorithm Soft-Clustering Algorithm might get trapped in local minima of the underlying empirical risk. The GMM parameters delivered by Algorithm Soft-Clustering Algorithm might only be a local minimum of \eqref{equ_def_emp_risk_soft_clustering} but not the global minimum (see Figure fig_emp_risk_k_means for the analogous situation in hard clustering). As for hard clustering Algorithm k-means, we typically repeat Algorithm Soft-Clustering Algorithm several times. During each repetition of Algorithm Soft-Clustering Algorithm, we use a different (randomly chosen) initialization for the GMM parameter estimates ${\boldsymbol \theta} = \{ \widehat{\clustermean}^{(\clusteridx)}, \widehat{\clustercov}^{(\clusteridx)}, \hat{p}_{\clusteridx} \}_{\clusteridx=1}^{\nrcluster}$. Each repetition of Algorithm Soft-Clustering Algorithm results in a potentially different set of GMM parameter estimates and degrees of belongings $\hat{y}^{(\sampleidx)}_{\clusteridx}$. We then use the results for that repetition that achieves the smallest empirical risk \eqref{equ_def_emp_risk_soft_clustering}.

Let us point out an interesting link between soft clustering methods based on GMM (see Algorithm Soft-Clustering Algorithm) and hard clustering with k-means (see Algorithm k-means). Consider the GMM \eqref{equ_cond_dist_GMM} with prescribed cluster covariance matrices

[$]$$\label{equ_def_special_case}\clustercov^{(\clusteridx)}= \sigma^{2} \mathbf{I} \mbox{ for all } \clusteridx \in \{1,\ldots,\nrcluster\},$$[$]

with some given variance $\sigma^{2}\gt0$. We assume the cluster covariance matrices in the GMM to be given by \eqref{equ_def_special_case} and therefore can replace the covariance matrix updates in Algorithm Soft-Clustering Algorithm with the assignment $\widehat{\clustercov}^{(\clusteridx)} \defeq \sigma^{2} \mathbf{I}$. It can be verified easily that for sufficiently small variance $\sigma^{2}$ in \eqref{equ_def_special_case}, the update \eqref{equ_update_soft_cluster_assignm} tends to enforce $\hat{y}_{\clusteridx}^{(\sampleidx)} \in \{0,1\}$. In other words, each data point $\featurevec^{(\sampleidx)}$ becomes then effectively associated with exactly one single cluster $\clusteridx$ whose cluster mean $\widehat{\clustermean}^{(\clusteridx)}$ is nearest to $\featurevec^{(\sampleidx)}$. For $\sigma^{2} \rightarrow 0$, the soft clustering update \eqref{equ_update_soft_cluster_assignm} in Algorithm Soft-Clustering Algorithm reduces to the (hard) cluster assignment update \eqref{equ_cluster_assign_update} in k-means Algorithm k-means. We can interpret Algorithm k-means as an extreme case of Algorithm Soft-Clustering Algorithm that is obtained by fixing the covariance matrices in the GMM to $\sigma^{2} \mathbf{I}$ with a sufficiently small $\sigma^{2}$.

Combining GMM with linear regression. Let us sketch how Algorithm Soft-Clustering Algorithm could be combined with linear regression methods (see Section sec_lin_reg). The idea is to first compute the degree of belongings to the clusters for each data point. We then learn separate linear predictors for each cluster using the degree of belongings as weights for the individual loss terms in the training error. To predict the label of a new data point, we first compute the predictions obtained for each cluster-specific linear hypothesis. These cluster-specific predictions are then averaged using the degree of belongings for the new data point as weights.

## Connectivity-based Clustering

The clustering methods discussed in Sections Hard Clustering with kmeans and Soft Clustering with Gaussian Mixture Models can only be applied to data points which are characterized by numeric feature vectors. These methods define the similarity between data points using the Euclidean distance between the feature vectors of these data points. As illustrated in Figure fig_GMM_kmeans_geometry, these methods can only produce “Euclidean shaped” clusters that are contained either within hyper-spheres (Algorithm k-means) or hyper-ellipsoids (Algorithm Soft-Clustering Algorithm).

Some applications generate data points for which the construction of useful numeric features is difficult. Even if we can easily obtain numeric features for data points, the Euclidean distances between the resulting feature vectors might not reflect the actual similarities between data points. As a case in point, consider data points representing text documents. We could use the histogram of a respecified list of words as numeric features for a text document. In general, a small Euclidean distance between histograms of text documents does not imply that the text documents have similar meanings. Moreover, clusters of similar text documents might have highly complicated shapes in the space of feature vectors that cannot be grouped within hyper-ellipsoids. For datasets with such “non-Euclidean” cluster shapes, k-means or GMM are not suitable as clustering methods. We should then replace the Euclidean distance between feature vectors with another concept to determine or measure the similarity between data points.

Connectivity-based clustering methods do not require any numeric features of data points. These methods cluster data points based on explicitly specifying for any two different data points if they are similar and to what extend. A convenient mathematical tool to represent similarities between the data points of a dataset $\dataset$ is a weighted undirected graph $\graph=\big(\nodes,\edges\big)$. We refer to this graph as the similarity graph of the dataset $\dataset$ (see Figure fig_connect_clustering). The nodes $\nodes$ in this similarity graph $\graph$ represent data points in $\dataset$ and the undirected edges connect nodes that represent similar data points. The extend of the similarity is represented by the weights $W_{\nodeidx,\nodeidx'}$ for each edge $\{\nodeidx,\nodeidx'\} \in \edges$.

Given a similarity graph $\graph$ of a dataset, connectivity-based clustering methods determine clusters as subsets of nodes that are well connected within the cluster but weakly connected between different clusters. Different concepts for quantifying the connectivity between nodes in a graph yield different clustering methods. Spectral clustering methods use eigenvectors of a graph Laplacian matrix to measure the connectivity between nodes . Flow-based clustering methods measure the connectivity between two nodes via the amount of flow that can be routed between them . Note that we might use these connectivity measures to construct meaningful numerical feature vectors for the nodes in the empirical graph. These feature vectors can then be fed into the hard-clustering Algorithm k-means II or the soft clustering Algorithm Soft-Clustering Algorithm (see Figure fig_connect_clustering).

The algorithm DBSCAN considers two data points $\sampleidx,\sampleidx'$ as connected if one of them (say $\sampleidx$) is a core node and the other node ($\sampleidx'$) can be reached via a sequence (path) of connected core nodes

[$]\sampleidx^{(1)},\ldots,\sampleidx^{(\itercntr)} \mbox{ , with } \{\sampleidx,\sampleidx^{(1)}\}, \{\sampleidx^{(1)},\sampleidx^{(2)}\},\ldots,\{\sampleidx^{(\itercntr)},\sampleidx'\} \in \edges.[$]

DBSCAN considers a node to be a core node if it has a sufficiently large number of neighbours . The minimum number of neighbours required for a node to be considered a core node is a hyper-parameter of DBSCAN. When DBSCAN is applied to data points with numeric feature vectors, it defines two data points as connected if the Euclidean distance between their feature vectors does not exceed a given threshold $\varepsilon$ (see Figure fig_DBSCAN).

In contrast to k-means and GMM, DBSCAN does not require the number of clusters to be specified. The number of clusters is determined automatically by DBSCAN and depends on its hyper-parameters. DBSCAN also performs an implicit outlier detection. The outliers delivered by DBSCAN are those data points which do not belong to the same cluster as any other data point.

## Clustering as Preprocessing

In applications it might be benefiial to combine clustering methods with supervised methods such as linear regression. As a point in case consider a dataset that consists of data points obtained from two different data generation processes. Let us denote the data points generated by one process by $\dataset^{(1)}$ and the other one by $\dataset^{(2)}$. Each datapoint is characterzed by features and a label. While there would be an accurate lienar hypothesis for predicting the label of datapoints in $\dataset^{(1)}$ and another linear hypothesis for $\dataset^{(2)}$ these two are very different.

We could try to use clustering methods to assign any given data point to the corresponding data generation process. If we are lucky, the resulting clusters resemble (approximately) the two data generation processes $\dataset^{(1)}$ and $\dataset^{(2)}$. Once we have successfully clustered the data points, we can learn a separate (tailored) hypothesis for ach cluster. More generally, we can use the predicted cluster assignments obtained from the methods of Section Hard Clustering with kmeans - Connectivity-based Clustering as additional features for each data point.

Let us illustrate the above ideas by combining Algorithm k-means with linear regression. We first group data points into a given number $\nrcluster$ of clusters and then learn separate linear predictors $h^{(\clusteridx)}(\featurevec)=\big( \weights^{(\clusteridx)} \big)^{T} \featurevec$ for each cluster $\clusteridx=1,\ldots,\nrcluster$. To predict the label of a new data point with features $\featurevec$, we first assign to the cluster $\clusteridx'$ with the nearest cluster mean. We then use the linear predictor $h^{(\clusteridx')}$ assigned to cluster $\clusteridx'$ to compute the predicted label $\hat{\truelabel} = h^{(\clusteridx')}(\featurevec)$.

## Notes

1. Note that the expression \eqref{equ_def_mvn_distribution} is only valid for an invertible (non-singular) covariance matrix $\clustercov$.
2. Remember that the degree of belonging $\truelabel^{(\sampleidx)}_{\clusteridx}$ is considered as the (unknown) label value of a data point. The choice or definition for the labels of data points is a design choice. In particular, we can define the labels of data points using a hypothetical probabilistic model such as the GMM.

## References

Jung, Alexander (2022). Machine Learning: The Basics. Signapore: Springer. doi:10.1007/978-981-16-8193-6.

Jung, Alexander (2022). "Machine Learning: The Basics". arXiv:1805.05052.