Prelude to greedy pursuit algorithms¶

In this chapter we will review some matching pursuit algorithms which can help us solve the sparse approximation problem and the sparse recovery problem discussed in here.

The presentation in this chapter is based on a number of sources including [BDDH11][BD09][Ela10][NT09][Tro04][TG07].

Let us recall the definitions of sparse approximation and recovery problems from previous chapters.

From here let \(\DDD\) be a signal dictionary with \(\Phi \in \CC^{N \times D}\) being its synthesis matrix. The \((\mathcal{D}, K)\)-sparse approximation can be written as

\[\begin{split}\begin{aligned} & \underset{\alpha}{\text{minimize}} & & \| x - \Phi \alpha \|_2 \\ & \text{subject to} & & \| \alpha \|_0 \leq K. \end{aligned}\end{split}\]

From here with the help of synthesis matrix \(\Phi\), the \((\mathcal{D}, K)\)-exact-sparse problem can be written as

\[\begin{split}\begin{aligned} & \underset{\alpha}{\text{minimize}} & & \| \alpha \|_0 \\ & \text{subject to} & & x = \Phi \alpha\\ & \text{and} & & \| \alpha \|_0 \leq K \end{aligned}\end{split}\]

From here we recall the sparse signal recovery from compressed measurements problem as following. Let \(\Phi \in \CC^{M \times N}\) be a sensing matrix. Let \(x \in \CC^N\) be an unknown signal which is assumed to be sparse or compressible. Let \(y = \Phi x\) be a measurement vector in \(\CC^M\).

Then the signal recovery problem is to recover \(x\) from \(y\) subject to

\[y = \Phi x\]

assuming \(x\) to be \(K\) sparse or at least \(K\) compressible.

We note that sparse approximation problem and sparse recovery problems have pretty much same structure. They are in fact dual to each other. Thus we will see that the same set of algorithms can be adapted to solve both problems.

In the sequel we will see many variations of above problems.

Our first problem¶

We will start with attacking a very simple version of \((\mathcal{D}, K)\)-exact-sparse problem.

Setting up notation

\(x \in \CC^N\) is our signal of interest and it is known.
\(\DDD\) is the dictionary in which we are looking for a sparse representation of \(x\).
\(\Phi \in \CC^{N \times D}\) is the synthesis matrix for \(\DDD\).
The sparse representation of \(x\) in \(\DDD\) is given by

\[x = \Phi \alpha.\]

It is assumed that \(\alpha \in \CC^D\) is sparse with \(|\alpha|_0 \leq K\).
Also we assume that \(\alpha\) is the sparsest possible solution for \(x\) that we are looking.
We know \(x\), we know \(\Phi\), we don’t know \(\alpha\). We are looking for it.

Thus we need to solve the optimization problem given by

(1)¶\[\underset{\alpha}{\text{minimize}}\, \| \alpha \|_0 \; \text{subject to} \, x = \Phi \alpha.\]

For the unknown vector \(\alpha\), we need to find

the sparsest support for the solution i.e. \(\{ i | \alpha_i \neq 0 \}\)
the non-zero values \(\alpha_i\) over this support.

If we are able to find the support for the solution \(\alpha\), then we may assume that the non-zero values of \(\alpha\) can be easily computed by least squares methods.

Note that the support is discrete in nature (An index \(i\) either belongs to the support or it does not). Hence algorithms which will seek the support will also be discrete in nature.

We now build up a case for greedy algorithms before jumping into specific algorithms later.

Let us begin with a much simplified version of the problem.

Let the columns of the matrix \(\Phi\) be represented as

\[\Phi = \begin{bmatrix} \phi_1 & \phi_2 & \dots & \phi_D \end{bmatrix} .\]

Let \(\spark (\Phi) > 2\). Thus no two columns in \(\Phi\) are linearly dependent and as per here, for any \(x\), there is at most only one \(1\)-sparse explanation vector.

We now assume that such a representation exists and we would be looking for optimal solution vector \(\alpha^*\) that has only one non-zero value, i.e. \(\| \alpha^*\|_0 = 1\).

Let \(i\) be the index at which \(\alpha^*_i \neq 0\).

Thus \(x = \alpha^*_i \phi_i\), i.e. \(x\) is a scalar multiple of \(\phi_i\) (the \(i\)-th column of \(\Phi\)).

Of course we don’t know what is the value of index \(i\).

We can find this by comparing \(x\) with each column of \(\Phi\) and find the column which best matches it.

Consider the least squares minimization problem:

\[\epsilon(j) = \underset{z_j}{\text{minimize}}\, \| \phi_j z_j - x \|_2.\]

where \(z_j \in \CC\) is a scalar.

From linear algebra, it attempts to find the projection of \(x\) over \(\phi_j\) and \(\epsilon(j)\) represents the magnitude of error between \(x\) and the projection of \(x\) over \(\phi_j\).

Optimal solution is given by

\[z_j^* = \frac{\phi_j^H x }{\| \phi_j \|_2^2} = \phi_j^H x\]

since columns of a dictionary are assumed to be unit norm.

Plugging it back into the expression of minimum squared error we get

\[\begin{split}\epsilon^2(j) &= \underset{z_j}{\text{minimize}}\, \| \phi_j z_j - x \|_2^2\\ &=\left \| \phi_j \phi_j^H x - x \right \|_2^2\\ &= \| x\|_2^2 - |\phi_j^H x |^2.\end{split}\]

Now since \(x\) is a scalar multiple of \(\phi_i\), hence \(\epsilon(i) = 0\), thus if we look at \(\epsilon(j)\) for \(j = 1, \dots, D\), the minimum value \(0\) will be obtained for \(j = i\).

And \(\epsilon(i) = 0\) means

\[\| x\|_2^2 - |\phi_i^H x |^2 = 0 \implies \| x\|_2^2 = |\phi_i^H x |^2.\]

This is a special case of Cauchy-Schwartz inequality when \(x\) and \(\phi_i\) are collinear.

The sparse representation is given by

\[\begin{split}\alpha = \begin{bmatrix} 0 \\ \vdots \\ z_i^* \\ \vdots \\ 0 \end{bmatrix}\end{split}\]

Since \(x \in \CC^N\) and \(\phi_j \in \CC^N\), hence computation of \(\epsilon(j)\) requires \(\bigO{N}\) time.

Since we may need to do it for all \(D\) columns, hence finding the index \(i\) takes \(\bigO{ND}\) time.

Now let us make our life more complex. We now suppose that \(\spark(\Phi) > 2 K\). Thus a sparse representation \(\alpha\) of \(x\) with up to \(K\) non-zero values is unique if it exists(see again here). We assume it exists. Since \(x=\Phi \alpha\), \(x\) is a linear combination of up to \(K\) columns of \(\Phi\).

One approach could be to check out all \(\binom{D}{K}\) possible subsets of \(K\) columns from \(\Phi\).

But \(D \choose K\) is \(\bigO{D^{K}}\) and for each subset of \(K\) columns solving the least squares problem will take \(\bigO{N K^2}\) time. Hence overall complexity of the recovery process would be \(\bigO{D^{K} N K^2}\). This is prohibitively expensive.

A way around is by adopting a greedy strategy in which we abandon the hopeless exhaustive search and attempt a series of single term updates in the solution vector \(\alpha\).

Since this is an iterative procedure, let us call the approximation at each iteration as \(\alpha^k\) where \(k\) is the iteration index.

We start with \(\alpha^0 = 0\).
At each iteration we choose one new column in \(\Phi\) and fill in a value at corresponding index in \(\alpha^k\).
The column and value are chosen such that it maximally reduces the \(l_2\) error between \(x\) and the approximation. i.e.

\[\| x -\Phi \alpha^{k + 1} \|_2 < \| x -\Phi \alpha^{k} \|_2\]

and the error reduction is as high as possible.
We stop when the \(l_2\) error reduces below a specific threshold.

Many variations to this scheme are possible.

We can choose more than one atom in each iteration.
In fact we can choose even K atoms in each iteration.
We can drop some previously chosen atoms in an iteration too if they seem to be incorrect choices.

Not every chosen atom may be a correct one. Some algorithms have mechanisms to identify atoms which are more likely to be part of the support than others and thus drop the unlikely ones.

We are now ready to explore different greedy algorithms.