Coherence¶

Finding out the spark of a dictionary \(\DDD\) is NP-hard since it involves considering combinatorially large number of selections of columns from \(\DDD\) . In this section we consider the coherence of a dictionary which is computationally tractable and quite useful in characterizing the solutions of sparse approximation problems.

Definition

The coherence of a dictionary \(\DDD\) is defined as the maximum absolute inner product between two distinct atoms in the dictionary:

\[\mu = \underset{j \neq k}{\text{max}} | \langle d_{\omega_j}, d_{\omega_k} \rangle | = \underset{j \neq k}{\text{max}} | (\DDD^H \DDD)_{jk} |.\]

If the dictionary consists of two orthonormal bases, then coherence is also known as mutual coherence or proximity. Since the atoms within each orthonormal basis are orthogonal to each other, the coherence is determined only by the inner products of atoms from one basis with another basis.

We note that \(d_{\omega_i}\) is the \(i\) -th column of synthesis matrix \(\DDD\) . Also \(\DDD^H \DDD\) is the Gram matrix for \(\DDD\) whose elements are nothing but the inner-products of columns of \(\DDD\) .

We note that by definition \(\| d_{\omega} \|_2 = 1\) hence \(\mu \leq 1\) and since absolute values are considered hence \(\mu \geq 0\) . Thus, \(0 \leq \mu \leq 1\).

For an orthonormal basis \(\Psi\) all atoms are orthogonal to each other, hence

\[| \langle \psi_{\omega_j}, \psi_{\omega_k} \rangle |= 0 \text{ whenever } j \neq k.\]

Thus \(\mu = 0\) .

In the following, we will use the notation \(|A|\) to denote a matrix consisting of absolute values of entries in a matrix \(A\) . i.e.

\[| A |_{i j} = | A _{i j} |.\]

The off-diagonal entries of the Gram matrix are captured by the matrix \(\DDD^H \DDD - I\) . Note that all diagonal entries in \(\DDD^H \DDD - I\) are zero since atoms of \(\DDD\) are unit norm. Moreover, each of the entries in \(| \DDD^H \DDD - I |\) is dominated by \(\mu(\DDD)\) .

The inner product between any two atoms \(| \langle d_{\omega_j}, d_{\omega_k} \rangle |\) is a measure of how much they look alike or how much they are correlated. Coherence just picks up the two vectors which are most alike and returns their correlation. In a way \(\mu\) is quite a blunt measure of the quality of a dictionary, yet it is quite useful.

If a dictionary is uniform in the sense that there is not much variation in \(| \langle d_{\omega_j}, d_{\omega_k} \rangle |\) , then \(\mu\) captures the behavior of the dictionary quite well.

Definition

We say that a dictionary is incoherent if the coherence of the dictionary is small.

We are looking for dictionaries which are incoherent. In the sequel we will see how incoherence plays a role in sparse approximation.

Example

The coherence of two ortho-bases is bounded by

\[\frac{1}{\sqrt{N}} \leq \mu \leq 1.\]

The coherence of Dirac Fourier basis is \(\frac{1}{\sqrt{N}}\) .

ExampleCoherence: Multi-ONB dictionary

A dictionary of concatenated orthonormal bases is called a multi-ONB. For some \(N\) , it is possible to build a multi-ONB which contains \(N\) or even \(N+1\) bases yet retains the minimal coherence \(\mu = \frac{1}{\sqrt{N}}\) possible.

Theorem

A lower bound on the coherence of a general dictionary is given by

\[\mu \geq \sqrt{\frac{D-N}{N(D-1)}}\]

Definition

If each atomic inner product meets this bound, the dictionary is called an optimal Grassmannian frame .

The definition of coherence can be extended to arbitrary matrices \(\Phi \in \CC^{N \times D}\) .

Definition

The coherence of a matrix \(\Phi \in \CC^{N \times D}\) is defined as the maximum absolute normalized inner product between two distinct columns in the matrix. Let

\[\Phi = \begin{bmatrix} \phi_1 & \phi_2 & \dots & \phi_D \end{bmatrix}.\]

Then coherence of \(\Phi\) is given by

(1)¶\[\mu(\Phi) = \underset{j \neq k}{\text{max}} \frac{ | \langle \phi_j, \phi_k \rangle |} {\| \phi_j \|_2 \| \phi_k \|_2}\]

It is assumed that none of the columns in \(\Phi\) is a zero vector.

Lower bounds for spark¶

Coherence of a matrix is easy to compute. More interestingly it also provides a lower bound on the spark of a matrix.

Theorem

For any matrix \(\Phi \in \CC^{N \times D}\) (with non-zero columns) the following relationship holds

\[\spark(\Phi) \geq 1 + \frac{1}{\mu(\Phi)}.\]

Proof

We note that scaling of a column of \(\Phi\) doesn’t change either the spark or coherence of \(\Phi\) . Therefore, we assume that the columns of \(\Phi\) are normalized.

We now construct the Gram matrix of \(\Phi\) given by \(G = \Phi^H \Phi\) . We note that

\[G_{k k} = 1 \quad \Forall 1 \leq k \leq D\]

since each column of \(\Phi\) is unit norm.

Also

\[|G_{k j}| \leq \mu(\Phi) = \mu(\Phi) \quad \Forall 1 \leq k, j \leq D , k \neq j.\]

Consider any \(p\) columns from \(\Phi\) and construct its Gram matrix. This is nothing but a leading minor of size \(p \times p\) from the matrix \(G\) .

From the Gershgorin disk theorem, if this minor is diagonally dominant, i.e. if

\[\sum_{j \neq i} |G_{i j}| < | G_{i i}| \Forall i\]

then this sub-matrix of \(G\) is positive definite and so corresponding \(p\) columns from \(\Phi\) are linearly independent.

But

\[|G_{i i}| = 1\]

and

\[\sum_{j \neq i} |G_{i j}| \leq (p-1) \mu(\Phi)\]

for the minor under consideration. Hence for \(p\) columns to be linearly independent the following condition is sufficient

\[(p-1) \mu (\Phi) < 1.\]

Thus if

\[p < 1 + \frac{1}{\mu(\Phi)},\]

then every set of \(p\) columns from \(\Phi\) is linearly independent.

Hence, the smallest possible set of linearly dependent columns must satisfy

\[p \geq 1 + \frac{1}{\mu(\Phi)}.\]

This establishes the lower bound that

\[\spark(\Phi) \geq 1 + \frac{1}{\mu(\Phi)}.\]

This bound on spark doesn’t make any assumptions on the structure of the dictionary. In fact, imposing additional structure on the dictionary can give better bounds. Let us look at an example for a two ortho-basis [DE03].

Theorem

Let \(\DDD\) be a two ortho-basis. Then

\[\spark (\DDD) \geq \frac{2}{\mu(\DDD)}.\]

Proof

It can be shown that for any vector \(v \in \NullSpace(\DDD)\)

\[\| v \|_0 \geq \frac{2}{\mu(\DDD)}.\]

But

\[\spark(\DDD) = \underset{v \in \NullSpace(\DDD)} {\min}( \| v \|_0).\]

Thus

\[\spark(\DDD) \geq \frac{2}{\mu(\DDD)}.\]

For maximally incoherent two orthonormal bases, we know that \(\mu = \frac{1}{\sqrt{N}}\) . A perfect example is the pair of Dirac and Fourier bases. In this case \(\spark(\DDD) \geq 2 \sqrt{N}\) .

Uniqueness-Coherence¶

We can now establish a uniqueness condition for sparse solution of \(x = \Phi \alpha\) .

Theorem

Consider a solution \(x^*\) to the under-determined system \(y = \Phi x\) . If \(x^*\) obeys

\[\| x^* \|_0 < \frac{1}{2} \left (1 + \frac{1}{\mu(\Phi)} \right )\]

then it is necessarily the sparsest solution.

Proof

This is a straightforward application of spark uniqueness theorem and spark lower bound on coherence.

It is interesting to compare the two uniqueness theorems: spark uniqueness theorem and coherence uniqueness theorem.

First one is sharp and is far more powerful than the second one based on coherence.

Coherence can never be smaller than \(\frac{1}{\sqrt{N}}\) , therefore the bound on \(\| x^* \|_0\) in above can never be larger than \(\frac{\sqrt{N} + 1}{2}\) .

However, spark can be easily as large as \(N\) and then bound on \(\| x^* \|_0\) can be as large as \(\frac{N}{2}\) .

Thus, we note that coherence gives a weaker bound than spark for supportable sparsity levels of unique solutions. The advantage that coherence has is that it is easily computable and doesn’t require any special structure on the dictionary (two ortho basis has a special structure).

Singular values of sub-dictionaries¶

Theorem

Let \(\DDD\) be a dictionary and \(\DDD_{\Lambda}\) be a sub-dictionary. Let \(\mu\) be the coherence of \(\DDD\) . Let \(K = | \Lambda |\) . Then the eigen values of \(G = \DDD_{\Lambda}^H \DDD_{\Lambda}\) satisfy:

\[1 - (K - 1) \mu \leq \lambda \leq 1 + (K - 1) \mu.\]

Moreover, the singular values of the sub-dictionary \(\DDD_{\Lambda}\) satisfy

\[\sqrt{1 - (K - 1) \mu} \leq \sigma (\DDD_{\Lambda}) \leq \sqrt{1 + (K - 1) \mu}.\]

Proof

We recall from Gershgorin’s theorem that for any square matrix \(A \in \CC^{K \times K}\) , every eigen value \(\lambda\) of \(A\) satisfies

\[| \lambda - a_{ii} | \leq \sum_{j \neq i} |a_{ij}| \text{ for some } i \in \{ 1, \dots, K\}.\]

Now consider the matrix \(G = \DDD_{\Lambda}^H \DDD_{\Lambda}\) with diagonal elements equal to 1 and off diagonal elements bounded by a value \(\mu\) . Then

\[| \lambda - 1 | \leq \sum_{j \neq i} |a_{ij}| \leq \sum_{j \neq i} \mu = (K - 1) \mu.\]

Thus,

\[- (K - 1) \mu \leq \lambda - 1 \leq (K - 1) \mu \iff 1 - (K - 1) \mu \leq \lambda \leq 1 + (K - 1) \mu\]

This gives us a lower bound on the smallest eigen value.

\[\lambda_{\min} (G) \geq 1 - (K - 1) \mu.\]

Since \(G\) is positive definite ( \(\DDD_{\Lambda}\) is full-rank), hence its eigen values are positive. Thus, the above lower bound is useful only if

\[1 - (K - 1) \mu > 0 \iff 1 > (K - 1) \mu \iff \mu < \frac{1}{K - 1}.\]

We also get an upper bound on the eigen values of \(G\) given by

\[\lambda_{\max} (G) \leq 1 + (K - 1) \mu.\]

The bounds on singular values of \(\DDD_{\Lambda}\) are obtained as a straight-forward extension by taking square roots on the expressions.

Embeddings using sub-dictionaries¶

Theorem

Let \(\DDD\) be a real dictionary and \(\DDD_{\Lambda}\) be a sub-dictionary with \(K = |\Lambda|\) . Let \(\mu\) be the coherence of \(\DDD\) . Let \(v \in \RR^K\) be an arbitrary vector. Then

\[| v |^T [I - \mu (\OneMat - I)] | v | \leq \| \DDD_{\Lambda} v \|_2^2 \leq | v |^T [I + \mu (\OneMat - I)] | v |\]

where \(\OneMat\) is a \(K\times K\) matrix of all ones. Moreover

\[(1 - (K - 1) \mu) \| v \|_2^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq (1 + (K - 1) \mu)\| v \|_2^2.\]

Proof

We can easily write

\[\| \DDD_{\Lambda} v \|_2^2 = v^T \DDD_{\Lambda}^T \DDD_{\Lambda} v\]

\[\begin{aligned} v^T \DDD_{\Lambda}^T \DDD_{\Lambda} v &= \sum_{i=1}^K \sum_{j=1}^K v_i d_{\lambda_i}^T d_{\lambda_j} v_j. \end{aligned}\]

The terms in the R.H.S. for \(i = j\) are given by

\[v_i d_{\lambda_i}^T d_{\lambda_i} v_i = | v_i |^2.\]

Summing over \(i = 1, \cdots, K\) , we get

\[\sum_{i=1}^K | v_i |^2 = \| v \|_2^2 = v^T v = | v |^T | v | = | v |^T I | v |.\]

We are now left with \(K^2 - K\) off diagonal terms. Each of these terms is bounded by

\[- \mu |v_i| |v_j | \leq v_i d_{\lambda_i}^T d_{\lambda_j} v_j \leq \mu |v_i| |v_j |.\]

Summing over the \(K^2 - K\) off-diagonal terms we get:

\[\sum_{i \neq j} |v_i| |v_j | = \sum_{i, j} |v_i| |v_j | - \sum_{i = j} |v_i| |v_j | = | v |^T(\OneMat - I ) | v |.\]

Thus,

\[- \mu | v |^T (\OneMat - I ) | v | \leq \sum_{i \neq j} v_i d_{\lambda_i}^T d_{\lambda_j} v_j \leq \mu | v |^T (\OneMat - I ) | v |\]

Thus,

\[| v |^T I | v |- \mu | v |^T (\OneMat - I ) | v | \leq v^T \DDD_{\Lambda}^T \DDD_{\Lambda} v \leq | v |^T I | v |+ \mu | v |^T (\OneMat - I )| v |.\]

We get the result by slight reordering of terms:

\[| v |^T [I - \mu (\OneMat - I)] | v | \leq \| \DDD_{\Lambda} v \|_2^2 \leq | v |^T [I + \mu (\OneMat - I)] | v |\]

We recall that

\[| v |^T \OneMat | v | = \| v \|_1^2.\]

Thus, the inequalities can be written as

\[(1 + \mu) \| v \|_2^2 - \mu \| v \|_1^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq (1 - \mu) \| v \|_2^2 + \mu \| v \|_1^2.\]

Alternatively,

\[\| v \|_2^2 - \mu \left (\| v \|_1^2 - \| v \|_2^2 \right ) \leq \| \DDD_{\Lambda} v \|_2^2 \leq \| v \|_2^2 + \mu \left (\| v \|_1^2 - \| v \|_2^2\right ) .\]

Finally

\[\| v \|_1^2 \leq K \| v \|_2^2 \implies \| v \|_1^2 - \| v \|_2^2 \leq (K - 1) \| v \|_2^2.\]

This gives us

\[( 1- (K - 1) \mu ) \| v \|_2^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq ( 1 + (K - 1) \mu ) \| v \|_2^2 .\]

We now present the above theorem for the complex case. The proof is based on singular values. This proof is simpler and more general than the one presented above.

Theorem

Let \(\DDD\) be a dictionary and \(\DDD_{\Lambda}\) be a sub-dictionary with \(K = |\Lambda|\) . Let \(\mu\) be the coherence of \(\DDD\) . Let \(v \in \CC^K\) be an arbitrary vector. Then

\[(1 - (K - 1) \mu) \| v \|_2^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq (1 + (K - 1) \mu)\| v \|_2^2.\]

Proof

Recall that

\[\sigma_{\min}^2(\DDD_{\Lambda}) \| v \|_2^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq \sigma_{\max}^2(\DDD_{\Lambda}) \| v \|_2^2.\]

A previous result tells us:

\[1 - (K - 1) \mu \leq \sigma^2 (\DDD_{\Lambda}) \leq 1 + (K - 1) \mu.\]

Thus,

\[\sigma_{\min}^2(\DDD_{\Lambda}) \| v \|_2^2 \geq (1 - (K - 1) \mu) \| v \|_2^2\]

and

\[\sigma_{\max}^2(\DDD_{\Lambda}) \| v \|_2^2 \leq (1 + (K - 1) \mu)\| v \|_2^2.\]

This gives us the result

\[(1 - (K - 1) \mu) \| v \|_2^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq (1 + (K - 1) \mu)\| v \|_2^2.\]