Sparse-Plex

A journey into sparse and redundant representations.

About Sparse-Plex

Sparse-plex is a MATLAB library for solving sparse representation problems.

_images/union_of_subspaces.png

This is an example of a union of subspaces model. While the ambient space is \(\RR^3\), the data points actually fall in one of the three 2-d planes. The black points are in \(xy\)-plane, yellow points in \(yz\)-plane and red points in \(zx\) plane. Each of the 3 planes is a subspace of the ambient 3 dimensional space. Once an appropriate basis for each of the subspaces is chosen, the data points require only 2 coordinates to identify them in the subspace. In this case, it is easy to see that the standard basis for \(\RR^3\) contains the basis vectors for individual subspaces also. Thus, in the standard basis, each data point has only 2-non-zero coordinates.

The library website is : http://indigits.github.io/sparse-plex/.

Online documentation is hosted at: http://sparse-plex.readthedocs.org/en/latest/.

The project is hosted on GITHUB at: https://github.com/indigits/sparse-plex.

It contains implementations of many state of the art algorithms. Some implementations are simple and straight-forward while some have taken extra efforts to optimize the speed.

In addition to these, the library provides implementations of many other algorithms which are building blocks for the sparse recovery algorithms.

The library aims to solve:

  • Single vector sparse recovery or sparse approximation problems
  • Multiple vector joint sparse recovery or sparse approximation problems

The library provides

  • Various simple dictionaries and sensing matrices
  • Implementations of pursuit algorithms
    • Matching pursuit
    • Orthogonal matching pursuit
    • Compressive sampling matching pursuit
    • Basis pursuit
  • Some joint recovery algorithms
    • Cluster orthogonal matching pursuit
  • Some clustering algorithms
    • Spectral clustering
    • Sparse subspace clustering using l_1 minimization
    • Sparse subspace clustering using orthogonal matching pursuit
  • Various utilities for working with matrices, signals, norms, distances, signal comparison, vector spaces
  • Some visualization utilities
  • Some combinatoric systems
  • Various constructions for synthetic sparse signals
  • Some optimization algorithms
    • steepest descent
    • conjugate gradient descent
  • Detection and estimation algorithms
    • Compressive binary detector

The documentation contains several how-to-do tutorials. They are meant to help beginners in the area ramp up quickly. The documentation is not really a user manual. It doesn’t describe all parameters and behavior of a function in detail. Rather, it provides various code examples to explain how things work. Users are requested to read through the source code and relevant papers to get a deeper understanding of the methods.

Getting Started

Requirements

While much of the library can be used on stock MATLAB distribution with standard toolboxes, some parts of the library are dependent on some specific third party libraries. These dependencies are explained below.

MATLAB toolboxes

  • Signal processing toolbox
  • Image processing toolbox
  • Statistics toolbox
  • Optimization toolbox

Third party library dependencies (optional)

We repeat that only some parts of the library and examples depend on the third party libraries. You can install them on need basis. You don’t need to install them in advance.

Installation

  • Download sparse-plex library from http://indigits.github.io/sparse-plex/.
  • Unzip it in a suitable folder.
  • Add following commands to your MATLAB startup script:
    • Change directory to the root directory of sparse-plex.
    • Run spx_setup function.
    • Change back to whatever directory you want to be in.

Note

Make sure that MATLAB has write permissions to the directory in which you install sparse-plex. Some functions in sparse-plex create some MAT files for caching of intermediate results. Moreover, the sparse-plex setup script also creates a local settings file. For creating these files, write access is needed.

Getting acquainted

The online library documentation includes a number of step-by-step demonstrations. Follow these tutorials to get familiar with the library.

Running examples

  • Change directory to the root directory of sparse-plex.
  • Go into examples directory.
  • Browse the examples.
  • Run the example you want.

Checking the source code

  • Change directory to the root directory of sparse-plex.
  • Go into library directory.
  • Browse the source code.
    • The source code for spx library is maintained in +spx directory.
    • Unit-tests for the library are maintained in tests directory.

Verifying the installation

A number of unit tests have been included in the software to verify its integrity. The unit tests are based on MATLAB’s built in testing frameworks.

  • Change directory to the root directory of sparse-plex.
  • Move to the directory library/tests.
  • Execute the runalltests.m script.
  • Verify that all unit tests pass.

Building MATLAB Extensions

Some of the fast implementations of various algorithms are written in C as MATLAB extensions. You will need to build them before using them.

This section assumes that you have the necessary build tools available in your MATLAB installation. See What You Need to Build MEX Files for details.

  • Go to the sparse-plex\library\+spx\+fast\private directory inside MATLAB.
  • Run the make.m script.

The script make.m contains necessary commands to invoke the mex compiler on each of the source files in this private directory. The script takes care of building only those files which have been modified since last build.

Building documentation

Only if you really want to do it! Normally, you can read it online.

You will require Python Sphinx and other related packages like Pygments library etc. to build the documentation from scratch.

  • Change directory to the root directory of sparse-plex.
  • Go into docs directory.
  • Build the documentation using Sphinx tool chain.

Here is the command for building documentation automatically as the changes are being made to documentation:

sphinx-autobuild --port=9102 . _build\html

Configuring test data directories

Several examples in sparse-plex are developed on top of standard data sets. These include (but not limited to):

  • Standard test images
  • Yale Extended B Faces database (cropped images)

In order to execute these examples, access to the data is needed. The data is not distributed along with this software. You can download data and store it on your computer wherever you wish. In order to provide access to this data, you need to tell sparse-plex where does the data lie. This can be done by changing spx_local.ini file. When you download and unzip the library, this file doesn’t exist. When you run spx_setup, spx_defaults.ini is copied into spx_local.ini.

All you need to do is to point to the right directories which hold the test datasets.

Specific settings in spx_local.ini are:

  • standard_test_images_dir
  • yale_faces_db_dir

For more information, read the file.

Demos

Dirac DCT Tutorial

This tutorial is based on examples\\ex_dirac_dct_two_ortho_basis.m.

In this tutorial we will:

  • Construct a DCT basis
  • Construct a Dirac-DCT dictionary.
  • Construct a signal which is a mixture of few impulses and a few sinusoids.
  • Construct its representation in the DCT basis.
  • Recover its representation in Dirac-DCT dictionary using following sparse recovery algorithms
    • Matching Pursuit
    • Orthogonal Matching Pursuit
    • Basis Pursuit
  • Measure the recovery error for different sparse recovery algorithms.

Signal space dimension:

N = 256;

Dirac basis:

I = eye(N);

DCT basis:

Psi = dctmtx(N)';

Visualizing the DCT basis:

imagesc(Psi) ;
colormap(gray);
colorbar;
axis image;
title('\Psi');
_images/dct_256.png

Combining the Dirac and DCT orthonormal bases to form a two-ortho dictionary:

Phi = [I  Psi];

Visualizing the dictionary:

imagesc(Phi) ;
colormap(gray);
colorbar;
axis image;
title('\Phi');
_images/dirac_dct_256.png

Constructing a signal which is a combination of impulses and cosines:

alpha = zeros(2*N, 1);
alpha(20) = 1;
alpha(30) = -.4;
alpha(100) = .6;
alpha(N + 4) = 1.2;
alpha(N + 58) = -.8;
x = Phi * alpha;
K  = 5;
_images/impulse_cosine_combination_signal.png

Finding the representation in DCT basis:

x_dct = Psi' * x;
_images/impulse_cosine_dct_basis.png

Sparse representation in the Dirac DCT dictionary

_images/impulse_cosine_dirac_dct.png

Obtaining the sparse representation using matching pursuit algorithm:

solver = spx.pursuit.single.MatchingPursuit(Phi, K);
result = solver.solve(x);
mp_solution = result.z;
mp_diff = alpha - mp_solution;
% Recovery error
mp_recovery_error = norm(mp_diff) / norm(x);
_images/dirac_dct_mp_solution.png

Matching pursuit recovery error: 0.0353.

Obtaining the sparse representation using orthogonal matching pursuit algorithm:

solver = spx.pursuit.single.OrthogonalMatchingPursuit(Phi, K);
result = solver.solve(x);
omp_solution = result.z;
omp_diff = alpha - omp_solution;
% Recovery error
omp_recovery_error = norm(omp_diff) / norm(x);
_images/dirac_dct_omp_solution.png

Orthogonal Matching pursuit recovery error: 0.0000.

Obtaining a sparse approximation via basis pursuit:

solver = spx.pursuit.single.BasisPursuit(Phi, x);
result = solver.solve_l1_noise();
l1_solution = result;
l1_diff = alpha - l1_solution;
% Recovery error
l1_recovery_error = norm(l1_diff) / norm(x);
_images/dirac_dct_l_1_solution.png

l_1 recovery error: 0.0010.

Basic CS Tutorial

This tutorial is based on examples\\ex_simple_compressed_sensing_demo.m.

In this tutorial we will:

  • Create sparse signals (with Gaussian and bi-uniform distributed non-zero samples).
  • Look at how to identify support of a signal.
  • Construct a Gaussian sensing matrix.
  • Visualize the sensing matrix.
  • Compute random measurements on the sparse signal with the sensing matrix.
  • Add measurement noise to the measurements.
  • Recover the sparse vector using following sparse recovery algorithms
    • Matching Pursuit
    • Orthogonal Matching Pursuit
    • Basis Pursuit
  • Measure the recovery error for different sparse recovery algorithms.

Basic setup:

% Signal space
N = 1000;
% Number of measurements
M = 200;
% Sparsity level
K = 8;

Choosing the support randomly:

Omega = randperm(N, K);

Constructing a sparse vector with Gaussian entries:

% Initializing a zero vector
x = zeros(N, 1);
% Filling it with non-zero Gaussian entries at specified support
x(Omega) = 4 * randn(K, 1);
_images/k_sparse_gaussian_signal.png

Constructing a bi-uniform sparse vector:

a = 1;
b = 2;
% unsigned magnitudes of non-zero entries
xm = a + (b-a).*rand(K, 1);
% Generate sign for non-zero entries randomly
sgn = sign(randn(K, 1));
% Combine sign and magnitude
x(Omega) = sgn .* xm;
_images/k_sparse_biuniform_signal.png

Identifying support:

find(x ~= 0)'
% 98   127   277   544   630   815   905   911

Constructing a Gaussian sensing matrix:

Phi = randn(M, N);
% Make sure that variance is 1/sqrt(M)
Phi = Phi ./ sqrt(M);

Computing norm of each column:

column_norms = sqrt(sum(Phi .* conj(Phi)));

Norm histogram

_images/guassian_sensing_matrix_histogram.png

Constructing a Gaussian dictionary with normalized columns:

for i=1:N
    v = column_norms(i);
    % Scale it down
    Phi(:, i) = Phi(:, i) / v;
end

Visualizing the sensing matrix:

imagesc(Phi) ;
colormap(gray);
colorbar;
axis image;
_images/gaussian_matrix.png

Making random measurements from sparse high dimensional vector:

y0 = Phi * x;
_images/measurement_vector_biuniform.png

Adding some measurement noise:

SNR = 15;
snr = db2pow(SNR);
noise = randn(M, 1);
% we treat each column as a separate data vector
signalNorm = norm(y0);
noiseNorm = norm(noise);
actualNormRatio = signalNorm / noiseNorm;
requiredNormRatio = sqrt(snr);
gain_factor = actualNormRatio  / requiredNormRatio;
noise = gain_factor .* noise;

Measurement vector with noise:

y = y0 + noise;
_images/measurement_vector_biuniform_noisy.png

Sparse recovery using matching pursuit:

solver = spx.pursuit.single.MatchingPursuit(Phi, K);
result = solver.solve(y);
mp_solution = result.z;

Recovery error:

mp_diff = x - mp_solution;
mp_recovery_error = norm(mp_diff) / norm(x);
_images/cs_matching_pursuit_solution.png

Matching pursuit recovery error: 0.1612.

Sparse recovery using orthogonal matching pursuit:

solver = spx.pursuit.single.OrthogonalMatchingPursuit(Phi, K);
result = solver.solve(y);
omp_solution = result.z;
omp_diff = x - omp_solution;
omp_recovery_error = norm(omp_diff) / norm(x);
_images/cs_orthogonal_matching_pursuit_solution.png

Orthogonal Matching pursuit recovery error: 0.0301.

Sparse recovery using l_1 minimization:

solver = spx.pursuit.single.BasisPursuit(Phi, y);
result = solver.solve_l1_noise();
l1_solution = result;
l1_diff = x - l1_solution;
l1_recovery_error = norm(l1_diff) / norm(x);
_images/cs_l_1_minimization_solution.png

l_1 recovery error: 0.1764.

Sparse Signal Models

Outline

In this chapter we develop initial concepts of sparse signal models.

_images/sparse_representation_framework.png

A bird’s eye view of the sparse representations and compressive sensing framework. Signals (like speech, images, etc.) reside in a signal space \(\RR^N\). Analytical or trained dictionaries can be constructed such that the signals can have a sparse representation in such dictionaries. These sparse representations reside in a representation space \(\RR^D\). A sparse approximation algorithm \(\Delta_a\) can construct a representation \(\alpha\) for a signal \(x\) in the dictionary \(\DDD\). The approximation error is \(e\). A small number of \(M\) random measurements are sufficient to capture all the information in \(x\). The sensing process \(y = \Phi x + n\) constructs the measurement vector \(y \in \RR^M\) for a given signal where \(n\) is the measurement noise. In order to get \(x\) from \(y\), we first need to recover the sparse representation \(\alpha\) using the sparse recovery algorithm \(\Delta_r\). Then \(x \approx \DDD \alpha\).

We begin our study with a review of solutions of under-determined systems. We build a case for solutions which promote sparsity.

We show that although the real life signals may not be sparse yet they are compressible and can be approximated with sparse signals.

We then review orthonormal bases and explain the inadequacy of those bases in exploiting the sparsity in many signals of interest. We develop an example of Dirac Fourier basis as a two ortho basis and demonstrate how it can better exploit signal sparsity compared to Dirac basis and Fourier basis individually.

We follow this with a general discussion of redundant signal dictionaries. We show how they can be used to create sparse and redundant signal representations.

We study various properties of signal dictionaries which are useful in characterizing the capabilities of a signal dictionary in exploiting signal sparsity.

In this chapter, our signals of interest will typically lie in the finite \(N\)-dimensional complex vector space \(\CC^N\). Sometimes we will restrict our attention to the \(N\) dimensional Euclidean space to simplify discussion.

We will be concerned with different representations of our signals of interest in \(\CC^D\) where \(D \geq N\). This aspect will become clearer as we go along in this chapter.

Sparsity

We quickly define the notion of sparsity in a signal.

We recall the definition of \(l_0\)-“norm” (don’t forget the quotes) of \(x \in \CC^N\) given by

\[\| x \|_0 = | \supp(x) |\]

where \(\supp(x) = \{ i : x_i \neq 0\}\) denotes the support of \(x\).

Informally we say that a signal \(x \in \CC^N\) is term{sparse} if \(\| x \|_0 \ll N\). index{Sparse signal}

More generally if \(x =\DDD \alpha\) where \(\DDD \in \CC^{N \times D}\) with \(D > N\) is some signal dictionary (to be formally defined later), then \(x\) is sparse in dictionary \(\DDD\) if \(\| \alpha \|_0 \ll D\).

Sometimes we simply say that \(x\) is \(K\)-sparse if \(\| x \|_0 \leq K\) where \(K < N\). We do not specifically require that \(K \ll N\).

An even more general definition of sparsity is the degrees of freedom a signal may have.

As an example consider all points on the surface of a unit sphere in \(\RR^N\). For every point \(x\) belonging to the surface \(|x|_2 = 1\). Thus if we choose the values of \(N-1\) components of \(x\) then the value of the remaining component is automatically fixed. Thus the number of degrees of freedom \(x\) has on the surface of the unit sphere in \(\RR^N\) is actually \(N-1\). Such a surface represents a manifold in the ambient Euclidean space. Of special interest are low dimensional manifolds where the number of degrees of freedom \(K \ll N\).

Sparse solutions for under-determined linear systems

The discussion in this section is largely based on chapter 1 of [Ela10].

Consider a matrix \(\Phi \in \CC^{M \times N}\) with \(M < N\).

Define an under-determined system of linear equations:

\[\Phi x = y\]

where \(y \in \CC^M\) is known and \(x \in \CC^N\) is unknown.

This system has \(N\) unknowns and \(M\) linear equations. There are more unknowns than equations.

Let the columns of \(\Phi\) be given by \(\phi_1, \phi_2, \dots, \phi_N\).

Column space of \(\Phi\) (vector space spanned by all columns of \(\Phi\)) is denoted by \(\ColSpace(\Phi)\) i.e.

\[\ColSpace(\Phi) = \sum_{i=1}^{N} c_i \phi_i, \quad c_i \in \CC.\]

We know that \(\ColSpace(\Phi) \subset \CC^M\).

Clearly \(\Phi x \in \ColSpace(\Phi)\) for every \(x \in \CC^N\). Thus if \(y \notin \ColSpace(\Phi)\) then we have no solution. But, if \(y \in \ColSpace(\Phi)\) then we have infinite number of solutions.

Let \(\NullSpace(\Phi)\) represent the null space of \(\Phi\) given by

\[\NullSpace(\Phi) = \{ x \in \CC^N : \Phi x = 0\}.\]

Let \(\widehat{x}\) be a solution of \(y = \Phi x\). And let \(z \in \NullSpace(\Phi)\). Then

\[\Phi (\widehat{x} + z) = \Phi \widehat{x} + \Phi z = y + 0 = y.\]

Thus the set \(\widehat{x} + \NullSpace(\Phi)\) forms the complete set of infinite solutions to the problem \(y = \Phi x\) where

\[\widehat{x} + \NullSpace(\Phi) = \{\widehat{x} + z \quad \Forall z \in \NullSpace(\Phi)\}.\]
ExampleAn under-determined system

As a running example in this section, we will consider a simple under-determined system in \(\RR^2\). The system is specified by

\[\Phi = \begin{bmatrix} 3 & 4 \end{bmatrix}\]

and

\[\begin{split}x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\end{split}\]

with

\[\Phi x = y = 12.\]

where \(x\) is unknown and \(y\) is known. Alternatively

\[\begin{split}\begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = 12\end{split}\]

or more simply

\[3 x_1 + 4 x_2 = 12.\]

The solution space of this system is a line in \(\RR^2\) which is shown in the figure below.

_images/underdetermined_system.png

Specification of the under-determined system as above, doesn’t give us any reason to prefer one particular point on the line as the preferred solution.

Two specific solutions are of interest:

  • \((x_1, x_2) = (4,0)\) lies on the \(x_1\) axis.
  • \((x_1, x_2) = (0,3)\) lies on the \(x_2\) axis.

In both of these solutions, one component is 0, thus leading these solutions to be sparse.

It is easy to visualize sparsity in this simplified 2-dimensional setup but situation becomes more difficult when we are looking at high dimensional signal spaces. We need well defined criteria to promote sparse solutions.

Regularization

Are all these solutions equivalent or can we say that one solution is better than the other in some sense? In order to suggest that some solution is better than other solutions, we need to define a criteria for comparing two solutions.

In optimization theory, this idea is known as regularization.

We define a cost function \(J(x) : \CC^N \to \RR\) which defines the desirability of a given solution \(x\) out of infinitely possible solutions. The higher the cost, lower is the desirability of the solution.

Thus the goal of the optimization problem is to find a desired \(x\) with minimum possible cost.

In optimization literature, the cost function is one type of objective function. While the objective of an optimization problem might be either minimized or maximized, cost is always minimized.

We can write this optimization problem as

\[\begin{split}\begin{aligned} & \underset{x}{\text{minimize}} & & J(x) \\ & \text{subject to} & & y = \Phi x. \end{aligned}\end{split}\]

If \(J(x)\) is convex, then its possible to find a global minimum cost solution over the solution set.

If \(J(x)\) is not convex, then it may not be possible to find a global minimum, we may have to settle with a local minimum.

A variety of such cost function based criteria can be considered.

\(l_2\) Regularization

One of the most common criteria is to choose a solution with the smallest \(l_2\) norm.

The problem can then be reformulated as an optimization problem

\[\begin{split}\begin{aligned} & \underset{x}{\text{minimize}} & & \| x \|_2 \\ & \text{subject to} & & y = \Phi x. \end{aligned}\end{split}\]

In fact minimizing \(\| x \|_2\) is same as minimizing its square \(\| x \|_2^2 = x^H x\).

So an equivalent formulation is

\[\begin{split}\begin{aligned} & \underset{x}{\text{minimize}} & & x^H x \\ & \text{subject to} & & y = \Phi x. \end{aligned}\end{split}\]
ExampleMinimum :math:`l_2` norm solution for an under-determined system

We continue with our running example.

We can write \(x_2\) as

\[\]

x_2 = 3 - frac{3}{4} x_1.

With this definition the squared \(l_2\) norm of \(x\) becomes

\[\begin{split}\| x \|_2^2 = x_1^2 + x_2^2 &= x_1^2 + \left ( 3 - \frac{3}{4} x_1 \right )^2\\ & = \frac{25}{16} x_1^2 - \frac{9}{2} x_1 + 9.\end{split}\]

Minimizing \(\| x \|_2^2\) over all \(x\) is same as minimizing over all \(x_1\).

Since \(\| x \|_2^2\) is a quadratic function of \(x_1\), we can simply differentiate it and equate to 0 giving us

\[\frac{25}{8} x_1 - \frac{9}{2} = 0 \implies x_1 = \frac{36}{25} = 1.44.\]

This gives us

\[x_2 = \frac{48}{25} = 1.92.\]

Thus the optimal \(l_2\) norm solution is obtained at \((x_1, x_2) = (1.44, 1.92)\).

We note that the minimum \(l_2\) norm at this solution is

\[\| x \|_2 = \frac{12}{5} = 2.4.\]

It is instructive to note that the \(l_2\) norm cost function prefers a non-sparse solution to the optimization problem.

We can view this solution graphically by drawing \(l_2\) norm balls of different radii in figure below. The ball which just touches the solution space line (i.e. the line is tangent to the ball) gives us the optimal solution.

_images/underdetermined_system_l2_balls.png

All other norm balls either don’t touch the solution line at all, or they cross it at exactly two points.

A formal solution to \(l_2\) norm minimization problem can be easily obtained using Lagrange multipliers.

We define the Lagrangian

\[\mathcal{L}(x) = \|x\|_2^2 + \lambda^H (\Phi x - y)\]

with \(\lambda \in \CC^M\) being the Lagrange multipliers for the (equality) constraint set.

Differentiating \(\mathcal{L}(x)\) w.r.t. \(x\) we get

\[\frac{\partial \mathcal{L}(x)} {\partial x} = 2 x + \Phi^H \lambda.\]

By equating the derivative to \(0\) we obtain the optimal value of \(x\) as

\[x^* = - \frac{1}{2} \Phi^H \lambda. \label{eq:ssm:underdetermined_l2_optimal_value_expression_1}\]

Plugging this solution back into the constraint \(\Phi x= y\) gives us

\[\Phi x^* = - \frac{1}{2} (\Phi \Phi^H) \lambda= y\implies \lambda = -2(\Phi \Phi^H)^{-1} y.\]

In above we are implicitly assuming that \(\Phi\) is a full rank matrix thus, \(\Phi \Phi^H\) is invertible and positive definite.

Putting \(\lambda\) back in above we obtain the well known closed form least squares solution using pseudo-inverse solution

\[x^* = \Phi^H (\Phi \Phi^H)^{-1} y = \Phi^{\dag} y.\]

We would like to mention that there are several iterative approaches to solve the \(l_2\) norm minimization problem (like gradient descent and conjugate descent). For large systems, they are more effective than computing the pseudo-inverse.

The beauty of \(l_2\) norm minimization lies in its simplicity and availability of closed form analytical solutions. This has led to its prevalence in various fields of science and engineering. But \(l_2\) norm is by no means the only suitable cost function. Rather the simplicity of \(l_2\) norm often drives engineers away from trying other possible cost functions. In the sequel, we will look at various other possible cost functions.

Convexity

Convex optimization problems have a unique feature that it is possible to find the global optimal solution if such a solution exists.

The solution space \(\Omega = \{x : \Phi x = y\}\) is convex. Thus the feasible set of solutions for the optimization problem is also convex. All it remains is to make sure that we choose a cost function \(J(x)\) which happens to be convex. This will ensure that a global minimum can be found through convex optimization techniques. Moreover, if \(J(x)\) is strictly convex, then it is guaranteed that the global minimum solution is unique. Thus even though, we may not have a nice looking closed form expression for the solution of a strictly convex cost function minimization problem, the guarantee of the existence and uniqueness of solution as well as well developed algorithms for solving the problem make it very appealing to choose cost functions which are convex.

We remind that all \(l_p\) norms with \(p \geq 1\) are convex functions. In particular \(l_{\infty}\) and \(l_1\) norms are very interesting and popular where

\[l_{\infty}(x) = \max(x_i), \, 1 \leq i \leq N\]

and

\[l_1(x) = \sum_{i=1}^{N} |x_i|.\]

In the following section we will attempt to find a unique solution to our optimization problem using \(l_1\) norm.

\(l_1\) Regularization

In this section we will restrict our attention to the Euclidean space case where \(x \in \RR^N\), \(\Phi \in \RR^{M \times N}\) and \(y \in \RR^M\).

We choose our cost function \(J(x) = l_1(x)\).

The cost minimization problem can be reformulated as

\[\begin{split}\begin{aligned} & \underset{x}{\text{minimize}} & & \| x \|_1 \\ & \text{subject to} & & \Phi x = y. \end{aligned}\end{split}\]
ExampleMinimum :math:`l_1` norm solution for an under-determined system

We continue with our running example.

Again we can view this solution graphically by drawing \(l_1\) norm balls of different radii in the figure below. The ball which just touches the solution space line gives us the optimal solution.

_images/underdetermined_system_l1_balls.png

As we can see from the figure the minimum \(l_1\) norm solution is given by \((x_1,x_2) = (0,3)\).

It is interesting to note that \(l_1\) norm solution promotes sparser solutions while \(l_2\) norm solution promotes solutions in which signal energy is distributed amongst all of its components.

It’s time to have a closer look at our cost function \(J(x) = \|x \|_1\). This function is convex yet not strictly convex.

Example:math:`\| x\|_1` is not strictly convex

Consider again \(x \in \RR^2\). For \(x \in \RR_+^2\) (the first quadrant),

\[\|x \|_1 = x_1 + x_2.\]

Hence for any \(c_1, c_2 \geq 0\) and \(x, y \in \RR_+^2\):

\[\|(c_1 x + c_2 y)\|_1 = (c_1 x + c_2 y)_1 + (c_1 x + c_2 y)_2 = c_1 \| x\|_1 + c_2 \| y \|_1.\]

Thus, \(l_1\)-norm is not strictly convex. Consequently, a unique solution may not exist for \(l_1\) norm minimization problem.

As an example consider the under-determined system

\[3 x_1 + 3 x_2 = 12.\]

We can easily visualize that the solution line will pass through points \((0,4)\) and \((4,0)\). Moreover, it will be clearly parallel with \(l_1\)-norm ball of radius \(4\) in the first quadrant. See again the figure above. This gives us infinitely possible solutions to the minimization problem.

We can still observe that

  • These solutions are gathered in a small line segment that is bounded (a bounded convex set) and
  • There exist two solutions \((4,0)\) and \((0,4)\) amongst these solutions which have only 1 non-zero component.

For the \(l_1\) norm minimization problem since \(J(x)\) is not strictly convex, hence a unique solution may not be guaranteed. In specific cases, there may be infinitely many solutions. Yet what we can claim is begin{itemize} item these solutions are gathered in a set that is bounded and convex, and item among these solutions, there exists at least one solution with at most \(M\) non-zeros (as the number of constraints in \(\Phi x = y\)). end{itemize} todo{Provide reference to the claim that solution set is convex and bounded} todo{Show that at least one solution exists with \(M\) sparsity level}

Theorem
Let \(S\) denote the solution set of \(l_1\) norm minimization problem. \(S\) contains at least one solution \(\widehat{x}\) with \(\| \widehat{x} \|_0 = M\).
Proof

We have

  • \(S\) is convex and bounded.
  • \(\Phi x^* = y \, \Forall x^* \in S\).
  • Since \(\Phi \in \RR^{M \times N}\) is full rank and \(M < N\), hence \(\text{rank}(\Phi) = M\).

Let \(x^* \in S\) be an optimal solution with \(\| x^* \|_0 = L > M\).

Consider the \(L\) columns of \(\Phi\) which correspond to \(\supp(x^*)\).

Since \(L > M\) and \(\text{rank}(\Phi) = M\) hence these columns linearly dependent.

Thus there exists a vector \(h \in \RR^N\) with \(\supp(h) \subseteq \supp(x^*)\) such that

\[\Phi h = 0.\]

Note that since we are only considering those columns of \(\Phi\) which correspond to \(\supp(x)\), hence we require \(h_i = 0\) whenever \(x^*_i = 0\).

Consider a new vector

\[x = x^* + \epsilon h\]

where \(\epsilon\) is small enough such that every element in \(x\) has the same sign as \(x^*\).

As long as

\[|\epsilon| \leq \underset{i \in \supp(x^*)}{\min} \frac{|x^*_i|}{|h_i|} = \epsilon_0\]

such an \(x\) can be constructed.

Note that \(x_i = 0\) whenever \(x^*_i = 0\).

Clearly

\[\Phi x = \Phi (x^* + \epsilon h) = y + \epsilon 0 = y.\]

Thus \(x\) is a feasible solution to the problem (1) though it need not be an optimal solution.

But since \(x^*\) is optimal hence, we must assume that \(l_1\) norm of \(x\) is greater than or equal to the \(l_1\) norm of \(x^*\)

\[\|x \|_1 = \|x^* + \epsilon h \|_1 \geq \| x^* \|_1 \Forall |\epsilon| \leq \epsilon_0.\]

Now look at \(\|x \|_1\) as a function of \(\epsilon\) in the region \(|\epsilon| \leq \epsilon_0\).

In this region, \(l_1\) function is continuous and differentiable since all vectors \(x^* + \epsilon h\) have the same sign pattern. If we define \(y^* = | x^* |\) (the vector of absolute values), then

\[\| x^* \|_1 = \| y^* \|_1 = \sum_{i=1}^N y^*_i.\]

Since the sign patterns don’t change, hence

\[|x_i| = |x^*_i + \epsilon h_i | = y^*_i + \epsilon h_i \sgn(x^*_i).\]

Thus

\[\begin{split}\begin{aligned} \|x \|_1 &= \sum_{i=1}^N |x_i| \\ &= \sum_{i=1}^N \left (y^*_i + \epsilon h_i \sgn(x^*_i) \right) \\ &= \| x^* \|_1 + \epsilon \sum_{i=1}^N h_i \sgn(x^*_i)\\ &= \| x^* \|_1 + \epsilon h^T \sgn(x^*). \end{aligned}\end{split}\]

The quantity \(h^T \sgn(x^*)\) is a constant. The inequality \(\|x \|_1 \geq \| x^* \|_1\) applies to both positive and negative values of \(\epsilon\) in the region \(|\epsilon | \leq \epsilon_0\). This is possible only when inequality is in fact an equality.

This implies that the addition / subtraction of \(\epsilon h\) under these conditions does not change the \(l_1\) length of the solution. Thus, \(x \in S\) is also an optimal solution.

This can happen only if

\[h^T \sgn(x^*) = 0.\]

We now wish to tune \(\epsilon\) such that one entry in \(x^*\) gets nulled while keeping the solutions \(l_1\) length.

We choose \(i\) corresponding to \(\epsilon_0\) (defined above) and pick

\[\epsilon = \frac{-x^*_i}{h_i}.\]

Clearly for the corresponding

\[x = x^* + \epsilon h\]

the \(i\)-th entry is nulled while others keep their sign and the \(l_1\) norm is also preserved. Thus, we have got a new optimal solution with \(L-1\) non-zeros at the most. It is possible that more than 1 entries get nulled this operation.

We can repeat this procedure till we are left with \(M\) non-zero elements.

Beyond this we may not proceed since \(\text{rank}(\Phi) = M\) hence we cannot say that corresponding columns of \(\Phi\) are linearly dependent.

We thus note that \(l_1\) norm has a tendency to prefer sparse solutions. This is a well known and fundamental property of linear programming.

\(l_1\) norm minimization problem as a linear programming problem

We now show that \(l_1\) norm minimization problem in \(\RR^N\) is in fact a linear programming problem.

Recalling the problem:

(1)\[\begin{split}\begin{aligned} & \underset{x \in \RR^N}{\text{minimize}} & & \| x \|_1 \\ & \text{subject to} & & y = \Phi x. \end{aligned}\end{split}\]

Let us write \(x\) as \(u - v\) where \(u, v \in \RR^N\) are both non-negative vectors such that \(u\) takes all positive entries in \(x\) while \(v\) takes all the negative entries in \(x\).

Example:math:`x = u - v`

Let

\[\]

x = (-1, 0 , 0 , 2, 0 , 0, 0, 4, 0, 0, -3, 0 , 0 , 0 , 0, 2 , 10).

Then

\[\]

u = (0, 0 , 0 , 2, 0 , 0, 0, 4, 0, 0, 0, 0 , 0 , 0 , 0, 2 , 10).

And

\[\]

v = (1, 0 , 0 , 0, 0 , 0, 0, 0, 0, 0, 3, 0 , 0 , 0 , 0, 0 , 0).

Clearly \(x = u - v\).

We note here that by definition

\[\supp(u) \cap \supp(v) = \EmptySet\]

i.e. support of \(u\) and \(v\) do not overlap.

We now construct a vector

\[\begin{split}z = \begin{bmatrix} u \\ v \end{bmatrix} \in \RR^{2N}.\end{split}\]

We can now verify that

\[\| x \|_1 = \|u\|_1 + \| v \|_1 = 1^T z.\]

And

\[\begin{split}\Phi x = \Phi (u - v) = \Phi u - \Phi v = \begin{bmatrix} \Phi & -\Phi \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix} = \begin{bmatrix} \Phi & -\Phi \end{bmatrix} z\end{split}\]

where \(z \succeq 0\).

Hence the optimization problem (1) can be recast as

(2)\[\begin{split}\begin{aligned} & \underset{z \in \RR^{2N}}{\text{minimize}} & & 1^T z \\ & \text{subject to} & & \begin{bmatrix} \Phi & -\Phi \end{bmatrix} z = y\\ & \text{and} & & z \succeq 0. \end{aligned}\end{split}\]

This optimization problem has the classic Linear Programming structure since the objective function is affine as well as constraints are affine.

Let \(z^* =\begin{bmatrix} u^* \\ v^* \end{bmatrix}\) be an optimal solution to the problem (2).

In order to show that the two optimization problems are equivalent, we need to verify that our assumption about the decomposition of \(x\) into positive entries in \(u\) and negative entries in \(v\) is indeed satisfied by the optimal solution \(u^*\) and \(v^*\). i.e. support of \(u^*\) and \(v^*\) do not overlap.

Since \(z \succeq 0\) hence \(\langle u^* , v^* \rangle \geq 0\). If support of \(u^*\) and \(v^*\) don’t overlap, then we have \(\langle u^* , v^* \rangle = 0\). And if they overlap then \(\langle u^* , v^* \rangle > 0\).

Now for the sake of contradiction, let us assume that support of \(u^*\) and \(v^*\) do overlap for the optimal solution \(z^*\).

Let \(k\) be one of the indices at which both \(u_k \neq 0\) and \(v_k \neq 0\). Since \(z \succeq 0\), hence \(u_k > 0\) and \(v_k > 0\).

Without loss of generality let us assume that \(u_k > v_k > 0\).

In the equality constraint

\[\begin{split}\begin{bmatrix} \Phi & -\Phi \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix} = y\end{split}\]

Both of these coefficients multiply the same column of \(\Phi\) with opposite signs giving us a term

\[\phi_k (u_k - v_k).\]

Now if we replace the two entries in \(z^*\) by

\[u_k' = u_k - v_k\]

and

\[v_k' = 0\]

to obtain an new vector \(z'\), we see that there is no impact in the equality constraint

\[\begin{bmatrix} \Phi & -\Phi \end{bmatrix} z = y.\]

Also the positivity constraint

\[z \succeq 0\]

is satisfied. This means that \(z'\) is a feasible solution.

On the other hand the objective function \(1^T z\) value reduces by \(2 v_k\) for \(z'\). This contradicts our assumption that \(z^*\) is the optimal solution.

Hence for the optimal solution of (2) we have

\[\supp(u^*) \cap \supp(v^*) = \EmptySet\]

thus

\[x^* = u^* - v^*\]

is indeed the desired solution for the optimization problem (1).

Dictionary based representations

Dictionaries

DefinitionDictionary

A dictionary for \(\CC^N\) is a finite collection \(\mathcal{D}\) of unit-norm vectors which span the whole space.

The elements of a dictionary are called atoms and they are denoted by \(\phi_{\omega}\) where \(\omega\) is drawn from an index set \(\Omega\).

The whole dictionary structure is written as

\[\mathcal{D} = \{\phi_{\omega} : \omega \in \Omega \}\]

where

\[\| \phi_{\omega} \|_2 = 1 \Forall \omega \in \Omega\]

and

\[x = \sum_{\omega \in \Omega} c_{\omega} \phi_{\omega} \Forall x \in \CC^N.\]

We use the letter \(D\) to denote the number of elements in the dictionary, i.e.

\[D = | \Omega |.\]

This definition is adapted from [Tro04].

The indices may have an interpretation, such as the time-frequency or time-scale localization of an atom, or they may simply be labels without any underlying meaning.

Note

In most cases, the dictionary is a matrix of size \(N \times D\) where \(D\) is the number of columns or atoms in the dictionary. The index set in this situation is \([1:D]\) which is the set of integers from 1 to \(D\).

Example

Let’s construct a simple Dirac-DCT dictionary of dimensions \(4 \times 8\).

>> A = spx.dict.simple.dirac_dct_mtx(4); A

A =

    1.0000         0         0         0    0.5000    0.6533    0.5000    0.2706
         0    1.0000         0         0    0.5000    0.2706   -0.5000   -0.6533
         0         0    1.0000         0    0.5000   -0.2706   -0.5000    0.6533
         0         0         0    1.0000    0.5000   -0.6533    0.5000   -0.2706

This dictionary consists of two parts. The left part is a \(4 \times 4\) identity matrix and the right part is a \(4 \times 4\) DCT matrix.

The rank of this dictionary is 4. Since the columns come from \(\RR^4\), any 5 columns are linearly dependent.

It is interesting to note that there exists a set of 4 columns in this dictionary which is linearly dependent.

>> B = A(:, [1, 4, 5, 7]); B

B =

    1.0000         0    0.5000    0.5000
         0         0    0.5000   -0.5000
         0         0    0.5000   -0.5000
         0    1.0000    0.5000    0.5000

>> rank(B)

ans =

     3

This is a crucial difference between an orthogonal basis and an overcomplete dictionary. In an orthogonal basis for \(\RR^N\), all \(N\) vectors are linearly independent. As we create overcomplete dictionaries, it is possible that there exist some subsets of columns of size \(N\) or less which are linearly dependent.

Let’s quickly examine the null space of \(B\):

>> c = null(B)

c =

   -0.5000
   -0.5000
    0.5000
    0.5000

>> B * c

ans =

   1.0e-16 *

    0.5551
   -0.2776
   -0.8327
   -0.2776

Note that the dictionary need not provide a unique representation for any vector \(x \in \CC^N\), but it provides at least one representation for each \(x \in \CC^N\).

ExampleNon-unique representations

We will construct a vector in the null space of \(A\):

>> n = zeros(8,1); n([1,4,5,7]) = c; n

n =

   -0.5000
         0
         0
   -0.5000
    0.5000
         0
    0.5000
         0

Consider the vector:

>> x = [4 ,2,2,5]';

Following calculation shows two different representations of \(x\) in \(A\):

>> alpha  = [2, 0, 0, 3, 4, 0, 0, 0]'
>> A * alpha

ans =

     4
     2
     2
     5

>> A * (alpha + n)

ans =

     4
     2
     2
     5

>> beta = alpha + n

beta =

    1.5000
         0
         0
    2.5000
    4.5000
         0
    0.5000
         0

Both alpha and beta are valid representations of x in A. While alpha has 3 non-zero entries, beta has 4. In that sense alpha is a more sparse representation of x in A.

Constructing x from A requires only 3 columns if we choose the alpha representation, but it requires 4 columns if we choose the beta representation.

When \(D=N\) we have a set of unit norm vectors which span the whole of \(\CC^N\). Thus, we have a basis (not-necessarily an orthonormal basis). A dictionary cannot have \(D < N\). The more interesting case is when \(D > N\).

Note

There are also applications of undercomplete dictionaries where the number of atoms \(D\) is less than the ambient space dimension \(N\). However, we will not be considering them unless specifically mentioned.

Redundant dictionaries and sparse signals

With \(D > N\), clearly there are more atoms than necessary to provide a representation of signal \(x \in \CC^N\). Thus such a dictionary is able provide multiple representations to same vector \(x\). We call such dictionaries redundant dictionaries or over-complete dictionaries.

In contrast a basis with \(D=N\) is called a complete dictionary.

A special class of signals is those signals which have a sparse representation in a given dictionary \(\mathcal{D}\).

Definition
A signal \(x \in \CC^N\) is called \((\mathcal{D},K)\)-sparse if it can be expressed as a linear combination of at-most \(K\) atoms from the dictionary \(\mathcal{D}\) where \(K \ll D\).

It is usually expected that \(K \ll N\) also holds.

Let \(\Lambda \subset \Omega\) be a subset of indices with \(|\Lambda|=K\).

Let \(x\) be any signal in \(\CC^N\) such that \(x\) can be expressed as

\[x = \sum_{\lambda \in \Lambda} b_{\lambda} \phi_{\lambda} \quad \text{where } b_{\lambda} \in \CC.\]

Note that this is not the only possible representation of \(x\) in \(\mathcal{D}\). This is just one of the possible representations of \(x\). The special thing about this representation is that it is \(K\)-sparse i.e. only at most \(K\) atoms from the dictionary are being used.

Now there are \(\binom{D}{K}\) ways in which we can choose a set of \(K\) atoms from the dictionary \(\mathcal{D}\).

Thus the set of \((\mathcal{D},K)\)-sparse signals is given by

\[\Sigma_{(\mathcal{D},K)} = \{x \in \CC^N : x = \sum_{\lambda \in \Lambda} b_{\lambda} \phi_{\lambda} \}.\]

for some index set \(\Lambda \subset \Omega\) with \(|\Lambda|=K\).

This set \(\Sigma_{(\mathcal{D},K)}\) is dependent on the chosen dictionary \(\mathcal{D}\). In the sequel, we will simply refer to it as \(\Sigma_K\).

ExampleK-sparse signals for standard basis

For the special case where \(\mathcal{D}\) is nothing but the standard basis of \(\CC^N\), then

\[\Sigma_K = \{ x : \|x \|_0 \leq K\}\]

i.e. the set of signals which has \(K\) or less non-zero elements.

Example

In contrast if we choose an orthonormal basis \(\Psi\) such that every \(x\in\CC^N\) can be expressed as

\[x = \Psi \alpha\]

then with the dictionary \(\mathcal{D} = \Psi\), the set of \(K\)-sparse signals is given by

\[\Sigma_K = \{ x = \Psi \alpha : \| \alpha \|_0 \leq K\}.\]

We also note that set of vectors \(\{ \alpha_{\lambda} : \lambda \in \Lambda \}\) with \(K < N\) form a subspace of \(\CC^N\).

So we have \(\binom{D}{K}\) \(K\)-sparse subspaces contained in the dictionary \(\mathcal{D}\). And the \(K\)-sparse signals lie in the union of all these subspaces.

Sparse approximation problem

In sparse approximation problem, we attempt to express a given signal \(x \in \CC^N\) using a linear combination of \(K\) atoms from the dictionary \(\mathcal{D}\) where \(K \ll N\) and typically \(N \ll D\) i.e. the number of atoms in a dictionary \(\mathcal{D}\) is typically much larger than the ambient signal space dimension \(N\).

Naturally we wish to obtain a best possible sparse representation of \(x\) over the atoms
\(\phi_{\omega} \in \mathcal{D}\) which minimizes the approximation error.

Let \(\Lambda\) denote the index set of atoms which are used to create a \(K\)-sparse representation of \(x\) where \(\Lambda \subset \Omega\) with \(|\Lambda| = K\).

Let \(x_{\Lambda}\) represent an approximation of \(x\) over the set of atoms indexed by \(\Lambda\).

Then we can write \(x_{\Lambda}\) as

\[x_{\Lambda} = \sum_{\lambda \in \Lambda} b_{\lambda} \phi_{\lambda} \quad \text{where } b_{\lambda} \in \CC.\]

We put all complex valued coefficients \(b_{\lambda}\) in the sum into a list \(b\).

The approximation error is given by

\[e = \| x - x_{\Lambda} \|_2.\]

We would like to minimize the approximation error over all possible choices of \(K\) atoms and corresponding set of coefficients \(b_{\lambda}\).

Thus the sparse approximation problem can be cast as a minimization problem given by

(1)\[\underset{|\Lambda| = K}{\text{min}} \, \underset{b}{\text{min}} \left \| x - \sum_{\lambda \in \Lambda} b_{\lambda} \phi_{\lambda} \right \|_2.\]

If we choose a particular \(\Lambda\), then the inner minimization problem becomes a straight-forward least squares problem. But there are \(\binom{D}{K}\) possible choices of \(\Lambda\) and solving the inner least squares problem for each of them becomes prohibitively expensive.

We reemphasize here that in this formulation we are using a fixed dictionary \(\mathcal{D}\) while the vector \(x \in \CC^N\) is arbitrary.

This problem is known as \((\mathcal{D}, K)\)-sparse approximation problem.

A related problem is known as \((\mathcal{D}, K)\)-exact-sparse problem where it is known a-priori that \(x\) is a linear combination of at-most \(K\) atoms from the given dictionary \(\mathcal{D}\) i.e. \(x\) is a \(K\)-sparse signal as defined in previous section for the dictionary \(\mathcal{D}\).

This formulation simplifies the minimization problem (1) since it is known a priori that for \(K\)-sparse signals, a \(0\) approximation error can be achieved. The only problem is to find a set of subspaces from the \(\binom{D}{K}\) possible \(K\)-sparse subspaces which are able to provide a \(K\)-sparse representation of \(x\) and amongst them choose one. It is imperative to note that even the \(K\)-sparse representation need not be unique.

Clearly the exact-sparse problem is simpler than the sparse approximation problem. Thus if exact-sparse problem is NP-Hard then so is the harder sparse-approximation problem. It is expected that solving the exact-sparse problem will provide insights into solving the sparse problem.

It would be useful to get some uniqueness conditions for general dictionaries which guarantee that the sparse representation of a vector is unique in the dictionary. Such conditions would help us guarantee the uniqueness of exact-sparse problem.

Synthesis and analysis

The atoms of a dictionary \(\mathcal{D}\) can be organized into a \(N \times D\) matrix as follows:

\[\Phi \triangleq \begin{bmatrix} \phi_{\omega_1} & \phi_{\omega_2} & \dots & \phi_{\omega_D} \end{bmatrix}.\]

where \(\Omega = \{\omega_1, \omega_2, \dots, \omega_N\}\) is the index set for the atoms of \(\mathcal{D}\). We remind that \(\phi_{\omega} \in \CC^N\), hence they have a column vector representation in the standard basis for \(\CC^N\).

The order of columns doesn’t matter as long as it remains fixed once chosen.

Thus, in matrix terminology, a representation of \(x \in \CC^N\) in the dictionary can be written as

\[x = \Phi b\]

where \(b \in \CC^D\) is a vector of coefficients to produce a superposition \(x\) from the atoms of dictionary \(\mathcal{D}\). Clearly with \(D > N\), \(b\) is not unique. Rather for every vector \(z \in \NullSpace(\Phi)\), we have:

\[\Phi (b + z) = \Phi b + \Phi z = x + 0 = x.\]
Definition
The matrix \(\Phi\) is called a synthesis matrix since \(x\) is synthesized from the columns of \(\Phi\) with the coefficient vector \(b\).

We can also view the synthesis matrix \(\Phi\) as a linear operator from \(\CC^D\) to \(\CC^N\).

There is another way to look at \(x\) through \(\Phi\).

DefinitionAnalysis matrix

The conjugate transpose \(\Phi^H\) of the synthesis matrix \(\Phi\) is called the analysis matrix. It maps a given vector \(x \in \CC^N\) to a list of inner products with the dictionary:

\[c = \Phi^H x\]

where \(c \in \CC^N\).

Remark
Note that in general \(x \neq \Phi (\Phi^H x)\) unless \(\mathcal{D}\) is an orthonormal basis.
DefinitionD-K exact-sparse problem

With the help of synthesis matrix \(\Phi\), the \((\mathcal{D}, K)\)-exact-sparse can now be written as

(2)\[\begin{split}\begin{aligned} & \underset{\alpha}{\text{minimize}} & & \| \alpha \|_0 \\ & \text{subject to} & & x = \Phi \alpha\\ & \text{and} & & \| \alpha \|_0 \leq K \end{aligned}\end{split}\]
DefinitionD-K sparse approximation problem

With the help of synthesis matrix \(\Phi\), the \((\mathcal{D}, K)\)-sparse approximation can now be written as

(3)\[\begin{split}\begin{aligned} & \underset{\alpha}{\text{minimize}} & & \| x - \Phi \alpha \|_2 \\ & \text{subject to} & & \| \alpha \|_0 \leq K. \end{aligned}\end{split}\]

p-norms and sparse signals

l1 , l2 and max norms

There are some simple and useful results on relationships between different \(p\)-norms listed in this section. We also discuss some interesting properties of \(l_1\)-norm specifically.

Definition

Let \(v \in \CC^N\). Let the entries in \(v\) be represented as

\[v_i = r_i \exp (j \theta_i)\]

where \(r_i = | v_i |\) with the convention that \(\theta_i = 0\) whenever \(r_i = 0\).

The sign vector for \(v\) denoted by \(\sgn(v)\) is defined as

\[\begin{split}\sgn(v) = \begin{bmatrix}\sgn(v_1) \\ \vdots \\ \sgn(v_N) \end{bmatrix}\end{split}\]

where

\[\begin{split}\sgn(v_i) = \left\{ \begin{array}{ll} \exp (j \theta_i) & \mbox{if :math:`r_i \neq 0` };\\ 0 & \mbox{if :math:`r_i = 0` }. \end{array} \right.\end{split}\]
Lemma

For any \(v \in \CC^N\) :

\[\| v \|_1 = \sgn(v)^H v = \langle v , \sgn(v) \rangle.\]
Proof
\[\| v \|_1 = \sum_{i=1}^N r_i = \sum_{i=1}^N \left [r_i e^{j \theta_i} \right ] e^{- j \theta_i} = \sum_{i=1}^N v_i e^{- j \theta_i} = \sgn(v)^H v.\]

Note that whenever \(v_i = 0\), corresponding \(0\) entry in \(\sgn(v)\) has no effect on the sum.

Lemma

Suppose \(v \in \CC^N\). Then

\[\| v \|_2 \leq \| v\|_1 \leq \sqrt{N} \| v \|_2.\]
Proof

For the lower bound, we go as follows

\[\| v \|_2^2 = \sum_{i=1}^N | v_i|^2 \leq \left ( \sum_{i=1}^N | v_i|^2 + 2 \sum_{i, j, i \neq j} | v_i | | v_j| \right ) = \left ( \sum_{i=1}^N | v_i| \right )^2 = \| v \|_1^2.\]

This gives us

\[\| v \|_2 \leq \| v \|_1.\]

We can write \(l_1\) norm as

\[\| v \|_1 = \langle v, \sgn (v) \rangle.\]

By Cauchy-Schwartz inequality we have

\[\langle v, \sgn (v) \rangle \leq \| v \|_2 \| \sgn (v) \|_2\]

Since \(\sgn(v)\) can have at most \(N\) non-zero values, each with magnitude 1,

\[\| \sgn (v) \|_2^2 \leq N \implies \| \sgn (v) \|_2 \leq \sqrt{N}.\]

Thus, we get

\[\| v \|_1 \leq \sqrt{N} \| v \|_2.\]
Lemma

Let \(v \in \CC^N\). Then

\[\| v \|_2 \leq \sqrt{N} \| v \|_{\infty}\]
Proof
\[\| v \|_2^2 = \sum_{i=1}^N | v_i |^2 \leq N \underset{1 \leq i \leq N}{\max} ( | v_i |^2) = N \| v \|_{\infty}^2.\]

Thus

\[\| v \|_2 \leq \sqrt{N} \| v \|_{\infty}.\]
Lemma

Let \(v \in \CC^N\). Let \(1 \leq p, q \leq \infty\). Then

\[\| v \|_q \leq \| v \|_p \text{ whenever } p \leq q.\]
Proof
TBD
Lemma

Let \(\OneVec \in \CC^N\) be the vector of all ones i.e. \(\OneVec = (1, \dots, 1)\). Let \(v \in \CC^N\) be some arbitrary vector. Let \(| v |\) denote the vector of absolute values of entries in \(v\). i.e. \(|v|_i = |v_i| \Forall 1 \leq i \leq N\). Then

\[\| v \|_1 = \OneVec^T | v | = \OneVec^H | v |.\]
Proof
\[\OneVec^T | v | = \sum_{i=1}^N | v |_i = \sum_{i=1}^N | v_i | = \| v \|_1.\]

Finally since \(\OneVec\) consists only of real entries, hence its transpose and Hermitian transpose are same.

Lemma

Let \(\OneMat \in \CC^{N \times N}\) be a square matrix of all ones. Let \(v \in \CC^N\) be some arbitrary vector. Then

\[|v|^T \OneMat | v | = \| v \|_1^2.\]
Proof

We know that

\[\OneMat = \OneVec \OneVec^T\]

Thus,

\[|v|^T \OneMat | v | = |v|^T \OneVec \OneVec^T | v | = (\OneVec^T | v | )^T \OneVec^T | v | = \| v \|_1 \| v \|_1 = \| v \|_1^2.\]

We used the fact that \(\| v \|_1 = \OneVec^T | v |\).

Theorem
\(k\)-th largest (magnitude) entry in a vector \(x \in \CC^N\) denoted by \(x_{(k)}\) obeys
\[| x_{(k)} | \leq \frac{\| x \|_1}{k}\]
Proof

Let \(n_1, n_2, \dots, n_N\) be a permutation of \(\{ 1, 2, \dots, N \}\) such that

\[|x_{n_1} | \geq | x_{n_2} | \geq \dots \geq | x_{n_N} |.\]

Thus, the \(k\)-th largest entry in \(x\) is \(x_{n_k}\). It is clear that

\[\| x \|_1 = \sum_{i=1}^N | x_i | = \sum_{i=1}^N |x_{n_i} |\]

Obviously

\[|x_{n_1} | \leq \sum_{i=1}^N |x_{n_i} | = \| x \|_1.\]

Similarly

\[k |x_{n_k} | = |x_{n_k} | + \dots + |x_{n_k} | \leq |x_{n_1} | + \dots + |x_{n_k} | \leq \sum_{i=1}^N |x_{n_i} | \leq \| x \|_1.\]

Thus

\[|x_{n_k} | \leq \frac{\| x \|_1}{k}.\]

Sparse signals

In this section we explore some useful properties of \(\Sigma_K\), the set of \(K\)-sparse signals in standard basis for \(\CC^N\).

We recall that

\[\Sigma_K = \{ x \in \CC^N : \| x \|_0 \leq K \}.\]

We established before that this set is a union of \(\binom{N}{K}\) subspaces of \(\CC^N\) each of which is is constructed by an index set \(\Lambda \subset \{1, \dots, N \}\) with \(| \Lambda | = K\) choosing \(K\) specific dimensions of \(\CC^N\).

We first present some lemmas which connect the \(l_1\), \(l_2\) and \(l_{\infty}\) norms of vectors in \(\Sigma_K\).

Lemma

Suppose \(u \in \Sigma_K\). Then

\[\frac{\| u\|_1}{\sqrt{K}} \leq \| u \|_2 \leq \sqrt{K} \| u \|_{\infty}.\]
Proof

We can write \(l_1\) norm as

\[\| u \|_1 = \langle u, \sgn (u) \rangle.\]

By Cauchy-Schwartz inequality we have

\[\langle u, \sgn (u) \rangle \leq \| u \|_2 \| \sgn (u) \|_2\]

Since \(u \in \Sigma_K\), \(\sgn(u)\) can have at most \(K\) non-zero values each with magnitude 1. Thus, we have

\[\| \sgn (u) \|_2^2 \leq K \implies \| \sgn (u) \|_2 \leq \sqrt{K}\]

Thus we get the lower bound

\[\| u \|_1 \leq \| u \|_2 \sqrt{K} \implies \frac{\| u \|_1}{\sqrt{K}} \leq \| u \|_2.\]

Now \(| u_i | \leq \max(| u_i |) = \| u \|_{\infty}\). So we have

\[\| u \|_2^2 = \sum_{i= 1}^{N} | u_i |^2 \leq K \| u \|_{\infty}^2\]

since there are only \(K\) non-zero terms in the expansion of \(\| u \|_2^2\).

This establishes the upper bound:

\[\| u \|_2 \leq \sqrt{K} \| u \|_{\infty}\]

Compressible signals

In this section, we first look at some general results and definitions related to \(K\)-term approximations of arbitrary signals \(x \in \CC^N\). We then define the notion of a compressible signal and study properties related to it.

K-term approximation of general signals

Definition

Let \(x \in \CC^N\). Let \(T \subset \{ 1, 2, \dots, N\}\) be any index set.Further let

\[T = \{t_1, t_2, \dots, t_{|T|}\}\]

such that

\[t_1 < t_2 < \dots < t_{|T|}.\]

Let \(x_T \in \CC^{|T|}\) be defined as

(1)\[x_T = \begin{pmatrix} x_{t_1} & x_{t_2} & \dots & x_{t_{|T|}} \end{pmatrix}.\]

Then \(x_T\) is a restriction of the signal \(x\) on the index set \(T\).

Alternatively let \(x_T \in \CC^N\) be defined as

(2)\[\begin{split}x_{T}(i) = \left\{ \begin{array}{ll} x(i) & \mbox{if $i \in T$ };\\ 0 & \mbox{otherwise}. \end{array} \right.\end{split}\]

In other words, \(x_T \in \CC^N\) keeps the entries in \(x\) indexed by \(T\) while sets all other entries to 0. Then we say that \(x_T\) is obtained by masking \(x\) with \(T\). As an abuse of notation, we will use any of the two definitions whenever we are referring to \(x_T\). The definition being used should be obvious from the context.

ExampleRestrictions on index sets

Let

\[x = \begin{pmatrix} -1 & 5 & 8 & 0 & 0 & -3 & 0 & 0 & 0 & 0 \end{pmatrix} \in \CC^{10}.\]

Let

\[T = \{ 1, 3, 7, 8\}.\]

Then

\[x_T = \begin{pmatrix} -1 & 0 & 8 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix} \in \CC^{10}.\]

Since \(|T| = 4\), sometimes we will also write

\[x = \begin{pmatrix} -1 & 8 & 0 & 0 \end{pmatrix} \in \CC^4.\]
Definition

Let \(x \in \CC^N\) be an arbitrary signal. Consider any index set \(T \subset \{1, \dots, N \}\) with \(|T| = K\). Then \(x_T\) is a \(K\)-term approximation of \(x\).

Clearly for any \(x \in \CC^N\) there are \(\binom{N}{K}\) possible \(K\)-term approximations of \(x\).

ExampleK-term approximation

Let

\[x = \begin{pmatrix} -1 & 5 & 8 & 0 & 0 & -3 & 0 & 0 & 0 & 0 \end{pmatrix} \in \CC^{10}.\]

Let \(T= \{ 1, 6 \}\). Then

\[x_T = \begin{pmatrix} -1 & 0 & 0 & 0 & 0 & -3 & 0 & 0 & 0 & 0 \end{pmatrix}\]

is a \(2\)-term approximation of \(x\).

If we choose \(T= \{7,8,9,10\}\), the corresponding \(4\)-term approximation of \(x\) is

\[ \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix}.\]
Definition

Let \(x \in \CC^N\) be an arbitrary signal. Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(x\) such that

\[| x_{\lambda_1} | \geq | x_{\lambda_2} | \geq \dots \geq | x_{\lambda_N} |.\]

In case of ties, the order is resolved lexicographically, i.e. if \(|x_i| = |x_j|\) and \(i < j\) then \(i\) will appear first in the sequence \(\lambda_k\).

Consider the index set \(\Lambda_K = \{ \lambda_1, \lambda_2, \dots, \lambda_K\}\). The restriction of \(x\) on \(\Lambda_K\) given by \(x_{\Lambda_K}\) (see above) contains the \(K\) largest entries \(x\) while setting all other entries to 0. This is known as the \(K\) largest entries approximation of \(x\).

This signal is denoted henceforth as \(x|_K\). i.e.

\[x|_K = x_{\Lambda_K}\]

where \(\Lambda_K\) is the index set corresponding to \(K\) largest entries in \(x\) (magnitude wise).

ExampleLargest entries approximation

Let

\[x = \begin{pmatrix} -1 & 5 & 8 & 0 & 0 & -3 & 0 & 0 & 0 & 0 \end{pmatrix}.\]

Then

\[x|_1 = \begin{pmatrix} 0 & 0 & 8 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix}.\]
\[x|_2 = \begin{pmatrix} 0 & 5 & 8 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix}.\]
\[x|_3 = \begin{pmatrix} 0 & 5 & 8 & 0 & 0 & -3 & 0 & 0 & 0 & 0 \end{pmatrix}\]
\[x|_4 = x.\]

All further \(K\) largest entries approximations are same as \(x\).

A pertinent question at this point is: which \(K\)-term approximation of \(x\) is the best \(K\)-term approximation? Certainly in order to compare two approximations we need some criterion. Let us choose \(l_p\) norm as the criterion. The next lemma gives an interesting result for best \(K\)-term approximations in \(l_p\) norm sense.

Lemma

Let \(x \in \CC^N\). Let the best \(K\) term approximation of \(x\) be obtained by the following optimization program:

(3)\[\begin{split}\begin{aligned} & \underset{T \subset \{1, \dots, N\}}{\text{maximize}} & & \| x_T \|_p \\ & \text{subject to} & & |T| = K. \end{aligned}\end{split}\]

where \(p \in [1, \infty]\).

Let an optimal solution for this optimization problem be denoted by \(x_{T^*}\). Then

\[\| x|_K \|_p = \| x_{T^*} \|_p.\]

i.e. the \(K\)-largest entries approximation of \(x\) is an optimal solution to (3) .

Proof

For \(p=\infty\), the result is obvious. In the following, we focus on \(p \in [1, \infty)\).

We note that maximizing \(\| x_T \|_p\) is equivalent to maximizing \(\| x_T \|^p_p\).

Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(x\) such that

\[| x_{\lambda_1} | \geq | x_{\lambda_2} | \geq \dots \geq | x_{\lambda_N} |.\]

Further let \(\{ \omega_1, \dots, \omega_N\}\) be any permutation of \(\{1, \dots, N \}\).

Clearly

\[\| x|_K \|_p^{p} = \sum_{i=1}^K |x_{\lambda_i}|^{p} \geq \sum_{i=1}^K |x_{\omega_i}|^{p}.\]

Thus if \(T^*\) corresponds to an optimal solution of (3) then

\[\| x|_K \|_p^{p} = \| x_{T^*} \|_p^{p}.\]

Thus \(x|_K\) is an optimal solution to (3) .

This lemma helps us establish that whenever we are looking for a best \(K\)-term approximation of \(x\) under any \(l_p\) norm, all we have to do is to pickup the \(K\)-largest entries in \(x\).

Definition

Let \(\Phi \in \CC^{M \times N}\). Let \(T \subset \{ 1, 2, \dots, N\}\) be any index set.Further let

\[T = \{t_1, t_2, \dots, t_{|T|}\}\]

such that

\[t_1 < t_2 < \dots < t_{|T|}.\]

Let \(\Phi_T \in \CC^{M \times |T|}\) be defined as

(4)\[\Phi_T = \begin{bmatrix} \phi_{t_1} & \phi_{t_2} & \dots & \phi_{t_{|T|}} \end{bmatrix}.\]

Then \(\Phi_T\) is a restriction of the matrix \(\Phi\) on the index set \(T\).

Alternatively let \(\Phi_T \in \CC^{M \times N}\) be defined as

(5)\[\begin{split}(\Phi_{T})_i = \left\{ \begin{array}{ll} \phi_i & \mbox{if $i \in T$ };\\ 0 & \mbox{otherwise}. \end{array} \right.\end{split}\]

In other words, \(\Phi_T \in \CC^{M \times N}\) keeps the columns in \(\Phi\) indexed by \(T\) while sets all other columns to 0. Then we say that \(\Phi_T\) is obtained by masking \(\Phi\) with \(T\). As an abuse of notation, we will use any of the two definitions whenever we are referring to \(\Phi_T\). The definition being used should be obvious from the context.

Lemma

Let \(\supp(x) = \Lambda\). Then

\[\Phi x = \Phi_{\Lambda} x_{\Lambda}.\]
Proof
\[\Phi x = \sum_{i=1}^N x_i \phi_i = \sum_{\lambda_i \in \Lambda} x_{\lambda_i} \phi_{\lambda_i} = \Phi_{\Lambda} x_{\Lambda}.\]
Remark
The lemma remains valid whether we use the restriction or the mask version of \(x_{\Lambda}\) notation as long as same version is used for both \(\Phi\) and \(x\).
Corollary

Let \(S\) and \(T\) be two disjoint index sets such that for some \(x \in \CC^N\)

\[x = x_T + x_S\]

using the mask version of \(x_T\) notation. Then the following holds

\[\Phi x = \Phi_T x_T + \Phi_S x_S.\]
Proof

Straightforward application of previous result:

\[\Phi x = \Phi x_T + \Phi x_S = \Phi_T x_T + \Phi_S x_S.\]
Lemma

Let \(T\) be any index set. Let \(\Phi \in \CC^{M \times N}\) and \(y \in \CC^M\). Then

\[[\Phi^H y]_T = \Phi_T^H y.\]
Proof
\[\begin{split}\Phi^H y = \begin{bmatrix} \langle \phi_1 , y \rangle\\ \vdots \\ \langle \phi_N , y \rangle\\ \end{bmatrix}\end{split}\]

Now let

\[T = \{ t_1, \dots, t_K \}.\]

Then

\[\begin{split}[\Phi^H y]_T = \begin{bmatrix} \langle \phi_{t_1} , y \rangle\\ \vdots \\ \langle \phi_{t_K} , y \rangle\\ \end{bmatrix} = \Phi_T^H y.\end{split}\]
Remark
The lemma remains valid whether we use the restriction or the mask version of \(\Phi_T\) notation.

Compressible signals

We will now define the notion of a compressible signal in terms of the decay rate of magnitude of its entries when sorted in descending order.

Definition

Let \(x \in \CC^N\) be an arbitrary signal. Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(x\) such that

\[| x_{\lambda_1} | \geq | x_{\lambda_2} | \geq \dots \geq | x_{\lambda_N} |.\]

In case of ties, the order is resolved lexicographically, i.e. if \(|x_i| = |x_j|\) and \(i < j\) then \(i\) will appear first in the sequence \(\lambda_k\). Define

(6)\[\widehat{x} = (x_{\lambda_1}, x_{\lambda_2}, \dots, x_{\lambda_N}).\]

The signal \(x\) is called \(p\)-compressible with magnitude \(R\) if there exists \(p \in (0, 1]\) such that

(7)\[| \widehat{x}_i |\leq R \cdot i^{-\frac{1}{p}} \quad \forall i=1, 2,\dots, N.\]
Lemma

Let \(x\) be be \(p\)-compressible with \(p=1\). Then

\[\| x \|_1 \leq R (1 + \ln (N)).\]
Proof

Recalling \(\widehat{x}\) from (6) it’s straightforward to see that

\[\|x\|_1 = \|\widehat{x}\|_1\]

since the \(l_1\) norm doesn’t depend on the ordering of entries in \(x\).

Now since \(x\) is \(1\)-compressible, hence from (7) we have

\[|\widehat{x}_i | \leq R \frac{1}{i}.\]

This gives us

\[\|\widehat{x}\|_1 \leq \sum_{i=1}^N R \frac{1}{i} = R \sum_{i=1}^N \frac{1}{i}.\]

The sum on the R.H.S. is the \(N\)-th Harmonic number (sum of reciprocals of first \(N\) natural numbers). A simple upper bound on Harmonic numbers is

\[H_k \leq 1 + \ln(k).\]

This completes the proof.

We now demonstrate how a compressible signal is well approximated by a sparse signal.

Lemma

Let \(x\) be a \(p\)-compressible signal and let \(x|_K\) be its best \(K\)-term approximation. Then the \(l_1\) norm of approximation error satisfies

(8)\[\| x - x|_K\|_1 \leq C_p \cdot R \cdot K^{1 - \frac{1}{p}}\]

with

\[C_p = \left (\frac{1}{p} - 1 \right)^{-1}.\]

Moreover the \(l_2\) norm of approximation error satisfies

\[\| x - x|_K\|_2 \leq D_p \cdot R \cdot K^{1 - \frac{1}{p}}\]

with

\[D_p = \left (\frac{2}{p} - 1 \right )^{-1/2}.\]
Proof
\[\| x - x|_K\|_1 = \sum_{i=K+1}^N |x_{\lambda_i}| \leq R \sum_{i=K+1}^N i^{-\frac{1}{p}}.\]

We now approximate the R.H.S. sum with an integral.

\[\sum_{i=K+1}^N i^{-\frac{1}{p}} \leq \int_{x=K}^N x^{-\frac{1}{p}} d x \leq \int_{x=K}^{\infty} x^{-\frac{1}{p}} d x.\]

Now

\[\int_{x=K}^{\infty} x^{-\frac{1}{p}} d x = \left [ \frac{x^{1-\frac{1}{p}}}{1-\frac{1}{p}} \right ]_{K}^{\infty} = C_p K^{1 - \frac{1}{p}}.\]

We can similarly show the result for \(l_2\) norm.

Tools for dictionary analysis

In this and following sections we review various properties associated with a dictionary \(\mathcal{D}\) which are useful in understanding the behavior and capabilities of a dictionary.

We recall that a dictionary \(\mathcal{D}\) consists of a finite number of unit norm vectors in \(\CC^N\) called atoms which span the signal space \(\CC^N\). Atoms of the dictionary are indexed by an index set \(\Omega\). i.e.

\[\mathcal{D} = \{ d_{\omega} : \omega \in \Omega \}\]

with \(|\Omega| = D\) and \(N \leq D\) with \(\| d_{\omega} \|_2 = 1\) for all atoms.

The vectors \(x \in \CC^N\) can be represented by a synthesis matrix consisting of the atoms of \(\mathcal{D}\) by a vector \(\alpha \in \CC^D\) as

\[x = \DDD \alpha.\]

Note that we are using the same symbol \(\DDD\) to represent the dictionary as a set of atoms as well as the corresponding synthesis matrix.

We can write the matrix \(\DDD\) consisting of its columns as

\[\DDD = \begin{bmatrix} d_1 & \dots & d_D \end{bmatrix}\]

This shouldn’t be causing any confusion in the sequel. When we write the subscript as \(d_{\omega_i}\) where \(\omega_i \in \Omega\) we are referring to the atoms of the dictionary \(\mathcal{D}\) indexed by the set \(\Omega\), while when we write the subscript as \(d_i\) we are referring to a column of corresponding synthesis matrix. In this case, \(\Omega\) will simply mean the index set \(\{ 1, \dots, D \}\). Obviously \(|\Omega| = D\) holds still.

Often, we will be working with a subset of atoms in a dictionary. Usually such a subset of atoms will be indexed by an index set \(\Lambda \subseteq \Omega\). \(\Lambda\) will take the form of \(\Lambda \subseteq \{\omega_1, \dots, \omega_D\}\) or \(\Lambda \subseteq \{1, \dots, D\}\) depending upon whether we are talking about the subset of atoms in the dictionary or a subset of columns from the corresponding synthesis matrix.

We will need the notion of a sub-dictionary [Tro06] described below.

Definition

A sub-dictionary is a linearly independent collection of atoms. Let \(\Lambda \subset \{\omega_1, \dots, \omega_D\}\) be the index set for the atoms in the sub-dictionary. We denote the sub-dictionary as \(\DDD_{\Lambda}\). We also use \(\DDD_{\Lambda}\) to denote the corresponding matrix with \(\Lambda \subset \{1, \dots, D\}\).

Remark
A subdictionary is full rank.

This is obvious since it is a collection of linearly independent atoms.

For subdictionaries often we will say \(K = | \Lambda |\) and \(G = \DDD_{\Lambda}^H \DDD_{\Lambda}\) as its Gram matrix. Sometimes, we will also be considering \(G^{-1}\). \(G^{-1}\) has a useful interpretation in terms of the dual vectors for the atoms in \(\DDD_{\Lambda}\) [TRO04].

Let \(\{ d_{\lambda} \}_{\lambda \in \Lambda}\) denote the atoms in \(\DDD_{\Lambda}\). Let \(\{ c_{\lambda} \}_{\lambda \in \Lambda}\) be chosen such that

\[\langle d_{\lambda} , c_{\lambda} \rangle = 1\]

and

\[\langle d_{\lambda} , c_{\omega} \rangle = 0 \text { for } \lambda, \omega \in \Lambda, \lambda \neq \omega.\]

Each dual vector \(c_{\lambda}\) is orthogonal to atoms in the subdictionary at different indices and is long enough so that its inner product with \(d_{\lambda}\) is one. The dual system somehow inverts the sub-dictionary. In fact the dual vectors are nothing but the columns of the matrix \(B = (\DDD_{\Lambda}^{\dag})^H\). Now, a simple calculation:

\[B^H B = (\DDD_{\Lambda}^{\dag}) (\DDD_{\Lambda}^{\dag})^H = (\DDD_{\Lambda}^H \DDD_{\Lambda})^{-1} \DDD_{\Lambda}^H \DDD_{\Lambda} (\DDD_{\Lambda}^H \DDD_{\Lambda})^{-1} = (\DDD_{\Lambda}^H \DDD_{\Lambda})^{-1} = G^{-1}.\]

Therefore, the inverse Gram matrix lists the inner products between the dual vectors.

Sometimes we will be discussing tools which also apply for general matrices. We will use the symbol \(\Phi\) for representing general matrices. Whenever the dictionary is an orthonormal basis, we will use the symbol \(\Psi\).

Spark

Definition
The spark of a given matrix \(\Phi\) is the smallest number of columns of \(\Phi\) that are linearly dependent. If all columns are linearly independent, then the spark is defined to be number of columns plus one.

Note that the definition of spark applies to all matrices (wide, tall or square). It is not restricted to the synthesis matrices for a dictionary.

Correspondingly, the spark of a dictionary is defined as the minimum number of atoms which are linearly dependent.

We recall that rank of a matrix is defined as the maximum number of columns which are linearly independent. Definition of spark bears remarkable resemblance yet its very hard to obtain as it requires a combinatorial search over all possible subsets of columns of \(\Phi\).

ExampleSpark
  • Spark of the \(3 \times 3\) identity matrix

    \[\begin{split}\begin{pmatrix} 1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}\end{split}\]

    is 4 since all columns are linearly independent.

  • Spark of the \(2 \times 4\) matrix

    \[\begin{split}\begin{pmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 0 & -1 \end{pmatrix}\end{split}\]

    is 2 since column 1 and 3 are linearly dependent.

  • If a matrix has a column with all zero entries, then the spark of such a matrix is 1. This is a trivial case and we will not consider such matrices in the sequel.

  • In general for an \(N \times D\) synthesis matrix, \(\spark(\DDD) \in [2, N+1]\).

A naive combinatorial algorithm to calculate the spark of a matrix is given below.

_images/spark_naive_algorithm.png

A naive algorithm to compute the spark of a matrix

Spark is useful in characterizing the uniqueness of the solution of a \((\DDD, K)\)- exact-sparse problem.

Remark

The \(l_0\)-“norm” of vectors belonging to null space of a matrix \(\Phi\) is greater than or equal to \(\spark(\Phi)\):

\[\| x \|_0 \geq \spark(\Phi) \Forall x\in \NullSpace(\Phi).\]
Proof

If \(x \in \NullSpace(\Phi)\) then \(\Phi x = 0\). Thus non-zero entries in \(x\) pick a set of columns in \(\Phi\) which are linearly dependent. Clearly \(\| x \|_0\) indicates the number of columns in the set which are linearly dependent. By definition spark of \(\Phi\) indicates the minimum number of columns which are linearly dependent hence the result.

\[\| x \|_0 \geq \spark(\Phi) \Forall x\in \NullSpace(\Phi).\]

We now present a criteria based on spark which characterizes the uniqueness of a sparse solution to the problem \(y = \Phi x\).

Theorem

Consider a solution \(x^*\) to the under-determined system \(y = \Phi x\). If \(x^*\) obeys

\[\| x^* \|_0 < \frac{\spark(\Phi)}{2}\]

then it is necessarily the sparsest solution.

Proof

Let \(x'\) be some other solution to the problem. Then

\[\Phi x' = \Phi x^* \implies \Phi (x' - x^*) = 0 \implies (x' - x^*) \in \NullSpace(\Phi).\]

Now based on previous remark we have

\[\| x' - x^* \|_0 \geq \spark(\Phi).\]

Now

\[\| x' \|_0 + \| x^* \|_0 \geq \| x' - x^* \|_0 \geq \spark(\Phi).\]

Hence, if \(\| x^* \|_0 < \frac{\spark(\Phi)}{2}\), then we have

\[\| x' \|_0 > \frac{\spark(\Phi)}{2}\]

for all other solutions \(x'\) to the equation \(y = \Phi x\).

Thus \(x^*\) is necessarily the sparsest possible solution.

This result is quite useful as it establishes a global optimality criterion for the \((\DDD, K)\)- exact-sparse problem.

As long as \(K < \frac{1}{2}\spark(\Phi)\) this theorem guarantees that the solution to \((\DDD, K)\)- exact-sparse problem is unique. This is quite surprising result for a non-convex combinatorial optimization problem. We are able to guarantee a global uniqueness for the solution based on a simple check on the sparsity of the solution.

Note that we are only saying that if a sufficiently sparse solution is found then it is unique. We are not claiming that it is possible to find such a solution.

Obviously, the larger the spark, we can guarantee uniqueness for signals with higher sparsity levels. So a natural question is: How large can spark of a dictionary be? We consider few examples.

ExampleSpark of Gaussian dictionaries

Consider a dictionary \(\DDD\) whose atoms \(d_{i}\) are random vectors independently drawn from normal distribution. Since a dictionary requires all its atoms to be unit-norms, hence we divide the each of the random vectors with their norms.

We know that with probability \(1\), any set of \(N\) independent Gaussian random vectors is linearly independent. Also since \(d_i \in \CC^N\) hence a set of \(N+1\) atoms is always linearly dependent.

Thus \(\spark(\DDD) = N +1\).

Thus, if a solution to exact-sparse problem contains \(\frac{N}{2}\) or fewer non-zero entries then it is necessarily unique with probability 1.

ExampleSpark of Dirac Fourier basis

For

\[\DDD = \begin{bmatrix} I & F \end{bmatrix} \in \CC^{N \times 2N}\]

it can be shown that

\[\spark(\DDD) = 2 \sqrt{N}.\]

In this case, the sparsity level of a unique solution must be less than \(\sqrt{N}\).

ExampleSpark of a Partial Hadamard matrix

Let’s construct a Hadamard matrix of size \(20 \times 20\):

PhiA = hadamard(20);

Let’s print it:

>> PhiA

PhiA =

     1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
     1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1
     1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1
     1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1
     1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1
     1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1
     1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1
     1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1
     1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1
     1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1
     1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1
     1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1
     1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1
     1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1
     1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1
     1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1
     1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1
     1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1
     1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1
     1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1

We will now select 10 rows randomly from it:

>> rng default;
>> rows = randperm(20, 10)

rows =

     6    18     7    16    12    13     3     4    19    20

>> Phi = PhiA(rows, :)

Phi =

     1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1
     1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1
     1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1
     1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1
     1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1
     1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1
     1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1
     1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1
     1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1
     1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1

Let’s measure its spark:

>> spx.dict.spark(Phi)

ans =

     8

We can also find out the set of 8 columns which are linearly dependent:

>> [spark, columns] = spx.dict.spark(Phi)

spark =

     8


columns =

     1     2     3     7    11    14    19    20

Let’s find out this sub-matrix

>> PhiD = Phi(:, columns)

PhiD =

     1 -1 -1 -1  1 -1  1  1
     1 -1 -1  1 -1 -1  1  1
     1 -1 -1  1  1 -1  1 -1
     1  1  1 -1 -1 -1  1  1
     1  1 -1  1 -1  1  1 -1
     1 -1  1 -1 -1 -1 -1  1
     1 -1  1 -1  1  1  1 -1
     1  1  1 -1 -1  1 -1 -1
     1 -1  1  1 -1  1  1 -1
     1  1 -1 -1  1 -1 -1 -1

Let’s verify that this matrix is indeed singular:

>> rank(PhiD)

ans =

     7

We can find out a vector in its null space:

>> z = null(PhiD)'

z =

    0.4472    0.2236    0.2236    0.4472    0.4472    0.2236   -0.2236    0.4472

Verify that it is indeed a null space vector:

>> norm (PhiD * z')

ans =

   1.1776e-15

The rank of this matrix is 10. If every set of 10 columns was independent, then the spark would have been 11 and the matrix would be a full spark matrix. Unfortunately, it is not so. However the spark is still quite large.

We can normalize the columns of this matrix to make it a proper dictionary:

>> Phi = spx.norm.normalize_l2(Phi);

Let’s verify the column-wise norms:

>> spx.norm.norms_l2_cw(Phi)

ans =

  Columns 1 through 12

    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000

  Columns 13 through 20

    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000

The coherence of this dictionary [to be discussed in next section] is 0.6 which is moderate (but not low).

Coherence

Finding out the spark of a dictionary \(\DDD\) is NP-hard since it involves considering combinatorially large number of selections of columns from \(\DDD\) . In this section we consider the coherence of a dictionary which is computationally tractable and quite useful in characterizing the solutions of sparse approximation problems.

Definition

The coherence of a dictionary \(\DDD\) is defined as the maximum absolute inner product between two distinct atoms in the dictionary:

\[\mu = \underset{j \neq k}{\text{max}} | \langle d_{\omega_j}, d_{\omega_k} \rangle | = \underset{j \neq k}{\text{max}} | (\DDD^H \DDD)_{jk} |.\]

If the dictionary consists of two orthonormal bases, then coherence is also known as mutual coherence or proximity. Since the atoms within each orthonormal basis are orthogonal to each other, the coherence is determined only by the inner products of atoms from one basis with another basis.

We note that \(d_{\omega_i}\) is the \(i\) -th column of synthesis matrix \(\DDD\) . Also \(\DDD^H \DDD\) is the Gram matrix for \(\DDD\) whose elements are nothing but the inner-products of columns of \(\DDD\) .

We note that by definition \(\| d_{\omega} \|_2 = 1\) hence \(\mu \leq 1\) and since absolute values are considered hence \(\mu \geq 0\) . Thus, \(0 \leq \mu \leq 1\).

For an orthonormal basis \(\Psi\) all atoms are orthogonal to each other, hence

\[| \langle \psi_{\omega_j}, \psi_{\omega_k} \rangle |= 0 \text{ whenever } j \neq k.\]

Thus \(\mu = 0\) .

In the following, we will use the notation \(|A|\) to denote a matrix consisting of absolute values of entries in a matrix \(A\) . i.e.

\[| A |_{i j} = | A _{i j} |.\]

The off-diagonal entries of the Gram matrix are captured by the matrix \(\DDD^H \DDD - I\) . Note that all diagonal entries in \(\DDD^H \DDD - I\) are zero since atoms of \(\DDD\) are unit norm. Moreover, each of the entries in \(| \DDD^H \DDD - I |\) is dominated by \(\mu(\DDD)\) .

The inner product between any two atoms \(| \langle d_{\omega_j}, d_{\omega_k} \rangle |\) is a measure of how much they look alike or how much they are correlated. Coherence just picks up the two vectors which are most alike and returns their correlation. In a way \(\mu\) is quite a blunt measure of the quality of a dictionary, yet it is quite useful.

If a dictionary is uniform in the sense that there is not much variation in \(| \langle d_{\omega_j}, d_{\omega_k} \rangle |\) , then \(\mu\) captures the behavior of the dictionary quite well.

Definition

We say that a dictionary is incoherent if the coherence of the dictionary is small.

We are looking for dictionaries which are incoherent. In the sequel we will see how incoherence plays a role in sparse approximation.

Example

The coherence of two ortho-bases is bounded by

\[\frac{1}{\sqrt{N}} \leq \mu \leq 1.\]

The coherence of Dirac Fourier basis is \(\frac{1}{\sqrt{N}}\) .

ExampleCoherence: Multi-ONB dictionary
A dictionary of concatenated orthonormal bases is called a multi-ONB. For some \(N\) , it is possible to build a multi-ONB which contains \(N\) or even \(N+1\) bases yet retains the minimal coherence \(\mu = \frac{1}{\sqrt{N}}\) possible.
Theorem

A lower bound on the coherence of a general dictionary is given by

\[\mu \geq \sqrt{\frac{D-N}{N(D-1)}}\]
Definition

If each atomic inner product meets this bound, the dictionary is called an optimal Grassmannian frame .

The definition of coherence can be extended to arbitrary matrices \(\Phi \in \CC^{N \times D}\) .

Definition

The coherence of a matrix \(\Phi \in \CC^{N \times D}\) is defined as the maximum absolute normalized inner product between two distinct columns in the matrix. Let

\[\Phi = \begin{bmatrix} \phi_1 & \phi_2 & \dots & \phi_D \end{bmatrix}.\]

Then coherence of \(\Phi\) is given by

(1)\[\mu(\Phi) = \underset{j \neq k}{\text{max}} \frac{ | \langle \phi_j, \phi_k \rangle |} {\| \phi_j \|_2 \| \phi_k \|_2}\]

It is assumed that none of the columns in \(\Phi\) is a zero vector.

Lower bounds for spark

Coherence of a matrix is easy to compute. More interestingly it also provides a lower bound on the spark of a matrix.

Theorem

For any matrix \(\Phi \in \CC^{N \times D}\) (with non-zero columns) the following relationship holds

\[\spark(\Phi) \geq 1 + \frac{1}{\mu(\Phi)}.\]
Proof

We note that scaling of a column of \(\Phi\) doesn’t change either the spark or coherence of \(\Phi\) . Therefore, we assume that the columns of \(\Phi\) are normalized.

We now construct the Gram matrix of \(\Phi\) given by \(G = \Phi^H \Phi\) . We note that

\[G_{k k} = 1 \quad \Forall 1 \leq k \leq D\]

since each column of \(\Phi\) is unit norm.

Also

\[|G_{k j}| \leq \mu(\Phi) = \mu(\Phi) \quad \Forall 1 \leq k, j \leq D , k \neq j.\]

Consider any \(p\) columns from \(\Phi\) and construct its Gram matrix. This is nothing but a leading minor of size \(p \times p\) from the matrix \(G\) .

From the Gershgorin disk theorem, if this minor is diagonally dominant, i.e. if

\[\sum_{j \neq i} |G_{i j}| < | G_{i i}| \Forall i\]

then this sub-matrix of \(G\) is positive definite and so corresponding \(p\) columns from \(\Phi\) are linearly independent.

But

\[|G_{i i}| = 1\]

and

\[\sum_{j \neq i} |G_{i j}| \leq (p-1) \mu(\Phi)\]

for the minor under consideration. Hence for \(p\) columns to be linearly independent the following condition is sufficient

\[(p-1) \mu (\Phi) < 1.\]

Thus if

\[p < 1 + \frac{1}{\mu(\Phi)},\]

then every set of \(p\) columns from \(\Phi\) is linearly independent.

Hence, the smallest possible set of linearly dependent columns must satisfy

\[p \geq 1 + \frac{1}{\mu(\Phi)}.\]

This establishes the lower bound that

\[\spark(\Phi) \geq 1 + \frac{1}{\mu(\Phi)}.\]

This bound on spark doesn’t make any assumptions on the structure of the dictionary. In fact, imposing additional structure on the dictionary can give better bounds. Let us look at an example for a two ortho-basis [DE03].

Theorem

Let \(\DDD\) be a two ortho-basis. Then

\[\spark (\DDD) \geq \frac{2}{\mu(\DDD)}.\]
Proof

It can be shown that for any vector \(v \in \NullSpace(\DDD)\)

\[\| v \|_0 \geq \frac{2}{\mu(\DDD)}.\]

But

\[\spark(\DDD) = \underset{v \in \NullSpace(\DDD)} {\min}( \| v \|_0).\]

Thus

\[\spark(\DDD) \geq \frac{2}{\mu(\DDD)}.\]

For maximally incoherent two orthonormal bases, we know that \(\mu = \frac{1}{\sqrt{N}}\) . A perfect example is the pair of Dirac and Fourier bases. In this case \(\spark(\DDD) \geq 2 \sqrt{N}\) .

Uniqueness-Coherence

We can now establish a uniqueness condition for sparse solution of \(x = \Phi \alpha\) .

Theorem

Consider a solution \(x^*\) to the under-determined system \(y = \Phi x\) . If \(x^*\) obeys

\[\| x^* \|_0 < \frac{1}{2} \left (1 + \frac{1}{\mu(\Phi)} \right )\]

then it is necessarily the sparsest solution.

Proof
This is a straightforward application of spark uniqueness theorem and spark lower bound on coherence.

It is interesting to compare the two uniqueness theorems: spark uniqueness theorem and coherence uniqueness theorem.

First one is sharp and is far more powerful than the second one based on coherence.

Coherence can never be smaller than \(\frac{1}{\sqrt{N}}\) , therefore the bound on \(\| x^* \|_0\) in above can never be larger than \(\frac{\sqrt{N} + 1}{2}\) .

However, spark can be easily as large as \(N\) and then bound on \(\| x^* \|_0\) can be as large as \(\frac{N}{2}\) .

Thus, we note that coherence gives a weaker bound than spark for supportable sparsity levels of unique solutions. The advantage that coherence has is that it is easily computable and doesn’t require any special structure on the dictionary (two ortho basis has a special structure).

Singular values of sub-dictionaries

Theorem

Let \(\DDD\) be a dictionary and \(\DDD_{\Lambda}\) be a sub-dictionary. Let \(\mu\) be the coherence of \(\DDD\) . Let \(K = | \Lambda |\) . Then the eigen values of \(G = \DDD_{\Lambda}^H \DDD_{\Lambda}\) satisfy:

\[1 - (K - 1) \mu \leq \lambda \leq 1 + (K - 1) \mu.\]

Moreover, the singular values of the sub-dictionary \(\DDD_{\Lambda}\) satisfy

\[\sqrt{1 - (K - 1) \mu} \leq \sigma (\DDD_{\Lambda}) \leq \sqrt{1 + (K - 1) \mu}.\]
Proof

We recall from Gershgorin’s theorem that for any square matrix \(A \in \CC^{K \times K}\) , every eigen value \(\lambda\) of \(A\) satisfies

\[| \lambda - a_{ii} | \leq \sum_{j \neq i} |a_{ij}| \text{ for some } i \in \{ 1, \dots, K\}.\]

Now consider the matrix \(G = \DDD_{\Lambda}^H \DDD_{\Lambda}\) with diagonal elements equal to 1 and off diagonal elements bounded by a value \(\mu\) . Then

\[| \lambda - 1 | \leq \sum_{j \neq i} |a_{ij}| \leq \sum_{j \neq i} \mu = (K - 1) \mu.\]

Thus,

\[- (K - 1) \mu \leq \lambda - 1 \leq (K - 1) \mu \iff 1 - (K - 1) \mu \leq \lambda \leq 1 + (K - 1) \mu\]

This gives us a lower bound on the smallest eigen value.

\[\lambda_{\min} (G) \geq 1 - (K - 1) \mu.\]

Since \(G\) is positive definite ( \(\DDD_{\Lambda}\) is full-rank), hence its eigen values are positive. Thus, the above lower bound is useful only if

\[1 - (K - 1) \mu > 0 \iff 1 > (K - 1) \mu \iff \mu < \frac{1}{K - 1}.\]

We also get an upper bound on the eigen values of \(G\) given by

\[\lambda_{\max} (G) \leq 1 + (K - 1) \mu.\]

The bounds on singular values of \(\DDD_{\Lambda}\) are obtained as a straight-forward extension by taking square roots on the expressions.

Embeddings using sub-dictionaries

Theorem

Let \(\DDD\) be a real dictionary and \(\DDD_{\Lambda}\) be a sub-dictionary with \(K = |\Lambda|\) . Let \(\mu\) be the coherence of \(\DDD\) . Let \(v \in \RR^K\) be an arbitrary vector. Then

\[| v |^T [I - \mu (\OneMat - I)] | v | \leq \| \DDD_{\Lambda} v \|_2^2 \leq | v |^T [I + \mu (\OneMat - I)] | v |\]

where \(\OneMat\) is a \(K\times K\) matrix of all ones. Moreover

\[(1 - (K - 1) \mu) \| v \|_2^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq (1 + (K - 1) \mu)\| v \|_2^2.\]
Proof

We can easily write

\[\| \DDD_{\Lambda} v \|_2^2 = v^T \DDD_{\Lambda}^T \DDD_{\Lambda} v\]
\[\begin{aligned} v^T \DDD_{\Lambda}^T \DDD_{\Lambda} v &= \sum_{i=1}^K \sum_{j=1}^K v_i d_{\lambda_i}^T d_{\lambda_j} v_j. \end{aligned}\]

The terms in the R.H.S. for \(i = j\) are given by

\[v_i d_{\lambda_i}^T d_{\lambda_i} v_i = | v_i |^2.\]

Summing over \(i = 1, \cdots, K\) , we get

\[\sum_{i=1}^K | v_i |^2 = \| v \|_2^2 = v^T v = | v |^T | v | = | v |^T I | v |.\]

We are now left with \(K^2 - K\) off diagonal terms. Each of these terms is bounded by

\[- \mu |v_i| |v_j | \leq v_i d_{\lambda_i}^T d_{\lambda_j} v_j \leq \mu |v_i| |v_j |.\]

Summing over the \(K^2 - K\) off-diagonal terms we get:

\[\sum_{i \neq j} |v_i| |v_j | = \sum_{i, j} |v_i| |v_j | - \sum_{i = j} |v_i| |v_j | = | v |^T(\OneMat - I ) | v |.\]

Thus,

\[- \mu | v |^T (\OneMat - I ) | v | \leq \sum_{i \neq j} v_i d_{\lambda_i}^T d_{\lambda_j} v_j \leq \mu | v |^T (\OneMat - I ) | v |\]

Thus,

\[| v |^T I | v |- \mu | v |^T (\OneMat - I ) | v | \leq v^T \DDD_{\Lambda}^T \DDD_{\Lambda} v \leq | v |^T I | v |+ \mu | v |^T (\OneMat - I )| v |.\]

We get the result by slight reordering of terms:

\[| v |^T [I - \mu (\OneMat - I)] | v | \leq \| \DDD_{\Lambda} v \|_2^2 \leq | v |^T [I + \mu (\OneMat - I)] | v |\]

We recall that

\[| v |^T \OneMat | v | = \| v \|_1^2.\]

Thus, the inequalities can be written as

\[(1 + \mu) \| v \|_2^2 - \mu \| v \|_1^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq (1 - \mu) \| v \|_2^2 + \mu \| v \|_1^2.\]

Alternatively,

\[\| v \|_2^2 - \mu \left (\| v \|_1^2 - \| v \|_2^2 \right ) \leq \| \DDD_{\Lambda} v \|_2^2 \leq \| v \|_2^2 + \mu \left (\| v \|_1^2 - \| v \|_2^2\right ) .\]

Finally

\[\| v \|_1^2 \leq K \| v \|_2^2 \implies \| v \|_1^2 - \| v \|_2^2 \leq (K - 1) \| v \|_2^2.\]

This gives us

\[( 1- (K - 1) \mu ) \| v \|_2^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq ( 1 + (K - 1) \mu ) \| v \|_2^2 .\]

We now present the above theorem for the complex case. The proof is based on singular values. This proof is simpler and more general than the one presented above.

Theorem

Let \(\DDD\) be a dictionary and \(\DDD_{\Lambda}\) be a sub-dictionary with \(K = |\Lambda|\) . Let \(\mu\) be the coherence of \(\DDD\) . Let \(v \in \CC^K\) be an arbitrary vector. Then

\[(1 - (K - 1) \mu) \| v \|_2^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq (1 + (K - 1) \mu)\| v \|_2^2.\]
Proof

Recall that

\[\sigma_{\min}^2(\DDD_{\Lambda}) \| v \|_2^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq \sigma_{\max}^2(\DDD_{\Lambda}) \| v \|_2^2.\]

A previous result tells us:

\[1 - (K - 1) \mu \leq \sigma^2 (\DDD_{\Lambda}) \leq 1 + (K - 1) \mu.\]

Thus,

\[\sigma_{\min}^2(\DDD_{\Lambda}) \| v \|_2^2 \geq (1 - (K - 1) \mu) \| v \|_2^2\]

and

\[\sigma_{\max}^2(\DDD_{\Lambda}) \| v \|_2^2 \leq (1 + (K - 1) \mu)\| v \|_2^2.\]

This gives us the result

\[(1 - (K - 1) \mu) \| v \|_2^2 \leq \| \DDD_{\Lambda} v \|_2^2 \leq (1 + (K - 1) \mu)\| v \|_2^2.\]

Babel function

Recalling the definition of coherence, we note that it reflects only the extreme correlations between atoms of dictionary. If most of the inner products are small compared to one dominating inner product, then the value of coherence is highly misleading.

In [Tro04], Tropp introduced Babel function , which measures the maximum total coherence between a fixed atom and a collection of other atoms. The Babel function quantifies an idea as to how much the atoms of a dictionary are speaking the same language.

Definition

The Babel function for a dictionary \(\DDD\) is defined by

(1)\[\mu_1(p) \triangleq \underset{|\Lambda| = p}{\max} \; \underset {\psi}{\max} \sum_{\Lambda} | \langle \psi, d_{\lambda} \rangle |,\]

where the vector \(\psi\) ranges over the atoms indexed by \(\Omega \setminus \Lambda\). We define

\[\mu_1(0) = 0\]

for sparsity level \(p=0\).

Let us understand what is going on here. For each value of \(p\) we consider all possible \(\binom{D}{p}\) subspaces by choosing \(p\) vectors from \(\mathcal{D}\).

Let the atoms spanning one such subspace be identified by an index set \(\Lambda \subset \Omega\).

All other atoms are indexed by the index set \(\Gamma = \Omega \setminus \Lambda\).

Let

\[\Psi = \{ \psi_{\gamma} : \gamma \in \Gamma \}\]

denote the atoms indexed by \(\Gamma\).

We pickup a vector \(\psi \in \Psi\) and compute its inner product with all atoms indexed by \(\Lambda\). We compute the sum of absolute value of these inner products over all \(\{ d_{\lambda} : \lambda \in \Lambda\}\).

We run it for all \(\psi \in \Psi\) and then pickup the maximum value of above sum over all \(\psi\).

We finally compute the maximum over all possible \(p\)-subspaces.

This number is considered at the Babel number for sparsity level \(p\).

We first make a few observations over the properties of Babel function.

Babel function is a generalization of coherence.

Remark

For \(p=1\) we observe that

\[\mu_1(1) = \mu(\DDD)\]

the coherence of \(\mathcal{D}\).

Remark
\(\mu_1\) is a non-decreasing function of \(p\).
Proof

This is easy to see since the sum

\[\sum_{\Lambda} | \langle \psi, d_{\lambda} \rangle |\]

cannot decrease as \(p = | \Lambda|\) increases.

In particular for some value of \(p\) let \(\Lambda^p\) and \(\psi^p\) denote the set and vector for which the maximum in (1) is achieved. Now pick some column which is not \(\psi^p\) and is not indexed by \(\Lambda^p\) and include it for \(\Lambda^{p + 1}\). Note that \(\Lambda^{p + 1}\) and \(\psi^p\) might not be the worst case for sparsity level \(p+1\) in (1). Clearly

\[\sum_{\Lambda^{p + 1}} | \langle \psi^p, d_{\lambda} \rangle | \geq \sum_{\Lambda^{p}} | \langle \psi^p, d_{\lambda} \rangle |\]

\(\mu_1(p+1)\) cannot be less than \(\mu_1(p)\).

Lemma

Babel function is upper bounded by coherence as per

\[\mu_1(p) \leq p \; \mu(\DDD).\]
Proof
\[\sum_{\Lambda} | \langle \psi, d_{\lambda} \rangle | \leq p \; \mu(\DDD).\]

This leads to

\[\mu_1(p) = \underset{|\Lambda| = p}{\max} \; \underset {\psi}{\max} \sum_{\Lambda} | \langle \psi, d_{\lambda} \rangle | \leq \underset{|\Lambda| = p}{\max} \; \underset {\psi}{\max} \left (p \; \mu(\DDD)\right) = p \; \mu(\DDD).\]

Computation of Babel function

It might seem at first that computation of Babel function is combinatorial and hence prohibitively expensive. But it is not true.

We will demonstrate this through an example in this section. Our example synthesis matrix will be

\[\begin{split}\DDD = \begin{bmatrix} 0.5 & 0 & 0 & 0.6533 & 1 & 0.5 & -0.2706 & 0\\ 0.5 & 1 & 0 & 0.2706 & 0 & -0.5 & 0.6533 & 0\\ 0.5 & 0 & 1 & -0.2706 & 0 & -0.5 & -0.6533 & 0\\ 0.5 & 0 & 0 & -0.6533 & 0 & 0.5 & 0.2706 & 1 \end{bmatrix}\end{split}\]

From the synthesis matrix \(\DDD\) we first construct its Gram matrix given by

\[G = \DDD^H \DDD.\]

We then take absolute value of each entry in \(G\) to construct \(|G|\).

For the running example

\[\begin{split}|G| = \begin{bmatrix} 1 & 0.5 & 0.5 & 0 & 0.5 & 0 & 0 & 0.5\\ 0.5 & 1 & 0 & 0.2706 & 0 & 0.5 & 0.6533 & 0\\ 0.5 & 0 & 1 & 0.2706 & 0 & 0.5 & 0.6533 & 0\\ 0 & 0.2706 & 0.2706 & 1 & 0.6533 & 0 & 0 & 0.6533\\ 0.5 & 0 & 0 & 0.6533 & 1 & 0.5 & 0.2706 & 0\\ 0 & 0.5 & 0.5 & 0 & 0.5 & 1 & 0 & 0.5\\ 0 & 0.6533 & 0.6533 & 0 & 0.2706 & 0 & 1 & 0.2706\\ 0.5 & 0 & 0 & 0.6533 & 0 & 0.5 & 0.2706 & 1 \end{bmatrix}\end{split}\]

We now sort every row in descending order to obtain a new matrix \(G'\).

\[\begin{split}G' = \begin{bmatrix} 1 & 0.5 & 0.5 & 0.5 & 0.5 & 0 & 0 & 0\\ 1 & 0.6533 & 0.5 & 0.5 & 0.2706 & 0 & 0 & 0\\ 1 & 0.6533 & 0.5 & 0.5 & 0.2706 & 0 & 0 & 0\\ 1 & 0.6533 & 0.6533 & 0.2706 & 0.2706 & 0 & 0 & 0\\ 1 & 0.6533 & 0.5 & 0.5 & 0.2706 & 0 & 0 & 0\\ 1 & 0.5 & 0.5 & 0.5 & 0.5 & 0 & 0 & 0\\ 1 & 0.6533 & 0.6533 & 0.2706 & 0.2706 & 0 & 0 & 0\\ 1 & 0.6533 & 0.5 & 0.5 & 0.2706 & 0 & 0 & 0 \end{bmatrix}\end{split}\]

First entry in each row is now \(1\). This corresponds to \(\langle d_i, d_i \rangle\) and it doesn’t appear in the calculation of \(\mu_1(p)\) hence we disregard whole of first column.

Now look at column 2 in \(G'\). In the \(i\)-th row it is nothing but

\[\underset{j \neq i}{\max} | \langle d_i, d_j \rangle |.\]

Thus,

\[\mu (\DDD) = \mu_1(1) = \underset{1 \leq j \leq D} {\max} {G'}_{j, 2}\]

i.e. the coherence is given by the maximum in the 2nd column of \(G'\).

In the running example

\[\mu (\DDD) = \mu_1(1) = 0.6533.\]

Looking carefully we can note that for \(\psi = d_i\) the maximum value of sum

\[\sum_{\Lambda} | \langle \psi, d_{\lambda} \rangle |\]

while \(| \Lambda| = p\) is given by the sum over elements from 2nd to \((p+1)\)-th columns in \(i\)-th row.

Thus

\[\mu_1 (p) = \underset{1 \leq i \leq D} {\max} \sum_{j = 2}^{p + 1} G'_{i j}.\]

For the running example the Babel function values are given by

\[\begin{pmatrix} 0.6533 & 1.3066 & 1.6533 & 2 & 2 & 2 & 2 \end{pmatrix}.\]

We see that Babel function stops increasing after \(p=4\). Actually \(\DDD\) is constructed by shuffling the columns of two orthonormal bases. Hence many of the inner products are 0 in \(G\).

Babel function and spark

We first note that Babel function tells something about linear independence of columns of \(\DDD\).

Lemma

Let \(\mu_1\) be the Babel function for a dictionary \(\DDD\). If

\[\mu_1(p) < 1\]

then all selections of \(p+1\) columns from \(\DDD\) are linearly independent.

Proof

We recall from the proof of this result that if

\[p + 1 < 1 + \frac{1}{\mu(\DDD)} \implies p < \frac{1}{\mu(\DDD)}\]

then every set of \((p+1)\) columns from \(\DDD\) are linearly independent.

We also know from this result that

\[p \; \mu(\DDD) \geq \mu_1(p) \implies \mu(\DDD) \geq \frac{\mu_1(p)}{p} \implies \frac{1}{\mu(\DDD)} \leq \frac{p} {\mu_1(p)}.\]

Thus if

\[p < \frac{p} {\mu_1(p)} \implies 1 < \frac{1} {\mu_1(p)} \implies \mu_1(p) < 1\]

then all selections of \(p+1\) columns from \(\DDD\) are linearly independent.

This leads us to a lower bound on spark from Babel function .

Lemma

A lower bound of spark of a dictionary \(\DDD\) is given by

\[\spark(\DDD) \geq \underset{1 \leq p \leq N} {\min}\{p : \mu_1(p-1)\geq 1\}.\]
Proof

For all \(j \leq p-2\) we are given that \(\mu_1(j) < 1\). Thus all sets of \(p-1\) columns from \(\DDD\) are linearly independent (using this result).

Finally \(\mu_1(p-1) \geq 1\), hence we cannot say definitively whether a set of \(p\) columns from \(\DDD\) is linearly dependent or not. This establishes the lower bound on spark.

An earlier version of this result also appeared in [DE03] theorem 6.

Babel function and singular values

Theorem

Let \(\DDD\) be a dictionary and \(\Lambda\) be an index set with \(|\Lambda| = K\). The singular values of \(\DDD_{\Lambda}\) are bounded by

\[1 - \mu_1(K - 1) \leq \sigma^2 \leq 1 + \mu_1 (K - 1).\]
Proof

Consider the Gram matrix

\[G = \DDD_{\Lambda}^H \DDD_{\Lambda}.\]

\(G\) is a \(K\times K\) square matrix.

Also let

\[\Lambda = \{ \lambda_1, \lambda_2, \dots, \lambda_K\}\]

so that

\[\DDD_{\Lambda} = \begin{bmatrix} d_{\lambda_1} & d_{\lambda_2} & \dots & d_{\lambda_K} \end{bmatrix}.\]

The Gershgorin Disc Theorem states that every eigenvalue of \(G\) lies in one of the \(K\) discs

\[\Delta_k = \left \{ z : |z - G_{k k}|\leq \sum_{j \neq k } | G_{j k}| \right \}\]

Since \(d_i\) are unit norm, hence \(G_{k k} = 1\).

Also we note that

\[\sum_{j \neq k } | G_{j k}| = \sum_{j \neq k } | \langle d_{\lambda_j}, d_{\lambda_k} \rangle | \leq \mu_1(K-1)\]

since there are \(K-1\) terms in sum and \(\mu_1(K-1)\) is an upper bound on all such sums.

Thus if \(z\) is an eigen value of \(G\) then we have

\[\begin{split}\begin{aligned} &| z -1 | \leq \mu_1(K-1) \\ \implies &- \mu_1(K-1) \leq z - 1 \leq \mu_1(K-1) \\ \implies &1 - \mu_1(K-1) \leq z \leq 1 + \mu_1(K-1). \end{aligned}\end{split}\]

This is OK since \(G\) is positive semi-definite, thus, the eigen values of \(G\) are real.

But the eigen values of \(G\) are nothing but the squared singular values of \(\DDD_{\Lambda}\). Thus we get

\[1 - \mu_1(K-1) \leq \sigma^2 \leq 1 + \mu_1(K-1).\]
Corollary
Let \(\DDD\) be a dictionary and \(\Lambda\) be an index set with \(|\Lambda| = K\). If \(\mu_1(K-1) < 1\) then the squared singular values of \(\DDD_{\Lambda}\) exceed \((1 - \mu_1 (K-1))\).
Proof

From previous theorem we have

\[1 - \mu_1(K-1) \leq \sigma^2 \leq 1 + \mu_1(K-1).\]

Since the singular values are always non-negative, the lower bound is useful only when \(\mu_1(K-1) < 1\). When it holds we have

\[\sigma(\DDD_{\Lambda}) \geq \sqrt{1 - \mu_1(K-1)}.\]
Theorem
Let \(\mu_1(K -1 ) < 1\). If a signal can be written as a linear combination of \(k\) atoms, then any other exact representation of the signal requires at least \((K - k + 1)\) atoms.
Proof

If \(\mu_1(K -1 ) < 1\), then the singular values of any sub-matrix of \(K\) atoms are non-zero. Thus, the minimum number of atoms required to form a linear dependent set is \(K + 1\). Let the number of atoms in any other exact representation of the signal be \(l\). Then

\[k + l \geq K + 1 \implies l \geq K - k + 1.\]

Babel function and gram matrix

Let \(\Lambda\) index a subdictionary and let \(G = \DDD_{\Lambda}^H \DDD_{\Lambda}\) denote the Gram matrix of the subdictionary \(\DDD_{\Lambda}\). Assume \(K = | \Lambda |\).

Theorem
\[\| G \|_{\infty} = \| G \|_{1} \leq 1 + \mu_1(K - 1).\]
Proof

Since \(G\) is Hermitian, hence the two norms are equal:

\[\| G \|_{\infty} = \| G^H \|_{1} = \| G \|_{1}.\]

Now each row consists of a diagonal entry \(1\) and \(K-1\) off diagonal entries. The absolute sum of all the off-diagonal entries in a row is upper bounded by \(\mu_1(K -1)\). Thus, the absolute sum of all the entries in a row is upper bounded by \(1 + \mu_1(K - 1)\). Since \(\| G \|_{\infty}\) is nothing but the maximum \(l_1\) norm of rows of \(G\), hence

\[\| G \|_{\infty} \leq 1 + \mu_1(K - 1).\]
Theorem

Suppose that \(\mu_1(K - 1) < 1\). Then

\[\| G^{-1} \|_{\infty} = \| G^{-1} \|_{1} \leq \frac{1}{1 - \mu_1(K - 1)}\]
Proof

Since \(G\) is Hermitian, hence the two operator norms are equal:

\[\| G^{-1} \|_{\infty} = \| G^{-1} \|_{1}.\]

As usual we can write \(G\) as \(G = I + A\) where \(A\) consists of off-diagonal entries in \(A\) (recall that since atoms are unit norm, hence diagonal entries in \(G\) are 1).

Each row of \(A\) lists inner products between a fixed atom and \(K-1\) other atoms (leaving the 0 at the diagonal entry). Therefore

\[\| A \|_{\infty \to \infty} \leq \mu_1(K - 1)\]

(since \(l_1\) norm of any row is upper bounded by the babel number \(\mu_1(K - 1)\) ). Now \(G^{-1}\) can be written as a Neumann series

\[G^{-1} = \sum_{k=0}^{\infty}(-A)^k.\]

Thus

\[\| G^{-1} \|_{\infty} = \| \sum_{k=0}^{\infty}(-A)^k \|_{\infty} \leq \sum_{k=0}^{\infty} \| (-A)^k \|_{\infty} = \sum_{k=0}^{\infty} \| A \|_{\infty}^k = \frac{1}{1 - \| A \|_{\infty}}.\]

Finally

\[\begin{split}\begin{aligned} \| A \|_{\infty} \leq \mu_1(K - 1) &\iff 1 - \| A \|_{\infty} \geq 1 - \mu_1(K - 1)\\ &\iff \frac{1}{1 - \| A \|_{\infty}} \leq \frac{1}{1 - \mu_1(K - 1)}. \end{aligned}\end{split}\]

Thus

\[\| G^{-1} \|_{\infty} \leq \frac{1}{1 - \mu_1(K - 1)}.\]

Quasi incoherent dictionaries

Definition

When the Babel function of a dictionary grows slowly, we say that the dictionary is quasi-incoherent .

Implementing the Babel function

We will implement the babel function in Matlab. Here is the signature of the function:

function [ babel ] = babel( Phi )

Let’s compute the Gram matrix:

G = Phi' * Phi;

We now take the absolute values of all entries in the gram matrix:

absG = abs(G);

We sort the rows in absG in descending order:

GS = sort(absG, 2,'descend');

We compute the cumulative sums over each row of GS leaving out the first column:

rowSums = cumsum(GS(:, 2:end), 2);

The babel function is now obtained by simply taking maximum over each column:

babel = max(rowSums);

This implementation is available in the sparse-plex library as spx.dict.babel.

Dirac-DCT dictionary

Definition
The Dirac-DCT dictionary is a two-ortho dictionary consisting of the union of the Dirac and the DCT bases.

This dictionary is suitable for real signals since both Dirac and DCT are totally real bases \(\in \RR^{N \times N}\).

The dictionary is obtained by combining the \(N \times N\) identity matrix (Dirac basis) with the \(N \times N\) DCT matrix for signals in \(\RR^N\).

Let \(\Psi_{\text{DCT}, N}\) denote the DCT matrix for \(\RR^N\). Let \(I_N\) denote the identity matrix for \(\RR^N\). Then

\[\DD_{\text{DCT}} = \begin{bmatrix} I_N & \Psi_{\text{DCT}, N} \end{bmatrix}.\]

Let

\[\Psi_{\text{DCT}, N} = \begin{bmatrix} \psi_1 & \psi_2 & \dots & \psi_N \end{bmatrix}\]

The \(k\)-th column of \(\Psi_{\text{DCT}, N}\) is given by

(1)\[\psi_k(n) = \sqrt{\frac{2}{N}} \Omega_k \cos \left (\frac{\pi}{2 N} (2 n - 1) (k - 1) \right ), n = 1, \dots, N,\]

with \(\Omega_k = \frac{1}{\sqrt{2}}\) for \(k=1\) and \(\Omega_k = 1\) for \(2 \leq k \leq N\).

Note that for \(k=1\), the entries become

\[\sqrt{\frac{2}{N}} \frac{1}{\sqrt{2}} \cos 0 = \sqrt{\frac{1}{N}}.\]

Thus, the \(l_2\) norm of \(\psi_1\) is 1. We can similarly verify the \(l_2\) norm of other columns also. They are all one.

Theorem
The Dirac-DCT dictionary has coherence \(\sqrt{\frac{2}{N}}\).
Proof

The coherence of a two ortho basis where one basis is Dirac basis is given by the magnitude of the largest entry in the other basis. For \(\Psi_{\text{DCT}, N}\), the largest value is obtained when \(\Omega_k = 1\) and the \(\cos\) term evaluates to 1. Clearly,

\[\mu (\DD_{\text{DCT}}) = \sqrt{\frac{2}{N}}.\]
Theorem

The \(p\)-babel function for Dirac-DCT dictionary is given by

\[\mu_p(k) = k^{\frac{1}{p}} \mu \Forall 1\leq k \leq N.\]

In particular, the standard babel function is given by

\[\mu_1(k) = k\mu\]
Proof
TODO prove it.

Hands-on with Dirac DCT dictionaries

ExampleConstructing a Dirac DCT dictionary

We need to specify the dimension of the ambient space:

N = 256;

We are ready to construct the dictionary:

Phi = spx.dict.simple.dirac_dct_mtx(N);

Let’s visualize the dictionary:

imagesc(Phi);
colorbar;
_images/demo_dirac_dct_1.png

Measuring the coherence of the dictionary:

>> spx.dict.coherence(Phi)

ans =

    0.0884

We can cross-check with the theoretical estimate:

>> sqrt(2/N)

ans =

    0.0884

Let’s construct the babel function for this dictionary:

mu1 = spx.dict.babel(Phi);

We can plot it:

plot(mu1);
grid on;
_images/demo_dirac_dct_babel.png

We note that the babel function increases linearly for the initial part and saturates to a value of 16 afterwards.

Dirac-Hadamard dictionary

Definition
The Dirac-Hadamard dictionary is a two-ortho dictionary consisting of the union of the Dirac and the Hadamard bases.

This dictionary is suitable for real signals since both Dirac and Hadamard are totally real bases \(\in \RR^{N \times N}\).

\(N\), \(N/12\) or \(N/20\) must be a power of 2 to allow for the construction of Hadamard matrix.

Hadamard matrix is special in the sense that all the entries are either 1 or -1. Thus, multiplication with the matrix can be achieved by simple additions and subtractions:

>> A = hadamard(12)

A =

     1     1     1     1     1     1     1     1     1     1     1     1
     1    -1     1    -1     1     1     1    -1    -1    -1     1    -1
     1    -1    -1     1    -1     1     1     1    -1    -1    -1     1
     1     1    -1    -1     1    -1     1     1     1    -1    -1    -1
     1    -1     1    -1    -1     1    -1     1     1     1    -1    -1
     1    -1    -1     1    -1    -1     1    -1     1     1     1    -1
     1    -1    -1    -1     1    -1    -1     1    -1     1     1     1
     1     1    -1    -1    -1     1    -1    -1     1    -1     1     1
     1     1     1    -1    -1    -1     1    -1    -1     1    -1     1
     1     1     1     1    -1    -1    -1     1    -1    -1     1    -1
     1    -1     1     1     1    -1    -1    -1     1    -1    -1     1
     1     1    -1     1     1     1    -1    -1    -1     1    -1    -1

>> A' * A

ans =

    12     0     0     0     0     0     0     0     0     0     0     0
     0    12     0     0     0     0     0     0     0     0     0     0
     0     0    12     0     0     0     0     0     0     0     0     0
     0     0     0    12     0     0     0     0     0     0     0     0
     0     0     0     0    12     0     0     0     0     0     0     0
     0     0     0     0     0    12     0     0     0     0     0     0
     0     0     0     0     0     0    12     0     0     0     0     0
     0     0     0     0     0     0     0    12     0     0     0     0
     0     0     0     0     0     0     0     0    12     0     0     0
     0     0     0     0     0     0     0     0     0    12     0     0
     0     0     0     0     0     0     0     0     0     0    12     0
     0     0     0     0     0     0     0     0     0     0     0    12

While constructing the Dirac-Hadamard dictionary, we need to ensure that the columns of the dictionary are normalized.

Hands-on with Dirac Hadamard dictionaries

ExampleConstructing a Dirac Hadamard dictionary

We need to specify the dimension of the ambient space:

N = 256;

We are ready to construct the dictionary:

Phi = spx.dict.simple.dirac_hadamard_mtx(N);

Let’s visualize the dictionary:

imagesc(Phi);
colorbar;
_images/demo_dirac_hadamard_1.png

Measuring the coherence of the dictionary:

>> spx.dict.coherence(Phi)

ans =

    0.0625

Let’s construct the babel function for this dictionary:

mu1 = spx.dict.babel(Phi);

We can plot it:

plot(mu1);
grid on;
_images/demo_dirac_hadamard_babel.png

We note that the babel function increases linearly for the initial part and saturates to a value of 16 afterwards.

ExampleNormalization in Dirac Hadamard dictionary

We can construct a Dirac Hadamard dictionary for a small size to see the effect of normalization:

>> spx.dict.simple.dirac_hadamard_mtx(4)

ans =

    1.0000         0         0         0    0.5000    0.5000    0.5000    0.5000
         0    1.0000         0         0    0.5000   -0.5000    0.5000   -0.5000
         0         0    1.0000         0    0.5000    0.5000   -0.5000   -0.5000
         0         0         0    1.0000    0.5000   -0.5000   -0.5000    0.5000

Dictionaries with Wavelet Toolbox

MATLAB Wavelet Toolbox provides good support for constructing multi-basis dictionaries (dictionaries that are constructed by concatenating one or more subdictionaries which are either orthogonal bases or wavelet packets).

Constructing Dictionaries

ExampleDirac DCT Dictionary

We need to specify the dimension of the signal space \(\RR^N\):

N  = 32;

We can now construct the dictionary:

Phi = wmpdictionary(N, 'lstcpt', {'RnIdent', 'dct'});
_images/wmp_dirac_dct_N_32.png

The name-value pair argument lstcpt takes the list of constituent subdictionaries.

ExampleSymlet DCT Dictionary

We wish to combine a symlet ONB with 4 vanishing moments and 5 level decomposition and a DCT basis

We can now construct the dictionary:

N = 256;
[Phi, nb_atoms] = wmpdictionary(N, 'lstcpt', { {'sym4', 5}, 'dct'});
_images/wmp_sym4_dct_N_256.png

The vector nb_atoms tells us the number of atoms in each subdictionary:

>> nb_atoms
nb_atoms =

    256   256
ExampleSymlet, Symlet Packets, DCT Dictionary

Here we will combine symlets with the wavelet packet version of symets and DCT ONB.

  1. symlet with 4 vanishing moments and 5 level decomposition
  2. wavelet packet symlet with 4 vanishing moments and 5 level decomposition
  3. DCT basis
N = 256;
[Phi, nb_atoms] = wmpdictionary(N, 'lstcpt', { {'sym4', 5}, {'wpsym4', 5}, 'dct'});
_images/wmp_sym4_wpsym4_dct_N_256.png

We can visualize the atoms in this dictionary one by one. sparse-plex provides a method to visualize the atoms one by one and save the visualizations in the form of an MP4 video file:

spx.graphics.multi_basis_dict_movie('sym4_wpsym4_dct.mp4', ...
Phi, nb_atoms, {'sym4', 'wpsym4', 'dct'})

We have specified the name of the output video file, the dictionary to be visualized, number of atoms in each subdictionary and names of subdictionaries.

Compressive Sensing

Introduction to compressive sensing

In this section we formally define the problem of compressed sensing.

Compressive sensing refers to the idea that for sparse or compressible signals, a small number of nonadaptive measurements carries sufficient information to approximate the signal well. In the literature it is also known as compressed sensing and compressive sampling . Different authors seem to prefer different names.

In this section we will represent a signal dictionary as well as its synthesis matrix as \(\DD\) .

We recall the definition of sparse signals. A signal \(x \in \CC^N\) is \(K\) -sparse in \(\DD\) if there exists a representation \(\alpha\) for \(x\) which has at most \(K\) non-zeros. i.e.

\[x = \DD \alpha\]

and

\[\| \alpha \|_0 \leq K.\]

The dictionary could be standard basis, Fourier basis, wavelet basis, a wavelet packet dictionary, a multi-ONB or even a randomly generated dictionary.

Real life signals are not sparse, yet they are compressible in the sense that entries in the signal decay rapidly when sorted by magnitude. As a result, compressible signals are well approximated by sparse signals. Note that we are talking about the sparsity or compressibility of the signal in a suitable dictionary. Thus, we mean that the signal \(x\) has a representation \(\alpha\) in \(\DD\) in which the coefficients decay rapidly when sorted by magnitude.

Definition

In compressed sensing, a measurement is a linear functional applied to a signal

\[y = \langle x, f \rangle.\]

The compressed sensor makes multiple such linear measurements. This can best be represented by the action of a sensing matrix \(\Phi\) on the signal \(x\) given by

\[y = \Phi x\]

where \(\Phi \in \CC^{M \times N}\) represents \(M\) different measurements made on the signal \(x\) by the sensing process. Each row of \(\Phi\) represents one linear measurement.

The vector \(y \in \CC^M\) is known as measurement vector .

\(\CC^N\) forms the signal space while \(\CC^M\) forms the measurement space .

We also note that above can be written as

\[y = \Phi x = \Phi \DD \alpha = (\Phi \DD) \alpha.\]

It is assumed that the signal \(x\) is \(K\) -sparse or \(K\) -compressible in \(\DD\) and \(K \ll N\) .

The objective is to recover \(x\) from \(y\) given that \(\Phi\) and \(\DD\) are known.

We do this by first recovering the sparse representation \(\alpha\) from \(y\) and then computing \(x = \DD \alpha\) .

If \(M \geq N\) then the problem is a straight forward least squares problem. So we don’t consider it here.

The more interesting case is when \(K < M \ll N\) i.e. the number of measurements is much less than the dimension of the ambient signal space while more than the sparsity level of signal namely \(K\) .

We note that given \(\alpha\) is found, finding \(x\) is straightforward. We therefore can remove the dictionary from our consideration and look at the simplified problem given as: recover \(x\) from \(y\) with

\[y = \Phi x\]

where \(x \in \CC^N\) itself is assumed to be \(K\) -sparse or \(K\) -compressible and \(\Phi \in \CC^{M \times N}\) is the sensing matrix.

Note

The definition above doesn’t consider the noise introduced during taking the measurements. We will introduce noise later.

The sensing matrix

There are two ways to look at the sensing matrix. First view is in terms of its columns

(1)\[\Phi = \begin{bmatrix} \phi_1 & \phi_2 & \dots & \phi_N \end{bmatrix}\]

where \(\phi_i \in \CC^M\) are the columns of sensing matrix. In this view we see that

\[y = \sum_{i=1}^{N} x_i \phi_i\]

i.e. \(y\) belongs to the column span of \(\Phi\) and one representation of \(y\) in \(\Phi\) is given by \(x\) .

This view looks very similar to a dictionary and its atoms but there is a difference. In a dictionary, we require each atom to be unit norm. We don’t require columns of the sensing matrix \(\Phi\) to be unit norm.

The second view of sensing matrix \(\Phi\) is in terms of its columns. We write

(2)\[\begin{split}\Phi = \begin{bmatrix} \chi_1^H \\ \chi_2^H \\ \vdots \\ \chi_M^H \end{bmatrix}\end{split}\]

where \(\chi_i \in \CC^N\) are conjugate transposes of rows of \(\Phi\) . This view gives us following result

\[\begin{split}\begin{bmatrix} y_1\\ y_2 \\ \vdots y_M \end{bmatrix} = \begin{bmatrix} \chi_1^H \\ \chi_2^H \\ \vdots \\ \chi_M^H \end{bmatrix} x = \begin{bmatrix} \chi_1^H x\\ \chi_2^H x\\ \vdots \\ \chi_M^H x \end{bmatrix} = \begin{bmatrix} \langle x , \chi_1 \rangle \\ \langle x , \chi_2 \rangle \\ \vdots \\ \langle x , \chi_M \rangle \\ \end{bmatrix}\end{split}\]

In this view \(y_i\) is a measurement given by the inner product of \(x\) with \(\chi_i\) \(( \langle x , \chi_i \rangle = \chi_i^H x)\) .

We will call \(\chi_i\) as a sensing vector . There are \(M\) such sensing vectors in \(\CC^N\) comprising \(\Phi\) corresponding to \(M\) measurements in the measurement space \(\CC^M\) .

Note

Dictionary design focuses on creating sparsest possible representations of the signals in a particular domain. Sensing matrix design focuses on reducing the number of measurements as much as possible while still being able to recover the sparse representation from the measurements.

Number of measurements

A fundamental question of compressed sensing framework is: How many measurements are necessary to acquire :math:`K` -sparse signals? By necessary we mean that \(y\) carries enough information about \(x\) such that \(x\) can be recovered from \(y\) .

If \(M < K\) then recovery is not possible.

We further note that the sensing matrix \(\Phi\) should not map two different \(K\) -sparse signals to the same measurement vector. Thus, we will need \(M \geq 2K\) and each collection of \(2K\) columns in \(\Phi\) must be non-singular.

Think
Why do we need 2K or more measurements? What happens if 2K or less columns in \(\Phi\) form a linearly dependent set?

If the \(K\)-column sub matrices of \(\Phi\) are badly conditioned, then it is possible that some sparse signals get mapped to very similar measurement vectors. Thus it is numerically unstable to recover the signal. Moreover, if noise is present, stability further degrades.

In [CT06] Cand`es and Tao showed that the geometry of sparse signals should be preserved under the action of a sensing matrix. In particular the distance between two sparse signals shouldn’t change by much during sensing.

They quantified this idea in the form of a restricted isometric constant of a matrix \(\Phi\) as the smallest number \(\delta_K\) for which the following holds

\[(1 - \delta_K) \| x \|_2^2 \leq \| \Phi x \|_2^2 \leq (1 + \delta_K) \| x \|_2^2 \Forall x : \| x \|_0 \leq K.\]

We will study more about this property known as restricted isometry property (RIP) later. Here we just sketch the implications of RIP for compressed sensing.

When \(\delta_K < 1\) then the inequalities imply that every collection of \(K\) columns from \(\Phi\) is non-singular. Since we need every collection of \(2K\) columns to be non-singular, we actually need \(\delta_{2K} < 1\) which is the minimum requirement for recovery of \(K\) sparse signals.

Further if \(\delta_{2K} \ll 1\), then we note that sensing operator very nearly maintains the \(l_2\) distance between any two \(K\) sparse signals. In consequence, it is possible to invert the sensing process stably.

It is now known that many randomly generated matrices have excellent RIP behavior. One can show that if \(\delta_{2K} \leq 0.1\) , then with

\[M = \bigO{K \ln ^{\alpha} N}\]

measurements, one can recover \(x\) with high probability.

Some of the typical random matrices which have suitable RIP properties are

  • Gaussian sensing matrices
  • Partial Fourier matrices
  • Rademacher sensing matrices

Signal recovery

The second fundamental problem in compressed sensing is: Given the compressed measurements \(y\) how do we recover the signal \(x\)? This problem is known as SPARSE-RECOVERY problem.

A simple formulation of the problem as: minimize \(\| x \|_0\) subject to \(y = \Phi x\) is hopeless since it entails a combinatorial explosion of search space.

Over the years, people have developed a number of algorithms to tackle the sparse recovery problem.

The algorithms can be broadly classified into following categories

  • [Greedy pursuits] These algorithms attempt to build the approximation of the signal iteratively by making locally optimal choices at each step. Examples of such algorithms include OMP (orthogonal matching pursuit), stage-wise OMP, regularized OMP, CoSaMP (compressive sampling pursuit) and IHT (iterative hard thresholding).
  • [Convex relaxation] These techniques relax the \(l_0\) “norm” minimization problem into a suitable problem which is a convex optimization problem. This relaxation is valid for a large class of signals of interest. Once the problem has been formulated as a convex optimization problem, a number of solutions are available, e.g. interior point methods, projected gradient methods and iterative thresholding.
  • [Combinatorial algorithms] These methods are based on research in group testing and are specifically suited for situations where highly structured measurements of the signal are taken. This class includes algorithms like Fourier sampling, chaining pursuit, and HHS pursuit.

A major emphasis of the following chapters will be the study of these sparse recovery algorithms.

In the following we present examples of real life problems which can be modeled as compressed sensing problems.

Error correction in linear codes

The classical error correction problem was discussed in one of the seminal founding papers on compressed sensing [CT05].

Let \(f \in \RR^N\) be a “plaintext” message being sent over a communication channel.

In order to make the message robust against errors in communication channel, we encode the error with an error correcting code.

We consider \(A \in \RR^{D \times N}\) with \(D > N\) as a linear code . \(A\) is essentially a collection of code words given by

\[A = \begin{bmatrix} a_1 & a_2 & \dots & a_N \end{bmatrix}\]

where \(a_i \in \RR^D\) are the codewords.

We construct the “ciphertext”

\[x = A f\]

where \(x \in \RR^D\) is sent over the communication channel. \(x\) is a redundant representation of \(f\) which is expected to be robust against small errors during transmission.

\(A\) is assumed to be full column rank. Thus \(A^T A\) is invertible and we can easily see that

\[f = A^{\dag} x\]

where

\[A^{\dag} = (A^T A)^{-1}A^T\]

is the left pseudo inverse of \(A\) . The communication channel is going to add some error. What we actually receive is

\[y = x + e = A f + e\]

where \(e \in \RR^D\) is the error being introduced by the channel.

The least squares solution by minimizing the error \(l_2\) norm is given by

\[f' = A^{\dag} y = A^{\dag} (A f + e) = f + A^{\dag} e.\]

Since \(A^{\dag} e\) is usually non-zero (we cannot assume that \(A^{\dag}\) will annihilate \(e\) ), hence \(f'\) is not an exact replica of \(f\).

What is needed is an exact reconstruction of \(f\). To achieve this, a common assumption in literature is that error vector \(e\) is in fact sparse. i.e.

\[\| e \|_0 \leq K \ll D.\]

To reconstruct \(f\) it is sufficient to reconstruct \(e\) since once \(e\) is known we can get

\[x = y -e\]

and from there \(f\) can be faithfully reconstructed.

The question is: for a given sparsity level \(K\) for the error vector \(e\) can one reconstruct \(e\) via practical algorithms? By practical we mean algorithms which are of polynomial time w.r.t. the length of “ciphertext” (D).

The approach in [CT05] is as follows.

We construct a matrix \(F \in \RR^{M \times D}\) which can annihilate \(A\) i.e.

\[FA = 0.\]

We then apply \(F\) to \(y\) giving us

\[\tilde{y} = F (A f + e) = Fe.\]

Therefore the decoding problem is reduced to that of reconstructing a sparse vector \(e \in \RR^D\) from the measurements \(Fe \in \RR^M\) where we would like to have \(M \ll D\) .

With this the problem of finding \(e\) can be cast as problem of finding a sparse solution for the under-determined system given by

(3)\[\begin{split}\begin{aligned} & \underset{e \in \Sigma_K}{\text{minimize}} & & \| e \|_0 \\ & \text{subject to} & & \tilde{y} = F e\\ \end{aligned}\end{split}\]

This now becomes the compressed sensing problem. The natural questions are

  • How many measurements \(M\) are necessary (in \(F\) ) to be able to recover \(e\) exactly?
  • How should \(F\) be constructed?
  • How do we recover \(e\) from \(\tilde{y}\) ?

These questions are discussed in upcoming sections.

Recovery of exactly sparse signals

The null space of a matrix \(\Phi\) is denoted as

\[\NullSpace(\Phi) = \{ v \in \RR^N :\Phi v = 0\}.\]

The set of \(K\) -sparse signals is defined as

\[\Sigma_K = \{ x \in \RR^N : \|x\|_0 \leq K\}.\]
ExampleK sparse signals

Let \(N=10\) .

  • \(x=(1,2, 1, -1, 2 , -3, 4, -2, 2, -2) \in \RR^{10}\) is not a sparse signal.
  • \(x=(0,0,0,0,1,0,0,-1,0,0)\in \RR^{10}\) is a 2-sparse signal. Its also a 4 sparse signal.
Lemma
If \(a\) and \(b\) are two $K$ sparse signals then \(a - b\) is a \(2K\) sparse signal.
Proof
\((a - b)_i\) is non zero only if at least one of \(a_i\) and \(b_i\) is non-zero. Hence number of non-zero components of \(a - b\) cannot exceed \(2K\) . Hence \(a - b\) is a \(2K\) -sparse signal.
ExampleDifference of K sparse signals

Let N = 5.

  • Let \(a = (0,1,-1,0, 0)\) and \(b = (0,2,0,-1, 0)\). Then \(a - b = (0,-1,-1,1, 0)\) is a 3 sparse hence 4 sparse signal.
  • Let \(a = (0,1,-1,0, 0)\) and \(b = (0,2,-1,0, 0)\). Then \(a - b = (0,-1,-2,0, 0)\) is a 2 sparse hence 4 sparse signal.
  • Let \(a = (0,1,-1,0, 0)\) and \(b = (0,0,0,1, -1)\). Then \(a - b = (0,1,-1,-1, 1)\) is a 4 sparse signal.
Lemma
A sensing matrix \(\Phi\) uniquely represents all \(x \in \Sigma_K\) if and only if \(\NullSpace(\Phi) \cap \Sigma_{2K} = \phi\) . i.e. \(\NullSpace(\Phi)\) contains no vectors in \(\Sigma_{2K}\) .
Proof

Let \(a\) and \(b\) be two \(K\) sparse signals. Then \(\Phi a\) and \(\Phi b\) are corresponding measurements. Now if \(\Phi\) allows recovery of all \(K\) sparse signals, then \(\Phi a \neq \Phi b\) . Thus \(\Phi (a - b) \neq 0\) . Thus \(a - b \notin \NullSpace(\Phi)\) .

Let \(x \in \NullSpace(\Phi) \cap \Sigma_{2K}\) . Thus \(\Phi x = 0\) and \(\#x \leq 2K\) . Then we can find \(y, z \in \Sigma_K\) such that \(x = z - y\) . Thus \(m = \Phi z = \Phi y\) . But then, \(\Phi\) doesn’t uniquely represent \(y, z \in \Sigma_K\) .

There are many equivalent ways of characterizing above condition.

The spark

We recall from definition of spark, that spark of a matrix \(\Phi\) is defined as the minimum number of columns which are linearly dependent.

Definition
A signal \(x \in \RR^N\) is called an explanation of a measurement \(y \in \RR^M\) w.r.t. sensing matrix \(\Phi\) if \(y = \Phi x\) .
Theorem
For any measurement \(y \in \RR^M\), there exists at most one signal \(x \in \Sigma_K\) such that \(y = \Phi x\) if and only if \(\spark(\Phi) > 2K\) .
Proof

We need to show

  • If for every measurement, there is only one \(K\) -sparse explanation, then \(\spark(\Phi) > 2K\) .
  • If \(\spark(\Phi) > 2K\) then for every measurement, there is only one \(K\) -sparse explanation.

Assume that for every \(y \in \RR^M\) there exists at most one \(K\) sparse signal \(x \in \RR^N\) such that \(y = \Phi x\) .

Now assume that \(\spark(\Phi) \leq 2K\) . Thus there exists a set of at most \(2K\) columns which are linearly dependent.

Thus there exists \(v \in \Sigma_{2K}\) such that \(\Phi v = 0\) . Thus \(v \in \NullSpace (\Phi)\) .

Thus \(\Sigma_{2K} \cap \NullSpace (\Phi) \neq \phi\) .

Hence \(\Phi\) doesn’t uniquely represent each signal \(x \in \Sigma_K\) . A contradiction.

Hence \(\spark(\Phi) > 2K\) .

Now suppose that \(\spark(\Phi) > 2K\) .

Assume that for some \(y\) there exist two different K-sparse explanations \(x, x'\) such that \(y = \Phi x =\Phi x'\) .

Thus \(\Phi (x - x') = 0\) . Thus \(x - x ' \in \NullSpace (\Phi)\) and \(x - x' \in \Sigma_{2K}\) .

Thus \(\spark(\Phi) \leq 2K\) . A contradiction.

Since \(\spark(\Phi) \in [2, M+1]\) and we require that \(\spark(\Phi) > 2K\) hence we require that \(M \geq 2K\) .

Recovery of approximately sparse signals

Spark is a useful criteria for characterization of sensing matrices for truly sparse signals. But this doesn’t work well for approximately sparse signals. We need to have more restrictive criteria on \(\Phi\) for ensuring recovery of approximately sparse signals from compressed measurements.

In this context we will deal with two types of errors:

  • [Approximation error] Let us approximate a signal \(x\) using only \(K\) coefficients. Let us call the approximation as \(\widehat{x}\) . Thus \(e_a = (x - \widehat{x})\) is approximation error.
  • [Recovery error] Let \(\Phi\) be a sensing matrix. Let \(\Delta\) be a recovery algorithm. Then \(x'= \Delta(\Phi x)\) is the recovered signal vector. The error \(e_r = (x - x')\) is recovery error.

In this section we will

  • Formalize the notion of null space property (NSP) of a matrix \(\Phi\) .
  • Describe a measure for performance of an arbitrary recovery algorithm \(\Delta\) .
  • Establish the connection between NSP and performance guarantee for recovery algorithms.

Suppose we approximate \(x\) by a \(K\) -sparse signal \(\widehat{x} \in \Sigma_K\), then the minimum error for \(l_p\) norm is given by

\[\sigma_K(x)_p = \min_{\widehat{x} \in \Sigma_K} \| x - \widehat{x}\|_p.\]

Specific \(\widehat{x} \in \Sigma_K\) for which this minimum is achieved is the best \(K\) -term approximation.

In the following, we will need some new notation.

Let \(I = \{1,2,\dots, N\}\) be the set of indices for signal \(x \in \RR^N\) .

Let \(\Lambda \subset I\) be a subset of indices.

Let \(\Lambda^c = I \setminus \Lambda\) .

\(x_{\Lambda}\) will denote a signal vector obtained by setting the entries of \(x\) indexed by \(\Lambda^c\) to zero.

Example

Let N = 4. Then \(I = \{1,2,3,4\}\) . Let \(\Lambda = \{1,3\}\) . Then \(\Lambda^c = \{2, 4\}\) .

Now let \(x = (-1,1,2,-4)\) . Then \(x_{\Lambda} = (-1, 0, 2, 0)\) .

\(\Phi_{\Lambda}\) will denote a \(M\times N\) matrix obtained by setting the columns of \(\Phi\) indexed by \(\Lambda^c\) to zero.

Example

Let N = 4. Then \(I = \{1,2,3,4\}\) . Let \(\Lambda = \{1,3\}\) . Then \(\Lambda^c = \{2, 4\}\) .

Now let \(x = (-1,1,2,-4)\) . Then \(x_{\Lambda} = (-1, 0, 2, -4)\) .

Now let

\[\begin{split}\Phi = \begin{pmatrix} 1 & 0 & -1 & 1\\ -1 & -2 & 2 & 3 \end{pmatrix}\end{split}\]

Then

\[\begin{split}\Phi_{\Lambda} = \begin{pmatrix} 1 & 0 & -1 & 0\\ -1 & 0 & 2 & 0 \end{pmatrix}\end{split}\]
Definition

A matrix \(\Phi\) satisfies the null space property (NSP) of order \(K\) if there exists a constant \(C > 0\) such that,

\[\| h_{\Lambda}\|_2 \leq C \frac{\| h_{{\Lambda}^c}\|_1 }{\sqrt{K}}\]

holds \(\forall h \in \NullSpace (\Phi)\) and \(\forall \Lambda\) such that \(|\Lambda| \leq K\) .

  • Let \(h\) be \(K\) sparse. Thus choosing the indices on which \(h\) is non-zero, I can construct a \(\Lambda\) such that \(|\Lambda| \leq K\) and \(h_{{\Lambda}^c} = 0\) . Thus \(\| h_{{\Lambda}^c}\|_1\) = 0. Hence above condition is not satisfied. Thus such a vector \(h\) should not belong to \(\NullSpace(\Phi)\) if \(\Phi\) satisfies NSP.
  • Essentially vectors in \(\NullSpace (\Phi)\) shouldn’t be concentrated in a small subset of indices.
  • If \(\Phi\) satisfies NSP then the only \(K\) -sparse vector in \(\NullSpace(\Phi)\) is \(h = 0\) .

Measuring the performance of a recovery algorithm

Let \(\Delta : \RR^M \rightarrow \RR^N\) represent a recovery method to recover approximately sparse \(x\) from \(y\) .

\(l_2\) recovery error is given by

\[\| \Delta (\Phi x) - x \|_2.\]

The \(l_1\) error for \(K\) -term approximation is given by \(\sigma_K(x)_1\) .

We will be interested in guarantees of the form

(1)\[ \| \Delta (\Phi x) - x \|_2 \leq C \frac{\sigma_K (x)_1}{\sqrt{K}}\]

Why, this recovery guarantee formulation?

  • Exact recovery of K-sparse signals. \(\sigma_K (x)_1 = 0\) if \(x \in \Sigma_K\) .
  • Robust recovery of non-sparse signals
  • Recovery dependent on how well the signals are approximated by \(K\) -sparse vectors.
  • Such guarantees are known as instance optimal guarantees.
  • Also known as uniform guarantees.

Why the specific choice of norms?

  • Different choices of \(l_p\) norms lead to different guarantees.
  • \(l_2\) norm on the LHS is a typical least squares error.
  • \(l_2\) norm on the RHS will require prohibitively large numbertodo{Why? Prove it.} of measurements.
  • \(l_1\) norm on the RHS helps us keep the number of measurements less.

If an algorithm \(\Delta\) provides instance optimal guarantees as defined above, what kind of requirements does it place on the sensing matrix \(\Phi\) ?

We show that NSP of order \(2K\) is a necessary condition for providing uniform guarantees.

Theorem
Let \(\Phi : \RR^N \rightarrow \RR^M\) denote a sensing matrix and \(\Delta : \RR^M \rightarrow \RR^N\) denote an arbitrary recovery algorithm. If the pair \((\Phi, \Delta)\) satisfies instance optimal guarantee (1), then \(\Phi\) satisfies NSP of the order \(2K\) .
Proof

We are given that

  • \((\Phi, \Delta)\) form an encoder-decoder pair.
  • Together, they satisfy instance optimal guarantee :eq`eq:nspguarantee`.
  • Thus they are able to recover all sparse signals exactly.
  • For non-sparse signals, they are able to recover their \(K\) -sparse approximation with bounded recovery error.

We need to show that if \(h \in \NullSpace(\Phi)\), then \(h\) satisfies

\[\| h_{\Lambda}\|_2 \leq C \frac{\| h_{{\Lambda}^c}\|_1 }{\sqrt{2K}}\]

where \(\Lambda\) corresponds to \(2K\) largest magnitude entries in \(h\) .

Note that we have used \(2K\) in this expression, since we need to show that \(\Phi\) satisfies NSP of order \(2K\) .

Let \(h \in \NullSpace(\Phi)\) .

Let \(\Lambda\) be the indices corresponding to the \(2K\) largest entries of h. Thus

\[h = h_{\Lambda} + h_{\Lambda^c}.\]

Split \(\Lambda\) into \(\Lambda_0\) and \(\Lambda_1\) such that \(|\Lambda_0| = |\Lambda_1| = K\) . Now

\[h_{\Lambda} = h_{\Lambda_0} + h_{\Lambda_1}.\]

Let

\[x = h_{\Lambda_0} + h_{\Lambda^c}.\]

Let

\[x' = - h_{\Lambda_1}.\]

Then

\[h = x - x'.\]

By assumption \(h \in \NullSpace(\Phi)\)

Thus

\[\Phi h = \Phi(x - x') = 0 \implies \Phi x = \Phi x'.\]

But since \(x' \in \Sigma_K\) (recall that \(\Lambda_1\) indexes only \(K\) entries) and \(\Delta\) is able to recover all \(K\) -sparse signals exactly, hence

\[x' = \Delta (\Phi x').\]

Thus

\[\Delta (\Phi x) = \Delta (\Phi x') = x'.\]

i.e. the recovery algorithm \(\Delta\) recovers \(x'\) for the signal \(x\) . Certainly \(x'\) is not \(K\) -sparse.

Finally we also have (since \(h\) contains some additional non-zero entries)

\[\| h_{\Lambda} \|_2 \leq \| h \|_2 = \| x - x'\|_2 = \| x - \Delta (\Phi x)\| _2.\]

But as per instance optimal recovery guarantee (1) for \((\Phi, \Delta)\) pair, we have

\[\| \Delta (\Phi x) - x \|_2 \leq C \frac{\sigma_K (x)_1}{\sqrt{K}}.\]

Thus

\[\| h_{\Lambda} \|_2 \leq C \frac{\sigma_K (x)_1}{\sqrt{K}}.\]

But

\[\sigma_K (x)_1 = \min_{\widehat{x} \in \Sigma_K} \| x - \widehat{x}\|_1.\]

Recall that \(x =h_{\Lambda_0} + h_{\Lambda^c}\) where \(\Lambda_0\) indexes \(K\) entries of \(h\) which are (magnitude wise) larger than all entries indexed by \(\Lambda^c\) . Thus the best \(l_1\) -norm \(K\) term approximation of \(x\) is given by \(h_{\Lambda_0}\) .

Hence

\[\sigma_K (x)_1 = \| h_{\Lambda^c} \|_1.\]

Thus we finally have

\[\| h_{\Lambda} \|_2 \leq C \frac{\| h_{\Lambda^c} \|_1}{\sqrt{K}} = \sqrt{2}C \frac{\| h_{\Lambda^c} \|_1}{\sqrt{2K}} \quad \forall h \in \NullSpace(\Phi).\]

Thus \(\Phi\) satisfies the NSP of order \(2K\) .

It turns out that NSP of order \(2K\) is also sufficient to establish a guarantee of the form above for a practical recovery algorithm

Recovery in presence of measurement noise

Measurement vector in the presence of noise is given by

\[y =\Phi x + e\]

where \(e\) is the measurement noise or error. \(\| e \|_2\) is the \(l_2\) size of measurement error.

Recovery error as usual is given by

\[\| \Delta (y) - x \|_2 = \| \Delta (\Phi x + e) - x \|_2\]

Stability of a recovery algorithm is characterized by comparing variation of recovery error w.r.t. measurement error.

NSP is both necessary and sufficient for establishing guarantees of the form:

\[\| \Delta (\Phi x) - x \|_2 \leq C \frac{\sigma_K (x)_1}{\sqrt{K}}\]

These guarantees do not account for presence of noise during measurement. We need stronger conditions for handling noise. The restricted isometry property for sensing matrices comes to our rescue.

Restricted isometry property

A matrix \(\Phi\) satisfies the restricted isometry property (RIP) of order \(K\) if there exists \(\delta_K \in (0,1)\) such that

(1)\[ (1- \delta_K) \| x \|^2_2 \leq \| \Phi x \|^2_2 \leq (1 + \delta_K) \| x \|^2_2\]

holds for all \(x \in \Sigma_K = \{ x : \| x\|_0 \leq K \}\) .

  • If a matrix satisfies RIP of order \(K\) , then we can see that it approximately preserves the size of a \(K\)-sparse vector.
  • If a matrix satisfies RIP of order \(2K\) , then we can see that it approximately preserves the distance between any two \(K\)-sparse vectors since difference vectors would be \(2K\) sparse.
  • We say that the matrix is nearly orthonormal for sparse vectors.
  • If a matrix satisfies RIP of order \(K\) with a constant \(\delta_K\) , it automatically satisfies RIP of any order \(K' < K\) with a constant \(\delta_{K'} \leq \delta_{K}\) .

Stability

Informally, a recovery algorithm is stable if recovery error is small in the presence of small measurement error.

Is RIP necessary and sufficient for sparse signal recovery from noisy measurements?

Let us look at the necessary part. We will define a notion of stability of the recovery algorithm.

Definition

Let \(\Phi : \RR^N \rightarrow \RR^M\) be a sensing matrix and \(\Delta : \RR^M \rightarrow \RR^N\) be a recovery algorithm. We say that the pair \((\Phi, \Delta)\) is \(C\)-stable if for any \(x \in \Sigma_K\) and any \(e \in \RR^M\) we have that

\[\| \Delta(\Phi x + e) - x\|_2 \leq C \| e\|_2.\]
  • Error is added to the measurements.
  • LHS is \(l_2\) norm of recovery error.
  • RHS consists of scaling of the \(l_2\) norm of measurement error.
  • The definition says that recovery error is bounded by a multiple of the measurement error.
  • Thus, adding a small amount of measurement noise shouldn’t be causing arbitrarily large recovery error.

It turns out that \(C\)-stability requires \(\Phi\) to satisfy RIP.

Theorem

If a pair \((\Phi, \Delta)\) is \(C\)-stable then

\[\frac{1}{C} \| x\|_2 \leq \| \Phi x \|_2\]

for all \(x \in \Sigma_{2K}\) .

Proof

Any \(x \in \Sigma_{2K}\) can be written in the form of \(x = y - z\) where \(y, z \in \Sigma_K\) .

So let \(x \in \Sigma_{2K}\) . Split it in the form of \(x = y -z\) with \(y, z \in \Sigma_{K}\) .

Define

\[e_y = \frac{\Phi (z - y)}{2} \quad \text{and} \quad e_z = \frac{\Phi (y - z)}{2}\]

Thus

\[e_y - e_z = \Phi (z - y) \implies \Phi y + e_y = \Phi z + e_z\]

We have

\[\Phi y + e_y = \Phi z + e_z = \frac{\Phi (y + z)}{2}.\]

Also, we have

\[\| e_y \|_2 = \| e_z \|_2 = \frac{\| \Phi (y - z) \|_2}{2} = \frac{\| \Phi x \|_2}{2}\]

Let

\[y' = \Delta (\Phi y + e_y) = \Delta (\Phi z + e_z)\]

Since \((\Phi, \Delta)\) is \(C\)-stable, hence we have

\[\| y'- y\|_2 \leq C \| e_y\|_2.\]

also

\[\| y'- z\|_2 \leq C \| e_z\|_2.\]

Using the triangle inequality

\[\begin{split}\| x \|_2 &= \| y - z\|_2 = \| y - y' + y' - z \|_2\\ &\leq \| y - y' \|_2 + \| y' - z\|_2\\ &\leq C \| e_y \|_2 + C \| e_z \|_2 = C (\| e_y \|_2 + \| e_z \|_2) = C \| \Phi x \|_2\end{split}\]

Thus we have \(\forall x \in \Sigma_{2K}\)

\[\frac{1}{C}\| x \|_2 \leq \| \Phi x \|_2\]

This theorem gives us the lower bound for RIP property of order \(2K\) in (1) with \(\delta_{2K} = 1 - \frac{1}{C^2}\) as a necessary condition for \(C\)-stable recovery algorithms.

Note that smaller the constant \(C\) , lower is the bound on recovery error (w.r.t. measurement error). But as \(C \to 1\) , \(\delta_{2K} \to 0\), thus reducing the impact of measurement noise requires sensing matrix \(\Phi\) to be designed with tighter RIP constraints.

\(C\)-stability doesn’t require an upper bound on the RIP property in (1).

It turns out that If \(\Phi\) satisfies RIP, then this is also sufficient for a variety of algorithms to be able to successfully recover a sparse signal from noisy measurements. We will discuss this later.

Measurement bounds

As stated in previous section, for a \((\Phi, \Delta)\) pair to be \(C\)-stable we require that \(\Phi\) satisfies RIP of order \(2K\) with a constant \(\delta_{2K}\).

Let us ignore \(\delta_{2K}\) for the time being and look at relationship between \(M\) , \(N\) and \(K\).

We have a sensing matrix \(\Phi\) of size \(M\times N\) and expect it to provide RIP of order \(2K\) .

How many measurements \(M\) are necessary?

We will assume that \(K < N / 2\). This assumption is valid for approximately sparse signals.

Before we start figuring out the bounds, let us develop a special subset of \(\Sigma_K\) sets.

Consider the set

\[U = \{ x \in \{0, +1, -1\}^N : \| x\|_0 = K \}\]

Some explanation: By \(A^N\) we mean \(A \times A \times \dots \times A\) i.e. \(N\) times Cartesian product of \(A\) .

When we say \(\| x\|_0 = K\) , we mean that only \(K\) terms in each member of \(U\) can be non-zero (i.e. \(-1\) or \(+1\) ).

So \(U\) is a set of signal vectors \(x\) of length \(N\) where each sample takes values from \(\{0, +1, -1\}\) and number of allowed non-zero samples is fixed at \(K\) .

An example below explains it further.

ExampleU for N=6 and K=2

Each vector in \(U\) will have 6 elements out of which \(2\) can be non zero. There are \(\binom{6}{2}\) ways of choosing the non-zero elements. Some of those sets are listed below as examples:

\[\begin{split}&(+1,+1,0,0,0,0)\\ &(+1,-1,0,0,0,0)\\ &(0,-1,0,+1,0,0)\\ &(0,-1,0,+1,0,0)\\ &(0,0,0,0,-1,+1)\\ &(0,0,-1,-1,0,0)\end{split}\]

\(U\) is a grid in the union of subspaces \(\Sigma_K\).

Revisiting

\[U = \{ x \in \{0, +1, -1\}^N : \| x\|_0 = K \}\]

It’s now obvious that

\[\| x \|_2^2 = K \quad \forall x \in U.\]

Since there are \(\binom{N}{K}\) ways of choosing \(K\) non-zero elements and each non zero element can take either of the two values \(+1\) or \(-1\) , hence the cardinality of set \(U\) is given by:

\[|U| = \binom{N}{K} 2^K\]

By definition

\[U \subset \Sigma_K.\]

Further Let \(x, y \in U\) .

Then \(x - y\) will have a maximum of \(2K\) non-zero elements. The non-zero elements would have values \(\in \{-2,-1,1,2\}\) .

Thus \(\| x - y \|_0 = R \leq 2K\).

Further, \(\| x - y \|_2^2 \geq R\).

Hence

\[\| x - y \|_0 \leq \| x - y \|_2^2 \quad \forall x, y \in U.\]

We now state a lemma which will help us in getting to the bounds.

Lemma

Let \(K\) and \(N\) satisfying \(K < \frac{N}{2}\) be given. There exists a set \(X \subset \Sigma_K\) such that for any \(x \in X\) we have \(\| x \|_2 \leq \sqrt{K}\) and for any \(x, y \in X\) with \(x \neq y\) ,

\[\| x - y \|_2 \geq \sqrt{\frac{K}{2}}.\]

and

\[\ln | X | \geq \frac{K}{2} \ln \left( \frac{N}{K} \right) .\]

The lemma establishes the existence of a set in the union of subspaces \(\Sigma_K\) within a sphere of radius \(\sqrt{K}\) whose points are sufficiently apart and whose size is sufficiently large.

Proof

We just need to find one set \(X\) which satisfies the requirements of this lemma. We have to construct a set \(X\) such that

  • \(\| x \|_2 \leq \sqrt{K} \quad \forall x \in X.\)
  • \(\| x - y \|_2 \geq \sqrt{\frac{K}{2}} \quad \forall x, y \in X.\)
  • \(\ln | X | \geq \frac{K}{2} \ln \left( \frac{N}{K} \right)\) or equivalently \(|X| \geq \left( \frac{N}{K} \right)^{\frac{K}{2}}\) .

We will construct \(X\) by picking vectors from \(U\) . Thus \(X \subset U\) .

Since \(x \in X \subset U\) hence \(\| x \|_2 = \sqrt{K} \leq \sqrt{K} \quad \forall x \in X\) .

Consider any fixed \(x \in U\) .

How many elements \(y\) are there in \(U\) such that \(\|x - y\|_2^2 < \frac{K}{2}\) ?

Define

\[U_x^2 = \left \{ y \in U : \|x - y\|_2^2 < \frac{K}{2} \right \}\]

Clearly by requirements in the lemma, if \(x \in X\) then \(U_x^2 \cap X = \phi\) . i.e. no vector in \(U_x^2\) belongs to \(X\) .

How many elements are there in \(U_x^2\) ?

Let us find an upper bound. \(\forall x, y \in U\) we have \(\|x - y\|_0 \leq \|x - y\|_2^2\) .

If \(x\) and \(y\) differ in \(\frac{K}{2}\) or more places, then naturally \(\|x - y\|_2^2 \geq \frac{K}{2}\) .

Hence if \(\|x - y\|_2^2 < \frac{K}{2}\) then \(\|x - y\|_0 < \frac{K}{2}\) hence \(\|x - y\|_0 \leq \frac{K}{2}\) for any \(x, y \in U_x^2\) .

So define

\[U_x^0 = \left \{ y \in U : \|x - y\|_0 \leq \frac{K}{2} \right \}\]

We have

\[U_x^2 \subseteq U_x^0\]

Thus we have an upper bound given by

\[| U_x^2 | \leq | U_x^0 |.\]

Let us look at \(U_x^0\) carefully.

We can choose \(\frac{K}{2}\) indices where \(x\) and \(y\) may differ in \(\binom{N}{\frac{K}{2}}\) ways.

At each of these \(\frac{K}{2}\) indices, \(y_i\) can take value as one of \((0, +1, -1)\) .

Thus We have an upper bound

\[| U_x^2 | \leq | U_x^0 | \leq \binom {N}{\frac{K}{2}} 3^{\frac{K}{2}}.\]

We now describe an iterative process for building \(X\) from vectors in \(U\) .

Say we have added \(j\) vectors to \(X\) namely \(x_1, x_2,\dots, x_j\) .

Then

\[(U^2_{x_1} \cup U^2_{x_2} \cup \dots \cup U^2_{x_j}) \cap X = \phi\]

Number of vectors in \(U^2_{x_1} \cup U^2_{x_2} \cup \dots \cup U^2_{x_j}\) is bounded by \(j \binom {N}{ \frac{K}{2}} 3^{\frac{K}{2}}\) .

Thus we have at least

\[\binom{N}{K} 2^K - j \binom {N}{ \frac{K}{2}} 3^{\frac{K}{2}}\]

vectors left in \(U\) to choose from for adding in \(X\) .

We can keep adding vectors to \(X\) till there are no more suitable vectors left.

So we can construct a set of size \(|X|\) provided

(2)\[|X| \binom {N}{ \frac{K}{2}} 3^{\frac{K}{2}} \leq \binom{N}{K} 2^K\]

Now

\[\frac{\binom{N}{K}}{\binom{N}{\frac{K}{2}}} = \frac {\left ( \frac{K}{2} \right ) ! \left (N - \frac{K}{2} \right ) ! } {K! (N-K)!} = \prod_{i=1}^{\frac{K}{2}} \frac{N - K + i}{ K/ 2 + i}\]

Note that \(\frac{N - K + i}{ K/ 2 + i}\) is a decreasing function of \(i\) .

Its minimum value is achieved for \(i=\frac{K}{2}\) as \((\frac{N}{K} - \frac{1}{2})\) .

So we have

\[\begin{split}&\frac{N - K + i}{ K/ 2 + i} \geq \frac{N}{K} - \frac{1}{2}\\ &\implies \prod_{i=1}^{\frac{K}{2}} \frac{N - K + i}{ K/ 2 + i} \geq \left ( \frac{N}{K} - \frac{1}{2} \right )^{\frac{K}{2}}\\ &\implies \frac{\binom{N}{K}}{\binom{N}{\frac{K}{2}}} \geq \left ( \frac{N}{K} - \frac{1}{2} \right )^{\frac{K}{2}}\end{split}\]

Rephrasing (2) we have

\[|X| \left( \frac{3}{4} \right )^{\frac{K}{2}} \leq \frac{\binom{N}{K}}{\binom{N}{\frac{K}{2}}}\]

So if

\[|X| \left( \frac{3}{4} \right ) ^{\frac{K}{2}} \leq \left ( \frac{N}{K} - \frac{1}{2} \right )^{\frac{K}{2}}\]

then (2) will be satisfied.

Now it is given that \(K < \frac{N}{2}\) . So we have:

\[\begin{split}& K < \frac{N}{2}\\ &\implies \frac{N}{K} > 2\\ &\implies \frac{N}{4K} > \frac{1}{2}\\ &\implies \frac{N}{K} - \frac{N}{4K} < \frac{N}{K} - \frac{1}{2}\\ &\implies \frac{3N}{4K} < \frac{N}{K} - \frac{1}{2}\\ &\implies \left( \frac{3N}{4K} \right) ^ {\frac{K}{2}}< \left ( \frac{N}{K} - \frac{1}{2} \right )^{\frac{K}{2}}\\\end{split}\]

Thus we have

\[\left( \frac{N}{K} \right) ^ {\frac{K}{2}} \left( \frac{3}{4} \right) ^ {\frac{K}{2}} < \frac{\binom{N}{K}}{\binom{N}{\frac{K}{2}}}\]

Choose

\[|X| = \left( \frac{N}{K} \right) ^ {\frac{K}{2}}\]

Clearly, this value of \(|X|\) satisfies (2). Hence \(X\) can have at least these many elements. Thus

\[\begin{split}&|X| \geq \left( \frac{N}{K} \right) ^ {\frac{K}{2}}\\ &\implies \ln |X| \geq \frac{K}{2} \ln \left( \frac{N}{K} \right)\end{split}\]

which completes the proof.

We can now establish following bound on the required number of measurements to satisfy RIP.

At this moment, we won’t worry about exact value of \(\delta_{2K}\) . We will just assume that \(\delta_{2K}\) is small in range \((0, \frac{1}{2}]\) .

Theorem

Let \(\Phi\) be an \(M \times N\) matrix that satisfies RIP of order \(2K\) with constant \(\delta_{2K} \in (0, \frac{1}{2}]\) . Then

\[M \geq C K \ln \left ( \frac{N}{K} \right )\]

where \(C = \frac{1}{2 \ln (\sqrt{24} + 1)} \approx 0.28173\) .

Proof

Since \(\Phi\) satisfies RIP of order \(2K\) we have

\[\begin{split}& (1 - \delta_{2K}) \| x \|^2_2 \leq \| \Phi x \|^2_2 \leq (1 + \delta_{2K}) \| x\|^2_2 \quad \forall x \in \Sigma_{2K}.\\ & \implies (1 - \delta_{2K}) \| x - y \|^2_2 \leq \| \Phi x - \Phi y\|^2_2 \leq (1 + \delta_{2K}) \| x - y\|^2_2 \quad \forall x, y \in \Sigma_K.\end{split}\]

Also

\[\delta_{2K} \leq \frac{1}{2} \implies 1 - \delta_{2K} > \frac{1}{2} \text{ and } 1 + \delta_{2K} \leq \frac{3}{2}\]

Consider the set \(X \subset U \subset \Sigma_K\) developed in above.

We have

\[\begin{split}&\| x - y\|^2_2 \geq \frac{K}{2} \quad \forall x, y \in X\\ &\implies (1 - \delta_{2K}) \| x - y \|^2_2 \geq \frac{K}{4}\\ &\implies \| \Phi x - \Phi y\|^2_2 \geq \frac{K}{4}\\ &\implies \| \Phi x - \Phi y\|_2 \geq \sqrt{\frac{K}{4}} \quad \forall x, y \in X\end{split}\]

Also

\[\begin{split}&\| \Phi x \|^2_2 \leq (1 + \delta_{2K}) \| x\|^2_2 \leq \frac{3}{2} \| x\|^2_2 \quad \forall x \in X \subset \Sigma_K \subset \Sigma_{2K}\\ &\implies \| \Phi x \|_2 \leq \sqrt {\frac{3}{2}} \| x\|_2 \leq \sqrt {\frac{3K}{2}} \quad \forall x \in X.\end{split}\]

since \(\| x\|_2 \leq \sqrt{K} \quad \forall x \in X\) .

So we have a lower bound:

(3)\[\| \Phi x - \Phi y\|_2 \geq \sqrt{\frac{K}{4}} \quad \forall x, y \in X.\]

and an upper bound:

(4)\[\| \Phi x \|_2 \leq \sqrt {\frac{3K}{2}} \quad \forall x \in X.\]

What do these bounds mean? Let us start with the lower bound. \(\Phi x\) and \(\Phi y\) are projections of \(x\) and \(y\) in \(\RR^M\) (measurement space).

Construct \(l_2\) balls of radius \(\sqrt{\frac{K}{4}} / 2= \sqrt{\frac{K}{16}}\) in \(\RR^M\) around \(\Phi x\) and \(\Phi y\) .

Lower bound says that these balls are disjoint. Since \(x, y\) are arbitrary, this applies to every \(x \in X\).

Upper bound tells us that all vectors \(\Phi x\) lie in a ball of radius \(\sqrt {\frac{3K}{2}}\) around origin in \(\RR^M\) .

Thus, the set of all balls lies within a larger ball of radius \(\sqrt {\frac{3K}{2}} + \sqrt{\frac{K}{16}}\) around origin in \(\RR^M\) .

So we require that the volume of the larger ball MUST be greater than the sum of volumes of \(|X|\) individual balls.

Since volume of an \(l_2\) ball of radius \(r\) is proportional to \(r^M\) , we have:

\[\begin{split}&\left ( \sqrt {\frac{3K}{2}} + \sqrt{\frac{K}{16}} \right )^M \geq |X| . \left ( \sqrt{\frac{K}{16}} \right )^M\\. & \implies (\sqrt {24} + 1)^M \geq |X| \\ & \implies M \geq \frac{\ln |X| }{\ln (\sqrt {24} + 1) }\end{split}\]

Again from above we have

\[\ln |X| \geq \frac{K}{2} \ln \left ( \frac{N}{K} \right ).\]

Putting back we get

\[M \geq \frac{\frac{K}{2} \ln \left ( \frac{N}{K} \right ) }{\ln (\sqrt {24} + 1) }\]

which establishes a lower bound on the number of measurements \(M\) .

ExampleLower bounds on M for RIP of order 2K
  1. \(N=1000, K=100 \implies M \geq 65\) .
  2. \(N=1000, K=200 \implies M \geq 91\) .
  3. \(N=1000, K=400 \implies M \geq 104\) .

Some remarks are in order:

  • The theorem only establishes a necessary lower bound on \(M\) . It doesn’t mean that if we choose an \(M\) larger than the lower bound then \(\Phi\) will have RIP of order \(2K\) with any constant \(\delta_{2K} \in (0, \frac{1}{2}]\) .
  • The restriction \(\delta_{2K} \leq \frac{1}{2}\) is arbitrary and is made for convenience. In general, we can work with \(0 < \delta_{2K} \leq \delta_{\text{max}} < 1\) and develop the bounds accordingly.
  • This result fails to capture dependence of \(M\) on the RIP constant \(\delta_{2K}\) directly. Johnson-Lindenstrauss lemma helps us resolve this which concerns embeddings of finite sets of points in low-dimensional spaces.
  • We haven’t made significant efforts to optimize the constants. Still they are quite reasonable.

The RIP and the NSP

RIP and NSP are connected. If a matrix \(\Phi\) satisfies RIP then it also satisfies NSP (under certain conditions).

Thus RIP is strictly stronger than NSP (under certain conditions).

We will need following lemma which applies to any arbitrary \(h \in \RR^N\) . The lemma will be proved later.

Lemma

Suppose that \(\Phi\) satisfies RIP of order \(2K\), and let \(h \in \RR^N, h \neq 0\) be arbitrary. Let \(\Lambda_0\) be any subset of \(\{1,2,\dots, N\}\) such that \(|\Lambda_0| \leq K\).

Define \(\Lambda_1\) as the index set corresponding to the \(K\) entries of \(h_{\Lambda_0^c}\) with largest magnitude, and set \(\Lambda = \Lambda_0 \cup \Lambda_1\). Then

\[\| h_{\Lambda} \|_2 \leq \alpha \frac{\| h_{\Lambda_0^c} \|_1 }{ \sqrt{K}} + \beta \frac{| \langle \Phi h_{\Lambda}, \Phi h \rangle | }{\| h_{\Lambda} \|_2},\]

where

\[\alpha = \frac{\sqrt{2} \delta_{2K}}{ 1 - \delta_{2K}} , \beta = \frac{1}{ 1 - \delta_{2K}}.\]

Let us understand this lemma a bit. If \(h \in \NullSpace (\Phi)\), then the lemma simplifies to

\[\| h_{\Lambda} \|_2 \leq \alpha \frac{\| h_{\Lambda_0^c} \|_1 }{ \sqrt{K}}\]
  • \(\Lambda_0\) maps to the initial few ( \(K\) or less) elements we chose.
  • \(\Lambda_0^c\) maps to all other elements.
  • \(\Lambda_1\) maps to largest (in magnitude) \(K\) elements of \(\Lambda_0^c\) .
  • \(h_{\Lambda}\) contains a maximum of \(2K\) non-zero elements.
  • \(\Phi\) satisfies RIP of order \(2K\) .
  • Thus \((1 - \delta_{2K}) \| h_{\Lambda} \|_2 \leq \| \Phi h_{\Lambda} \|_2 \leq (1 + \delta_{2K}) \| h_{\Lambda} \|_2\) .

We now state the connection between RIP and NSP.

Theorem

Suppose that \(\Phi\) satisfies RIP of order \(2K\) with \(\delta_{2K} < \sqrt{2} - 1\) . Then \(\Phi\) satisfies the NSP of order \(2K\) with constant

\[C= \frac {\sqrt{2} \delta_{2K}} {1 - (1 + \sqrt{2})\delta_{2K}}\]
Proof

We are given

\[(1- \delta_{2K}) \| x \|^2_2 \leq \| \Phi x \|^2_2 \leq (1 + \delta_{2K}) \| x \|^2_2\]

holds for all \(x \in \Sigma_{2K}\) where \(\delta_{2K} < \sqrt{2} - 1\).

We have to show that:

\[\| h_{\Lambda}\|_2 \leq C \frac{\| h_{{\Lambda}^c}\|_1 }{\sqrt{K}}\]

holds \(\forall h \in \NullSpace (\Phi)\) and \(\forall \Lambda\) such that \(|\Lambda| \leq 2K\).

Let \(h \in \NullSpace(\Phi)\) . Then \(\Phi h = 0\) .

Let \(\Lambda_m\) denote the \(2K\) largest entries of \(h\). Then

\[\| h_{\Lambda}\|_2 \leq \| h_{\Lambda_m}\|_2 \quad \forall \Lambda : |\Lambda| \leq 2K.\]

Similarly

\[\| h_{\Lambda^c}\|_1 \geq \| h_{\Lambda_m^c}\|_1 \quad \forall \Lambda : |\Lambda| \leq 2K.\]

Thus if we show that \(\Phi\) satisfies NSP of order \(2K\) for \(\Lambda_m\) , i.e.

\[\| h_{\Lambda_m}\|_2 \leq C \frac{\| h_{{\Lambda_m}^c}\|_1 }{\sqrt{K}}\]

then we would have shown it for all \(\Lambda\) such that \(|\Lambda| \leq 2K\) . So let \(\Lambda = \Lambda_m\) .

We can divide \(\Lambda\) into two components \(\Lambda_0\) and \(\Lambda_1\) of size \(K\) each.

Since \(\Lambda\) maps to the largest \(2K\) entries in \(h\) hence whatever entries we choose in \(\Lambda_0\) , the largest \(K\) entries in \(\Lambda_0^c\) will be \(\Lambda_1\) .

Hence as per lemma above above, we have

\[\| h_{\Lambda} \|_2 \leq \alpha \frac{\| h_{\Lambda_0^c}\|_1}{\sqrt{K}}\]

Also

\[\Lambda = \Lambda_0 \cup \Lambda_1 \implies \Lambda_0 = \Lambda \setminus \Lambda_1 = \Lambda \cap \Lambda_1^c \implies \Lambda_0^c = \Lambda_1 \cup \Lambda^c\]

Thus we have

\[\| h_{\Lambda_0^c} \|_1 = \| h_{\Lambda_1} \|_1 + \| h_{\Lambda^c} \|_1\]

We have to get rid of \(\Lambda_1\) .

Since \(h_{\Lambda_1} \in \Sigma_K\) , by applying lem:u_sigma_k_norms we get

\[\| h_{\Lambda_1} \|_1 \leq \sqrt{K} \| h_{\Lambda_1} \|_2\]

Hence

\[\| h_{\Lambda} \|_2 \leq \alpha \left ( \| h_{\Lambda_1} \|_2 + \frac{\| h_{\Lambda^c} \|_1}{\sqrt{K}} \right)\]

But since \(\Lambda_1 \subset \Lambda\) , hence \(\| h_{\Lambda_1} \|_2 \leq \| h_{\Lambda} \|_2\) , hence

\[\begin{split}&\| h_{\Lambda} \|_2 \leq \alpha \left ( \| h_{\Lambda} \|_2 + \frac{\| h_{\Lambda^c} \|_1}{\sqrt{K}} \right)\\ \implies &(1 - \alpha) \| h_{\Lambda} \|_2 \leq \alpha \frac{\| h_{\Lambda^c} \|_1}{\sqrt{K}}\\ \implies &\| h_{\Lambda} \|_2 \leq \frac{\alpha}{1 - \alpha} \frac{\| h_{\Lambda^c} \|_1}{\sqrt{K}} \quad \text{ if } \alpha \leq 1.\end{split}\]

Note that the inequality is also satisfied for \(\alpha = 1\) in which case, we don’t need to bring \(1-\alpha\) to denominator.

Now

\[\begin{split}&\alpha \leq 1\\ \implies &\frac{\sqrt{2} \delta_{2K}}{ 1 - \delta_{2K}} \leq 1 \\ \implies &\sqrt{2} \delta_{2K} \leq 1 - \delta_{2K}\\ \implies &(\sqrt{2} + 1) \delta_{2K} \leq 1\\ \implies &\delta_{2K} \leq \sqrt{2} - 1\end{split}\]

Putting

\[C = \frac{\alpha}{1 - \alpha} = \frac {\sqrt{2} \delta_{2K}} {1 - (1 + \sqrt{2})\delta_{2K}}\]

we see that \(\Phi\) satisfies NSP of order \(2K\) whenever \(\Phi\) satisfies RIP of order \(2K\) with \(\delta_{2K} \leq \sqrt{2} -1\) .

Note that for \(\delta_{2K} = \sqrt{2} - 1\) , \(C=\infty\) .

Matrices satisfying RIP

The natural question at this moment is how to construct matrices which satisfy RIP.

There are two different approaches

  • Deterministic approach
  • Randomized approach

Known deterministic approaches so far tend to require \(M\) to be very large ( \(O(K^2 \ln N)\) or \(O(KN^{\alpha}\) ).

We can overcome this limitation by randomizing matrix construction.

Construction process:

  • Input \(M\) and \(N\) .
  • Generate \(\Phi\) by choosing \(\Phi_{ij}\) as independent realizations from some probability distribution.

Suppose that \(\Phi\) is drawn from normal distribution.

It can be shown that the rank of \(\Phi\) is \(M\) with probability 1.

ExampleRandom matrices are full rank.

We can verify this fact by doing a small computer simulation.

M = 6;
N = 20;
trials = 10000;
n_full_rank = 0;
for i=1:trials
    % Create a random matrix of size M x N
    A = rand(M,N);
    % Obtain its rank
    R = rank(A);
    % Check whether the rank equals M or not
    if R == M
        n_full_rank = n_full_rank + 1;
    end
end
fprintf('Number of trials: %d\n',trials);
fprintf('Number of full rank matrices: %d\n',n_full_rank);
percentage = n_full_rank*100/trials;
fprintf('Percentage of full rank matrices: %.2f %%\n', percentage);

Above program generates a number of random matrices and measures their ranks. It verifies whether they are full rank or not.

Here is a sample output:

Number of trials: 10000
Number of full rank matrices: 10000
Percentage of full rank matrices: 100.00 %

Thus, if we choose \(M=2K\) , any subset of \(2K\) columns will be linearly independent. The matrix will satisfy RIP with some \(\delta_{2K} > 0\).

But this construction doesn’t tell us exact value of \(\delta_{2K}\) .

In order to find out \(\delta_{2K}\), we must consider all possible \(K\)-dimensional subspaces formed by columns of \(\Phi\).

This is computationally impossible for reasonably large \(N\) and \(K\).

What is the alternative?

We can start with a chosen value of \(\delta_{2K}\) and try to construct a matrix which matches it.

Before we proceed further, we should take a detour and review sub-Gaussian distributions in this section.

We now state the main theorem of this section.

Theorem

Suppose that \(X = [X_1, X_2, \dots, X_M]\) where each \(X_i\) is i.i.d. with \(X_i \sim \Sub (c^2)\) and \(\EE (X_i^2) = \sigma^2\) . Then

\[\EE (\| X\|_2^2) = M \sigma^2\]

Moreover, for any \(\alpha \in (0,1)\) and for any \(\beta \in [c^2/\sigma^2, \beta_{\text{max}}]\), there exists a constant \(\kappa^* \geq 4\) depending only on \(\beta_{\text{max}}\) and the ratio \(\sigma^2/c^2\) such that

\[\PP(\| X\|_2^2 \leq \alpha M \sigma^2) \leq \exp \left ( -\frac{M(1-\alpha)^2}{\kappa^*} \right )\]

and

\[\PP(\| X\|_2^2 \geq \beta M \sigma^2) \leq \exp \left ( -\frac{M(\beta-1)^2}{\kappa^*} \right )\]

The theorem states that the length (squared) of the random vector \(X\) is concentrated around its mean value. If we choose \(\sigma\) such that \(M \sigma^2 = 1\), then we have \(\beta \leq \| X \|_2^2 \leq \alpha\) with very high probability.

Conditions on random distribution for RIP

Let us get back to our business of constructing a matrix \(\Phi\) using random distributions which satisfies RIP with a given \(\delta\) .

We will impose some conditions on the random distribution.

  • We require that the distribution will yield a matrix that is norm-preserving. This requires that
(1)\[ \EE (\Phi_{ij}^2) = \frac{1}{M}\]

Hence variance of distribution should be \(\frac{1}{M}\).

  • We require that distribution is a sub-Gaussian distribution i.e. there exists a constant \(c > 0\) such that

    (2)\[ \EE(\exp(\Phi_{ij} t)) \leq \exp \left (\frac{c^2 t^2}{2} \right )\]

    This says that the moment generating function of the distribution is dominated by a Gaussian distribution.

    In other words, tails of the distribution decay at least as fast as the tails of a Gaussian distribution.

We will further assume that entries of \(\Phi\) are strictly sub-Gaussian. i.e. they must satisfy (2) with

\[c^2 = \EE (\Phi_{ij}^2) = \frac{1}{M}\]

Under these conditions we have the following result.

Corollary

Suppose that \(\Phi\) is an \(M\times N\) matrix whose entries \(\Phi_{ij}\) are i.i.d. with \(\Phi_{ij}\) drawn according to a strictly sub-Gaussian distribution with \(c^2 = \frac{1}{M^2}\).

Let \(Y = \Phi x\) for \(x \in \RR^N\). Then for any \(\epsilon > 0\) and any \(x \in \RR^N\) ,

\[\EE ( \| Y \|_2^2) = \| x \|_2^2\]

and

\[\PP ( \| Y \|^2_2 - \| x \|_2^2 \geq \epsilon \| x \|_2^2 ) \leq 2 \exp \left ( - \frac{M \epsilon^2}{\kappa^*} \right)\]

where \(\kappa^* = \frac{2}{1 - \ln(2)} \approx 6.5178\) .

This means that the norm of a sub-Gaussian random vector strongly concentrates about its mean.

Sub Gaussian random matrices satisfy the RIP

Using this result we now state that sub-Gaussian matrices satisfy the RIP.

Theorem

Fix \(\delta \in (0,1)\) . Let \(\Phi\) be an \(M\times N\) random matrix whose entries \(\Phi_{ij}\) are i.i.d. with \(\Phi_{ij}\) drawn according to a strictly sub-Gaussian distribution with \(c^2 = \frac{1}{M}\) . If

\[M \geq \kappa_1 K \ln \left ( \frac{N}{K} \right ),\]

then \(\Phi\) satisfies the RIP of order \(K\) with the prescribed \(\delta\) with probability exceeding \(1 - 2e^{-\kappa_2 M}\) , where \(\kappa_1\) is arbitrary and

\[\kappa_2 = \frac{\delta^2 }{2 \kappa^*} - \frac{1}{\kappa_1} \ln \left ( \frac{42 e}{\delta} \right )\]

We note that this theorem achieves \(M\) of the same order as the lower bound obtained in this result up to a constant.

This is much better than deterministic approaches.

Advantages of random construction

There are a number of advantages of the random sensing matrix construction approach:

  • One can show that for random construction, the measurements are democratic. This means that all measurements are equal in importance and it is possible to recover the signal from any sufficiently large subset of the measurements. Thus by using random \(\Phi\) one can be robust to the loss or corruption of a small fraction of measurements.
  • In general we are more interested in \(x\) which is sparse in some basis \(\Psi\) . In this setting, we require that \(\Phi \Psi\) satisfy the RIP. Deterministic construction would explicitly require taking \(\Psi\) into account. But if \(\Phi\) is random, we can avoid this issue. If \(\Phi\) is Gaussian and \(\Psi\) is an orthonormal basis, then one can easily show that \(\Phi \Psi\) will also have a Gaussian distribution. Thus if \(M\) is high, \(\Phi \Psi\) will also satisfy RIP with very high probability.

Similar results hold for other sub-Gaussian distributions as well.

Subgaussian distributions

In this section we review subgaussian distributions and matrices drawn from subgaussian distributions.

Examples of subgaussian distributions include

  • Gaussian distribution
  • Rademacher distribution taking values \(\pm \frac{1}{\sqrt{M}}\)
  • Any zero mean distribution with a bounded support
Definition

A random variable \(X\) is called subgaussian if there exists a constant \(c > 0\) such that

(1)\[M_X(t) = \EE [\exp(X t) ] \leq \exp \left (\frac{c^2 t^2}{2} \right )\]

holds for all \(t \in \RR\). We use the notation \(X \sim \Sub (c^2)\) to denote that \(X\) satisfies the constraint (1). We also say that \(X\) is \(c\)-subgaussian.

\(\EE [\exp(X t) ]\) is moment generating function of \(X\).

\(\exp \left (\frac{c^2 t^2}{2} \right )\) is moment generating function of a Gaussian random variable with variance \(c^2\).

The definition means that for a subgaussian variable \(X\), its M.G.F. is bounded by the M.G.F. of a Gaussian random variable \(\sim \mathcal{N}(0, c^2)\).

ExampleGaussian r.v. as subgaussian r.v.

Consider zero-mean Gaussian random variable \(X \sim \mathcal{N}(0, \sigma^2)\) with variance \(\sigma^2\). Then

\[\EE [\exp(X t) ] = \exp\left ( \frac{\sigma^2 t^2}{2} \right )\]

Putting \(c = \sigma\) we see that (1) is satisfied. Hence, \(X\sim \Sub(\sigma^2)\) is a subgaussian r.v. or \(X\) is \(\sigma\)-subgaussian.

ExampleRademacher distribution

Consider \(X\) with

\[\PP_X(x) = \frac{1}{2}\delta(x-1) + \frac{1}{2}\delta(x + 1)\]

i.e. \(X\) takes a value \(1\) with probability \(0.5\) and value \(-1\) with probability \(0.5\).

Then

\[\EE [\exp(X t) ] = \frac{1}{2} \exp(-t) + \frac{1}{2} \exp(t) = \cosh t \leq \exp \left ( \frac{t^2}{2} \right)\]

Thus \(X \sim \Sub(1)\) or \(X\) is 1-subgaussian.

ExampleUniform distribution

Consider \(X\) as uniformly distributed over the interval \([-a, a]\) for some \(a > 0\). i.e.

\[\begin{split}f_X(x) = \begin{cases} \frac{1}{2 a} & -a \leq x \leq a\\ 0 & \text{otherwise} \end{cases}\end{split}\]

Then

\[\EE [\exp(X t) ] = \frac{1}{2 a} \int_{-a}^{a} \exp(x t)d x = \frac{1}{2 a t} [e^{at} - e^{-at}] = \sum_{n = 0}^{\infty}\frac{(at)^{2 n}}{(2 n + 1)!}\]

But \((2n+1)! \geq n! 2^n\). Hence we have

\[\sum_{n = 0}^{\infty}\frac{(at)^{2 n}}{(2 n + 1)!} \leq \sum_{n = 0}^{\infty}\frac{(at)^{2 n}}{( n! 2^n)} = \sum_{n = 0}^{\infty}\frac{(a^2 t^2 / 2)^{n}}{( n!)} = \exp \left (\frac{a^2 t^2}{2} \right )\]

Thus

\[\EE [\exp(X t ] \leq \exp \left ( \frac{a^2 t^2}{2} \right ).\]

Hence \(X \sim \Sub(a^2)\) or \(X\) is \(a\)-subgaussian.

ExampleRandom variable with bounded support

Consider \(X\) as a zero mean, bounded random variable i.e.

\[\PP(|X| \leq B) = 1\]

for some \(B \in \RR^+\) and

\[\EE(X) = 0.\]

Then, the following upper bound holds:

\[\EE [ \exp(X t) ] = \int_{-B}^{B} \exp(x t) f_X(x) d x \leq \exp\left (\frac{B^2 t^2}{2} \right )\]

This result can be proven with some advanced calculus. \(X \sim \Sub(B^2)\) or \(X\) is \(B\)-subgaussian.

There are some useful properties of subgaussian random variables.

Lemma

If \(X \sim \Sub(c^2)\) then

\[\EE (X) = 0\]

and

\[\EE(X^2) \leq c^2\]

Thus subgaussian random variables are always zero-mean.

Their variance is always bounded by the variance of the bounding Gaussian distribution.

Proof
\[\sum_{n = 0}^{\infty} \frac{t^n}{n!} \EE (X^n) = \EE \left( \sum_{n = 0}^{\infty} \frac{(X t)^n}{n!} \right ) = \EE \left ( \exp(X t) \right )\]

But since \(X \sim \Sub(c^2)\) hence

\[\sum_{n = 0}^{\infty} \frac{t^n}{n!} \EE (X^n) \leq \exp \left ( \frac{c^2 t^2}{2} \right) = \sum_{n = 0}^{\infty} \frac{c^{2 n} t^{2 n}}{2^n n!}\]

Restating

\[\EE (X) t + \EE (X^2) \frac{t^2}{2!} \leq \frac{c^2 t^2}{2} + \smallO{t^2} \text{ as } t \to 0.\]

Dividing throughout by \(t > 0\) and letting \(t \to 0\) we get \(\EE (X) \leq 0\).

Dividing throughout by \(t < 0\) and letting \(t \to 0\) we get \(\EE (X) \geq 0\).

Thus \(\EE (X) = 0\). So \(\Var(X) = \EE (X^2)\).

Now we are left with

\[\EE (X^2) \frac{t^2}{2!} \leq \frac{c^2 t^2}{2} + \smallO{t^2} \text{ as } t \to 0.\]

Dividing throughout by \(t^2\) and letting \(t \to 0\) we get \(\Var(X) \leq c^2\)

Subgaussian variables have a linear structure.

Theorem

If \(X \sim \Sub(c^2)\) i.e. \(X\) is \(c\)-subgaussian, then for any \(\alpha \in \RR\), the r.v. \(\alpha X\) is \(|\alpha| c\)-subgaussian.

If \(X_1, X_2\) are r.v. such that \(X_i\) is \(c_i\)-subgaussian, then \(X_1 + X_2\) is \(c_1 + c_2\)-subgaussian.

Proof

Let \(X\) be \(c\)-subgaussian. Then

\[\EE [\exp(X t) ] \leq \exp \left (\frac{c^2 t^2}{2} \right )\]

Now for \(\alpha \neq 0\), we have

\[\EE [\exp(\alpha X t) ] \leq \exp \left (\frac{\alpha^2 c^2 t^2}{2} \right ) = \exp \left (\frac{(|\alpha | c)^2 t^2}{2} \right )\]

Hence \(\alpha X\) is \(|\alpha| c\)-subgaussian.

Now consider \(X_1\) as \(c_1\)-subgaussian and \(X_2\) as \(c_2\)-subgaussian.

\[\EE (\exp(X_i t) ) \leq \exp \left (\frac{c_i^2 t^2}{2} \right )\]

Let \(p, q >1\) be two numbers s.t. \(\frac{1}{p} + \frac{1}{q} = 1\).

Using H”older’s inequality, we have

\[\begin{split}\EE (\exp((X_1 + X_2)t) ) &\leq \left [ \EE (\exp(X_1 t) )^p\right ]^{\frac{1}{p}} \left [ \EE (\exp(X_2 t) )^q\right ]^{\frac{1}{q}}\\ &= \left [ \EE (\exp( p X_1 t) )\right ]^{\frac{1}{p}} \left [ \EE (\exp(q X_2 t) )\right ]^{\frac{1}{q}}\\ &\leq \left [ \exp \left (\frac{(p c_1)^2 t^2}{2} \right ) \right ]^{\frac{1}{p}} \left [ \exp \left (\frac{(q c_2)^2 t^2}{2} \right ) \right ]^{\frac{1}{q}}\\ &= \exp \left ( \frac{t^2}{2} ( p c_1^2 + q c_2^2) \right ) \\ &= \exp \left ( \frac{t^2}{2} ( p c_1^2 + \frac{p}{p - 1} c_2^2) \right )\end{split}\]

Since this is valid for any \(p > 1\), we can minimize the r.h.s. over \(p > 1\). If suffices to minimize the term

\[r = p c_1^2 + \frac{p}{p - 1} c_2^2.\]

We have

\[\frac{\partial r}{\partial p} = c_1^2 - \frac{1}{(p-1)^2}c_2^2\]

Equating it to 0 gives us

\[p - 1 = \frac{c_2}{c_1} \implies p = \frac{c_1 + c_2}{c_1} \implies \frac{p}{p -1} = \frac{c_1 + c_2}{c_2}\]

Taking second derivative, we can verify that this is indeed a minimum value.

Thus

\[r_{\min} = (c_1 + c_2)^2\]

Hence we have the result

\[\EE (\exp((X_1 + X_2)t) ) \leq \exp \left (\frac{(c_1+ c_2)^2 t^2}{2} \right )\]

Thus \(X_1 + X_2\) is \((c_1 + c_2)\)-subgaussian.

If \(X_1\) and \(X_2\) are independent, then \(X_1 + X_2\) is \(\sqrt{c_1^2 + c_2^2}\)-subgaussian.

If \(X\) is \(c\)-subgaussian then naturally, \(X\) is \(d\)-subgaussian for any \(d \geq c\). A question arises as to what is the minimum value of \(c\) such that \(X\) is \(c\)-subgaussian.

Definition

For a centered random variable \(X\), the subgaussian moment of \(X\), denoted by \(\sigma(X)\), is defined as

\[\sigma(X) = \inf \left \{ c \geq 0 \; | \; \EE (\exp(X t) ) \leq \exp \left (\frac{c^2 t^2}{2} \right ), \Forall t \in \RR. \right \}\]

\(X\) is subgaussian if and only if \(\sigma(X)\) is finite.

We can also show that \(\sigma(\cdot)\) is a norm on the space of subgaussian random variables. And this normed space is complete.

For centered Gaussian r.v. \(X \sim \mathcal{N}(0, \sigma^2)\), the subgaussian moment coincides with the standard deviation. \(\sigma(X) = \sigma\).

Sometimes it is useful to consider more restrictive class of subgaussian random variables.

Definition

A random variable \(X\) is called strictly subgaussian if \(X \sim \Sub(\sigma^2)\) where \(\sigma^2 = \EE(X^2)\), i.e. the inequality

\[\EE (\exp(X t) ) \leq \exp \left (\frac{\sigma^2 t^2}{2} \right )\]

holds true for all \(t \in \RR\).

We will denote strictly subgaussian variables by \(X \sim \SSub (\sigma^2)\).

ExampleGaussian distribution
If \(X \sim \mathcal{N} (0, \sigma^2)\) then \(X \sim \SSub(\sigma^2)\).

Characterization of subgaussian random variables

We quickly review Markov’s inequality which will help us establish the results in this section.

Lemma

Let \(X\) be a non-negative random variable. And let \(t > 0\). Then

\[\PP (X \geq t ) \leq \frac{\EE (X)}{t}.\]
Theorem

For a centered random variable \(X\), the following statements are equivalent:

  • moment generating function condition:
\[\EE [\exp(X t) ] \leq \exp \left (\frac{c^2 t^2}{2} \right ) \Forall t \in \RR.\]
  • subgaussian tail estimate: There exists \(a > 0\) such that
\[\PP(|X| \geq \lambda) \leq 2 \exp (- a \lambda^2) \Forall \lambda > 0.\]
  • \(\psi_2\)-condition: There exists some \(b > 0\) such that
\[\EE [\exp (b X^2) ] \leq 2.\]
Proof
\((1) \implies (2)\) Using Markov’s inequality, for any \(t > 0\) we have
\[\begin{split}\PP(X \geq \lambda) &= \PP (t X \geq t \lambda) = \PP \left(e^{t X} \geq e^{t \lambda} \right )\\ &\leq \frac{\EE \left ( e^{t X} \right ) }{e^{t \lambda}} \leq \exp \left ( - t \lambda + \frac{c^2 t^2}{2}\right ) \Forall t \in \RR.\end{split}\]

Since this is valid for all \(t \in \RR\), hence it should be valid for the minimum value of r.h.s.

The minimum value is obtained for \(t = \frac{\lambda}{c^2}\).

Thus we get

\[\PP(X \geq \lambda) \leq \exp \left ( - \frac{\lambda^2}{2 c^2}\right )\]

Since \(X\) is \(c\)-subgaussian, hence \(-X\) is also \(c\)-subgaussian.

Hence

\[\PP (X \leq - \lambda) = \PP (-X \geq \lambda) \leq \exp \left ( - \frac{\lambda^2}{2 c^2}\right )\]

Thus

\[\PP(|X| \geq \lambda) = \PP (X \leq - \lambda) + \PP(X \geq \lambda) \leq 2 \exp \left ( - \frac{\lambda^2}{2 c^2}\right )\]

Thus we can choose \(a = \frac{1}{2 c^2}\) to complete the proof.

\((2)\implies (3)\)

TODO PROVE THIS

\[\EE (\exp (b X^2)) \leq 1 + \int_0^{\infty} 2 b t \exp (b t^2) \PP (|X| > t)d t\]

\((3)\implies (1)\)

TODO PROVE THIS

More properties

We also have the following result on the exponential moment of a subgaussian random variable.

Lemma

Suppose \(X \sim \Sub(c^2)\). Then

\[\EE \left [\exp \left ( \frac{\lambda X^2}{2 c^2} \right ) \right ] \leq \frac{1}{\sqrt{1 - \lambda}}\]

for any \(\lambda \in [0,1)\).

Proof

We are given that

\[\begin{split}&\EE (\exp(X t) ) \leq \exp \left (\frac{c^2 t^2}{2} \right )\\ &\implies \int_{-\infty}^{\infty} \exp(t x) f_X(x) d x \leq \exp \left (\frac{c^2 t^2}{2} \right ) \Forall t \in \RR\\\end{split}\]

Multiplying on both sides with \(\exp \left ( -\frac{c^2 t^2}{2 \lambda} \right )\) :

\[\int_{-\infty}^{\infty} \exp \left (t x - \frac{c^2 t^2}{2 \lambda}\right ) f_X(x) d x \leq \exp \left (\frac{c^2 t^2}{2}\frac{\lambda-1}{\lambda} \right ) = \exp \left (-\frac{t^2}{2}\frac{c^2 (1 - \lambda)}{\lambda} \right )\]

Integrating on both sides w.r.t. \(t\) we get:

\[\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \exp \left (t x - \frac{c^2 t^2}{2 \lambda}\right ) f_X(x) d x d t \leq \int_{-\infty}^{\infty} \exp \left (-\frac{t^2}{2}\frac{c^2 (1 - \lambda)}{\lambda} \right ) d t\]

which reduces to:

\[\begin{split}&\frac{1}{c} \sqrt{2 \pi \lambda} \int_{-\infty}^{\infty} \exp \left ( \frac{\lambda x^2}{2 c^2} \right ) f_X(x) d x \leq \frac{1}{c} \sqrt {\frac{2 \pi \lambda}{1 - \lambda}}\\ \implies & \EE \left (\exp \left ( \frac{\lambda X^2}{2 c^2} \right ) \right ) \leq \frac{1}{\sqrt{1 - \lambda}}\end{split}\]

which completes the proof.

Subgaussian random vectors

The linearity property of subgaussian r.v.s can be extended to random vectors also. This is stated more formally in following result.

Theorem
Suppose that \(X = [X_1, X_2,\dots, X_N]\), where each \(X_i\) is i.i.d. with \(X_i \sim \Sub(c^2)\). Then for any \(\alpha \in \RR^N\), \(\langle X, \alpha \rangle \sim \Sub(c^2 \| \alpha \|^2_2)\). Similarly if each \(X_i \sim \SSub(\sigma^2)\), then for any \(\alpha \in \RR^N\), \(\langle X, \alpha \rangle \sim \SSub(\sigma^2 \| \alpha \|^2_2)\).

Norm of a subgaussian random vector

Let \(X\) be a random vector where each \(X_i\) is i.i.d. with \(X_i \sim \Sub (c^2)\).

Consider the \(l_2\) norm \(\| X \|_2\). It is a random variable in its own right.

It would be useful to understand the average behavior of the norm.

Suppose \(N=1\). Then \(\| X \|_2 = |X_1|\).

Also \(\| X \|^2_2 = X_1^2\). Thus \(\EE (\| X \|^2_2) = \sigma^2\).

  • It looks like \(\EE (\| X \|^2_2)\) should be connected with \(\sigma^2\).
  • Norm can increase or decrease compared to the average value.
  • A ratio based measure between actual value and average value would be useful.
  • What is the probability that the norm increases beyond a given factor?
  • What is the probability that the norm reduces beyond a given factor?

These bounds are stated formally in the following theorem.

Theorem

Suppose that \(X = [X_1, X_2,\dots, X_N]\), where each \(X_i\) is i.i.d. with \(X_i \sim \Sub(c^2)\).

Then

(2)\[\EE (\| X \|_2^2 ) = N \sigma^2.\]

Moreover, for any \(\alpha \in (0,1)\) and for any \(\beta \in [\frac{c^2}{\sigma^2}, \beta_{\max}]\), there exists a constant \(\kappa^* \geq 4\) depending only on \(\beta_{\max}\) and the ratio \(\frac{\sigma^2}{c^2}\) such that

(3)\[\PP (\| X \|_2^2 \leq \alpha N \sigma^2) \leq \exp \left ( - \frac{ N (1 - \alpha)^2}{\kappa^*} \right )\]

and

(4)\[\PP (\| X \|_2^2 \geq \beta N \sigma^2) \leq \exp \left ( - \frac{ N (\beta - 1)^2}{\kappa^*} \right )\]
  • First equation gives the average value of the square of the norm.
  • Second inequality states the upper bound on the probability that norm could reduce beyond a factor given by \(\alpha < 1\).
  • Third inequality states the upper bound on the probability that norm could increase beyond a factor given by \(\beta > 1\).
  • Note that if \(X_i\) are strictly subgaussian, then \(c=\sigma\). Hence \(\beta \in (1, \beta_{\max})\).
Proof

Since \(X_i\) are independent hence

\[\EE \left [ \| X \|_2^2 \right ] = \EE \left [ \sum_{i=1}^N X_i^2 \right ] = \sum_{i=1}^N \EE \left [ X_i^2 \right ] = N \sigma^2.\]

This proves the first part. That was easy enough.

Now let us look at eqref{eq:subgaussian_vector_norm_expansion_probability}.

By applying Markov’s inequality for any \(\lambda > 0\) we have:

\[\begin{split}\PP (\| X \|_2^2 \geq \beta N \sigma^2) &= \PP \left ( \exp (\lambda \| X \|_2^2 ) \geq \exp (\lambda \beta N \sigma^2) \right) \\ & \leq \frac{\EE (\exp (\lambda \| X \|_2^2 )) }{\exp (\lambda \beta N \sigma^2)} = \frac{\prod_{i=1}^{N}\EE (\exp ( \lambda X_i^2 )) }{\exp (\lambda \beta N \sigma^2)}\end{split}\]

Since \(X_i\) is \(c\)-subgaussian, hence from cref {lem:subgaussian_exp_square_moment} we have

\[\EE (\exp ( \lambda X_i^2 )) = \EE \left (\exp \left ( \frac{2 c^2\lambda X_i^2}{2 c^2} \right ) \right) \leq \frac{1}{\sqrt{1 - 2 c^2 \lambda}}.\]

Thus:

\[\prod_{i=1}^{N}\EE (\exp ( \lambda X_i^2 )) \leq \left ( \frac{1}{\sqrt{1 - 2 c^2 \lambda}} \right )^{\frac{N}{2}}.\]

Putting it back we get:

\[\PP (\| X \|_2^2 \geq \beta N \sigma^2) \leq \left (\frac{\exp (- 2\lambda \beta \sigma^2)}{\sqrt{1 - 2 c^2 \lambda}}\right )^{\frac{N}{2}}.\]

Since above is valid for all \(\lambda > 0\), we can minimize the R.H.S. over \(\lambda\) by setting the derivative w.r.t. \(\lambda\) to \(0\).

Thus we get optimum \(\lambda\) as:

\[\lambda = \frac{\beta \sigma^2 - c^2 }{2 c^2 \sigma^2 (1 + \beta)}.\]

Plugging this back we get:

\[\PP (\| X \|_2^2 \geq \beta N \sigma^2) \leq \left ( \beta \frac{\sigma^2}{c^2} \exp \left ( 1 - \beta \frac{\sigma^2}{c^2} \right ) \right ) ^{\frac{N}{2}}.\]

Similarly proceeding for eqref{eq:subgaussian_vector_norm_reduction_probability} we get

\[\PP (\| X \|_2^2 \leq \alpha N \sigma^2) \leq \left ( \alpha \frac{\sigma^2}{c^2} \exp \left ( 1 - \alpha \frac{\sigma^2}{c^2} \right ) \right ) ^{\frac{N}{2}}.\]

We need to simplify these equations. We will do some jugglery now.

Consider the function

\[f(\gamma) = \frac{2 (\gamma - 1)^2}{(\gamma-1) - \ln \gamma} \Forall \gamma > 0.\]

By differentiating twice, we can show that this is a strictly increasing function.

Let us have \(\gamma \in (0, \gamma_{\max}]\).

Define

\[\kappa^* = \max \left ( 4, \frac{2 (\gamma_{\max} - 1)^2}{(\gamma_{\max}-1) - \ln \gamma_{\max}} \right )\]

Clearly

\[\kappa^* \geq \frac{2 (\gamma - 1)^2}{(\gamma-1) - \ln \gamma} \Forall \gamma \in (0, \gamma_{\max}].\]

Which gives us:

\[\ln (\gamma) \leq (\gamma - 1) - \frac{2 (\gamma - 1)^2}{\kappa^*}.\]

Hence by exponentiating on both sides we get:

\[\gamma \leq \exp \left [ (\gamma - 1) - \frac{2 (\gamma - 1)^2}{\kappa^*} \right ].\]

By slight manipulation we get:

\[\gamma \exp ( 1 - \gamma) \leq \exp \left [ \frac{2 (1 - \gamma )^2}{\kappa^*} \right ].\]

We now choose

\[\gamma = \alpha \frac{\sigma^2}{c^2}\]

Substituting we get:

\[\PP (\| X \|_2^2 \leq \alpha N \sigma^2) \leq \left ( \gamma \exp \left ( 1 - \gamma \right ) \right ) ^{\frac{N}{2}} \leq \exp \left [ \frac{N (1 - \gamma )^2}{\kappa^*} \right ] .\]

Finally

\[c \geq \sigma \implies \frac{\sigma^2}{c^2}\leq 1 \implies \gamma \leq \alpha \implies 1 - \gamma \geq 1 - \alpha\]

Thus we get

\[\PP (\| X \|_2^2 \leq \alpha N \sigma^2) \leq \exp \left [ \frac{N (1 - \alpha )^2}{\kappa^*} \right ] .\]

Similarly by choosing \(\gamma = \beta \frac{\sigma^2}{c^2}\) proves the other bound.

We can now map \(\gamma_{\max}\) to some \(\beta_{\max}\) by:

\[\gamma_{\max} = \frac {\beta_{\max} \sigma^2 }{c^2}.\]

This result tells us that given a vector with entries drawn from a subgaussian distribution, we can expect the norm of the vector to concentrate around its expected value \(N\sigma^2\).

Rademacher sensing matrices

In this section we collect several results related to Rademacher sensing matrices.

Definition

A Rademacher sensing matrix \(\Phi \in \RR^{M \times N}\) with \(M < N\) is constructed by drawing each entry \(\phi_{ij}\) independently from a Radamacher random distribution given by

(1)\[\PP_X(x) = \frac{1}{2}\delta\left(x-\frac{1}{\sqrt{M}}\right) + \frac{1}{2}\delta\left(x+\frac{1}{\sqrt{M}}\right).\]

Thus \(\phi_{ij}\) takes a value \(\pm \frac{1}{\sqrt{M}}\) with equal probability.

We can remove the scale factor \(\frac{1}{\sqrt{M}}\) out of the matrix \(\Phi\) writing

\[\Phi = \frac{1}{\sqrt{M}} \Chi\]

With that we can draw individual entries of \(\Chi\) from a simpler Rademacher distribution given by

(2)\[\PP_X(x) = \frac{1}{2}\delta(x-1) + \frac{1}{2}\delta(x + 1).\]

Thus entries in \(\Chi\) take values of \(\pm 1\) with equal probability.

This construction is useful since it allows us to implement the multiplication with \(\Phi\) in terms of just additions and subtractions. The scaling can be implemented towards the end in the signal processing chain.

We note that

\[\EE(\phi_{ij}) = 0.\]
\[\EE(\phi_{ij}^2) = \frac{1}{M}.\]

Actually we have a better result with

\[\phi_{ij}^2 = \frac{1}{M}.\]

We can write

\[\Phi = \begin{bmatrix} \phi_1 & \dots & \phi_N \end{bmatrix}\]

where \(\phi_j \in \RR^M\) is a Rademacher random vector with independent entries.

We note that

\[\EE (\| \phi_j \|_2^2) = \EE \left ( \sum_{i=1}^M \phi_{ij}^2 \right ) = \sum_{i=1}^M (\EE (\phi_{ij}^2)) = M \frac{1}{M} = 1.\]

Actually in this case we also have

\[\| \phi_j \|_2^2 = 1.\]

Thus the squared length of each of the columns in \(\Phi\) is \(1\) .

Lemma

Let \(z \in \RR^M\) be a Rademacher random vector with i.i.d entries \(z_i\) that take a value \(\pm \frac{1}{\sqrt{M}}\) with equal probability. Let \(u \in \RR^M\) be an arbitrary unit norm vector. Then

\[\PP \left ( | \langle z, u \rangle | > \epsilon \right ) \leq 2 \exp \left (- \epsilon^2 \frac{M}{2} \right ).\]

Representative values of this bound are plotted below.

_images/rademacher_rand_vec_tail_bound.png

Tail bound for the probability of inner product of a Rademacher random vector with a unit norm vector

Proof
This can be proven using Hoeffding inequality. To be elaborated later.

A particular application of this lemma is when \(u\) itself is another (independently chosen) unit norm Rademacher random vector.

The lemma establishes that the probability of inner product of two independent unit norm Rademacher random vectors being large is very very small. In other words, independently chosen unit norm Rademacher random vectors are incoherent with high probability. This is a very useful result as we will see later in measurement of coherence of Rademacher sensing matrices.

Joint correlation

Columns of \(\Phi\) satisfy a joint correlation property ([TG07]) which is described in following lemma.

Lemma

Let \(\{u_k\}\) be a sequence of \(K\) vectors (where \(u_k \in \RR^M\) ) whose \(l_2\) norms do not exceed one. Independently choose \(z \in \RR^M\) to be a random vector with i.i.d. entries \(z_i\) that take a value \(\pm \frac{1}{\sqrt{M}}\) with equal probability. Then

\[\PP\left(\max_{k} | \langle z, u_k\rangle | \leq \epsilon \right) \geq 1 - 2 K \exp \left( - \epsilon^2 \frac{M}{2} \right).\]
Proof

Let us call \(\gamma = \max_{k} | \langle z, u_k\rangle |\) .

We note that if for any \(u_k\) , \(\| u_k \|_2 <1\) and we increase the length of \(u_k\) by scaling it, then \(\gamma\) will not decrease and hence \(\PP(\gamma \leq \epsilon)\) will not increase. Thus if we prove the bound for vectors \(u_k\) with \(\| u_k\|_2 = 1 \Forall 1 \leq k \leq K\) , it will be applicable for all \(u_k\) whose \(l_2\) norms do not exceed one. Hence we will assume that \(\| u_k \|_2 = 1\) .

From previous lemma we have

\[\PP \left ( | \langle z, u_k \rangle | > \epsilon \right ) \leq 2 \exp \left (- \epsilon^2 \frac{M}{2} \right ).\]

Now the event

\[\left \{ \max_{k} | \langle z, u_k\rangle | > \epsilon \right \} = \bigcup_{ k= 1}^K \{| \langle z, u_k\rangle | > \epsilon\}\]

i.e. if any of the inner products (absolute value) is greater than \(\epsilon\) then the maximum is greater.

We recall Boole’s inequality which states that

\[\PP \left(\bigcup_{i} A_i \right) \leq \sum_{i} \PP(A_i).\]

Thus

\[\PP\left(\max_{k} | \langle z, u_k\rangle | > \epsilon \right) \leq 2 K \exp \left (- \epsilon^2 \frac{M}{2} \right ).\]

This gives us

\[\begin{split}\begin{aligned} \PP\left(\max_{k} | \langle z, u_k\rangle | \leq \epsilon \right) &= 1 - \PP\left(\max_{k} | \langle z, u_k\rangle | > \epsilon \right) \\ &\geq 1 - 2 K \exp \left(- \epsilon^2 \frac{M}{2} \right). \end{aligned}\end{split}\]

Coherence of Rademacher sensing matrix

We show that coherence of Rademacher sensing matrix is fairly small with high probability (adapted from [TG07]).

Lemma

Fix \(\delta \in (0,1)\) . For an \(M \times N\) Rademacher sensing matrix \(\Phi\) as defined above, the coherence statistic

\[\mu \leq \sqrt{ \frac{4}{M} \ln \left( \frac{N}{\delta}\right)}\]

with probability exceeding \(1 - \delta\) .

_images/rademacher_coherence_bound.png

Coherence bounds for Rademacher sensing matrices

Proof

We recall the definition of coherence as

\[\mu = \underset{j \neq k}{\max} | \langle \phi_j, \phi_k \rangle | = \underset{j < k}{\max} | \langle \phi_j, \phi_k \rangle |.\]

Since \(\Phi\) is a Rademacher sensing matrix hence each column of \(\Phi\) is unit norm column. Consider some \(1 \leq j < k \leq N\) identifying columns \(\phi_j\) and \(\phi_k\) . We note that they are independent of each other. Thus from above we have

\[\PP \left ( |\langle \phi_j, \phi_k \rangle | > \epsilon \right ) \leq 2 \exp \left (- \epsilon^2 \frac{M}{2} \right ).\]

Now there are \(\frac{N(N-1)}{2}\) such pairs of \((j, k)\) . Hence by applying Boole’s inequality

\[\PP \left ( \underset{j < k} {\max} |\langle \phi_j, \phi_k \rangle | > \epsilon \right ) \leq 2 \frac{N(N-1)}{2} \exp \left (- \epsilon^2 \frac{M}{2} \right ) \leq N^2 \exp \left (- \epsilon^2 \frac{M}{2} \right ).\]

Thus, we have

\[\PP \left ( \mu > \epsilon \right )\leq N^2 \exp \left (- \epsilon^2 \frac{M}{2} \right ).\]

What we need to do now is to choose a suitable value of \(\epsilon\) so that the R.H.S. of this inequality is simplified.

We choose

\[\epsilon^2 = \frac{4}{M} \ln \left ( \frac{N}{\delta}\right ).\]

This gives us

\[\epsilon^2 \frac{M}{2} = 2 \ln \left ( \frac{N}{\delta}\right ) \implies \exp \left (- \epsilon^2 \frac{M}{2} \right ) = \left ( \frac{\delta}{N} \right)^2.\]

Putting back we get

\[\PP \left ( \mu > \epsilon \right )\leq N^2 \left ( \frac{\delta}{N} \right)^2 \leq \delta^2.\]

This justifies why we need \(\delta \in (0,1)\) .

Finally

\[\PP \left ( \mu \leq \sqrt{ \frac{4}{M} \ln \left( \frac{N}{\delta}\right)} \right ) = \PP (\mu \leq \epsilon) = 1 - \PP (\mu > \epsilon) > 1 - \delta^2\]

and

\[1 - \delta^2 > 1 - \delta\]

which completes the proof.

Gaussian sensing matrices

In this section we collect several results related to Gaussian sensing matrices.

Definition
A Gaussian sensing matrix \(\Phi \in \RR^{M \times N}\) with \(M < N\) is constructed by drawing each entry \(\phi_{ij}\) independently from a Gaussian random distribution \(\Gaussian(0, \frac{1}{M})\) .

We note that

\[\EE(\phi_{ij}) = 0.\]
\[\EE(\phi_{ij}^2) = \frac{1}{M}.\]

We can write

\[\Phi = \begin{bmatrix} \phi_1 & \dots & \phi_N \end{bmatrix}\]

where \(\phi_j \in \RR^M\) is a Gaussian random vector with independent entries.

We note that

\[\EE (\| \phi_j \|_2^2) = \EE \left ( \sum_{i=1}^M \phi_{ij}^2 \right ) = \sum_{i=1}^M (\EE (\phi_{ij}^2)) = M \frac{1}{M} = 1.\]

Thus the expected value of squared length of each of the columns in \(\Phi\) is \(1\) .

Joint correlation

Columns of \(\Phi\) satisfy a joint correlation property ([TG07]) which is described in following lemma.

Lemma

Let \(\{u_k\}\) be a sequence of \(K\) vectors (where \(u_k \in \RR^M\) ) whose \(l_2\) norms do not exceed one. Independently choose \(z \in \RR^M\) to be a random vector with i.i.d. \(\Gaussian(0, \frac{1}{M})\) entries. Then

\[\PP\left(\max_{k} | \langle z, u_k\rangle | \leq \epsilon \right) \geq 1 - K \exp \left( - \epsilon^2 \frac{M}{2} \right).\]
Proof

Let us call \(\gamma = \max_{k} | \langle z, u_k\rangle |\) .

We note that if for any \(u_k\) , \(\| u_k \|_2 <1\) and we increase the length of \(u_k\) by scaling it, then \(\gamma\) will not decrease and hence \(\PP(\gamma \leq \epsilon)\) will not increase. Thus if we prove the bound for vectors \(u_k\) with \(\| u_k\|_2 = 1 \Forall 1 \leq k \leq K\) , it will be applicable for all \(u_k\) whose \(l_2\) norms do not exceed one. Hence we will assume that \(\| u_k \|_2 = 1\) .

Now consider \(\langle z, u_k \rangle\) . Since \(z\) is a Gaussian random vector, hence \(\langle z, u_k \rangle\) is a Gaussian random variable. Since \(\| u_k \| =1\) hence

\[\langle z, u_k \rangle \sim \Gaussian \left(0, \frac{1}{M} \right).\]

We recall a well known tail bound for Gaussian random variables which states that

\[\PP_X ( | x | > \epsilon) \; = \; \sqrt{\frac{2}{\pi}} \int_{\epsilon \sqrt{N}}^{\infty} \exp \left( -\frac{x^2}{2}\right) d x \; \leq \; \exp \left (- \epsilon^2 \frac{M}{2} \right).\]

Now the event

\[\left \{ \max_{k} | \langle z, u_k\rangle | > \epsilon \right \} = \bigcup_{ k= 1}^K \{| \langle z, u_k\rangle | > \epsilon\}\]

i.e. if any of the inner products (absolute value) is greater than \(\epsilon\) then the maximum is greater.

We recall Boole’s inequality which states that

\[\PP \left(\bigcup_{i} A_i \right) \leq \sum_{i} \PP(A_i).\]

Thus

\[\PP\left(\max_{k} | \langle z, u_k\rangle | > \epsilon \right) \leq K \exp \left(- \epsilon^2 \frac{M}{2} \right).\]

This gives us

\[\begin{split}\begin{aligned} \PP\left(\max_{k} | \langle z, u_k\rangle | \leq \epsilon \right) &= 1 - \PP\left(\max_{k} | \langle z, u_k\rangle | > \epsilon \right) \\ &\geq 1 - K \exp \left(- \epsilon^2 \frac{M}{2} \right). \end{aligned}\end{split}\]

Hands on with Gaussian sensing matrices

We will show several examples of working with Gaussian sensing matrices through the sparse-plex library.

ExampleConstructing a Gaussian sensing matrix

Let’s specify the size of representation space:

N = 1000;

Let’s specify the number of measurements:

M = 100;

Let’s construct the sensing matrix:

Phi = spx.dict.simple.gaussian_mtx(M, N, false);

By default the function gaussian_mtx constructs a matrix with normalized columns. When we set the third argument to false as in above, it constructs a matrix with unnormalized columns.

We can visualize the matrix easily:

imagesc(Phi);
colorbar;
_images/demo_gaussian_1.png

Let’s compute the norms of each of the columns:

column_norms = spx.norm.norms_l2_cw(Phi);

Let’s look at the mean value:

>> mean(column_norms)

ans =

    0.9942

We can see that the mean value is very close to unity as expected.

Let’s compute the standard deviation:

>> std(column_norms)

ans =

    0.0726

As expected, the column norms are concentrated around its mean.

We can examine the variation in norm values by looking at the quantile values:

>> quantile(column_norms, [0.1, 0.25, 0.5, 0.75, 0.9])

ans =

    0.8995    0.9477    0.9952    1.0427    1.0871

The histogram of column norms can help us visualize it better:

hist(column_norms);
_images/demo_gaussian_1_norm_hist.png

The singular values of the matrix help us get deeper understanding of how well behaved the matrix is:

singular_values = svd(Phi);
figure;
plot(singular_values);
ylim([0, 5]);
grid;
_images/demo_gaussian_1_singular_values.png

As we can see, singular values decrease quite slowly.

The condition number captures the variation in singular values:

>> max(singular_values)

ans =

    4.1177

>> min(singular_values)

ans =

    2.2293

>> cond(Phi)

ans =

    1.8471

The source code can be downloaded here.

Examples

In this section we will look at several examples which can be modeled using sparse and redundant representations and measured using compressed sensing techniques.

Several examples in this section have been incorporated from Sparco [BFH+07] (a testing framework for sparse reconstruction).

Piecewise cubic polynomial signal

This example was discussed in [CR04]. Our signal of interest is a piecewise cubic polynomial signal as shown here.

_images/signal.png

A piecewise cubic polynomials signal

It has a sparse representation in a wavelet basis.

_images/representation.png

Sparse representation of signal in wavelet basis

We can sort the wavelet coefficients by magnitude and plot them in descending order to visualize how sparse the representation is.

_images/representation_sorted.png

Wavelet coefficients sorted by magnitude

The chosen basis is a Daubechies wavelet basis \(\Psi\).

_images/dictionary.png

Daubechies-8 wavelet basis

A Gaussian random sensing matrix \(\Phi\) is used to generate the measurement vector \(y\)

_images/sensing_matrix.png

Gaussian sensing matrix \(\Phi\)

The measurements are shown here:

_images/measurements.png

Measurement vector \(y = \Phi x + e\)

Finally the product of \(\Phi\) and \(\Psi\) given by \(\Phi \Psi\) will be used for actual recovery of sparse representation.

_images/recovery_matrix.png

Recovery matrix \(\Phi \Psi\)

Fundamental equations are:

\[x = \Psi \alpha\]

and

\[y = \Phi x + e = \Phi \Psi \alpha + e.\]

with \(x \in \RR^N\). In this example \(N = 2048\). \(\Psi\) is a complete dictionary of size \(N \times N\). Thus we have \(D = N\) and \(\alpha \in \RR^N\). \(\Phi \in \RR^{M \times N}\). In this example, the number of measurements \(M=600\). The measurement vector \(y \in \RR^M\). For this problem we chose \(e = 0\).

Sparse signal recovery problem is denoted as

\[\widehat{\alpha} = \text{recovery}(\Phi \Psi, y, K).\]

where \(\widehat{\alpha}\) is a \(K\)-sparse approximation of \(\alpha\).

Closely examining the coefficients in \(\alpha\) we can note that \(\max(\alpha_i) = 78.0546\). Further if we put different thresholds over magnitudes of entries in \(\alpha\) we can find the number of coefficients higher than the threshold as listed in the table below. A choice of \(M = 600\) looks quite reasonable given the decay of entries in \(\alpha\).

Entries in wavelet representation of piecewise cubic polynomial signal higher than a threshold
Threshold Entries higher than threshold
1 129
1E-1 173
1E-2 186
1E-4 197
1E-8 199
1E-12 200

Data Analysis

Principal Component Analysis

Principal component analysis (PCA) [Jol02] is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. If a multivariate dataset is visualized as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA can supply the user with a lower-dimensional picture, a projection of this object when viewed from its most informative viewpoint. PCA can be thought of as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component.

Consider a data-matrix \(X \in \RR^{n \times p}\) \((n \geq p)\) with each column representing one feature (or random variable) and each row representing one feature vector (or observation vector). Assume that \(X\) has column wise zero sample mean. The principal components decomposition of \(X\) is given by \(T = X V\) where \(V\) is a \(p \times p\) matrix whose columns are eigen vectors of \(X^T X\). If each row of \(X\) ( resp. T) is given by a (column) vector \(x\) (resp. t), then they are related by \(t = V^T x\) or \(x = V t\). Each principal component \(t_i\) is obtained by taking the inner product of an eigen vector \(v^i\) in \(V\) with \(x\). \(T\) can be obtained straight-away from the SVD of \(X = U \Sigma V^T\) giving \(T = X V = U \Sigma\). Note that \(T^T T = \Sigma^T \Sigma\) implying that the columns of \(T\) are orthogonal to each other. In other words, the features (or random variables) corresponding to each column of \(T\) are uncorrelated. Recall that \(T^T T\) is proportional to the empirical covariance matrix of \(T\) and \(\sigma_1 \geq \dots \geq \sigma_p\) shows how variance of individual columns in \(T\) decreases. The form \(T = U \Sigma\) is also known as the polar decomposition of \(T\).

The dimensionality reduction of data-set in \(X\) is obtained by keeping just the first \(k\) columns of \(T\).

Data Clustering

Data Clustering Introduction

In this section, we summarize some of the traditional and general purpose data clustering algorithms. These algorithms get used as building blocks for various subspace clustering algorithms. The objective of data clustering is to group the data points into clusters such that points within each cluster are more related to each other than points across different cluster. The relationship can be measured in various ways: distance between points, similarity of points, etc. In distance based clustering, we group the points into \(K\) clusters such that the distance among points in the same group is significantly smaller than those between clusters. In similarity based clustering, the points within the same cluster are more similar to each other than points from different cluster. A graph based clustering will treat each point as a node on a graph [with appropriate edges] and split the graph into connected components. Compare this with subspace clustering which assumes that points in each cluster are sampled from one subspace [even though they may be far apart within the subspace].

Simplest distance measure is the standard Euclidean distance measure. But it is susceptible to the choice of basis. This can be improved by adopting a statistical model for data in each cluster. We assume that the data in \(k\)-th cluster is sampled from a probability distribution with mean \(\mu_k\) and covariance \(\Sigma_k\). An appropriate distance measure from the mean of a distribution which is invariant of the choice of basis is the Mahanalobis distance:

\[d^2 (x_s, \mu_k) = \| x_s - \mu_k\|_{\Sigma_k}^2 = (x_s - \mu_k)^T \Sigma_k^{-1}(x_s - \mu_k).\]

For Gaussian distributions, this is proportional to the negative of the log-likelihood of a sample point. A simple way to measure similarity between two points is the absolute value of the inner product. Alternatively, one can look at the angle between two points or inner product of the normalized points. Another way to measure similarity is to consider the inverse of an appropriate distance measure.

Measurement of clustering performance

In general a clustering \(\CCC\) of a set \(Y\) constructed by a clustering algorithm is a set \(\{\CCC_1, \dots, \CCC_C\}\) of non-empty disjoint subsets of \(Y\) such that their union equals \(Y\). Clearly: \(|\CCC_c| > 0\).

The clustering process may identify incorrect number of clusters and \(C\) may not be equal to \(K\). More-over even if \(K = C\), the vectors may be placed in wrong clusters. Ideally, we want \(K = C\) and \(\CCC_c = Y_k\) with a bijective mapping between \(1 \leq c \leq C\) and \(1 \leq k \leq K\). In practice, a clustering algorithm estimates the number of clusters \(C\) and assigns a label \(l_s\), \(1 \leq s \leq S\) to each vector \(y_s\) where \(1\leq l_s \leq C\). All the labels can be put in a label vector \(L\) where \(L \in \{1, \dots, C\}^S\). The permutation matrix \(\Gamma\) can be easily obtained from \(L\).

Following [WW07], we will quickly establish the measures used in this work for clustering performance of synthetic experiments. We have a reference clustering of vectors in \(Y\) given by \(\BBB = \{Y_1, \dots, Y_K\}\) which is known to us in advance (either by construction in synthetic experiments or as ground truth with real life data-sets). The clustering obtained by the algorithm is given by \(\CCC= \{\CCC_1, \dots, \CCC_C\}\). For two arbitrary vectors \(y_i, y_j \in Y\), there are four possibilities: a) they belong to same cluster in both \(\BBB\) and \(\CCC\) (true positive), b) they are in same cluster in \(\BBB\) but different cluster in \(\CCC\) (false negative) c) they are in different clusters in \(\BBB\) but in same cluster in \(\CCC\) d) they are in different clusters in both \(\BBB\) and \(\CCC\) (true negative).

Consider some cluster \(Y_i \in \BBB\) and \(\CCC_j \in \CC\). The elements common to \(Y_i\) and \(\CCC_j\) are given by \(Y_i \cap \CCC_j\). We define \(\text{precision}_{ij} \triangleq \frac{|Y_i \cap \CCC_j|}{|\CCC_j|}.\) We define the overall precision for \(\CCC_j\) as \(\text{precision}(\CCC_j) \triangleq \underset{i}{\max}(\text{precision}_{ij}).\) We define \(\text{recall}_{ij} \triangleq \frac{|Y_i \cap \CCC_j|}{|Y_i|}\). We define the overall recall for \(Y_i\) as \(\text{recall}(Y_i) \triangleq \underset{j}{\max}(\text{recall}_{ij})\). We define the \(F\) score as \(F_{ij} \triangleq \frac{2 \text{precision}_{ij} \text{recall}_{ij} }{\text{precision}_{ij} + \text{recall}_{ij}}.\) We define the overall \(F\)-score for \(Y_i\) as \(F(Y_i) \triangleq \underset{j}{\max}(F_{ij}).\) We note that cluster \(\CCC_j\) for which the maximum is achieved is best matching cluster for \(Y_i\). Finally, we define the overall \(F\)-score for the clustering \(F(\mathcal{B}, \mathcal{C}) \triangleq \frac{1}{S}\sum_{i=1}^p |Y_i | F(Y_i)\) where \(S\) is the total number of vectors in \(Y\). We also define a clustering ratio given by the factor \(\eta \triangleq \frac{C}{K}\).

There are different ways to define clustering error. For the special case where the number of clusters is known in advance, and we ensure that the data-set is divided into exactly those many clusters, it is possible to define subspace clustering error as follows:

\[\text{subspace clustering error} = \frac{\text{\# of misclassified points}} {\text{total \# of points}}.\]

The definition is adopted from [EV13] for comparing the results in this work with their results. This definition can be used after a proper one-one mapping between original labels and cluster labels assigned by the clustering algorithms has been identified. We can compute this mapping by comparing \(F\)-scores.

K-means Clustering

_images/alg_k_means_clustering.png

K-means clustering algorithm

K-means clustering algorithm [M+67][DHS12][Har75] is an iterative clustering method. We start with an initial set of means and covariance matrices for each cluster. In each iteration, we segment the data points into individual clusters by choosing the nearest mean. Then, we estimate the new mean and covariance matrices. We return a label vector \(L[1:K]\) which maps each point to corresponding cluster. A within-cluster-scatter can be defined as

\[w(L) = \frac{1}{S} \sum_{k=1}^K \sum_{L(s) = k} \| y_s - \mu_k \|^2_{\Sigma_k}.\]

This represents the average (squared) distance of each point to the respective cluster mean. The \(K\)-means algorithm reduces the scatter in each iteration. it is guaranteed to converge to a local minimum.

A simpler version of this algorithm is based on Euclidean distance and doesn’t compute or updates the covariance matrices for each cluster.

Spectral Clustering

Spectral clustering is a graph based clustering algorithm [VL07]. \(\GGG = \{T, W\}\) to obtain the clustering \(\CCC\) of \(X\). More specifically, the following steps are performed. The degree of a vertex \(t_s \in T\) is defined as \(d_s = \sum_{j = 1}^S w_{s j}\). The degree matrix \(D\) is defined as the diagonal matrix with the degrees \(\{ d_s \}_{s =1 }^S\). The unnormalized graph Laplacian is defined as \(\LLL = D - W\). The normalized graph Laplacian is defined as \(\LLL_{\text{rw}} \triangleq D^{-1} \LLL = I - D^{-1} W\) footnote{We specifically use the random walk version of normalized Graph Laplacian as defined in [VL07]. There are other ways to define normalized graph Laplacian.}. The subscript \(\text{rw}\) stands for random walk. We compute \(\LLL_{\text{rw}}\) and examine its eigen-structure to estimate the number of clusters \(C\) and the label vector \(L\). If \(C\) is known in advance, usually the first \(C\) eigen vectors of \(\LLL_{\text{rw}}\) corresponding to the smallest eigen-values are taken and their row vectors are clustered using K-means algorithm [SM00]. Since, we don’t make any assumption on the number of clusters, we need to estimate it. A simple way is to track the eigen-gap statistic. After arranging the eigen values in increasing order, we can choose the number \(C\) such that the eigen values \(\lambda_1, \dots, \lambda_C\) are very small and \(\lambda_{C + 1}\) is large. This is guided by the theoretical results that if a Graph has \(C\) connected components then exactly \(C\) eigen values of \(\LLL_{\text{rw}}\) are 0. However, when the subspaces are not clearly separated, and noise is introduced, this approach becomes tricky. We go for a more robust approach by analyzing the eigen vectors as described in [ZMP04]. The approach of [ZMP04], with a slightly different definition of the graph Laplacian \((D^{-1/2} W D^{-1/2})\) [NJW+02], has been adapted for working with the Laplacian \(\LLL_{\text{rw}}\) as defined above.

Robust estimation of number of clusters

In step 6, we estimate the number of clusters from the Graph Laplacian. It can be easily shown that \(0\) is an eigen value of \(\LLL_{\text{rw}}\) with an eigen vector \(\OneVec_S\) [VL07]. Further, the multiplicity of eigen value 0 equals the number of connected components in \(\GGG\). In fact the adjacency matrix can be factored as

\[\begin{split}W = \begin{bmatrix} W_1 & \dots & 0\\ \vdots & \ddots & \vdots \\ 0 & \dots & W_P \end{bmatrix} \Gamma\end{split}\]

where \(W_p \in \RR^{S_p \times S_p}\) is the adjacency matrix for the \(p\)-th connected component of \(\GGG\) corresponding to the subspace \(\UUU^p\) and \(\Gamma\) is the unknown permutation matrix. The graph Laplacian for each \(W_p\) has an eigen value \(0\) and the eigen-vector \(\OneVec_{S_p}\). Thus, if we look at the \(P\)-dimensional eigen-space of \(\LLL_{\text{rw}}\) corresponding to eigen value \(0\), then there exists a basis \(\widehat{V} \in \RR^{S \times P}\) such that each row of \(\widehat{V}\) is a unit vector in \(\RR^P\) and the columns contain \(S_1, \dots, S_P\) ones. Actual eigen vectors obtained through any numerical method will be a rotated version of \(\widehat{V}\) given by \(V = \widehat{V} R\). [ZMP04] suggests a cost function over the entries in \(V\) such that the cost is minimized when the rows of \(V\) are close to coordinate vectors. It then estimates a rotation matrix as a product of Givens rotations which can rotate \(V\) to minimize the cost. The parameters of the rotation matrix are the angles of Givens rotations which are estimated through a Gradient descent process. Since \(P\) is unknown, the algorithm is run over multiple values of \(C\) and we choose the value which gives minimum cost. Note that, we reuse the rotated version of \(V\) obtained for a particular value of \(C\) when we go for examining \(C+1\) eigen-vectors. This may appear to be ad-hoc, but is seen to help in faster convergence of the gradient descent algorithm for next iteration.

When \(S\) is small, we can do a complete SVD of \(\LLL_{\text{rw}}\) to get the eigen vectors. However, this is time consuming when \(S\) is large (say 1000+). An important question is how many eigen vectors we really need to examine! As \(C\) increases, the number of Givens rotation parameters increase as \(C(C-1)/2\). Thus, if we examine too many eigen-vectors, we will lose out unnecessarily on speed. We can actually use the eigen-gap statistic described above to decide how many eigen vectors we should examine.

Finally, we assign labels to each data point to identify the cluster they belong to. As described above, we maintain the rotated version of \(V\) during the estimation of rotation matrix. Once, we have zeroed in on the right value of \(C\), then assigning labels to \(x^s\) is straight-forward. We simply perform non-maximum suppression on the rows of V, i.e. we keep the largest (magnitude) entry in each row of \(V\) and assign zero to the rest. The column number of the largest entry in the \(s\)-th row of \(V\) is the label \(l_s\) for \(x^s\). This completes the clustering process.

While eigen gap statistic based estimation of number of clusters is quick, it requires running an additional \(K\)-means algorithm step on the first \(C\) eigen vectors to assign the labels. In contrast, eigen vector based estimation of number of clusters is involved and slow but it allows us to pick the labels very quickly.

Expectation Maximization

Expectation-Maximization (EM) [DLR77] method is a maximum likelihood based estimation paradigm. It requires an explicit probabilistic model of the mixed data-set. The algorithm estimates model parameters and the segmentation of data in Maximum-Likelihood (ML) sense.

We assume that \(y_s\) are samples drawn from multiple “component” distributions and each component distribution is centered around a mean. Let there be \(K\) such component distributions. We introduce a latent (hidden) discrete random variable \(z \in \{1, \dots, K\}\) associated with the random variable \(y\) such that \(z_s = k\) if \(y_s\) is drawn from \(k\)-th component distribution. The random vector \((y, z) \in \RR^M \times \{1, \dots, K\}\) completely describes the event that a point \(y\) is drawn from a component indexed by the value of \(z\).

We assume that \(z\) is subject to a multinomial (marginal) distribution. i.e.:

\[p(z= k) = \pi_k \geq 0, \quad \pi_1 + \dots + \pi_K = 1.\]

Each component distribution can then be modeled as a conditional (continuous) distribution \(f(y | z)\). If each of the components is a multivariate normal distribution, then we have \(f(y | z = k) \sim \NNN(\mu_k, \Sigma_k)\) where \(\mu_k\) is the mean and \(\Sigma_k\) is the covariance matrix of the \(k\)-th component distribution. The parameter set for this model is then \(\theta = \{\pi_k, \mu_k, \Sigma_K \}_{k=1}^K\) which is unknown in general and needs to be estimated from the dataset \(Y\).

With \((y, z)\) being the complete random vector, the marginal PDF of \(y\) given \(\theta\) is given by

\[f(y | \theta) = \sum_{z = 1}^K f(y | z, \theta) p (z | \theta) = \sum_{z = 1}^K \pi_k f(y | z=k, \theta).\]

The log-likelihood function for the dataset \(Y = \{ y_s\}_{s=1}^N\) is given by

\[l (Y; \theta) = \sum_{s=1}^S \ln f(y_s | \theta).\]

An ML estimate of the parameters, namely \(\hat{\theta}_{\ML}\) is obtained by maximizing \(l (Y; \theta)\) over the parameter space. The statistic \(l (Y; \theta)\) is called incomplete log-likelihood function since it is marginalized over \(z\) It is very difficult to compute and maximize directly. The EM method provides an alternate means of maximizing \(l (Y; \theta)\) by utilizing the latent r.v. \(z\).

We start with noting that

\[f(y | \theta) p ( z | y , \theta) = f(y, z | \theta),\]
\[\sum_{k=1}^K p(z = k | y , \theta) = 1.\]

Thus, \(l (Y; \theta)\) can be rewritten as

\[\begin{split}\begin{aligned} l (Y; \theta) &= \sum_{s=1}^S \sum_{k=1}^K p(z_s = k | y_s , \theta) \ln \frac{f(y_s, z_s =k | \theta)}{p(z_s=k | y_s, \theta)}\\ &= \sum_{s, k} p(z_s = k | y_s , \theta) \ln f(y_s, z_s =k | \theta) \\ &- \sum_{s, k} p(z_s = k | y_s , \theta) \ln p(z_s=k | y_s, \theta) . \end{aligned}\end{split}\]

The first term is expected complete log-likelihood function and the second term is the conditional entropy of \(z_s\) given \(y_s\) and \(\theta\).

Let us introduce auxiliary variables \(w_{sk} (\theta) = p(z_s = k | y_s , \theta)\). \(w_{sk}\) basically represents the expected membership of \(y_s\) in the \(k\)-th cluster. Put \(w_{sk}\) in a matrix \(W (\theta)\) and write:

\[l'(Y; \theta, W) = \sum_{s=1}^S \sum_{k=1}^K w_{sk} \ln f(y_s, z_s =k | \theta).\]
\[h( z | y; W) = - \sum_{s=1}^S \sum_{k=1}^K w_{sk} \ln w_{sk}.\]

Then, we have

\[l(Y; \theta, W) = l'(Y; \theta, W) + h( z | y; W)\]

where, we have written \(l\) as a function of both \(\theta\) and \(W\). An iterative maximization approach can be introduced as follows:

  • Maximize \(l(Y; \theta, W)\) w.r.t. \(W\) keeping \(\theta\) as constant.
  • Maximize \(l(Y; \theta, W)\) w.r.t. \(\theta\) keeping \(W\) as constant.
  • Repeat the previous two steps till convergence.

This is essentially the EM algorithm. Step 1 is known as E-step and step 2 is known as the M-step. In the E-step, we are estimating the expected membership of each sample being drawn from each component distribution. In the M-step, we are maximizing the expected complete log-likelihood function as the conditional entropy term doesn’t depend on \(\theta\).

Using Lagrange multiplier, we can show that the optimal \(\hat{w}_{sk}\) in the E-step is given by

\[\hat{w}_{sk} = \frac{\pi_k f(y_s | z_s = k, \theta )} {\sum_{l=1}^K \pi_l f(y_s | z_s = l, \theta )}.\]

A closed form solution for the \(M\)-step depends on the particular choice of the component distributions. We provide a closed form solution for the special case when each of the components is an isotropic normal distribution (\(\NNN(\mu_k, \sigma_k^2 I)\)).

\[\begin{split}\begin{aligned} &\hat{\mu_k} = \frac{\sum_{s=1}^S w_{sk} y_s} {\sum_{s=1}^S w_{sk}},\\ &\hat{\sigma}_k^2 = \frac{\sum_{s=1}^S w_{sk} \| y_s - \mu_k \|_2^2} {M \sum_{s=1}^S w_{sk}},\\ &\hat{\pi_k} = \frac{\sum_{k=1}^K w_{sk}}{K}. \end{aligned}\end{split}\]

In \(K\)-means, each \(y_s\) gets hard assigned to a specific cluster. In EM, we have a soft assignment given by \(w_{sk}\).

EM-method is a good method for a hybrid dataset consisting of mixture of component distributions. Yet, its applicability is limited. We need to have a good idea of the number of components beforehand. Further, for a Gaussian Mixture Model (GMM), it fails to work if the variance in some of the directions is arbitrarily small [Vap13]. For example, a subspace like distribution is one where the data has large variance within a subspace but almost zero variance orthogonal to the subspace. The EM method tends to fail with subspace like distributions.

Hands-on spectral clustering

ExampleClustering rings

In this example, we will cluster 2D data which form three different rings in the plane.

Sample data is available in the data directory.

Let us load the data:

dataset_file = fullfile(spx.data_dir, 'clustering', ...
    'self_tuning_paper_clustering_data');
data = load(dataset_file);
datasets = data.XX;
raw_data = datasets{1};
num_clusters = data.group_num(1);

The raw data is organized in a matrix where each row represents one 2D point. Number of data points is the number of rows in the dataset. Let’s plot the data to get a better understanding:

X = raw_data(:, 1);
Y = raw_data(:, 2);
figure;
axis equal;
plot(X, Y, '.', 'MarkerSize',16);
_images/demo_sc_1_unscaled.png

We can see that the data is organized in three different rings. This data set is unlikely to be clustered properly by K-means algorithm.

It is good practice to scale the data before clustering it:

raw_data = raw_data - repmat(mean(raw_data),size(raw_data,1),1);
raw_data = raw_data/max(max(abs(raw_data)));
X = raw_data(:, 1);
Y = raw_data(:, 2);
figure;
axis equal;
plot(X, Y, '.', 'MarkerSize',16);
_images/demo_sc_1_scaled.png

The next step is to compute pairwise distances between the points in the dataset:

sqrt_dist_mat = spx.commons.distance.sqrd_l2_distances_rw(raw_data);

We convert the distances into a Gaussian similarity. To compute the similarity, we will need to provide the scale value:

scale = 0.04;
% Compute the similarity matrix
sim_mat = spx.cluster.similarity.gauss_sim_from_sqrd_dist_mat(sqrt_dist_mat, scale);

We are now ready to perform spectral clustering on the data.

Create the spectral clustering algorithm instance:

clusterer = spx.cluster.spectral.Clustering(sim_mat);

Inform it about the expected number of clusters:

clusterer.NumClusters = num_clusters;

There are two different spectral clustering algorithms available. We will use the random walk version:

cluster_labels = clusterer.cluster_random_walk();

We can summarize the results of clustering:

>> tabulate(cluster_labels)
  Value    Count   Percent
      1       99     33.11%
      2      139     46.49%
      3       61     20.40%

Let’s plot the data points in different colors depending on which cluster they belong to:

figure;
colors = [1,0,0;0,1,0;0,0,1;1,1,0;1,0,1;0,1,1;0,0,0];
hold on;
axis equal;
for c=1:num_clusters
    % Identify points in this cluster
    points = raw_data(cluster_labels == c, :);
    X = points(:, 1);
    Y = points(:, 2);
    plot(X, Y, '.','Color',colors(c,:), 'MarkerSize',16);
end
_images/demo_sc_1_clustered.png

Complete example code can be downloaded here.

Inside Unnormalized Spectral Clustering

In this section, we will start with a similarity matrix and go through the steps of unnormalized spectral clustering.

We will consider a simple case of of 8 data points which are known to be falling into two clusters.

We construct an undirected graph \(G\) where the nodes in same cluster are connected to each other and nodes in different clusters are not connected to each other.

In this simple example, we will assume that the graph is unweighted.

The adjacency matrix for the graph is \(W\):

>> W = [ones(4) zeros(4); zeros(4) ones(4)]
W =

     1     1     1     1     0     0     0     0
     1     1     1     1     0     0     0     0
     1     1     1     1     0     0     0     0
     1     1     1     1     0     0     0     0
     0     0     0     0     1     1     1     1
     0     0     0     0     1     1     1     1
     0     0     0     0     1     1     1     1
     0     0     0     0     1     1     1     1

We have arranged the adjacency matrix in a manner so that the clusters are easily visible.

Let’s just get the number of nodes:

>> [num_nodes, ~] = size(W);

Let’s also assign the true labels to these nodes which will be used for verification later:

>> true_labels = [1 1 1 1 2 2 2 2];

We construct the degree matrix \(D\) for the graph:

>> Degree = diag(sum(W))
Degree =

     4     0     0     0     0     0     0     0
     0     4     0     0     0     0     0     0
     0     0     4     0     0     0     0     0
     0     0     0     4     0     0     0     0
     0     0     0     0     4     0     0     0
     0     0     0     0     0     4     0     0
     0     0     0     0     0     0     4     0
     0     0     0     0     0     0     0     4

The unnormalized Laplacian is given by \(L = D - W\):

>> Laplacian = Degree - W
Laplacian =

     3    -1    -1    -1     0     0     0     0
    -1     3    -1    -1     0     0     0     0
    -1    -1     3    -1     0     0     0     0
    -1    -1    -1     3     0     0     0     0
     0     0     0     0     3    -1    -1    -1
     0     0     0     0    -1     3    -1    -1
     0     0     0     0    -1    -1     3    -1
     0     0     0     0    -1    -1    -1     3

We now compute the singular value decomposition of the Laplacian \(U \Sigma V^T = L\):

>> [~, S, V] = svd(Laplacian);
>> singular_values = diag(S);
>> fprintf('Singular values: \n');
>> spx.io.print.vector(singular_values);
Singular values:
4.00 4.00 4.00 4.00 4.00 4.00 0.00 0.00

We know that the number of connected components in an undirected graph is equal to the number of singular values of the Laplacian which are zero. On inspection, we can see that the there are indeed two such zeros.

For more general cases, the lower singular values may not indeed be zero. We need to find the knee of the singular value curve.

_images/simple_unnormalized_singular_values.png

A simple way to find it to look at the changes between consecutive singular values and find the place where the change is largest:

>> sv_changes = diff( singular_values(1:end-1) );
>> spx.io.print.vector(sv_changes);
0.00 0.00 0.00 -0.00 0.00 -4.00

Note that it is known that the Laplacian always has one singular value which is 0. Thus, we need to look at the changes only in remaining singular values.

Locate the largest change:

>> [min_val , ind_min ] = min(sv_changes)
min_val =

   -4.0000


ind_min =

     6

The number of clusters is now easy to determine:

>> num_clusters = num_nodes - ind_min
num_clusters =

     2

We pickup the right singular vector corresponding to the last 2 smallest singular values:

>> Kernel = V(:,num_nodes-num_clusters+1:num_nodes);

Each row of this matrix corresponds to one data point. At this point, the standard k-means clustering can be invoked to cluster the points into clusters where the number of clusters was determined as above:

% Maximum iteration for KMeans Algorithm
>> max_iterations = 1000;
% Replication for KMeans Algorithm
>> replicates = 100;
>> labels = kmeans(Kernel, num_clusters, ...
    'start','sample', ...
    'maxiter',max_iterations,...
    'replicates',replicates, ...
    'EmptyAction','singleton'...
    );

Print the labels given by k-means:

>> spx.io.print.vector(labels, 0);
1 1 1 1 2 2 2 2

As expected, the algorithm has been able to group the points into two clusters. The labels are matching with the original true labels.

Complete example code can be downloaded here.

sparse-plex includes a function which implements the unnormalized spectral clustering algorithm. We can use it on the data above as follows:

>> result = spx.cluster.spectral.simple.unnormalized(W);
>> result.labels'

ans =

     1     1     1     1     2     2     2     2

Inside Normalized (Random Walk) Spectral Clustering

In this section, we will look at a spectral clustering method using normalized Laplacians. The primary difference is the way the graph Laplacian is computed \(L = I - D^{-1} W\).

We will use the third example from self tuning paper for demonstration here.

_images/st3_nrw_raw_data.png

There are three clusters in the dataset. While two of the clusters have very clear convex shapes, the third one forms a half moon. It is this cluster which causes problems with a simple algorithm like k-means. The semi-moon is the first cluster, while the other two above it are cluster 2 and 3 (from left to right).

Loading the dataset:

dataset_file = fullfile(spx.data_dir, 'clustering', ...
    'self_tuning_paper_clustering_data');
data = load(dataset_file);
datasets = data.XX;
raw_data = datasets{3};
% Scale the raw_data
raw_data = raw_data - repmat(mean(raw_data),size(raw_data,1),1);
raw_data = raw_data/max(max(abs(raw_data)));
num_clusters = data.group_num(1);
X = raw_data(:, 1);
Y = raw_data(:, 2);
% plot it
axis equal;
plot(X, Y, '.', 'MarkerSize',16);

Let’s compute the pairwise (squared) \(\ell_2\) distances between the points:

sqrt_dist_mat = spx.commons.distance.sqrd_l2_distances_rw(raw_data);
imagesc(sqrt_dist_mat);
title('Distance Matrix');
_images/st3_nrw_distances.png
  • In cluster 2 and 3, the distances are quite small between all pairs of points.
  • In cluster 1, every point is near to some of the points. The points form kind of a chain structure. They keep getting farther and farther. This is visible from the gradual color change from blue to yellow in the off diagonal parts of the first (diagonal) sub-part in the image above.

We map the (squared) distances to similarity values between 0 to 1:

scale = 0.04;
W = spx.cluster.similarity.gauss_sim_from_sqrd_dist_mat(sqrt_dist_mat, scale);
imagesc(W);
title('Similarity Matrix');

The transformation involved here is

\[w_{i j} = \exp \left ( - \frac{d_{i j}^2}{\sigma^2} \right).\]
_images/st3_nrw_similarity.png

The chain structure of similarity in cluster 1 is clearly visible here. In cluster 2 and 3 points are fairly similar to each other.

We now compute the graph Laplacian:

[num_nodes, ~] = size(W);
Degree = diag(sum(W));
DegreeInv = Degree^(-1);
Laplacian = speye(num_nodes) - DegreeInv * W;
imagesc(Laplacian);
title('Normalized Random Walk Laplacian');

Note how the Laplacian has been computed slightly differently.

_images/st3_nrw_laplacian.png

Let’s look at the singular values:

[~, S, V] = svd(Laplacian);
singular_values = diag(S);
plot(singular_values, 'b.-');
grid on;
title('Singular values of the Laplacian');
_images/st3_nrw_singular_values.png

This time no clear knee is visible in the singular value plot. We can verify this by looking at the differences:

sv_changes = diff( singular_values(1:end-1) );
plot(sv_changes, 'b.-');
grid on;
title('Changes in singular values');
_images/st3_nrw_sv_changes.png

Finding the largest change in singular values will not give us the correct number of clusters:

>> [min_val , ind_min ] = min(sv_changes);
>> num_clusters = num_nodes - ind_min;
>> num_clusters

num_clusters =

    26

However, by data inspection, we can clearly see that there are only 3 clusters of interest.

In this case, since the data are well segregated, the number of singular values which is close to zero actually matches with the number of clusters:

>> num_clusters = sum(singular_values < 1e-6)
num_clusters =

     3

Let’s verify by printing 10 smallest singular values:

>> spx.io.print.vector(singular_values(end-10:end), 6)
0.027412 0.023242 0.014100 0.012101 0.006108 0.003988 0.001655 0.000471 0.000000 0.000000 0.000000

We will stick to this way of computing number of clusters here.

Let’s pick up the right singular vectors corresponding to the last 3 singular values:

% Choose the last num_clusters eigen vectors
Kernel = V(:,num_nodes-num_clusters+1:num_nodes);

Time to perform k-means clustering on the row vectors of this kernel:

% Maximum iteration for KMeans Algorithm
max_iterations = 1000;
% Replication for KMeans Algorithm
replicates = 100;
cluster_labels = kmeans(Kernel, num_clusters, ...
    'start','plus', ...
    'maxiter',max_iterations,...
    'replicates',replicates, ...
    'EmptyAction','singleton'...
    );

The labels are returned in the variable cluster_labels.

Let’s plot the original data by assigning different colors to points belonging to different labels:

hold on;
axis equal;
for c=1:num_clusters
    % Identify points in this cluster
    points = raw_data(cluster_labels == c, :);
    X = points(:, 1);
    Y = points(:, 2);
    plot(X, Y, '.', 'MarkerSize',16);
end
hold off;
_images/st3_nrw_clustered_data.png

We can see that the clusters have been clearly identified.

Utility Functions for Clustering Experiments

We provide some utility functions which are quite useful in setting up clustering experiments.

Suppose you stack data vectors from different clusters together in a matrix column-wise. You wish to assign labels to each column of the matrix. We provide a function to automatically choose such labels.

Let’s choose some cluster sizes:

>> cluster_sizes = [  4 3 3 2];

Let’s generate labels for these clusters:

>> labels = spx.cluster.labels_from_cluster_sizes(cluster_sizes)

labels =

     1     1     1     1     2     2     2     3     3     3     4     4

Notice how first 4 labels are 1, next 3 labels are 2, next 3 are 3 and final 2 are 4.

Let’s randomly reorder the labels. This is a typical step in feeding a clustering algorithm so that any inherent order in data is destroyed before applying the clustering algorithm.

>> labels = labels(randperm(numel(labels)))

labels =

3 2 2 3 4 3 1 1 4 1 2 1

A useful function is to find the sizes of clusters for each label. We provide a function for that:

>> spx.cluster.cluster_sizes_from_labels(labels)

ans =

     4     3     3     2

Comparing Clusterings

In Measurement of clustering performance, we looked at the theoretical aspects of comparing two different clusterings.

In this section, we will learn the tools available in sparse-plex library for comparing clusterings.

Example

In this example, we will consider a set of 14 objects which are clustered into 4 different clusters by two different algorithms, algorithm A and B. Algorithm A could be human annotations themselves, in which case the labels are the ground truth against which we will compare the results of B.

We assume that the number of clusters is known in advance to be 4 and the two algorithms are generating the labels 1, 2, 3, 4.

The algorithm A outputs following labels:

A  = [2 1 3 2 4 2 1 1 1 1 4 3 3 3];

It puts 5 objects into cluster 1, 3 into cluster 2, 4 into cluster 3, and 2 in cluster 4.

The algorithm B outputs following labels:

B  = [4 2 3 4 2 4 2 2 3 2 1 3 3 3];

It puts only 1 object in cluster 1, 5 in cluster 2, 5 in cluster 3 and 3 in cluster 4.

An easy way to determine this is the tabulate function:

>> tabulate(A)
  Value    Count   Percent
      1        5     35.71%
      2        3     21.43%
      3        4     28.57%
      4        2     14.29%

By inspection, we can see that the two algorithms are assigning different labels in most cases.

We need to figure out the label mapping between two clusters. It describes how the labels between two clusterings are related to each other. e.g. when A assigns a label 1 to some object, what is the most likely label assigned by B.

sparse-plex provides a cluster comparison tool:

>> cc = spx.cluster.ClusterComparison(A, B);

In order to compare the two clusterings, the first tool is the confusion matrix:

>> cm = cc.confusionMatrix(); cm

cm =

     0     4     1     0
     0     0     0     3
     0     0     4     0
     1     1     0     0

In the confusion matrix, the rows represent the labels assigned by A and columns represent the labels assigned by B.

e.g. for the 5 objects assigned to cluster 1 by A, 4 of them were assigned to cluster 2 by B and 1 was assigned to cluster 3 by B.

Confusion matrix is a very useful tool to identify label mapping. In this case, cluster 1 of algorithm A and cluster 2 of algorithm B are likely to be similar.

From Measurement of clustering performance, we would like to get the precision, recall and f1-measure numbers between the two clusterings.

ClusterComparison provides a method to get all of these metrics:

>> fm = cc.fMeasure();
>> fm.precisionMatrix

ans =

         0    0.8000    0.2000         0
         0         0         0    1.0000
         0         0    0.8000         0
    1.0000    0.2000         0         0

>> fm.recallMatrix

ans =

         0    0.8000    0.2000         0
         0         0         0    1.0000
         0         0    1.0000         0
    0.5000    0.5000         0         0

>> fm.fMatrix

ans =

         0    0.8000    0.2000         0
         0         0         0    1.0000
         0         0    0.8889         0
    0.6667    0.2857         0         0

>> fm.precision

ans =

    0.8571

>> fm.recall

ans =

    0.8571

>> fm.fMeasure

ans =

    0.8492

A label map is also computed using the f1 matrix:

>> fm.labelMap'

ans =

     2     4     3     1

The map suggests a mapping from labels of A to labels of B as follows: 1->2, 2->4, 3->3, 4->1.

It also provides you the new B labels after remapping:

>> fm.remappedLabels'

ans =

     2     1     3     2     1     2     1     1     3     1     4     3     3     3

We can look at the number of places the remapped labels of B differ from the original A labels:

>> fm.remappedLabels' ~= A

ans =

  1×14 logical array

   0   0   0   0   1   0   0   0   1   0   0   0   0   0

We see that after remapping of labels, A and B differ in only 2 places. The clustering done by B is actually very close to the clustering done by A.

The ClusterComparison class provides a helpful method for printing the results in fm object:

>> spx.cluster.ClusterComparison.printF1MeasureResult(fm)
F1-measure: 0.85, Precision: 0.86, Recall: 0.86, Misclassification rate: 0.14, Clusters: A: 4, B: 4, Clustering ratio: 1.00
Label map:
1=>2, 2=>4, 3=>3, 4=>1,

Label mapping using Hungarian method

Label mapping is essentially an assignment problem. We want to assign labels by the two different algorithms in such a way that the clustering error is minimized.

The Hungarian algorithm is used in assignment problems when we want to minimize cost.

sparse-plex includes an implementation of hungarian mapping by Niclas Borlin.

Example

We can perform the assignment as follows:

>> C = bestMap(A, B)'; C

ans =

     2 1 3 2 1 2 1 1 3 1 4 3 3 3

>> C ~= A

ans =

  1×14 logical array

   0   0   0   0   1   0   0   0   1   0   0   0   0   0

In this case the mapping given by Hungarian method is same as mapping generated by \(f_1\) measure method. Sometimes, it is not so.

The bestMap method is easy to use.

Clustering Error

If two clusterings have same number of labels, then a simpler clustering error metric is quite useful.

We start with an example set of true labels A and estimated labels B:

A  = [2 1 3 2 4 2 1 1 1 1 4 3 3 3];
B  = [4 2 3 4 2 4 2 2 3 2 1 3 3 3];

Total number of labels:

num_labels = numel(A);
num_labels =

    14

Let’s use the Hungarian mapping technique to find the mapping of labels between A and B:

mapped_B = bestMap(A, B)'
mapped_B =

     2 1 3 2 1 2 1 1 3 1 4 3 3 3

After this mapping, the mapped B labels are looking pretty much like A. The difference between these two labels is where the algorithm has made some mistakes:

mistakes =

  1×14 logical array

   0   0   0   0   1   0   0   0   1   0   0   0   0   0

Total number of mistakes:

num_mistakes = sum(mistakes)
num_mistakes =

     2

Clustering error is nothing but the ratio of mistakes made and total number of data points:

clustering_error  = num_mistakes / num_labels
clustering_error =

    0.1429

In percentage:

clustering_error_perc = clustering_error * 100
clustering_error_perc =

   14.2857

Accuracy can be computed from error:

clustering_acc_perc = 100 -clustering_error_perc

Sparse-Plex provides a function which does all of this together:

>> spx.cluster.clustering_error_hungarian_mapping(A, B)

ans =

  struct with fields:

           num_labels: 14
    num_missed_points: 2
                error: 0.1429
           error_perc: 14.2857
        mapped_labels: [2 1 3 2 1 2 1 1 3 1 4 3 3 3]
               misses: [0 0 0 0 1 0 0 0 1 0 0 0 0 0]

Pursuit Algorithms

Prelude to greedy pursuit algorithms

In this chapter we will review some matching pursuit algorithms which can help us solve the sparse approximation problem and the sparse recovery problem discussed in here.

The presentation in this chapter is based on a number of sources including [BDDH11][BD09][Ela10][NT09][Tro04][TG07].

Let us recall the definitions of sparse approximation and recovery problems from previous chapters.

From here let \(\DDD\) be a signal dictionary with \(\Phi \in \CC^{N \times D}\) being its synthesis matrix. The \((\mathcal{D}, K)\)-sparse approximation can be written as

\[\begin{split}\begin{aligned} & \underset{\alpha}{\text{minimize}} & & \| x - \Phi \alpha \|_2 \\ & \text{subject to} & & \| \alpha \|_0 \leq K. \end{aligned}\end{split}\]

From here with the help of synthesis matrix \(\Phi\), the \((\mathcal{D}, K)\)-exact-sparse problem can be written as

\[\begin{split}\begin{aligned} & \underset{\alpha}{\text{minimize}} & & \| \alpha \|_0 \\ & \text{subject to} & & x = \Phi \alpha\\ & \text{and} & & \| \alpha \|_0 \leq K \end{aligned}\end{split}\]

From here we recall the sparse signal recovery from compressed measurements problem as following. Let \(\Phi \in \CC^{M \times N}\) be a sensing matrix. Let \(x \in \CC^N\) be an unknown signal which is assumed to be sparse or compressible. Let \(y = \Phi x\) be a measurement vector in \(\CC^M\).

Then the signal recovery problem is to recover \(x\) from \(y\) subject to

\[y = \Phi x\]

assuming \(x\) to be \(K\) sparse or at least \(K\) compressible.

We note that sparse approximation problem and sparse recovery problems have pretty much same structure. They are in fact dual to each other. Thus we will see that the same set of algorithms can be adapted to solve both problems.

In the sequel we will see many variations of above problems.

Our first problem

We will start with attacking a very simple version of \((\mathcal{D}, K)\)-exact-sparse problem.

Setting up notation

  • \(x \in \CC^N\) is our signal of interest and it is known.
  • \(\DDD\) is the dictionary in which we are looking for a sparse representation of \(x\).
  • \(\Phi \in \CC^{N \times D}\) is the synthesis matrix for \(\DDD\).
  • The sparse representation of \(x\) in \(\DDD\) is given by
\[x = \Phi \alpha.\]
  • It is assumed that \(\alpha \in \CC^D\) is sparse with \(|\alpha|_0 \leq K\).
  • Also we assume that \(\alpha\) is the sparsest possible solution for \(x\) that we are looking.
  • We know \(x\), we know \(\Phi\), we don’t know \(\alpha\). We are looking for it.

Thus we need to solve the optimization problem given by

(1)\[\underset{\alpha}{\text{minimize}}\, \| \alpha \|_0 \; \text{subject to} \, x = \Phi \alpha.\]

For the unknown vector \(\alpha\), we need to find

  • the sparsest support for the solution i.e. \(\{ i | \alpha_i \neq 0 \}\)
  • the non-zero values \(\alpha_i\) over this support.

If we are able to find the support for the solution \(\alpha\), then we may assume that the non-zero values of \(\alpha\) can be easily computed by least squares methods.

Note that the support is discrete in nature (An index \(i\) either belongs to the support or it does not). Hence algorithms which will seek the support will also be discrete in nature.

We now build up a case for greedy algorithms before jumping into specific algorithms later.

Let us begin with a much simplified version of the problem.

Let the columns of the matrix \(\Phi\) be represented as

\[\Phi = \begin{bmatrix} \phi_1 & \phi_2 & \dots & \phi_D \end{bmatrix} .\]

Let \(\spark (\Phi) > 2\). Thus no two columns in \(\Phi\) are linearly dependent and as per here, for any \(x\), there is at most only one \(1\)-sparse explanation vector.

We now assume that such a representation exists and we would be looking for optimal solution vector \(\alpha^*\) that has only one non-zero value, i.e. \(\| \alpha^*\|_0 = 1\).

Let \(i\) be the index at which \(\alpha^*_i \neq 0\).

Thus \(x = \alpha^*_i \phi_i\), i.e. \(x\) is a scalar multiple of \(\phi_i\) (the \(i\)-th column of \(\Phi\)).

Of course we don’t know what is the value of index \(i\).

We can find this by comparing \(x\) with each column of \(\Phi\) and find the column which best matches it.

Consider the least squares minimization problem:

\[\epsilon(j) = \underset{z_j}{\text{minimize}}\, \| \phi_j z_j - x \|_2.\]

where \(z_j \in \CC\) is a scalar.

From linear algebra, it attempts to find the projection of \(x\) over \(\phi_j\) and \(\epsilon(j)\) represents the magnitude of error between \(x\) and the projection of \(x\) over \(\phi_j\).

Optimal solution is given by

\[z_j^* = \frac{\phi_j^H x }{\| \phi_j \|_2^2} = \phi_j^H x\]

since columns of a dictionary are assumed to be unit norm.

Plugging it back into the expression of minimum squared error we get

\[\begin{split}\epsilon^2(j) &= \underset{z_j}{\text{minimize}}\, \| \phi_j z_j - x \|_2^2\\ &=\left \| \phi_j \phi_j^H x - x \right \|_2^2\\ &= \| x\|_2^2 - |\phi_j^H x |^2.\end{split}\]

Now since \(x\) is a scalar multiple of \(\phi_i\), hence \(\epsilon(i) = 0\), thus if we look at \(\epsilon(j)\) for \(j = 1, \dots, D\), the minimum value \(0\) will be obtained for \(j = i\).

And \(\epsilon(i) = 0\) means

\[\| x\|_2^2 - |\phi_i^H x |^2 = 0 \implies \| x\|_2^2 = |\phi_i^H x |^2.\]

This is a special case of Cauchy-Schwartz inequality when \(x\) and \(\phi_i\) are collinear.

The sparse representation is given by

\[\begin{split}\alpha = \begin{bmatrix} 0 \\ \vdots \\ z_i^* \\ \vdots \\ 0 \end{bmatrix}\end{split}\]

Since \(x \in \CC^N\) and \(\phi_j \in \CC^N\), hence computation of \(\epsilon(j)\) requires \(\bigO{N}\) time.

Since we may need to do it for all \(D\) columns, hence finding the index \(i\) takes \(\bigO{ND}\) time.

Now let us make our life more complex. We now suppose that \(\spark(\Phi) > 2 K\). Thus a sparse representation \(\alpha\) of \(x\) with up to \(K\) non-zero values is unique if it exists(see again here). We assume it exists. Since \(x=\Phi \alpha\), \(x\) is a linear combination of up to \(K\) columns of \(\Phi\).

One approach could be to check out all \(\binom{D}{K}\) possible subsets of \(K\) columns from \(\Phi\).

But \(D \choose K\) is \(\bigO{D^{K}}\) and for each subset of \(K\) columns solving the least squares problem will take \(\bigO{N K^2}\) time. Hence overall complexity of the recovery process would be \(\bigO{D^{K} N K^2}\). This is prohibitively expensive.

A way around is by adopting a greedy strategy in which we abandon the hopeless exhaustive search and attempt a series of single term updates in the solution vector \(\alpha\).

Since this is an iterative procedure, let us call the approximation at each iteration as \(\alpha^k\) where \(k\) is the iteration index.

  • We start with \(\alpha^0 = 0\).

  • At each iteration we choose one new column in \(\Phi\) and fill in a value at corresponding index in \(\alpha^k\).

  • The column and value are chosen such that it maximally reduces the \(l_2\) error between \(x\) and the approximation. i.e.

    \[\| x -\Phi \alpha^{k + 1} \|_2 < \| x -\Phi \alpha^{k} \|_2\]

    and the error reduction is as high as possible.

  • We stop when the \(l_2\) error reduces below a specific threshold.

Many variations to this scheme are possible.

  • We can choose more than one atom in each iteration.
  • In fact we can choose even K atoms in each iteration.
  • We can drop some previously chosen atoms in an iteration too if they seem to be incorrect choices.

Not every chosen atom may be a correct one. Some algorithms have mechanisms to identify atoms which are more likely to be part of the support than others and thus drop the unlikely ones.

We are now ready to explore different greedy algorithms.

Matching Pursuit

Algorithm

_images/alg_matching_pursuit.png

Matching Pursuit

The matching pursuit algorithm is a very simple iterative approach to solve the sparse recovery problem. We are given the signal \(y\) and the dictionary \(\Phi\) and we are to recover the sparse representation \(x\) satisfying \(y = \Phi x\).

In each iteration of matching pursuit:

  • A current estimate of the representation vector \(x\) is maintained in the variable \(z\).
  • Current residual \(r = y - \Phi z\) is maintained.
  • The inner product of the residual with all the atoms in \(\Phi\) is computed.
  • We look for the atom which has the largest inner product in magnitude.
  • Contribution from this atom is added to the representation.
  • Residual is reduced accordingly.

Note that it is guaranteed that the norm of residual decreases monotonically in each iteration till it converges.

The algorithm can be motivated as follows.

Let \(\Lambda\) be the support of the representation vector \(x\). Then

\[y = \sum_{j \in \Lambda} \phi_{j} x_{j}.\]

For some \(k \in \Lambda\)

\[\langle y, \phi_k \rangle = \sum_{j \in \Lambda} \langle \phi_{j} , \phi_k \rangle x_{j}.\]

If the atoms formed an orthonormal set, this would have reduced to \(x_{k} = \langle y, \phi_k \rangle\) and picking the largest inner product would give us the largest non-zero entry in \(x\).

In fact, if \(\Phi\) was an orthonormal basis, then matching pursuit recovers the representation of \(y\) in exactly \(K\) iterations where \(K = |\Lambda|\) by successively picking up nonzero coefficients in \(x\) in the order of descending magnitude. We hope that the algorithm is useful even when the atoms in \(\Phi_{\Lambda}\) are not orthogonal.

Now, let us look at the iterative structure. Assume that the current estimate \(z\) satisfies \(\supp(z) \subseteq \Lambda\). Then \(\Phi z \in \Range(\Phi_{\Lambda})\). Since \(y \in \Range(\Phi_{\Lambda})\), hence the residual \(r \in \Range(\Phi_{\Lambda})\) also holds.

Finally, if the atoms in \(\Phi\) are nearly orthogonal to each other, then the largest inner product of \(r\) will be for one of the atoms in \(\Lambda\). This atom is indexed by the variable \(k\). Then \(h_k\) is the projection of the residual \(r\) on the atom \(\phi_k\).

We add this projection coefficient to \(z_k\) and remove the projection from the residual. The support of \(z\) continues to be within \(\Lambda\).

Since the atoms are not orthogonal, matching pursuit typically takes much larger number of iterations than the sparsity level \(K\). However, under suitable conditions, it does converge to the correct solution.

Hands-on with Matching Pursuit

Matching pursuit on a 2-sparse vector

In this example, we will reconstruct a 2-sparse representation vector \(x\) from a signal \(y = \Phi x\). We will develop the basic implementation of matching pursuit along-with.

From this example, we know of a way to construct a dictionary with high spark:

rng default;
N = 20;
M = 10;
K = 2;
PhiA = hadamard(N);
rows = randperm(N, M);
PhiB = PhiA(rows, :);

Let’s print its contents:

>> PhiB

PhiB =

     1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1
     1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1
     1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1
     1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1
     1  1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1
     1 -1  1  1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1
     1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1
     1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1
     1 -1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1
     1  1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1 -1  1  1  1  1 -1 -1

Let’s normalize its columns:

Phi = spx.norm.normalize_l2(PhiB);

Bi-Gaussian discusses ways to generate synthetic sparse vectors.

Let’s generate our 2-sparse representation vector:

rng(100);
gen = spx.data.synthetic.SparseSignalGenerator(N, K);
x =  gen.biGaussian();

Let’s print \(x\):

>> spx.io.print.sparse_signal(x);
(6,1.6150) (11,-1.2390)   N=20, K=2

This is a nice helper function to print sparse vectors. It prints a sequence of tuples where each tuple consists of the index of a non-zero value and corresponding value.

The support for this vector is:

>> spx.commons.sparse.support(x)'

ans =

     6    11

Let’s construct our 10-dimensional signal from it:

y = Phi * x;

Let’s print it:

>> spx.io.print.vector(y)
0.12 -0.12 -0.90 0.90 0.90 0.90 -0.90 -0.12 0.90 0.12

Our problem is now setup. Our job now is to recover \(x\) from \(\Phi\) and \(y\).

Initialize the estimated representation and current residual:

z = zeros(N, 1);
r = y;

We will run the matching pursuit iterations up to 100 times:

for i=1:100

Following code samples are part of each matching pursuit iteration. We start with computing the inner products of the current residual with each atom:

inner_products = Phi' * r;

Find the index of best matching atom \(k\)

[max_abs_inner_product, index]  = max(abs(inner_products));

Corresponding signed inner product \(h_k\):

max_inner_product = inner_products(index);

Update the representation:

z(index) = z(index) + max_inner_product;

Remove the projection of the atom from the residual:

r = r - max_inner_product * Phi(:, index);

Compute the norm of residual:

norm_residual = norm(r);

If the norm is less than a threshold, we break out of loop:

if norm_residual < 1e-4
    break;
end

It will be instructive to print current value of residual norm, selected atom index and estimated coefficients in the \(z\) variable in each iteration:

fprintf('[%d]: k: %d, h_k : %.4f, r_norm: %.4f, estimate: ', i, index, norm_residual, max_inner_product);

Here is the output of running this algorithm for this problem:

[1]: k: 6, h_k : 1.2140, r_norm: 1.8628, estimate: (6,1.8628)   N=20, K=1
[2]: k: 11, h_k : 0.2428, r_norm: -1.1894, estimate: (6,1.8628) (11,-1.1894)   N=20, K=2
[3]: k: 6, h_k : 0.0486, r_norm: -0.2379, estimate: (6,1.6249) (11,-1.1894)   N=20, K=2
[4]: k: 11, h_k : 0.0097, r_norm: -0.0476, estimate: (6,1.6249) (11,-1.2370)   N=20, K=2
[5]: k: 6, h_k : 0.0019, r_norm: -0.0095, estimate: (6,1.6154) (11,-1.2370)   N=20, K=2
[6]: k: 11, h_k : 0.0004, r_norm: -0.0019, estimate: (6,1.6154) (11,-1.2389)   N=20, K=2
[7]: k: 6, h_k : 0.0001, r_norm: -0.0004, estimate: (6,1.6150) (11,-1.2389)   N=20, K=2

It took us 7 iterations, but the residual norm reached close to 0. We can note that the non-zero values in \(z\) match closely with the corresponding values in \(x\). Matching pursuit has been successful. We can also notice that the reconstruction alternates between atom number 6 and 11 in each iteration. Also, the residual norm keeps on decreasing with each iteration.

The complete code can be downloaded here.

ExampleWhen matching pursuit fails::

Although the spark of the dictionary in previous example is \(8\), matching pursuit fails to recover signals which are 3-sparse.

Here is an example output of running matching pursuit on a 3-sparse vector for 20 iterations:

The representation: (6,-1.9014) (8,1.3481) (11,1.6150)   N=20, K=3
[1]: k: 6, h_k : 1.9189, r_norm: -2.7636, estimate: (6,-2.7636)   N=20, K=1
[2]: k: 11, h_k : 1.2654, r_norm: 1.4425, estimate: (6,-2.7636) (11,1.4425)   N=20, K=2
[3]: k: 8, h_k : 0.7712, r_norm: 1.0032, estimate: (6,-2.7636) (8,1.0032) (11,1.4425)   N=20, K=3
[4]: k: 6, h_k : 0.3449, r_norm: 0.6898, estimate: (6,-2.0738) (8,1.0032) (11,1.4425)   N=20, K=3
[5]: k: 8, h_k : 0.2069, r_norm: 0.2759, estimate: (6,-2.0738) (8,1.2791) (11,1.4425)   N=20, K=3
[6]: k: 11, h_k : 0.1542, r_norm: 0.1380, estimate: (6,-2.0738) (8,1.2791) (11,1.5805)   N=20, K=3
[7]: k: 6, h_k : 0.0690, r_norm: 0.1380, estimate: (6,-1.9359) (8,1.2791) (11,1.5805)   N=20, K=3
[8]: k: 8, h_k : 0.0414, r_norm: 0.0552, estimate: (6,-1.9359) (8,1.3343) (11,1.5805)   N=20, K=3
[9]: k: 16, h_k : 0.0308, r_norm: 0.0276, estimate: (6,-1.9359) (8,1.3343) (11,1.5805) (16,0.0276)   N=20, K=4
[10]: k: 14, h_k : 0.0241, r_norm: -0.0193, estimate: (6,-1.9359) (8,1.3343) (11,1.5805) (14,-0.0193) (16,0.0276)
  N=20, K=5
[11]: k: 10, h_k : 0.0197, r_norm: 0.0138, estimate: (6,-1.9359) (8,1.3343) (10,0.0138) (11,1.5805) (14,-0.0193)
(16,0.0276)   N=20, K=6
[12]: k: 6, h_k : 0.0151, r_norm: 0.0127, estimate: (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5805) (14,-0.0193)
(16,0.0276)   N=20, K=6
[13]: k: 11, h_k : 0.0115, r_norm: 0.0097, estimate: (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5902) (14,-0.0193)
(16,0.0276)   N=20, K=6
[14]: k: 15, h_k : 0.0095, r_norm: -0.0065, estimate: (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5902) (14,-0.0193)
(15,-0.0065) (16,0.0276)   N=20, K=7
[15]: k: 13, h_k : 0.0078, r_norm: 0.0055, estimate: (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5902) (13,0.0055)
(14,-0.0193) (15,-0.0065) (16,0.0276)   N=20, K=8
[16]: k: 1, h_k : 0.0056, r_norm: -0.0054, estimate: (1,-0.0054) (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5902)
(13,0.0055) (14,-0.0193) (15,-0.0065) (16,0.0276)   N=20, K=9
[17]: k: 20, h_k : 0.0044, r_norm: -0.0035, estimate: (1,-0.0054) (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5902)
(13,0.0055) (14,-0.0193) (15,-0.0065) (16,0.0276) (20,-0.0035)
  N=20, K=10
[18]: k: 2, h_k : 0.0034, r_norm: 0.0028, estimate: (1,-0.0054) (2,0.0028) (6,-1.9232) (8,1.3343) (10,0.0138)
(11,1.5902) (13,0.0055) (14,-0.0193) (15,-0.0065) (16,0.0276)
(20,-0.0035)   N=20, K=11
[19]: k: 4, h_k : 0.0025, r_norm: 0.0023, estimate: (1,-0.0054) (2,0.0028) (4,0.0023) (6,-1.9232) (8,1.3343)
(10,0.0138) (11,1.5902) (13,0.0055) (14,-0.0193) (15,-0.0065)
(16,0.0276) (20,-0.0035)   N=20, K=12
[20]: k: 17, h_k : 0.0021, r_norm: -0.0014, estimate: (1,-0.0054) (2,0.0028) (4,0.0023) (6,-1.9232) (8,1.3343)
(10,0.0138) (11,1.5902) (13,0.0055) (14,-0.0193) (15,-0.0065)
(16,0.0276) (17,-0.0014) (20,-0.0035)   N=20, K=13

The sparse vector is supported on atoms 6, 8 and 11. If we order the atoms in terms of the magnitude of their coefficients, the order is 6,11 and 8.

  • Atom 6 is discovered in first iteration.
  • Atom 11 is discovered in second iteration.
  • Atom 8 is discovered in the third iteration.
  • The coefficients for atom 6, 8 and 11 continue to be updated till 8 iterations.
  • In 9-th iteration, it discovers an incorrect atom 16.
  • In the following iterations, it keeps discovering more incorrect atoms 14, 10, 15, 13, 1, 20, etc.
  • The algorithm is side-tracked after 9-th iteration. The residual doesn’t belong to the range \(\Range(\Phi_{\Lambda})\) anymore.
  • After 20 iterations, as many as 13 atoms are involved in the representation.
  • Yet, most of the energy is concentrated in atoms 6, 8, 11 only. In that sense, MP hasn’t failed completely.
  • A simple thresholding can remove the spurious contributions from incorrect atoms.

Orthogonal Matching Pursuit

The OMP Algorithm

Orthogonal Matching Pursuit (OMP) addresses some of the limitations of Matching Pursuit. In particular, in each iteration:

  • The current estimate is computed by performing a least squares estimation on the subdictionary formed by atoms selected so far.
  • It ensures that the residual is totally orthogonal to already selected atoms.
  • It also means that an atom is selected only once.
  • Further, if all the atoms in the support are selected by OMP correctly, then the least squares estimate is able to achieve perfect recovery. The residual becomes 0.
  • In other words, if OMP is recovering a K-sparse representation, then it can recover it in exactly K iterations (if in each iteration it recovers one atom correctly).
  • OMP performs far better than MP in terms of the set of signals it can recover correctly.
  • At the same time, OMP is a much more complex algorithm (due to the least squares step).
_images/algorithm_orthogonal_matching_pursuit1.png

Orthogonal Matching Pursuit

The core OMP algorithm is presented above. The algorithm is iterative.

  • We start with the initial estimate of solution as \(x=0\).
  • We also maintain the support of \(x\) i.e. the set of indices for which \(x\) is non-zero in a variable \(\Lambda\). We start with an empty support.
  • In each (\(k\)-th) iteration we attempt to reduce the difference between the actual signal \(y\) and the approximate signal based on current solution \(x^{k}\) given by \(r^{k} = y - \Phi x^{k}\).
  • We do this by choosing a new index in \(x\) given by \(\lambda^{k+1}\) for the column \(\phi_{\lambda^{k+1}}\) which most closely matches our current residual.
  • We include this to our support for \(x\), estimate new solution vector \(x^{k+1}\) and compute new residual.
  • We stop when the residual magnitude is below a threshold \(\epsilon\) defined by us.

Each iteration of algorithm consists of following stages:

  1. Match For each column \(\phi_j\) in our dictionary, we measure the projection of residual from previous iteration on the column

  2. Identify We identify the atom with largest inner product and store its index in the variable \(\lambda^{k+1}\).

  3. Update support We include \(\lambda^{k+1}\) in the support set \(\Lambda^{k}\).

  4. Update representation In this step we find the solution of minimizing \(\| \Phi x - y \|^2\) over the support \(\Lambda^{k+1}\) as our next candidate solution vector.

    By keeping \(x_i = 0\) for \(i \notin \Lambda^{k+1}\) we are essentially leaving out corresponding columns \(\phi_i\) from our calculations.

    Thus we pickup up only the columns specified by \(\Lambda^{k+1}\) from \(\Phi\). Let us call this matrix as \(\Phi_{\Lambda^{k+1}}\). The size of this matrix is \(N \times | \Lambda^{k+1} |\). Let us call corresponding sub vector as \(x_{\Lambda^{k+1}}\).

    E.g. suppose \(D=4\), then \(\Phi = \begin{bmatrix} \phi_1 & \phi_2 & \phi_3 & \phi_4 \end{bmatrix}\). Let \(\Lambda^{k+1} = \{1, 4\}\). Then \(\Phi_{\Lambda^{k+1}} = \begin{bmatrix} \phi_1 & \phi_4 \end{bmatrix}\) and \(x_{\Lambda^{k+1}} = (x_1, x_4)\).

    Our minimization problem then reduces to minimizing \(\|\Phi_{\Lambda^{k+1}} x_{\Lambda^{k+1}} - y \|_2\).

    We use standard least squares estimate for getting the coefficients for \(x_{\Lambda^{k+1}}\) over these indices. We put back \(x_{\Lambda^{k+1}}\) to obtain our new solution estimate \(x^{k+1}\).

    In the running example after obtaining the values \(x_1\) and \(x_4\), we will have \(x^{k+1} = (x_1, 0 , 0, x_4)\).

    The solution to this minimization problem is given by

    \[\Phi_{\Lambda^{k+1}}^H ( \Phi_{\Lambda^{k+1}}x_{\Lambda^{k+1}} - y ) = 0 \implies x_{\Lambda^{k+1}} = ( \Phi_{\Lambda^{k+1}}^H \Phi_{\Lambda^{k+1}} )^{-1} \Phi_ {\Lambda^{k+1}}^H y.\]

    Interestingly, we note that \(r^{k+1} = y - \Phi x^{k+1} = y - \Phi_{\Lambda^{k+1}} x_{\Lambda^{k+1}}\), thus

    \[\Phi_{\Lambda^{k+1}}^H r^{k+1} = 0\]

    which means that columns in \(\Phi_{\Lambda^k}\) which are part of support \(\Lambda^{k+1}\) are necessarily orthogonal to the residual \(r^{k+1}\). This implies that these columns will not be considered in the coming iterations for extending the support. This orthogonality is the reason behind the name of the algorithm as OMP.

  5. Update residual We finally update the residual vector to \(r^{k+1}\) based on new solution vector estimate.

Hands-on with Orthogonal Matching Pursuit
Example

Let us consider a synthesis matrix of size \(10 \times 20\). Thus \(N=10\) and \(D=20\). In order to fit into the display, we will present the matrix in two 10 column parts.

tiny

\[\begin{split}\begin{aligned} \Phi_a = \frac{1}{\sqrt{10}} \begin{bmatrix} -1 & -1 & -1 & 1 & -1 & -1 & 1 & 1 & -1 & 1\\ 1 & 1 & 1 & 1 & 1 & -1 & -1 & 1 & -1 & -1\\ -1 & -1 & -1 & -1 & -1 & 1 & 1 & 1 & 1 & 1\\ 1 & -1 & -1 & 1 & 1 & 1 & -1 & 1 & 1 & 1\\ 1 & 1 & 1 & -1 & -1 & 1 & -1 & -1 & 1 & 1\\ 1 & -1 & 1 & -1 & -1 & -1 & 1 & -1 & 1 & -1\\ -1 & -1 & 1 & 1 & -1 & -1 & -1 & -1 & 1 & -1\\ 1 & -1 & 1 & 1 & -1 & 1 & -1 & -1 & -1 & 1\\ -1 & 1 & -1 & 1 & 1 & -1 & -1 & -1 & 1 & 1\\ 1 & 1 & 1 & 1 & -1 & 1 & -1 & 1 & -1 & 1 \end{bmatrix}\\ \Phi_b = \frac{1}{\sqrt{10}} \begin{bmatrix} 1 & -1 & -1 & -1 & 1 & 1 & 1 & -1 & -1 & -1\\ 1 & 1 & 1 & -1 & -1 & -1 & -1 & -1 & -1 & 1\\ -1 & 1 & 1 & 1 & 1 & 1 & -1 & -1 & -1 & -1\\ 1 & -1 & 1 & -1 & 1 & 1 & 1 & -1 & -1 & -1\\ 1 & -1 & -1 & 1 & 1 & 1 & -1 & 1 & 1 & -1\\ -1 & 1 & 1 & 1 & -1 & 1 & -1 & 1 & -1 & 1\\ -1 & 1 & 1 & -1 & 1 & -1 & -1 & -1 & 1 & 1\\ 1 & -1 & -1 & 1 & 1 & -1 & -1 & 1 & -1 & 1\\ 1 & 1 & 1 & 1 & -1 & -1 & 1 & 1 & 1 & -1\\ -1 & -1 & 1 & 1 & -1 & 1 & 1 & -1 & -1 & 1 \end{bmatrix} \end{aligned}\end{split}\]

with

\[\Phi = \begin{bmatrix}\Phi_a & \Phi_b \end{bmatrix}.\]

You may verify that each column is unit norm.

It is known that \(\Rank(\Phi) = 10\) and \(\spark(\Phi)= 6\). Thus if a signal \(y\) has a \(2\) sparse representation in \(\Phi\) then the representation is necessarily unique.

We now consider a signal \(y\) given by

\[\begin{split}\small y = \begin{pmatrix} 4.74342 & -4.74342 & 1.58114 & -4.74342 & -1.58114 \\ 1.58114 & -4.74342 & -1.58114 & -4.74342 & -4.74342 \end{pmatrix}. \normalsize\end{split}\]

For saving space, we have written it as an n-tuple over two rows. You should treat it as a column vector of size \(10 \times 1\).

It is known that the vector has a two sparse representation in \(\Phi\). Let us go through the steps of OMP and see how it works.

In step 0, \(r^0= y\), \(x^0 = 0\), and \(\Lambda^0 = \EmptySet\).

We now compute absolute value of inner product of \(r^0\) with each of the columns. They are given by

\[\begin{split}\small \begin{pmatrix} 4 & 4 & 4 & 7 & 3 & 1 & 11 & 1 & 2 & 1 \\ 2 & 1 & 7 & 0 & 2 & 4 & 0 & 2 & 1 & 3 \end{pmatrix} \normalsize\end{split}\]

We quickly note that the maximum occurs at index 7 with value 11.

We modify our support to \(\Lambda^1 = \{ 7 \}\).

We now solve the least squares problem

\[\text{minimize} \left \| y - [\phi_7] x_7 \right \|_2^2.\]

The solution gives us \(x_7 = 11.00\). Thus we get

\[\begin{split}x^1 = \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 11 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix}.\end{split}\]

Again note that to save space we have presented \(x\) over two rows. You should consider it as a \(20 \times 1\) column vector.

This leaves us the residual as

\[\begin{split}r^1 = y - \Phi x^1 = \begin{pmatrix} 1.26491 & -1.26491 & -1.89737 & -1.26491 & 1.89737 \\ -1.89737 & -1.26491 & 1.89737 & -1.26491 & -1.26491 \end{pmatrix}.\end{split}\]

We can cross check that the residual is indeed orthogonal to the columns already selected, for

\[\langle r^1 , \phi_7 \rangle = 0.\]

Next we compute inner product of \(r^1\) with all the columns in \(\Phi\) and take absolute values. They are given by

\[\begin{split}\begin{pmatrix} 0.4 & 0.4 & 0.4 & 0.4 & 0.8 & 1.2 & 0 & 1.2 & 2 & 1.2 \\ 2.4 & 3.2 & 4.8 & 0 & 2 & 0.4 & 0 & 2 & 1.2 & 0.8 \end{pmatrix}\end{split}\]

We quickly note that the maximum occurs at index 13 with value \(4.8\).

We modify our support to \(\Lambda^1 = \{ 7, 13 \}\).

We now solve the least squares problem

\[\begin{split}\text{minimize} \left \| y - \begin{bmatrix} \phi_7 & \phi_{13} \end{bmatrix} \begin{bmatrix} x_7 \\ x_{13} \end{bmatrix} \right \|_2^2.\end{split}\]

This gives us \(x_7 = 10\) and \(x_{13} = -5\).

Thus we get

\[\begin{split}x^2 = \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 10 & 0 & 0 & 0 \\ 0 & 0 & -5 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix}\end{split}\]

Finally the residual we get at step 2 is

\[\begin{split}r^2 = y - \Phi x^2 = 10^{-14} \begin{pmatrix} 0 & 0 & -0.111022 & 0 & 0.111022 \\ -0.111022 & 0 & 0.111022 & 0 & 0 \end{pmatrix}\end{split}\]

The magnitude of residual is very small. We conclude that our OMP algorithm has converged and we have been able to recover the exact 2 sparse representation of \(y\) in \(\Phi\).

Exact recovery conditions

Recall the \((\mathcal{D}, K)\)-exact-sparse problem discussed in Sparse approximation problem. OMP is a good and fast algorithm for solving this problem.

In terms of theoretical understanding, it is quite useful to know of certain conditions under which a sparse representation can be exactly recovered from a given signal using OMP. Such guarantees are known as exact recovery guarantees.

In this section, following Tropp in [Tro04], we will closely look at some conditions under which OMP is guaranteed to recover the solution for \((\mathcal{D}, K)\)-exact-sparse problem.

We rephrase the OMP algorithm following the conventions in \((\mathcal{D}, K)\)-exact-sparse problem.

_images/algorithm_omp_x_alpha_version.png

It is known that \(x = \Phi \alpha\) where \(\alpha\) contains at most \(K\) non-zero entries. Both the support and entries of \(\alpha\) are known. OMP is only given \(\Phi\), \(x\) and \(K\) and is estimating \(\alpha\). The estimate returned by OMP is denoted as \(\widehat{\alpha}\).

Let \(\Lambda_{\text{opt}} = \supp(\alpha)\) be the set of indices at which optimal representation \(\alpha\) has non-zero entries. Then we can write

\[x = \sum_{i \in \Lambda} \alpha_i \phi_i.\]

From the synthesis matrix \(\Phi\) we can extract a \(N \times K\) matrix \(\Phi_{\text{opt}}\) whose columns are indexed by \(\Lambda_{\text{opt}}\).

\[\Phi_{\text{opt}} \triangleq \begin{bmatrix} \phi_{\lambda_1} & \dots & \phi_{\lambda_K} \end{bmatrix}\]

where \(\lambda_i \in \Lambda_{\text{opt}}\). Thus, we can also write

\[x = \Phi_{\text{opt}} \alpha_{\text{opt}}\]

where \(\alpha_{\text{opt}} \in \CC^K\) is a vector of \(K\) complex entries.

Now the columns of optimum \(\Phi_{\text{opt}}\) are linearly independent. Hence \(\Phi_{\text{opt}}\) has full column rank.

We define another matrix \(\Psi_{\text{opt}}\) whose columns are the remaining \(D - K\) columns of \(\Phi\). Thus \(\Psi_{\text{opt}}\) consists of atoms or columns which do not participate in the optimum representation of \(x\).

OMP starts with an empty support. In every step, it picks up one column from \(\Phi\) and adds to the support of approximation. If we can ensure that it never selects any column from \(\Psi_{\text{opt}}\) we will be guaranteed that correct \(K\) sparse representation is recovered.

We will use mathematical induction and assume that OMP has succeeded in its first \(k\) steps and has chosen \(k\) columns from \(\Phi_{\text{opt}}\) so far. At this point it is left with the residual \(r^k\).

In \((k+1)\)-th iteration, we compute inner product of \(r^k\) with all columns in \(\Phi\) and choose the column which has highest inner product.

We note that maximum value of inner product of \(r^k\) with any of the columns in \(\Psi_{\text{opt}}\) is given by

\[\| \Psi_{\text{opt}}^H r^k \|_{\infty}.\]

Correspondingly, maximum value of inner product of \(r^k\) with any of the columns in \(\Phi_{\text{opt}}\) is given by

\[\| \Phi_{\text{opt}}^H r^k \|_{\infty}.\]

Actually since we have already shown that \(r^k\) is orthogonal to the columns already chosen, hence they will not contribute to this equation.

In order to make sure that none of the columns in \(\Psi_{\text{opt}}\) is selected, we need

\[\| \Psi_{\text{opt}}^H r^k \|_{\infty} < \| \Phi_{\text{opt}}^H r^k \|_{\infty}.\]
Definition

We define a ratio

(1)\[\rho(r^k) \triangleq \frac{\| \Psi_{\text{opt}}^H r^k \|_{\infty}}{\| \Phi_{\text{opt}}^H r^k \|_{\infty}}.\]

This ratio is known as greedy selection ratio.

We can see that as long as \(\rho(r^k) < 1\), OMP will make a right decision at \((k+1)\)-th stage. If \(\rho(r^k) = 1\) then there is no guarantee that OMP will make the right decision. We will assume pessimistically that OMP makes wrong decision in such situations.

We note that this definition of \(\rho(r^k)\) looks very similar to matrix \(p\)-norms defined in p-norm for matrices. It is suggested to review the properties of \(p\)-norms for matrices at this point.

We now present a condition which guarantees that \(\rho(r^k) < 1\) is always satisfied.

Theorem

A sufficient condition for Orthogonal Matching Pursuit to resolve \(x\) completely in \(K\) steps is that

(2)\[\underset{\psi}{\max} \| \Phi_{\text{opt}}^{\dag} \psi \|_1 < 1,\]

where \(\psi\) ranges over columns in \(\Psi_{\text{opt}}\).

Moreover, Orthogonal Matching Pursuit is a correct algorithm for \((\mathcal{D}, K)\)-exact-sparse problem whenever the condition holds for every superposition of \(K\) atoms from \(\DD\).

Proof

In (2) \(\Phi_{\text{opt}}^{\dag}\) is the pseudo-inverse of \(\Phi\)

\[\Phi_{\text{opt}}^{\dag} = (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1} \Phi_{\text{opt}}^H.\]

What we need to show is if (2) holds true then \(\rho(r^k)\) will always be less than 1.

We note that the projection operator for the column span of \(\Phi_{\text{opt}}\) is given by

\[\Phi_{\text{opt}} (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1} \Phi_{\text{opt}}^H = (\Phi_{\text{opt}}^{\dag})^H \Phi_{\text{opt}}^H.\]

We also note that by assumption since \(x \in \ColSpace(\Phi_{\text{opt}})\) and the approximation at the \(k\)-th step, \(x^k = \Phi \alpha^k \in \ColSpace(\Phi_{\text{opt}})\), hence \(r^k = x - x^k\) also belongs to \(\ColSpace(\Phi_{\text{opt}})\).

Thus

\[r^k = (\Phi_{\text{opt}}^{\dag})^H \Phi_{\text{opt}}^H r^k\]

i.e. applying the projection operator for \(\Phi_{\text{opt}}\) on \(r^k\) doesn’t change it.

Using this we can rewrite \(\rho(r^k)\) as

\[\rho(r^k) = \frac{\| \Psi_{\text{opt}}^H r^k \|_{\infty}}{\| \Phi_{\text{opt}}^H r^k \|_{\infty}} = \frac{\| \Psi_{\text{opt}}^H (\Phi_{\text{opt}}^{\dag})^H \Phi_{\text{opt}}^H r^k \|_{\infty}} {\| \Phi_{\text{opt}}^H r^k \|_{\infty}}.\]

We see \(\Phi_{\text{opt}}^H r^k\) appearing both in numerator and denominator.

Now consider the matrix \(\Psi_{\text{opt}}^H (\Phi_{\text{opt}}^{\dag})^H\) and recall the definition of matrix \(\infty\)-norm from here

\[\| A\|_{\infty} = \underset{x \neq 0}{\max } \frac{\| A x \|_{\infty}}{\| x \|_{\infty}} \geq \frac{\| A x \|_{\infty}}{\| x \|_{\infty}} \Forall x \neq 0.\]

Thus

\[\| \Psi_{\text{opt}}^H (\Phi_{\text{opt}}^{\dag})^H \|_{\infty} \geq \frac{\| \Psi_{\text{opt}}^H (\Phi_{\text{opt}}^{\dag})^H \Phi_{\text{opt}}^H r^k \|_{\infty}} {\| \Phi_{\text{opt}}^H r^k \|_{\infty}}\]

which gives us

\[\rho(r^k) \leq \| \Psi_{\text{opt}}^H (\Phi_{\text{opt}}^{\dag})^H \|_{\infty} = \| \left ( \Phi_{\text{opt}}^{\dag} \Psi_{\text{opt}} \right )^H \|_{\infty}.\]

Finally we recall that \(\| A\|_{\infty}\) is max row sum norm while \(\| A\|_1\) is max column sum norm. Hence

\[\| A\|_{\infty} = \| A^T \|_1= \| A^H \|_1\]

which means

\[\| \left ( \Phi_{\text{opt}}^{\dag} \Psi_{\text{opt}} \right )^H \|_{\infty} = \| \Phi_{\text{opt}}^{\dag} \Psi_{\text{opt}} \|_1.\]

Thus

\[\rho(r^k) \leq \| \Phi_{\text{opt}}^{\dag} \Psi_{\text{opt}} \|_1.\]

Now the columns of \(\Phi_{\text{opt}}^{\dag} \Psi_{\text{opt}}\) are nothing but \(\Phi_{\text{opt}}^{\dag} \psi\) where \(\psi\) ranges over columns of \(\Psi_{\text{opt}}\).

Thus in terms of max column sum norm

\[\rho(r^k) \leq \underset{\psi}{\max} \| \Phi_{\text{opt}}^{\dag} \psi \|_1.\]

Thus assuming that OMP has made \(k\) correct decision and \(r^k\) lies in \(\ColSpace( \Phi_{\text{opt}})\), \(\rho(r^k) < 1\) whenever

\[\underset{\psi}{\max} \| \Phi_{\text{opt}}^{\dag} \psi \|_1 < 1.\]

Finally the initial residual \(r^0 = 0\) which always lies in column space of \(\Phi_{\text{opt}}\). By above logic, OMP will always select an optimal column in each step. Since the residual is always orthogonal to the columns already selected, hence it will never select the same column twice. Thus in \(K\) steps it will retrieve all \(K\) atoms which comprise \(x\).

Babel function estimates

There is a small problem with this result. Since we don’t know the support a-priori hence its not possible to verify that

\[\underset{\psi}{\max} \| \Phi_{\text{opt}}^{\dag} \psi \|_1 < 1\]

holds. Of course, verifying this for all \(K\) column sub-matrices is computationally prohibitive.

It turns out that Babel function (recall from Babel function) is there to help. We show how Babel function guarantees that exact recovery condition for OMP holds.

Theorem

Suppose that \(\mu_1\) is the Babel function for a dictionary \(\DD\) with synthesis matrix \(\Phi\). The exact recovery condition holds whenever

(3)\[\mu_1 (K - 1) + \mu_1(K) < 1.\]

Thus, Orthogonal Matching Pursuit is a correct algorithm for \((\mathcal{D}, K)\)-exact-sparse problem whenever (3) holds.

In other words, for sufficiently small \(K\) for which (3) holds, OMP will recover any arbitrary superposition of \(K\) atoms from \(\DD\).

Proof

We can write

\[\underset{\psi}{\max} \| \Phi_{\text{opt}}^{\dag} \psi \|_1 = \underset{\psi}{\max} \| (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1} \Phi_{\text{opt}}^H \psi \|_1\]

We recall from here that operator-norm is subordinate i.e.

\[\| A x \|_1 \leq \| A \|_1 \| x \|_1.\]

Thus with \(A = (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1}\) we have

\[\| (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1} \Phi_{\text{opt}}^H \psi \|_1 \leq \| (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1} \|_1 \| \Phi_{\text{opt}}^H \psi \|_1.\]

With this we have

\[\underset{\psi}{\max} \| \Phi_{\text{opt}}^{\dag} \psi \|_1 \leq \| (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1} \|_1 \underset{\psi}{\max} \| \Phi_{\text{opt}}^H \psi \|_1.\]

Now let us look at \(\| \Phi_{\text{opt}}^H \psi \|_1\) closely. There are \(K\) columns in \(\Phi_{\text{opt}}\). For each column we compute its inner product with \(\psi\). And then absolute sum of the inner product.

Also recall the definition of Babel function:

\[\mu_1(K) = \underset{|\Lambda| = K}{\max} \; \underset {\psi}{\max} \sum_{\Lambda} | \langle \psi, \phi_{\lambda} \rangle |.\]

Clearly

\[\underset{\psi}{\max} \| \Phi_{\text{opt}}^H \psi \|_1 = \underset{\psi}{\max} \sum_{\Lambda_{\text{opt}}} | \langle \psi, \phi_{\lambda_i} \rangle | \leq \mu_1(K).\]

We also need to provide a bound on \(\| (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1} \|_1\) which requires more work.

First note that since all columns in \(\Phi\) are unit norm, hence the diagonal of \(\Phi_{\text{opt}}^H \Phi_{\text{opt}}\) contains unit entries. Thus we can write

\[\Phi_{\text{opt}}^H \Phi_{\text{opt}} = I_K + A\]

where \(A\) contains the off diagonal terms in \(\Phi_{\text{opt}}^H \Phi_{\text{opt}}\).

Looking carefully , each column of \(A\) lists the inner products between one atom of \(\Phi_{\text{opt}}\) and the remaining \(K-1\) atoms. By definition of Babel function

\[\|A \|_1 = \max_{k} \sum_{j \neq k} | \langle \phi_{\lambda_k} \phi_{\lambda_j} \rangle | \leq \mu_1(K -1).\]

Now whenever \(\| A \|_1 < 1\) then the Von Neumann series \(\sum(-A)^k\) converges to the inverse \((I_K + A)^{-1}\).

Thus we have

\[\begin{split}\begin{aligned} \| (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1} \|_1 &= \| ( I_K + A )^{-1} \|_1 \\ &= \| \sum_{ k = 0}^{\infty} (-A)^k \|_1\\ & \leq \sum_{ k = 0}^{\infty} \| A\|^k_1 \\ &= \frac{1}{1 - \| A \|_1}\\ & \leq \frac{1}{1 - \mu_1(K-1)}. \end{aligned}\end{split}\]

Thus putting things together we get

\[\underset{\psi}{\max} \| \Phi_{\text{opt}}^{\dag} \psi \|_1 \leq \frac{\mu_1(K)}{1 - \mu_1(K-1)}.\]

Thus whenever

\[\mu_1 (K - 1) + \mu_1(K) < 1.\]

we have

\[\frac{\mu_1(K)}{1 - \mu_1(K-1)} < 1 \implies \underset{\psi}{\max} \| \Phi_{\text{opt}}^{\dag} \psi \|_1 < 1.\]

Sparse approximation conditions

We now remove the assumption that \(x\) is \(K\)-sparse in \(\Phi\). This is indeed true for all real life signals as they are not truly sparse.

In this section we will look at conditions under which OMP can successfully solve the \((\mathcal{D}, K)\)-sparse approximation problem.

_images/algorithm_omp_x_alpha_version.png

Let \(x\) be an arbitrary signal and suppose that \(\alpha_{\text{opt}}\) is an optimum \(K\)-term approximation representation of \(x\). i.e. \(\alpha_{\text{opt}}\) is a solution to (3) and the optimal \(K\)-term approximation of \(x\) is given by

\[x_{\text{opt}} = \Phi \alpha_{\text{opt}}.\]

We note that \(\alpha_{\text{opt}}\) may not be unique.

Let \(\Lambda_{\text{opt}}\) be the support of \(\alpha_{\text{opt}}\) which identifies the atoms in \(\Phi\) that participate in the \(K\)-term approximation of \(x\).

Once again let \(\Phi_{\text{opt}}\) be the sub-matrix consisting of columns of \(\Phi\) indexed by \(\Lambda_{\text{opt}}\).

We assume that columns in \(\Phi_{\text{opt}}\) are linearly independent. This is easily established since if any atom in this set were linearly dependent on other atoms, we could always find a more optimal solution.

Again let \(\Psi_{\text{opt}}\) be the matrix of \((D - K)\) columns which are not indexed by \(\Lambda_{\text{opt}}\).

We note that if \(\Lambda_{\text{opt}}\) is identified then finding \(\alpha_{\text{opt}}\) is a straightforward least squares problem.

We now present a condition under which Orthogonal Matching Pursuit is able to recover the optimal atoms.

Theorem

Assume that \(\mu_1(K) < \frac{1}{2}\), and suppose that at \(k\)-th iteration, the support \(S^k\) for \(\alpha^k\) consists only of atoms from an optimal \(k\)-term approximation of the signal \(x\).At step \((k+1)\), Orthogonal Matching Pursuit will recover another atom indexed by \(\Lambda_{\text{opt}}\) whenever

(1)\[\| x - \Phi \alpha^k \|_2 > \sqrt{1 + \frac{K ( 1 - \mu_1(K))}{(1 - 2 \mu_1(K))^2} } \; \| x - \Phi \alpha_{\text{opt}}\|_2.\]

A few remarks are in order.

  • \(\| x - \Phi \alpha^k \|_2\) is the approximation error norm at \(k\)-th iteration.
  • \(\| x - \Phi \alpha_{\text{opt}}\|_2\) is the optimum approximation error after \(K\) iterations.
  • The theorem says that OMP makes absolute progress whenever the current error is larger than optimum error by a factor.
  • As a result of this theorem, we note that every optimal \(K\)-term approximation of \(x\) contains the same kernel of atoms. The optimum error is always independent of choice of atoms in \(K\) term approximation (since it is optimum). Initial error is also independent of choice of atoms (since initial support is empty). OMP always selects the same set of atoms by design.
Proof

Let us assume that after \(k\) steps, OMP has recovered an approximation \(x^k\) given by

\[x^k = \Phi \alpha^k\]

where \(S^k = \supp(\alpha^k)\) chooses \(k\) columns from \(\Phi\) all of which belong to \(\Phi_{\text{opt}}\).

Let the residual at \(k\)-th stage be

\[r^k = x - x^k = x - \Phi \alpha^k.\]

Recalling from previous section, a sufficient condition for recovering another optimal atom is

\[\rho(r^k) = \frac{\| \Psi_{\text{opt}}^H r^k \|_{\infty}}{\| \Phi_{\text{opt}}^H r^k \|_{\infty}} < 1.\]

One difference from previous section is that \(r^k \notin \ColSpace(\Phi_{\text{opt}})\).

We can write

\[r^k = x - x^k = (x - x_{\text{opt}}) + (x_{\text{opt}} - x^k).\]

Note that \((x - x_{\text{opt}})\) is nothing but the residual left after \(K\) iterations.

We also note that since residual in OMP is always orthogonal to already selected columns, hence

\[\Phi_{\text{opt}}^H (x - x_{\text{opt}}) = 0.\]

We will now use these expressions to simplify \(\rho(r^k)\).

\[\begin{split}\begin{aligned} \rho(r^k) &= \frac{\| \Psi_{\text{opt}}^H r^k \|_{\infty}} {\| \Phi_{\text{opt}}^H r^k \|_{\infty}}\\ &= \frac{\| \Psi_{\text{opt}}^H (x - x_{\text{opt}}) + \Psi_{\text{opt}}^H (x_{\text{opt}} - x^k)\|_{\infty}} {\| \Phi_{\text{opt}}^H (x - x_{\text{opt}}) + \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_{\infty}}\\ & = \frac{\| \Psi_{\text{opt}}^H (x - x_{\text{opt}}) + \Psi_{\text{opt}}^H (x_{\text{opt}} - x^k)\|_{\infty}} {\| \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_{\infty}}\\ &\leq \frac{\| \Psi_{\text{opt}}^H (x - x_{\text{opt}})\|_{\infty}} {\| \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_{\infty}} + \frac{\| \Psi_{\text{opt}}^H (x_{\text{opt}} - x^k)\|_{\infty}} {\| \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_{\infty}} \end{aligned}\end{split}\]

We now define two new terms

\[\rho_{\text{err}}(r^k) \triangleq \frac{\| \Psi_{\text{opt}}^H (x - x_{\text{opt}})\|_{\infty}} {\| \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_{\infty}}\]

and

\[\rho_{\text{opt}}(r^k) \triangleq \frac{\| \Psi_{\text{opt}}^H (x_{\text{opt}} - x^k)\|_{\infty}} {\| \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_{\infty}}.\]

With these we have

(2)\[\rho(r^k) \leq \rho_{\text{opt}}(r^k) + \rho_{\text{err}}(r^k)\]

Now \(x_{\text{opt}}\) has an exact \(K\)-term representation in \(\Phi\) given by \(\alpha_{\text{opt}}\). Hence \(\rho_{\text{opt}}(r^k)\) is nothing but \(\rho(r^k)\) for corresponding exact-sparse problem.

From the proof of here we recall

\[\rho_{\text{opt}}(r^k) \leq \frac{\mu_1(K)}{1 - \mu_1(K-1)} \leq \frac{\mu_1(K)}{1 - \mu_1(K)}\]

since

\[\mu_1(K-1) \leq \mu_1(K) \implies 1 - \mu_1(K-1) \geq 1 - \mu_1(K).\]

The remaining problem is \(\rho_{\text{err}}(r^k)\). Let us look at its numerator and denominator one by one.

\(\| \Psi_{\text{opt}}^H (x - x_{\text{opt}})\|_{\infty}\) essentially is the maximum (absolute) inner product between any column in \(\Psi_{\text{opt}}\) with \(x - x_{\text{opt}}\).

We can write

\[\| \Psi_{\text{opt}}^H (x - x_{\text{opt}})\|_{\infty} \leq \underset{\psi}{\max} | \psi^H (x - x_{\text{opt}}) | \leq \underset{\psi}{\max} \|\psi \|_2 \| x - x_{\text{opt}}\|_2 = \| x - x_{\text{opt}}\|_2\]

since all columns in \(\Phi\) are unit norm. In between we used Cauchy-Schwartz inequality.

Now look at denominator \(\| \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_{\infty}\) where \((x_{\text{opt}} - x^k) \in \CC^N\) and \(\Phi_{\text{opt}} \in \CC^{N \times K}.\) Thus

\[\Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \in \CC^{K}.\]

Now for every \(v \in \CC^K\) we have

\[\| v \|_2 \leq \sqrt{K} \| v\|_{\infty}.\]

Hence

\[\| \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_{\infty} \geq K^{-1/2} \| \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_2.\]

Since \(\Phi_{\text{opt}}\) has full column rank, hence its singular values are non-zero. Thus

\[\| \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_2 \geq \sigma_{\text{min}}(\Phi_{\text{opt}}) \| x_{\text{opt}} - x^k \|_2.\]

From here we have

\[\sigma_{\text{min}}(\Phi_{\text{opt}}) \geq \sqrt{1 - \mu_1(K-1)} \geq \sqrt{1 - \mu_1(K)}.\]

Combining these observations we get

\[\rho_{\text{err}}(r^k) \leq \frac{\sqrt{K} \| x - x_{\text{opt}}\|_2} {\sqrt{1 - \mu_1(K)} \| x_{\text{opt}} - x^k \|_2}.\]

Now from (2) \(\rho(r^k) <1\) whenever \(\rho_{\text{opt}}(r^k) + \rho_{\text{err}}(r^k) < 1\).

Thus a sufficient condition for \(\rho(r^k) <1\) can be written as

\[\frac{\mu_1(K)}{1 - \mu_1(K)} + \frac{\sqrt{K} \| x - x_{\text{opt}}\|_2} {\sqrt{1 - \mu_1(K)} \| x_{\text{opt}} - x^k \|_2} < 1.\]

We need to simplify this expression a bit. Multiplying by \((1 - \mu_1(K))\) on both sides we get

\[\begin{split}\begin{aligned} &\mu_1(K) + \frac{\sqrt{K} \sqrt{1 - \mu_1(K)} \| x - x_{\text{opt}}\|_2} { \| x_{\text{opt}} - x^k \|_2} < 1 - \mu_1(K)\\ \implies & \frac{\sqrt{K(1 - \mu_1(K)}) \| x - x_{\text{opt}}\|_2} { \| x_{\text{opt}} - x^k \|_2} < 1 - 2 \mu_1(K)\\ \implies & \| x_{\text{opt}} - x^k \|_2 > \frac{\sqrt{K(1 - \mu_1(K)})} {1 - 2 \mu_1(K)}\| x - x_{\text{opt}}\|_2. \end{aligned}\end{split}\]

We assumed \(\mu_1(K) < \frac{1}{2}\) thus \(1 - 2 \mu_1(K) > 0\) which validates the steps above.

Finally we remember that \((x - x_{\text{opt}}) \perp \ColSpace(\Phi_{\text{opt}})\) and \((x_{\text{opt}} - x^k) \in \ColSpace(\Phi_{\text{opt}})\) thus \((x - x_{\text{opt}})\) and \((x_{\text{opt}} - x^k)\) are orthogonal to each other. Thus by applying Pythagorean theorem we have

\[\| x - x^k\|_2^2 = \| x - x_{\text{opt}} \|_2^2 + \| x_{\text{opt}} - x^k \|_2^2.\]

Thus we have

\[\| x - x^k\|_2^2 > \frac{K(1 - \mu_1(K))} {(1 - 2 \mu_1(K))^2}\| x - x_{\text{opt}}\|_2^2 + \| x - x_{\text{opt}}\|_2^2.\]

This gives us a sufficient condition

(3)\[\| x - x^k\|_2 > \sqrt{1 + \frac{K(1 - \mu_1(K))} {(1 - 2 \mu_1(K))^2}}\| x - x_{\text{opt}}\|_2.\]

i.e. whenever (3) holds true, we have \(\rho(r^k) < 1\) which leads to OMP making a correct choice and choosing an atom from the optimal set.

Putting \(x^k = \Phi \alpha^k\) and \(x_{\text{opt}} = \Phi \alpha_{\text{opt}}\) we get back (1) which is the desired result.

This result establishes that as long as (1) holds for each of the steps from 1 to \(K\), OMP will recover a \(K\) term optimum approximation \(x_{\text{opt}}\). If \(x \in \CC^N\) is completely arbitrary, then it may not be possible that (1) holds for all the \(K\) iterations. In this situation, a question arises as to what is the worst \(K\)-term approximation error that OMP will incur if (1) doesn’t hold true all the way.

This is answered in following corollary of previous theorem.

Corollary

Assume that \(\mu_1(K) < \frac{1}{2}\) and let \(x \in \CC^N\) be a completely arbitrary signal. Orthogonal Matching Pursuit produces a \(K\)-term approximation \(x^K\) which satisfies

(4)\[\| x - x^K \|_2 \leq \sqrt{1 + C(\DD, K)} \| x - x_{\text{opt}} \|_2\]

where \(x_{\text{opt}}\) is the optimum \(K\)-term approximation of \(x\) in dictionary \(\DD\) (i.e. \(x_{\text{opt}} = \Phi \alpha_{\text{opt}}\) where \(\alpha_{\text{opt}}\) is an optimal solution of (3) ). \(C(\DD, K)\) is a constant depending upon the dictionary \(\DD\) and the desired sparsity level \(K\). An estimate of \(C(\DD, K)\) is given by

\[C(\DD, K) \leq \frac{K ( 1 - \mu_1(K))}{(1 - 2 \mu_1(K))^2}.\]
Proof

Suppose that OMP runs fine for first \(p\) steps where \(p < K\). Thus (1) keeps holding for first \(p\) steps. We now assume that (1) breaks at step \(p+1\) and OMP is no longer guaranteed to make an optimal choice of column from \(\Phi_{\text{opt}}\). Thus at step \(p+1\) we have

\[\| x - x^p \|_2 \leq \sqrt{1 + \frac{K(1 - \mu_1(K))} {(1 - 2 \mu_1(K))^2}} \| x - x_{\text{opt}} \|_2.\]

Any further iterations of OMP will only reduce the error further (although not in an optimal way). This gives us

\[\| x - x^K \|_2 \leq \| x - x^p \|_2 \leq \sqrt{1 + \frac{K(1 - \mu_1(K))} {(1 - 2 \mu_1(K))^2}} \| x - x_{\text{opt}} \|_2.\]

Choosing

\[C(\DD, K) = \frac{K ( 1 - \mu_1(K))}{(1 - 2 \mu_1(K))^2}\]

we can rewrite this as

\[\| x - x^K \|_2 \leq \sqrt{1 + C(\DD, K)} \| x - x_{\text{opt}} \|_2.\]

This is a very useful result. It establishes that even if OMP is not able to recover the optimum \(K\)-term representation of \(x\), it always constructs an approximation whose error lies within a constant factor of optimum approximation error where the constant factor is given by \(\sqrt{1 + C(\DD, K)}\).

If the optimum approximation error \(\| x - x_{\text{opt}} \|_2\) is small then \(\| x - x^K \|_2\) will also be not too large.

If \(\| x - x_{\text{opt}} \|_2\) is moderate, then the OMP may inflate the approximation error to a higher value. But in this case, probably sparse approximation is not the right tool for signal representation over the dictionary.

Fast Implementation of OMP

As part of sparse-plex, we provide a fast CPU based implementation of OMP. It is up to 4 times faster than the OMP implementation in OMPBOX.

This is written in C and uses the BLAS and LAPACK features available in MATLAB. The implementation is available in the function spx.fast.omp. The corresponding C code is in omp.c.

For a \(100 \times 1000\) sensing matrix, the implementation can recover sparse representations for each signal in few hundred microseconds (depending upon the number of non-zero coefficients in the sparse representation and hence the sparsity level) on an Intel i7 2.4 GHz laptop with 16 GB RAM.

Read Building MATLAB Extensions for how to build the mex files for fast OMP implementation.

A Simple Example

Let’s create a Gaussian sensing matrix:

M = 100;
N = 1000;
A = spx.dict.simple.gaussian_mtx(M, N);

See Hands on with Gaussian sensing matrices for details.

Let’s create a 1000 sparse signals with sparsity 7:

S = 1000;
K = 7;
gen = spx.data.synthetic.SparseSignalGenerator(N, K, S);
X =  gen.biGaussian();

See Generation of synthetic sparse representations for details.

Let’s compute their measurements using the Gaussian matrix:

Y = A*X;

Let’s recover the representations from the measurements:

start_time = tic;
result = spx.fast.omp(A, Y, K, 1e-12);
elapsed_time = toc(start_time);
fprintf('Time taken: %.2f seconds\n', elapsed_time);
fprintf('Per signal time: %.2f usec', elapsed_time * 1e6/ S);

Time taken: 0.17 seconds
Per signal time: 169.56 usec

See The OMP Algorithm for a review of OMP algorithm.

We are taking just 169 micro seconds per signal.

Let’s verify that all the signals have been recovered correctly:

cmpare = spx.commons.SparseSignalsComparison(X, result, K);
cmpare.summarize();

Signal dimension: 1000
Number of signals: 1000
Combined reference norm: 159.67639347
Combined estimate norm: 159.67639347
Combined difference norm: 0.00000000
Combined SNR: 307.9221 dB

All signals have indeed been recovered correctly. See Comparing sparse or approximately sparse signals for details about SparseSignalsComparison.

Example code can be downloaded here.

Benchmarks
System configuration
OS Windows 7 Professional 64 Bit
Processor Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Memory (RAM) 16.0 GB
Hard Disk SATA 120GB
MATLAB R2017b

The method for benchmarking has been adopted from the file ompspeedtest.m in the OMPBOX package by Ron Rubinstein.

We compare following algorithms:

  • The Cholesky decomposition based OMP implementation in OMPBOX.
  • Our C version in sparse-plex.

The work load consists of a Gaussian dictionary of size \(512 \times 1000\). Sufficient signals are chosen so that the benchmarks can run reasonable duration. 8 sparse representations are constructed for each randomly generated signal in the given dictionary.

Speed summary for 6917 signals, dictionary size 512 x 1000:
Call syntax        Algorithm               Total time
--------------------------------------------------------
OMP(D,X,[],T)                    OMP-Cholesky            16.65 seconds
SPX-OMP(D, X, T)                 SPX-OMP-Cholesky         4.29 seconds

Our implementation is close to 4 times faster.

The benchmark generation code is in ex_fast_omp_speed_test.m.

Batch OMP

In this section, we develop an efficient version of OMP known as Batch OMP [RZE08].

In OMP, given a signal \(\bar{y}\) and a dictionary \(\Phi\), our goal is to iteratively construct a sparse representation \(x\) such that \(\bar{y} \approx \Phi x\) satisfying either a target sparsity \(K\) of \(x\) or a target error \(\| \bar{y} - \Phi x\|_2 \leq \epsilon\). The algorithm picks an atom from \(\Phi\) in each iteration and computes a least squares estimate \(y\) of \(\bar{y}\) on the selected atoms. The residual \(r = \bar{y} - y\) is used to select the next atom by choosing the atom which matches best with the residual. Let \(I\) be the set of atoms selected in OMP after some iterations.

Recalling the OMP steps in the next iteration:

  1. Matching atoms with residuals: \(h = \Phi^T r\)
  2. Finding the new atom (best match with residual): \(i = \underset{j}{\text{arg max}} (\abs(h_j))\)
  3. Support update: \(I = I \cup \{ i \}\)
  4. Least squares: \(x_I = \Phi_I^{\dag} \bar{y}\)
  5. Approximation update: \(y = \Phi_I x_I = \Phi_I \Phi_I^{\dag} \bar{y}\)
  6. Residual update: \(r = \bar{y} - y = (I - \Phi_I \Phi_I^{\dag}) \bar{y}\)

Batch OMP is useful when we are trying to reconstruct representations of multiple signals at the same time.

Least Squares in OMP using Cholesky Update

Here we review how least squares can be fast implemented using Cholesky updates.

In the following, we will denote

  • the matrix \(\Phi^T \Phi\) by the symbol \(G\)
  • the matrix \(\Phi^T \Phi_I\) by the symbol \(G_I\)
  • the matrix \((\Phi_I^T \Phi_I)\) by the symbol \(G_{I, I}\).

Note that \(G_I\) is formed by taking the columns indexed by \(I\) from \(G\). The matrix \(G_{I, I}\) is formed by taking the rows and columns both indexed by \(I\) from \(G\).

We have

\[x_I = (\Phi_I^T \Phi_I)^{-1} \Phi_I^T y\]

and

\[y = \Phi_I x_I = \Phi_I \Phi_I^{\dag} \bar{y} = \Phi_I (\Phi_I^T \Phi_I)^{-1} \Phi_I^T \bar{y}.\]

We can rewrite:

\[(\Phi_I^T \Phi_I) x_I = \Phi_I^T \bar{y}.\]

If we perform a Cholesky decomposition of the Gram matrix \(G_{I, I} = \Phi_I^T \Phi_I\) as \(G_{I, I} = L L^T\), then we have:

\[L L^T x_I = \Phi_I^T \bar{y}.\]

Solving this equation involves

  • Computing \(b = \Phi_I^T \bar{y}\)
  • Solving the triangular system \(L u = b\)
  • Solving the triangular system \(L^T x_I = u\)

We also need an efficient way of computing \(L\). It so happens that the Cholesky decomposition \(G_{I, I} = L L^T\) can be updated incrementally in each iteration of OMP. Let

  • \(I^k\) denote the index set of chosen atoms after k iterations.
  • \(\Phi_{I^k}\) denote the corresponding subdictionary of chosen atoms.
  • \(G_{I^k, I^k}\) denote the Gram matrix \(\Phi_{I^k}^T \Phi_{I^k}\).
  • \(L^k\) denote the Cholesky decomposition of G_{I^k, I^k}.
  • \(i^k\) be the index of atom chosen in k-th iteration.

The Cholesky update process aims to compute \(L^k\) given \(L^{k-1}\) and \(i^k\). Note that we can write

\[\begin{split}G_{I^k, I^k} = \begin{split}\Phi_{I^k}^T \Phi_{I^k} = \begin{bmatrix} \Phi_{I^{k-1}}^T \Phi_{I^{k-1}} & \Phi_{I^{k-1}}^T \phi_{i^k}\\ \phi_{i^k}^T \Phi_{I^{k-1}} & \phi_{i^k}^T \phi_{i^k} \end{bmatrix}.\end{split}\end{split}\]

Define \(v = \Phi_{I^{k-1}}^T \phi_{i^k}\). Note that \(\phi_{i^k}^T \phi_{i^k} = 1\) for dictionaries with unit norm columns. This gives us:

\[\begin{split}\begin{split} G_{I^k, I^k} = \begin{bmatrix} G_{I^{k-1}, I^{k-1}} & v \\ v^T & 1 \end{bmatrix}.\end{split}\end{split}\]

This can be solved to give us an equation for update of Cholesky decomposition:

\[\begin{split}\begin{split}L^k = \begin{bmatrix} L^{k - 1} & 0 \\ w^T & \sqrt{1 - w^T w} \end{bmatrix}\end{split}\end{split}\]

where \(w\) is the solution of the triangular system \(L^{k - 1} w = v\).

Removing residuals from the computation

An interesting observation on OMP is that the real goal of OMP is to identify the index set of atoms participating in the sparse representation of \(\bar{y}\). The computation of residuals is just a way of achieving the same. If the index set has been identified, then the sparse representation is given by \(x_I = \Phi_I^{\dag} \bar{y}\) with all other entries in \(x\) set to zero and the sparse approximation of \(y\) is given by \(\Phi_I x_I\).

The selection of atoms doesn’t really need the residual explicitly. All it needs is a way to update the inner products of atoms in \(\Phi\) with the current residual. In this section, we will rewrite the OMP steps in a way that doesn’t require explicit computation of residual.

We begin with pre-computation of \(\bar{h} = \Phi^T \bar{y}\). This is the initial value of \(h\) (the inner products of atoms in dictionary with the current residual). This computation is anyway needed for OMP. Now, let’s expand the calculation of \(h\):

\[\begin{split}\begin{aligned} h &= \Phi^T r \\ &= \Phi^T (\bar{y} - y) \\ &= \Phi^T (I - \Phi_I \Phi_I^{\dag}) \bar{y}\\ &= \Phi^T \bar{y} - \Phi^T \Phi_I \Phi_I^{\dag}) \bar{y}\\ &= \bar{h} - G_I G_{I, I}^{-1} \Phi_I^T \bar{y}\\ &= \bar{h} - G_I x_I. \end{aligned}\end{split}\]

But \(\Phi_I^T \bar{y}\) is nothing but \(\bar{h}_I\). Thus,

\[h = \bar{h} - G_I G_{I, I}^{-1} \bar{h}_I.\]

This means that if \(\bar{h} = \Phi^T \bar{y}\) and \(G = \Phi^T \Phi\) have been precomputed, then \(h\) can be computed for each iteration without explicitly computing the residual.

If we are reconstructing just one signal, then the computation of \(G\) is very expensive. But, if we are reconstructing thousands of signals together in batch, computation of \(G\) is actually a minuscule factor in overall computation. This is the essential trick in Batch OMP algorithm.

There is one more issue to address. A typical halting criterion in OMP is the error based stopping criterion which compares the norm of the residual with a threshold. If the residual norm goes below the threshold, we stop OMP. If the residual is not computed explicitly, the it becomes challenging to apply this criterion. However, there is a way out. In the following, let

  • \(x_{I^k} = \Phi_{I^k}^{\dag} \bar{y}\) be the non-zero entries in the k-th sparse representation
  • \(x^k\) denote the k-th sparse representation
  • \(y^k\) be the k-th sparse approximation \(y^k = \Phi x^k = \Phi_{I^k} x_{I^k}\)
  • \(r^k\) be the residual \(\bar{y} - y^k\).

We start by writing a residual update equation. We have:

\[\begin{split}\begin{aligned} r^k &= \bar{y} - y^k = \bar{y} - \Phi x^k \\ r^{k-1} &= \bar{y} - y^{k-1} = \bar{y} - \Phi x^{k -1}. \end{aligned}\end{split}\]

Combining the two, we get:

\[r^k = r^{k -1} + \Phi (x^{k -1 } - x^k) = r^{k -1} + y^{k -1} - y^k.\]

Due to the orthogonality of the residual, we have \(\langle r^k, y^k \rangle = 0\). Using this property and a long derivation (in eq 2.8 of [RZE08]), we obtain the relationship:

\[\| r^k \|_2^2 = \| r^{k -1} \|_2^2 - (x^k)^T G x^k + (x^{k-1})^T G x^{k-1}.\]

We introduce the symbols \(\epsilon^k = \| r^k \|_2^2\) and \(\delta^k = (x^k)^T G x^k\). The previous equation reduces to:

\[\epsilon^k = \epsilon^{k-1} - \delta^{k -1} + \delta^{k}.\]

Thus, we just need to keep track of the quantity \(\delta^k\). Note that \(\delta^0 = 0\) since the initial estimate \(x^0 = 0\) for OMP.

Recall that

\[\begin{split}\begin{aligned} G x &= G_I x_I \\ &= G_I \Phi_I^{\dag} \bar{y}\\ & = G_I (\Phi_I^T \Phi_I)^{-1} \Phi_I \bar{y}\\ &= G_I G_{I, I}^{-1} \Phi_I \bar{y}\\ &= G_I G_{I, I}^{-1} \bar{h}_I \end{aligned}\end{split}\]

which has already been computed for updating \(h\) and can be reused. So

\[\delta^k = (x^k)^T G x^k = (x^k)^T \left( G_{I^k} G_{{I^k}, {I^k}}^{-1} \bar{h}_{I^k} \right)\]

which is a simple inner product.

The Batch OMP Algorithm

The batch OMP algorithm is described in the figure below.

The inputs are

  • The Gram matrix \(G = \Phi^T \Phi\).
  • The initial correlation vector \(\bar{h} = \Phi^T \bar{y}\).
  • The squared norm \(\epsilon^0\) of the signal \(\bar{y}\) whose sparse representation we are constructing.
  • The upper bound on the desired sparsity level \(K\)
  • Residual norm (squared) threshold \(\epsilon\).

It returns the sparse representation \(x\).

Note that the algorithm doesn’t need direct access to either the dictionary \(\Phi\) or the signal \(\bar{y}\).

_images/algorithm_batch_omp.png

Note

The sparse vector \(x\) is usually returned as a pair of vectors \(I\) and \(x_I\). This is more efficient in terms of space utilization.

Fast Batch OMP Implementation

As part of sparse-plex, we provide a fast CPU based implementation of Batch OMP. It is up to 3 times faster than the Batch OMP implementation in OMPBOX.

This is written in C and uses the BLAS and LAPACK features available in MATLAB. The implementation is available in the function spx.fast.batch_omp. The corresponding C code is in batch_omp.c.

A Simple Example

Let’s create a Gaussian matrix (with normalized columns):

M = 400;
N = 1000;
Phi = spx.dict.simple.gaussian_mtx(M, N);

See Hands on with Gaussian sensing matrices for details.

Let’s create a few thousand sparse signals:

K = 16;
S = 5000;
X = spx.data.synthetic.SparseSignalGenerator(N, K, S).biGaussian();

See Generation of synthetic sparse representations for details.

Let’s compute their measurements using the Gaussian matrix:

Y = Phi*X;

We wish to recover \(X\) from \(Y\) and \(\Phi\).

Let’s precompute the Gram matrix:

G = Phi' * Phi;

Let’s precompute the correlation vectors for each signal:

DtY = Phi' * Y;

Let’s perform sparse recovery using Batch OMP and time it:

start_time = tic;
result = spx.fast.batch_omp(Phi, [], G, DtY, K, 1e-12);
elapsed_time = toc(start_time);
fprintf('Time taken: %.2f seconds\n', elapsed_time);
fprintf('Per signal time: %.2f usec', elapsed_time * 1e6/ S);

Time taken: 0.52 seconds
Per signal time: 103.18 usec

We note that the reconstruction has happened very quickly taking about just 100 micro seconds per signal.

We can verify the correctness of the result:

cmpare = spx.commons.SparseSignalsComparison(X, result, K);
cmpare.summarize();

Signal dimension: 1000
Number of signals: 5000
Combined reference norm: 536.04604784
Combined estimate norm: 536.04604784
Combined difference norm: 0.00000000
Combined SNR: 302.5784 dB

All signals have indeed been recovered correctly.
See :ref:`sec:library-commons-comparison-sparse` for
details about ``SparseSignalsComparison``.

For comparison, let’s see the time taken by Fast OMP implementation:

fprintf('Reconstruction with Fast OMP')
start_time = tic;
result = spx.fast.omp(Phi, Y, K, 1e-12);
elapsed_time = toc(start_time);
fprintf('Time taken: %.2f seconds\n', elapsed_time);
fprintf('Per signal time: %.2f usec', elapsed_time * 1e6/ S);

Reconstruction with Fast OMPTime taken: 4.39 seconds
Per signal time: 878.88 usec

See Fast Implementation of OMP for details about our fast OMP implementation.

Fast Batch OMP implementation is more than 8 times faster than fast OMP implementation for this problem configuration (M, N, K, S).

Benchmarks

System configuration
OS Windows 7 Professional 64 Bit
Processor Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Memory (RAM) 16.0 GB
Hard Disk SATA 120GB
MATLAB R2017b

The method for benchmarking has been adopted from the file ompspeedtest.m in the OMPBOX package by Ron Rubinstein.

We compare following algorithms:

The work load consists of a Gaussian dictionary of size \(512 \times 1000\). Sufficient signals are chosen so that the benchmarks can run reasonable duration. 8 sparse representations are constructed for each randomly generated signal in the given dictionary.

Speed summary for 178527 signals, dictionary size 512 x 1000:
Call syntax        Algorithm               Total time
--------------------------------------------------------
OMP(D,X,G,T)                     Batch-OMP               60.83 seconds
OMP(DtX,G,T)                     Batch-OMP with DTX    12.73 seconds
SPX-Batch-OMP(D, X, G, [], T)    SPX-Batch-OMP           19.78 seconds
SPX-Batch-OMP([], [], G, Dtx, T) SPX-Batch-OMP DTX      7.25 seconds
Gain SPX/OMPBOX without DTX 3.08
Gain SPX/OMPBOX with DTX 1.76

Our implementation is up to 3 times faster on this large workload.

The benchmark generation code is in ex_fast_batch_omp_speed_test.m.

Orthogonal least squares

Compressive sampling matching pursuit

Iterative hard thresholding

Hard thresholding pursuit

Framework for study of performance of pursuit algorithms

Experimental study of pursuit algorithms for sparse recovery needs following components:

  • Generation of synthetic sparse representations
  • Generation of synthetic compressible representations
  • Addition of measurement error
  • Measurement of recovery error
  • Phase transition diagrams

sparse-plex library provides a wide variety of functions to help with the study of pursuit algorithms.

Generation of synthetic sparse representations

A sparse representation is constructed in an appropriate representation space.

The class spx.data.synthetic.SparseSignalGenerator provides various methods for generating synthetic sparse representations from different distributions.

  • Uniform
  • Bi-uniform
  • Gaussian
  • Complex Gaussian
  • Rademacher
  • Bi-Gaussian
  • Real spherical rows
  • Complex spherical rows

It takes following parameters:

\(N\)

The dimension of the representation space

\(K\)

The sparsity level of representations

\(S\) (optional)

Number of sparse representations to generate with a common support. Default value is 1.

The generator first uniformly selects a random support of \(K\) indices from the index set \([1, N]\).

After that it provides various ways to generate the non-zero values.

Uniform
ExampleSparse representations with uniformly distributed non-zero values

We create the sparse signal generator instance:

N = 32;
K = 4;
gen = spx.data.synthetic.SparseSignalGenerator(N, K);

We generate a sparse vector:

rep =  gen.uniform();

Let’s plot it:

stem(rep, '.');
_images/demo_sparse_uniform_1.png

Note that all non-zero entries are positive and they are distributed uniformly between \([0, 1]\).

We can easily identify the support for the representation:

>> spx.commons.sparse.support(rep)'

ans =

     4    27    29    32

The \(\ell_0\)-“norm” can be calculated easily too:

>> spx.commons.sparse.l0norm(rep)

ans =

     4

Let’s cross-check with the support used by the generator:

>> gen.Omega

ans =

    27    29     4    32

By default, non-zero values are chosen between the range \([0, 1]\).

We can specify a custom range \([a, b]\) by calling:

rep =  gen.uniform(a, b);
Bi-uniform

The problem with previous example is that all non-zero entries are positive. We would like that the sign of non-zero entries also changes with equal probability. This can be achieved using bi-uniform generator.

  • The non-zero values are generated using uniform distribution.
  • A sign for each non-zero entry is chosen with equal probability.
  • The sign is multiplied to the non-zero value.
ExampleSparse representations with bi-uniformly distributed non-zero values

The setup steps are same:

N = 32;
K = 4;
gen = spx.data.synthetic.SparseSignalGenerator(N, K);

The representation generation step changes:

rep =  gen.biUniform();

Plotting:

stem(rep, '.');
_images/demo_sparse_biuniform_1.png
ExampleChanging the range of values

We will generate the magnitudes between 2 and 4:

rep =  gen.biUniform(2, 4);

Plotting:

stem(rep, '.');
_images/demo_sparse_biuniform_2.png
Gaussian
ExampleSparse representations with Gaussian distributed non-zero values

Let’s increase the dimensions of our representation space and sparsity level:

N = 128;
K = 8;
gen = spx.data.synthetic.SparseSignalGenerator(N, K);

Let’s generate non-zero entries using Gaussian distribution:

rep =  gen.gaussian();

Plot it:

stem(rep, '.');
_images/demo_sparse_gaussian_1.png
Bi-Gaussian

While the non-zero values in Gaussian distribution have both signs, we can see that some of the non-zero values are way too small. These are problematic for those sparse recovery algorithms which are not very good with way too small values or which demand that the dynamic range between the large non-zero values and small non-zero values shouldn’t be too high. The small non-zero values are also problematic in the presence of noise as it is hard to distinguish them from noise.

To address these concerns, we have a bi-Gaussian distribution.

The way it works is as follows:

  • Generate non-zero values using Gaussian distribution.
  • Let a value be \(x\).
  • Let an offset \(\alpha > 0\) be given.
  • If \(x > 0\), then \(x = x + \alpha\).
  • If \(x < 0\), then \(x = x - \alpha\).

Default value of offset is 1.

ExampleSparse representations with bi-Gaussian distributed non-zero values

Setup:

N = 128;
K = 8;
gen = spx.data.synthetic.SparseSignalGenerator(N, K);

Generating the representation vector:

rep =  gen.biGaussian();

Plot it:

stem(rep, '.');
_images/demo_sparse_bigaussian_1.png

Let’s pickup the non-zero values from this vector:

>> nz_rep = rep(rep~=0)'; nz_rep

nz_rep =

   -1.0631   -2.3499    1.7147   -1.2050    1.7254    4.5784    4.0349    3.7694

Let’s estimate the dynamic range:

>> anz_rep = abs(nz_rep);
>> dr = max(anz_rep) / min(anz_rep)

dr =

    4.3068

The bi-Gaussian distribution is quite flexible.

  • The non-zero values are both positive and negative.
  • Quite large non-zero values are possible (though rare).
  • Too small values are not allowed.
  • Dynamic range between largest and smallest non-zero values is not much.
Rademacher

Sometimes, you want a sparse representation where the non-zero values are either \(+1\) or \(-1\). In this case, the non-zero values should be drawn from Rademacher distribution.

ExampleSparse representations with Rademacher distributed non-zero values

Setup:

N = 128;
K = 8;
gen = spx.data.synthetic.SparseSignalGenerator(N, K);

Generating Rademacher distributed non-zero values:

rep =  gen.rademacher();

Plot it:

stem(rep, '.');
_images/demo_sparse_rademacher_1.png

Generating compressible signals

[Cev09] describes a set of probability distributions, dubbed compressible priors whose independent and identically distributed realizations result in p-compressible signals.

The authors provided a Matlab function randcs.m for generating compressible signals. It is included in sparse-plex.

Subspace Clustering

Introduction

High dimensional data-sets are now pervasive in various signal processing applications. For example, high resolution surveillance cameras are now commonplace generating millions of images continually. A major factor in the success of current generation signal processing algorithms is the fact that, even though these data-sets are high dimensional, their intrinsic dimension is often much smaller than the dimension of the ambient space.

One resorts to inferring (or learning) a quantitative model \(\mathbb{M}\) of a given set of data points \(Y = \{ y_1, \dots, y_S\} \subset \RR^M\). Such a model enables us to obtain a low dimensional representation of a high dimensional data set. The low dimensional representations enable efficient implementation of acquisition, compression, storage, and various statistical inferencing tasks without losing significant precision. There is no such thing as a perfect model. Rather, we seek a model \(\mathbb{M}^*\) that is best amongst a restricted class of models \(\mathcal{M} = \{ \mathbb{M} \}\) which is rich enough to describe the data set to a desired accuracy yet restricted enough so that selecting the best model is tractable.

In absence of training data, the problem of modeling falls into the category of unsupervised learning. There are two common viewpoints of data modeling. A statistical viewpoint assumes that data points are random samples from a probabilistic distribution. Statistical models try to learn the distribution from the dataset. In contrast, a geometrical viewpoint assumes that data points belong to a geometrical object (a smooth manifold or a topological space). A geometrical model attempts to learn the shape of the object to which the data points belong. Examples of statistical modeling include maximum likelihood, maximum a posteriori estimates, Bayesian models etc. An example of geometrical models is Principal Component Analysis (PCA) which assumes that data is drawn from a low dimensional subspace of the high dimensional ambient space. PCA is simple to implement and has found tremendous success in different fields e.g., pattern recognition, data compression, image processing, computer vision, etc. footnote{PCA can also be viewed as a statistical model. When the data points are independent samples drawn from a Gaussian distribution, the geometric formulation of PCA coincides with its statistical formulation.}

The assumption that all the data points in a data set could be drawn from a single model however happens to be a stretched one. In practice, it often occurs that if we group or segment the data set \(Y\) into multiple disjoint subsets: \(Y = Y_1 \cup \dots \cup Y_K\), then each subset can be modeled sufficiently well by a model \(\mathbb{M}_k^*\) (\(1 \leq k \leq K\)) chosen from a simple model class. Each model \(\mathbb{M}_k^*\) is called a primitive or component model. In this sense, the data set \(Y\) is called a mixed dataset and the collection of primitive models is called a hybrid model for the dataset. Let us look at some examples of mixed data sets.

Consider the problem of vanishing point detection in computer vision. Under perspective projection, a group of parallel lines pass through a common point in the image plane which is known as the vanishing point for the group. For a typical scene consisting of multiple sets of parallel lines, the problem of detecting all vanishing points in the image plane from the set of edge segments (identified in the image) can be transformed into clustering points (in edge segments) into multiple 2D subspaces in \(\RR^3\) (world coordinates of the scene).

In the Motion segmentation problem, an image sequence consisting of multiple moving objects is segmented so that each segment consists of motion from only one object. This is a fundamental problem in applications such as motion capture, vision based navigation, target tracking and surveillance. We first track the trajectories of feature points (from all objects) over the image sequence. It has been shown (see here) that trajectories of feature points for rigid motion for a single object form a low dimensional subspace. Thus motion segmentation problem can be solved by segmenting the feature point trajectories for different objects separately and estimating the motion of each object from corresponding trajectories.

In a face clustering problem, we have a collection of unlabeled images of different faces taken under varying illumination conditions. Our goal is to cluster, images of the same face in one group each. For a Lambertian object, it has been shown that the set of images taken under different lighting conditions forms a cone in the image space. This cone can be well approximated by a low-dimensional subspace [BJ03][HYL+03]. The images of the face of each person form one low dimensional subspace and the face clustering problem reduces to clustering the collection of images to multiple subspaces.

As the examples above suggest, a typical hybrid model for a mixed data set consists of multiple primitive models where each primitive is a (low dimensional) subspace. The data set is modeled as being sampled from a collection or arrangement \(\UUU\) of linear (or affine) subspaces \(\UUU_k \subset \RR^M\) : \(\UUU = \{ \UUU_1 , \dots , \UUU_K \}\). The union of the subspaces footnote{In the sequel, we would use the terms arrangement and union interchangeably. For more discussion see here.} is denoted as \(Z_{\UUU} = \UUU_1 \cup \dots \cup \UUU_K\). This is indeed a geometric model. In such modeling problems, individual subspaces (dimension and orientation of each subspace and total number of subspaces) and the membership of a data point (a single image in the face clustering problem) to a particular subspace is unknown beforehand. This entails the need for algorithms which can simultaneously identify the subspaces involved and cluster/segment the data points from individual subspaces into separate groups. This problem is known as subspace clustering which is the focus of this paper. An earlier detailed introduction to subspace clustering can be found in [Vid10].

An example of a statistical hybrid model is a Gaussian Mixture Model (GMM) where one assumes that the sample points are drawn independently from a mixture of Gaussian distributions. A way of estimating such a mixture model is the expectation maximization (EM) method.

The fundamental difficulty in the estimation of hybrid models is the “chicken-and-egg” relationship between data segmentation and model estimation. If the data segmentation was known, one could easily fit a primitive model to each subset. Alternatively, if the constituent primitive models were known, one could easily segment the data by choosing the best model for each data point. An iterative approach starts with an initial (hopefully good) guess of primitive models or data segments. It then alternates between estimating the models for each segment and segmenting the data based on current primitive models till the solution converges. On the contrary, a global algorithm can perform the segmentation and primitive modeling simultaneously. In the sequel, we will look at a variety of algorithms for solving the subspace clustering problem.

Notation and problem formulation

First some general notation for vectors and matrices. For a vector \(v \in \RR^n\), its support is denoted by \(\supp(v)\) and is defined as \(\supp(v) \triangleq \{i : v_i \neq 0, 1 \leq i \leq n \}\). \(|v|\) denotes a vector obtained by taking the absolute values of entries in \(v\). \(\OneVec_n \in \RR^n\) denotes a vector whose each entry is \(1\). \(\| v \|_p\) denotes the \(\ell_p\) norm of \(v\). \(\| v \|_0\) denotes the \(\ell_0\)-“norm” of \(v\). Let \(A\) be any \(m \times n\) real matrix (\(A \in \RR^{m \times n}\)). \(a_{i, j}\) is the element at the \(i\)-th row and \(j\)-th column of \(A\). \(a_j\) with \(1 \leq j \leq n\) denotes the \(j\)-th column vector of \(A\). \(\underline{a}_i\) with \(1 \leq i \leq m\) denotes the \(i\)-th row vector of \(A\). \(a_{j,k}\) is the \(k\)-th entry in \(a_j\). \(\underline{a}_{i,k}\) is the \(k\)-th entry in \(\underline{a}_i\). \(A_{\Lambda}\) denotes a submatrix of \(A\) consisting of columns indexed by \(\Lambda \subset \{1, \dots, n \}\). \(\underline{A}_{\Lambda}\) denotes a submatrix of \(A\) consisting of rows indexed by \(\Lambda \subset \{1, \dots, m \}\). \(|A|\) denotes matrix consisting of absolute values of entries in \(A\).

\(\supp(A)\) denotes the index set of non-zero rows of \(A\). Clearly, \(\supp(A) \subseteq \{1, \dots, m\}\). \(\| A \|_{0}\) denotes the number of non-zero rows of \(A\). Clearly, \(\| A \|_{0} = |\supp(A)|\). We note that while \(\| A \|_{0}\) is not a norm, its behavior is similar to the \(l_0\)-“norm” for vectors \(v \in \RR^n\) defined as \(\| v \|_0 \triangleq | \supp(v) |\). \(\OneVec_n \in \RR^n\) denotes a vector consisting of all \(1\text{s}\).

We use \(f(x)\) and \(F(x)\) to denote the PDF and CDF of a continuous random variable. We use \(p(x)\) to denote the PMF of a discrete random variable. We use \(\PP(E)\) to denote the probability of an event.

Problem formulation

The data set can be modeled as a set of data points lying in a union of low dimensional linear or affine subspaces in a Euclidean space \(\RR^M\) where \(M\) denotes the dimension of ambient space. Let the data set be \(\{ y_j \in \RR^M \}_{j=1}^S\) drawn from the union of subspaces under consideration. \(S\) is the total number of data points being analyzed simultaneously. We put the data points together in a data matrix as

\[Y \triangleq \begin{bmatrix} y_1 & \dots & y_S \end{bmatrix}.\]

The data matrix \(Y\) off course is known to us.

We will slightly abuse the notation and let \(Y\) denote the set of data points \(\{ y_j \in \RR^M \}_{j=1}^S\) also. We will use the terms data points and vectors interchangeably in the sequel. Let the vectors be drawn from a set of \(K\) (linear or affine) subspaces, The number of subspaces may not be known in advance. The subspaces are indexed by a variable \(k\) with \(1 \leq k \leq K\). The \(k\)-th subspace is denoted by \(\UUU_k\). Let the (linear or affine) dimension of \(k\)-th subspace be \(\dim(\UUU_k) = D_k\) with \(D_k \leq D\). Here \(D\) is an upper bound on the dimension of individual subspaces. We may or may not know \(D\). We assume that none of the subspaces is contained in another. A pair of subspaces may not intersect (e.g. parallel lines or planes), may have a trivial intersection (lines passing through origin), or a non-trivial intersection (two planes intersecting at a line). The collection of subspaces may also be independent or disjoint.

The vectors in \(Y\) can be grouped (or segmented or clustered) as submatrices \(Y_1, Y_2, \dots, Y_K\) such that all vectors in \(Y_k\) lie in subspace \(\UUU_k\). Thus, we can write

\[Y^* = Y \Gamma = \begin{bmatrix} y_1 & \dots & y_S \end{bmatrix} \Gamma = \begin{bmatrix} Y_1 & \dots & Y_K \end{bmatrix}\]

where \(\Gamma\) is an \(S \times S\) unknown permutation matrix placing each vector to the right subspace. This segmentation is straight-forward if the (affine) subspaces do not intersect or the subspaces intersect trivially at one point (e.g. any pair of linear subspaces passes through origin). Let there be \(S_k\) vectors in \(Y_k\) with \(S = S_1 + \dots + S_K\). Naturally, we may not have any prior information about the number of points in individual subspaces. We do typically require that there are enough vectors drawn from each subspace so that they can span the corresponding subspace. This requirement may vary for individual subspace clustering algorithms. For example, for linear subspaces, sparse representation based algorithms require that whenever a vector is removed from \(Y_k\), the remaining set of vectors spans \(\UUU_k\). This guarantees that every vector in \(Y_k\) can be represented in terms of other vectors in \(Y_k\). The minimum required \(S_k\) for which this is possible is \(S_k = D_k + 1\) when the data points from each subspace are in general position (i.e. \(\spark(Y_k) = D_k + 1\)).

Let \(Q_k\) be an orthonormal basis for subspace \(\UUU_k\). Then, the subspaces can be described as

\[\UUU_k = \{ y \in \RR^M : y = \mu_k + Q_k \alpha \}, \quad 1 \leq k \leq K\]

For linear subspaces, \(\mu_k = 0\). We will abuse \(Y_k\) to also denote the set of vectors from the \(k\)-th subspace.

The basic objective of subspace clustering algorithms is to obtain a clustering or segmentation of vectors in \(Y\) into \(Y_1, \dots, Y_K\). This involves finding out the number of subspaces/clusters \(K\), and placing each vector \(y_s\) in its cluster correctly. Alternatively, if we can identify \(\Gamma\) and the numbers \(S_1, \dots, S_K\) correctly, we have solved the clustering problem. Since the clusters fall into different subspaces, as part of subspace clustering, we may also identify the dimensions \(\{D_k\}_{k=1}^K\) of individual subspaces, the bases \(\{ Q_k \}_{k=1}^K\) and the offset vectors \(\{ \mu_k \}_{k=1}^K\) in case of affine subspaces. These quantities emerge due to modeling the clustering problem as a subspace clustering problem. However, they are not essential outputs of the subspace clustering algorithms. Some subspace clustering algorithms may not calculate them, yet they are useful in the analysis of the algorithm. See here for a quick review of data clustering terminology.

Noisy case

We also consider clustering of data points which are contaminated with noise. The data points do not perfectly lie in a subspace but can be approximated as a sum of a component which lies perfectly in a subspace and a noise component. Let

\[y_s = \bar{y}_s + e_s , \quad \Forall 1 \leq s \leq S\]

be the \(s\)-th vector that is obtained by corrupting an error free vector \(\bar{y}_s\) (which perfectly lies in a low dimensional subspace) with a noise vector \(e_s \in \RR^M\). The clustering problem remains the same. Our goal would be to characterize the behavior of the clustering algorithm in the presence of noise at different levels.

Algorithms

A number of algorithms have been developed to address the subspace clustering problem over last 3 decades. They can be largely classified under: algebraic methods, iterative methods, statistical methods, spectral clustering and sparse representations based methods. Some algorithms combine ideas from different approaches. In the following, we review a set of representative algorithms from the literature.

Algebraic methods include: matrix factorization based algorithms, Generalized Principal Component Analysis (GPCA).

Iterative methods include: \(K\)-plane clustering, \(K\)-subspace clustering, Expectation-Maximization based subspace clustering.

Statistical methods include: Mixture of Probabilistic Principal Component Analysis (MPPCA), ALC, Random Sampling Consensus (RANSAC).

Spectral clustering based methods include: Spectral Curvature Clustering (SCC).

Sparse representations based methods in turn use spectral clustering as a post processing step. These methods include: Low Rank Representation (LRR), Sparse Subspace Clustering via \(\ell_1\) minimization (SSC-\(\ell_1\)), Sparse Subspace Clustering via Orthogonal Matching Pursuit (SSC-OMP).

Some algorithms assume that the subspaces are independent. Some algorithms are capable of handling subspaces which may not be independent but are disjoint. Some algorithms can allow for arbitrary intersection between subspaces too. The performance of an algorithm depends on a number of parameters: ambient space dimension, number of subspaces, dimension of each subspace, number of points in each subspace and their distribution within the subspace, the separation between subspaces (in terms of say subspace angles). We provide relevant commentary on the features and capabilities of each algorithm.

Some algorithms have explicit support for handling affine subspaces. Many of them are designed for linear subspaces only. This is not a handicap in general as a \(d\)-dimensional affine subspace in \(\RR^M\) can easily be mapped to a \(d+1\)-dimensional linear subspace in \(\RR^{M + 1}\) by using homogeneous coordinates. This representation is one-to-one. The only downside is that we have to add one more coordinate in the ambient space. This may not be an issue if \(M\) is large.

When \(M\) is very large (say images), then it may be useful to perform a dimensionality reduction in advance before applying a subspace clustering algorithm. With the union of subspaces being \(Z_{\UUU} = \UUU_1 \cup \dots \cup \UUU_K\), two situations are possible. The linear span of \(Z_{\UUU}\) is a proper low dimensional subspace of \(\RR^M\). In this case, a direct PCA on the dataset pretty good in achieving the dimensional reduction. Alternatively the dimension of \(\text{span}(Z_{\UUU})\) may be very large even though individual subspace dimensions \(D_k\) are small. Now, let \(D_{\max} = \max(\{ D_k \}\). If \(D_{\max}\) is known and \(D_{\max} < M - 1\), then we can choose a \(D_{\max}+1\) dimensional subspace which can preserve the separation and dimension of all the subspaces \(\UUU_k\) and project all the points to it. Such a subspace may be chosen either randomly or using special purpose methods [BK00]. Note that such a projection may not preserve distances between points or angles between subspaces fully. An approximately distance preserving projection may require larger dimension subspace [DG99].

Matrix Factorization based algorithms

Basic matrix factorization based algorithms were developed for solving the motion segmentation problem in [BB91][Gea98][CK98][Kan01]. These algorithms are primarily algebraic in nature. See here for the motivation from motion segmentation problem.

The following derivation is applicable if the subspaces are linear and independent.

We start with the equation:

\[Y^* = Y \Gamma = \begin{bmatrix} y_1 & \dots & y_S \end{bmatrix} \Gamma = \begin{bmatrix} Y_1 & \dots & Y_K \end{bmatrix}.\]

Under the independence assumption, we have

\[\Rank (Y) = \Rank(Y^*) = \sum_{k=1}^K \Rank(Y_k).\]

Note that each \(Y_k \in \RR^{M \times S_k}\) can be factorized via SVD as

\[Y_k = U_k \Sigma_k V_k^T\]

where \(U_k \in \RR^{M \times D_k}\), \(\Sigma_k = \text{diag}(\sigma_{p 1}, \dots, \sigma_{p D_k}) \in \RR^{D_k \times D_k}\) and \(V_k \in \RR^{S_k \times D_k}\). Columns of \(U_k\) form an orthonormal basis for the subspace \(\UUU_k\). Columns of \(\Sigma_k V_k^T\) give the coordinates of points in \(Y_k\) in the orthonormal basis \(U_k\). Singular values are non-zero since \(Y_k\) spans \(\UUU_k\). Alternatively, \(D_k\) can be obtained by counting the non-zero singular values in the SVD of Y. Denoting:

\[\begin{split}\hat{U} = \begin{bmatrix} U_1 & \dots & U_K \end{bmatrix}\\ \hat{\Sigma} = \begin{bmatrix} \Sigma_1 & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & \Sigma_K \end{bmatrix}\\ \hat{V} = \begin{bmatrix} V_1 & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & V_K \end{bmatrix},\end{split}\]

we can write

\[Y^* = \hat{U} \hat{\Sigma} \hat{V}^T.\]

This is a valid SVD of \(Y^*\) if the subspaces \(\UUU_k\) are independent. This differs from the standard SVD of \(Y^*\) only in the permutation of singular values in \(\Sigma\) as the standard SVD of \(Y^*\) will require them to be ordered in decreasing order. Nevertheless,

\[Y = Y^* \Gamma^{-1} = Y^* \Gamma^T = \hat{U} \hat{\Sigma} \hat{V}^T \Gamma^T = \hat{U} \hat{\Sigma} (\Gamma \hat{V})^T.\]

It is clear that both \(Y\) and \(Y^*\) share the same singular values. Let the SVD of \(Y\) be \(Y = U \Sigma V^T\). Let \(\Sigma = \hat{\Sigma}\hat{\Gamma}\) where \(\hat{\Gamma}\) permutes the singular values in \(\hat{\Sigma}\) in decreasing order. Then \(\hat{\Sigma} = \Sigma \hat{\Gamma}^T\) and

\[Y = \hat{U} \hat{\Sigma} (\Gamma \hat{V})^T = \hat{U} \Sigma \hat{\Gamma}^T (\Gamma \hat{V})^T = \hat{U} \Sigma (\Gamma \hat{V} \hat{\Gamma})^T.\]

Matching terms, we see that \(U = \hat{U}\) and \(V = \Gamma \hat{V} \hat{\Gamma}\). Thus \(\hat{V}\) is obtained by permuting the rows and columns of \(V\) where \(\Gamma\) and \(\hat{\Gamma}\) are unknown permutations.

Let \(W = VV^T\) and \(\hat{W} = \hat{V} \hat{V}^T\). Then

\[W = VV^T = \Gamma \hat{V} \hat{\Gamma} \hat{\Gamma}^T \hat{V}^T \Gamma^T = \Gamma \hat{V} \hat{V}^T \Gamma^T = \Gamma \hat{W} \Gamma^T.\]

Alternatively

\[\hat{W} = \Gamma^T W \Gamma.\]

Thus, \(\hat{W}\) can be obtained by identical row and column permutations of \(W\) given by \(\Gamma\).

The matrix \(W\) is very useful. But first let’s check out \(\hat{W}\). Note that \(\hat{V}\) can be considered as a \(P \times P\) block matrix with diagonal matrix elements. Thus

\[\begin{split}\hat{V} \hat{V}^T = \begin{bmatrix} V_1 & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & V_K \end{bmatrix} \begin{bmatrix} V_1^T & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & V_K^T \end{bmatrix}.\end{split}\]

Simplifying, we obtain

\[\begin{split}\hat{W} = \begin{bmatrix} V_1 V_1^T & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & V_K V_K^T \end{bmatrix}.\end{split}\]

\(V_k V_k^T\) is a \(S_k \times S_k\) non-zero matrix. \(\hat{W}\) is an \(S \times S\) matrix. Clearly, \(\hat{W}_{i j} = 0\) if \(i\)-th and \(j\)-th columns in \(Y^*\) belong to the different subspaces. Since \(W\) is obtained by permuting the rows and columns of \(\hat{W}\) by \(\Gamma\), hence \(W_{ij} = 0\) if \(i\)-th and \(j\)-th columns in the unsorted data matrix \(Y\) come from different subspaces. A simple algorithm for data segmentation is thus obtained which puts the \(i\)-th and \(j\)-th columns in \(Y\) in same cluster if the corresponding entry \(W_{ij}\) is non-zero.

K-plane clustering

K-plane clustering [BM00] is a variation of the K-means algorithm [DHS12]. In \(K\)-means, we choose a point as the center of each cluster. In \(K\)-plane clustering, we instead choose a hyperplane at the center of each cluster. This algorithm can be used for solving subspace clustering problem when each subspace \(\UUU_k\) is deemed to be a hyperplane of \(\RR^M\). See here for a quick review of affine subspaces. In our notation, we will be estimating \(K\) hyperplanes \(\mathcal{H}_k\) with \(1 \leq k \leq K\). We also assume that \(K\) is known in advance. Each of the planes is defined as

\[\mathcal{H}_k = \{ x | x \in \RR^M , x^T w_k = d_k\}.\]

The algorithm seeks to choose planes such that the sum of squares of distances of each point in \(Y\) to the nearest plane is minimized.

The algorithm alternates between cluster assignment step (where each point is assigned to the nearest plane) and cluster update step (where a new nearest plane is computed for each cluster).

We assume that the normal vector \(w_k\) is unit norm, i.e. \(\| w_k \|_2 = 1\). Thus, the distance of a point \(y_s\) from a plane \(\mathcal{H}_k\) is \(| \langle w_k , y_s \rangle - d_k |\).

In the cluster assignment step, the closest plane for the point \(y_s\) is chosen as

\[k(s) = \underset{k \in 1, \dots, K}{\text{arg min}} | \langle w_k , y_s \rangle - d_k |\]

where \(k(s)\) denotes the assignment of \(s\)-th point to \(k\)-th cluster. Next, we look at the problem of finding the nearest hyperplane to a given set of points. Let \(\{y_{k 1}, y_{k 2}, y_{k n_k} \}\) be the set of points assigned to \(k\)-th cluster at a given iteration. We can stack the vectors \(y_{k n}\) in a matrix \(Y_k = \begin{bmatrix} y_{k 1} & \dots & y_{k n_k} \end{bmatrix}\). If \(\Rank(Y_k) < M\), then it is easy to find a hyperplane which contains all the points and the minimum distance is 0. In particular, if \(\Rank(Y_k) = M-1\), then this hyperplane is the range of columns of \(Y_k\): \(\Range(Y_k)\). Otherwise, any hyperplane containing \(\Range(Y_k)\) would work fine.

In the general case, for an arbitrary hyperplane specified by \((w, d)\), the sum of squared distances from the plane is given by

\[\sum_{n=1}^{n_k}| \langle w , y_{k n} \rangle - d |^2 = \| Y_k^T w - d \OneVec_{n_k} \|_2^2.\]

The cluster update step thus is equivalent to finding the solution to the optimization problem:

\[\begin{split}\begin{aligned} \underset{w, d}{\text{minimize}} \| Y_k^T w - d \OneVec_{n_k} \|_2^2\\ \text{subject to } w^T w = 1. \end{aligned}\end{split}\]

To solve this problem, we define a matrix

\[B \triangleq Y_k \left ( I - \frac{\OneVec \OneVec^T}{n_k} \right ) Y_k^T.\]

A global solution to this problem is obtained at any eigenvector \(w\) of \(B\) corresponding to a minimum eigenvalue of \(B\) and \(d = \frac{\OneVec^T Y_k^T w}{n_k}\) [BM00]. When \(Y_k\) is degenerate (\(\Rank(Y_k) < M\)), then the minimum eigen value of \(B\) is 0 and the minimum distance is 0.

Finally, it can also be shown that the \(K\)-plane clustering algorithm terminates in a finite number of steps at a cluster assignment that is locally optimal. This concludes our discussion of \(K\)-plane clustering.

K-subspace clustering

K-subspace clustering [HYL+03] is a generalization of K-means [see here] and K-plane clustering. In K-means, we cluster points around centroids, in K-plane, we cluster points around hyperplanes, and in K-subspace clustering, we cluster points around subspaces. This algorithm requires the number of subspaces \(K\) and their dimensions \(\{ D_1, \dots, D_K \}\) to be known in advance. We present the version for linear subspaces with \(\mu_k = 0\). Fitting the dataset \(Y\) into \(K\)-subspaces can be reduced to means identifying an orthogonal basis \(Q_k \in \RR^{M \times D_k}\) for each subspace. If the data points fit perfectly, then for every \(s\) in \(\{ 1, \dots , S\}\) there exists a \(k\) in \(\{1, \dots, K\}\) such that \(y_s = Q_k \alpha_s\) (i.e. \(y_s\) belongs to \(k\)-th subspace with basis \(Q_k\)). If the data point belongs to an intersection of two or more subspaces, then we can arbitrarily assign the data point to one of the subspaces.

Lastly, data points may not be lying perfectly in the subspace. The orthoprojector for each subspace is given by \(Q_k Q_k^T\). Thus, the projection of a point \(y_s\) on a subspace \(\UUU_k\) is \(Q_k Q_k^T y_s\) and the error is \((I - Q_k Q_k^T) y_s\). The (squared) distance from the subspace is then \(\|(I - Q_k Q_k^T) y_s\|_2^2\). The point can be assigned to the subspace closest to it.

Given that a set of points \(Y_k\) are assigned to the subspace \(\UUU_k\), the orthonormal basis \(Q_k\) can be estimated for \(\UUU_k\) by performing principal component analysis here.

This gives us a straightforward iterative method for fitting the subspaces.

  • Start with initial subspace bases \(Q_1^{(0)}, \dots, Q_K^{(0)}\).
  • Assign points to subspaces by using minimum distance criteria.
  • Estimate the bases for each subspace.
  • Repeat steps 2 and 3 till the clustering keeps changing.

Initial subspaces can be chosen randomly.

Expectation-Maximization for K-subspaces

The EM method can be adapted for fitting of subspaces also. We need to assume a statistical mixture model for the dataset.

We assume that the dataset \(Y\) is sampled from a mixture of \(K\) component distributions where each component is centered around a subspace. A latent (hidden) discrete random variable \(z \in \{1, \dots, K \}\) picks the component distribution from which a sample \(y\) is drawn. Let the \(k\)-th component be centered around the subspace \(\UUU_k\) which has an orthogonal basis \(Q_k\). Then, we can write

\[y = Q_k \alpha + B_k \beta\]

where \(B_k \in \RR^{M \times (M - D_k)}\) is an orthonormal basis for the subspace \(\UUU_k^{\perp}\), \(Q_k \alpha_k\) is the component of \(y_k\) lying perfectly in \(\UUU_k\) and \(B_k \beta_k\) is the component lying in \(\UUU_k^{\perp}\) representing the projection error (to the subspace). We will assume that both \(\alpha\) and \(\beta\) are sampled from multivariate isotropic normal distributions, i.e. \(\alpha \sim \NNN(0, \sigma'^2_{k} I)\) and \(\beta \sim \NNN(0, \sigma^2_{k} I)\). Assuming that \(\alpha\) and \(\beta\) are independent, the covariance matrix for \(y\) is given by

\[\Sigma_k^{-1} = \sigma'^{-2}_k Q_k Q_k^T + \sigma^{-2}_k B_k B_k^T.\]

Since \(y\) is expected to be very close the to the subspace \(\UUU_k\), hence \(\sigma^2_k \ll \sigma'^2_k\). In the limit \(\sigma'^2_k \to \infty\), we have \(\Sigma_k^{-1} \to \sigma^{-2}_k B_k B_k^T\). Basically, this means that \(y\) is uniformly distributed in the subspace and its location inside the subspace (given by \(Q_k \alpha\)) is not important to us. All we care about is that \(y\) should belong to one of the subspaces \(\UUU_k\) with \(B_k \beta\) capturing the projection error being small and normally distributed.

The component distributes therefore are:

\[f(y | z = k) = \frac{1}{(2 \pi \sigma_k^2)^{(M - D_k)/2}} \exp \left ( - \frac{y^T B_k B_k^T y}{2 \sigma_k^2}\right ).\]

\(z\) is multinomial distributed with \(p (z = k) = \pi_k\). The parameter set for this model is then \(\theta = \{\pi_k, B_k, \sigma_K \}_{k=1}^K\) which is unknown and needs to be estimated from the dataset \(Y\). The marginal distribution \(f(y| \theta)\) and the incomplete likelihood function \(l(Y | \theta)\) can be derived just like here. We again introduce auxiliary variables \(w_{sk}\) and convert the ML estimation problem into an iterative estimation problem.

Estimates for \(\hat{w}_{sk}\) in the E-step remain the same.

Estimates of parameters in \(\theta\) in M-step are computed as follows. We compute the weighted sample covariance matrix for the \(k\)-th cluster as

\[\hat{\Sigma}_k = \sum_{s=1}^S w_{sk} y_s y_s^T.\]

\(\hat{B}_k\) is the eigenvectors associated with the smallest \(M - D_k\) eigenvalues of \(\hat{\Sigma}_k\). \(\pi_k\) and \(\sigma_k\) are estimated as follows:

\[\hat{\pi_k} = \frac{\sum_{s=1}^S w_{sk}}{S}.\]
\[\hat{\sigma}^2_k = \frac{\sum_{s=1}^S w_{sk} \| \hat{B}^2_k y_s \|_2^2 } {(M - D_k) \sum_{s=1}^S w_{sk}}.\]

The primary conceptual difference between \(K\)-subspaces and EM algorithm is: At each iteration, \(K\)-subspaces gives a definite assignment of every point to one of the subspaces; while EM views the membership as a random variable and uses its expected value \(\sum_{s=1}^S w_{ks}\) to give a “probabilistic” assignment of a data point to a subspace.

Both of these algorithms require number of subspaces and the dimension of each subspace as input and depend on a good initialization of subspaces to converge to an optimal solution.

Generalized PCA

Generalized Principal Component Analysis (GPCA) is algebraic subspace clustering technique based on polynomial fitting and differentiation [VMS03][VH04][HMV04][VMS05][VTH08]. The basic idea is that a union of subspaces can be represented as a zero set of a set of homogeneous polynomials. Once the set of polynomials has been fitted for the given dataset, individual component subspaces can be identified via polynomial differentiation and division. See here for a quick review of ideas from algebraic geometry which are used in the development of GPCA algorithm.

We will assume that \(\UUU_k\) are linear subspaces. If they are affine, we simply take their homogeneous embeddings.

Representing the union of subspaces with a set of homogeneous polynomials

Consider the \(k\)-th subspace \(\UUU_k \subset \RR^M\) with dimension \(D_k\) and its orthogonal complement \(\UUU_k^{\perp}\) with dimension \(D'_k = M - D_k\). Choose a basis for \(\UUU_k^{\perp}\) as:

\[B_k = [b_{k_1}, \dots, b_{k_{D'_k}}] \in \RR^{M \times D'_k}.\]

Recall that for each \(y \in \UUU_k\), \(b_{k_i}^T y = 0\) as vectors in \(\UUU_k^{\perp}\) are orthogonal to vectors in \(\UUU_k\). Note that each of the forms \(b_{k_i}^T y\) is a homogeneous polynomial of degree 1. The solutions of \(b_{k_i}^T y = 0\) are (linear) hyperplanes of dimension \(M-1\) and the subspace \(\UUU_k\) is the intersection of these hyperplanes. In other words:

\[\begin{split}\begin{aligned} \UUU_k &= \{ y \in \RR^M : B_k^T y = 0 \}\\ &= \left \{ y \in \RR^M : \bigwedge_{i=1}^{D'_k} (b_{k_i}^T y = 0) \right \} . \end{aligned}\end{split}\]

Note that \(y \in Z_{\UUU}\) if and only if \((y \in \UUU_1) \vee \dots \vee (y \in \UUU_K)\). Alternatively:

\[\bigvee_{k=1}^K (y\in \UUU_k) \Leftrightarrow \bigvee_{k=1}^K \bigwedge_{j=1}^{D'_k} (b_{k_j}^T y = 0) \Leftrightarrow \bigwedge_{\sigma} \bigvee_{k=1}^K (b_{k_{\sigma(k)}}^T y = 0)\]

where \(\sigma\) denotes an arbitrary choice of one normal vector \(b_{k_{\sigma(k)}}\) from each basis \(B_k\) and we are considering all such choices. If \(y\in Z_{\UUU}\), it belongs to some \(\UUU_k\), and \((b_{k_i}^T y = 0)\) for each \(b_{k_i}\) in \(B_k\). Hence, for each choice \(\sigma\), \(b_{k_{\sigma(k)}}^T y = 0\) and RHS is true. Conversely, assume RHS is true. If \(y \notin Z_{\UUU}\), then from each \(B_k\), we could pick one normal vector \(b\) such that \(b^T y \neq 0\). This choice would make RHS false, hence \(y \in Z_{\UUU}\). The total number of choices \(\sigma\) is: \(\prod_{k=1}^K D'_k\). Interestingly:

\[\bigvee_{k=1}^K (b_{k_{\sigma(k)}}^T y = 0) \Leftrightarrow \left ( \prod_{k=1}^K (b_{k_{\sigma(k)}}^T y) = 0\right ) \iff (p^K_{\sigma}(y) = 0)\]

where \(p^K_{\sigma}(y)\) is a homogeneous polynomial of degree \(K\) in \(M\) variables.

Therefore, A union of \(K\) subspaces can be represented as the zero set of a set of homogeneous polynomials of the form:

(1)\[p^K(y) = \prod_{k=1}^K (b_k^T y ) = c_K^T v_K(y),\]

where \(b_k \in \RR^M\) is a normal vector to the \(k\)-th subspace and \(v_K(y)\) is the Veronese embedding (see here) of \(y \in \RR^M\) into \(\RR^{A_{K}(M)}\). The problem of fitting \(K\) subspaces to the given dataset is then equivalent the problem of fitting homogeneous polynomials \(p^K(y)\) such that all the points in the dataset belong to the zero set of these polynomials. Fitting of such polynomials doesn’t require iterative data segmentation and model estimation since they depend on all the points in the dataset. Once, the polynomials have been identified, the remaining task is to split their zero-set into individual subspaces identified by \(B_k\).

In the following, we assume that the number of subspaces \(K\) is known beforehand. We consider the task of estimating \(K\) later.

Fitting polynomials to data

Let \(I(Z_{\UUU})\) be the vanishing ideal of \(Z_{\UUU}\). Since, the number of subspaces \(K\) is known, we only need to consider the homogeneous component \(I_K\) of \(I(Z_{\UUU})\) (3).

The vanishing ideal \(I(\UUU_K)\) of \(\UUU_k\) is generated by the set of linear forms

\[\GGG_k = \{l(y) = b^T y, b \in B_k \}.\]

If the subspace arrangement is transversal, \(I_K\) is generated by products of \(K\) linear forms that vanish on the \(K\) subspaces. Any polynomial \(p(y) \in I_K\) can be written as a summation of products of linear forms

\[p(y) = \sum l_1 (y) l_2(y) \dots l_K(y)\]

where \(l_k(y)\) is a linear form in \(I(\UUU_k)\). Using the Veronese map, each polynomial in \(I_K\) can also be written as:

\[p(y) = c_K^T v_K(y) = \sum c_{k_1, \dots, k_M} y_1^{k_1} \dots y_M^{k_M} = 0\]

where \(k_1 + \dots + k_M = K\) and \(c_{k_1, \dots, k_M} \in \RR\) represents the coefficient of monomial \(y^{\underline{K}} = y_1^{k_1} \dots y_M^{k_M}\). Fitting the polynomial \(p(y)\) is equivalent to identifying its coefficient vector \(c_K\). Since \(p(y) = 0\) is satisfied by each data point \(y_s \in Y\), we have \(c_K^T v_K(y_s) = 0\) for all \(s = 1, \dots, S\). We define

\[\begin{split}V_K(M) = \begin{bmatrix} v_K(y_1)^T\\ \vdots\\ v_K(y_S)^T \end{bmatrix} \in \RR^{S \times A_K(M) }\end{split}\]

as embedded data matrix. Then, we have

\[V_K(M) c_K = 0 \in \RR^S.\]

The coefficient vector \(c_K\) of every polynomial in \(I_K\) is in the null space of \(V_K(M)\). To ensure that every polynomial obtained from \(V_K(M)\) is in \(I_K\), we require that

\[\text{dim} (\NullSpace (V_K(M))) = \text{dim} (I_K) = h_I(K)\]

where \(h_I\) is the Hilbert function of \(I(Z_{\UUU})\) (2). Equivalently, the rank of \(V_K(M)\) needs to satisfy:

\[\text{rank}(V_K(M)) = A_K(M) - h_I(K).\]

This condition is typically satisfied with \(S \geq (A_K(M) - 1)\) points in general position. Assuming this, a basis for \(I_K\) can be constructed from the set of \(h_I(K)\) singular vectors of \(V_K(M)\) associated with its \(h_I(K)\) zero singular values. In the presence of moderate noise, we can still estimate the coefficients of the polynomials in the least squares sense from the singular vectors associated with the \(h_I(K)\) smallest singular values.

Subspaces by polynomial differentiation

Now that we have obtained a basis for the polynomials in \(I_K\), the next step is to calculate the basis vectors \(B_k\) for each \(\UUU_k^{\perp}\).

Sparse Subspace Clustering (SSC)

Sparse representations using overcomplete dictionaries have become a popular approach to solve a number of signal and image processing problems in last couple of decades [Ela10]. The dictionary [Tro04][RBE10] consists of a set of prototype signals called atoms which are representative of the particular class of signals of interest. Signals are then approximated by a sparse linear combination of these atoms (i.e. linear combinations of as few atoms as possible). A wide range of sparse recovery algorithms have been developed to decompose a given signal in terms of the atoms from the dictionary in order to obtain the sparsest possible representation [TW10]. Essentially, it is expected that the signals reside in low dimensional subspaces of the ambient signal space and a good dictionary contains well chosen elementary signals called atoms such that a small set of those atoms can span (or approximate) any of the low dimensional subspaces in the class of signals under consideration. Two typical approaches for computing the sparse representation (a.k.a. sparse coding or recovery) of a given signal in a given dictionary are convex relaxation (\(\ell_1\)-minimization) [CDS98][TRO04][CT05][DET06][Don06] and greedy pursuits [MZ93][PRK93][TG07][NT09].

Sparse Subspace Clustering (SSC), introduced in [EV09][EV13] is a method which utilizes the idea of sparse representations for solving the subspace clustering problem. It treats the dataset \(Y\) itself as an (unstructured) dictionary and suggests that a sparse representation of each point in a union of subspaces may be constructed from other data points in the dataset.

A dataset where each point can be expressed as a linear combination of other points in the dataset is said to satisfy self-expressiveness property. The self-expressive representation of a point \(y_s\) in \(Y\) is given by

\[y_s = Y c_s, \; c_{ss} = 0, \text{ or } Y = Y C, \quad \text{diag}(C) = 0\]

where \(C = \begin{bmatrix}c_1, \dots, c_S \end{bmatrix} \in \RR^{S \times S}\) is the matrix of representation coefficients.

In general, the representation \(c_s\) for vector \(y_s\) need not be unique. Now, let \(y_s\) belong to \(k\)-th subspace \(\UUU_k\). Let \(Y^{-s}\) denote the dataset \(Y\) excluding the point \(y_s\) and \(Y_k^{-s}\) denote the set of points in \(Y_k\) excluding \(y_s\). If \(Y_k^{-s}\) spans the subspace \(\UUU_k\), then a representation of \(y_s\) can be constructed entirely from the points in \(Y_k^{-s}\). A representation is called subspace preserving if it consists of points within the same subspace. Now if \(c_i\) is a subspace preserving representation of \(y_i\) and \(y_j\) belongs to a different subspace, then \(c_{ij} = 0\). Thus, if \(C\) consists entirely of subspace preserving representations, then \(C_{ij} = 0\) whenever \(y_i\) and \(y_j\) belong to different subspaces.

Note that, \(C\) may not be symmetric. i.e. even if \(y_j\) participates in the representation of \(y_i\), \(y_i\) may not participate in the representation of \(y_j\) or the representation coefficients \(C_{ij}\) and \(C_{ji}\) may be different. But we can construct a symmetric matrix \(W = | C | + |C|^T\), where \(|C|\) denotes taking absolute value of each entry in \(C\). The matrix \(W\) can be used as an affinity matrix for the points from the union of subspaces such that the affinity of points from different subspaces is 0. \(W\) can be used to partition \(Y\) into \(Y_k\) via spectral clustering footnote{See here for a review of spectral clustering.} [VL07].

The remaining issue is constructing a subspace preserving representation \(C\) of \(Y\). This is where the sparse recovery methods developed in sparse representations literature come to our rescue. [EV09][EV13] initially proposed the use of using \(\ell_1\)-minimization by solving

\[c_s^* = \underset{c}{\text{arg min}} \| c \|_1 \text{ s.t. } y_s = Y c, \; c_{s} = 0.\]

They proved theoretically that, if the subspaces \(\{\UUU_k\}\) are independent, then \(\ell_1\) minimization can recover subspace preserving representations. They also showed that if the subspaces are disjoint, then under certain conditions, subspace preserving representations can be obtained.

Subsequently, [DSB13][YV15] showed that Orthogonal Matching Pursuit (OMP) [PRK93][TG07] can also be used for obtaining subspace preserving representations under appropriate conditions. We will call these two variants of SSC as SSC-\(\ell_1\) and SSC-OMP respectively. The essential SSC method is described below.

_images/alg_ssc.png

SSC by Basis Pursuit

Hands-on SSC-BP with Synthetic Data

In this example, we will select a set of random subspaces in an ambient space and pick random points within those subspaces. We will make the data noisy and then use sparse subspace clustering by basis pursuit to solve the clustering problem.

Configure the random number generator for repeatability of experiment:

rng default;

Let’s choose the ambient space dimension:

M = 50;

The number of subspaces to be drawn in this ambient space:

K = 10;

Dimension of each of the subspaces:

D = 20;

Choose random subspaces (by choosing bases for them):

bases = spx.data.synthetic.subspaces.random_subspaces(M, K, D);

See Random Subspaces for details.

Compute the smallest principal angles between them:

>> angles_matrix = spx.la.spaces.smallest_angles_deg(bases)
angles_matrix =

         0   13.7806   21.2449   12.6763   18.2977   14.5865   19.0584   14.1622   20.4491   15.9609
   13.7806         0   12.7650   14.3358   15.5764   12.5790   18.1699   14.8446   19.3907   13.2812
   21.2449   12.7650         0   14.7511   13.2121   10.7509   16.1944   11.7819   15.3850   19.7930
   12.6763   14.3358   14.7511         0   14.1313   15.6603   14.1016   13.4738   13.1950   19.8852
   18.2977   15.5764   13.2121   14.1313         0   13.1154   18.3977   15.4241   12.2688   16.7764
   14.5865   12.5790   10.7509   15.6603   13.1154         0    7.6558   13.6178   13.3462   10.5027
   19.0584   18.1699   16.1944   14.1016   18.3977    7.6558         0   12.6955   13.8088   17.2580
   14.1622   14.8446   11.7819   13.4738   15.4241   13.6178   12.6955         0   13.8851   17.1396
   20.4491   19.3907   15.3850   13.1950   12.2688   13.3462   13.8088   13.8851         0    8.4910
   15.9609   13.2812   19.7930   19.8852   16.7764   10.5027   17.2580   17.1396    8.4910         0

See Hands on with Principal Angles for details.

Let’s quickly look at the minimum angle between any of the pairs of subspaces:

>> angles = spx.matrix.off_diag_upper_tri_elements(angles_matrix)';
>> min(angles)
ans =

    7.6558

Some of the subspaces are indeed very closely aligned.

Let’s choose the number of points we will draw for each subspace:

>> Sk = 4 * D

Sk =

    80

Number of points that will be drawn in each subspace:

cluster_sizes = Sk * ones(1, K);

Total number of points to be drawn:

S = sum(cluster_sizes);

Let’s generate these points on the unit sphere in each subspace:

points_result = spx.data.synthetic.subspaces.uniform_points_on_subspaces(bases, cluster_sizes);
X0 = points_result.X;

See Uniformly Distributed Points in Subspaces for more details.

Let’s add some noise to the data points:

% noise level
sigma = 0.5;
% Generate noise
Noise = sigma * spx.data.synthetic.uniform(M, S);
% Add noise to signal
X = X0 + Noise;

See Uniformly Distributed Points in Space for the spx.data.synthetic.uniform function details.

Let’s normalize the noisy data points:

X = spx.norm.normalize_l2(X);

Let’s create true labels for each of the data points:

true_labels = spx.cluster.labels_from_cluster_sizes(cluster_sizes);

See Utility Functions for Clustering Experiments for labels_from_cluster_sizes function.

It is time to apply the sparse subspace clustering algorithm. There are following steps involved:

  1. Compute the sparse representations using basis pursuit.
  2. Convert the representations into a Graph adjacency matrix.
  3. Apply spectral clustering on the adjacency matrix.

Basis Pursuit based Representation Computation

Let’s allocate storage for storing the representation of each point in terms of other points:

Z = zeros(S, S);

Note that there are exactly S points and each has to have a representation in terms of others. The diagonal elements of Z must be zero since a data point cannot participate in its own representation.

We will use CVX to construct the sparse representation of each point in terms of other points using basis pursuit:

start_time = tic;
fprintf('Processing %d signals\n', S);
for s=1:S
    fprintf('.');
    if (mod(s, 50) == 0)
        fprintf('\n');
    end
    x = X(:, s);
    cvx_begin
    % storage for  l1 solver
    variable z(S, 1);
    minimize norm(z, 1)
    subject to
    x == X*z;
    z(s) == 0;
    cvx_end
    Z(:, s)  = z;
end
elapsed_time  = toc(start_time);
fprintf('\n Time spent: %.2f seconds\n', elapsed_time);

The constraint x == X*z is forcing each data point to be represented in terms of other data points.

The constraint z(s) == 0 ensures that a data point cannot participate in its own representation. In other words, the diagonal elements of the matrix Z are forced to be zero.

The output of this loop looks like:

Processing 800 signals
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................

 Time spent: 313.70 seconds

CVX based basis pursuit is indeed a slow algorithm.

Graph adjacency matrix

The sparse representation matrix Z is not symmetric. Also, the sparse representation coefficients are not always positive.

We need to make it symmetric and positive so that it can be used as an adjacency matrix of a graph:

W = abs(Z) + abs(Z).';

Spectral Clustering

See Hands-on spectral clustering about detailed intro to spectral clustering.

We can now apply spectral clustering on this matrix. We will choose normalized symmetric spectral clustering:

clustering_result = spx.cluster.spectral.simple.normalized_symmetric(W);

The labels assigned by the clustering algorithms:

cluster_labels = clustering_result.labels;

Performance of the Algorithm

Time to compare the clusterings and measure clustering accuracy and error. We will use the Hungarian mapping trick to map between original cluster labels and estimated cluster labels by clustering algorithm:

comparsion_result = spx.cluster.clustering_error_hungarian_mapping(cluster_labels, true_labels, K);

See Clustering Error for Hungarian mapping based clustering error.

The clustering accuracy and error:

clustering_error_perc = comparsion_result.error_perc;
clustering_acc_perc = 100 - comparsion_result.error_perc;

Let’s print it:

>> fprintf('\nclustering error: %0.2f %%, clustering accuracy: %0.2f %% \n'...
    , clustering_error_perc, clustering_acc_perc);
clustering error: 7.00 %, clustering accuracy: 93.00 %

We have achieved pretty good accuracy despite very closely aligned subspaces and significant amount of noise.

Subspace Preserving Representations

Let’s also get the subspace preserving representation statistics:

spr_stats = spx.cluster.subspace.subspace_preservation_stats(Z, cluster_sizes);
spr_error = spr_stats.spr_error;
spr_flag = spr_stats.spr_flag;
spr_perc = spr_stats.spr_perc;

See Performance Metrics for Sparse Subspace Clustering for more details.

Print it:

>> fprintf('mean spr error: %0.2f, preserving : %0.2f %%\n', spr_stats.spr_error, spr_stats.spr_perc);
mean spr error: 0.68, preserving : 0.00 %

Complete example code can be downloaded here.

SSC by Orthogonal Matching Pursuit

Motion Segmentation

The theory of structure from motion and motion segmentation has evolved over a set of papers [TK91][TK92][BB91][PK97][Gea98][CK98][Kan01]. In this section, we review the essential ideas from this series of work.

A typical image sequence (from a single camera shot) may contain multiple objects moving independently of each other. In the simplest model, we can assume that images in a sequence are views of a single moving object observed by a stationary camera or a stationary object observed by a moving camera. Only rigid motions are considered. In either case, the object is moving with respect to the camera. The structure from motion problem focuses on recovering the (3D) shape and motion information of the moving object. In the general case, there are multiple objects moving independently. Thus, we also need to perform a motion segmentation such that motions of different objects can be separated and (either after or simultaneously) shape and motion of each object can be inferred.

This problem is typically solved in two stages. In the first stage, a frame to frame correspondence problem is solved which identifies a set of feature points whose coordinates can be tracked over the sequence as the point moves from one position to other in the sequence. We obtain a set of trajectories for these points over the frames in the video. If there is a single moving object or the scene is static and the observer is moving then all the feature points will belong to the same object. Otherwise, we need to cluster these feature points to different objects moving in different directions. In the second stage, these trajectories are analyzed to group the feature points into separate objects and recover the shape and motion for individual objects. In this section we assume that the feature trajectories have been obtained by an appropriate method. Our focus is to identify the moving objects and obtain the shape and motion information for each object from the trajectories.

Modeling structure from motion for single object

We start with the simple model of a static camera and a moving object. All feature point trajectories belong to the moving object. Our objective is to demonstrate that the subspace spanned by feature trajectories of a single moving object is a low dimensional subspace.

Let the image sequence consist of \(F\) frames denoted by \(1 \leq f \leq F\). Let us assume that \(S\) feature points of the moving object have been tracked over this image sequence. Let \((u_{fs}, v_{fs})\) be the image coordinates of the \(s\)-th point in \(f\)-th frame. We form the feature trajectory vector for the \(s\)-th point by stacking its coordinates for the \(F\) frames vertically as

\[y_s = \begin{bmatrix} u_{1s} & v_{1s} & u_{2s} & v_{2s} & \dots & u_{Fs} & v_{Fs} \end{bmatrix}^T.\]

Putting together the feature trajectory vectors of \(S\) points in a single feature trajectory matrix, we obtain

\[Y = \begin{bmatrix} y_1 & y_2 &\dots & y_S \end{bmatrix}.\]

This is the data matrix under consideration from which the shape and motion of the object need to be inferred.

We need two coordinate systems. We use the camera coordinate system as the world coordinate system with the \(Z\)-axis along the optical axis. The coordinates of different points in the object are changing from frame to frame in the world coordinate system as the object is moving. We also establish a coordinate system within the object with origin at the centroid of the feature points such that the coordinates of individual points do not change from frame to frame in the object coordinate system. The (rigid) motion of the object is then modeled by the translation (of the centroid) and rotation of its coordinate system with respect to the world coordinate system. Let \((a_s, b_s, c_s)\) be the coordinate of the \(s\)-th point in the object coordinate system. Then, the matrix

\[\begin{split}\begin{bmatrix} a_1 & a_2 & \dots & a_S\\ b_1 & b_2 & \dots & b_S\\ c_1 & c_2 & \dots & c_S\\ \end{bmatrix}\end{split}\]

represents the shape of the object (w.r.t. its centroid).

Let us choose an orthonormal basis in the object coordinate system. Let \(d_f\) be the position of the centroid and \((i_f, j_f, k_f)\) be the (orthonormal) basis vectors of the object coordinate system in the \(f\)-th frame. Then, the position of the \(s\)-th point in the world coordinate system in \(f\)-th frame is given by

\[h_{fs} = d_f + a_s i_f + b_s j_f + c_s k_f.\]

Assuming orthographic projection and letting \(h_{fs} = (u_{fs}, v_{fs}, w_{fs})\), the image coordinates are obtained by chopping of the third component \(w_{fs}\). We define the rotation matrix for \(f\)-th frame as

\[\begin{split}R_f \triangleq \begin{bmatrix} i_f & j_f & k_f \end{bmatrix} = \begin{bmatrix} \underline{i}_f \\ \underline{j}_f \\ \underline{k}_f \end{bmatrix}\end{split}\]

where \(\underline{i}_f\), \(\underline{j}_f\), \(\underline{k}_f\) are the row vectors of \(R_f\). Let \(x_s = (a_s, b_s, c_s, 1)\) be the homogeneous coordinates of the \(s\)-th point in object coordinate system. We can write the homogeneous coordinates in camera coordinate system as

\[\begin{split}\begin{bmatrix} h_{fs}\\ 1 \end{bmatrix} = \begin{bmatrix} R_f & d_f \\ 0_{1 \times 3} & 1 \end{bmatrix} x_s.\end{split}\]

If we write \(d_f = (d_{fi}, d_{fj}, d_{fk})\), then, the data matrix \(Y\) can be factorized as

\[\begin{split}Y = \begin{bmatrix} u_{11} & \dots & u_{1S}\\ v_{11} & \dots & v_{1S}\\ \vdots & \dots & \vdots \\ \vdots & \dots & \vdots \\ u_{F1} & \dots & u_{FS}\\ v_{F1} & \dots & v_{FS} \end{bmatrix} = \left[ \begin{array}{c|c} \underline{i}_1 & d_{1i}\\ \underline{j}_1 & d_{1j}\\ \vdots & \vdots \\ \vdots & \vdots \\ \underline{i}_F & d_{Fi}\\ \underline{j}_F & d_{Fj} \end{array} \right] \begin{bmatrix} x_1 & \dots & x_S \end{bmatrix}.\end{split}\]

We rewrite this as

\[Y = \mathbb{M} \mathbb{S}\]

where \(\mathbb{M}\) represents the motion information of the object and \(\mathbb{S}\) footnote{The last row of \(\mathbb{S}\) as formulated above consists of \(1`s.} represents the shape information of the object. This factorization is known as the *Tomasi-Kanade factorization* of shape and motion information of a moving object. Note that :math:\)mathbb{M} in RR^{2F times 4}` and \(\mathbb{S} \in \RR^{4 \times S}\). Thus the rank of \(Y\) is at most 4. Thus the feature trajectories of the rigid motion of an object span an up to 4-dimensional subspace of the trajectory space \(\RR^{2F}\).

Solving the structure from motion problem

We digress a bit to understand how to perform the factorization of \(Y\) into \(\mathbb{M}\) and \(\mathbb{S}\). Using SVD, \(Y\) can be decomposed as

\[Y = U \Sigma V^T.\]

Since \(Y\) is at most rank \(4\), we keep only the first 4 singular values as \(\Sigma = \text{diag}(\sigma_1, \sigma_2, \sigma_3, \sigma_4)\). Matrices \(U \in \RR^{2F \times 4}\) and \(V \in \RR^{S \times 4}\) are the left and right singular matrices respectively.

There is no unique factorization of \(Y\) in general. One simple factorization can be obtained by defining:

\[\widehat{\mathbb{M}} = U \Sigma^{\frac{1}{2}}, \quad \widehat{\mathbb{S}} = \Sigma^{\frac{1}{2}} V^T.\]

But for any \(4 \times 4\) invertible matrix \(A\),

\[\mathbb{M} = \widehat{\mathbb{M}} A, \quad \mathbb{S} = A^{-1}\widehat{\mathbb{S}}\]

is also a possible solution since \(\mathbb{M} \mathbb{S} = \widehat{\mathbb{M}} \widehat{\mathbb{S}} = Y\). Remember that \(\mathbb{M}\) is not an arbitrary matrix but represents the rigid motion of an object. There is considerable structure inside the motion matrix. These structural constraints can be used to compute an appropriate \(A\) and thus obtain \(\mathbb{M}\) from \(\widehat{\mathbb{M}}\). To proceed further, let us break \(A\) into two parts

\[A = \left[\begin{array}{c|c} A_R & a_t \end{array}\right]\]

where \(A_R \in \RR^{4 \times 3}\) is the rotational component and \(a_t \in \RR^4\) is related to translation. We can now write:

\[\mathbb{M} = \left [ \begin{array}{c|c} \widehat{\mathbb{M}} A_R & \widehat{\mathbb{M}} a_t \end{array} \right]\]

Rotational constraints Recall that \(R_f\) is a rotation matrix hence its rows are unit norm and orthogonal to each other. Thus every row of \(\widehat{\mathbb{M}} A_R\) is unit norm and every pair of rows (for a given frame) is orthogonal. This yields following constraints.

\[\widehat{m}_{2f-1} A_R A_R^T \widehat{m}_{2f-1}^T = 1 \quad \widehat{m}_{2f} A_R A_R^T \widehat{m}_{2f}^T = 1\]
\[\widehat{m}_{2f-1} A_R A_R^T \widehat{m}_{2f}^T = 0\]

where \(\widehat{m}_k\) are rows of matrix \(\widehat{\mathbb{M}}\) for \(1 \leq f \leq F\). This over-constrained system can be solved for the entries of \(A_R\) using least squares techniques.

Translational constraints Recall that the image of a centroid of a set of points under an isometry (rigid motion) is the centroid of the images of the points under the same isometry. The homogeneous coordinates of the centroid in the object coordinate system are \((0, 0, 0, 1)\). The coordinates of the centroid in image are \((\frac{1}{S} \sum_s {u_{f s}}, \frac{1}{S} \sum_s {v_{f s}} )\). Putting back, we obtain

\[\begin{split}\frac{1}{S} \begin{bmatrix} \sum_s {u_{1 s}}\\ \sum_s {v_{1 s}}\\ \vdots\\ \sum_s {u_{F s}}\\ \sum_s {v_{F s}}\\ \end{bmatrix} = \left [ \begin{array}{c|c} \widehat{\mathbb{M}} A_R & \widehat{\mathbb{M}} a_t \end{array} \right] \begin{bmatrix} 0 \\ 0 \\ 0 \\1 \end{bmatrix} = \widehat{\mathbb{M}} a_t .\end{split}\]

A least squares solution for \(a_t\) is straight-forward.

Modeling motion for multiple objects

The generalization of modeling of motion of one object to multiple objects is straight-forward. Let there be \(K\) objects in the scene moving independently. footnote{Our realization of an object is a set of feature points undergoing same rotation and translation over a sequence of images. The notion of locality, color, connectivity etc. plays no role in this definition. It is possible that two visually distinct objects are undergoing same rotation and translation within a given image sequence. For the purposes of inferring an object from its motion, these two visually distinct object are treated as one.} Let \(S_1, S_2, \dots, S_K\) feature points be tracked for objects \(1,2, \dots, K\) respectively for \(F\) frames with \(S = \sum_k S_k\). Let these feature trajectories be put in a data matrix \(Y \in \RR^{2F \times S}\). In general, we don’t know which feature point belongs to which object and how many feature points are there for each object. Of course there is at least one feature point for each object (otherwise the object isn’t being tracked at all). We could permute the columns of \(Y\) via an (unknown) permutation \(\Gamma\) so that the feature points of each object are placed contiguously giving us

\[Y^* = Y \Gamma = \begin{bmatrix} Y_1 & Y_2 & \dots & Y_K \end{bmatrix}.\]

Clearly, each submatrix \(Y_k\) (\(1 \leq k \leq K\)) which consists of feature trajectories of one object spans an (up to) 4 dimensional subspace. Now, the problem of motion segmentation is essentially separating \(Y\) into \(Y_k\) which reduces to a standard subspace clustering problem.

Let us dig a bit deeper to see how the motion shape factorization identity changes for the multi-object formulation. Each data submatrix \(Y_k\) can be factorized as

\[Y_k = U_k \Sigma_k V_k^T = \mathbb{M}_k \mathbb{S}_k = \widehat{\mathbb{M}}_k A_k A_k^{-1} \widehat{\mathbb{S}}_k.\]

\(Y^*\) now has the canonical factorization:

\[\begin{split}Y^* = \begin{bmatrix} \mathbb{M}_1 & \dots & \mathbb{M}_K \end{bmatrix} \begin{bmatrix} \mathbb{S}_1 & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & \mathbb{S}_K \end{bmatrix}.\end{split}\]

If we further denote :

\[\begin{split}\mathbb{M} = \begin{bmatrix} \mathbb{M}_1 & \dots & \mathbb{M}_K \end{bmatrix}\\ \widehat{\mathbb{M}} = \begin{bmatrix} \widehat{\mathbb{M}}_1 & \dots & \widehat{\mathbb{M}}_K \end{bmatrix}\\ \mathbb{S} = \begin{bmatrix} \mathbb{S}_1 & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & \mathbb{S}_K \end{bmatrix}\\ \widehat{\mathbb{S}} = \begin{bmatrix} \widehat{\mathbb{S}}_1 & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & \widehat{\mathbb{S}}_K \end{bmatrix}\\ A = \begin{bmatrix} A_1 & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & A_K \end{bmatrix}\\ U = \begin{bmatrix} U_1 & \dots & U_K \end{bmatrix}\\ \Sigma = \begin{bmatrix} \Sigma_1 & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & \Sigma_K \end{bmatrix}\\ V = \begin{bmatrix} V_1 & \dots & 0 \\ \vdots & \ddots & \vdots\\ 0 & \dots & V_K \end{bmatrix},\end{split}\]

then we obtain a factorization similar to the single object case given by

\[\begin{split}Y^* = \mathbb{M} \mathbb{S} = \widehat{\mathbb{M}} A A^{-1}\widehat{\mathbb{S}}\\ \mathbb{S} = A^{-1}\widehat{\mathbb{S}} = A^{-1} \Sigma^{\frac{1}{2}} V^T\\ \mathbb{M} = \widehat{\mathbb{M}} A = U \Sigma^{\frac{1}{2}} A.\end{split}\]

Thus, when the segmentation of \(Y\) in terms of the unknown permutation \(\Gamma\) has been obtained, (sorted) data matrix \(Y^*\) can be factorized into shape and motion components as appropriate.

Limitations Our discussion so far has established that feature trajectories for each moving object span a 4-dimensional space. There are a number of reasons why this is only approximately valid: perspective distortion of camera, tracking errors, and pixel quantization. Thus, a subspace clustering algorithm should allow for the presence of noise or corruption of data in real life applications.

Synthetic Data Generation

Random Subspaces

Subspace clustering is focused on segmenting data which fall in different subspaces where subspaces are either independent or disjoint with each other and they are sufficiently oriented away from each other.

For testing algorithms, it is useful to pick random subspaces of an ambient signal space and then draw data points within these subspaces.

A way to pick a random subspace is to pick a basis for the subspace. Then, all the linear combinations of the basis elements fall in the subspace and the basis elements span every vector in the said random subspace.

Let’s pick a random plane in the 3-Dimensional space:

>> basis = orth(randn(3, 2))
basis =

   -0.2634    0.6981
   -0.5459   -0.6769
    0.7954   -0.2334

What we are doing is we are constructing a 3x2 Gaussian random matrix and orthogonalizing its columns. With probability 1, the Gaussian random matrix is full rank. Hence, this is a safe way of choosing a basis for a random plane.

We can verify that the basis is indeed orthogonal:

>> basis'*basis
ans =

    1.0000         0
         0    1.0000
Visualizing Subspaces

It is possible to visualize 2D subspaces in 3D space.

Let’s pick one subspace:

rng(10);
A = orth(randn(3, 2))

Identify its basis vectors:

e1 = A(:, 1);
e2 = A(:, 2);

Identify the corner points of a square around its basis vectors:

corners = [e1+e2, e2-e1, -e1-e2, -e2+e1];

Visualize it:

fill3(corners(1,:),corners(2,:),corners(3,:),'r');
grid on;
hold on;
alpha(0.3);

Add the arrows of basis vectors from origin:

quiver3(0, 0, 0, e1(1), e1(2), e1(3), 'color', 'r');
quiver3(0, 0, 0, e2(1), e2(2), e2(3), 'color', 'r');
_images/random_subspace_a_3d.png

Let’s add one more basis:

B = orth(randn(3, 2));
e1 = B(:, 1);
e2 = B(:, 2);
corners = [e1+e2, e2-e1, -e1-e2, -e2+e1];
fill3(corners(1,:),corners(2,:),corners(3,:),'g');
alpha(0.3);
quiver3(0, 0, 0, e1(1), e1(2), e1(3), 'color', spx.graphics.rgb('DarkGreen'));
quiver3(0, 0, 0, e2(1), e2(2), e2(3), 'color', spx.graphics.rgb('DarkGreen'));
_images/random_subspace_a_b_3d.png
Multiple Subspaces

sparse-plex provides a way to draw multiple random subspaces of a given dimension from an ambient space.

Let’s pick the dimension of the ambient space:

M = 10;

Let’s pick the dimension of subspaces:

D = 4;

Let’s pick the number of subspaces to be drawn:

K = 2;

Let’s draw the bases for each random subspace:

import spx.data.synthetic.subspaces.random_subspaces;
bases = random_subspaces(M, K, D);

The result bases is a cell array containing the orthogonal basis for each subspace:

>> bases{1}

ans =

   -0.1178   -0.1432    0.0438   -0.0100
    0.1311   -0.0110   -0.4409    0.1758
    0.5198   -0.6404    0.0422   -0.3980
    0.5211   -0.0172   -0.2929    0.6334
   -0.2253   -0.1194   -0.2797    0.0920
    0.4695    0.1059    0.5408    0.1396
    0.1919    0.0765   -0.1441   -0.3519
    0.0940    0.0145   -0.4542   -0.4078
    0.3209    0.6274   -0.2325   -0.2118
   -0.0855   -0.3791   -0.2537    0.2153

>> bases{2}

ans =

    0.4784   -0.0579   -0.4213   -0.0206
    0.1213   -0.0591    0.3498    0.2351
    0.3077   -0.2110    0.2573    0.0042
   -0.5581   -0.5284    0.0988   -0.1403
    0.1128    0.5914    0.2518   -0.1872
   -0.1804   -0.0095    0.0707   -0.1351
   -0.0728    0.2774   -0.2063    0.3801
   -0.4417    0.3878    0.2071    0.4004
    0.0695   -0.2496   -0.1836    0.7344
    0.3158   -0.1732    0.6608    0.1647

Verify orthogonality:

>> Psi = bases{1}
>> Psi' * Psi

ans =

    1.0000   -0.0000   -0.0000   -0.0000
   -0.0000    1.0000    0.0000    0.0000
   -0.0000    0.0000    1.0000   -0.0000
   -0.0000    0.0000   -0.0000    1.0000

Principal Angles

If \(\UUU\) and \(\VVV\) are two linear subspaces of \(\RR^M\), then the smallest principal angle between them denoted by \(\theta\) is defined as [BjorckG73]

\[\cos \theta = \underset{u \in \UUU, v \in \VVV}{\max} \frac{u^T v}{\| u \|_2 \| v \|_2}.\]

For the functions provided in sparse-plex for measuring principal angles, see Hands on with Principal Angles.

Uniformly Distributed Points in Space

If we wish to generate points uniformly distributed on unit sphere, we have to follow the following two step procedure:

  1. Generate independent standard Gaussian random vectors.
  2. Normalize their lengths.

Here is an example.

Let ambient space dimension be:

>> M = 4;

Let number of points to be generated by:

>> S = 6;

Let’s generate the Gaussian random vectors:

>> X = randn(M, S)

X =

   -0.6568   -0.2926   -0.4930    0.6113    1.8045    0.6001
   -1.4814   -0.5408   -0.1807    0.1093   -0.7231    0.5939
    0.1555   -0.3086    0.0458    1.8140    0.5265   -2.1860
    0.8186   -1.0966   -0.0638    0.3120   -0.2603   -1.3270

Let’s normalize them:

>> X = spx.norm.normalize_l2(X)

X =

   -0.3605   -0.2260   -0.9286    0.3147    0.8886    0.2228
   -0.8130   -0.4177   -0.3404    0.0563   -0.3561    0.2205
    0.0853   -0.2384    0.0863    0.9338    0.2593   -0.8117
    0.4492   -0.8471   -0.1201    0.1606   -0.1282   -0.4928

Verify that they are indeed on the unit-sphere:

>> spx.norm.norms_l2_cw(X)

ans =

    1.0000    1.0000    1.0000    1.0000    1.0000    1.0000

We provide a reusable function to generate uniformly distributed points on unit sphere:

>> spx.data.synthetic.uniform(M, S)

ans =

   -0.6788    0.5450   -0.3194   -0.1977   -0.6098   -0.4051
    0.1893    0.3660    0.6441    0.2742    0.3803    0.1614
    0.6926   -0.7056   -0.0138   -0.0292   -0.0341   -0.6422
   -0.1540    0.2667    0.6949   -0.9407   -0.6946    0.6305

Uniformly Distributed Points in Subspaces

For subspace clustering purposes, individual vectors are usually normalized. They then fall onto the surface of the unit sphere of the subspace to which they belong.

For experimentation, it is useful to generate uniformly distributed points on the unit sphere of a random subspace.

It is actually very easy to do. Let’s start with a simple example of a random 2D plane inside 3D space.

Let’s choose a random plane:

basis = orth(randn(3, 2));

Let’s choose coordinates of some points in this basis where the coordinates are Gaussian distributed:

num_points = 100;
coefficients = randn(2, num_points);

Let’s normalize the coefficients:

coefficients = spx.norm.normalize_l2(coefficients);

The coordinates of these points in the 3D space can be easily calculated now:

uniform_points = basis * coefficients;

Verify that these points are indeed on unit sphere:

>> max(abs(spx.norm.norms_l2_cw(uniform_points) - 1))

ans =

   4.4409e-16

Time to visualize everything. First the plane:

e1 = basis(:, 1);
e2 = basis(:, 2);
corners = [e1+e2, e2-e1, -e1-e2, -e2+e1];
spx.graphics.figure.full_screen;
fill3(corners(1,:),corners(2,:),corners(3,:),'r');
grid on;
hold on;
alpha(0.3);

Then the unit vectors:

quiver3(0, 0, 0, e1(1), e1(2), e1(3), 'color', 'blue');
quiver3(0, 0, 0, e2(1), e2(2), e2(3), 'color', 'blue');

Finally the points:

x = uniform_points(1, :);
y = uniform_points(2, :);
z = uniform_points(3, :);
plot3(x, y, z, '.', 'color', spx.graphics.rgb('Brown') );

We might as well draw the origin too:

plot3(0, 0, 0, '+k', 'MarkerSize', 10, 'color', spx.graphics.rgb('DarkRed'));
_images/uniform_points_2d_subspace.png

Complete example code can be downloaded here.

Uniformly distributed points in multiple subspaces

We provide a useful function which can generate uniformly distributed points on one or more subspaces.

We first need to choose the bases for the subspaces for which we will draw uniformly distributed points. Here we will choose those bases randomly.

Ambient space dimension:

M = 10;

Number of subspaces:

K = 4;

Dimension of each subspace:

D = 5;

Bases for each subspace:

bases = spx.data.synthetic.subspaces.random_subspaces(M, K, D);

Now, let’s decide how many points do we need in each subspace:

cluster_sizes = [10 4 4 8];

Let’s generate uniformly distributed points in each subspace:

data_points = spx.data.synthetic.subspaces.uniform_points_on_subspaces(bases, cluster_sizes);

The returned value contains the data matrix containing the points and start and end indices for each cluster of points (for each subspace):

>> data_points

data_points =

  struct with fields:

                X: [10×26 double]
    start_indices: [1 11 15 19]
      end_indices: [10 14 18 26]

Let’s look at the start and end indices for each cluster:

>> data_points.start_indices

ans =

     1    11    15    19

>> data_points.end_indices

ans =

    10    14    18    26

Verify the size of the data matrix:

>> size(data_points.X)

ans =

    10    26

Let’s look at the data points for 2nd cluster:

>> data_points.X(:, data_points.start_indices(2):data_points.end_indices(2))

ans =

    0.0987    0.5278   -0.4014    0.2963
   -0.1793    0.0614    0.1551    0.2283
    0.4603    0.1510   -0.0926    0.0340
    0.3573   -0.1289    0.1654    0.4519
    0.1202   -0.0495    0.1382   -0.4503
   -0.1857   -0.6572   -0.1129   -0.3851
    0.4265    0.1540   -0.6315   -0.0117
   -0.4420    0.4131    0.0530    0.1565
    0.0262    0.1973   -0.2354    0.1153
   -0.4377   -0.1022    0.5385   -0.5155

Complete example code can be downloaded here.

Performance Metrics for Sparse Subspace Clustering

Consider a sparse representation matrix \(C\) where each signal has been represented in terms of other signals. With \(S\) signals, the matrix is of size \(S \times S\) and the diagonal elements of the matrix are zero.

We use following metrics for comparison of algorithms.

Percentage of subspace preserving representations (p%) [YV15]

This is the percentage of points whose representations are subspace-preserving. Due to the imprecision of solvers, coefficients with absolute values less than \(10^{-3}\) are considered zero. A subspace preserving \(C\) gives \(p = 100\).

Subspace preserving representation error (e%) [EV13]

For each column \(c_s\) in \(C\), we compute the fraction of its \(\ell_1\) norm that comes from other subspaces and average over all \(1 \leq s \leq S\).

\[e\% = \frac{100}{S} \sum_{s=1}^S \left ( 1 - \frac{\sum_{i=1}^S w_{is} | c_{is}| }{\| c_s \|_1} \right )\]

where \(w_{is} \in \{0, 1\}\) is its true affinity. A subspace-preserving \(C\) gives \(e=0\).

Clustering accuracy (a %) [YV15]

This is the percentage of correctly labeled data points. It is computed by matching the estimated and true labels as

\[a\% = \frac{100}{S} \underset{\pi}{\max} \sum_{ks} L^{\text{est}}_{\pi(k) s} L^{\text{true}}_{ks}\]

where \(\pi\) is a permutation of the \(K\) cluster labels, \(L_{ks} = 1\) if point \(s\) belongs to cluster \(k\), and 0 otherwise. This assumes that either the number of subspaces/clusters is known a priori to the clustering algorithm or the clustering algorithm has inferred it correctly.

Running time (t)

For each clustering task using MATLAB.

Hands-on with Subspace Preservation Metrics

Let’s consider a data set of 10 points:

X =

    0.2813   -0.9343    0.2368   -0.7846    0.7908         0         0         0         0         0
    0.9596    0.3566   -0.9716    0.6200    0.6120   -0.4064    0.9962    0.9613   -0.0830    0.7051
         0         0         0         0         0    0.9137   -0.0866   -0.2757    0.9965    0.7091

The points are drawn from a 3 dimensional space. First 5 points are drawn from X-Y plane and last 5 points are from Y-Z plane.

We constructed the sparse presentations of these data points in terms of other points using basis pursuit. The representations are:

C =

    0.0000   -0.0000   -0.0000    0.0000    0.8565   -0.0000    0.3284    0.0000    0.0000    0.3615
    0.0000   -0.0000   -0.0000    0.7476   -0.5885   -0.0000    0.0000    0.0000    0.0000    0.0000
   -0.0000   -0.0000   -0.0000   -0.3638   -0.0000    0.0000   -0.3902   -0.0000   -0.0000   -0.4295
    0.0000    0.8797   -0.3018    0.0000   -0.0000   -0.0000    0.0000    0.0000    0.0000    0.0000
    0.3558   -0.3085   -0.0000   -0.0000   -0.0000   -0.0000    0.0000    0.0000    0.0000    0.0000
   -0.0000    0.0000    0.0000   -0.0000   -0.0000   -0.0000   -0.0000   -0.2187    0.8167    0.0000
    0.6854   -0.0000   -0.7247    0.0000    0.0000   -0.0000    0.0000    0.8757    0.0000    0.0000
    0.0000   -0.0000   -0.0000    0.0000    0.0000   -0.3520    0.3141    0.0000   -0.0000    0.0000
    0.0000   -0.0000   -0.0000    0.0000    0.0000    0.8195   -0.0000   -0.0000   -0.0000    0.7116
    0.0837   -0.0000   -0.0885    0.0000    0.0000    0.0000    0.0000    0.0000    0.3530    0.0000

For subspace preserving representations:

  • In the first 5 columns, non-zero entries should appear in first 5 rows.
  • In the last 5 columns, non-zero entries should appear in last 5 rows.

On inspection, we can see that column 1 is not subspace preserving while column 2 is.

Let’s go through the steps of computing the metrics. We will work on column 1.

Let’s assign cluster labels to each of the columns:

cluster_sizes = [5 5];
labels = spx.cluster.labels_from_cluster_sizes(cluster_sizes)
>> spx.io.print.vector(labels, 0)
1 1 1 1 1 2 2 2 2 2

Let’s compute absolute values of \(C\):

C = abs(C);

We will allocate some space to keep flags indicating whether a column contains subspace preserving representation or not and the amount of \(\ell_1\)-error in each column:

spr_flags = zeros(1, S);
spr_errors = zeros(1, S);

Let’s pick up the first column:

c1 = C(:, 1);

The label assigned to this column is:

k = labels(1);

which happens to be 1 (first cluster).

Identify the rows which contain non-zero values:

non_zero_indices = (c1 >= 1e-3);

Each non-zero value is a contribution from some other column. We wish to identify the cluster to which those columns belong:

non_zero_labels = labels(non_zero_indices)
non_zero_labels =

     1     2     2

Notice, how only one of the contributors is from 1st cluster while the other two are from second cluster. Cross check this in the \(C\) matrix display above.

Verify if all the contributors are from the same cluster and store it in the spr_flags variable:

spr_flags(1) = all(non_zero_labels == k)
0

Next, let’s identify the columns which come from the same cluster as the current cluster:

w = labels == k;

Coefficients from same cluster are:

c1k = c1(w);

Subspace preserving representation error is given by:

spr_errors(1) = 1 - sum(c1k) / sum (c1)
>> spr_errors(1)

ans =

    0.6837

We provide a function which does this whole sequence of operations on all data points:

spr_stats = spx.cluster.subspace.subspace_preservation_stats(C, cluster_sizes);

The flags whether a representation is subspace preserving or not for each data point:

>> spr_stats.spr_flags

ans =

     0     1     0     1     1     1     0     1     1     0

Indicator if all representations are subspace preserving or not:

>> spr_stats.spr_flag
0

Data point wise subspace preserving representation error:

>> spr_stats.spr_errors

ans =

    0.6837    0.0000    0.7293    0.0000    0.0000    0.0000    0.6958    0.0000    0.0000    0.5264

Average representation error:

>> spr_stats.spr_error

ans =

    0.2635

This is about 26% error.

Percentage of data points having subspace preserving representations:

>> spr_stats.spr_perc

ans =

    60

Not too bad given that the number of data points was very small.

Complete example code can be downloaded here.

Sparse Subspace Clustering with MNIST Digits

In this section, we discuss using SSC algorithms on MNIST dataset.

MNIST dataset [LBBH98] contains gray scale images of handwritten digits 0-9. The dataset consists of \(60,000\) images. Following [YRV16], for each image, we compute a set of feature vectors using a scattering convolution network [BM13]. The feature vector is translation invariant and deformation stable. Each feature vector is of length \(3,472\). The feature vectors are available here.

MNIST Dataset

Please download the file MNIST_SC.mat and place it in sparse-plex/data/mnist directory.

We provide a wrapper class to load the data from this dataset:

md = spx.data.image.ChongMNISTDigits;

Beware, the whole dataset is 1GB in size and can take 10-20 seconds to load depending upon your system capability.

Let’s look at the structure md:

>> md
md =

  ChongMNISTDigits with properties:

                Y: [3472×60000 double]
           labels: [1×60000 double]
           digits: [0 1 2 3 4 5 6 7 8 9]
    cluster_sizes: [5923 6742 5958 6131 5842 5421 5918 6265 5851 5949]
  • The \(Y\) matrix contains one feature vector (as column) per example digit.
  • The labels array contains information about the digit represented in each column of \(Y\).
  • cluster_sizes is the number of examples of each digit in this dataset.

Seeing some labels:

>> md.labels(1:10)

ans =

     5     0     4     1     9     2     1     3     1     4

Number of examples of digit 5

>> sum(md.labels == 5)

ans =

        5421

>> md.cluster_sizes(5+1)

ans =

        5421

The object md provides a method to find out the column indices for a given digit in the labels array. Let’s find all the indices for digit 4:

>> four_indices = md.digit_indices(4);
>> numel(four_indices)

ans =

        5842

Let’s checkout some of these indices and verify them in the labels array:

>> four_indices(1:4)

ans =

     3    10    21    27

>> md.labels(four_indices(1:4))

ans =

     4     4     4     4

We can select a subset of samples from this dataset along with the labels as follows:

>> indices = [1 10 11 40];
>> [Y, labels] = md.selected_samples(indices);
>> labels

labels =

     5     4     3     6

SSC-OMP on MNIST Dataset

In this section, we will go through the steps of applying the SSC-OMP algorithm on the MNIST dataset.

We will work on all the digits:

digit_set = 0:9;

Number of samples for each digit:

num_samples_per_digit = 400;

Number of clusters or corresponding low dimensional subspaces:

K = length(digit_set);

Sizes of each cluster:

cluster_sizes = num_samples_per_digit*ones(1, K);

Let’s draw 200 examples/samples for each digit from the MNIST dataset described above:

sample_list = [];
for k=1:K
    digit = digit_set(k);
    digit_indices = md.digit_indices(digit);
    num_digit_samples = length(digit_indices);
    choices = randperm(num_digit_samples, cluster_sizes(k));
    selected_indices = digit_indices(choices);
    sample_list = [sample_list selected_indices];
end

We have picked the column numbers of samples/examples for each digit and concatenated them into sample_list.

Time to pickup the samples from the dataset along with labels:

[Y, true_labels] = md.selected_samples(sample_list);

The feature vectors are 3472 dimensional. We don’t really need this much of detail. We will perform PCA to reduce the dimensions to 500:

fprintf('Performing PCA\n');
tstart = tic;
Y = spx.la.pca.low_rank_approx(Y, 500);
elapsed_time = toc (tstart);
fprintf('Time taken in PCA %.2f seconds\n', elapsed_time);
Performing PCA
Time taken in PCA 17.69 seconds

The ambient space dimension M and the number of data vectors S:

[M, S] = size(Y);

Time to perform sparse subspace clustering with orthogonal matching pursuit:

tstart = tic;
fprintf('Performing SSC OMP\n');
import spx.cluster.ssc.OMP_REPR_METHOD;
solver = spx.cluster.ssc.SSC_OMP(Y, D, K, 1e-3, OMP_REPR_METHOD.FLIPPED_OMP_MATLAB);
solver.Quiet = true;
clustering_result = solver.solve();
elapsed_time = toc (tstart);
fprintf('Time taken in SSC-OMP %.2f seconds\n', elapsed_time);
Performing SSC OMP
Time taken in SSC-OMP 10.54 seconds

Let’s collect the statistics related to clustering error and subspace preserving representations error:

connectivity = clustering_result.connectivity;
% estimated number of clusters
estimated_num_subspaces = clustering_result.num_clusters;
% Time to compare the clustering
cluster_labels = clustering_result.labels;
fprintf('Measuring clustering error and accuracy\n');
comparsion_result = spx.cluster.clustering_error_hungarian_mapping(cluster_labels, true_labels, K);
clustering_error_perc = comparsion_result.error_perc;
clustering_acc_perc = 100 - comparsion_result.error_perc;
spr_stats = spx.cluster.subspace.subspace_preservation_stats(clustering_result.Z, cluster_sizes);
spr_error = spr_stats.spr_error;
spr_flag = spr_stats.spr_flag;
spr_perc = spr_stats.spr_perc;
fprintf('\nclustering error: %0.2f %% , clustering accuracy: %0.2f %%\n, mean spr error: %0.4f preserving : %0.2f %%\n, connectivity: %0.2f, elapsed time: %0.2f sec',...
    clustering_error_perc, clustering_acc_perc,...
    spr_stats.spr_error, spr_stats.spr_perc,...
    connectivity, elapsed_time);
fprintf('\n\n');

Results

Measuring clustering error and accuracy

clustering error: 6.42 % , clustering accuracy: 93.58 %
, mean spr error: 0.3404 preserving : 0.00 %
, connectivity: -1.00, elapsed time: 10.54 sec

SSC-OMP on MNIST Benchmarks

The table below reports the performance of SSC-OMP algorithm on MNIST dataset. The data consists of randomly chosen number of images for each of the 10 digits. Scattering network features are extracted from the image and they are projected to dimension 500 using PCA. The images per digit are varied for each experiment from 50 to 400.

Images per Digit a% e% t
50 82.18 42.11 0.36
80 87.39 39.79 0.81
100 87.20 38.86 1.11
150 89.16 37.33 2.02
200 89.68 36.39 3.25
300 92.19 35.18 6.27
400 91.13 34.26 7.07

Benchmarks on SSC-MC-OMP

The section describing SSC-MC-OMP algorithm is under development.

Here we report the benchmarks using the SSC-MC-OMP algorithm.

Clustering Accuracy a%

Images per Digit 1-4 2.1-4 42.1-4 2-4
50 82.18 82.87 82.68 83.81
80 87.39 87.14 85.34 86.82
100 87.20 87.47 86.75 89.17
150 89.16 89.15 88.06 89.09
200 89.68 90.23 88.17 88.31
300 92.19 91.18 87.80 91.89
400 91.13 91.52 90.16 91.50

Subspace Preserving Representation Error e%

Images per Digit 1-4 2.1-4 42.1-4 2-4
50 42.11 41.63 41.46 41.00
80 39.79 39.10 38.85 38.19
100 38.86 38.12 37.80 37.06
150 37.33 36.56 36.11 35.19
200 36.39 35.50 34.99 34.00
300 35.18 34.15 33.59 32.60
400 34.26 33.26 32.70 31.57

Time t

Images per Digit 1-4 2.1-4 42.1-4 2-4
50 2.07 3.26 5.95 9.22
80 3.57 6.22 11.67 15.77
100 4.71 8.39 15.88 20.61
150 8.97 15.98 30.88 37.71
200 13.50 24.94 48.13 57.25
300 30.50 56.81 120.77 121.76
400 50.38 95.76 177.78 192.57

Yale Faces Dataset

Loading the faces:

yf = spx.data.image.YaleFaces();
yf.load();

Number of subjects:

ns = yf.num_subjects();

Images to load per subject:

ni = yf.ImagesToLoadPerSubject;

Images of a particular subject:

Y = yf.get_subject_images(i);

Resized images of a particular subject:

Y = yf.get_subject_images_resized(i)

Total images:

yf.num_total_images()

Size of image in pixels:

yf.image_size()

Image by global index across all subjects:

yf.get_image_by_glob_idx(index)

Resize all images in buffer:

yf.resize_all(width, height)
yf.resize_all(42, 48);

Describe the contents of the database:

yf.describe()

Create a canvas of images randomly chosen from all subjects:

canvas = yf.create_random_canvas();
imshow(canvas);
colormap(gray);
axis image;
axis off;

Creating a canvas for a particular subject:

yf.resize_all(42, 48);
canvas = yf.create_subject_canvas(1);
imshow(canvas);
colormap(gray);
axis image;
axis off;

Pick ten random images from each subject:

images = yf.training_set_a()

Dictionary Learning

THIS CHAPTER IS NOT DEVELOPED YET.

Dictionary Learning

UNDER DEVELOPMENT

Set Theory

Introduction

This chapter is a background material on basic concepts in set theory and basic properties of real numbers.

We look at

  • Basic properties of sets
  • Concept of a function
  • Cartesian products
  • Relations
  • Notion of order in sets
  • Countable and uncountable sets

Concepts are developed sequentially where one concept builds upon other concepts previously defined. Examples are added wherever suitable for better understanding. In examples we use sets of real numbers, natural numbers, integers frequently. Some of their properties may not have been defined before they are used in examples. This has been done keeping in mind that the reader has an intuitive understanding of these numbers and the examples are easier to visualize.

The presentation in this chapter largely follows [AB98].

Sets

In this section we will review basic concepts of set theory.

Definition
A set is a collection of objects viewed as a single entity.

Actually, it’s not a formal definition. It is just a working definition which we will use going forward.

  • Sets are denoted by capital letters.
  • Objects in a set are called members, elements or points.
  • \(x \in A\) means that element \(x\) belongs to set \(A\).
  • \(x \notin A\) means that \(x\) doesn’t belong to set \(A\).
  • \(\{ a,b,c\}\) denotes a set with elements \(a\), \(b\), and \(c\). Their order is not relevant.
Definition
A set with only one element is known as a singleton set.
Definition
Two sets \(A\) and \(B\) are said to be equal (\(A=B\)) if they have precisely the same elements. i.e. if \(x \in A\) then \(x \in B\) and vice versa. Otherwise they are not equal (\(A \neq B\)).
Definition
A set \(A\) is called a subset of another set \(B\) if every element of \(A\) belongs to \(B\). This is denoted as \(A \subseteq B\). Formally \(A \subseteq B \iff (x \in A \implies x \in B)\).

Clearly, \(A = B \iff (A \subseteq B \text{ and } B \subseteq A)\).

Definition
If \(A \subseteq B\) and \(A \neq B\) then \(A\) is called a proper subset of \(B\) denoted by \(A \subset B\).
Definition
A set without any elements is called the empty or void set. It is denoted by \(\EmptySet\).
Definition

We define fundamental set operations below

  • The union \(A \cup B\) of \(A\) and \(B\) is defined as
\[A \cup B = \{ x : x \in A \text{ or } x \in B\}.\]
  • The intersection \(A \cap B\) of \(A\) and \(B\) is defined as
\[A \cap B = \{ x : x \in A \text{ and } x \in B\}.\]
  • The difference \(A \setminus B\) of \(A\) and \(B\) is defined as
\[A \setminus B = \{ x : x \in A \text{ and } x \notin B\}.\]
Definition
\(A\) and \(B\) are called disjoint if \(A \cap B = \EmptySet\).

Some useful identities

  • \((A \cup B) \cap C = (A \cup C) \cap (B \cup C)\).
  • \((A \cap B) \cup C = (A \cap C) \cup (B \cap C)\).
  • \((A \cup B) \setminus C = (A \setminus C) \cap (B \setminus C)\).
  • \((A \cap B) \setminus C = (A \setminus C) \cap (B \setminus C)\).
Definition

Symmetric difference between \(A\) and \(B\) is defined as

\[A \Delta B = ( A \setminus B) \cup (B \setminus A)\]

i.e. the elements which are in \(A\) but not in \(B\) and the elements which are in \(B\) but not in \(A\).

Family of sets

Definition
A Family of sets is a nonempty set \(\mathcal{F}\) whose members are sets by themselves. If for each element \(i\) of a non-empty set \(I\), a subset \(A_i\) of a fixed set \(X\) is assigned, then \(\{ A_i\}_{i \in I}\) ( or \(\{ A_i : i \in I\}\) or simply \(\{A_i\}\)) denotes the family whose members are the sets \(A_i\). The set \(I\) is called the index set of the family and its members are known as indices.
ExampleIndex sets

Following are some examples of index sets

  • \(\{1,2,3,4\}\): the family consists of only 4 sets.
  • \(\{0,1,2,3\}\): the family consists again of only 4 sets but indices are different.
  • \(\Nat\): The sets in family are indexed by natural numbers. They are countably infinite.
  • \(\ZZ\): The sets in family are indexed by integers. They are countably infinite.
  • \(\QQ\): The sets in family are indexed by rational numbers. They are countably infinite.
  • \(\RR\): There are uncountably infinite sets in the family.

If \(\mathcal{F}\) is a family of sets, then by letting \(I=\mathcal{F}\) and \(A_i = i \quad \forall i \in I\), we can express \(\mathcal{F}\) in the form of \(\{ A_i\}_{i \in I}\).

Definition

Let \(\{ A_i\}_{i \in I}\) be a family of sets.

  • The union of the family is defined to be

    \[\bigcup_{i\in I} A_i = \{ x : \exists i \in I \text{ such that } x \in A_i\}.\]
  • The intersection of the family is defined to be

    \[\bigcap_{i \in I} A_i = \{ x : x \in A_i \quad \forall i \in I\}.\]

We will also use simpler notation \(\bigcup A_i\), \(\bigcap A_i\) for denoting the union and inersection of family.

If \(I =\Nat = \{1,2,3,\dots\}\) (the set of natural numbers), then we will denote union and intersection by \(\bigcup_{i=1}^{\infty}A_i\) and \(\bigcap_{i=1}^{\infty}A_i\).

We now have the generalized distributive law:

\[\begin{split}&\left ( \bigcup_{i\in I} A_i \right ) \cap B = \bigcup_{i\in I} \left ( A_i \cap B \right )\\ &\left ( \bigcap_{i\in I} A_i \right ) \cup B = \bigcap_{i\in I} \left ( A_i \cup B \right )\end{split}\]
Definition
A family of sets \(\{ A_i\}_{i \in I}\) is called pairwise disjoint if for each pair \(i, j \in I\) the sets \(A_i\) and \(A_j\) are disjoint i.e. \(A_i \cap A_j = \EmptySet\).
Definition
The set of all subsets of a set \(A\) is called its power set and is denoted by \(\Power (A)\).

In the following \(X\) is a big fixed set (sort of a frame of reference) and we will be considering different subsets of it.

Let \(X\) be a fixed set. If \(P(x)\) is a property well defined for all \(x \in X\), then the set of all \(x\) for which \(P(x)\) is true is denoted by \(\{x \in X : P(x)\}\).

Definition
Let \(A\) be a set. Its complement w.r.t. a fixed set \(X\) is the set \(A^c = X \setminus A\).

We have

  • \((A^c)^c = A\).
  • \(A \cap A^c = \EmptySet\).
  • \(A \cup A^c = X\).
  • \(A\setminus B = A \cap B^c\).
  • \(A \subseteq B \iff B^c \subseteq A^c\).
  • \((A \cup B)^c = A^c \cap B^c\).
  • \((A \cap B)^c = A^c \cup B^c\).

Functions

Definition

A function from a set \(A\) to a set \(B\), in symbols \(f : A \to B\) (or \(A \xrightarrow{f} B\) or \(x \mapsto f(x)\)) is a specific rule that assigns to each element \(x \in A\) a unique element \(y \in B\).

We say that the element \(y\) is the value of the function \(f\) at \(x\) (or the image of \(x\) under \(f\)) and denote as \(f(x)\), that is, \(y = f(x)\).

We also sometimes say that \(y\) is the output of \(f\) when the input is \(x\).

The set \(A\) is called domain of \(f\). The set \(\{y \in B : \exists x \in A \text{ with } y = f(x)\}\) is called the range of \(f\).

ExampleDirichlet's unruly indicator function for rational numbers
\[\begin{split}g(x) = \left\{ \begin{array}{ll} 1 & \mbox{if $x \in \QQ$};\\ 0 & \mbox{if $x \notin \QQ$}. \end{array} \right.\end{split}\]

This function is not continuous anywhere on the real line.

ExampleAbsolute value function
\[\begin{split}| x | = \left\{ \begin{array}{ll} x & \mbox{if $x \geq 0$};\\ -x & \mbox{if $x < 0$}. \end{array} \right.\end{split}\]

This function is continuous but not differentiable at \(x=0\).

Definition
Two functions \(f : A \to B\) and \(g : A \to B\) are said to be equal, in symbols \(f = g\) if \(f(x) = g(x) \quad \forall x \in A\).
Definition
A function \(f : A \to B\) is called onto or surjective if the range of \(f\) is all of \(B\). i.e. for every \(y \in B\), there exists (at least one) \(x \in A\) such that \(y = f(x)\).
Definition
A function \(f : A \to B\) is called injective if \(x_1 \neq x_2 \implies f(x_1) \neq f(x_2)\).
Definition

Let \(f : X \to Y\) be a function. If \(A \subseteq X\), then image of \(A\) under \(f\) denoted as \(f(A)\) (a subset of \(Y\)) is defined by

\[f(A) = \{ y \in Y : \exists x \in A \text{ such that } y = f(x)\}.\]
Definition

If \(B\) is a subset of \(Y\) then the inverse image \(f^{-1}(B)\) of \(B\) under \(f\) is the subset of \(X\) defined by

\[f^{-1} (B) = \{ x \in X : f(x) \in B\}.\]

Let \(\{A_i\}_{i \in I}\) be a family of subsets of \(X\).

Let \(\{B_i\}_{i \in I}\) be a family of subsets of \(Y\).

Then the following results hold:

\[\begin{split}& f ( \cup_{i \in I} A_i) = \cup_{i \in I} f(A_i)\\ & f (\cap_{i \in I} A_i) \subseteq \cap_{i \in I} f(A_i) \\ & f^{-1} (\cup_{i \in I} B_i) = \cup_{i \in I}f^{-1} (B_i)\\ & f^{-1} (\cap_{i \in I} B_i) = \cap_{i \in I}f^{-1} (B_i)\\ & f^{-1}(B^c) = (f^{-1}(B))^c\end{split}\]
Definition

Given two functions \(f : X \to Y\) and \(g : Y \to Z\), their composition \(g \circ f\) is the function \(g \circ f : X \to Z\) defined by

\[(g \circ f)(x) = g(f(x)) \quad \forall x \in X.\]
Theorem
Given two one-one functions \(f : X \to Y\) and \(g : Y \to Z\), their composition \(g \circ f\) is one-one.
Proof
Let \(x_1, x_2 \in X\). We need to show that \(g(f(x_1)) = g(f(x_2)) \implies x_1 = x_2\). Since \(g\) is one-one, hence \(g(f(x_1)) = g(f(x_2)) \implies f(x_1) = f(x_2)\). Further, since \(f\) is one-one, hence \(f(x_1) = f(x_2) \implies x_1 = x_2\).
Theorem
Given two onto functions \(f : X \to Y\) and \(g : Y \to Z\), their composition \(g \circ f\) is onto.
Proof
Let \(z \in Z\). We need to show that there exists \(x \in X\) such that \(g(f(x)) = z\). Since \(g\) is on-to, hence for every \(z \in Z\), there exists \(y \in Y\) such that \(z = g(y)\). Further, since \(f\) is onto, for every \(y \in Y\), there exists \(x \in X\) such that \(y = f(x)\). Combining the two, for every \(z \in Z\), there exists \(x \in X\) such that \(z = g(f(x))\).
Theorem
Given two one-one onto functions \(f : X \to Y\) and \(g : Y \to Z\), their composition \(g \circ f\) is one-one onto.
Proof
This is a direct result of combining here and here.
Definition

If a function \(f : X \to Y\) is one-one and onto, then for every \(y \in Y\) there exists a unique \(x \in X\) such that \(y = f(x)\). This unique element is denoted by \(f^{-1}(y)\). Thus a function \(f^{-1} : Y \to X\) can be defined by

\[f^{-1}(y) = x \text{ whenever } f(x) = y.\]

The function \(f^{-1}\) is called the inverse of \(f\).

We can see that \((f \circ f^{-1})(y) = y\) for all \(y \in Y\).

Also \((f^{-1} \circ f) (x) = x\) for all \(x \in X\).

Definition

We define an identity function on a set \(X\) as

\[\begin{split}\begin{aligned} I_X : &X \to X\\ & I_X(x) = x \quad \forall x \in X \end{aligned}\end{split}\]
Remark
Identify function is bijective.

Thus we have:

\[\begin{split}& f \circ f^{-1} = I_Y.\\ & f^{-1} \circ f = I_X.\end{split}\]

If \(f : X \to Y\) is one-one, then we can define a function \(g : X \to f(Y)\) given by \(g(x) = f(x)\). This function is one-one and onto. Thus \(g^{-1}\) exists. We will use this idea to define an inverse function for a one-one function \(f\) as \(f^{-1} : f(X) \to X\) given by \(f^{-1}(y) = x \Forall y \in f(X)\). Clearly \(f^{-1}\) so defined is one-one and onto between \(X\) and \(f(X)\).

Theorem
Given two one-one functions \(f : X \to Y\) and \(g : Y \to X\), there exists a one-one onto function \(h : X \to Y\).
Proof

Clearly, we can define a one-one onto function \(f^{-1} : f(X) \to X\) and another one-one onto function \(g^{-1} : g(Y) \to Y\). Let the two-sided sequence \(C_x\) be defined as

\[\dots, f^{-1} (g^{-1}(x)), g^{-1}(x), x , f(x), g(f(x)), f(g(f(x))), \dots.\]

Note that the elements in the sequence alternate between \(X\) and \(Y\). On the left side, the sequence stops whenever \(f^{-1}(y)\) or \(g^{-1}(x)\) is not defined. On the right side the sequence goes on infinitely.

We call the sequence as \(X\) stopper if it stops at an element of \(X\) or as \(Y\) stopper if it stops at an element of \(Y\). If any element in the left side repeats, then the sequence on the left will keep on repeating. We call the sequence doubly infinite if all the elements (on the left) are distinct, or cyclic if the elements repeat. Define \(Z = X \cup Y\) If an element \(z \in Z\) occurs in two sequences, then the two sequences must be identical by definition. Otherwise, the two sequences must be disjoint. Thus the sequences form a partition on \(Z\). All elements within one equivalence class of \(Z\) are reachable from each other through one such sequence. The elements from different sequences are not reachable from each other at all. Thus, we need to define bijections between elements of \(X\) and \(Y\) which belong to same sequence separately.

For an \(X\)-stopper sequence \(C\), every element \(y \in C \cap Y\) is reachable from \(f\). Hence \(f\) serves as the bijection between elements of \(X\) and \(Y\). For an \(Y\)-stopper sequence \(C\), every element \(x \in C \cap X\) is reachable from \(g\). Hence \(g\) serves as the bijection between elements of \(X\) and \(Y\). For a cyclic or doubly infinite sequence \(C\), every element \(y \in C \cap Y\) is reachable from \(f\) and every element \(x \in C \cap X\) is reachable from \(g\). Thus either of \(f\) and \(g\) can serve as a bijection.

Sequence

Definition

Any function \(x : \Nat \to X\), where \(\Nat = \{1,2,3,\dots\}\) is the set of natural numbers, is called a sequence of \(X\).

We say that \(x(n)\) denoted by \(x_n\) is the \(n^{\text{th}}\) term in the sequence.

We denote the sequence by \(\{ x_n \}\).

Note that sequence may have repeated elements and the order of elements in a sequence is important.

Definition
A subsequence of a sequence \(\{ x_n \}\) is a sequence \(\{ y_n \}\) for which there exists a strictly increasing sequence \(\{ k_n \}\) of natural numbers (i.e. \(1 \leq k_1 < k_2 < k_3 < \ldots)\) such that \(y_n = x_{k_n}\) holds for each \(n\).

Cartesian product

Definition

Let \(\{ A_i \}_{i \in I}\) be a family of sets. Then the Cartesian product \(\prod_{i \in I} A_i\) or \(\prod A_i\) is defined to be the set consisting of all functions \(f : I \to \cup_{i \in I}A_i\) such that \(x_i = f(i) \in A_i\) for each \(i \in I\).

Such a function is called a choice function and often denoted by \((x_i)_{i \in I}\) or simply by \((x_i)\).

If a family consists of two sets, say \(A\) and \(B\), then the Cartesian product of the sets \(A\) and \(B\) is designated by \(A \times B\). The members of \(A \times B\) are denoted as ordered pairs.

\[A \times B = \{ (a, b) : a \in A \text{ and } b \in B \}.\]

Similarly the Cartesian product of a finite family of sets \(\{ A_1, \dots, A_n\}\) is written as \(A_1 \times \dots \times A_n\) and its members are denoted as \(n\)-tuples, i.e.:

\[A_1 \times \dots \times A_n = \{(a_1, \dots, a_n) : a_i \in A_i \forall i = 1,\dots,n\}.\]

Note that \((a_1,\dots, a_n) = (b_1,\dots,b_n)\) if and only if \(a_i = b_i \forall i = 1,\dots,n\).

If \(A_1 = A_2 = \dots = A_n = A\), then it is standard to write \(A_1 \times \dots \times A_n\) as \(A^n\).

Example:math:`A^n`

Let \(A = \{ 0, +1, -1\}\).

Then \(A^2\) is

\[\begin{split}\{\\ &(0,0), (0,+1), (0,-1),\\ &(+1,0), (+1,+1), (+1,-1),\\ &(-1,0), (-1,+1), (-1,-1)\\ \}.\end{split}\]

And \(A^3\) is given by

\[\begin{split} \{\\ &(0,0,0), (0,0,+1), (0,0,-1),\\ &(0,+1,0), (0,+1,+1), (0,+1,-1),\\ &(0,-1,0), (0,-1,+1), (0,-1,-1),\\ &(+1,0,0), (+1,0,+1), (+1,0,-1),\\ &(+1,+1,0), (+1,+1,+1), (+1,+1,-1),\\ &(+1,-1,0), (+1,-1,+1), (+1,-1,-1),\\ &(-1,0,0), (-1,0,+1), (-1,0,-1),\\ &(-1,+1,0), (-1,+1,+1), (-1,+1,-1),\\ &(-1,-1,0), (-1,-1,+1), (-1,-1,-1)\\ &\}.\end{split}\]

If the family of sets \(\{A_i\}_{i \in I}\) satisfies \(A_i = A \forall i \in I\), then \(\prod_{i \in I} A_i\) is written as \(A^I\).

\[A^I = \{ f | f : I \to A\}.\]

i.e. \(A^I\) is the set of all functions from \(I\) to \(A\).

Example
  • Let \(A = \{0, 1\}\). \(A^{\RR}\) is a set of all functions on \(\RR\) which can take only one of the two values \(0\) or \(1\). \(A^{\Nat}\) is a set of all sequences of zeros and ones.
  • \(\RR^\RR\) is a set of all functions from \(\RR\) to \(\RR\).

Axiom of choice

If a Cartesian product is non-empty, then each \(A_i\) must be non-empty.

We can therefore ask: If each :math:`A_i` is non-empty, is then the Cartesian product :math:`prod A_i` nonempty?

An affirmative answer cannot be proven within the usual axioms of set theory.

This requires us to introduce the axiom of choice.

Axiom
Axiom of choice. If \(\{A_i\}_{i \in I}\) is a nonempty family of sets such that \(A_i\) is nonempty for each \(i \in I\), then \(\prod A_i\) is nonempty.

Another way to state the axiom of choice is:

Axiom
If \(\{A_i\}_{i \in I}\) is a nonempty family of pairwise disjoint sets such that \(A_i \neq \EmptySet\) for each \(i \in I\), then there exists a set \(E \subseteq \cup_{i \in I} A_i\) such that \(E \cap A_i\) consists of precisely one element for each \(i \in I\).

Relations

Definition

A binary relation on a set \(X\) is defined as a subset \(\mathcal{R}\) of \(X \times X\).

If \((x,y) \in \mathcal{R}\) then \(x\) is said to be in relation \(\mathcal{R}\) with \(y\). This is denoted by \(x \mathcal{R} y\).

The most interesting relations are equivalence relations.

Definition

A relation \(\mathcal{R}\) on a set \(X\) is said to be an equivalence relation if it satisfies the following properties:

  • \(x \mathcal{R} x\) for each \(x \in X\) (reflexivity).
  • If \(x \mathcal{R} y\) then \(y \mathcal{R} x\) (symmetry).
  • If \(x \mathcal{R} y\) and \(y \mathcal{R} z\) then \(x \mathcal{R} z\) (transitivity).

We can now introduce equivalence classes on a set.

Definition

Let \(\mathcal{R}\) be an equivalence relation on a set \(X\). Then the equivalence class determined by the element \(x \in X\) is denoted by \([x]\) and is defined as

\[[x] = \{ y \in X : x \mathcal{R} y\}\]

i.e. all elements in \(X\) which are related to \(x\).

We can now look at some properties of equivalence classes and relations.

Lemma
Any two equivalence classes are either disjoint or else they coincide.
ExampleEquivalent classes

Let \(X\) bet the set of integers \(\ZZ\). Let \(\mathcal{R}\) be defined as

\[x \mathcal{R} y \iff 2 \mid (x-y)\]

i.e. \(x\) and \(y\) are related if the difference of \(x\) and \(y\) given by \(x-y\) is divisible by \(2\).

Clearly the set of odd integers and the set of even integers forms two disjoint equivalent classes.

Lemma
Let \(\mathcal{R}\) be an equivalence relation on a set \(X\). Since \(x \in [x]\) for each \(x \in X\), there exists a family \(\{A_i\}_{i \in I}\) of pairwise disjoint sets (a family of equivalence classes) such that \(X = \cup_{i \in I} A_i\).
Definition

If a set \(X\) can be represented as a union of a family \(\{A_i\}_{i \in I}\) of pairwise disjoint sets i.e.

\[X = \cup_{i \in I} A_i\]

then we say that \(\{A_i\}_{i \in I}\) is a partition of \(X\).

A partition over a set \(X\) also defines an equivalence relation on it.

Lemma

If there exists a family \(\{A_i\}_{i \in I}\) of pairwise disjoint sets which partitions a set \(X\), (i.e. \(X = \cup_{i \in I} A_i\)), then by letting

\[\mathcal{R} = \{(x,y) \in X \times X : \exists i \in I \text{ such that } x, y \in A_i\}\]

an equivalence relation is defined on \(X\) whose equivalence classes are precisely the sets \(A_i\).

In words, the relation \(\mathcal{R}\) includes only those tuples \((x,y)\) from the Cartesian product \(X\times X\) for which there exists one set \(A_i\) in the family of sets \(\{A_i\}_{i \in I}\) such that both \(x\) and \(y\) belong to \(A_i\).

Order

Another important type of relation is an order relation.

Definition

A relation, denoted by \(\leq\), on a set \(X\) is said to be a partial order for \(X\) (or that \(X\) is partially ordered by \(\leq\)) if it satisfies the following properties:

  • \(x \leq x\) holds for every \(x \in X\) (reflexivity).
  • If \(x \leq y\) and \(y \leq x\), then \(x = y\) (antisymmetry).
  • If \(x \leq y\) and \(y \leq z\), then \(x \leq z\) (transitivity).

An alternative notation for \(x \leq y\) is \(y \geq x\).

Definition
A set equipped with a partial order is known as a partially ordered set.
ExamplePartially ordered set

Consider a set \(A = \{1,2,3\}\). Consider the power set of \(A\) which is

\[X = \{\EmptySet, \{1\}, \{2\}, \{3\}, \{1,2\} , \{2,3\} , \{1,3\}, \{1,2,3\} \}.\]

Define a relation \(\mathcal{R}\) on \(X\) such that \(x \mathcal{R} y\) if \(x \subseteq y\).

Clearly

  • \(x \subseteq x \quad \forall x \in X\).
  • If \(x \subseteq y\) and \(y \subseteq x\) then \(x =y\).
  • If \(x \subseteq y\) and \(y \subseteq z\) then \(x \subseteq y\).

Thus the relation \(\mathcal{R}\) defines a partial order on the power set \(X\).

We can look at how elements are ordered within a set a bit more closely.

Definition

A subset \(Y\) of a partially ordered set \(X\) is called a chain if for every \(x, y \in Y\) either \(x \leq y\) or \(y \leq x\) holds.

A chain is also known as a totally ordered set.

  • In a partially ordered set \(X\), we don’t require that for every \(x,y \in X\), either \(x \leq y\) or \(y \leq x\) should hold. Thus there could be elements which are not connected by the order relation.
  • In a totally ordered set \(Y\), for every \(x,y \in Y\) we require that either \(x \leq y\) or \(y \leq x\).
  • If a set is totally ordered, then it is partially ordered also.
ExampleChain

Continuing from previous example consider a subset \(Y\) of \(X\) defined by

\[Y = \{\EmptySet, \{1\}, \{1,2\}, \{1,2,3\} \}.\]

Clearly for every \(x, y \in Y\), either \(x \subseteq y\) or \(y \subseteq x\) holds.

Hence \(Y\) is a chain or a totally ordered set within \(X\).

ExampleMore ordered sets
  • The set of natural numbers \(\Nat\) is totally ordered.
  • The set of integers \(\ZZ\) is totally ordered.
  • The set of real numbers \(\RR\) is totally ordered.
  • Suppose we define an order relation in the set of complex numbers as follows. Let \(x+jy\) and \(u+jv\) be two complex numbers. We say that
\[x+jy \leq u+jv \iff x \leq u \text{ and } y \leq v.\]

With this definition, the set of complex numbers \(\CC\) is partially ordered.

  • \(\RR\) is a totally ordered subset of \(\CC\) since the imaginary component is 0 for all real numbers in the complex plane.
  • In fact any line or a ray or a line segment in the complex plane represents a totally ordered set in the complex plane.

We can now define the notion of upper bounds in a partially ordered set.

Definition
If \(Y\) is a subset of a partially ordered set \(X\) such that \(y \leq u\) holds for all \(y \in Y\) and for some \(u \in X\), then \(u\) is called an upper bound of \(Y\).

Note that there can be more than one upper bounds of \(Y\). Upper bound is not required to be unique.

Definition
An element \(m \in X\) is called a maximal element whenever the relation \(m \leq x\) implies \(x = m\).

This means that there is no other element in \(X\) which is greater than \(m\).

A maximal element need not be unique. A partially ordered set may contain more than one maximal element.

ExampleMaximal elements

Consider the following set

\[Z = \{\EmptySet, \{1\}, \{2\}, \{3\}, \{1,2\} , \{2,3\} , \{1,3\} \}.\]

The set is partially ordered w.r.t. the relation \(\subseteq\).

There are three maximal elements in this set namely \(\{1,2\} , \{2,3\} , \{1,3\}\).

ExampleOrdered sets without a maximal element
  • The set of natural numbers \(\Nat\) has no maximal element.

What are the conditions under which a maximal element is guaranteed in a partially ordered set \(X\)?

Following statement known as Zorn’s lemma guarantees the existence of maximal elements in certain partially ordered sets.

Lemma
If chain in a partially ordered set \(X\) has an upper bound in \(X\), then \(X\) has a maximal element.

Following is corresponding notion of lower bounds.

Definition
If \(Y\) is a subset of a partially ordered set \(X\) such that \(u \leq y\) holds for all \(y \in Y\) and for some \(u \in X\), then \(u\) is called an lower bound of \(Y\).
Definition
An element \(m \in X\) is called a minimal element whenever the relation \(x \leq m\) implies \(x = m\).

As before there can be more than one minimal elements in a set.

Countable and uncountable sets

In this section, we deal with questions concerning the size of a set.

When do we say that two sets have same number of elements?

If we can find a one-to-one correspondence between two sets \(A\) and \(B\) then we can say that the two sets \(A\) and \(B\) have same number of elements.

In other words, if there exists a function \(f : A \to B\) that is one-to-one and onto (hence invertible), we say that \(A\) and \(B\) have same number of elements.

Definition
Two sets \(A\) and \(B\) are said to be equivalent (denoted as \(A \sim B\)) if there exists a function \(f : A \to B\) that is one-to-one and on to. When two sets are equivalent, we say that they have same cardinality.

Note that two sets may be equivalent yet not equal to each other.

ExampleEquivalent sets
  • The set of natural numbers \(\Nat\) is equivalent to the set of integers \(\ZZ\). Consider the function \(f : \Nat \to \ZZ\) given by

    \[\begin{split}f (n) = \left\{ \begin{array}{ll} (n - 1) / 2 & \mbox{if $n$ is odd};\\ -n / 2 & \mbox{if $n$ is even}. \end{array} \right.\end{split}\]

    It is easy to show that this function is one-one and on-to.

  • \(\Nat\) is equivalent to the set of even natural numbers \(E\). Consider the function \(f : \Nat \to E\) given by \(f(n) = 2n\). This is one-one and onto.

  • \(\Nat\) is equivalent to the set of rational numbers \(\QQ\).

  • The sets \(\{a, b, c\}\) and \(\{1,4, 9\}\) are equivalent but not equal.

Theorem

Let \(A, B, C\) be sets. Then:

  1. \(A \sim A\).
  2. If \(A \sim B\), then \(B \sim A\).
  3. If \(A \sim B\), and \(B \sim C\), then \(A \sim C\).

Thus it is an equivalence relation.

Proof

(i). Construct a function \(f : A \to A\) given by \(f (a) = a \Forall a \in A\). This is a one-one and onto function. Hence \(A \sim A\).

(ii). It is given that \(A \sim B\). Thus, there exists a function \(f : A \to B\) which is one-one and onto. Thus, there exists an inverse function \(g : B \to A\) which is one-one and onto. Thus, \(B \sim A\).

(iii). It is given that \(A \sim B\) and \(B \sim C\). Thus there exist two one-one and onto functions \(f : A \to B\) and \(g : B \to C\). Define a function \(h : A \to C\) given by \(h = g \circ f\). Since composition of bijective functions is bijective , \(h\) is one-one and onto. Thus, \(A \sim C\).

We now look closely at the set of natural numbers \(\Nat = \{1,2,3,\dots\}\).

Definition
Any subset of \(\Nat\) of the form \(\{1,\dots, n\}\) is called a segment of \(\Nat\). \(n\) is called the number of elements of the segment.

Clearly, two segments \(\{1,\dots,m\}\) and \(\{1,\dots,n\}\) are equivalent only if \(m= n\).

Thus a proper subset of a segment cannot be equivalent to the segment.

Definition
A set that is equivalent to a segment is called a finite set.

The number of elements of a set which is equivalent to a segment is equal to the number of elements in the segment.

The empty set is also considered to be finite with zero elements.

Definition
A set that is not finite is called an infinite set.

It should be noted that so far we have defined number of elements only for sets which are equivalent to a segment.

Definition
A set \(A\) is called countable if it is equivalent to \(\Nat\), i.e., if there exists a one-to-one correspondence of \(\Nat\) with the elements of \(A\).

A countable set \(A\) is usually written as \(A = \{a_1, a_2, \dots\}\) which indicates the one-to-one correspondence of \(A\) with the set of natural numbers \(\Nat\).

This notation is also known as the enumeration of \(A\).

Definition
An infinite set which is not countable is called an uncountable set.

With the definitions in place, we are now ready to study the connections between countable, uncountable and finite sets.

Theorem
Every infinite set contains a countable subset.
Proof
Let \(A\) be an infinite set. Clearly \(A \neq \EmptySet\). Pick an element \(a_1 \in A\). Consider \(A_1 = A \setminus \{a_1 \}\). Since \(A\) is infinite, hence \(A_1\) is nonempty. Pick an element \(a_2 \in A_1\). Clearly, \(a_2 \neq a_1\). Consider the set \(A_2 = A \setminus \{a_1, a_2 \}\). Again, by the same argument, since \(A\) is infinite, \(A_2\) is non-empty. We can pick \(a_3 \in A_2\). Proceeding in the same way we construct a set \(B = \{a_1, a_2, a_3, \dots \}\). The set is countable and by construction it is a subset of \(A\).
Theorem
Every subset of \(\Nat\) has a least element.
Theorem

If a subset \(S\) of \(\Nat\) satisfies the following properties:

  1. \(1 \in S\) and
  2. \(n \in S \implies n + 1 \in S\),

then \(S = \Nat\).

The principle of mathematical induction is applied as follows. We consider a set \(S = \{ n \in \Nat : n \mbox{ satisfies } P \}\) where \(P\) is some property that the members of this set satisfy. We that show that \(1\) satisfies the property \(P\). Further, we show that if \(n\) satisfies property \(P\), then \(n + 1\) also has to satisfy \(P\). Then applying the principle of mathematical induction, we claim that \(S = \Nat\) i.e. every number \(n \in \Nat\) satisfies the property \(P\).

Theorem
Every subset of a countable set is either finite or countable. i.e. if \(A\) is countable and \(B \subseteq A\), then either \(B\) is finite or \(B \sim A\).
Proof
Let \(A\) be a countable set and \(B \subseteq A\). If \(B\) is finite, then there is nothing to prove. So we consider \(B\) as infinite and show that it is countable. Since \(A\) is countable, hence \(A \sim \Nat\). Thus, \(B\) is equivalent to a subset of \(\Nat\). Without loss of generality, let us assume that \(B\) is a subset of \(\Nat\). We now construct a mapping \(f : \Nat \to B\) as follows. Let \(b_1\) be the least element of \(B\) (which exists due to well ordering principle). We assign \(f(1) = b_1\). Now, let \(b_2\) be the least element of \(B \setminus \{ b_1\}\). We assign \(f(2) = b_2\). Similarly, assuming that \(f(1) = b_1, f(2) = b_2, \dots , f(n) = b_n\) has been assigned, we assign \(f(n+1) =\) the least element of \(B \setminus \{b_1, \dots, b_n\}\). This least element again exists due to well ordering principle. This completes the definition of \(f\) using the principle of mathematical induction. It is easy to show that the function is one-one and onto. This proves that \(B \sim \Nat\).

We present different characterizations of a countable set.

Theorem

Let \(A\) be an infinite set. The following are equivalent:

  1. A is countable
  2. There exists a subset \(B\) of \(\Nat\) and a function \(f: B \to A\) that is on-to.
  3. There exists a function \(g : A \to \Nat\) that is one-one.
Proof

(i):math:implies (ii). Since \(A\) is countable, there exists a function \(f : \Nat \to A\) which is one-one and on to. Choosing \(B = \Nat\), we get the result.

(ii):math:implies (iii). We are given that there exists a subset \(B\) of \(\Nat\) and a function \(f: B \to A\) that is on-to. For some \(a \in A\), consider \(f^{-1}{a} = \{ b \in B : f(b) = a \}\). Since \(f\) is on-to, hence \(f^{-1}(a)\) is non-empty. Since \(f^{-1}(a)\) is a set of natural numbers, it has a least element due to well ordering principle. Further if \(a_1, a_2 \in A\) are distinct, then \(f^{-1}(a_1)\) and \(f^{-1}(a_2)\) are disjoint and the corresponding least elements are distinct. Assign \(g(a) = \text{ least element of } f^{-1}(a) \Forall a \in A\). Such a function is well defined by construction. Clearly, the function is one-one.

(iii):math:implies (i). We are given that there exists a function \(g : A \to \Nat\) that is one-one. Clearly, \(A \sim g(A)\) where \(g(A) \subseteq \Nat\). Since \(A\) is infinite, hence \(g(A)\) is also infinite. Due to here, \(g(A)\) is countable implying \(g(A) \sim \Nat\). Thus, \(A \sim g(A) \sim \Nat\) and \(A\) is countable.

Theorem

Let \(\{A_1, A_2, \dots \}\) be a countable family of sets where each \(A_i\) is a countable set. Then

\[A = \bigcup_{i=1}^{\infty} A_i\]

is countable.

Proof
Let \(A_n = \{a_1^n, a_2^n, \dots\} \Forall n \in \Nat\). Further, let \(B = \{2^k 3^n : k, n \in \Nat \}\). Note that every element of \(B\) is a natural number, hence \(B \subseteq \Nat\). Since \(B\) is infinite, hence by here \(B\) is countable, i.e. \(B \sim \Nat\). We note that if \(b_1 = 2^{k_1} 3^{n_1}\) and \(b_2 = 2^{k_2} 3^{n_2}\), then \(b_1 = b_2\) if and only if \(k_1 = k_2\) and \(n_1 = n_2\). Now define a mapping \(f : B \to A\) given by \(f (2^k 3^n) = a^n_k\) (picking \(k\)-th element from \(n\)-th set). Clearly, \(f\) is well defined and on-to. Thus, using here, \(A\) is countable.
Theorem
Let \(\{A_1, A_2, \dots, A_n \}\) be a finite collection of sets such that each \(A_i\) is countable. Then their Cartesian product \(A = A_1 \times A_2 \times \dots \times A_n\) is countable.
Proof

Let \(A_i = \{a_1^i, a_2^i, \dots\} \Forall 1 \leq i \leq n\). Choose \(n\) distinct prime numbers \(p_1, p_2, \dots, p_n\). Consider the set \(B = \{p_1^{k_1}p_2^{k_2} \dots p_n^{k_n} : k_1, k_2, \dots, k_n \in \Nat \}\). Clearly, \(B \subset \Nat\). Define a function \(f : A \to \Nat\) as

\[f (a^1_{k_1}, a^2_{k_2}, \dots, a^n_{k_n}) = p_1^{k_1}p_2^{k_2} \dots p_n^{k_n}.\]

By fundamental theorem of arithmetic, every natural number has a unique prime factorization. Thus, \(f\) is one-one. Invoking here, \(A\) is countable.

Theorem
The set of rational numbers \(\QQ\) is countable.
Proof
Let \(\frac{p}{q}\) be a positive rational number with \(p > 0\) and \(q > 0\) having no common factor. Consider a mapping \(f(\frac{p}{q}) = 2^p 3^q\). This is a one-one mapping into natural numbers. Hence invoking here, the set of positive rational numbers is countable. Similarly, the set of negative rational numbers is countable. Invoking here, \(\QQ\) is countable.
Theorem
The set of all finite subsets of \(\Nat\) is countable.
Proof

Let \(F\) denote the set of finite subsets of \(\Nat\). Let \(f \in F\). Then we can write \(f = \{n_1, \dots, n_k\}\) where \(k\) is the number of elements in \(f\). Consider the sequence of prime numbers \(\{p_n\}\) where \(p_n\) denotes \(n\)-th prime number. Now, define a mapping \(g : F \to \Nat\) as

\[g (f ) = \prod_{i=1}^k p_{n_i}.\]

The mapping \(g\) is one-one, since the prime decomposition of a natural number is unique. Hence invoking here, \(F\) is countable.

Corollary
The set of all finite subsets of a countable set is countable.
Definition
We say that \(A \preceq B\) whenever there exists a one-one function \(f : A \to B\). In other words, \(A\) is equivalent to a subset of \(B\).

In this sense, \(B\) has at least as many elements as \(A\).

Theorem

The relation \(\preceq\) satisfies following properties

  1. \(A \preceq A\) for all sets \(A\).
  2. If \(A \preceq B\) and \(b \preceq C\), then \(A \preceq C\).
  3. If \(A \preceq B\) and \(B \preceq A\), then \(A \sim B\).
Proof

(i). We can use the identity function \(f (a ) = a \Forall a \in A\).

(ii). Straightforward application of the result that composition of injective functions is injective.

(iii). Straightforward application of Schröder-Bernstein theorem.

Theorem
If \(A\) is a set, then \(A \preceq \Power (A)\) and \(A \nsim \Power (A)\).
Proof

If \(A = \EmptySet\), then \(\Power(A) = \{ \EmptySet\}\) and the result is trivial. So, lets consider non-empty \(A\). We can choose \(f : A \to \Power(A)\) given by \(f (x) = \{ x\} \Forall x \in A\). This is clearly a one-one function leading to \(A \preceq \Power (A)\).

Now for the sake of contradiction, lets us assume that \(A \sim \Power (A)\). Then, there exists a bijective function \(g : A \to \Power(A)\). Consider the set \(B = \{ a \in A : a \notin g(a) \}\). Since \(B \subseteq A\), and \(g\) is bijective, there exists \(a \in A\) such that \(g (a) = B\).

Now if \(a \in B\) then \(a \notin g(a) = B\). And if \(a \notin B\), then \(a \in g(a) = B\). This is impossible, hence \(A \nsim \Power(A)\).

Definition
For every set \(A\) a symbol (playing the role of a number) can be assigned that designates the number of elements in the set. This number is known as cardinal number of the set and is denoted by Card{A} or \(| A |\). It is also known as cardinality.

Note that the cardinal numbers are different from natural numbers, real numbers etc. If \(A\) is finite, with \(A = \{a_1, a_2, \dots, a_n \}\), then \(\Card{A} = n\). We use the symbol \(\aleph_0\) to denote the cardinality of \(\Nat\). By saying \(A\) has the cardinality of \(\aleph_0\), we simply mean that \(A \sim \Nat\).

If \(a\) and \(b\) are two cardinal numbers, then by \(a \leq b\), we mean that there exist two sets \(A\) and \(B\) such that \(\Card{A} = a\), \(\Card{B} = b\) and \(A \preceq B\). By \(a < b\), we mean that \(A \preceq B\) and \(A \nsim B\). \(a \leq b\) and \(b \leq a\) guarantees that \(a = b\).

It can be shown that \(\Power(\Nat) \sim \RR\). The cardinality of \(\RR\) is denoted by \(\mathfrak{c}\).

Definition
A cardinal number \(a\) satisfying \(\aleph_0 \leq a\) is known as infinite cardinal number.
Definition
The cardinality of \(\RR\) denoted by \(\mathfrak{c}\) is known as the cardinality of the continuum.
Theorem
Let \(2 = \{ 0, 1 \}\). Then \(2^X \sim \Power (X)\) for every set \(X\).
Proof

\(2^X\) is the set of all functions \(f : X \to 2\). i.e. a function from \(X\) to \(\{ 0, 1 \}\) which can take only one the two values \(0\) and \(1\).

Define a function \(g : \Power (X) \to 2^X\) as follows. Let \(y \in \Power(X)\). Then \(g(y)\) is a function \(f : X \to \{ 0, 1 \}\) given by

\[\begin{split}f(x) = \left\{ \begin{array}{ll} 1 & \mbox{if $x \in y$};\\ 0 & \mbox{if $x \notin y$}. \end{array} \right.\end{split}\]

The function \(g\) is one-one and on-to. Thus \(2^X \sim \Power(X)\).

We denote the cardinal number of \(\Power(X)\) by \(2^{\Card{X}}\). Thus, \(\mathfrak{c} = 2^{\aleph_0}\).

The following inequalities of cardinal numbers hold:

\[0 < 1 < 2 < \dots < n \dots < \aleph_0 < 2^{\aleph_0} = \mathfrak{c} < 2^ \mathfrak{c} < 2^{2^{ \mathfrak{c}}} \dots.\]

Linear Algebra

Vector Spaces

Algebraic structures

In mathematics, the term algebraic structure refers to an arbitrary set with one or more operations defined on it. Simpler algebraic structures include groups, rings, and fields. More complex algebraic structures like vector spaces are built on top of the simpler structures. We will develop the notion of vector spaces as a progression of these algebraic structures.

Groups

A group is a set with a single binary operation. It is one of the simplest algebraic structures.

Definition
Let \(G\) be a set and let \(*\) be a binary operation defined on \(G\) as:
\[\begin{split}\begin{aligned} * : &G \times G \to G\\ &(g_1, g_2) \to * (g_1, g_2) \\ &\triangleq g_1 * g_2 \end{aligned}\end{split}\]

such that the binary operation \(*\) satisfies following requirements.

  1. [Closure] The set \(G\) is closed under the binary operation \(*\). i.e.

    \[\forall g_1, g_2 \in G, g_1 * g_2 \in G.\]
  2. [Associativity] For every \(g_1, g_2, g_3 \in G\)

    \[g_1 * (g_2 * g_3) = (g_1 * g_2) * g_3\]
  3. [Identity element] There exists an element \(e \in G\) such that

    \[g * e = e * g = g \quad \forall g \in G\]
  4. [Inverse element] For every \(g \in G\) there exists an element \(g^{-1} \in G\) such that

    \[g * g^{-1} = g^{-1} * g = e\]

Then the set \(G\) together with the operator \(*\) denoted as \((G, *)\) is known as a group.

Above requirements are known as group axioms. Note that commutativity is not a requirement of a group.

In the sequel we will write \(g_1 * g_2\) as \(g_1 g_2\).

Commutative groups

A commutative group is a richer structure than a group. Its elements also satisfy commutativity property.

Definition

Let \((G, *)\) be a group such that it satisfies

  • [Commutativity] For every \(g_1, g_2 \in G\)
\[g_1 g_2 = g_2 g_1\]

Then \((G,*)\) is known as a commutative group or an Abelian group.

In the sequel we may simply write a group \((G, *)\) as \(G\) when the underlying operation \(*\) is clear from context.

Rings

A ring is a set with two binary operations defined over it with some requirements as described below.

Definition

Let \(R\) be a set with two binary operations \(+\) (addition) and \(\cdot\) (multiplication) defined over it as:

\[\begin{split}\begin{aligned} + : &R \times R \to R\\ &(r_1, r_2) \to r_1 + r_2 \end{aligned}\end{split}\]
\[\begin{split}\begin{aligned} \cdot : &R \times R \to R\\ &(r_1, r_2) \to r_1 \cdot r_2 \end{aligned}\end{split}\]

such that \((R, +, \cdot)\) satisfies following requirements:

  1. \((R, +)\) is an Abelian group.

  2. \(R\) is closed under multiplication.

    \[r_1 \cdot r_2 \in R \quad \forall r_1, r_2 \in R\]
  3. Multiplication is associative.

    \[r_1 \cdot (r_2 \cdot r_3) = (r_1 \cdot r_2) \cdot r_3 \quad \forall r_1, r_2, r_3 \in R\]
  4. Multiplication distributes over addition.

    \[\begin{split}\begin{aligned} &r_1 \cdot (r_2 + r_3) = (r_1 \cdot r_2) + (r_1 \cdot r_3) \quad \forall r_1, r_2, r_3 \in R\\ &(r_1 + r_2) \cdot r_3 = (r_1 \cdot r_3) + (r_2 \cdot r_3) \quad \forall r_1, r_2, r_3 \in R \end{aligned}\end{split}\]

Then \((R, +, \cdot)\) is known as an associative ring.

We denote the identity element for \(+\) as \(0\) and call it additive identity.

In the sequel we will write \(r_1 \cdot r_2\) as \(r_1 r_2\).

We may simply write a ring \((R, +, \cdot)\) as \(R\) when the underlying operations \(+,\cdot\) are clear from context.

There is a hierarchy of ring like structures. In particular we mention:

  • Associative ring with identity
  • Field
Definition

Let \((R, +, \cdot)\) be an associative ring such that it satisfies following additional requirement:

  • There exists an element \(1 \in R\) (known as multiplicative identity) such that

    \[1 \cdot r = r \cdot 1 = r \quad \forall r \in R\]

Then \((R, +, \cdot)\) is known as an associative ring with identity.

Fields

Field is the richest algebraic structure on one set with two operations.

Definition

Let \(F\) be a set with two binary operations \(+\) (addition) and \(\cdot\) (multiplication) defined over it as:

\[\begin{split}\begin{aligned} + : &F \times F \to F\\ &(x_1, x_2) \to x_1 + x_2 \end{aligned}\end{split}\]
\[\begin{split}\begin{aligned} \cdot : &F \times F \to F\\ &(x_1, x_2) \to x_1 \cdot x_2 \end{aligned}\end{split}\]

such that \((F, +, \cdot)\) satisfies following requirements:

  1. \((F, +)\) is an Abelian group (with additive identity as \(0 \in F\)).

  2. \((F \setminus \{0\}, \cdot)\) is an Abelian group (with multiplicative identity as \(1 \in F\)).

  3. Multiplication distributes over addition:

    \[\alpha \cdot (\beta + \gamma) = (\alpha \cdot \beta) + (\alpha \cdot \gamma) \quad \forall \alpha, \beta, \gamma \in F\]

Then \((F, +, \cdot)\) is known as a field.

ExampleExamples of fields
  • The set of real numbers \(\RR\) is a field.
  • The set of complex numbers \(\CC\) is a field.
  • The Galois field GF-2 is the the set \(\{ 0, 1 \}\) with modulo-2 additions and multiplications.

Vector space

We are now ready to define a vector space. A vector space involves two sets. One set \(\VV\) contains the vectors. The other set \(\mathrm{F}\) (a field) contains scalars which are used to scale the vectors.

Definition
A set \(\VV\) is called a vector space over the field \(\mathrm{F}\) (or an \(\mathrm{F}\)-vector space) if there exist two mappings
\[\begin{split}\begin{aligned} + : &\VV \times \VV \to \VV\\ &(v_1, v_2) \to v_1 + v_2 \quad v_1, v_2 \in \VV \end{aligned}\end{split}\]
\[\begin{split}\begin{aligned} \cdot : &\mathrm{F} \times \VV \to \VV\\ &(\alpha, v) \to \alpha \cdot v \triangleq \alpha v \quad \alpha \in \mathrm{F}; v \in \VV \end{aligned}\end{split}\]

which satisfy following requirements:

  1. \((\VV, +)\) is an Abelian group.

  2. Scalar multiplication \(\cdot\) distributes over vector addition \(+\):

    \[\alpha (v_1 + v_2) = \alpha v_1 + \alpha v_2 \quad \forall \alpha \in \mathrm{F}; \forall v_1, v_2 \in \VV.\]
  3. Addition in \(\mathrm{F}\) distributes over scalar multiplication \(\cdot\):

    \[( \alpha + \beta) v = (\alpha v) + (\beta v) \quad \forall \alpha, \beta \in \mathrm{F}; \forall v \in \VV.\]
  4. Multiplication in \(\mathrm{F}\) commutes over scalar multiplication:

    \[(\alpha \beta) \cdot v = \alpha \cdot (\beta \cdot v) = \beta \cdot (\alpha \cdot v) = (\beta \alpha) \cdot v \quad \forall \alpha, \beta \in \mathrm{F}; \forall v \in \VV.\]
  5. Scalar multiplication from multiplicative identity \(1 \in \mathrm{F}\) satisfies the following:

    \[1 v = v \quad \forall v \in \VV.\]

Some remarks are in order:

  • \(\VV\) as defined above is also known as an \(\mathrm{F}\) vector space.
  • Elements of \(\VV\) are known as vectors.
  • Elements of \(\mathrm{F}\) are known as scalars.
  • There are two \(0\) involved: \(0 \in \mathrm{F}\) and \(0 \in \VV\). It should be clear from context which \(0\) is being referred to.
  • \(0 \in \VV\) is known as the zero vector.
  • All vectors in \(\VV \setminus \{0\}\) are non-zero vectors.
  • We will typically denote elements of \(\mathrm{F}\) by \(\alpha, \beta, \dots\).
  • We will typically denote elements of \(\VV\) by \(v_1, v_2, \dots\).

We quickly look at some vector spaces which will appear again and again in our discussions.

ExampleN-tuples as a vector space

Let \(\mathrm{F}\) be some field.

The set of all \(N\)-tuples \((a_1, a_2, \dots, a_N)\) with \(a_1, a_2, \dots, a_N \in \mathrm{F}\) is denoted as \(\mathrm{F}^N\). This is a vector space with the operations of coordinate-wise addition and scalar multiplication.

Let \(u, v \in \mathrm{F}^N\) with

\[u = (u_1, \dots, u_N)\]

and

\[v = (v_1, \dots, v_N).\]

Addition is defined as

\[u + v \triangleq (u_1 + v_1, \dots, u_N + v_N).\]

Let \(c \in \mathrm{F}\). Scalar multiplication is defined as

\[c u \triangleq (c u_1, \dots, c u_N).\]

\(u, v\) are called equal if \(u_1 = v_1, \dots, u_N = v_N\).

In matrix notation, vectors in \(\mathrm{F}^N\) are also written as row vectors

\[u = \begin{bmatrix} u_1 & \vdots & u_N \end{bmatrix}\]

or column vectors

\[\begin{split}u = \begin{bmatrix} u_1 \\ \dots \\ u_N \end{bmatrix}\end{split}\]
ExampleMatrices

Let \(\mathrm{F}\) be some field. A matrix is an array of the form

\[\begin{split}\begin{bmatrix} a_{11} & a_{12} & \dots & a_{1N} \\ a_{21} & a_{22} & \dots & a_{2N} \\ \vdots & \vdots & \ddots & \vdots \\ a_{M 1} & a_{M 2} & \dots & a_{MN} \\ \end{bmatrix}\end{split}\]

with \(M\) rows and \(N\) columns where \(a_{ij} \in \mathrm{F}\).

The set of these matrices is denoted as \(\mathrm{F}^{M \times N}\) which is a vector space with operations of matrix addition and scalar multiplication.

Let \(A, B \in \mathrm{F}^{M \times N}\). Matrix addition is defined by

\[(A + B)_{ij} \triangleq A_{ij} + B_{ij}.\]

Let \(c \in \mathrm{F}\). Scalar multiplication is defined by

\[(cA)_{ij} \triangleq c A_{ij}.\]
ExamplePolynomials

Let \(\mathrm{F}[x]\) denote the set of all polynomials with coefficients drawn from field \(\mathrm{F}\). i.e. if \(f(x) \in \mathrm{F}[x]\), then it can be written as

\[f(x) = a_n x^n + a_{n-1}x^{n -1} + \dots + a_1 x + a_0\]

where \(a_i \in \mathrm{F}\).

The set \(\mathrm{F}[x]\) is a vector space with usual operations of addition and scalar multiplication

\[f(x) + g(x) = (a_n + b_n)x^n + \dots + (a_1 + b_1 ) x + (a_0 + b_0).\]
\[c f(x) = c a_n x^n + \dots + c a_1 x + c a_0.\]

Some useful results are presented without proof.

Theorem
Let \(\VV\) be an \(\mathrm{F}\) vector space. Let \(x, y, z\) be some vectors in \(\VV\) such that \(x + z = y + z\). Then \(x = y\).

This is known as the cancellation law of vector spaces.

Corollary
The \(0\) vector in a vector space \(\VV\) is unique.
Corollary
The additive inverse of a vector \(x\) in \(\VV\) is unique.
Theorem

In a vector space \(\VV\) the following statements are true

  • \(0x = 0 \Forall x \in \VV\).
  • \((-a)x = - (ax) = a(-x) \Forall a \in \mathrm{F} \text{ and } x \in \VV\).
  • \(a 0 = 0 \Forall a \in \mathrm{F}\).

Linear independence

Definition

A linear combination of two vectors \(v_1, v_2 \in \VV\) is defined as

\[\alpha v_1 + \beta v_2\]

where \(\alpha, \beta \in \mathrm{F}\).

A linear combination of \(p\) vectors \(v_1,\dots, v_p \in \VV\) is defined as

\[\sum_{i=1}^{p} \alpha_i v_i\]
Definition

Let \(\VV\) be a vector space and let \(S\) be a nonempty subset of \(\VV\). A vector \(v \in \VV\) is called a linear combination of vectors of \(S\) if there exist a finite number of vectors \(s_1, s_2, \dots, s_n \in S\) and scalars \(a_1, \dots, a_N\) in \(\mathrm{F}\) such that

\[v = a_1 s_1 + a_2 s_2 + \dots a_n s_n.\]

We also say that \(v\) is a linear combination of \(s_1, s_2, \dots, s_n\) and \(a_1, a_2, \dots, a_n\) are the coefficients of linear combination.

Note that \(0\) is a trivial linear combination of any subset of \(\VV\).

Note that linear combination may refer to the expression itself or its value. e.g. two different linear combinations may have same value.

Note that a linear combination always consists of a finite number of vectors.

Definition

A finite set of non-zero vectors \(\{v_1, \cdots, v_p\} \subset \VV\) is called linearly dependent if there exist \(\alpha_1,\dots,\alpha_p \in \mathrm{F}\) not all \(0\) such that

\[\sum_{i=1}^{p} \alpha_i v_i = 0.\]
Definition

A set \(S \subseteq \VV\) is called linearly dependent if there exist a finite number of distinct vectors \(u_1, u_2, \dots, u_n \in S\) and scalars \(a_1, a_2, \dots, a_n \in \mathrm{F}\) not all zero, such that

\[a_1 u_1 + a_2 u_2 + \dots + a_n u_n = 0.\]
Definition
A set \(S \subseteq \VV\) is called linearly independent if it is not linearly dependent.
Definition

More specifically a finite set of non-zero vectors \(\{v_1, \cdots, v_n\} \subset \VV\) is called linearly independent if

\[\sum_{i=1}^{n} \alpha_i v_i = 0 \implies \alpha_i = 0 \Forall 1 \leq i \leq n.\]
ExampleExamples of linearly dependent and independent sets
  • The empty set is linearly independent.
  • A set of a single non-zero vector \(\{v\}\) is always linearly independent. Prove!
  • If two vectors are linearly dependent, we say that they are collinear.
  • Alternatively if two vectors are linearly independent, we say that they are not collinear.
  • If a set \(\{v_1, \cdots, v_p\}\) is linearly independent, then any subset of it will be linearly independent. Prove!
  • Adding another vector \(v\) to the set may make it linearly dependent. When?
  • It is possible to have an infinite set to be linearly independent. Consider the set of polynomials \(\{1, x, x^2, x^3, \dots\}\). This set is infinite, yet linearly independent.
Theorem
Let \(\VV\) be a vector space. Let \(S_1 \subseteq S_2 \subseteq \VV\). If \(S_1\) is linearly dependent, then \(S_2\) is linearly dependent.
Corollary
Let \(\VV\) be a vector space. Let \(S_1 \subseteq S_2 \subseteq \VV\). If \(S_2\) is linearly independent, then \(S_1\) is linearly independent.

Span

Vectors can be combined to form other vectors. It makes sense to consider the set of all vectors which can be created by combining a given set of vectors.

Definition

Let \(S \subset \VV\) be a subset of vectors. The span of \(S\) denoted as \(\langle S \rangle\) or \(\Span(S)\) is the set of all possible linear combinations of vectors belonging to \(S\).

\[\Span(S) \triangleq \langle S \rangle \triangleq \{ v \in \VV : v = \sum_{i=1}^{p} \alpha_i v_i \quad \text{for some} \quad v_i \in S; \alpha_i \in \mathrm{F}; p \in \mathbb{N}\}\]

For convenience we define \(\Span(\EmptySet) = \{ 0 \}\).

Span of a finite set of vectors \(\{v_1, \cdots, v_p\}\) is denoted by \(\langle v_1, \cdots, v_p \rangle\).

\[\langle v_1, \cdots, v_p \rangle = \left \{\sum_{i=1}^{p} \alpha_i v_i | \alpha_i \in \mathrm{F} \right \}.\]

We say that a set of vectors \(S \subseteq \VV\) spans \(\VV\) if \(\langle S \rangle = \VV\).

Lemma
Let \(S \subseteq \VV\), then \(\Span (S) \subseteq \VV\).
Definition

Let \(S \subset \VV\). We say that \(S\) spans (or generates) \(\VV\) if

\[\langle S \rangle = \VV.\]

In this case we also say that vectors of \(S\) span (or generate) \(\VV\).

Theorem
Let \(S\) be a linearly independent subset of a vector space \(\VV\) and let \(v \in \VV \setminus S\). Then \(S \cup \{ v \}\) is linearly dependent if and only if \(v \in \Span(S)\).

Basis

Definition
A set of linearly independent vectors \(\mathcal{B}\) is called a basis of \(\VV\) if \(\langle \mathcal{B} \rangle = \VV\), i.e. \(\mathcal{B}\) spans \(\VV\).
ExampleBasis examples
  • Since \(\Span(\EmptySet) = \{ 0 \}\) and \(\EmptySet\) is linearly independent, \(\EmptySet\) is a basis for the zero vector space \(\{ 0 \}\).
  • The basis \(\{ e_1, \dots, e_N\}\) with \(e_1 = (1, 0, \dots, 0)\), \(e_2 = (0, 1, \dots, 0)\), \(\dots\), \(e_N = (0, 0, \dots, 1)\), is called the standard basis for \(\mathrm{F}^N\).
  • The set \(\{1, x, x^2, x^3, \dots\}\) is the standard basis for \(\mathrm{F}[x]\). Indeed, an infinite basis. Note that though the basis itself is infinite, yet every polynomial \(p \in \mathrm{F}[x]\) is a linear combination of finite number of elements from the basis.

We review some properties of bases.

Theorem

Let \(\VV\) be a vector space and \(\mathcal{B} = \{ v_1, v_2, \dots, v_n\}\) be a subset of \(\VV\). Then \(\mathcal{B}\) is a basis for \(\VV\) if and only if each \(v \in \VV\) can be uniquely expressed as a linear combination of vectors of \(\mathcal{B}\):

\[v = a_1 v_1 + a_2 v_2 + \dots + a_n v_n\]

for unique scalars \(a_1, \dots, a_n\).

This theorem states that a basis \(\mathrm{B}\) provides a unique representation to each vector \(v \in \VV\) where the representation is defined as the \(n\)-tuple \((a_1, a_2, \dots, a_n)\).

If the basis is infinite, then the above theorem needs to be modified as follows:

Theorem

Let \(\VV\) be a vector space and \(\mathcal{B}\) be a subset of \(\VV\). Then \(\mathcal{B}\) is a basis for \(\VV\) if and only if each \(v \in \VV\) can be uniquely expressed as a linear combination of vectors of \(\mathcal{B}\):

\[v = a_1 v_1 + a_2 v_2 + \dots + a_n v_n\]

for unique scalars \(a_1, \dots, a_n\) and unique vectors \(v_1, v_2, \dots v_n \in \mathcal{B}\).

Theorem
If a vector space \(\VV\) is spanned by a finite set \(S\), then some subset of \(S\) is a basis for \(\VV\). Hence \(\VV\) has a finite basis.
Theorem

Let \(\VV\) be a vector space that is spanned by a set \(G\) containing exactly \(n\) vectors. Let \(L\) be a linearly independent subset of \(\VV\) containing exactly \(m\) vectors.

Then \(m \leq n\) and there exists a subset \(H\) of \(G\) containing exactly \(n-m\) vectors such that \(L \cup H\) spans \(\VV\).

Corollary
Let \(\VV\) be a vector space having a finite basis. Then every basis for \(\VV\) contains the same number of vectors.
Definition

A vector space \(\VV\) is called finite-dimensional if it has a basis consisting of a finite number of vectors. This unique number of vectors in any basis \(\mathcal{B}\) of the vector space \(\VV\) is called the dimension or dimensionality of the vector space. It is denoted as \(\dim \VV\). We say:

\[\dim \VV \triangleq |\mathcal{B}|\]

If \(\VV\) is not finite-dimensional, then we say that \(\VV\) is infinite-dimensional.

ExampleVector space dimensions
  • Dimension of \(\mathrm{F}^N\) is \(N\).
  • Dimension of \(\mathrm{F}^{M \times N}\) is \(MN\).
  • The vector space of polynomials \(\mathrm{F}[x]\) is infinite dimensional.
Lemma

Let \(\VV\) be a vector space with dimension \(n\).

  1. Any finite spanning set for \(\VV\) contains at least \(n\) vectors, and a spanning set that contains exactly \(n\) vectors is a basis for \(\VV\).
  2. Any linearly independent subset of \(\VV\) that contains exactly \(n\) vectors is a basis for \(\VV\).
  3. Every linearly independent subset of \(\VV\) can be extended to a basis for \(\VV\).
Definition
For a finite dimensional vector space \(\VV\), an ordered basis for \(\VV\) is a basis for \(\VV\) with a specific order. In other words, it is a finite sequence of linearly independent vectors in \(\VV\) that spans \(\VV\).

Typically we will write an ordered basis as \(\BBB = \{ v_1, v_2, \dots, v_n\}\) and assume that the basis vectors are ordered in the order they appear.

With the help of an ordered basis, we can define a coordinate vector.

Definition

Let \(\BBB = \{ v_1, \dots, v_n\}\) be an ordered basis for \(\VV\), and for \(x \in \VV\), let \(\alpha_1, \dots, \alpha_n\) be unique scalars such that

\[x = \sum_{i=1}^n \alpha_i v_i.\]

The coordinate vector of \(x\) relative to \(\BBB\) is defined as

\[\begin{split}[x]_{\BBB} = \begin{bmatrix} \alpha_1\\ \vdots\\ \alpha_n \end{bmatrix}.\end{split}\]

Subspace

Definition

Let \(W\) be a subset of \(\VV\). Then \(W\) is called a subspace if \(W\) is a vector space in its own right under the same vector addition \(+\) and scalar multiplication \(\cdot\) operations. i.e.

\[\begin{split}\begin{aligned} + : &\WW \times \WW \to \WW\\ &(w_1, w_2) \to w_1 + w_2 \quad w_1, w_2 \in \WW \end{aligned}\end{split}\]
\[\begin{split}\begin{aligned} \cdot : &\mathrm{F} \times \WW \to \WW\\ &(\alpha, w) \to \alpha \cdot w \triangleq \alpha w \quad \alpha \in \mathrm{F}; w \in \WW \end{aligned}\end{split}\]

are defined by restricting \(+ : \VV \times \VV \to \VV\) and \(\cdot : \VV \times \VV \to \VV\) to \(W\) and \(W\) is closed under these operations.

ExampleSubspaces
  • \(\VV\) is a subspace of \(\VV\).
  • \(\{0\}\) is a subspace of any \(\VV\).
Theorem

A subset \(\WW \subseteq \VV\) is a subspace of \(\VV\) if and only if

  • \(0 \in\WW\)
  • \(x + y \in\WW\) whenever \(x, y \in\WW\)
  • \(\alpha x \in\WW\) whenever \(\alpha \in \mathrm{F}\) and \(x \in\WW\).
ExampleSymmetric matrices

A matrix \(M \in \mathrm{F}^{M \times N}\) is symmetric if

\[M^T = M.\]

The set of symmetric matrices forms a subspace of set of all \(M\times N\) matrices.

ExampleDiagonal matrices

A matrix \(M\) is called a diagonal if \(M_{ij} = 0\) whenever \(i \neq j\).

The set of diagonal matrices is a subspace of \(\mathrm{F}^{M \times N}\).

Theorem
Any intersection of subspaces of a vector space \(\VV\) is a subspace of \(\VV\).

We note that a union of subspaces is not necessarily a subspace, since it is not closed under addition.

Theorem

The span of a set \(S \subset \VV\) given by \(\langle S \rangle\) is a subspace of \(\VV\).

Moreover any subspace of \(\VV\) that contains \(S\) must also contain the span of \(S\).

This theorem is quite useful. It allows us to construct subspaces from a given basis.

Let \(\mathcal{B}\) be a basis of an \(n\) dimensional space \(\VV\). There are \(n\) vectors in \(\mathcal{B}\). We can create \(2^n\) distinct subsets of \(\mathcal{B}\). Thus we can construct \(2^n\) distinct subspaces of \(\VV\).

Choosing some other basis lets us construct another set of subspaces.

An \(n\)-dimensional vector space has infinite number of bases. Correspondingly, there are infinite possible subspaces.

If \(W_1\) and \(W_2\) are two subspaces of \(\VV\) then we say that \(W_1\) is smaller than \(W_2\) if \(W_1 \subset\WW _2\).

Theorem

Let \(\WW\) be the smallest subspace containing vectors \(\{ v_1, \dots, v_p \}\). Then

\[\WW = \langle v_1, \dots, v_p \rangle.\]

i.e. \(\WW\) is same as the span of \(\{ v_1, \dots, v_p \}\).

Theorem

Let \(\WW\) be a subspace of a finite-dimensional vector space \(\VV\). Then \(\WW\) is finite dimensional and

\[\dim \WW \leq \dim \VV.\]

Moreover, if

\[\dim \WW = \dim \VV,\]

then \(\WW = \VV\).

Corollary
If \(\WW\) is a subspace for a finite-dimensional vector space \(\VV\) then any basis for \(\WW\) can be extended to a basis for \(\VV\).
Definition

Let \(\VV\) be a finite dimensional vector space and \(\WW\) be a subspace of \(\VV\). The codimension of \(\WW\) is defined as

\[\text{codim} \WW = \dim \VV - \dim \WW.\]

Linear transformations

In this section, we will be using symbols \(\VV\) and \(\WW\) to represent arbitrary vector spaces over a field \(\FF\). Unless specified the two vector spaces won’t be related in any way.

Following results can be restated for more general situations where \(\VV\) and \(\WW\) are defined over different fields, but we will assume that they are defined over the same field \(\FF\) for simplicity of discourse.

Definition

We call a map \(\TT : \VV \to \WW\) a linear transformation from \(\VV\) to \(\WW\) if for all \(x, y \in \VV\) and \(\alpha \in \FF\), we have

  • \(\TT(x + y) = \TT(x) + \TT(y)\) and
  • \(\TT(\alpha x) = \alpha \TT(x)\)

A linear transformation is also known as a linear map or a linear operator. Usually when the domain (\(\VV\)) and co-domain (\(\WW\)) for a linear transformation are same, then the term linear operator is used.

Remark
If \(\TT\) is linear then \(\TT(0) = 0\).

This is straightforward since

\[\TT(0 + 0) = \TT(0) + \TT(0) \implies \TT(0) = \TT(0) + \TT(0) \implies \TT(0) = 0.\]
Lemma
\(\TT\) is linear \(\iff \TT(\alpha x + y) = \alpha \TT(x) + \TT(y) \Forall x, y \in \VV, \alpha \in \FF\)
Proof

Assuming \(\TT\) to be linear we have

\[\TT(\alpha x + y) = \TT(\alpha x) + \TT(y) = \alpha \TT(x) + \TT(y).\]

Now for the converse, assume

\[\TT(\alpha x + y) = \alpha \TT(x) + \TT(y) \Forall x, y \in \VV, \alpha \in \FF.\]

Choosing both \(x\) and \(y\) to be 0 and \(\alpha=1\) we get

\[\TT(0 + 0) = \TT(0) + \TT(0) \implies \TT(0) = 0.\]

Choosing \(y=0\) we get

\[\TT(\alpha x + 0) = \alpha \TT(x) + \TT(0) = \alpha \TT(x).\]

Choosing \(\alpha = 1\) we get

\[\TT(x + y) = \TT(x) + \TT(y).\]

Thus \(\TT\) is a linear transformation.

Remark
If \(\TT\) is linear then \(\TT(x - y) = \TT(x) - \TT(y)\)
\[\TT(x - y) = \TT(x + (-1)y) = \TT(x) + \TT((-1)y) = \TT(x) +(-1)\TT(y) = \TT(x) - \TT(y).\]
Remark

\(\TT\) is linear \(\iff\) for \(x_1, \dots, x_n \in \VV\) and \(\alpha_1, \dots, \alpha_n \in \FF\),

\[\TT\left (\sum_{i=1}^{n} \alpha_i x_i \right ) = \sum_{i=1}^{n} \alpha_i \TT(x_i).\]

We can use mathematical induction to prove this.

Some special linear transformations need mention.

Definition

The identity transformation \(\mathrm{I}_{\VV} : \VV \to \VV\) is defined as

\[\mathrm{I}_{\VV}(x) = x, \Forall x \in \VV.\]
Definition

The zero transformation \(\mathrm{0} : \VV \to \WW\) is defined as

\[0(x) = 0, \Forall x \in \VV.\]

In this definition \(0\) is taking up multiple meanings: a linear transformation from \(\VV\) to \(\WW\) which maps every vector in \(\VV\) to the \(0\) vector in \(\WW\).

From the context usually it should be obvious whether we are talking about \(0 \in \FF\) or \(0 \in \VV\) or \(0 \in \WW\) or \(0\) as a linear transformation from \(\VV\) to \(\WW\).

Null space and range

Definition

The null space or kernel of a linear transformation \(\TT : \VV \to \WW\) denoted by \(\NullSpace(\TT)\) or \(\Kernel(\TT)\) is defined as

\[\Kernel(\TT) = \NullSpace(\TT) \triangleq \{ x \in \VV : \TT(x) = 0\}\]
Theorem
The null space of a linear transformation \(\TT : \VV \to \WW\) is a subspace of \(\VV\).
Proof

Let \(v_1, v_2 \in \Kernel(\TT)\). Then

\[\TT(\alpha v_1 + v_2) = \alpha \TT(v_1) + \TT(v_2) = \alpha 0 + 0 = 0.\]

Thus \(\alpha v_1 + v_2 \in \Kernel(\TT)\). Thus \(\Kernel(\TT)\) is a subspace of \(\VV\).

Definition

The range or image of a linear transformation \(\TT : \VV \to \WW\) denoted by \(\Range(\TT)\) or \(\Image(\TT)\) is defined as

\[\Range(\TT) = \Image(\TT) \triangleq \{\TT(x) \Forall x \in \VV \}.\]

We note that \(\Image(\TT) \subseteq \WW\).

Theorem
The image of a linear transformation \(\TT : \VV \to \WW\) is a subspace of \(\WW\).
Proof

Let \(w_1, w_2 \in \Image(\TT)\). Then there exist \(v_1, v_2 \in \VV\) such that

\[w_1 = \TT(v_1); w_2 = \TT(v_2).\]

Thus

\[\alpha w_1 + w_2 = \alpha \TT(v_1) + \TT(v_2) = \TT(\alpha v_1 + v_2).\]

Thus \(\alpha w_1 + w_2 \in \Image(\TT)\). Hence \(\Image(\TT)\) is a subspace of \(\WW\).

Theorem

Let \(\TT : \VV \to \WW\) be a linear transformation. Let \(\mathcal{B} = \{v_1, v_2, \dots, v_n\}\) be some basis of \(\VV\). Then

\[\Image(\TT) = \langle \TT(\mathcal{B}) \rangle = \langle\{\TT(v_1), \TT(v_2), \dots, \TT(v_n) \} \rangle.\]

i.e. The image of a basis of \(\VV\) under a linear transformation \(\TT\) spans the range of the transformation.

Proof

Let \(w\) be some arbitrary vector in \(\Image(\TT)\). Then there exists \(v \in \VV\) such that \(w = \TT(v)\). Now

\[v = \sum_{i=1}^n c_i v_i\]

since \(\mathcal{B}\) forms a basis for \(\VV\).

Thus

\[w = \TT(v) = \TT(\sum_{i=1}^n c_i v_i) = \sum_{i=1}^n c_i(\TT(v_i)).\]

This means that \(w \in \langle \TT(\mathcal{B}) \rangle\).

Definition

For vector spaces \(\VV\) and \(\WW\) and linear \(\TT : \VV \to \WW\) if \(\Kernel{\TT}\) is finite dimensional then nullity of \(\TT\) is defined as

\[\Nullity(\TT) = \dim \Kernel(\TT)\]

i.e. the dimension of the null space or kernel of \(\TT\).

Definition

For vector spaces \(\VV\) and \(\WW\) and linear \(\TT : \VV \to \WW\) if \(\Image{\TT}\) is finite dimensional then rank of \(\TT\) is defined as

\[\Rank(\TT) = \dim \Image(\TT)\]

i.e. the dimension of the range or image of \(\TT\).

Theorem

For vector spaces \(\VV\) and \(\WW\) and linear \(\TT : \VV \to \WW\) if \(\VV\) is finite dimensional, then

\[\dim \VV = \Nullity(\TT) + \Rank(\TT).\]

This is known as dimension theorem.

Theorem
For vector spaces \(\VV\) and \(\WW\) and linear \(\TT : \VV \to \WW\), \(\TT\) is one-one if and only if \(\Kernel(\TT) = \{ 0\}\).
Proof

If \(\TT\) is one-one, then

\[v_1 \neq v_2 \implies T(v_1) \neq T(v_2)\]

Let \(v \neq 0\). Now \(\TT(0) = 0 \implies \TT(v) \neq 0\) since \(\TT\) is one-one. Thus \(\Kernel(\TT) = \{ 0\}\).

For converse let us assume that \(\Kernel(\TT) = \{ 0\}\). Let \(v_1, v_2 \in V\) be two vectors in \(V\) such that

\[\begin{split}&\TT(v_1) = \TT(v_2) \\ \implies &\TT(v_1 - v_2) = 0 \\ \implies &v_1 - v_2 \in \Kernel(\TT)\\ \implies &v_1 - v_2 = 0 \\ \implies &v_1 = v_2.\end{split}\]

Thus \(\TT\) is one-one.

Theorem

For vector spaces \(\VV\) and \(\WW\) of equal finite dimensions and linear \(\TT : \VV \to \WW\), the following are equivalent.

  1. \(\TT\) is one-one.
  2. \(\TT\) is onto.
  3. \(\Rank(\TT) = \dim (\VV)\).
Proof

From (a) to (b)

Let \(\mathcal{B} = \{v_1, v_2, \dots v_n \}\) be some basis of \(\VV\) with \(\dim \VV = n\).

Let us assume that \(\TT(\mathcal{B})\) are linearly dependent. Thus there exists a linear relationship

\[\sum_{i=1}^{n}\alpha_i \TT(v_i) = 0\]

where \(\alpha_i\) are not all 0.

Now

\[\begin{split}&\sum_{i=1}^{n}\alpha_i \TT(v_i) = 0 \\ \implies &\TT\left(\sum_{i=1}^{n}\alpha_i v_i\right) = 0\\ \implies &\sum_{i=1}^{n}\alpha_i v_i \in \Kernel(\TT)\\ \implies &\sum_{i=1}^{n}\alpha_i v_i = 0\end{split}\]

since \(\TT\) is one-one. This means that \(v_i\) are linearly dependent. This contradicts our assumption that \(\mathcal{B}\) is a basis for \(\VV\).

Thus \(\TT(\mathcal{B})\) are linearly independent.

Since \(\TT\) is one-one, hence all vectors in \(\TT(\mathcal{B})\) are distinct, hence

\[| \TT(\mathcal{B}) | = n.\]

Since \(\TT(\mathcal{B})\) span \(\Image(\TT)\) and are linearly independent, hence they form a basis of \(\Image(\TT)\). But

\[\dim \VV = \dim \WW = n\]

and \(\TT(\mathcal{B})\) are a set of \(n\) linearly independent vectors in \(\WW\).

Hence \(\TT(\mathcal{B})\) form a basis of \(\WW\). Thus

\[\Image(\TT) = \langle \TT(\mathcal{B}) \rangle = \WW.\]

Thus \(\TT\) is on-to.

From (b) to (c) \(\TT\) is on-to means \(\Image(\TT) = \WW\) thus

\[\Rank(\TT) = \dim \WW = \dim \VV.\]

From (c) to (a) We know that

\[\dim \VV = \Rank(\TT) + \Nullity(\TT).\]

But it is given that \(\Rank(\TT) = \dim \VV\). Thus

\[\Nullity(\TT) = 0.\]

Thus \(\TT\) is one-one.

Bracket operator

Recall the definition of coordinate vector from here. Conversion of a given vector to its coordinate vector representation can be shown to be a linear transformation.

Definition

Let \(\VV\) be a finite dimensional vector space over a field \(\FF\) where \(\dim \VV = n\). Let \(\BBB = \{ v_1, \dots, v_n\}\) be an ordered basis in \(\VV\). We define a bracket operator from \(\VV\) to \(\FF^n\) as

\[\begin{split}\begin{aligned} \Bracket_{\BBB} : &\VV \to \FF^n\\ & x \to [x]_{\BBB}\\ & \triangleq \begin{bmatrix} \alpha_1\\ \vdots\\ \alpha_n \end{bmatrix} \end{aligned}\end{split}\]

where

\[x = \sum_{i=1}^n \alpha_i v_i.\]

In other words, the bracket operator takes a vector \(v\) from a finite dimensional space \(\VV\) to its representation in \(\FF^n\) for a given basis \(\BBB\).

We now show that the bracket operator is linear.

Theorem

Let \(\VV\) be a finite dimensional vector space over a field \(\FF\) where \(\dim \VV = n\). Let \(\BBB = \{ v_1, \dots, v_n\}\) be an ordered basis in \(\VV\). The bracket operator \(\Bracket_{\BBB} : \VV \to \FF^n\) as defined here is a linear operator.

Moreover \(\Bracket_{\BBB}\) is a one-one and onto mapping.

Proof

Let \(x, y \in \VV\) such that

\[x = \sum_{i=1}^n \alpha_i v_i.\]

and

\[y = \sum_{i=1}^n \beta_i v_i.\]

Then

\[c x + y = c \sum_{i=1}^n \alpha_i v_i + \sum_{i=1}^n \beta_i v_i = \sum_{i=1}^n (c \alpha_i + \beta_i ) v_i.\]

Thus

\[\begin{split}[c x + y]_{\BBB} = \begin{bmatrix} c \alpha_1 + \beta_1 \\ \vdots\\ c \alpha_n + \beta_n \end{bmatrix} = c \begin{bmatrix} \alpha_1 \\ \vdots\\ \alpha_n \end{bmatrix} + \begin{bmatrix} \beta_1 \\ \vdots\\ \beta_n \end{bmatrix} = c [x]_{\BBB} + [y]_{\BBB}.\end{split}\]

Thus \(\Bracket_{\BBB}\) is linear.

We can see that by definition \(\Bracket_{\BBB}\) is one-one. Now since \(\dim \VV = n = \dim \FF^n\) hence \(\Bracket_{\BBB}\) is on-to due to here.

Matrix representations

It is much easier to work with a matrix representation of a linear transformation. In this section we describe how matrix representations of a linear transformation are developed.

In order to develop a representation for the map \(\TT : \VV \to \WW\) we first need to choose a representation for vectors in \(\VV\) and \(\WW\). This can be easily done by choosing a basis in \(\VV\) and another in \(\WW\). Once the bases are chosen, then we can represent vectors as coordinate vectors.

Definition

Let \(\VV\) and \(\WW\) be finite dimensional vector spaces with ordered bases \(\BBB = \{v_1, \dots, v_n\}\) and \(\Gamma = \{w_1, \dots,w_m\}\) respectively. Let \(\TT : \VV \to \WW\) be a linear transformation. For each \(v_j \in \BBB\) we can find a unique representation for \(\TT(v_j)\) in \(\Gamma\) given by

\[\TT(v_j) = \sum_{i=1}^{m} a_{ij} w_i \Forall 1 \leq j \leq n.\]

The \(m\times n\) matrix \(A\) defined by \(A_{ij} = a_{ij}\) is the matrix representation of \(\TT\) in the ordered bases \(\BBB\) and \(\Gamma\), denoted as

\[A = [\TT]_{\BBB}^{\Gamma}.\]

If \(\VV = \WW\) and \(\BBB = \Gamma\) then we write

\[A = [\TT]_{\BBB}.\]

The \(j\)-th column of \(A\) is the representation of \(\TT(v_j)\) in \(\Gamma\).

In order to justify the matrix representation of \(\TT\) we need to show that application of \(\TT\) is same as multiplication by \(A\). This is stated formally below.

Theorem
\[[\TT (v)]_{\Gamma} = [\TT]_{\BBB}^{\Gamma} [v]_{\BBB} \Forall v \in \VV.\]
Proof

Let

\[v = \sum_{j=1}^{n} c_j v_j.\]

Then

\[\begin{split}[v]_{\BBB} = \begin{bmatrix} c_1\\ \vdots\\ c_n \end{bmatrix}\end{split}\]

Now

\[\begin{split}\TT(v) &= \TT\left( \sum_{j=1}^{n} c_j v_j \right)\\ &= \sum_{j=1}^{n} c_j \TT(v_j)\\ &= \sum_{j=1}^{n} c_j \sum_{i=1}^{m} a_{ij} w_i\\ &= \sum_{i=1}^{m} \left ( \sum_{j=1}^{n} a_{ij} c_j \right ) w_i\\\end{split}\]

Thus

\[\begin{split}[\TT (v)]_{\Gamma} = \begin{bmatrix} \sum_{j=1}^{n} a_{1 j} c_j\\ \vdots\\ \sum_{j=1}^{n} a_{m j} c_j \end{bmatrix} = A \begin{bmatrix} c_1\\ \vdots\\ c_n \end{bmatrix} = [\TT]_{\BBB}^{\Gamma} [v]_{\BBB}.\end{split}\]

Vector space of linear transformations

If we consider the set of linear transformations from \(\VV\) to \(\WW\) we can impose some structure on it and take its advantages.

First of all we will define basic operations like addition and scalar multiplication on the general set of functions from a vector space \(\VV\) to another vector space \(\WW\).

Definition

Let \(\TT\) and \(\UU\) be arbitrary functions from vector space \(\VV\) to vector space \(\WW\) over the field \(\FF\). Then addition of functions is defined as

\[(\TT + \UU)(v) = \TT(v) + \UU(v) \Forall v \in \VV.\]

Scalar multiplication on a function is defined as

\[(\alpha \TT)(v) = \alpha (\TT (v)) \Forall \alpha \in \FF, v \in \VV.\]

With these definitions we have

\[(\alpha \TT + \UU)(v) = (\alpha \TT)(v) + \UU(v) = \alpha (\TT (v)) + \UU(v).\]

We are now ready to show that with the addition and scalar multiplication as defined above, the set of linear transformations from \(\VV\) to \(\WW\) actually forms a vector space.

Theorem

Let \(\VV\) and \(\WW\) be vector spaces over field \(\FF\). Let \(\TT\) and \(\UU\) be some linear transformations from \(\VV\) to \(\WW\). Let addition and scalar multiplication of linear transformations be defined as in here. Then \(\alpha \TT + \UU\) where \(\alpha \in \FF\) is a linear transformation.

Moreover the set of linear transformations from \(\VV\) to \(\WW\) forms a vector space.

Proof

We first show that \(\alpha \TT + \UU\) is linear.

Let \(x,y \in \VV\) and \(\beta \in \FF\). Then we need to show that

\[\begin{split}(\alpha \TT + \UU) (x + y) = (\alpha \TT + \UU) (x) + (\alpha \TT + \UU) (y)\\ (\alpha \TT + \UU) (\beta x) = \beta ((\alpha \TT + \UU) (x)).\end{split}\]

Starting with the first one:

\[\begin{split}(\alpha \TT + \UU)(x + y) &= (\alpha \TT)(x + y) + \UU(x + y)\\ &= \alpha ( \TT (x + y) ) + \UU(x) + \UU(y)\\ &= \alpha \TT (x) + \alpha \TT(y) + \UU(x) + \UU(y)\\ &= (\alpha \TT) (x) + \UU (x) + (\alpha \TT)(y) + \UU(y)\\ &= (\alpha \TT + \UU)(x) + (\alpha \TT + \UU)(y).\end{split}\]

Now the next one

\[\begin{split}(\alpha \TT + \UU) (\beta x) &= (\alpha \TT ) (\beta x) + \UU (\beta x)\\ &= \alpha (\TT(\beta x)) + \beta (\UU (x))\\ &= \alpha (\beta (\TT (x))) + \beta (\UU (x))\\ &= \beta (\alpha (\TT (x))) + \beta (\UU(x))\\ &= \beta ((\alpha \TT)(x) + \UU(x))\\ &= \beta((\alpha \TT + \UU)(x)).\end{split}\]

We can now easily verify that the set of linear transformations from \(\VV\) to \(\WW\) satisfies all the requirements of a vector space. Hence its a vector space (of linear transformations from \(\VV\) to \(\WW\)).

Definition

Let \(\VV\) and \(\WW\) be vector spaces over field \(\FF\). Then the vector space of linear transformations from \(\VV\) to \(\WW\) is denoted by \(\LinTSpace(\VV, \WW)\).

When \(\VV = \WW\) then it is simply denoted by \(\LinTSpace(\VV)\).

The addition and scalar multiplication as defined in here carries forward to matrix representations of linear transformations also.

Theorem

Let \(\VV\) and \(\WW\) be finite dimensional vector spaces over field \(\FF\) with \(\BBB\) and \(\Gamma\) being their respective bases. Let \(\TT\) and \(\UU\) be some linear transformations from \(\VV\) to \(\WW\).

Then the following hold

  • \([\TT + \UU]_{\BBB}^{\Gamma} = [\TT]_{\BBB}^{\Gamma} + [\UU]_{\BBB}^{\Gamma}\)
  • \([\alpha \TT]_{\BBB}^{\Gamma} = \alpha [\TT]_{\BBB}^{\Gamma} \Forall \alpha \in \FF\)

Inner product spaces

Inner product

Inner product is a generalization of the notion of dot product.

Definition

An inner product over a \(K\)-vector space \(V\) is any map

\[\begin{split}\begin{aligned} \langle, \rangle : &V \times V \to K (\RR \text{ or } \CC )\\ & (v_1, v_2) \to \langle v_1, v_2 \rangle \end{aligned}\end{split}\]

satisfying following requirements:

  1. Positive definiteness

    (1)\[ \langle v, v \rangle \geq 0 \text{ and } \langle v, v \rangle = 0 \iff v = 0\]
  2. Conjugate symmetry

    (2)\[ \langle v_1, v_2 \rangle = \overline{\langle v_2, v_1 \rangle} \quad \forall v_1, v_2 \in V\]
  3. Linearity in the first argument

    (3)\[\begin{split} \begin{aligned} &\langle \alpha v, w \rangle = \alpha \langle v, w \rangle \quad \forall v, w \in V; \forall \alpha \in K\\ &\langle v_1 + v_2, w \rangle = \langle v_1, w \rangle + \langle v_2, w \rangle \quad \forall v_1, v_2,w \in V \end{aligned}\end{split}\]

Remarks

  • Linearity in first argument extends to any arbitrary linear combination:
\[\left \langle \sum \alpha_i v_i, w \right \rangle = \sum \alpha_i \langle v_i, w \rangle\]
  • Similarly we have conjugate linearity in second argument for any arbitrary linear combination:
\[\left \langle v, \sum \alpha_i w_i \right \rangle = \sum \overline{\alpha_i} \langle v, w_i \rangle\]

Orthogonality

Definition

A set of non-zero vectors \(\{v_1, \dots, v_p\}\) is called orthogonal if

\[\langle v_i, v_j \rangle = 0 \text{ if } i \neq j \quad \forall 1 \leq i, j \leq p\]
Definition

A set of non-zero vectors \(\{v_1, \dots, v_p\}\) is called orthonormal if

(4)\[\begin{split}\begin{aligned} &\langle v_i, v_j \rangle = 0 \text{ if } i \neq j \quad \forall 1 \leq i, j \leq p\\ &\langle v_i, v_i \rangle = 1 \quad \forall 1 \leq i \leq p \end{aligned}\end{split}\]

i.e. \(\langle v_i, v_j \rangle = \delta(i, j)\).

Remarks:

  • A set of orthogonal vectors is linearly independent. Prove!
Definition
A \(K\)-vector space \(V\) equipped with an inner product \(\langle, \rangle : V \times V \to K\) is known as an inner product space or a pre-Hilbert space.

Norm

Norms are a generalization of the notion of length.
Definition

A norm over a \(K\)-vector space \(V\) is any map

\[\begin{split}\begin{aligned} \| \| : &V \to \RR \\ & v \to \| v\| \end{aligned}\end{split}\]

satisfying following requirements:

  1. Positive definiteness

    (5)\[ \| v\| \geq 0 \quad \forall v \in V \text{ and } \| v\| = 0 \iff v = 0\]
  2. Scalar multiplication

    \[\| \alpha v \| = | \alpha | \| v \| \quad \forall \alpha \in K; \forall v \in V\]
  3. Triangle inequality

    \[\| v_1 + v_2 \| \leq \| v_1 \| + \| v_2 \| \quad \forall v_1, v_2 \in V\]
Definition
A \(K\)-vector space \(V\) equipped with a norm \(\| \| : V \to \RR\) is known as a normed linear space.

Projection

Definition

A projection is a linear transformation \(P\) from a vector space \(V\) to itself such that \(P^2=P\). i.e. if \(P v = \beta\), then \(P \beta = \beta\). Thus whenever \(P\) is applied twice to any vector, it gives the same result as if it was applied once.

Thus \(P\) is an idempotent operator.

ExampleProjection operators

Consider the operator \(P : \RR^3 \to \RR^3\) defined as

\[\begin{split}P = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}.\end{split}\]

Then application of \(P\) on any arbitrary vector is given by

\[\begin{split}P \begin{pmatrix} x \\ y \\z \end{pmatrix} = \begin{pmatrix} x \\ y \\ 0 \end{pmatrix}\end{split}\]

A second application doesn’t change it

\[\begin{split}P \begin{pmatrix} x \\ y \\0 \end{pmatrix} = \begin{pmatrix} x \\ y \\ 0 \end{pmatrix}\end{split}\]

Thus \(P\) is a projection operator.

Usually we can directly verify the property by computing \(P^2\) as

\[\begin{split}P^2 = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix} = P.\end{split}\]

Orthogonal projection

Consider a projection operator \(P : V \to V\) where \(V\) is an inner product space.

The range of \(P\) is given by

\[\Range(P) = \{v \in V | v = P x \text{ for some } x \in V \}.\]

The null space of \(P\) is given by

\[\NullSpace(P) = \{ v \in V | P v = 0\}.\]
Definition

A projection operator \(P : V \to V\) over an inner product space \(V\) is called orthogonal projection operator if its range \(\Range(P)\) and the null space \(\NullSpace(P)\) as defined above are orthogonal to each other. i.e.

\[\langle r, n \rangle = 0 \Forall r \in \Range(P) , \Forall n \in \NullSpace(P).\]
Lemma
A projection operator is orthogonal if and only if it is self adjoint.
ExampleOrthogonal projection on a line

Consider a unit norm vector \(u \in \RR^N\). Thus \(u^T u = 1\).

Consider

\[P_u = u u^T.\]

Now

\[P_u^2 = (u u^T) (u u^T) = u (u^T u) u^T = u u^T = P.\]

Thus \(P\) is a projection operator.

Now

\[P_u^T = (u u^T)^T = u u^T = P_u\]

Thus \(P_u\) is self-adjoint. Hence \(P_u\) is an orthogonal projection operator.

Now

\[P_u u = (u u^T) u = u (u^T u) = u.\]

Thus \(P_u\) leaves \(u\) intact. i.e. Projection of \(u\) on to \(u\) is \(u\) itself.

Let \(v \in u^{\perp}\) i.e. \(\langle u, v \rangle = 0\).

Then

\[P_u v = (u u^T) v = u (u^T v) = u \langle u, v \rangle = 0.\]

Thus \(P_u\) annihilates all vectors orthogonal to \(u\).

Now any vector \(x \in \RR^N\) can be broken down into two components

\[x = x_{\parallel} + x_{\perp}\]

such that \(\langle u , x_{\perp} \rangle =0\) and \(x_{\parallel}\) is collinear with \(u\).

Then

\[P_u x = u u^T x_{\parallel} + u u^T x_{\perp} = x_{\parallel}.\]

Thus \(P_u\) retains the projection of \(x\) on \(u\) given by \(x_{\parallel}\).

ExampleProjections over the column space of a matrix

Let \(A \in \RR^{M \times N}\) with \(N \leq M\) be a matrix given by

\[A = \begin{bmatrix} a_1 & a_2 & \dots & a_N \end{bmatrix}\]

where \(a_i \in \RR^M\) are its columns which are linearly independent.

The column space of \(A\) is given by

\[C(A) = \{ A x \Forall x \in \RR^N \} \subseteq \RR^M.\]

It can be shown that \(A^T A\) is invertible.

Consider the operator

\[P_A = A (A^T A)^{-1} A^T.\]

Now

\[P_A^2 = A (A^T A)^{-1} A^T A (A^T A)^{-1} A^T = A (A^T A)^{-1} A^T = P_A.\]

Thus \(P_A\) is a projection operator.

\[P_A^T = (A (A^T A)^{-1} A^T)^T = A ((A^T A)^{-1} )^T A^T = A (A^T A)^{-1} A^T = P_A.\]

Thus \(P_A\) is self-adjoint.

Hence \(P_A\) is an orthogonal projection operator on the column space of \(A\).

Parallelogram identity

Theorem
\[2 \| x \|_2^2 + 2 \| y \|_2^2 = \|x + y \|_2^2 + \| x - y \|_2^2. \Forall x, y \in V.\]
Proof
\[\| x + y \|_2^2 = \langle x + y, x + y \rangle = \langle x, x \rangle + \langle y , y \rangle + \langle x , y \rangle + \langle y , x \rangle.\]
\[\| x - y \|_2^2 = \langle x - y, x - y \rangle = \langle x, x \rangle + \langle y , y \rangle - \langle x , y \rangle - \langle y , x \rangle.\]

Thus

\[\|x + y \|_2^2 + \| x - y \|_2^2 = 2 ( \langle x, x \rangle + \langle y , y\rangle) = 2 \| x \|_2^2 + 2 \| y \|_2^2.\]

When inner product is a real number following identity is quite useful.

Theorem
\[\langle x, y \rangle = \frac{1}{4} \left ( \|x + y \|_2^2 - \| x - y \|_2^2 \right ). \Forall x, y \in V.\]
Proof
\[\| x + y \|_2^2 = \langle x + y, x + y \rangle = \langle x, x \rangle + \langle y , y \rangle + \langle x , y \rangle + \langle y , x \rangle.\]
\[\| x - y \|_2^2 = \langle x - y, x - y \rangle = \langle x, x \rangle + \langle y , y \rangle - \langle x , y \rangle - \langle y , x \rangle.\]

Thus

\[\|x + y \|_2^2 - \| x - y \|_2^2 = 2 ( \langle x , y \rangle + \langle y , x \rangle) = 4 \langle x , y \rangle\]

since for real inner products

\[\langle x , y \rangle = \langle y , x \rangle.\]

Polarization identity

When inner product is a complex number, polarization identity is quite useful.

Theorem
\[\langle x, y \rangle = \frac{1}{4} \left ( \|x + y \|_2^2 - \| x - y \|_2^2 + i \| x + i y \|_2^2 - i \| x -i y \|_2^2 \right ) \Forall x, y \in V.\]
Proof
\[\| x + y \|_2^2 = \langle x + y, x + y \rangle = \langle x, x \rangle + \langle y , y \rangle + \langle x , y \rangle + \langle y , x \rangle.\]
\[\| x - y \|_2^2 = \langle x - y, x - y \rangle = \langle x, x \rangle + \langle y , y \rangle - \langle x , y \rangle - \langle y , x \rangle.\]
\[\| x + i y \|_2^2 = \langle x + i y, x + i y \rangle = \langle x, x \rangle + \langle i y , i y \rangle + \langle x , i y \rangle + \langle i y , x \rangle.\]
\[\| x - i y \|_2^2 = \langle x - i y, x - i y \rangle = \langle x, x \rangle + \langle i y , i y \rangle - \langle x , i y \rangle - \langle i y , x \rangle.\]

Thus

\[\begin{split} \|x + y \|_2^2 - \| x - y \|_2^2 + & i \| x + i y \|_2^2 - i \| x -i y \|_2^2\\ &= 2 \langle x, y \rangle + 2 \langle y , x \rangle + 2 i \langle x , i y \rangle + 2 i \langle ix , y \rangle\\ &= 2 \langle x, y \rangle + 2 \langle y , x \rangle + 2 \langle x, y \rangle - 2 \langle y , x \rangle\\ & = 4 \langle x, y \rangle.\end{split}\]

The Euclidean space

In this book we will be generally concerned with the Euclidean space \(\RR^N\). This section summarizes important results for this space.

\(\RR^2\) (the 2-dimensional plane) and \(\RR^3\) the 3-dimensional space are the most familiar spaces to us.

\(\RR^N\) is a generalization in \(N\) dimensions.

Definition
Let \(\RR\) denote the field of real numbers. For any positive integer \(N\), the set of all \(N\)-tuples of real numbers forms an \(N\)-dimensional vector space over \(\RR\) which is denoted as \(\RR^N\) and sometimes called real cooordinate space.

An element \(x\) in \(\RR^N\) is written as

\[x = (x_1, x_2, \ldots, x_N),\]

where each \(x_i\) is a real number.

Vector space operations on \(\RR^N\) are defined by:

\[\begin{split}&x + y = (x_1 + y_1, x_2 + y_2, \dots, x_N + y_N), \quad \forall x, y \in \RR^N.\\ & \alpha x = (\alpha x_1, \alpha x_2, \dots, \alpha x_N) \quad \forall x \in \RR^N, \alpha \in \RR .\end{split}\]

\(\RR^N\) comes with the standard ordered basis \(B = \{e_1, e_2, \dots, e_N\}\):

(1)\[\begin{split}\begin{aligned} & e_1 = (1,0,\dots, 0),\\ & e_2 = (0,1,\dots, 0),\\ &\vdots\\ & e_N = (0,0,\dots, 1) \end{aligned}\end{split}\]

An arbitrary vector \(x\in\RR^N\) can be written as

\[x = \sum_{i=1}^{N}x_i e_i\]

Inner product

Standard inner product (a.k.a. dot product) is defined as:

\[\langle x, y \rangle = \sum_{i=1}^{N} x_i y_i = x_1 y_1 + x_2 y_2 + \dots + x_N y_N \quad \forall x, y \in \RR^N.\]

This makes \(\RR^N\) an inner product space.

The result is always a real number. Hence we have symmetry:

\[\langle x, y \rangle = \langle y, x \rangle\]

Norm

The length of the vector (a.k.a. Euclidean norm or \(\ell_2\) norm) is defined as:

\[\| x \| = \sqrt{\langle x, x \rangle} = \sqrt{\sum_{i=1}^{N} x_i^2} \quad \forall x \in \RR^N.\]

This makes \(\RR^N\) a normed linear space.

The angle \(\theta\) between two vectors is given by:

\[\theta = \cos^{-1} \frac{ \langle x, y \rangle }{\| x \| \| y \|}\]

Distance

Distance between two vectors is defined as:

\[d(x,y) = \| x - y \| = \sqrt{\sum_{i=1}^{N} (x_i - y_i)^2}\]

This distance function is known as Euclidean metric.

This makes \(\RR^N\) a metric space.

\(\ell_p\) norms

In addition to standard Euclidean norm, we define a family of norms indexed by \(p \in [1, \infty]\) known as \(l_p\) norms over \(\RR^N\).

Definition

\(\ell_p\) norm is defined as:

(2)\[\begin{split} \| x \|_p = \begin{cases} \left ( \sum_{i=1}^{N} | x |_i^p \right ) ^ {\frac{1}{p}} & p \in [1, \infty)\\ \underset{1 \leq i \leq N}{\max} |x_i| & p = \infty \end{cases}\end{split}\]
\(\ell_2\) norm

As we can see from definition, \(\ell_2\) norm is same as Euclidean norm. So we have:

\[\| x \| = \| x \|_2\]
\(\ell_1\) norm

From above definition we have

\[\|x\|_1 = \sum_{i=1}^N |x_i|= |x_1| + |x_2| + \dots + | x_N|\]

We use norms as a measure of strength of a signal or size of an error. Different norms signify different aspects of the signal.

Quasi-norms

In some cases it is useful to extend the notion of \(\ell_p\) norms to the case where \(0 < p < 1\).

In such cases norm as defined in (2) doesn’t satisfy triangle inequality, hence it is not a proper norm function. We call such functions as quasi-norms.

\(\ell_0\)-“norm”

Of specific mention is \(\ell_0\)-“norm”. It isn’t even a quasi-norm. Note the use of quotes around the word norm to distinguish \(\ell_0\)-“norm” from usual norms.

Definition

\(\ell_0\)-“norm” is defined as:

\[\| x \|_0 = | \supp(x) |\]

where \(\supp(x) = \{ i : x_i \neq 0\}\) denotes the support of \(x\).

Note that \(\| x \|_0\) defined above doesn’t follow the definition in (2).

Yet we can show that:

\[\lim_{p\to 0} \| x \|_p^p = | \supp(x) |\]

which justifies the notation.

N dimensional complex space

In this section we review important features of N dimensional complex vector space \(\CC^N\).

Definition
Let \(\CC\) denote the field of complex numbers. For any positive integer \(N\), the set of all \(N\)-tuples of complex numbers forms an \(N\)-dimensional vector space over \(\CC\) which is denoted as \(\CC^N\) and sometimes called complex vector space.

An element \(x\) in \(\CC^N\) is written as

\[x = (x_1, x_2, \ldots, x_N),\]

where each \(x_i\) is a complex number.

Vector space operations on \(\CC^N\) are defined by:

\[\begin{split}&x + y = (x_1 + y_1, x_2 + y_2, \dots, x_N + y_N), \quad \forall x, y \in \CC^N.\\ & \alpha x = (\alpha x_1, \alpha x_2, \dots, \alpha x_N) \quad \forall x \in \CC^N, \alpha \in \CC .\end{split}\]

\(\CC^N\) comes with the standard ordered basis \(B = \{e_1, e_2, \dots, e_N\}\):

(1)\[\begin{split}\begin{aligned} & e_1 = (1,0,\dots, 0),\\ & e_2 = (0,1,\dots, 0),\\ &\vdots\\ & e_N = (0,0,\dots, 1) \end{aligned}\end{split}\]

We note that the basis is same as the basis for \(N\) dimensional real vector space (the Euclidean space).

An arbitrary vector \(x\in\CC^N\) can be written as

\[x = \sum_{i=1}^{N}x_i e_i\]

Inner product

Standard inner product is defined as:

\[\langle x, y \rangle = \sum_{i=1}^{N} x_i \overline{y_i} = x_1 \overline{y_1} + x_2 \overline{y_2} + \dots + x_N \overline{y_N} \quad \forall x, y \in \CC^N.\]

where \(\overline{y_i}\) denotes the complex conjugate.

This makes \(\CC^N\) an inner product space.

This satisfies the inner product rule:

\[\langle x, y \rangle = \overline{\langle y, x \rangle}\]

Norm

The length of the vector (a.k.a. \(\ell_2\) norm) is defined as:

\[\| x \| = \sqrt{\langle x, x \rangle} = \sqrt{\sum_{i=1}^{N} x_i \overline{x_i} } = \sqrt{\sum_{i=1}^{N} |x_i|^2 } \quad \forall x \in \CC^N.\]

This makes \(\CC^N\) a normed linear space.

Distance

Distance between two vectors is defined as:

\[d(x,y) = \| x - y \| = \sqrt{\sum_{i=1}^{N} |x_i - y_i|^2}\]

This makes \(\CC^N\) a metric space.

\(\ell_p\) norms

In addition to standard Euclidean norm, we define a family of norms indexed by \(p \in [1, \infty]\) known as \(\ell_p\) norms over \(\CC^N\).

Definition

\(\ell_p\) norm is defined as:

(2)\[\begin{split} \| x \|_p = \begin{cases} \left ( \sum_{i=1}^{N} | x |_i^p \right ) ^ {\frac{1}{p}} & p \in [1, \infty)\\ \underset{1 \leq i \leq N}{\max} |x_i| & p = \infty \end{cases}\end{split}\]

So we have:

\[\| x \| = \| x \|_2\]
\(\ell_1\) norm

From above definition we have

\[\|x\|_1 = \sum_{i=1}^N |x_i|= |x_1| + |x_2| + \dots + | x_N|\]

We use norms as a measure of strength of a signal or size of an error. Different norms signify different aspects of the signal.

Quasi-norms

In some cases it is useful to extend the notion of \(\ell_p\) norms to the case where \(0 < p < 1\).

In such cases norm as defined in (2) doesn’t satisfy triangle inequality, hence it is not a proper norm function. We call such functions as quasi-norms.

\(\ell_0\) “norm”

Of specific mention is \(\ell_0\) “norm”. It isn’t even a quasi-norm. Note the use of quotes around the word norm to distinguish \(\ell_0\) “norm” from usual norms.

Definition

\(\ell_0\) “norm” is defined as:

(3)\[ \| x \|_0 = | \supp(x) |\]

where \(\supp(x) = \{ i : x_i \neq 0\}\) denotes the support of \(x\).

Note that \(\| x \|_0\) defined above doesn’t follow the definition in (2).

Yet we can show that:

\[\lim_{p\to 0} \| x \|_p^p = | \supp(x) |\]

which justifies the notation.

Affine Subspaces Review

For a detailed introduction to affine concepts, see [KW79]. For a vector \(v \in \RR^n\), the function \(f\) defined by \(f (x) = x + v, x \in \RR^n\) is a translation of \(\RR^n\) by \(v\). The image of any set \(\mathcal{S}\) under \(f\) is the \(v\)-translate of \(\mathcal{S}\). A translation of space is a one to one isometry of \(\RR^n\) onto \(\RR^n\).

A translate of a \(d\)-dimensional, linear subspace of \(\RR^n\) is a \(d\)-dimensional flat or simply \(d\)-flat in \(\RR^n\). Flats of dimension 1, 2, and \(n-1\) are also called lines, planes, and hyperplanes, respectively. Flats are also known as affine subspaces.

Every \(d\)-flat in \(\RR^n\) is congruent to the Euclidean space \(\RR^d\). Flats are closed sets.

An affine combination of the vectors \(v_1, \dots, v_m\) is a linear combination in which the sum of coefficients is 1. Thus, \(b\) is an affine combination of \(v_1, \dots, v_m\) if \(b = k_1 v_1 + \dots k_m v_m\) and \(k_1 + \dots + k_m = 1\). The set of affine combinations of a set of vectors \(\{ v_1, \dots, v_m \}\) is their affine span. A finite set of vectors \(\{v_1, \dots, v_m\}\) is called affine independent if the only zero-sum linear combination of theirs representing the null vector is the null combination. i.e. \(k_1 v_1 + \dots + k_m v_m = 0\) and \(k_1 + \dots + k_m = 0\) implies \(k_1 = \dots = k_m = 0\). Otherwise, the set is affinely dependent. A finite set of two or more vectors is affine independent if and only if none of them is an affine combination of the others.

Vectors vs. Points An n-tuple \((x_1, \dots, x_n)\) is used to refer to a point \(X\) in \(\RR^n\) as well as to a vector from origin \(O\) to \(X\) in \(\RR^n\). In basic linear algebra, the terms vector and point are used interchangeably. While discussing geometrical concepts (affine or convex sets etc.), it is useful to distinguish between vectors and points. When the terms “dependent” and “independent” are used without qualification to points, they refer to affine dependence/independence. When used for vectors, they mean linear dependence/independence.

The span of \(k+1\) independent points is a \(k\)-flat and is the unique \(k\)-flat that contains all \(k+1\) points. Every \(k\)-flat contains \(k+1\) (affine) independent points. Each set of \(k+1\) independent points in the \(k\)-flat forms an affine basis for the flat. Each point of a \(k\)-flat is represented by one and only one affine combination of a given affine basis for the flat. The coefficients of the affine combination of a point are the affine coordinates of the point in the given affine basis of the \(k\)-flat. A \(d\)-flat is contained in a linear subspace of dimension \(d+1\). This can be easily obtained by choosing an affine basis for the flat and constructing its linear span.

A function \(f\) defined on a vector space \(V\) is an affine function or affine transformation or affine mapping if it maps every affine combination of vectors \(u, v\) in \(V\) onto the same affine combination of their images. If \(f\) is real valued, then \(f\) is an affine functional. A property which is invariant under an affine mapping is called affine invariant. The image of a flat under an affine function is a flat.

Every affine function differs from a linear function by a translation. A functional is an affine functional if and only if there exists a unique vector \(a \in \RR^n\) and a unique real number \(k\) such that \(f(x) = \langle a, x \rangle + k\). Affine functionals are continuous. If \(a \neq 0\), then the linear functional \(f(x) = \langle a, x \rangle\) and the affine functional \(g(x) = \langle a, x \rangle + k\) map bounded sets onto bounded sets, neighborhoods onto neighborhoods, balls onto balls and open sets onto open sets.

Hyperplanes and Half spaces

Corresponding to a hyperplane \(\mathcal{H}\) in \(\RR^n\) (an \(n-1\)-flat), there exists a non-null vector \(a\) and a real number \(k\) such that \(\mathcal{H}\) is the graph of \(\langle a , x \rangle = k\). The vector \(a\) is orthogonal to \(PQ\) for all \(P, Q \in \mathcal{H}\). All non-null vectors \(a\) to have this property are normal to the hyperplane. The directions of \(a\) and \(-a\) are called opposite normal directions of \(\mathcal{H}\). Conversely, the graph of \(\langle a , x \rangle = k\), \(a \neq 0\), is a hyperplane for which \(a\) is a normal vector. If \(\langle a, x \rangle = k\) and \(\langle b, x \rangle = h\), \(a \neq 0\), \(b \neq 0\) are both representations of a hyperplane \(\mathcal{H}\), then there exists a real non-zero number \(\lambda\) such that \(b = \lambda a\) and \(h = \lambda k\). Obviously, we can find a unit norm normal vector for \(\mathcal{H}\). Each point \(P\) in space has a unique foot (nearest point) \(P_0\) in a Hyperplane \(\mathcal{H}\). Distance of the point \(P\) with vector \(p\) from a hyperplane \(\mathcal{H} : \langle a , x \rangle = k\) is given by

\[d(P, \mathcal{H}) = \frac{|\langle a, p \rangle - k|}{\| a \|_2}.\]

The coordinate \(p_0\) of the foot \(P_0\) is given by

\[p_0 = p - \frac{\langle a, p \rangle - k}{\| a \|_2^2} a.\]

Hyperplanes \(\mathcal{H}\) and \(\mathcal{K}\) are parallel if they don’t intersect. This occurs if and only if they have a common normal direction. They are different constant sets of the same linear functional. If \(\mathcal{H}_1 : \langle a , x \rangle = k_1\) and \(\mathcal{H}_2 : \langle a, x \rangle = k_2\) are parallel hyperplanes, then the distance between the two hyperplanes is given by

\[d(\mathcal{H}_1 , \mathcal{H}_2) = \frac{| k_1 - k_2|}{\| a \|_2}.\]

If \(\langle a, x \rangle = k\), \(a \neq 0\), is a hyperplane \(\mathcal{H}\), then the graphs of \(\langle a , x \rangle > k\) and \(\langle a , x \rangle < k\) are the opposite sides or opposite open half spaces of \(\mathcal{H}\). The graphs of \(\langle a , x \rangle \geq k\) and \(\langle a , x \rangle \leq k\) are the opposite closed half spaces of \(\mathcal{H}\). \(\mathcal{H}\) is the face of the four half-spaces. Corresponding to a hyperplane \(\mathcal{H}\), there exists a unique pair of sets \(\mathcal{S}_1\) and \(\mathcal{S}_2\) that are the opposite sides of \(\mathcal{H}\). Open half spaces are open sets and closed half spaces are closed sets. If \(A\) and \(B\) belong to the opposite sides of a hyperplane \(\mathcal{H}\), then there exists a unique point of \(\mathcal{H}\) that is between \(A\) and \(B\).

General Position

A general position for a set of points or other geometric objects is a notion of genericity. In means the general case situation as opposed to more special and coincidental cases. For example, generically, two lines in a plane intersect in a single point. The special cases are when the two lines are either parallel or coincident. Three points in a plane in general are not collinear. If they are, then it is a degenerate case. A set of \(n+1\) or more points in \(\RR^n\) is in said to be in general position if every subset of \(n\) points is linearly independent. In general, a set of \(k+1\) or more points in a \(k\)-flat is said to be in general linear position if no hyperplane contains more than \(k\) points.

Matrix Factorizations

Singular Value Decomposition

A non-negative real value \(\sigma\) is a singular value for a matrix \(A \in \RR^{m \times n}\) if and only if there exist unit length vectors \(u \in \RR^m\) and \(v \in \RR^n\) such that \(A v = \sigma u\) and \(A^T u = \sigma v\). The vectors u and v are called left singular and right singular vectors for \(\sigma\) respectively. For every \(A \in \RR^{m \times n}\) with \(k = \min(m, n)\), there exist two orthogonal matrices \(U \in \RR^{m \times m}\) and \(V \in \RR^{n \times n}\) and a sequence of real numbers \(\sigma_1 \geq \dots \geq \sigma_k \geq 0\) such that \(U^T A V = \Sigma\) where \(\Sigma = \text{diag}(\sigma_1, \dots, \sigma_k, 0, \dots, 0) \in \RR^{m \times n}\) (Extra columns or rows are filled with zeros). The decomposition of \(A\) given by \(A = U \Sigma V^T\) is called the singular value decomposition of \(A\). The first \(k\) columns of \(U\) and \(V\) are the left and right singular vectors of \(A\) corresponding to the singular values \(\sigma_1, \dots, \sigma_k\). The rank of \(A\) is equal to the number of non-zero singular values which equals the rank of \(\Sigma\). The eigen values of positive semi-definite matrices \(A^T A\) and \(A A^T\) are given by \(\sigma_1^2, \dots, \sigma_k^2\) (remaining eigen values being 0). Specifically, \(A^T A = V \Sigma^T \Sigma V^T\) and \(A A^T = U \Sigma \Sigma^T U^T\). We can rewrite \(A = \sum_{i=1}^k \sigma_i u_i v_i^T\). \(\sigma_1 u_1 v_1^T\) is rank-1 approximation of \(A\) in Frobenius norm sense. The spectral radius and \(2\)-norm of \(A\) is given by its largest singular value \(\sigma_1\). The Moore-Penrose pseudo-inverse of \(\Sigma\) is easily obtained by taking the transpose of \(\Sigma\) and inverting the non-zero singular values. Further, \(A^{\dag} = V \Sigma^{\dag} U^T\). The non-zero singular values of \(A^{\dag}\) are just reciprocals of the non-zero singular values of \(A\). Geometrically, singular values of \(A\) are the precisely the lengths of the semi-axes of the hyper-ellipsoid \(E\) defined by \(E = \{ A x | \| x \|_2 = 1 \}\) (i.e. image of the unit sphere under \(A\)). Thus, if \(A\) is a data matrix, then the SVD of \(A\) is strongly connected with the principal component analysis of \(A\).

Principal Angles

If \(\UUU\) and \(\VVV\) are two linear subspaces of \(\RR^M\), then the smallest principal angle between them denoted by \(\theta\) is defined as [BjorckG73]

\[\cos \theta = \underset{u \in \UUU, v \in \VVV}{\max} \frac{u^T v}{\| u \|_2 \| v \|_2}.\]

In other words, we try to find unit norm vectors in the two spaces which are maximally aligned with each other. The angle between them is the smallest principal angle. Note that \(\theta \in [0, \pi /2 ]\) (\(\cos \theta\) as defined above is always positive). If we have \(U\) and \(V\) as matrices whose column spans are the subspaces \(\UUU\) and \(\VVV\) respectively, then in order to find the principal angles, we construct orthogonal bases \(Q_U\) and \(Q_V\). We then compute the inner product matrix \(G = Q_U^T Q_V\). The SVD of \(G\) gives the principal angles. In particular, the smallest principal angle is given by \(\cos \theta = \sigma_1\), the largest singular value.

Hands on with Principal Angles

We will generate two random 4-D subspaces in an ambient spaces \(\RR^{10}\):

% subspace dimension
D = 4;
% ambient dimension
M = 10;
% Number of subspaces
K = 2;
import spx.data.synthetic.subspaces.random_subspaces;
bases = random_subspaces(M, K, D);

Finding the smallest principal angle in two subspaces is quite easy.

Let’s give some convenient names to the two bases:

>> A = bases{1};
>> B = bases{2};

Now let’s compute the inner products matrix between the basis vectors of the two bases:

>> G = A' * B
G =

   -0.3416   -0.4993    0.1216    0.2732
   -0.3780    0.0173   -0.5111    0.4413
    0.1296   -0.1153   -0.4123   -0.5332
   -0.2152   -0.4476    0.2282   -0.5022

Compute the singular values for G:

>> sigmas = svd(G)'
sigmas =

    0.9676    0.8197    0.6738    0.1664

The largest inner product between unit vectors drawn from A and B is given by:

>> largest_product = sigmas(1)
largest_product =

    0.9676

It is clear that this is very high. The corresponding smallest principal angle is:

>> smallest_angle_rad  = acos(largest_product)
smallest_angle_rad =

    0.2551

Or in radians:

>> smallest_angle_deg = rad2deg(smallest_angle_rad)
smallest_angle_deg =

   14.6143

sparse-plex provides a number of convenience functions for measuring principal angles.

We start with functions which can tell us about the smallest principal angle between a pair of subspaces.

The smallest principal angle in degrees:

>> spx.la.spaces.smallest_angle_deg(A, B)

ans =

   14.6143

The smallest principal angle in radians:

>> spx.la.spaces.smallest_angle_rad(A, B)

ans =

    0.2551

The smallest principal angle in cosine version:

>> spx.la.spaces.smallest_angle_cos(A, B)

ans =

    0.9676

If we have more than two subspaces, then we have a way of computing principal angles between each of them.

Let’s draw 6 subspaces from \(\RR^{10}\):

>> K = 6;
>> bases = random_subspaces(M, K, D);

We now want pairwise smallest principal angles between them:

>> angles = spx.la.spaces.smallest_angles_deg(bases)
angles =

         0   19.9756   32.3022   21.1835   47.2059   24.9171
   19.9756         0   14.9874   17.8499   20.5399   42.5358
   32.3022   14.9874         0   34.6420   21.9036   34.4935
   21.1835   17.8499   34.6420         0   14.0794   26.5235
   47.2059   20.5399   21.9036   14.0794         0   39.5866
   24.9171   42.5358   34.4935   26.5235   39.5866         0

We can pull off the upper off-diagonal entries in the matrix to look at the distribution of angles:

>> angles = spx.matrix.off_diag_upper_tri_elements(angles)'
angles =

  Columns 1 through 13

   19.9756   32.3022   14.9874   21.1835   17.8499   34.6420   47.2059   20.5399   21.9036   14.0794   24.9171   42.5358   34.4935

  Columns 14 through 15

   26.5235   39.5866

For more information about off_diag_upper_tri_elements, see Working with matrices.

The statistics:

>> max(angles)
ans =

   47.2059

>> min(angles)
ans =

   14.0794

>> mean(angles)
ans =

   27.5151

>> std(angles)
ans =

   10.3412

There is quite variation in the distribution of angles. While some pairs of subspaces are so closely aligned that their smallest principle angle is as low as 14 degrees, there are some pairs for which the smallest principal angle is as high as 47 degrees.

While it is possible to select two subspaces which are arbitrarily close to each other, the distribution of principal angles gives us an idea as to who close/aligned the subspaces are likely to be if chosen randomly.

Above, we computed the smallest principal angles in degrees. We can also compute them in radians:

>> angles = spx.la.spaces.smallest_angles_rad(bases)
angles =

         0    0.3486    0.5638    0.3697    0.8239    0.4349
    0.3486         0    0.2616    0.3115    0.3585    0.7424
    0.5638    0.2616         0    0.6046    0.3823    0.6020
    0.3697    0.3115    0.6046         0    0.2457    0.4629
    0.8239    0.3585    0.3823    0.2457         0    0.6909
    0.4349    0.7424    0.6020    0.4629    0.6909         0

Or directly the largest singular values for each pair of subspaces:

>> angles = spx.la.spaces.smallest_angles_cos(bases)
angles =

    1.0000    0.9398    0.8452    0.9324    0.6794    0.9069
    0.9398    1.0000    0.9660    0.9519    0.9364    0.7369
    0.8452    0.9660    1.0000    0.8227    0.9278    0.8242
    0.9324    0.9519    0.8227    1.0000    0.9700    0.8948
    0.6794    0.9364    0.9278    0.9700    1.0000    0.7707
    0.9069    0.7369    0.8242    0.8948    0.7707    1.0000

Matrix Algebra

Introduction

In this chapter we collect results related to matrix algebra which are relevant to this book. Some specific topics which are typically not found in standard books are also covered here.

Standard notation in this chapter is given here. Matrices are denoted by capital letters \(A\), \(B\) etc.. They can be rectangular with \(m\) rows and \(n\) columns. Their elements or entries are referred to with small letters \(a_{i j}\), \(b_{i j}\) etc. where \(i\) denotes the \(i\)-th row of matrix and \(j\) denotes the \(j\)-th column of matrix. Thus

\[\begin{split}A = \begin{bmatrix} a_{1 1} & a_{1 2} & \dots a_{1 n}\\ a_{2 1} & a_{2 2} & \dots a_{1 n}\\ \vdots & \vdots & \ddots \vdots\\ a_{m 1} & a_{m 2} & \dots a_{m n}\\ \end{bmatrix}\end{split}\]

Mostly we consider complex matrices belonging to \(\CC^{m \times n}\). Sometimes we will restrict our attention to real matrices belonging to \(\RR^{m \times n}\).

Definition
An \(m \times n\) matrix is called square matrix if \(m = n\).
Definition
An \(m \times n\) matrix is called tall matrix if \(m > n\) i.e. the number of rows is greater than columns.
Definition
An \(m \times n\) matrix is called wide matrix if \(m < n\) i.e. the number of columns is greater than rows.
Definition
Let \(A= [a_{i j}]\) be an \(m \times n\) matrix. The main diagonal consists of entries \(a_{i j}\) where \(i = j\). i.e. main diagonal is \(\{a_{11}, a_{22}, \dots, a_{k k} \}\) where \(k = \min(m, n)\). Main diagonal is also known as leading diagonal, major diagonal primary diagonal or principal diagonal. The entries of \(A\) which are not on the main diagonal are known as off diagonal entries.
Definition

A diagonal matrix is a matrix (usually a square matrix) whose entries outside the main diagonal are zero.

Whenever we refer to a diagonal matrix which is not square, we will use the term rectangular diagonal matrix.

A square diagonal matrix \(A\) is also represented by \(\Diag(a_{11}, a_{22}, \dots, a_{n n})\) which lists only the diagonal (non-zero) entries in \(A\).

The transpose of a matrix \(A\) is denoted by \(A^T\) while the Hermitian transpose is denoted by \(A^H\). For real matrices \(A^T = A^H\).

When matrices are square, we have the number of rows and columns both equal to \(n\) and they belong to \(\CC^{n \times n}\).

If not specified, the square matrices will be of size \(n \times n\) and rectangular matrices will be of size \(m \times n\). If not specified the vectors (column vectors) will be of size \(n \times 1\) and belong to either \(\RR^n\) or \(\CC^n\). Corresponding row vectors will be of size \(1 \times n\).

For statements which are valid both for real and complex matrices, sometimes we might say that matrices belong to \(\FF^{m \times n}\) while the scalars belong to \(\FF\) and vectors belong to \(\FF^n\) where \(\FF\) refers to either the field of real numbers or the field of complex numbers. Note that this is not consistently followed at the moment. Most results are written only for \(\CC^{m \times n}\) while still being applicable for \(\RR^{m \times n}\).

Identity matrix for \(\FF^{n \times n}\) is denoted as \(I_n\) or simply \(I\) whenever the size is clear from context.

Sometimes we will write a matrix in terms of its column vectors. We will use the notation

\[A = \begin{bmatrix} a_1 & a_2 & \dots & a_n \end{bmatrix}\]

indicating \(n\) columns.

When we write a matrix in terms of its row vectors, we will use the notation

\[\begin{split}A = \begin{bmatrix} a_1^T \\ a_2^T \\ \vdots \\ a_m^T \end{bmatrix}\end{split}\]

indicating \(m\) rows with \(a_i\) being column vectors whose transposes form the rows of \(A\).

The rank of a matrix \(A\) is written as \(\Rank(A)\), while the determinant as \(\det(A)\) or \(|A|\).

We say that an \(m \times n\) matrix \(A\) is left-invertible if there exists an \(n \times m\) matrix \(B\) such that \(B A = I\). We say that an \(m \times n\) matrix \(A\) is right-invertible if there exists an \(n \times m\) matrix \(B\) such that \(A B= I\).

We say that a square matrix \(A\) is invertible when there exists another square matrix \(B\) of same size such that \(AB = BA = I\). A square matrix is invertible iff its both left and right invertible. Inverse of a square invertible matrix is denoted by \(A^{-1}\).

A special left or right inverse is the pseudo inverse, which is denoted by \(A^{\dag}\).

Column space of a matrix is denoted by \(\ColSpace(A)\), the null space by \(\NullSpace(A)\), and the row space by \(\RowSpace(A)\).

We say that a matrix is symmetric when \(A = A^T\), conjugate symmetric or Hermitian when \(A^H =A\).

When a square matrix is not invertible, we say that it is singular. A non-singular matrix is invertible.

The eigen values of a square matrix are written as \(\lambda_1, \lambda_2, \dots\) while the singular values of a rectangular matrix are written as \(\sigma_1, \sigma_2, \dots\).

The inner product or dot product of two column / row vectors \(u\) and \(v\) belonging to \(\RR^n\) is defined as

(1)\[u \cdot v = \langle u, v \rangle = \sum_{i=1}^n u_i v_i.\]

The inner product or dot product of two column / row vectors \(u\) and \(v\) belonging to \(\CC^n\) is defined as

(2)\[u \cdot v = \langle u, v \rangle = \sum_{i=1}^n u_i \overline{v_i}.\]

Block matrix

Definition

A block matrix is a matrix whose entries themselves are matrices with following constraints

  • Entries in every row are matrices with same number of rows.
  • Entries in every column are matrices with same number of columns.

Let \(A\) be an \(m \times n\) block matrix. Then

\[\begin{split}A = \begin{bmatrix} A_{11} & A_{12} & \dots & A_{1 n}\\ A_{21} & A_{22} & \dots & A_{2 n}\\ \vdots & \vdots & \ddots & \vdots\\ A_{m 1} & A_{m 2} & \dots & A_{m n}\\ \end{bmatrix}\end{split}\]

where \(A_{i j}\) is a matrix with \(r_i\) rows and \(c_j\) columns.

A block matrix is also known as a partitioned matrix.

Example2x2 block matrices

Quite frequently we will be using \(2x2\) block matrices.

\[\begin{split}P = \begin{bmatrix} P_{11} & P_{12} \\ P_{21} & P_{22} \end{bmatrix}.\end{split}\]

An example

\[\begin{split}P = \left[ \begin{array}{c c | c} a & b & c \\ d & e & f \\ \hline g & h & i \end{array} \right]\end{split}\]

We have

\[\begin{split}P_{11} = \begin{bmatrix} a & b \\ d & e \end{bmatrix} \; P_{12} = \begin{bmatrix} c \\ f \end{bmatrix} \; P_{21} = \begin{bmatrix} g & h \end{bmatrix} \; P_{22} = \begin{bmatrix} i \end{bmatrix}\end{split}\]
  • \(P_{11}\) and \(P_{12}\) have \(2\) rows.
  • \(P_{21}\) and \(P_{22}\) have \(1\) row.
  • \(P_{11}\) and \(P_{21}\) have \(2\) columns.
  • \(P_{12}\) and \(P_{22}\) have \(1\) column.
Lemma

Let \(A = [A_{ij}]\) be an \(m \times n\) block matrix with \(A_{ij}\) being an \(r_i \times c_j\) matrix. Then \(A\) is an \(r \times c\) matrix where

\[r = \sum_{i=1}^m r_i\]

and

\[c = \sum_{j=1}^n c_j.\]
Remark
Sometimes it is convenient to think of a regular matrix as a block matrix whose entries are \(1 \times 1\) matrices themselves.
Definition

Let \(A = [A_{ij}]\) be an \(m \times n\) block matrix with \(A_{ij}\) being a \(p_i \times q_j\) matrix. Let \(B = [B_{jk}]\) be an \(n \times p\) block matrix with \(B_{jk}\) being a \(q_j \times r_k\) matrix. Then the two block matrices are compatible for multiplication and their multiplication is defined by \(C = AB = [C_{i k}]\) where

\[C_{i k} = \sum_{j=1}^n A_{i j} B_{j k}\]

and \(C_{i k}\) is a \(p_i \times r_k\) matrix.

Definition
A block diagonal matrix is a block matrix whose off diagonal entries are zero matrices.

Linear independence, span, rank

Spaces associated with a matrix

Definition

The column space of a matrix is defined as the vector space spanned by columns of the matrix.

Let \(A\) be an \(m \times n\) matrix with

\[A = \begin{bmatrix} a_1 & a_2 & \dots & a_n \end{bmatrix}\]

Then the column space is given by

\[\ColSpace(A) = \{ x \in \FF^m : x = \sum_{i=1}^n \alpha_i a_i \; \text{for some } \alpha_i \in \FF \}.\]
Definition

The row space of a matrix is defined as the vector space spanned by rows of the matrix.

Let \(A\) be an \(m \times n\) matrix with

\[\begin{split}A = \begin{bmatrix} a_1^T \\ a_2^T \\ \vdots \\ a_m^T \end{bmatrix}\end{split}\]

Then the row space is given by

\[\RowSpace(A) = \{ x \in \FF^n : x = \sum_{i=1}^m \alpha_i a_i \; \text{for some } \alpha_i \in \FF \}.\]

Rank

Definition
The column rank of a matrix is defined as the maximum number of columns which are linearly independent. In other words column rank is the dimension of the column space of a matrix.
Definition
The row rank of a matrix is defined as the maximum number of rows which are linearly independent. In other words row rank is the dimension of the row space of a matrix.
Theorem
The column rank and row rank of a matrix are equal.
Definition
The rank of a matrix is defined to be equal to its column rank which is equal to its row rank.
Lemma

For an \(m \times n\) matrix \(A\)

\[0 \leq \Rank(A) \leq \min(m, n).\]
Lemma
The rank of a matrix is 0 if and only if it is a zero matrix.
Definition

An \(m \times n\) matrix \(A\) is called full rank if

\[\Rank (A) = \min(m, n).\]

In other words it is either a full column rank matrix or a full row rank matrix or both.

Lemma

Let \(A\) be an \(m \times n\) matrix and \(B\) be an \(n \times p\) matrix then

\[\Rank(AB) \leq \min (\Rank(A), \Rank(B)).\]
Lemma

Let \(A\) be an \(m \times n\) matrix and \(B\) be an \(n \times p\) matrix. If \(B\) is of rank \(n\) then

\[\Rank(AB) = \Rank(A).\]
Lemma

Let \(A\) be an \(m \times n\) matrix and \(B\) be an \(n \times p\) matrix. If \(A\) is of rank \(n\) then

\[\Rank(AB) = \Rank(B).\]
Lemma
The rank of a diagonal matrix is equal to the number of non-zero elements on its main diagonal.
Proof
The columns which correspond to diagonal entries which are zero are zero columns. Other columns are linearly independent. The number of linearly independent rows is also the same. Hence their count gives us the rank of the matrix.

Invertible matrices

Definition

A square matrix \(A\) is called invertible if there exists another square matrix \(B\) of same size such that

\[AB = BA = I.\]

The matrix \(B\) is called the inverse of \(A\) and is denoted as \(A^{-1}\).

Lemma
If \(A\) is invertible then its inverse \(A^{-1}\) is also invertible and the inverse of \(A^{-1}\) is nothing but \(A\).
Lemma
Identity matrix \(I\) is invertible.
Proof
\[I I = I \implies I^{-1} = I.\]
Lemma
If \(A\) is invertible then columns of \(A\) are linearly independent.
Proof

Assume \(A\) is invertible, then there exists a matrix \(B\) such that

\[AB = BA = I.\]

Assume that columns of \(A\) are linearly dependent. Then there exists \(u \neq 0\) such that

\[A u = 0 \implies BA u = 0 \implies I u = 0 \implies u = 0\]

a contradiction. Hence columns of \(A\) are linearly independent.

Lemma
If an \(n\times n\) matrix \(A\) is invertible then columns of \(A\) span \(\FF^n\).
Proof

Assume \(A\) is invertible, then there exists a matrix \(B\) such that

\[AB = BA = I.\]

Now let \(x \in \FF^n\) be any arbitrary vector. We need to show that there exists \(\alpha \in \FF^n\) such that

\[x = A \alpha.\]

But

\[x = I x = AB x = A ( B x).\]

Thus if we choose \(\alpha = Bx\), then

\[x = A \alpha.\]

Thus columns of \(A\) span \(\FF^n\).

Lemma
If \(A\) is invertible, then columns of \(A\) form a basis for \(\FF^n\).
Proof
In \(\FF^n\) a basis is a set of vectors which is linearly independent and spans \(\FF^n\). By here and here, columns of an invertible matrix \(A\) satisfy both conditions. Hence they form a basis.
Lemma
If \(A\) is invertible, then \(A^T\) is invertible.
Proof

Assume \(A\) is invertible, then there exists a matrix \(B\) such that

\[AB = BA = I.\]

Applying transpose on both sides we get

\[B^T A^T = A^T B^T = I.\]

Thus \(B^T\) is inverse of \(A^T\) and \(A^T\) is invertible.

Lemma
If \(A\) is invertible than \(A^H\) is invertible.
Proof

Assume \(A\) is invertible, then there exists a matrix \(B\) such that

\[AB = BA = I.\]

Applying conjugate transpose on both sides we get

\[B^H A^H = A^H B^H = I.\]

Thus \(B^H\) is inverse of \(A^H\) and \(A^H\) is invertible.

Lemma
If \(A\) and \(B\) are invertible then \(AB\) is invertible.
Proof

We note that

\[(AB) (B^{-1}A^{-1}) = A (B B^{-1})A^{-1} = A I A^{-1} = I.\]

Similarly

\[(B^{-1}A^{-1}) (AB) = B^{-1} (A^{-1} A ) B = B^{-1} I B = I.\]

Thus \(B^{-1}A^{-1}\) is the inverse of \(AB\).

Lemma
The set of \(n \times n\) invertible matrices under the matrix multiplication operation form a group.
Proof

We verify the properties of a group

  • [Closure] If \(A\) and \(B\) are invertible then \(AB\) is invertible. Hence the set is closed.
  • [Associativity] Matrix multiplication is associative.
  • [Identity element] \(I\) is invertible and \(AI = IA = A\) for all invertible matrices.
  • [Inverse element] If \(A\) is invertible then \(A^{-1}\) is also invertible.

Thus the set of invertible matrices is indeed a group under matrix multiplication.

Lemma

An \(n \times n\) matrix \(A\) is invertible if and only if it is full rank i.e.

\[\Rank(A) = n.\]
Corollary
The rank of an invertible matrix and its inverse are same.

Similar matrices

Definition

An \(n \times n\) matrix \(B\) is similar to an \(n \times n\) matrix \(A\) if there exists an \(n \times n\) non-singular matrix \(C\) such that

\[B = C^{-1} A C.\]
Lemma
If \(B\) is similar to \(A\) then \(A\) is similar to \(B\). Thus similarity is a symmetric relation.
Proof
\[B = C^{-1} A C \implies A = C B C^{-1} \implies A = (C^{-1})^{-1} B C^{-1}\]

Thus there exists a matrix \(D = C^{-1}\) such that

\[A = D^{-1} B D.\]

Thus \(A\) is similar to \(B\).

Lemma
Similar matrices have same rank.
Proof

Let \(B\) be similar to \(A\). Thus their exists an invertible matrix \(C\) such that

\[B = C^{-1} A C.\]

Since \(C\) is invertible hence we have \(\Rank (C) = \Rank(C^{-1}) = n\). Now using here \(\Rank (AC) = \Rank (A)\) and using here we have \(\Rank(C^{-1} (AC) ) = \Rank (AC) = \Rank(A)\). Thus

\[\Rank(B) = \Rank(A).\]
Lemma
Similarity is an equivalence relation on the set of \(n \times n\) matrices.
Proof
Let \(A, B, C\) be \(n \times n\) matrices. \(A\) is similar to itself through an invertible matrix \(I\). If \(A\) is similar to \(B\) then \(B\) is similar to \(A\). If \(B\) is similar to \(A\) via \(P\) s.t. \(B = P^{-1}AP\) and \(C\) is similar to \(B\) via \(Q\) s.t. \(C = Q^{-1} B Q\) then \(C\) is similar to \(A\) via \(PQ\) such that \(C = (PQ)^{-1} A (P Q)\). Thus similarity is an equivalence relation on the set of square matrices and if \(A\) is any \(n \times n\) matrix then the set of \(n \times n\) matrices similar to \(A\) forms an equivalence class.

Gram matrices

Definition

Gram matrix of columns of \(A\) is given by

\[G = A^H A\]
Definition

Gram matrix of rows of \(A\) is given by

\[G = A A^H\]

This is also known as the frame operator of \(A\).

Remark
Usually when we talk about Gram matrix of a matrix we are looking at the Gram matrix of its column vectors.
Remark
For real matrix \(A \in \RR^{m \times n}\), the Gram matrix of its column vectors is given by \(A^T A\) and the Gram matrix for its row vectors is given by \(A A^T\).

Following results apply equally well for the real case.

Lemma
The columns of a matrix are linearly dependent if and only if the Gram matrix of its column vectors \(A^H A\) is not invertible.
Proof

Let \(A\) be an \(m\times n\) matrix and \(G = A^H A\) be the Gram matrix of its columns.

If columns of \(A\) are linearly dependent, then there exists a vector \(u \neq 0\) such that

\[A u = 0.\]

Thus

\[G u = A^H A u = 0.\]

Hence the columns of \(G\) are also dependent and \(G\) is not invertible.

Conversely let us assume that \(G\) is not invertible, thus columns of \(G\) are dependent and there exists a vector \(v \neq 0\) such that

\[G v = 0.\]

Now

\[v^H G v = v^H A^H A v = (A v)^H (A v) = \| A v \|_2^2.\]

From previous equation, we have

\[\| A v \|_2^2 = 0 \implies A v = 0.\]

Since \(v \neq 0\) hence columns of \(A\) are also linearly dependent.

Corollary
The columns of a matrix are linearly independent if and only if the Gram matrix of its column vectors \(A^H A\) is invertible.
Proof

Columns of \(A\) can be dependent only if its Gram matrix is not invertible. Thus if the Gram matrix is invertible, then the columns of \(A\) are linearly independent.

The Gram matrix is not invertible only if columns of \(A\) are linearly dependent. Thus if columns of \(A\) are linearly independent then the Gram matrix is invertible.

Corollary
Let \(A\) be a full column rank matrix. Then \(A^H A\) is invertible.
Lemma

The null space of \(A\) and its Gram matrix \(A^HA\) coincide. i.e.

\[\NullSpace(A) = \NullSpace(A^H A).\]
Proof

Let \(u \in \NullSpace(A)\). Then

\[A u = 0 \implies A^H A u = 0.\]

Thus

\[u \in \NullSpace(A^HA ) \implies \NullSpace(A) \subseteq \NullSpace(A^H A).\]

Now let \(u \in \NullSpace(A^H A)\). Then

\[A^H A u = 0 \implies u^H A^H A u = 0 \implies \| A u \|_2^2 = 0 \implies A u = 0.\]

Thus we have

\[u \in \NullSpace(A ) \implies \NullSpace(A^H A) \subseteq \NullSpace(A).\]
Lemma
The rows of a matrix \(A\) are linearly dependent if and only if the Gram matrix of its row vectors \(AA^H\) is not invertible.
Proof

Rows of \(A\) are linearly dependent, if and only if columns of \(A^H\) are linearly dependent. There exists a vector \(v \neq 0\) s.t.

\[A^H v = 0\]

Thus

\[G v = A A^H v = 0.\]

Since \(v \neq 0\) hence \(G\) is not invertible.

Converse: assuming that \(G\) is not invertible, there exists a vector \(u \neq 0\) s.t.

\[G u = 0.\]

Now

\[u^H G u = u^H A A^H u = (A^H u)^H (A^H u) = \| A^H u \|_2^2 = 0 \implies A^H u = 0.\]

Since \(u \neq 0\) hence columns of \(A^H\) and consequently rows of \(A\) are linearly dependent.

Corollary
The rows of a matrix \(A\) are linearly independent if and only if the Gram matrix of its row vectors \(AA^H\) is invertible.
Corollary
Let \(A\) be a full row rank matrix. Then \(A A^H\) is invertible.

Pseudo inverses

Definition

Let \(A\) be an \(m \times n\) matrix. An \(n\times m\) matrix \(A^{\dag}\) is called its Moore-Penrose pseudo-inverse if it satisfies all of the following criteria:

  • \(A A^{\dag} A = A\).
  • \(A^{\dag} A A^{\dag} = A^{\dag}\).
  • \(\left(A A^{\dag} \right)^H = A A^{\dag}\) i.e. \(A A^{\dag}\) is Hermitian.
  • \((A^{\dag} A)^H = A^{\dag} A\) i.e. \(A^{\dag} A\) is Hermitian.
Theorem
For any matrix \(A\) there exists precisely one matrix \(A^{\dag}\) which satisfies all the requirements above.

We omit the proof for this. The pseudo-inverse can actually be obtained by the singular value decomposition of \(A\). This is shown here.

Lemma

Let \(D = \Diag(d_1, d_2, \dots, d_n)\) be an \(n \times n\) diagonal matrix. Then its Moore-Penrose pseudo-inverse is \(D^{\dag} = \Diag(c_1, c_2, \dots, c_n)\) where

\[\begin{split}c_i = \left\{ \begin{array}{ll} \frac{1}{d_i} & \mbox{if $d_i \neq 0$};\\ 0 & \mbox{if $d_i = 0$}. \end{array} \right.\end{split}\]
Proof

We note that \(D^{\dag} D = D D^{\dag} = F = \Diag(f_1, f_2, \dots f_n)\) where

\[\begin{split}f_i = \left\{ \begin{array}{ll} 1 & \mbox{if $d_i \neq 0$};\\ 0 & \mbox{if $d_i = 0$}. \end{array} \right.\end{split}\]

We now verify the requirements listed here.

\[D D^{\dag} D = F D = D.\]
\[D^{\dag} D D^{\dag} = F D^{\dag} = D^{\dag}\]

\(D^{\dag} D = D D^{\dag} = F\) is a diagonal hence Hermitian matrix.

Lemma

Let \(D = \Diag(d_1, d_2, \dots, d_p)\) be an \(m \times n\) rectangular diagonal matrix where \(p = \min(m, n)\). Then its Moore-Penrose pseudo-inverse is an \(n \times m\) rectangular diagonal matrix \(D^{\dag} = \Diag(c_1, c_2, \dots, c_p)\) where

\[\begin{split}c_i = \left\{ \begin{array}{ll} \frac{1}{d_i} & \mbox{if $d_i \neq 0$};\\ 0 & \mbox{if $d_i = 0$}. \end{array} \right.\end{split}\]
Proof

\(F = D^{\dag} D = \Diag(f_1, f_2, \dots f_n)\) is an \(n \times n\) matrix where

\[\begin{split}f_i = \left\{ \begin{array}{ll} 1 & \mbox{if $d_i \neq 0$};\\ 0 & \mbox{if $d_i = 0$};\\ 0 & \mbox{if $i > p$}. \end{array} \right.\end{split}\]

\(G = D D^{\dag} = \Diag(g_1, g_2, \dots g_n)\) is an \(m \times m\) matrix where

\[\begin{split}g_i = \left\{ \begin{array}{ll} 1 & \mbox{if $d_i \neq 0$};\\ 0 & \mbox{if $d_i = 0$};\\ 0 & \mbox{if $i > p$}. \end{array} \right.\end{split}\]

We now verify the requirements listed here.

\[D D^{\dag} D = D F = D.\]
\[D^{\dag} D D^{\dag} = D^{\dag} G = D^{\dag}\]

\(F = D^{\dag} D\) and \(G = D D^{\dag}\) are both diagonal hence Hermitian matrices.

Lemma

If \(A\) is full column rank then its Moore-Penrose pseudo-inverse is given by

\[A^{\dag} = (A^H A)^{-1} A^H.\]

It is a left inverse of \(A\).

Proof

By here \(A^H A\) is invertible.

First of all we verify that it is a left inverse.

\[A^{\dag} A = (A^H A)^{-1} A^H A = I.\]

We now verify all the properties.

\[A A^{\dag} A = A I = A.\]
\[A^{\dag} A A^{\dag} = I A^{\dag} = A^{\dag}.\]

Hermitian properties:

\[\left(A A^{\dag} \right)^H = \left(A (A^H A)^{-1} A^H \right)^H = \left(A (A^H A)^{-1} A^H \right) = A A^{\dag}.\]
\[(A^{\dag} A)^H = I^H = I = A^{\dag} A.\]
Lemma

If \(A\) is full row rank then its Moore-Penrose pseudo-inverse is given by

\[A^{\dag} = A^H (A A^H)^{-1} .\]

It is a right inverse of \(A\).

Proof

By here \(A A^H\) is invertible.

First of all we verify that it is a right inverse.

\[A A^{\dag} = A A^H (A A^H)^{-1}= I.\]

We now verify all the properties.

\[A A^{\dag} A = I A = A.\]
\[A^{\dag} A A^{\dag} = A^{\dag} I = A^{\dag}.\]

Hermitian properties:

\[\left(A A^{\dag} \right)^H = I^H = I = A A^{\dag}.\]
\[(A^{\dag} A)^H = \left (A^H (A A^H)^{-1} A \right )^H = A^H (A A^H)^{-1} A = A^{\dag} A.\]

Trace and determinant

Trace

Definition

The trace of a square matrix is defined as the sum of the entries on its main diagonal. Let \(A\) be an \(n\times n\) matrix, then

\[\Trace (A) = \sum_{i=1}^n a_{ii}\]

where \(\Trace(A)\) denotes the trace of \(A\).

Lemma

The trace of a square matrix and its transpose are equal.

\[\Trace(A) = \Trace(A^T).\]
Lemma

Trace of sum of two square matrices is equal to the sum of their traces.

\[\Trace(A + B) = \Trace(A) + \Trace(B).\]
Lemma

Let \(A\) be an \(m \times n\) matrix and \(B\) be an \(n \times m\) matrix. Then

\[\Trace(AB) = \Trace(BA).\]
Proof

Let \(AB = C = [c_{ij}]\). Then

\[c_{ij} = \sum_{k=1}^n a_{i k} b_{k j}.\]

Thus

\[c_{ii} = \sum_{k=1}^n a_{i k} b_{k i}.\]

Now

\[\Trace(C) = \sum_{i=1}^m c_{ii} = \sum_{i=1}^m \sum_{k=1}^n a_{i k} b_{k i} = \sum_{k=1}^n \sum_{i=1}^m a_{i k} b_{k i} = \sum_{k=1}^n \sum_{i=1}^m b_{k i} a_{i k}.\]

Let \(BA = D = [d_{ij}]\). Then

\[d_{ij} = \sum_{k=1}^m b_{i k} a_{k j}.\]

Thus

\[d_{ii} = \sum_{k=1}^m b_{i k} a_{k i}.\]

Hence

\[\Trace(D) = \sum_{i=1}^n d_{ii} = \sum_{i=1}^n \sum_{k=1}^m b_{i k} a_{k i} = \sum_{i=1}^m \sum_{k=1}^n b_{k i} a_{i k}.\]

This completes the proof.

Lemma

Let \(A \in \FF^{m \times n}\), \(B \in \FF^{n \times p}\), \(C \in \FF^{p \times m}\) be three matrices. Then

\[\Trace(ABC) = \Trace(BCA) = \Trace(CAB).\]
Proof

Let \(AB = D\). Then

\[\Trace(ABC) = \Trace(DC) = \Trace(CD) = \Trace(CAB).\]

Similarly the other result can be proved.

Lemma
Trace of similar matrices is equal.
Proof

Let \(B\) be similar to \(A\). Thus

\[B = C^{-1} A C\]

for some invertible matrix \(C\). Then

\[\Trace(B) = \Trace(C^{-1} A C ) = \Trace (C C^{-1} A) = \Trace(A).\]

We used this.

Determinants

Following are some results on determinant of a square matrix \(A\).

Lemma
\[\det(\alpha A) = \alpha^n \det(A).\]
Lemma

Determinant of a square matrix and its transpose are equal.

\[\det(A) = \det(A^T).\]
Lemma

Let \(A\) be a complex square matrix. Then

\[\det(A^H) = \overline{\det(A)}.\]
Proof
\[\det(A^H) = \det(\overline{A}^T) = \det(\overline{A}) = \overline{\det(A)}.\]
Lemma

Let \(A\) and \(B\) be two \(n\times n\) matrices. Then

\[\det (A B) = \det(A) \det(B).\]
Lemma

Let \(A\) be an invertible matrix. Then

\[\det(A^{-1}) = \frac{1}{\det(A)}.\]
Lemma
\[\det(A^{p}) = \left(\det(A) \right)^p.\]
Lemma

Determinant of a triangular matrix is the product of its diagonal entries. i.e. if \(A\) is upper or lower triangular matrix then

\[\det(A) = \prod_{i=1}^n a_{i i}.\]
Lemma

Determinant of a diagonal matrix is the product of its diagonal entries. i.e. if \(A\) is a diagonal matrix then

\[\det(A) = \prod_{i=1}^n a_{i i}.\]
Lemma
Determinant of similar matrices is equal.
Proof

Let \(B\) be similar to \(A\). Thus

\[B = C^{-1} A C\]

for some invertible matrix \(C\). Hence

\[\det(B) = \det(C^{-1} A C ) = \det (C^{-1}) \det (A) \det(C).\]

Now

\[\det (C^{-1}) \det (A) \det(C) = \frac{1}{\det(C)} \det (A) \det(C) = \det(A).\]

We used this and this.

Lemma

Let \(u\) and \(v\) be vectors in \(\FF^n\). Then

\[\det(I + u v^T) = 1 + u^T v.\]
Lemma

Let \(A\) be a square matrix and let \(\epsilon \approx 0\). Then

\[\det(I + \epsilon A ) \approx 1 + \epsilon \Trace(A).\]

Unitary and orthogonal matrices

Orthogonal matrix

Definition

A real square matrix \(U\) is called orthogonal if the columns of \(U\) form an orthonormal set. In other words, let

\[U = \begin{bmatrix} u_1 & u_2 & \dots & u_n \end{bmatrix}\]

with \(u_i \in \RR^n\). Then we have

\[u_i \cdot u_j = \delta_{i , j}.\]
Lemma
An orthogonal matrix \(U\) is invertible with \(U^T = U^{-1}\).
Proof

Let

\[U = \begin{bmatrix} u_1 & u_2 & \dots & u_n \end{bmatrix}\]

be orthogonal with

\[\begin{split}U^T = \begin{bmatrix} u_1^T \\ u_2^T \\ \vdots \\ u_n^T. \end{bmatrix}\end{split}\]

Then

\[\begin{split}U^T U = \begin{bmatrix} u_1^T \\ u_2^T \\ \vdots \\ u_n^T. \end{bmatrix} \begin{bmatrix} u_1 & u_2 & \dots & u_n \end{bmatrix} = \begin{bmatrix} u_i \cdot u_j \end{bmatrix} = I.\end{split}\]

Since columns of \(U\) are linearly independent and span \(\RR^n\), hence \(U\) is invertible. Thus

\[U^T = U^{-1}.\]
Lemma
Determinant of an orthogonal matrix is \(\pm 1\).
Proof

Let \(U\) be an orthogonal matrix. Then

\[\det (U^T U) = \det (I) \implies \left ( \det (U) \right )^2 = 1\]

Thus we have

\[\det(U) = \pm 1.\]

Unitary matrix

Definition

A complex square matrix \(U\) is called unitary if the columns of \(U\) form an orthonormal set. In other words, let

\[U = \begin{bmatrix} u_1 & u_2 & \dots & u_n \end{bmatrix}\]

with \(u_i \in \CC^n\). Then we have

\[u_i \cdot u_j = \langle u_i , u_j \rangle = u_j^H u_i = \delta_{i , j}.\]
Lemma
A unitary matrix \(U\) is invertible with \(U^H = U^{-1}\).
Proof

Let

\[U = \begin{bmatrix} u_1 & u_2 & \dots & u_n \end{bmatrix}\]

be orthogonal with

\[\begin{split}U^H = \begin{bmatrix} u_1^H \\ u_2^H \\ \vdots \\ u_n^H. \end{bmatrix}\end{split}\]

Then

\[\begin{split}U^H U = \begin{bmatrix} u_1^H \\ u_2^H \\ \vdots \\ u_n^H. \end{bmatrix} \begin{bmatrix} u_1 & u_2 & \dots & u_n \end{bmatrix} = \begin{bmatrix} u_i^H u_j \end{bmatrix} = I.\end{split}\]

Since columns of \(U\) are linearly independent and span \(\CC^n\), hence \(U\) is invertible. Thus

\[U^H = U^{-1}.\]
Lemma
The magnitude of determinant of a unitary matrix is \(1\).
Proof

Let \(U\) be a unitary matrix. Then

\[\det (U^H U) = \det (I) \implies \det(U^H) \det(U) = 1 \implies \overline{\det(U)}{\det(U)} = 1.\]

Thus we have

\[|\det(U) |^2 = 1 \implies |\det(U) | = 1.\]

F unitary matrix

We provide a common definition for unitary matrices over any field \(\FF\). This definition applies to both real and complex matrices.

Definition

A square matrix \(U \in \FF^{n \times n}\) is called \(\FF\)-unitary if the columns of \(U\) form an orthonormal set. In other words, let

\[U = \begin{bmatrix} u_1 & u_2 & \dots & u_n \end{bmatrix}\]

with \(u_i \in \FF^n\). Then we have

\[\langle u_i , u_j \rangle = u_j^H u_i = \delta_{i , j}.\]

We note that a suitable definition of inner product transports the definition appropriately into orthogonal matrices over \(\RR\) and unitary matrices over \(\CC\).

When we are talking about \(\FF\) unitary matrices, then we will use the symbol \(U^H\) to mean its inverse. In the complex case, it will map to its conjugate transpose, while in real case it will map to simple transpose.

This definition helps us simplify some of the discussions in the sequel (like singular value decomposition).

Following results apply equally to orthogonal matrices for real case and unitary matrices for complex case.

Lemma

\(\FF\)-unitary matrices preserve norm. i.e.

\[\| U x \|_2 = \|x \|_2.\]
Proof
\[\| U x \|_2^2 = (U x)^H (U x) = x^H U^H U x = x^H I x = \| x\|_2^2.\]
Remark

For the real case we have

\[\| U x \|_2^2 = (U x)^T (U x) = x^T U^T U x = x^T I x = \| x\|_2^2.\]
Lemma

\(\FF\)-unitary matrices preserve inner product. i.e.

\[\langle U x, U y \rangle = \langle x, y \rangle.\]
Proof
\[\langle U x, U y \rangle = (U y)^H U x = y^H U^H U x = y^H x.\]
Remark

For the real case we have

\[\langle U x, U y \rangle = (U y)^T U x = y^T U^T U x = y^T x.\]

Eigen values

Much of the discussion in this section will be equally applicable to real as well as complex matrices. We will use the complex notation mostly and make specific remarks for real matrices wherever needed.

Definition

A scalar \(\lambda\) is an eigen value of an \(n \times n\) matrix \(A = [ a_{ij} ]\) if there exists a non null vector \(x\) such that

(1)\[Ax = \lambda x.\]

A non null vector \(x\) which satisfies this equation is called an eigen vector of \(A\) for the eigen value \(\lambda\).

An eigen value is also known as a characteristic value, proper value or a latent value.

We note that (1) can be written as

\[Ax = \lambda I_n x \implies (A - \lambda I_n) x = 0.\]

Thus \(\lambda\) is an eigen value of \(A\) if and only if the matrix \(A - \lambda I\) is singular.

Definition
The set comprising of eigen values of a matrix \(A\) is known as its spectrum.
Remark
For each eigen vector \(x\) for a matrix \(A\) the corresponding eigen value \(\lambda\) is unique.
Proof

Assume that for \(x\) there are two eigen values \(\lambda_1\) and \(\lambda_2\), then

\[A x = \lambda_1 x = \lambda_2 x \implies (\lambda_1 - \lambda_2 ) x = 0.\]

This can happen only when either \(x = 0\) or \(\lambda_1 = \lambda_2\). Since \(x\) is an eigen vector, it cannot be 0. Thus \(\lambda_1 = \lambda_2\).

Remark

If \(x\) is an eigen vector for \(A\), then the corresponding eigen value is given by

\[\lambda = \frac{x^H A x }{x^H x}.\]
Proof
\[A x = \lambda x \implies x^H A x = \lambda x^H x \implies \lambda = \frac{x^H A x }{x^H x}.\]

since \(x\) is non-zero.

Remark

An eigen vector \(x\) of \(A\) for eigen value \(\lambda\) belongs to the null space of \(A - \lambda I\), i.e.

\[x \in \NullSpace(A - \lambda I).\]

In other words \(x\) is a nontrivial solution to the homogeneous system of linear equations given by

\[(A - \lambda I) z = 0.\]
Definition
Let \(\lambda\) be an eigen value for a square matrix \(A\). Then its eigen space is the null space of \(A - \lambda I\) i.e. \(\NullSpace(A - \lambda I)\).
Remark

The set comprising all the eigen vectors of \(A\) for an eigen value \(\lambda\) is given by

\[\NullSpace(A - \lambda I) \setminus \{ 0 \}\]

since \(0\) cannot be an eigen vector.

Definition
Let \(\lambda\) be an eigen value for a square matrix \(A\). The dimension of its eigen space \(\NullSpace(A - \lambda I)\) is known as the geometric multiplicity of the eigen value \(\lambda\).
Remark

Clearly

\[\dim (\NullSpace(A - \lambda I)) = n - \Rank(A - \lambda I).\]
Remark

A scalar \(\lambda\) can be an eigen value of a square matrix \(A\) if and only if

\[\det (A - \lambda I) = 0.\]

\(\det (A - \lambda I)\) is a polynomial in \(\lambda\) of degree \(n\).

Remark
\[\det (A - \lambda I) = p(\lambda) = \alpha^n \lambda^n + \alpha^{n-1} \lambda^{n-1} + \dots + \alpha^1 \lambda + \alpha_0\]

where \(\alpha_i\) depend on entries in \(A\).

In this sense, an eigen value of \(A\) is a root of the equation

\[p(\lambda) = 0.\]

Its easy to show that \(\alpha^n = (-1)^n\).

Definition

For any square matrix \(A\), the polynomial given by \(p(\lambda) = \det(A - \lambda I )\) is known as its characteristic polynomial. The equation give by

\[p(\lambda) = 0\]

is known as its characteristic equation. The eigen values of \(A\) are the roots of its characteristic polynomial or solutions of its characteristic equation.

Lemma

For real square matrices, if we restrict eigen values to real values, then the characteristic polynomial can be factored as

\[p(\lambda) = (-1)^n (\lambda - \lambda_1)^{r_1} \dots (\lambda - \lambda_k)^{r_k} q(\lambda).\]

The polynomial has \(k\) distinct real roots. For each root \(\lambda_i\), \(r_i\) is a positive integer indicating how many times the root appears. \(q(\lambda)\) is a polynomial that has no real roots. The following is true

\[r_1 + \dots + r_k + deg(q(\lambda)) = n.\]

Clearly \(k \leq n\).

For complex square matrices where eigen values can be complex (including real square matrices), the characteristic polynomial can be factored as

\[p(\lambda) = (-1)^n (\lambda - \lambda_1)^{r_1} \dots (\lambda - \lambda_k)^{r_k}.\]

The polynomial can be completely factorized into first degree polynomials. There are \(k\) distinct roots or eigen values. The following is true

\[r_1 + \dots + r_k = n.\]

Thus including the duplicates there are exactly \(n\) eigen values for a complex square matrix.

Remark
It is quite possible that a real square matrix doesn’t have any real eigen values.
Definition
The number of times an eigen value appears in the factorization of the characteristic polynomial of a square matrix \(A\) is known as its algebraic multiplicity. In other words \(r_i\) is the algebraic multiplicity for \(\lambda_i\) in above factorization.
Remark
In above the set \(\{\lambda_1, \dots, \lambda_k \}\) forms the spectrum of \(A\).

Let us consider the sum of \(r_i\) which gives the count of total number of roots of \(p(\lambda)\).

\[m = \sum_{i=1}^k r_i.\]

With this there are \(m\) not-necessarily distinct roots of \(p(\lambda)\). Let us write \(p(\lambda)\) as

\[p(\lambda) = (-1)^n (\lambda - c_1) (\lambda - c_2)\dots (\lambda - c_m)q(\lambda).\]

where \(c_1, c_2, \dots, c_m\) are \(m\) scalars (not necessarily distinct) of which \(r_1\) scalars are \(\lambda_1\), \(r_2\) are \(\lambda_2\) and so on. Obviously for the complex case \(q(\lambda)=1\).

We will refer to the set (allowing repetitions) \(\{c_1, c_2, \dots, c_m \}\) as the eigen values of the matrix \(A\) where \(c_i\) are not necessarily distinct. In contrast the spectrum of \(A\) refers to the set of distinct eigen values of \(A\). The symbol \(c\) has been chosen based on the other name for eigen values (the characteristic values).

We can put together eigen vectors of a matrix into another matrix by itself. This can be very useful tool. We start with a simple idea.

Lemma

Let \(A\) be an \(n \times n\) matrix. Let \(u_1, u_2, \dots, u_r\) be \(r\) non-zero vectors from \(\FF^n\). Let us construct an \(n \times r\) matrix

\[U = \begin{bmatrix} u_1 & u_2 & \dots & u_r \end{bmatrix}.\]

Then all the \(r\) vectors are eigen vectors of \(A\) if and only if there exists a diagonal matrix \(D = \Diag(d_1, \dots, d_r)\) such that

\[A U = U D.\]
Proof

Expanding the equation, we can write

\[\begin{bmatrix} A u_1 & A u_2 & \dots & A u_r \end{bmatrix} = \begin{bmatrix} d_1 u_1 & d_2 u_2 & \dots & d_r u_r \end{bmatrix}.\]

Clearly we want

\[A u_i = d_i u_i\]

where \(u_i\) are non-zero. This is possible only when \(d_i\) is an eigen value of \(A\) and \(u_i\) is an eigen vector for \(d_i\).

Converse: Assume that \(u_i\) are eigen vectors. Choose \(d_i\) to be corresponding eigen values. Then the equation holds.

Lemma
\(0\) is an eigen value of a square matrix \(A\) if and only if \(A\) is singular.
Proof

Let \(0\) be an eigen value of \(A\). Then there exists \(u \neq 0\) such that

\[A u = 0 u = 0.\]

Thus \(u\) is a non-trivial solution of the homogeneous linear system. Thus \(A\) is singular.

Converse: Assuming that \(A\) is singular, there exists \(u \neq 0\) s.t.

\[A u = 0 = 0 u.\]

Thus \(0\) is an eigen value of \(A\).

Lemma
If a square matrix \(A\) is singular, then \(\NullSpace(A)\) is the eigen space for the eigen value \(\lambda = 0\).
Proof
This is straight forward from the definition of eigen space (see here).
Remark
Clearly the geometric multiplicity of \(\lambda=0\) equals \(\Nullity(A) = n - \Rank(A)\).
Lemma
Let \(A\) be a square matrix. Then \(A\) and \(A^T\) have same eigen values.
Proof

The eigen values of \(A^T\) are given by

\[\det (A^T - \lambda I) = 0.\]

But

\[A^T - \lambda I = A^T - (\lambda I )^T = (A - \lambda I)^T.\]

Hence (using here)

\[\det (A^T - \lambda I) = \det \left ( (A - \lambda I)^T \right ) = \det (A - \lambda I).\]

Thus the characteristic polynomials of \(A\) and \(A^T\) are same. Hence the eigen values are same. In other words the spectrum of \(A\) and \(A^T\) are same.

Remark

If \(x\) is an eigen vector with a non-zero eigen value \(\lambda\) for \(A\) then \(Ax\) and \(x\) are collinear.

In other words the angle between \(Ax\) and \(x\) is either \(0^{\circ}\) when \(\lambda\) is positive and is \(180^{\circ}\) when \(\lambda\) is negative. Let us look at the inner product:

\[\langle Ax, x \rangle = x^H A x = x^H \lambda x = \lambda \| x\|_2^2.\]

Meanwhile

\[\| A x \|_2 = \| \lambda x \|_2 = |\lambda| \| x \|_2.\]

Thus

\[|\langle Ax, x \rangle | = \| Ax \|_2 \| x \|_2.\]

The angle \(\theta\) between \(Ax\) and \(x\) is given by

\[\cos \theta = \frac{\langle Ax, x \rangle}{\| Ax \|_2 \| x \|_2} = \frac{\lambda \| x\|_2^2}{|\lambda| \| x \|_2^2} = \pm 1.\]
Lemma
Let \(A\) be a square matrix and \(\lambda\) be an eigen value of \(A\). Let \(p \in \Nat\). Then \(\lambda^p\) is an eigen value of \(A^{p}\).
Proof

For \(p=1\) the statement holds trivially since \(\lambda^1\) is an eigen value of \(A^1\). Assume that the statement holds for some value of \(p\). Thus let \(\lambda^p\) be an eigen value of \(A^{p}\) and let \(u\) be corresponding eigen vector. Now

\[A^{p + 1} u = A^ p ( A u) = A^{p} \lambda u = \lambda A^{p} u = \lambda \lambda^p u = \lambda^{p + 1} u.\]

Thus \(\lambda^{p + 1}\) is an eigen value for \(A^{p + 1}\) with the same eigen vector \(u\). With the principle of mathematical induction, the proof is complete.

Lemma
Let a square matrix \(A\) be non singular and let \(\lambda \neq 0\) be some eigen value of \(A\). Then \(\lambda^{-1}\) is an eigen value of \(A^{-1}\). Moreover, all eigen values of \(A^{-1}\) are obtained by taking inverses of eigen values of \(A\) i.e. if \(\mu \neq 0\) is an eigen value of \(A^{-1}\) then \(\frac{1}{\mu}\) is an eigen value of \(A\) also. Also, \(A\) and \(A^{-1}\) share the same set of eigen vectors.
Proof

Let \(u \neq 0\) be an eigen vector of \(A\) for the eigen value \(\lambda\). Then

\[A u = \lambda u \implies u = A^{-1} \lambda u \implies \frac{1}{\lambda} u = A^{-1} u.\]

Thus \(u\) is also an eigen vector of \(A^{-1}\) for the eigen value \(\frac{1}{\lambda}\).

Now let \(B = A^{-1}\). Then \(B^{-1} = A\). Thus if \(\mu\) is an eigen value of \(B\) then \(\frac{1}{\mu}\) is an eigen value of \(B^{-1} = A\).

Thus if \(A\) is invertible then eigen values of \(A\) and \(A^{-1}\) have one to one correspondence.

This result is very useful. Since if it can be shown that a matrix \(A\) is similar to a diagonal or a triangular matrix whose eigen values are easy to obtain then determination of the eigen values of \(A\) becomes straight forward.

Invariant subspaces

Definition

Let \(A\) be a square \(n\times n\) matrix and let \(\WW\) be a subspace of \(\FF^n\) i.e. \(\WW \leq \FF\). Then \(\WW\) is invariant relative to \(A\) if

\[A w \in \WW \Forall w \in \WW.\]

i.e. \(A (W) \subseteq W\) or for every vector \(w \in \WW\) its mapping \(A w\) is also in \(\WW\). Thus action of \(A\) on \(\WW\) doesn’t take us outside of \(\WW\).

We also say that \(\WW\) is \(A\)-invariant.

Eigen vectors are generators of invariant subspaces.

Lemma

Let \(A\) be an \(n \times n\) matrix. Let \(x_1, x_2, \dots, x_r\) be \(r\) eigen vectors of \(A\). Let us construct an \(n \times r\) matrix

\[X = \begin{bmatrix} x_1 & x_2 & \dots & r_r \end{bmatrix}.\]

Then the column space of \(X\) i.e. \(\ColSpace(X)\) is invariant relative to \(A\).

Proof

Let us assume that \(c_1, c_2, \dots, c_r\) are the eigen values corresponding to \(x_1, x_2, \dots, x_r\) (not necessarily distinct).

Let any vector \(x \in \ColSpace(X)\) be given by

\[x = \sum_{i=1}^r \alpha_i x_i.\]

Then

\[A x= A \sum_{i=1}^r \alpha_i x_i = \sum_{i=1}^r \alpha_i A x_i = \sum_{i=1}^r \alpha_i c_i x_i.\]

Clearly \(Ax\) is also a linear combination of \(x_i\) hence belongs to \(\ColSpace(X)\). Thus \(X\) is invariant relative to \(A\) or \(X\) is \(A\)-invariant.

Triangular matrices

Lemma
Let \(A\) be an \(n\times n\) upper or lower triangular matrix. Then its eigen values are the entries on its main diagonal.
Proof

If \(A\) is triangular then \(A - \lambda I\) is also triangular with its diagonal entries being \((a_{i i} - \lambda)\). Using here, we have

\[p(\lambda) = \det (A - \lambda I) = \prod_{i=1}^n (a_{i i} - \lambda).\]

Clearly the roots of characteristic polynomial are \(a_{i i}\).

Several small results follow from this lemma.

Corollary

Let \(A = [a_{i j}]\) be an \(n \times n\) triangular matrix.

  1. The characteristic polynomial of \(A\) is \(p(\lambda) = (-1)^n (\lambda - a_{i i})\).
  2. A scalar \(\lambda\) is an eigen value of \(A\) iff its one of the diagonal entries of \(A\).
  3. The algebraic multiplicity of an eigen value \(\lambda\) is equal to the number of times it appears on the main diagonal of \(A\).
  4. The spectrum of \(A\) is given by the distinct entries on the main diagonal of \(A\).

A diagonal matrix is naturally both an upper triangular matrix as well as a lower triangular matrix. Similar results hold for the eigen values of a diagonal matrix also.

Lemma

Let \(A = [a_{i j}]\) be an \(n \times n\) diagonal matrix.

  1. Its eigen values are the entries on its main diagonal.
  2. The characteristic polynomial of \(A\) is \(p(\lambda) = (-1)^n (\lambda - a_{i i})\).
  3. A scalar \(\lambda\) is an eigen value of \(A\) iff its one of the diagonal entries of \(A\).
  4. The algebraic multiplicity of an eigen value \(\lambda\) is equal to the number of times it appears on the main diagonal of \(A\).
  5. The spectrum of \(A\) is given by the distinct entries on the main diagonal of \(A\).

There is also a result for the geometric multiplicity of eigen values for a diagonal matrix.

Lemma
Let \(A = [a_{i j}]\) be an \(n \times n\) diagonal matrix. The geometric multiplicity of an eigen value \(\lambda\) is equal to the number of times it appears on the main diagonal of \(A\).
Proof

The unit vectors \(e_i\) are eigen vectors for \(A\) since

\[A e_i = a_{i i } e_i.\]

They are independent. Thus if a particular eigen value appears \(r\) number of times, then there are \(r\) linearly independent eigen vectors for the eigen value. Thus its geometric multiplicity is equal to the algebraic multiplicity.

Similar matrices

Some very useful results are available for similar matrices.

Lemma
The characteristic polynomial and spectrum of similar matrices is same.
Proof

Let \(B\) be similar to \(A\). Thus there exists an invertible matrix \(C\) such that

\[B = C^{-1} A C.\]

Now

\[B - \lambda I = C^{-1} A C - \lambda I = C^{-1} A C - \lambda C^{-1} C = C^{-1} ( AC - \lambda C) = C^{-1} (A - \lambda I) C.\]

Thus \(B - \lambda I\) is similar to \(A - \lambda I\). Hence due to here, their determinant is equal i.e.

\[\det(B - \lambda I ) = \det(A - \lambda I ).\]

This means that the characteristic polynomials of \(A\) and \(B\) are same. Since eigen values are nothing but roots of the characteristic polynomial, hence they are same too. This means that the spectrum (the set of distinct eigen values) is same.

Corollary

If \(A\) and \(B\) are similar to each other then

  1. An eigen value has same algebraic and geometric multiplicity for both \(A\) and \(B\).
  2. The (not necessarily distinct) eigen values of \(A\) and \(B\) are same.

Although the eigen values are same, but the eigen vectors are different.

Lemma

Let \(A\) and \(B\) be similar with

\[B = C^{-1} A C\]

for some invertible matrix \(C\). If \(u\) is an eigen vector of \(A\) for an eigen value \(\lambda\), then \(C^{-1} u\) is an eigen vector of \(B\) for the same eigen value.

Proof

\(u\) is an eigen vector of \(A\) for an eigen value \(\lambda\). Thus we have

\[A u = \lambda u.\]

Thus

\[B C^{-1} u = C^{-1} A C C^{-1} u = C^{-1} A u = C^{-1} \lambda u = \lambda C^{-1} u.\]

Now \(u \neq 0\) and \(C^{-1}\) is non singular. Thus \(C^{-1} u \neq 0\). Thus \(C^{-1}u\) is an eigen vector of \(B\).

Theorem
Let \(\lambda\) be an eigen value of a square matrix \(A\). Then the geometric multiplicity of \(\lambda\) is less than or equal to its algebraic multiplicity.
Corollary
If an \(n \times n\) matrix \(A\) has \(n\) distinct eigen values, then each of them has a geometric (and algebraic) multiplicity of \(1\).
Proof
The algebraic multiplicity of an eigen value is greater than or equal to 1. But the sum cannot exceed \(n\). Since there are \(n\) distinct eigen values, thus each of them has algebraic multiplicity of \(1\). Now geometric multiplicity of an eigen value is greater than equal to 1 and less than equal to its algebraic multiplicity.
Corollary

Let an \(n \times n\) matrix \(A\) has \(k\) distinct eigen values \(\lambda_1, \lambda_2, \dots, \lambda_k\) with algebraic multiplicities \(r_1, r_2, \dots, r_k\) and geometric multiplicities \(g_1, g_2, \dots g_k\) respectively. Then

\[\sum_{i=1}^k g_k \leq \sum_{i=1}^k r_k \leq n.\]

Moreover if

\[\sum_{i=1}^k g_k = \sum_{i=1}^k r_k\]

then

\[g_k = r_k.\]

Linear independence of eigen vectors

Theorem
Let \(A\) be an \(n\times n\) square matrix. Let \(x_1, x_2, \dots , x_k\) be any \(k\) eigen vectors of \(A\) for distinct eigen values \(\lambda_1, \lambda_2, \dots, \lambda_k\) respectively. Then \(x_1, x_2, \dots , x_k\) are linearly independent.
Proof

We first prove the simpler case with 2 eigen vectors \(x_1\) and \(x_2\) and corresponding eigen values \(\lambda_1\) and \(\lambda_2\) respectively.

Let there be a linear relationship between \(x_1\) and \(x_2\) given by

\[\alpha_1 x_1 + \alpha_2 x_2 = 0.\]

Multiplying both sides with \((A - \lambda_1 I)\) we get

\[\begin{split}\begin{aligned} & \alpha_1 (A - \lambda_1 I) x_1 + \alpha_2(A - \lambda_1 I) x_2 = 0\\ \implies & \alpha_1 (\lambda_1 - \lambda_1) x_1 + \alpha_2(\lambda_1 - \lambda_2) x_2 = 0\\ \implies & \alpha_2(\lambda_1 - \lambda_2) x_2 = 0. \end{aligned}\end{split}\]

Since \(\lambda_1 \neq \lambda_2\) and \(x_2 \neq 0\) , hence \(\alpha_2 = 0\).

Similarly by multiplying with \((A - \lambda_2 I)\) on both sides, we can show that \(\alpha_1 = 0\). Thus \(x_1\) and \(x_2\) are linearly independent.

Now for the general case, consider a linear relationship between \(x_1, x_2, \dots , x_k\) given by

\[\alpha_1 x_1 + \alpha_2 x_2 + \dots \alpha_k x_k = 0.\]

Multiplying by \(\prod_{i \neq j, i=1}^k (A - \lambda_i I)\) and using the fact that \(\lambda_i \neq \lambda_j\) if \(i \neq j\), we get \(\alpha_j = 0\). Thus the only linear relationship is the trivial relationship. This completes the proof.

For eigen values with geometric multiplicity greater than \(1\) there are multiple eigenvectors corresponding to the eigen value which are linearly independent. In this context, above theorem can be generalized further.

Theorem
Let \(\lambda_1, \lambda_2, \dots, \lambda_k\) be \(k\) distinct eigen values of \(A\). Let \(\{x_1^j, x_2^j, \dots x_{g_j}^j\}\) be any \(g_j\) linearly independent eigen vectors from the eigen space of \(\lambda_j\) where \(g_j\) is the geometric multiplicity of \(\lambda_j\). Then the combined set of eigen vectors given by \(\{x_1^1, \dots x_{g_1}^1, \dots x_1^k, \dots x_{g_k}^k\}\) consisting of \(\sum_{j=1}^k g_j\) eigen vectors is linearly independent.

This result puts an upper limit on the number of linearly independent eigen vectors of a square matrix.

Lemma

Let \(\{ \lambda_1, \dots, \lambda_k \}\) represents the spectrum of an \(n \times n\) matrix \(A\). Let \(g_1, \dots, g_k\) be the geometric multiplicities of \(\lambda_1, \dots \lambda_k\) respectively. Then the number of linearly independent eigen vectors for \(A\) is

\[\sum_{i=1}^k g_i.\]

Moreover if

\[\sum_{i=1}^k g_i = n\]

then a set of \(n\) linearly independent eigen vectors of \(A\) can be found which forms a basis for \(\FF^n\).

Diagonalization

Diagonalization is one of the fundamental operations in linear algebra. This section discusses diagonalization of square matrices in depth.

Definition
An \(n \times n\) matrix \(A\) is said to be diagonalizable if it is similar to a diagonal matrix. In other words there exists an \(n\times n\) non-singular matrix \(P\) such that \(D = P^{-1} A P\) is a diagonal matrix. If this happens then we say that \(P\) diagonalizes \(A\) or \(A\) is diagonalized by \(P\).
Remark
\[D = P^{-1} A P \iff P D = A P \iff P D P^{-1} = A.\]

We note that if we restrict to real matrices, then \(U\) and \(D\) should also be real. If \(A \in \CC^{n \times n}\) (it may still be real) then \(P\) and \(D\) can be complex.

The next theorem is the culmination of a variety of results studied so far.

Theorem

Let \(A\) be a diagonalizable matrix with \(D = P^{-1} A P\) being its diagonalization. Let \(D = \Diag(d_1, d_2, \dots, d_n)\). Then the following hold

  1. \(\Rank(A) = \Rank(D)\) which equals the number of non-zero entries on the main diagonal of \(D\).

  2. \(\det(A) = d_1 d_2 \dots d_n\).

  3. \(\Trace(A) = d_1 + d_2 + \dots d_n\).

  4. The characteristic polynomial of \(A\) is

    \[p(\lambda) = (-1)^n (\lambda - d_1) (\lambda -d_2) \dots (\lambda - d_n).\]
  5. The spectrum of \(A\) comprises the distinct scalars on the diagonal entries in \(D\).

  6. The (not necessarily distinct) eigenvalues of \(A\) are the diagonal elements of \(D\).

  7. The columns of \(P\) are (linearly independent) eigenvectors of \(A\).

  8. The algebraic and geometric multiplicities of an eigenvalue \(\lambda\) of \(A\) equal the number of diagonal elements of \(D\) that equal \(\lambda\).

Proof

From here we note that \(D\) and \(A\) are similar. Due to here

\[\det(A) = \det(D).\]

Due to here

\[\det(D) = \prod_{i=1}^n d_i.\]

Now due to here

\[\Trace(A) = \Trace(D) = \sum_{i=1}^n d_i.\]

Further due to here the characteristic polynomial and spectrum of \(A\) and \(D\) are same. Due to here the eigen values of \(D\) are nothing but its diagonal entries. Hence they are also the eigen values of \(A\).

\[D = P^{-1} A P \implies A P = P D.\]

Now writing

\[P = \begin{bmatrix} p_1 & p_2 & \dots & p_n \end{bmatrix}\]

we have

\[A P = \begin{bmatrix} A p_1 & A p_2 & \dots & A p_n \end{bmatrix} = P D = \begin{bmatrix} d_1 p_1 & d_2 p_2 & \dots & d_n p_n \end{bmatrix}.\]

Thus \(p_i\) are eigen vectors of \(A\).

Since the characteristic polynomials of \(A\) and \(D\) are same, hence the algebraic multiplicities of eigen values are same.

From here we get that there is a one to one correspondence between the eigen vectors of \(A\) and \(D\) through the change of basis given by \(P\). Thus the linear independence relationships between the eigen vectors remain the same. Hence the geometric multiplicities of individual eigenvalues are also the same.

This completes the proof.

So far we have verified various results which are available if a matrix \(A\) is diagonalizable. We haven’t yet identified the conditions under which \(A\) is diagonalizable. We note that not every matrix is diagonalizable. The following theorem gives necessary and sufficient conditions under which a matrix is diagonalizable.

Theorem
An \(n \times n\) matrix \(A\) is diagonalizable by an \(n \times n\) non-singular matrix \(P\) if and only if the columns of \(P\) are (linearly independent) eigenvectors of \(A\).
Proof

We note that since \(P\) is non-singular hence columns of \(P\) have to be linearly independent.

The necessary condition part was proven in here. We now show that if \(P\) consists of \(n\) linearly independent eigen vectors of \(A\) then \(A\) is diagonalizable.

Let the columns of \(P\) be \(p_1, p_2, \dots, p_n\) and corresponding (not necessarily distinct) eigen values be \(d_1, d_2, \dots , d_n\). Then

\[A p_i = d_i p_i.\]

Thus by letting \(D = \Diag (d_1, d_2, \dots, d_n)\), we have

\[A P = P D.\]

Now since columns of \(P\) are linearly independent, hence \(P\) is invertible. This gives us

\[D = P^{-1} A P.\]

Thus \(A\) is similar to a diagonal matrix \(D\). This validates the sufficient condition.

A corollary follows.

Corollary
An \(n \times n\) matrix is diagonalizable if and only if there exists a linearly independent set of \(n\) eigenvectors of \(A\).

Now we know that geometric multiplicities of eigen values of \(A\) provide us information about linearly independent eigenvectors of \(A\).

Corollary

Let \(A\) be an \(n \times n\) matrix. Let \(\lambda_1, \lambda_2, \dots, \lambda_k\) be its \(k\) distinct eigen values (comprising its spectrum). Let \(g_j\) be the geometric multiplicity of \(\lambda_j\).Then \(A\) is diagonalizable if and only if

\[\sum_{i=1}^n g_i = n.\]

Symmetric matrices

This subsection is focused on real symmetric matrices.

Following is a fundamental property of real symmetric matrices.

Theorem
Every real symmetric matrix has an eigen value.

The proof of this result is beyond the scope of this book.

Lemma
Let \(A\) be an \(n \times n\) real symmetric matrix. Let \(\lambda_1\) and \(\lambda_2\) be any two distinct eigen values of \(A\) and let \(x_1\) and \(x_2\) be any two corresponding eigen vectors. Then \(x_1\) and \(x_2\) are orthogonal.
Proof

By definition we have \(A x_1 = \lambda_1 x_1\) and \(A x_2 = \lambda_2 x_2\). Thus

\[\begin{split}\begin{aligned} & x_2^T A x_1 = \lambda_1 x_2^T x_1\\ \implies & x_1^T A^T x_2 = \lambda_1 x_1^T x_2 \\ \implies & x_1^T A x_2 = \lambda_1 x_1^T x_2\\ \implies & x_1^T \lambda_2 x_2 = \lambda_1 x_1^T x_2\\ \implies & (\lambda_1 - \lambda_2) x_1^T x_2 = 0 \\ \implies & x_1^T x_2 = 0. \end{aligned}\end{split}\]

Thus \(x_1\) and \(x_2\) are orthogonal. In between we took transpose on both sides, used the fact that \(A= A^T\) and \(\lambda_1 - \lambda_2 \neq 0\).

Definition

A real \(n \times n\) matrix \(A\) is said to be orthogonally diagonalizable if there exists an orthogonal matrix \(U\) which can diagonalize \(A\), i.e.

\[D = U^T A U\]

is a real diagonal matrix.

Lemma
Every orthogonally diagonalizable matrix \(A\) is symmetric.
Proof

We have a diagonal matrix \(D\) such that

\[A = U D U^T.\]

Taking transpose on both sides we get

\[A^T = U D^T U^T = U D U^T = A.\]

Thus \(A\) is symmetric.

Theorem
Every symmetric matrix \(A\) is orthogonally diagonalizable.

We skip the proof of this theorem.

Hermitian matrices

Following is a fundamental property of Hermitian matrices.

Theorem
Every Hermitian matrix has an eigen value.

The proof of this result is beyond the scope of this book.

Lemma
The eigenvalues of a Hermitian matrix are real.
Proof

Let \(A\) be a Hermitian matrix and let \(\lambda\) be an eigen value of \(A\). Let \(u\) be a corresponding eigen vector. Then

\[\begin{split}\begin{aligned} & A u = \lambda u\\ \implies & u^H A^H = u^H \overline{\lambda} \\ \implies & u^H A^H u = u^H \overline{\lambda} u\\ \implies & u^H A u = \overline{\lambda} u^H u \\ \implies & u^H \lambda u = \overline{\lambda} u^H u \\ \implies &\|u\|_2^2 (\lambda - \overline{\lambda}) = 0\\ \implies & \lambda = \overline{\lambda} \end{aligned}\end{split}\]

thus \(\lambda\) is real. We used the facts that \(A = A^H\) and \(u \neq 0 \implies \|u\|_2 \neq 0\).

Lemma
Let \(A\) be an \(n \times n\) complex Hermitian matrix. Let \(\lambda_1\) and \(\lambda_2\) be any two distinct eigen values of \(A\) and let \(x_1\) and \(x_2\) be any two corresponding eigen vectors. Then \(x_1\) and \(x_2\) are orthogonal.
Proof

By definition we have \(A x_1 = \lambda_1 x_1\) and \(A x_2 = \lambda_2 x_2\). Thus

\[\begin{split}\begin{aligned} & x_2^H A x_1 = \lambda_1 x_2^H x_1\\ \implies & x_1^H A^H x_2 = \lambda_1 x_1^H x_2 \\ \implies & x_1^H A x_2 = \lambda_1 x_1^H x_2\\ \implies & x_1^H \lambda_2 x_2 = \lambda_1 x_1^H x_2\\ \implies & (\lambda_1 - \lambda_2) x_1^H x_2 = 0 \\ \implies & x_1^H x_2 = 0. \end{aligned}\end{split}\]

Thus \(x_1\) and \(x_2\) are orthogonal. In between we took conjugate transpose on both sides, used the fact that \(A= A^H\) and \(\lambda_1 - \lambda_2 \neq 0\).

Definition

A complex \(n \times n\) matrix \(A\) is said to be unitary diagonalizable if there exists a unitary matrix \(U\) which can diagonalize \(A\), i.e.

\[D = U^H A U\]

is a complex diagonal matrix.

Lemma
Let \(A\) be a unitary diagonalizable matrix whose diagonalization \(D\) is real. Then \(A\) is Hermitian.
Proof

We have a real diagonal matrix \(D\) such that

\[A = U D U^H.\]

Taking conjugate transpose on both sides we get

\[A^H = U D^H U^H = U D U^H = A.\]

Thus \(A\) is Hermitian. We used the fact that \(D^H = D\) since \(D\) is real.

Theorem
Every Hermitian matrix \(A\) is unitary diagonalizable.

We skip the proof of this theorem. The theorem means that if \(A\) is Hermitian then \(A = U \Lambda U^H\)

Definition

Let \(A\) be an \(n \times n\) Hermitian matrix. Let \(\lambda_1, \dots \lambda_n\) be its eigen values such that \(|\lambda_1| \geq |\lambda_2| \geq \dots \geq |\lambda_n |\). Let

\[\Lambda = \Diag(\lambda_1, \dots, \lambda_n).\]

Let \(U\) be a unit matrix consisting of orthonormal eigen vectors corresponding to \(\lambda_1, \dots, \lambda_n\). Then The eigen value decomposition of \(A\) is defined as

\[A = U \Lambda U^H.\]

If \(\lambda_i\) are distinct, then the decomposition is unique. If they are not distinct, then

Remark

Let \(\Lambda\) be a diagonal matrix as in here. Consider some vector \(x \in \CC^n\).

\[x^H \Lambda x = \sum_{i=1}^n \lambda_i | x_i |^2.\]

Now if \(\lambda_i \geq 0\) then

\[x^H \Lambda x \leq \lambda_1 \sum_{i=1}^n | x_i |^2 = \lambda_1 \| x \|_2^2.\]

Also

\[x^H \Lambda x \geq \lambda_n \sum_{i=1}^n | x_i |^2 = \lambda_n \| x \|_2^2.\]
Lemma

Let \(A\) be a Hermitian matrix with non-negative eigen values. Let \(\lambda_1\) be its largest and \(\lambda_n\) be its smallest eigen values.

\[\lambda_n \| x\|_2^2 \leq x^H A x \leq \lambda_1 \|x \|_2^2 \Forall x \in \CC^n.\]
Proof

\(A\) has an eigen value decomposition given by

\[A = U \Lambda U^H.\]

Let \(x \in \CC^n\) and let \(v = U^H x\). Clearly \(\| x \|_2 = \| v \|_2\). Then

\[x^H A x = x^H U \Lambda U^H x = v^H \Lambda v.\]

From previous remark we have

\[\lambda_n \| v \|_2^2 \leq v^H \Lambda v \leq \lambda_1 \| v \|_2^2.\]

Thus we get

\[\lambda_n \| x \|_2^2 \leq x^H A x \leq \lambda_1 \| x \|_2^2.\]

Miscellaneous properties

This subsection lists some miscellaneous properties of eigen values of a square matrix.

Lemma
\(\lambda\) is an eigen value of \(A\) if and only if \(\lambda + k\) is an eigen value of \(A + k I\). Moreover \(A\) and \(A + kI\) share the same eigen vectors.
Proof
\[\begin{split}\begin{aligned} & A x = \lambda x \\ \iff & A x + k x = \lambda x + k x \\ \iff & (A + k I ) x = (\lambda + k) x. \end{aligned}\end{split}\]

Thus \(\lambda\) is an eigen value of \(A\) with an eigen vector \(x\) if and only if \(\lambda + k\) is an eigen vector of \(A + kI\) with an eigen vector \(x\).

Diagonally dominant matrices

Definition

Let \(A = [a_{ij}]\) be a square matrix in \(\CC^{n \times n}\). \(A\) is called diagonally dominant if

\[| a_{ii} | \geq \sum_{j \neq i } |a_{ij}|\]

holds true for all \(1 \leq i \leq n\). i.e. the absolute value of the diagonal element is greater than or equal to the sum of absolute values of all the off diagonal elements on that row.

Definition

Let \(A = [a_{ij}]\) be a square matrix in \(\CC^{n \times n}\). \(A\) is called strictly diagonally dominant if

\[| a_{ii} | > \sum_{j \neq i } |a_{ij}|\]

holds true for all \(1 \leq i \leq n\). i.e. the absolute value of the diagonal element is bigger than the sum of absolute values of all the off diagonal elements on that row.

ExampleStrictly diagonally dominant matrix

Let us consider

\[\begin{split}A = \begin{bmatrix} -4 & -2 & -1 & 0\\ -4 & 7 & 2 & 0\\ 3 & -4 & 9 & 1\\ 2 & -1 & -3 & 15 \end{bmatrix}\end{split}\]

We can see that the strict diagonal dominance condition is satisfied for each row as follows:

\[\begin{split}\begin{aligned} & \text{ row 1}: \quad & |-4| > |-2| + |-1| + |0| = 3 \\ & \text{ row 2}: \quad & |7| > |-4| + |2| + |0| = 6 \\ & \text{ row 3}: \quad & |9| > |3| + |-4| + |1| = 8 \\ & \text{ row 4}: \quad & |15| > |2| + |-1| + |-3| = 6 \end{aligned}\end{split}\]

Strictly diagonally dominant matrices have a very special property. They are always non-singular.

Theorem
Strictly diagonally dominant matrices are non-singular.
Proof

Suppose that \(A\) is diagonally dominant and singular. Then there exists a vector \(u \in \CC^n\) with \(u\neq 0\) such that

(2)\[A u = 0.\]

Let

\[u = \begin{bmatrix}u_1 & u_2 & \dots & u_n \end{bmatrix}^T.\]

We first show that every entry in \(u\) cannot be equal in magnitude. Let us assume that this is so. i.e.

\[c = | u_1 | = | u_2 | = \dots = | u_n|.\]

Since \(u \neq 0\) hence \(c \neq 0\). Now for any row \(i\) in (2) , we have

\[\begin{split}\begin{aligned} & \sum_{j=1}^n a_{ij} u_j = 0\\ \implies & \sum_{j=1}^n \pm a_{ij} c = 0\\ \implies & \sum_{j=1}^n \pm a_{ij} = 0\\ \implies & \mp a_{ii} = \sum_{j \neq i} \pm a_{ij}\\ \implies & |a_{ii}| = | \sum_{j \neq i} \pm a_{ij}|\\ \implies & |a_{ii}| \leq \sum_{j \neq i} |a_{ij}| \quad {\text{ using triangle inequality}} \end{aligned}\end{split}\]

but this contradicts our assumption that \(A\) is strictly diagonally dominant. Thus all entries in \(u\) are not equal in magnitude.

Let us now assume that the largest entry in \(u\) lies at index \(i\) with \(|u_i| = c\). Without loss of generality we can scale down \(u\) by \(c\) to get another vector in which all entries are less than or equal to 1 in magnitude while \(i\)-th entry is \(\pm 1\). i.e. \(u_i = \pm 1\) and \(|u_j| \leq 1\) for all other entries.

Now from (2) we get for the \(i\)-th row

\[\begin{split}\begin{aligned} & \sum_{j=1}^n a_{ij} u_j = 0\\\ \implies & \pm a_{ii} = \sum_{j \neq i} u_j a_{ij}\\ \implies & |a_{ii}| \leq \sum_{j \neq i} |u_j a_{ij}| \leq \sum_{j \neq i} |a_{ij}| \end{aligned}\end{split}\]

which again contradicts our assumption that \(A\) is strictly diagonally dominant.

Hence strictly diagonally dominant matrices are non-singular.

Gershgorin’s theorem

We are now ready to examine Gershgorin’ theorem which provides very useful bounds on the spectrum of a square matrix.

Theorem

Every eigen value \(\lambda\) of a square matrix \(A \in \CC^{n\times n}\) satisfies

(3)\[| \lambda - a_{ii}| \leq \sum_{j\neq i} |a_{ij}| \text{ for some } i \in \{1,2, \dots, n \}.\]
Proof

The proof is a straight forward application of non-singularity of diagonally dominant matrices.

We know that for an eigen value \(\lambda\), \(\det(\lambda I - A) = 0\) i.e. the matrix \((\lambda I - A)\) is singular. Hence it cannot be strictly diagonally dominant due to here.

Thus looking at each row \(i\) of \((\lambda I - A)\) we can say that

\[| \lambda - a_{ii}| > \sum_{j\neq i} |a_{ij}|\]

cannot be true for all rows simultaneously. i.e. it must fail at least for one row. This means that there exists at least one row \(i\) for which

\[| \lambda - a_{ii}| \leq \sum_{j\neq i} |a_{ij}|\]

holds true.

What this theorem means is pretty simple. Consider a disc in the complex plane for the \(i\)-th row of \(A\) whose center is given by \(a_{ii}\) and whose radius is given by \(r = \sum_{j\neq i} |a_{ij}|\) i.e. the sum of magnitudes of all non-diagonal entries in \(i\)-th row.

There are \(n\) such discs corresponding to \(n\) rows in \(A\). (3) means that every eigen value must lie within the union of these discs. It cannot lie outside.

This idea is crystallized in following definition.

Definition

For \(i\)-th row of matrix \(A\) we define the radius \(r_i = \sum_{j\neq i} |a_{ij}|\) and the center \(c_i = a_{ii}\). Then the set given by

\[D_i = \{z \in \CC : | z - a_{ii} | \leq r_i \}\]

is called the \(i\)-th Gershgorin’s disc of \(A\).

We note that the definition is equally valid for real as well as complex matrices. For real matrices, the centers of disks lie on the real line. For complex matrices, the centers may lie anywhere in the complex plane.

Clearly there is nothing magical about the rows of \(A\). We can as well consider the columns of \(A\).

Theorem

Every eigen value of a matrix \(A\) must lie in a Gershgorin disc corresponding to the columns of \(A\) where the Gershgorin disc for \(j\)-th column is given by

\[D_j = \{z \in \CC : | z - a_{jj} | \leq r_j \}\]

with

\[r_j = \sum_{i \neq j} |a_{ij}|\]
Proof
We know that eigen values of \(A\) are same as eigen values of \(A^T\) and columns of \(A\) are nothing but rows of \(A^T\). Hence eigen values of \(A\) must satisfy conditions in here w.r.t. the matrix \(A^T\). This completes the proof.

Singular values

In previous section we saw diagonalization of square matrices which resulted in an eigen value decomposition of the matrix. This matrix factorization is very useful yet it is not applicable in all situations. In particular, the eigen value decomposition is useless if the square matrix is not diagonalizable or if the matrix is not square at all. Moreover, the decomposition is particularly useful only for real symmetric or Hermitian matrices where the diagonalizing matrix is an \(\FF\)-unitary matrix (see here). Otherwise, one has to consider the inverse of the diagonalizing matrix also.

Fortunately there happens to be another decomposition which applies to all matrices and it involves just \(\FF\)-unitary matrices.

Definition

A non-negative real number \(\sigma\) is a singular value for a matrix \(A \in \FF^{m \times n}\) if and only if there exist unit-length vectors \(u \in \FF^m\) and \(v \in \FF^n\) such that

\[A v = \sigma u\]

and

\[A^H u = \sigma v\]

hold. The vectors \(u\) and \(v\) are called left-singular and right-singular vectors for \(\sigma\) respectively.

We first present the basic result of singular value decomposition. We will not prove this result completely although we will present proofs of some aspects.

Theorem

For every \(A \in \FF^{m \times n}\) with \(k = \min(m , n)\), there exist two \(\FF\)-unitary matrices \(U \in \FF^{m \times m}\) and \(V \in \FF^{n \times n}\) and a sequence of real numbers

\[\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_k \geq 0\]

such that

(1)\[U^H A V = \Sigma\]

where

\[\Sigma = \Diag(\sigma_1, \sigma_2, \dots, \sigma_k) \in \FF^{ m \times n}.\]

The non-negative real numbers \(\sigma_i\) are the singular values of \(A\) as per here.

The sequence of real numbers \(\sigma_i\) doesn’t depend on the particular choice of \(U\) and \(V\).

\(\Sigma\) is rectangular with the same size as \(A\). The singular values of \(A\) lie on the principle diagonal of \(\Sigma\). All other entries in \(\Sigma\) are zero.

It is certainly possible that some of the singular values are 0 themselves.

Remark

Since \(U^H A V = \Sigma\) hence

(2)\[A = U \Sigma V^H.\]
Definition

The decomposition of a matrix \(A \in \FF^{m \times n}\) given by

(3)\[A = U \Sigma V^H\]

is known as its singular value decomposition.

Remark

When \(\FF\) is \(\RR\) then the decomposition simplifies to

(4)\[U^T A V = \Sigma\]

and

\[A = U \Sigma V^T.\]
Remark
Clearly there can be at most \(k= \min(m , n)\) distinct singular values of \(A\).
Remark

We can also write

(5)\[A V = U \Sigma.\]
Remark

Let us expand

\[\begin{split}A = U \Sigma V^H = \begin{bmatrix} u_1 & u_2 & \dots & u_m \end{bmatrix} \begin{bmatrix} \sigma_{ij} \end{bmatrix} \begin{bmatrix} v_1^H \\ v_2^H \\ \vdots \\ v_n^H \end{bmatrix} = \sum_{i=1}^m \sum_{j=1}^n \sigma_{ij} u_i v_j^H.\end{split}\]
Remark

Alternatively, let us expand

\[\begin{split}\Sigma = U^H AV = \begin{bmatrix} u_1^H \\ u_2^H \\ \vdots \\ u_m^H \end{bmatrix} A \begin{bmatrix} v_1 & v_2 & \dots & v_m \end{bmatrix} = \begin{bmatrix} u_i^H A v_j \end{bmatrix}\end{split}\]

This gives us

\[\sigma_{i j} = u_i^H A v_j.\]

Following lemma verifies that \(\Sigma\) indeed consists of singular values of \(A\) as per here.

Lemma
Let \(A = U \Sigma V^H\) be a singular value decomposition of \(A\). Then the main diagonal entries of \(\Sigma\) are singular values. The first \(k = \min(m, n)\) column vectors in \(U\) and \(V\) are left and right singular vectors of \(A\).
Proof

We have

\[AV = U \Sigma.\]

Let us expand R.H.S.

\[U \Sigma = \begin{bmatrix}\sum_{j=1}^m u_{i j} \sigma_{j k} \end{bmatrix} = [u_{i k} \sigma_k] = \begin{bmatrix} \sigma_1 u_1 & \sigma_2 u_2 & \dots \sigma_k u_k & 0 & \dots & 0 \end{bmatrix}\]

where \(0\) columns in the end appear \(n - k\) times.

Expanding the L.H.S. we get

\[A V = \begin{bmatrix} A v_1 & A v_2 & \dots & A v_n \end{bmatrix}.\]

Thus by comparing both sides we get

\[A v_i = \sigma_i u_i \; \text{ for } \; 1 \leq i \leq k\]

and

\[A v_i = 0 \text{ for } k < i \leq n.\]

Now let us start with

\[A = U \Sigma V^H \implies A^H = V \Sigma^H U^H \implies A^H U = V \Sigma^H.\]

Let us expand R.H.S.

\[V \Sigma^H = \begin{bmatrix}\sum_{j=1}^n v_{i j} \sigma_{j k} \end{bmatrix} = [v_{i k} \sigma_k] = \begin{bmatrix} \sigma_1 v_1 & \sigma_2 v_2 & \dots \sigma_k v_k & 0 & \dots & 0 \end{bmatrix}\]

where \(0\) columns appear \(m - k\) times.

Expanding the L.H.S. we get

\[ A^H U = \begin{bmatrix} A^H u_1 & A^H u_2 & \dots & A^H u_m \end{bmatrix}.\]

Thus by comparing both sides we get

\[A^H u_i = \sigma_i v_i \; \text{ for } \; 1 \leq i \leq k\]

and

\[A^H u_i = 0 \text{ for } k < i \leq m.\]

We now consider the three cases.

For \(m = n\), we have \(k = m =n\). And we get

\[A v_i = \sigma_i u_i, A^H u_i = \sigma_i v_i \; \text{ for } \; 1 \leq i \leq m\]

Thus \(\sigma_i\) is a singular value of \(A\) and \(u_i\) is a left singular vector while \(v_i\) is a right singular vector.

For \(m < n\), we have \(k = m\). We get for first \(m\) vectors in \(V\)

\[A v_i = \sigma_i u_i, A^H u_i = \sigma_i v_i \; \text{ for } \; 1 \leq i \leq m.\]

Finally for remaining \(n-m\) vectors in \(V\), we can write

\[A v_i = 0.\]

They belong to the null space of \(A\).

For \(m > n\), we have \(k = n\). We get for first \(n\) vectors in \(U\)

\[A v_i = \sigma_i u_i, A^H u_i = \sigma_i v_i \; \text{ for } \; 1 \leq i \leq n.\]

Finally for remaining \(m - n\) vectors in \(U\), we can write

\[A^H u_i = 0.\]
Lemma

\(\Sigma \Sigma^H\) is an \(m \times m\) matrix given by

\[\Sigma \Sigma^H = \Diag(\sigma_1^2, \sigma_2^2, \dots \sigma_k^{2}, 0, 0,\dots 0)\]

where the number of \(0\)‘s following \(\sigma_k^{2}\) is \(m - k\).

Lemma

\(\Sigma^H \Sigma\) is an \(n \times n\) matrix given by

\[\Sigma^H \Sigma = \Diag(\sigma_1^2, \sigma_2^2, \dots \sigma_k^{2}, 0, 0,\dots 0)\]

where the number of \(0\)‘s following \(\sigma_k^{2}\) is \(n - k\).

Lemma

Let \(A \in \FF^{m \times n}\) have a singular value decomposition given by

\[A = U \Sigma V^H.\]

Then

\[\Rank(A) = \Rank(\Sigma).\]

In other words, rank of \(A\) is number of non-zero singular values of \(A\). Since the singular values are ordered in descending order in \(A\) hence, the first \(r\) singular values \(\sigma_1, \dots, \sigma_r\) are non-zero.

Proof
This is a straight forward application of here and here. Further since only non-zero values in \(\Sigma\) appear on its main diagonal hence its rank is number of non-zero singular values \(\sigma_i\).
Corollary

Let \(r = \Rank(A)\). Then \(\Sigma\) can be split as a block matrix

\[\begin{split}\Sigma = \left [ \begin{array}{c | c} \Sigma_r & 0\\ \hline 0 & 0 \end{array} \right ]\end{split}\]

where \(\Sigma_r\) is an \(r \times r\) diagonal matrix of the non-zero singular values \(\Diag(\sigma_1, \sigma_2, \dots, \sigma_r)\). All other sub-matrices in \(\Sigma\) are 0.

Lemma
The eigen values of Hermitian matrix \(A^H A \in \FF^{n \times n}\) are \(\sigma_1^2, \sigma_2^2, \dots \sigma_k^{2}, 0, 0,\dots 0\) with \(n - k\) \(0\)‘s after \(\sigma_k^{2}\). Moreover the eigen vectors are the columns of \(V\).
Proof
\[A^H A = \left ( U \Sigma V^H \right)^H U \Sigma V^H = V \Sigma^H U^H U \Sigma V^H = V \Sigma^H \Sigma V^H.\]

We note that \(A^H A\) is Hermitian. Hence \(A^HA\) is diagonalized by \(V\) and the diagonalization of \(A^H A\) is \(\Sigma^H \Sigma\). Thus the eigen values of \(A^H A\) are \(\sigma_1^2, \sigma_2^2, \dots \sigma_k^{2}, 0, 0,\dots 0\) with \(n - k\) \(0\)‘s after \(\sigma_k^{2}\).

Clearly

\[(A^H A) V = V (\Sigma^H \Sigma)\]

thus columns of \(V\) are the eigen vectors of \(A^H A\).

Lemma
The eigen values of Hermitian matrix \(A A^H \in \FF^{m \times m}\) are \(\sigma_1^2, \sigma_2^2, \dots \sigma_k^{2}, 0, 0,\dots 0\) with \(m - k\) \(0\)‘s after \(\sigma_k^{2}\). Moreover the eigen vectors are the columns of \(V\).
Proof
\[A A^H = U \Sigma V^H \left ( U \Sigma V^H \right)^H = U \Sigma V^H V \Sigma^H U^H = U \Sigma \Sigma^H U^H.\]

We note that \(A^H A\) is Hermitian. Hence \(A^HA\) is diagonalized by \(V\) and the diagonalization of \(A^H A\) is \(\Sigma^H \Sigma\). Thus the eigen values of \(A^H A\) are \(\sigma_1^2, \sigma_2^2, \dots \sigma_k^{2}, 0, 0,\dots 0\) with \(m - k\) \(0\)‘s after \(\sigma_k^{2}\).

Clearly

\[(A A^H) U = U (\Sigma \Sigma^H)\]

thus columns of \(U\) are the eigen vectors of \(A A^H\).

Lemma
The Gram matrices \(A A^H\) and \(A^H A\) share the same eigen values except for some extra zeros. Their eigen values are the squares of singular values of \(A\) and some extra zeros. In other words singular values of \(A\) are the square roots of non-zero eigen values of the Gram matrices \(A A^H\) or \(A^H A\).

The largest singular value

Lemma

For all \(u \in \FF^n\) the following holds

\[\| \Sigma u \|_2 \leq \sigma_1 \| u \|_2\]

Moreover for all \(u \in \FF^m\) the following holds

\[\| \Sigma^H u \|_2 \leq \sigma_1 \| u \|_2\]
Proof

Let us expand the term \(\Sigma u\).

\[\begin{split}\begin{bmatrix} \sigma_1 & 0 & \dots & \dots & 0 \\ 0 & \sigma_2 & \dots & \dots & 0 \\ \vdots & \vdots & \ddots & \dots & 0\\ 0 & \vdots & \sigma_k & \dots & 0 \\ 0 & 0 & \vdots & \dots & 0 \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_k \\ \vdots \\ u_n \end{bmatrix} = \begin{bmatrix} \sigma_1 u_1 \\ \sigma_2 u_2 \\ \vdots \\ \sigma_k u_k \\ 0 \\ \vdots \\ 0 \end{bmatrix}\end{split}\]

Now since \(\sigma_1\) is the largest singular value, hence

\[|\sigma_r u_i| \leq |\sigma_1 u_i| \Forall 1 \leq i \leq k.\]

Thus

\[\sum_{i=1}^n |\sigma_1 u_i|^2 \geq \sum_{i=1}^n |\sigma_i u_i|^2\]

or

\[\sigma_1^2 \| u \|_2^2 \geq \| \Sigma u \|_2^2.\]

The result follows.

A simpler representation of \(\Sigma u\) can be given using here. Let \(r = \Rank(A)\). Thus

\[\begin{split}\Sigma = \left [ \begin{array}{c | c} \Sigma_r & 0\\ \hline 0 & 0 \end{array} \right ]\end{split}\]

We split entries in \(u\) as \(u = [(u_1, \dots, u_r )( u_{r + 1} \dots u_n)]^T\). Then

\[\begin{split}\Sigma u = \left [ \begin{array}{c} \Sigma_r \begin{bmatrix} u_1 & \dots& u_r \end{bmatrix}^T\\ 0 \begin{bmatrix} u_{r + 1} & \dots& u_n \end{bmatrix}^T \end{array} \right ] = \begin{bmatrix} \sigma_1 u_1 & \sigma_2 u_2 & \dots & \sigma_r u_r & 0 & \dots & 0 \end{bmatrix}^T\end{split}\]

Thus

\[\| \Sigma u \|_2^2 = \sum_{i=1}^r |\sigma_i u_i |^2 \leq \sigma_1 \sum_{i=1}^r |u_i |^2 \leq \sigma_1 \|u\|_2^2.\]

2nd result can also be proven similarly.

Lemma

Let \(\sigma_1\) be the largest singular value of an \(m \times n\) matrix \(A\). Then

\[\| A x \|_2 \leq \sigma_1 \| x \|_2 \Forall x \in \FF^n.\]

Moreover

\[\| A^H x \|_2 \leq \sigma_1 \| x \|_2 \Forall x \in \FF^m.\]
Proof
\[\| A x \|_2 = \| U \Sigma V^H x \|_2 = \| \Sigma V^H x \|_2\]

since \(U\) is unitary. Now from previous lemma we have

\[\| \Sigma V^H x \|_2 \leq \sigma_1 \| V^H x \|_2 = \sigma_1 \| x \|_2\]

since \(V^H\) also unitary. Thus we get the result

\[\| A x \|_2 \leq \sigma_1 \| x \|_2 \Forall x \in \FF^n.\]

Similarly

\[\| A^H x \|_2 = \| V \Sigma^H U^H x \|_2 = \| \Sigma^H U^H x \|_2\]

since \(V\) is unitary. Now from previous lemma we have

\[\| \Sigma^H U^H x \|_2 \leq \sigma_1 \| U^H x \|_2 = \sigma_1 \| x \|_2\]

since \(U^H\) also unitary. Thus we get the result

\[\| A^H x \|_2 \leq \sigma_1 \| x \|_2 \Forall x \in \FF^m.\]

There is a direct connection between the largest singular value and \(2\)-norm of a matrix (see here).

Corollary

The largest singular value of \(A\) is nothing but its \(2\)-norm. i.e.

\[\sigma_1 = \underset{\|u \|_2 = 1}{\max} \| A u \|_2.\]

SVD and pseudo inverse

Lemma

Let \(A = U \Sigma V^H\) and let \(r = \Rank (A)\). Let \(\sigma_1, \dots, \sigma_r\) be the \(r\) non-zero singular values of \(A\). Then the Moore-Penrose pseudo-inverse of \(\Sigma\) is an \(n \times m\) matrix \(\Sigma^{\dag}\) given by

\[\begin{split}\Sigma^{\dag} = \left [ \begin{array}{c | c} \Sigma_r^{-1} & 0\\ \hline 0 & 0 \end{array} \right ]\end{split}\]

where \(\Sigma_r = \Diag(\sigma_1, \dots, \sigma_r)\).

Essentially \(\Sigma^{\dag}\) is obtained by transposing \(\Sigma\) and inverting all its non-zero (positive real) values.

Proof
Straight forward application of here.
Corollary

The rank of \(\Sigma\) and its pseudo-inverse \(\Sigma^{\dag}\) are same. i.e.

\[\Rank (\Sigma) = \Rank(\Sigma^{\dag}).\]
Proof
The number of non-zero diagonal entries in \(\Sigma\) and \(\Sigma^{\dag}\) are same.
Lemma

Let \(A\) be an \(m \times n\) matrix and let \(A = U \Sigma V^H\) be its singular value decomposition. Let \(\Sigma^{\dag}\) be the pseudo inverse of \(\Sigma\) as per here. Then the Moore-Penrose pseudo-inverse of \(A\) is given by

\[A^{\dag} = V \Sigma^{\dag} U^H.\]
Proof

As usual we verify the requirements for a Moore-Penrose pseudo-inverse as per here. We note that since \(\Sigma^{\dag}\) is the pseudo-inverse of \(\Sigma\) it already satisfies necessary criteria.

First requirement:

\[A A^{\dag} A = U \Sigma V^H V \Sigma^{\dag} U^H U \Sigma V^H = U \Sigma \Sigma^{\dag} \Sigma V^H = U \Sigma V^H = A.\]

Second requirement:

\[A^{\dag} A A^{\dag} = V \Sigma^{\dag} U^H U \Sigma V^H V \Sigma^{\dag} U^H = V \Sigma^{\dag} \Sigma \Sigma^{\dag} U^H = V \Sigma^{\dag} U^H = A^{\dag}.\]

We now consider

\[A A^{\dag} = U \Sigma V^H V \Sigma^{\dag} U^H = U \Sigma \Sigma^{\dag} U^H.\]

Thus

\[\left ( A A^{\dag} \right )^H = \left ( U \Sigma \Sigma^{\dag} U^H \right )^H = U \left ( \Sigma \Sigma^{\dag} \right )^H U^H = U \Sigma \Sigma^{\dag} U^H = A A^{\dag}\]

since \(\Sigma \Sigma^{\dag}\) is Hermitian.

Finally we consider

\[A^{\dag} A = V \Sigma^{\dag} U^H U \Sigma V^H = V \Sigma^{\dag} \Sigma V^H.\]

Thus

\[\left ( A^{\dag} A \right )^H = \left ( V \Sigma^{\dag} \Sigma V^H\right )^H = V \left ( \Sigma^{\dag} \Sigma \right )^H V^H = V \Sigma^{\dag} \Sigma V^H = A^{\dag} A\]

since \(\Sigma^{\dag} \Sigma\) is also Hermitian.

This completes the proof.

Finally we can connect the singular values of \(A\) with the singular values of its pseudo-inverse.

Corollary

The rank of any \(m \times n\) matrix \(A\) and its pseudo-inverse \(A^{\dag}\) are same. i.e.

\[\Rank (A) = \Rank(A^{\dag}).\]
Proof
We have \(\Rank(A) = \Rank(\Sigma)\). Also its easy to verify that \(\Rank(A^{\dag}) = \Rank(\Sigma^{\dag})\). So using here completes the proof.
Lemma

Let \(A\) be an \(m \times n\) matrix and let \(A^{\dag}\) be its \(n \times m\) pseudo inverse as per here. Let \(r = \Rank(A)\) Let \(k = \min(m, n)\) denote the number of singular values while \(r\) denote the number of non-singular values of \(A\). Let \(\sigma_1, \dots, \sigma_r\) be the non-zero singular values of \(A\). Then the number of singular values of \(A^{\dag}\) is same as that of \(A\) and the non-zero singular values of \(A^{\dag}\) are

\[\frac{1}{\sigma_1} , \dots, \frac{1}{\sigma_r}\]

while all other \(k - r\) singular values of \(A^{\dag}\) are zero.

Proof

\(k= \min(m, n)\) denotes the number of singular values for both \(A\) and \(A^{\dag}\). Since rank of \(A\) and \(A^{\dag}\) are same, hence the number of non-zero singular values is same. Now look at

\[A^{\dag} = V \Sigma^{\dag} U^H\]

where

\[\begin{split}\Sigma^{\dag} = \left [ \begin{array}{c | c} \Sigma_r^{-1} & 0\\ \hline 0 & 0 \end{array} \right ].\end{split}\]

Clearly \(\Sigma_r^{-1} = \Diag(\frac{1}{\sigma_1} , \dots, \frac{1}{\sigma_r})\).

Thus expanding the R.H.S. we can get

\[A^{\dag} = \sum_{i=1}^r \frac{1}{\sigma_{i}} v_i u_i^H\]

where \(v_i\) and \(u_i\) are first \(r\) columns of \(V\) and \(U\) respectively. If we reverse the order of first \(r\) columns of \(U\) and \(V\) and reverse the first \(r\) diagonal entries of \(\Sigma^{\dag}\) , the R.H.S. remains the same while we are able to express \(A^{\dag}\) in the standard singular value decomposition form. Thus \(\frac{1}{\sigma_1} , \dots, \frac{1}{\sigma_r}\) are indeed the non-zero singular values of \(A^{\dag}\).

Full column rank matrices

In this subsection we consider some specific results related to singular value decomposition of a full column rank matrix.

We will consider \(A\) to be an \(m \times n\) matrix in \(\FF^{m \times n}\) with \(m \geq n\) and \(\Rank(A) = n\). Let \(A = U \Sigma V^H\) be its singular value decomposition. From here we observe that there are \(n\) non-zero singular values of \(A\). We will call these singular values as \(\sigma_1, \sigma_2, \dots, \sigma_n\). We will define

\[\Sigma_n = \Diag(\sigma_1, \sigma_2, \dots, \sigma_n).\]

Clearly \(\Sigma\) is an \(2\times 1\) block matrix given by

\[\begin{split}\Sigma = \left [ \begin{array}{c} \Sigma_n\\ \hline 0 \end{array} \right ]\end{split}\]

where the lower \(0\) is an \((m - n) \times n\) zero matrix. From here we obtain that \(\Sigma^H \Sigma\) is an \(n \times n\) matrix given by

\[\Sigma^H \Sigma = \Sigma_n^2\]

where

\[\Sigma_n^2 = \Diag(\sigma_1^2, \sigma_2^2, \dots, \sigma_n^2).\]
Lemma
Let \(A\) be a full column rank matrix with singular value decomposition \(A = U \Sigma V^H\). Then \(\Sigma^H \Sigma = \Sigma_n^2 = \Diag(\sigma_1^2, \sigma_2^2, \dots, \sigma_n^2)\) and \(\Sigma^H \Sigma\) is invertible.
Proof

Since all singular values are non-zero hence \(\Sigma_n^2\) is invertible. Thus

\[\left (\Sigma^H \Sigma \right )^{-1} = \left ( \Sigma_n^2 \right )^{-1} = \Diag\left(\frac{1}{\sigma_1^2}, \frac{1}{\sigma_2^2}, \dots, \frac{1}{\sigma_n^2} \right).\]
Lemma

Let \(A\) be a full column rank matrix with singular value decomposition \(A = U \Sigma V^H\). Let \(\sigma_1\) be its largest singular value and \(\sigma_n\) be its smallest singular value. Then

\[\sigma_n^2 \|x \|_2 \leq \| \Sigma^H \Sigma x \|_2 \leq \sigma_1^2 \|x \|_2 \Forall x \in \FF^n.\]
Proof

Let \(x \in \FF^n\). We have

\[\| \Sigma^H \Sigma x \|_2^2 = \| \Sigma_n^2 x \|_2^2 = \sum_{i=1}^n |\sigma_i^2 x_i|^2.\]

Now since

\[\sigma_n \leq \sigma_i \leq \sigma_1\]

hence

\[\sigma_n^4 \sum_{i=1}^n |x_i|^2 \leq \sum_{i=1}^n |\sigma_i^2 x_i|^2 \leq \sigma_1^4 \sum_{i=1}^n |x_i|^2\]

thus

\[\sigma_n^4 \| x \|_2^2 \leq \| \Sigma^H \Sigma x \|_2^2 \leq \sigma_1^4 \| x \|_2^2.\]

Applying square roots, we get

\[\sigma_n^2 \| x \|_2 \leq \| \Sigma^H \Sigma x \|_2 \leq \sigma_1^2 \| x \|_2 \Forall x \in \FF^n.\]

We recall from here that the Gram matrix of its column vectors \(G = A^HA\) is full rank and invertible.

Lemma

Let \(A\) be a full column rank matrix with singular value decomposition \(A = U \Sigma V^H\). Let \(\sigma_1\) be its largest singular value and \(\sigma_n\) be its smallest singular value. Then

\[\sigma_n^2 \|x \|_2 \leq \| A^H A x \|_2 \leq \sigma_1^2 \|x \|_2 \Forall x \in \FF^n.\]
Proof
\[A^H A = (U \Sigma V^H)^H (U \Sigma V^H) = V \Sigma^H \Sigma V^H.\]

Let \(x \in \FF^n\). Let

\[u = V^H x \implies \| u \|_2 = \|x \|_2.\]

Let

\[r = \Sigma^H \Sigma u.\]

Then from previous lemma we have

\[\sigma_n^2 \| u \|_2 \leq \| \Sigma^H \Sigma u \|_2 = \|r \|_2 \leq \sigma_1^2 \| u \|_2 .\]

Finally

\[A^ H A x = V \Sigma^H \Sigma V^H x = V r.\]

Thus

\[\| A^ H A x \|_2 = \|r \|_2.\]

Substituting we get

\[\sigma_n^2 \|x \|_2 \leq \| A^H A x \|_2 \leq \sigma_1^2 \|x \|_2 \Forall x \in \FF^n.\]

There are bounds for the inverse of Gram matrix also. First let us establish the inverse of Gram matrix.

Lemma

Let \(A\) be a full column rank matrix with singular value decomposition \(A = U \Sigma V^H\). Let the singular values of \(A\) be \(\sigma_1, \dots, \sigma_n\). Let the Gram matrix of columns of \(A\) be \(G = A^H A\). Then

\[G^{-1} = V \Psi V^H\]

where

\[\Psi = \Diag \left(\frac{1}{\sigma_1^2}, \frac{1}{\sigma_2^2}, \dots, \frac{1}{\sigma_n^2} \right).\]
Proof

We have

\[G = V \Sigma^H \Sigma V^H\]

Thus

\[G^{-1} = \left (V \Sigma^H \Sigma V^H \right )^{-1} = \left ( V^H \right )^{-1} \left ( \Sigma^H \Sigma \right )^{-1} V^{-1} = V \left ( \Sigma^H \Sigma \right )^{-1} V^H.\]

From here we have

\[\Psi = \left ( \Sigma^H \Sigma \right )^{-1} = \Diag \left (\frac{1}{\sigma_1^2}, \frac{1}{\sigma_2^2}, \dots, \frac{1}{\sigma_n^2} \right).\]

This completes the proof.

We can now state the bounds:

Lemma

Let \(A\) be a full column rank matrix with singular value decomposition \(A = U \Sigma V^H\). Let \(\sigma_1\) be its largest singular value and \(\sigma_n\) be its smallest singular value. Then

\[\frac{1}{\sigma_1^2} \|x \|_2 \leq \| \left(A^H A \right)^{-1} x \|_2 \leq \frac{1}{\sigma_n^2} \|x \|_2 \Forall x \in \FF^n.\]
Proof

From here we have

\[G^{-1} = \left ( A^H A \right)^{-1} = V \Psi V^H\]

where

\[\Psi = \Diag \left(\frac{1}{\sigma_1^2}, \frac{1}{\sigma_2^2}, \dots, \frac{1}{\sigma_n^2} \right).\]

Let \(x \in \FF^n\). Let

\[u = V^H x \implies \| u \|_2 = \|x \|_2.\]

Let

\[r = \Psi u.\]

Then

\[\| r \|_2^2 = \sum_{i=1}^n \left | \frac{1}{\sigma_i^2} u_i \right |^2.\]

Thus

\[\frac{1}{\sigma_1^2} \| u \|_2 \leq \| \Psi u \|_2 = \|r \|_2 \leq \frac{1}{\sigma_n^2} \| u \|_2 .\]

Finally

\[\left (A^ H A \right)^{-1} x = V \Psi V^H x = V r.\]

Thus

\[\| \left (A^ H A \right)^{-1} x \|_2 = \|r \|_2.\]

Substituting we get the result.

Low rank approximation of a matrix

Definition

An \(m \times n\) matrix \(A\) is called low rank if

\[\Rank(A) \ll \min (m, n).\]
Remark
A matrix is low rank if the number of non-zero singular values for the matrix is much smaller than its dimensions.

Following is a simple procedure for making a low rank approximation of a given matrix \(A\).

  1. Perform the singular value decomposition of \(A\) given by \(A = U \Sigma V^H\).
  2. Identify the singular values of \(A\) in \(\Sigma\).
  3. Keep the first \(r\) singular values (where \(r \ll \min(m, n)\) is the rank of the approximation) and set all other singular values to 0 to obtain \(\widehat{\Sigma}\).
  4. Compute \(\widehat{A} = U \widehat{\Sigma} V^H\).

Matrix norms

This section reviews various matrix norms on the vector space of complex matrices over the field of complex numbers \((\CC^{m \times n}, \CC)\).

We know \((\CC^{m \times n}, \CC)\) is a finite dimensional vector space with dimension \(m n\). We will usually refer to it as \(\CC^{m \times n}\).

Matrix norms will follow the usual definition of norms for a vector space.

Definition

A function \(\| \cdot \| : \CC^{m \times n} \to \RR\) is called a matrix norm on \(\CC^{m \times n}\) if for all \(A, B \in \CC^{m \times n}\) and all \(\alpha \in \CC\) it satisfies the following

  1. [Positivity]

    \[\| A \| \geq 0\]

    with \(\| A \| = 0 \iff A = 0\).

  2. [Homogeneity]

    \[\| \alpha A \| = | \alpha | \| A \|.\]
  3. [Triangle inequality]

    \[\| A + B \| \leq \| A \| + \| B \|.\]

We recall some of the standard results on normed vector spaces.

All matrix norms are equivalent. Let \(\| \cdot \|\) and \(\| \cdot \|'\) be two different matrix norms on \(\CC^{m \times n}\). Then there exist two constants \(a\) and \(b\) such that the following holds

\[a \| A \| \leq \| A \|' \leq b \|A \| \Forall A \in \CC^{m \times n}.\]

A matrix norm is a continuous function \(\| \cdot \| : \CC^{m \times n} \to \RR\).

Norms like \(\ell_p\) on complex vector space

Following norms are quite like \(\ell_p\) norms on finite dimensional complex vector space \(\CC^n\). They are developed by the fact that the matrix vector space \(\CC^{m\times n}\) has one to one correspondence with the complex vector space \(\CC^{m n}\).

Definition

Let \(A \in \CC^{m\times n}\) and \(A = [a_{ij}]\).

Matrix sum norm is defined as

\[\| A \|_S = \sum_{i=1}^{m} \sum_{j=1}^n | a_{ij} |\]
Definition

Let \(A \in \CC^{m\times n}\) and \(A = [a_{ij}]\).

Matrix Frobenius norm is defined as

\[\| A \|_F = \left ( \sum_{i=1}^{m} \sum_{j=1}^n | a_{ij} |^2 \right )^{\frac{1}{2}}.\]
Definition

Let \(A \in \CC^{m\times n}\) and \(A = [a_{ij}]\).

Matrix Max norm is defined as

\[\begin{split}\| A \|_M = \underset{\substack{ 1 \leq i \leq m \\ 1 \leq j \leq n}}{\max} | a_{ij} |.\end{split}\]

Properties of Frobenius norm

We now prove some elementary properties of Frobenius norm.

Lemma

The Frobenius norm of a matrix is equal to the Frobenius norm of its Hermitian transpose.

\[\| A^H \|_F = \| A \|_F.\]
Proof

Let

\[A = [a_{ij}].\]

Then

\[A^H = [\overline{a_{j i}}]\]
\[\begin{split}\| A^H \|_F^2 = \left ( \sum_{j=1}^n \sum_{i=1}^{m} | \overline{a_{ij}} |^2 \right ) = \left ( \sum_{i=1}^{m} \\ \sum_{j=1}^n | a_{ij} |^2 \right ) = \| A \|_F^2.\end{split}\]

Now

\[\| A^H \|_F^2 = \| A \|_F^2 \implies \| A^H \|_F = \| A \|_F\]
Lemma

Let \(A \in \CC^{m \times n}\) be written as a row of column vectors

\[A = \begin{bmatrix} a_1 & \dots & a_n \end{bmatrix}.\]

Then

\[\| A \|_F^2 = \sum_{j=1}^{n} \| a_j \|_2^2.\]
Proof

We note that

\[\| a_j \|_2^2 = \sum_{i=1}^m \| a_{i j} \|_2^2.\]

Now

\[\| A \|_F^2 = \left ( \sum_{i=1}^{m} \sum_{j=1}^n | a_{ij} |^2 \right ) = \left ( \sum_{j=1}^n \left ( \sum_{i=1}^{m} | a_{ij} |^2 \right ) \right ) = \left (\sum_{j=1}^n \| a_j \|_2^2 \right).\]

We thus showed that that the square of the Frobenius norm of a matrix is nothing but the sum of squares of \(\ell_2\) norms of its columns.

Lemma

Let \(A \in \CC^{m \times n}\) be written as a column of row vectors

\[\begin{split}A = \begin{bmatrix} \underline{a}^1 \\ \vdots \\ \underline{a}^m \end{bmatrix}.\end{split}\]

Then

\[\| A \|_F^2 = \sum_{i=1}^{m} \| \underline{a}^i \|_2^2.\]
Proof

We note that

\[\| \underline{a}^i \|_2^2 = \sum_{j=1}^n \| a_{i j} \|_2^2.\]

Now

\[\| A \|_F^2 = \left ( \sum_{i=1}^{m} \sum_{j=1}^n | a_{ij} |^2 \right ) = \sum_{i=1}^{m} \| \underline{a}^i \|_2^2.\]

We now consider how the Frobenius norm is affected with the action of unitary matrices.

Let \(A\) be any arbitrary matrix in \(\CC^{m \times n}\). Let \(U\) be some unitary matrices in \(\CC^{m \times m}\). Let \(V\) be some unitary matrices in \(\CC^{n \times n}\).

We present our first result that multiplication with unitary matrices doesn’t change Frobenius norm of a matrix.

Theorem

The Frobenius norm of a matrix is invariant to pre or post multiplication by a unitary matrix. i.e.

\[\| UA \|_F = \| A \|_F\]

and

\[\| AV \|_F = \| A \|_F.\]
Proof

We can write \(A\) as

\[A = \begin{bmatrix} a_1 & \dots & a_n \end{bmatrix}.\]

So

\[UA = \begin{bmatrix} Ua_1 & \dots & Ua_n \end{bmatrix}.\]

Then applying here clearly

\[\| UA \|_F^2 = \sum_{j=1}^{n} \|U a_j \|_2^2.\]

But we know that unitary matrices are norm preserving. Hence

\[\|U a_j \|_2^2 = \|a_j \|_2^2.\]

Thus

\[\| UA \|_F^2 = \sum_{j=1}^{n} \|a_j \|_2^2 = \| A \|_F^2\]

which implies

\[\| UA \|_F = \| A \|_F.\]

Similarly writing \(A\) as

\[\begin{split}A = \begin{bmatrix} r_1 \\ \vdots \\ r_m \end{bmatrix}.\end{split}\]

we have

\[\begin{split}AV = \begin{bmatrix} r_1 V\\ \vdots \\ r_m V \end{bmatrix}.\end{split}\]

Then applying here clearly

\[\| AV \|_F^2 = \sum_{i=1}^{m} \| r_i V \|_2^2.\]

But we know that unitary matrices are norm preserving. Hence

\[\|r_i V \|_2^2 = \|r_i \|_2^2.\]

Thus

\[\| AV \|_F^2 = \sum_{i=1}^{m} \| r_i \|_2^2 = \| A \|_F^2\]

which implies

\[\| AV \|_F = \| A \|_F.\]

An alternative approach for the 2nd part of the proof using the first part is just one line

\[\| AV \|_F = \| (AV)^H \|_F = \| V^H A^H \|_F = \| A^H \|_F = \| A \|_F.\]

In above we use here and the fact that \(V\) is a unitary matrix implies that \(V^H\) is also a unitary matrix. We have already shown that pre multiplication by a unitary matrix preserves Frobenius norm.

Theorem

Let \(A \in \CC^{m \times n}\) and \(B \in \CC^{n \times P}\) be two matrices. Then the Frobenius norm of their product is less than or equal to the product of Frobenius norms of the matrices themselves. i.e.

\[\| AB \|_F \leq \|A \|_F \| B \|_F.\]
Proof

We can write \(A\) as

\[\begin{split}A = \begin{bmatrix} a_1^T \\ \vdots \\ a_m^T \end{bmatrix}\end{split}\]

where \(a_i\) are \(m\) column vectors corresponding to rows of \(A\). Similarly we can write B as

\[B = \begin{bmatrix} b_1 & \dots & b_P \end{bmatrix}\]

where \(b_i\) are column vectors corresponding to columns of \(B\). Then

\[\begin{split}A B = \begin{bmatrix} a_1^T \\ \vdots \\ a_m^T \end{bmatrix} \begin{bmatrix} b_1 & \dots & b_P \end{bmatrix} = \begin{bmatrix} a_1^T b_1 & \dots & a_1^T b_P\\ \vdots & \ddots & \vdots \\ a_m^T b_1 & \dots & a_m^T b_P \end{bmatrix} = \begin{bmatrix} a_i^T b_j \end{bmatrix} .\end{split}\]

Now looking carefully

\[a_i^T b_j = \langle a_i, \overline{b_j} \rangle\]

Applying the Cauchy-Schwartz inequality we have

\[| \langle a_i, \overline{b_j} \rangle |^2 \leq \| a_i \|_2^2 \| \overline{b_j} \|_2^2 = \| a_i \|_2^2 \| b_j \|_2^2\]

Now

\[\begin{split}\| A B \|_F^2 &= \sum_{i=1}^{m} \sum_{j=1}^{P} | a_i^T b_j |^2\\ &\leq \sum_{i=1}^{m} \sum_{j=1}^{P} \| a_i \|_2^2 \| b_j \|_2^2\\ &= \left ( \sum_{i=1}^{m} \| a_i \|_2^2 \right ) \left ( \sum_{j=1}^{P} \| b_j \|_2^2\right )\\ &= \| A \|_F^2 \| B \|_F^2\end{split}\]

which implies

\[\| A B \|_F \leq \| A \|_F \| B \|_F\]

by taking square roots on both sides.

Corollary

Let \(A \in \CC^{m \times n}\) and let \(x \in \CC^n\). Then

\[\| A x \|_2 \leq \| A \|_F \| x \|_2.\]
Proof

We note that Frobenius norm for a column matrix is same as \(\ell_2\) norm for corresponding column vector. i.e.

\[\| x \|_F = \| x \|_2 \Forall x \in \CC^n.\]

Now applying here we have

\[\| A x \|_2 = \| A x \|_F \leq \| A \|_F \| x \|_F = \| A \|_F \| x \|_2 \Forall x \in \CC^n.\]

It turns out that Frobenius norm is intimately related to the singular value decomposition of a matrix.

Lemma

Let \(A \in \CC^{m \times n}\). Let the singular value decomposition of \(A\) be given by

\[A = U \Sigma V^H.\]

Let the singular value of \(A\) be \(\sigma_1, \dots, \sigma_n\). Then

\[\| A \|_F = \sqrt {\sum_{i=1}^n \sigma_i^2}.\]
Proof
\[A = U \Sigma V^H \implies \|A \|_F = \| U \Sigma V^H \|_F.\]

But

\[\| U \Sigma V^H \|_F = \| \Sigma V^H \|_F = \| \Sigma \|_F\]

since \(U\) and \(V\) are unitary matrices (see here ).

Now the only non-zero terms in \(\Sigma\) are the singular values. Hence

\[\| A \|_F = \| \Sigma \|_F = \sqrt {\sum_{i=1}^n \sigma_i^2}.\]

Consistency of a matrix norm

Definition

A matrix norm \(\| \cdot \|\) is called consistent on \(\CC^{n \times n}\) if

(1)\[\| A B \| \leq \| A \| \| B \|\]

holds true for all \(A, B \in \CC^{n \times n}\). A matrix norm \(\| \cdot \|\) is called consistent if it is defined on \(\CC^{m \times n}\) for all \(m, n \in \Nat\) and eq (1) holds for all matrices \(A, B\) for which the product \(AB\) is defined.

A consistent matrix norm is also known as a sub-multiplicative norm.

With this definition and results in here we can see that Frobenius norm is consistent.

Subordinate matrix norm

A matrix operates on vectors from one space to generate vectors in another space. It is interesting to explore the connection between the norm of a matrix and norms of vectors in the domain and co-domain of a matrix.

Definition

Let \(m, n \in \Nat\) be given. Let \(\| \cdot \|_{\alpha}\) be some norm on \(\CC^m\) and \(\| \cdot \|_{\beta}\) be some norm on \(\CC^n\). Let \(\| \cdot \|\) be some norm on matrices in \(\CC^{m \times n}\). We say that \(\| \cdot \|\) is subordinate to the vector norms \(\| \cdot \|_{\alpha}\) and \(\| \cdot \|_{\beta}\) if

\[\| A x \|_{\alpha} \leq \| A \| \| x \|_{\beta}\]

for all \(A \in \CC^{m \times n}\) and for all \(x \in \CC^n\). In other words the length of the vector doesn’t increase by the operation of \(A\) beyond a factor given by the norm of the matrix itself.

If \(\| \cdot \|_{\alpha}\) and \(\| \cdot \|_{\beta}\) are same then we say that \(\| \cdot \|\) is subordinate to the vector norm \(\| \cdot \|_{\alpha}\).

We have shown earlier in here that Frobenius norm is subordinate to Euclidean norm.

Operator norm

We now consider the maximum factor by which a matrix \(A\) can increase the length of a vector.

Definition

Let \(m, n \in \Nat\) be given. Let \(\| \cdot \|_{\alpha}\) be some norm on \(\CC^n\) and \(\| \cdot \|_{\beta}\) be some norm on \(\CC^m\). For \(A \in \CC^{m \times n}\) we define

\[\| A \| \triangleq \| A \|_{\alpha \to \beta} \triangleq \underset{x \neq 0}{\max } \frac{\| A x \|_{\beta}}{\| x \|_{\alpha}}.\]

\(\frac{\| A x \|_{\beta}}{\| x \|_{\alpha}}\) represents the factor with which the length of \(x\) increased by operation of \(A\). We simply pick up the maximum value of such scaling factor.

The norm as defined above is known as :math:`(alpha to beta)` operator norm, the \((\alpha \to \beta)\)-norm, or simply the \(\alpha\)-norm if \(\alpha = \beta\).

Of course we need to verify that this definition satisfies all properties of a norm.

Clearly if \(A= 0\) then \(A x = 0\) always, hence \(\| A \| = 0\).

Conversely, if \(\| A \| = 0\) then \(\| A x \|_{\beta} = 0 \Forall x \in \CC^n\). In particular this is true for the unit vectors \(e_i \in \CC^n\). The \(i\)-th column of \(A\) is given by \(A e_i\) which is 0. Thus each column in \(A\) is 0. Hence \(A = 0\).

Now consider \(c \in \CC\).

\[\| c A \| = \underset{x \neq 0}{\max } \frac{\| c A x \|_{\beta}}{\| x \|_{\alpha}} = | c | \underset{x \neq 0}{\max } \frac{\| A x \|_{\beta}}{\| x \|_{\alpha}} = | c | \|A \|.\]

We now present some useful observations on operator norm before we can prove triangle inequality for operator norm.

For any \(x \in \Kernel(A)\), \(A x = 0\) hence we only need to consider vectors which don’t belong to the kernel of \(A\).

Thus we can write

\[\| A \|_{\alpha \to \beta} = \underset{x \notin \Kernel(A)} {\max } \frac{\| A x \|_{\beta}}{\| x \|_{\alpha}}.\]

We also note that

\[\frac{\| A c x \|_{\beta}}{\| c x \|_{\alpha}} = \frac{| c | \| A x \|_{\beta}}{ | c | \| x \|_{\alpha}} = \frac{\| A x \|_{\beta}}{\| x \|_{\alpha}} \Forall c \neq 0, x \neq 0.\]

Thus, it is sufficient to find the maximum on unit norm vectors:

\[\| A \|_{\alpha \to \beta} = \underset{\| x \|_{\alpha} = 1} {\max } \| A x \|_{\beta}.\]

Note that since \(\|x \|_{\alpha} = 1\) hence the term in denominator goes away.

Lemma

The \((\alpha \to \beta)\)-operator norm is subordinate to vector norms \(\| \cdot \|_{\alpha}\) and \(\| \cdot \|_{\beta}\). i.e.

\[\| A x \|_{\beta} \leq \| A \|_{\alpha \to \beta } \| x \|_{\alpha}.\]
Proof

For \(x = 0\) the inequality is trivially satisfied. Now for \(x \neq 0\) by definition, we have

\[\| A \|_{\alpha \to \beta } \geq \frac{\| A x \|_{\beta}}{\| x \|_{\alpha}} \implies \| A \|_{\alpha \to \beta } \| x \|_{\alpha} \geq \| A x \|_{\beta}.\]
Remark

There exists a vector \(x^* \in \CC^{n}\) with unit norm (\(\| x^* \|_{\alpha} = 1\)) such that

\[\| A \|_{\alpha \to \beta} = \| A x^* \|_{\beta}.\]
Proof

Let \(x' \neq 0\) be some vector which maximizes the expression

\[\frac{\| A x \|_{\beta}}{\| x \|_{\alpha}}.\]

Then

\[\| A\|_{\alpha \to \beta} = \frac{\| A x' \|_{\beta}}{\| x' \|_{\alpha}}.\]

Now consider \(x^* = \frac{x'}{\| x' \|_{\alpha}}\). Thus \(\| x^* \|_{\alpha} = 1\). We know that

\[\frac{\| A x' \|_{\beta}}{\| x' \|_{\alpha}} = \| A x^* \|_{\beta}.\]

Hence

\[\| A\|_{\alpha \to \beta} = \| A x^* \|_{\beta}.\]

We are now ready to prove triangle inequality for operator norm.

Lemma
Operator norm as defined in here satisfies triangle inequality.
Proof

Let \(A\) and \(B\) be some matrices in \(\CC^{m \times n}\). Consider the operator norm of matrix \(A+B\). From previous remarks, there exists some vector \(x^* \in \CC^n\) with \(\| x^* \|_{\alpha} = 1\) such that

\[\| A + B \| = \| (A+B) x^* \|_{\beta}.\]

Now

\[\| (A+B) x^* \|_{\beta} = \| Ax^* + B x^* \|_{\beta} \leq \| Ax^*\|_{\beta} + \| Bx^*\|_{\beta}.\]

From another remark we have

\[\| Ax^*\|_{\beta} \leq \| A \| \|x^*\|_{\alpha} = \|A \|\]

and

\[\| Bx^*\|_{\beta} \leq \| B \| \|x^*\|_{\alpha} = \|B \|\]

since \(\| x^* \|_{\alpha} = 1\).

Hence we have

\[\| A + B \| \leq \| A \| + \| B \|.\]

It turns out that operator norm is also consistent under certain conditions.

Lemma

Let \(\| \cdot \|_{\alpha}\) be defined over all \(m \in \Nat\). Let \(\| \cdot \|_{\beta} = \| \cdot \|_{\alpha}\). Then the operator norm

\[\| A \|_{\alpha} = \underset{x \neq 0}{\max } \frac{\| A x \|_{\alpha}}{\| x \|_{\alpha}}\]

is consistent.

Proof

We need to show that

\[\| A B \|_{\alpha} \leq \| A \|_{\alpha} \| B \|_{\alpha}.\]

Now

\[\| A B \|_{\alpha} = \underset{x \neq 0}{\max } \frac{\| A B x \|_{\alpha}}{\| x \|_{\alpha}}.\]

We note that if \(Bx = 0\), then \(A B x = 0\). Hence we can rewrite as

\[\| A B \|_{\alpha} = \underset{Bx \neq 0}{\max } \frac{\| A B x \|_{\alpha}}{\| x \|_{\alpha}}.\]

Now if \(Bx \neq 0\) then \(\| Bx \|_{\alpha} \neq 0\). Hence

\[\frac{\| A B x \|_{\alpha}}{\| x \|_{\alpha}} = \frac{\| A B x \|_{\alpha}}{\|B x \|_{\alpha}} \frac{\| B x \|_{\alpha}}{\| x \|_{\alpha}}\]

and

\[\underset{Bx \neq 0}{\max } \frac{\| A B x \|_{\alpha}}{\| x \|_{\alpha}} \leq \underset{Bx \neq 0}{\max } \frac{\| A B x \|_{\alpha}}{\|B x \|_{\alpha}} \underset{Bx \neq 0}{\max } \frac{\| B x \|_{\alpha}}{\| x \|_{\alpha}}.\]

Clearly

\[\| B \|_{\alpha} = \underset{Bx \neq 0}{\max } \frac{\| B x \|_{\alpha}}{\| x \|_{\alpha}}.\]

Furthermore

\[\underset{Bx \neq 0}{\max } \frac{\| A B x \|_{\alpha}}{\|B x \|_{\alpha}} \leq \underset{y \neq 0}{\max } \frac{\| A y \|_{\alpha}}{\|y \|_{\alpha}} = \|A \|_{\alpha}.\]

Thus we have

\[\| A B \|_{\alpha} \leq \| A \|_{\alpha} \| B \|_{\alpha}.\]

p-norm for matrices

We recall the definition of \(\ell_p\) norms for vectors \(x \in \CC^n\) from (2)

\[\begin{split}\| x \|_p = \begin{cases} \left ( \sum_{i=1}^{n} | x |_i^p \right ) ^ {\frac{1}{p}} & p \in [1, \infty)\\ \underset{1 \leq i \leq n}{\max} |x_i| & p = \infty \end{cases}.\end{split}\]

The operator norms \(\| \cdot \|_p\) defined from \(\ell_p\) vector norms are of specific interest.

Definition

The \(p\)-norm for a matrix \(A \in \CC^{m \times n}\) is defined as

\[\| A \|_p \triangleq \underset{x \neq 0}{\max } \frac{\| A x \|_p}{\| x \|_p} = \underset{\| x \|_p = 1}{\max } \| A x \|_p\]

where \(\| x \|_p\) is the standard \(\ell_p\) norm for vectors in \(\CC^m\) and \(\CC^n\).

Remark
As per here \(p\)-norms for matrices are consistent norms. They are also sub-ordinate to \(\ell_p\) vector norms.

Special cases are considered for \(p=1,2\) and \(\infty\).

Theorem

Let \(A \in \CC^{m \times n}\).

For \(p=1\) we have

\[\| A \|_1 \triangleq \underset{1\leq j \leq n}{\max} \sum_{i=1}^m | a_{ij}|.\]

This is also known as max column sum norm.

For \(p=\infty\) we have

\[\| A \|_{\infty} \triangleq \underset{1\leq i \leq m}{\max} \sum_{j=1}^n | a_{ij}|.\]

This is also known as max row sum norm.

Finally for \(p=2\) we have

\[\| A \|_2 \triangleq \sigma_1\]

where \(\sigma_1\) is the largest singular value of \(A\). This is also known as spectral norm.

Proof

Let

\[A = \begin{bmatrix} a^1 & \dots, & a^n \end{bmatrix}.\]

Then

\[\begin{split}\begin{aligned} \| A x \|_1 &= \left \| \sum_{j=1}^n x_j a^j \right \|_1 \\ &\leq \sum_{j=1}^n \left \| x_j a^j \right \|_1 \\ &= \sum_{j=1}^n |x_j| \left \| a^j \right \|_1 \\ &\leq \underset{1 \leq j \leq n}{\max}\| a^j \|_1 \sum_{j=1}^n |x_j| \\ &= \underset{1 \leq j \leq n}{\max}\| a^j \|_1 \| x \|_1. \end{aligned}\end{split}\]

Thus,

\[\| A \|_1 = \underset{x \neq 0}{\max } \frac{\| A x \|_1}{\| x \|_1} \leq \underset{1 \leq j \leq n}{\max}\| a^j \|_1\]

which the maximum column sum. We need to show that this upper bound is indeed an equality.

Indeed for any \(x=e_j\) where \(e_j\) is a unit vector with \(1\) in \(j\)-th entry and 0 elsewhere,

\[\| A e_j \|_1 = \| a^j \|_1.\]

Thus

\[\| A \|_1 \geq \| a^j \|_1 \quad \Forall 1 \leq j \leq n.\]

Combining the two, we see that

\[\| A \|_1 = \underset{1 \leq j \leq n}{\max}\| a^j \|_1.\]

For \(p=\infty\), we proceed as follows:

\[\begin{split}\begin{aligned} \| A x \|_{\infty} &= \underset{1 \leq i \leq m}{\max} \left | \sum_{j=1}^n a_{ij } x_j \right | \\ & \leq \underset{1 \leq i \leq m}{\max} \sum_{j=1}^n | a_{ij } | | x_j |\\ & \leq \underset{1 \leq j \leq n}{\max} | x_j | \underset{1 \leq i \leq m}{\max} \sum_{j=1}^n | a_{ij } |\\ &= \| x \|_{\infty} \underset{1 \leq i \leq m}{\max}\| \underline{a}^i \|_1 \end{aligned}\end{split}\]

where \(\underline{a}^i\) are the rows of \(A\).

This shows that

\[\| A x \|_{\infty} \leq \underset{1 \leq i \leq m}{\max}\| \underline{a}^i \|_1.\]

We need to show that this is indeed an equality.

Fix an \(i = k\) and choose \(x\) such that

\[x_j = \sgn (a_{k j}).\]

Clearly \(\| x \|_{\infty} = 1\).

Then

\[\begin{split}\begin{aligned} \| A x \|_{\infty} &= \underset{1 \leq i \leq m}{\max} \left | \sum_{j=1}^n a_{ij } x_j \right | \\ &\geq \left | \sum_{j=1}^n a_{k j } x_j \right | \\ &= \left | \sum_{j=1}^n | a_{k j } | \right | \\ &= \sum_{j=1}^n | a_{k j } |\\ &= \| \underline{a}^k \|_1. \end{aligned}\end{split}\]

Thus,

\[\| A \|_{\infty} \geq \underset{1 \leq i \leq m}{\max}\| \underline{a}^i \|_1\]

Combining the two inequalities we get:

\[\| A \|_{\infty} = \underset{1 \leq i \leq m}{\max}\| \underline{a}^i \|_1.\]

Remaining case is for \(p=2\).

For any vector \(x\) with \(\| x \|_2 = 1\),

\[\| A x \|_2 = \| U \Sigma V^H x \|_2 = \| U (\Sigma V^H x )\|_2 = \| \Sigma V^H x \|_2\]

since \(\ell_2\) norm is invariant to unitary transformations.

Let \(v = V^H x\). Then \(\|v\|_2 = \| V^H x \|_2 = \| x \|_2 = 1\).

Now

\[\begin{split}\begin{aligned} \| A x \|_2 &= \| \Sigma v \|_2\\ &= \left ( \sum_{j=1}^n | \sigma_j v_j |^2 \right )^{\frac{1}{2}}\\ &\leq \sigma_1 \left ( \sum_{j=1}^n | v_j |^2 \right )^{\frac{1}{2}}\\ &= \sigma_1 \| v \|_2 = \sigma_1. \end{aligned}\end{split}\]

This shows that

\[\| A \|_2 \leq \sigma_1.\]

Now consider some vector \(x\) such that \(v = (1, 0, \dots, 0)\). Then

\[\| A x \|_2 = \| \Sigma v \|_2 = \sigma_1.\]

Thus

\[\| A \|_2 \geq \sigma_1.\]

Combining the two, we get that \(\| A \|_2 = \sigma_1\).

The 2-norm

Theorem

Let \(A\in \CC^{n \times n}\) has singular values \(\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_n\). Let the eigen values for \(A\) be \(\lambda_1, \lambda_2, \dots, \lambda_n\) with \(|\lambda_1| \geq |\lambda_2| \geq \dots \geq |\lambda_n|\). Then the following hold

\[\| A \|_2 = \sigma_1\]

and if \(A\) is non-singular

\[\| A^{-1} \|_2 = \frac{1}{\sigma_n}.\]

If \(A\) is symmetric and positive definite, then

\[\| A \|_2 = \lambda_1\]

and if \(A\) is non-singular

\[\| A^{-1} \|_2 = \frac{1}{\lambda_n}.\]

If \(A\) is normal then

\[\| A \|_2 = |\lambda_1|\]

and if \(A\) is non-singular

\[\| A^{-1} \|_2 = \frac{1}{|\lambda_n|}.\]

Unitary invariant norms

Definition
A matrix norm \(\| \cdot \|\) on \(\CC^{m \times n}\) is called unitary invariant if \(\| U A V \| = \|A \|\) for any \(A \in \CC^{m \times n}\) and any unitary matrices \(U \in \CC^{m \times m}\) and \(V \in \CC^{n \times n}\).

We have already seen in here that Frobenius norm is unitary invariant.

It turns out that spectral norm is also unitary invariant.

More properties of operator norms

In this section we will focus on operator norms connecting normed linear spaces \((\CC^n, \| \cdot \|_{p})\) and \((\CC^m, \| \cdot \|_{q})\). Typical values of \(p, q\) would be in \(\{1, 2, \infty\}\).

We recall that

\[\| A \|_{p \to q } = \underset{x \neq 0}{\max} \frac{\| A x \|_q}{\| x \|_p} = \underset{ \| x \|_p = 1}{\max} \| A x \|_q = \underset{\| x \|_p \leq 1}{\max} \| A x \|_q.\]

The following table (based on [TRO04]) shows how to compute different \((p, q)\) norms. Some can be computed easily while others are NP-hard to compute.

Typical \((p \to q)\) norms
p q \(\| A \|_{p \to q}\) Calculation
1 1 \(\| A \|_{1 }\) Maximum \(\ell_1\) norm of a column
1 2 \(\| A \|_{1 \to 2}\) Maximum \(\ell_2\) norm of a column
1 \(\infty\) \(\| A \|_{1 \to \infty}\) Maximum absolute entry of a matrix
2 1 \(\| A \|_{2 \to 1}\) NP hard
2 2 \(\| A \|_{2}\) Maximum singular value
2 \(\infty\) \(\| A \|_{2 \to \infty}\) Maximum \(\ell_2\) norm of a row
\(\infty\) 1 \(\| A \|_{\infty \to 1}\) NP hard
\(\infty\) 2 \(\| A \|_{\infty \to 2}\) NP hard
\(\infty\) \(\infty\) \(\| A \|_{\infty}\) Maximum \(\ell_1\)-norm of a row

The topological dual of the finite dimensional normed linear space \((\CC^n, \| \cdot \|_{p})\) is the normed linear space \((\CC^n, \| \cdot \|_{p'})\) where

\[\frac{1}{p} + \frac{1}{p'} = 1.\]

\(\ell_2\)-norm is dual of \(\ell_2\)-norm. It is a self dual. \(\ell_1\) norm and \(\ell_{\infty}\)-norm are dual of each other.

When a matrix \(A\) maps from the space \((\CC^n, \| \cdot \|_{p})\) to the space \((\CC^m, \| \cdot \|_{q})\), we can view its conjugate transpose \(A^H\) as a mapping from the space \((\CC^m, \| \cdot \|_{q'})\) to \((\CC^n, \| \cdot \|_{p'})\).

Theorem

Operator norm of a matrix always equals the operator norm of its conjugate transpose. i.e.

\[\| A \|_{p \to q} = \| A^H \|_{q' \to p'}\]

where

\[\frac{1}{p} + \frac{1}{p'} = 1, \frac{1}{q} + \frac{1}{q'} = 1.\]

Specific applications of this result are:

\[\| A \|_2 = \| A^H \|_2.\]

This is obvious since the maximum singular value of a matrix and its conjugate transpose are same.

\[\| A \|_1 = \| A^H \|_{\infty}, \quad \| A \|_{\infty} = \| A^H \|_1.\]

This is also obvious since max column sum of \(A\) is same as the max row sum norm of \(A^H\) and vice versa.

\[\| A \|_{1 \to \infty} = \| A^H \|_{1 \to \infty}.\]
\[\| A \|_{1 \to 2} = \| A^H \|_{2 \to \infty}.\]
\[\| A \|_{\infty \to 2} = \| A^H \|_{2 \to 1}.\]

We now need to show the result for the general case (arbitrary \(1 \leq p, q \leq \infty\)).

Proof
TODO
Theorem
\[\| A \|_{1 \to p} = \underset{1 \leq j \leq n}{\max}\| a^j \|_p.\]

where

\[A = \begin{bmatrix} a^1 & \dots, & a^n \end{bmatrix}.\]
Proof
\[\begin{split}\begin{aligned} \| A x \|_p &= \left \| \sum_{j=1}^n x_j a^j \right \|_p \\ &\leq \sum_{j=1}^n \left \| x_j a^j \right \|_p \\ &= \sum_{j=1}^n |x_j| \left \| a^j \right \|_p \\ &\leq \underset{1 \leq j \leq n}{\max}\| a^j \|_p \sum_{j=1}^n |x_j| \\ &= \underset{1 \leq j \leq n}{\max}\| a^j \|_p \| x \|_1. \end{aligned}\end{split}\]

Thus,

\[\| A \|_{1 \to p} = \underset{x \neq 0}{\max } \frac{\| A x \|_p}{\| x \|_1} \leq \underset{1 \leq j \leq n}{\max}\| a^j \|_p.\]

We need to show that this upper bound is indeed an equality.

Indeed for any \(x=e_j\) where \(e_j\) is a unit vector with \(1\) in \(j\)-th entry and 0 elsewhere,

\[\| A e_j \|_p = \| a^j \|_p.\]

Thus

\[\| A \|_{1 \to p} \geq \| a^j \|_p \quad \Forall 1 \leq j \leq n.\]

Combining the two, we see that

\[\| A \|_{1 \to p} = \underset{1 \leq j \leq n}{\max}\| a^j \|_p.\]
Theorem
\[\| A \|_{p \to \infty} = \underset{1 \leq i \leq m}{\max}\| \underline{a}^i \|_q\]

where

\[\frac{1}{p} + \frac{1}{q} = 1.\]
Proof

Using here, we get

\[\| A \|_{p \to \infty} = \| A^H \|_{1 \to q}.\]

Using here, we get

\[\| A^H \|_{1 \to q} = \underset{1 \leq i \leq m}{\max}\| \underline{a}^i \|_q.\]

This completes the proof.

Theorem

For two matrices \(A\) and \(B\) and \(p \geq 1\), we have

\[\| A B \|_{p \to q} \leq \| B \|_{p \to s} \| A \|_{s \to q}.\]
Proof

We start with

\[\| A B \|_{p \to q} = \underset{\| x \|_p = 1}{\max} \| A ( B x) \|_q.\]

From here, we obtain

\[\| A ( B x) \|_q \leq \| A \|_{s \to q} \| ( B x) \|_s.\]

Thus,

\[\| A B \|_{p \to q} \leq \| A \|_{s \to q} \underset{\| x \|_p = 1}{\max} \| ( B x) \|_s = \| A \|_{s \to q} \| B \|_{p \to s}.\]
Theorem

For two matrices \(A\) and \(B\) and \(p \geq 1\), we have

\[\| A B \|_{p \to \infty} \leq \| A \|_{\infty \to \infty} \| B \|_{p \to \infty}.\]
Proof

We start with

\[\| A B \|_{p \to \infty} = \underset{\| x \|_p = 1}{\max} \| A ( B x) \|_{\infty}.\]

From here, we obtain

\[\| A ( B x) \|_{\infty} \leq \| A \|_{\infty \to \infty} \| ( B x) \|_{\infty}.\]

Thus,

\[\| A B \|_{p \to \infty} \leq \| A \|_{\infty \to \infty} \underset{\| x \|_p = 1}{\max} \| ( B x) \|_{\infty} = \| A \|_{\infty \to \infty} \| B \|_{p \to \infty}.\]
Theorem
\[\| A \|_{p \to \infty} \leq \| A \|_{p \to p}.\]

In particular

\[\| A \|_{1 \to \infty} \leq \| A \|_{1}.\]
\[\| A \|_{2 \to \infty} \leq \| A \|_{2}.\]
Proof

Choosing \(q = \infty\) and \(s = p\) and applying here

\[\| I A \|_{p \to \infty} \leq \| A \|_{p \to p} \| I \|_{p \to \infty}.\]

But \(\| I \|_{p \to \infty}\) is the maximum \(\ell_p\) norm of any row of \(I\) which is \(1\). Thus

\[\| A \|_{p \to \infty} \leq \| A \|_{p \to p}.\]

Consider the expression

\[\begin{split}\underset{ \substack{z \in \ColSpace(A^H) \\ z \neq 0}}{\min} \frac{\| A z \|_{q}}{\| z \|_p}.\end{split}\]

\(z \in \ColSpace(A^H), z \neq 0\) means there exists some vector \(u \notin \Kernel(A^H)\) such that \(z = A^H u\).

This expression measures the factor by which the non-singular part of \(A\) can decrease the length of a vector.

Theorem

The following bound holds for every matrix \(A\):

\[\begin{split}\underset{\substack{z \in \ColSpace(A^H) \\ z \neq 0}}{\min} \frac{\| A z \|_{q}}{\| z \|_p} \geq \| A^{\dag}\|_{q, p}^{-1}.\end{split}\]

If \(A\) is surjective (onto), then the equality holds. When \(A\) is bijective (one-one onto, square, invertible), then the result implies

\[\begin{split}\underset{\substack{z \in \ColSpace(A^H) \\ z \neq 0}}{\min} \frac{\| A z \|_{q}}{\| z \|_p} = \| A^{-1}\|_{q, p}^{-1}.\end{split}\]
Proof

The spaces \(\ColSpace(A^H)\) and \(\ColSpace(A)\) have same dimensions given by \(\Rank(A)\). We recall that \(A^{\dag} A\) is a projector onto the column space of \(A\).

\[w = A z \iff z = A^{\dag} w = A^{\dag} A z \Forall z \in \ColSpace (A^H).\]

As a result we can write

\[\frac{\| z \|_p}{ \| A z \|_q} = \frac{\| A^{\dag} w \|_p}{ \| w \|_q}\]

whenever \(z \in \ColSpace(A^H)\). Now

\[\begin{split} \left [ \underset{\substack{z \in \ColSpace(A^H)\\z \neq 0}}{\min} \frac{\| A z \|_q}{\| z \|_p}\right ]^{-1} = \underset{\substack{z \in \ColSpace(A^H)\\z \neq 0}}{\max} \frac{\| z \|_p}{ \| A z \|_q} = \underset{\substack{w \in \ColSpace(A) \\ w \neq 0}}{\max} \frac{\| A^{\dag} w \|_p}{ \| w \|_q} \leq \underset{w \neq 0}{\max} \frac{\| A^{\dag} w \|_p}{ \| w \|_q}.\end{split}\]

When \(A\) is surjective, then \(\ColSpace(A) = \CC^m\). Hence

\[\begin{split}\underset{\substack{w \in \ColSpace(A)\\w \neq 0}}{\max} \frac{\| A^{\dag} w \|_p}{ \| w \|_q} = \underset{w \neq 0}{\max} \frac{\| A^{\dag} w \|_p}{ \| w \|_q}.\end{split}\]

Thus, the inequality changes into equality. Finally

\[\underset{w \neq 0}{\max} \frac{\| A^{\dag} w \|_p}{ \| w \|_q} = \| A^{\dag} \|_{q \to p}\]

which completes the proof.

Row column norms

Definition

Let \(A\) be an \(m\times n\) matrix with rows \(\underline{a}^i\) as

\[\begin{split}A = \begin{bmatrix} \underline{a}^1\\ \vdots \\ \underline{a}^m \end{bmatrix}\end{split}\]

Then we define

\[\| A \|_{p, \infty} \triangleq \underset{1 \leq i \leq m}{\max} \| \underline{a}^i \|_p = \underset{1 \leq i \leq m}{\max} \left ( \sum_{j=1}^n |\underline{a}^i_j |^p \right )^{\frac{1}{p}}\]

where \(1 \leq p < \infty\). i.e. we take \(p\)-norms of all row vectors and then find the maximum.

We define

\[\| A \|_{\infty, \infty} = \underset{i, j}{\max} |a_{i j}|.\]

This is equivalent to taking \(\ell_{\infty}\) norm on each row and then taking the maximum of all the norms.

For \(1 \leq p , q < \infty\), we define the norm

\[\| A \|_{p, q} \triangleq \left [ \sum_{i=1}^m \left ( \| \underline{a}^i \|_p \right )^q \right ]^{\frac{1}{q}}.\]

i.e., we compute \(p\)-norm of all the row vectors to form another vector and then take \(q\)-norm of that vector.

Note that the norm \(\| A \|_{p, \infty}\) is different from the operator norm \(\| A \|_{p \to \infty}\). Similarly \(\| A \|_{p, q}\) is different from \(\| A \|_{p \to q}\).

Theorem
\[\| A \|_{p, \infty} = \| A \|_{q \to \infty}\]

where

\[\frac{1}{p} + \frac{1}{q} = 1.\]
Proof

From here we get

\[\| A \|_{q \to \infty} = \underset{1 \leq i \leq m}{\max}\| \underline{a}^i \|_p.\]

This is exactly the definition of \(\| A \|_{p, \infty}\).

Theorem
\[\| A \|_{1 \to p} = \| A \|_{p, \infty}.\]
Proof
\[\| A \|_{1 \to p} = \| A^H \|_{q \to \infty}.\]

From here

\[\| A^H \|_{q \to \infty} = \| A^H \|_{p, \infty}.\]
Theorem

For any two matrices \(A, B\), we have

\[\frac{\|A B \|_{p, \infty}}{\| B\|_{p, \infty}} \leq \| A \|_{\infty \to \infty}.\]
Proof

Let \(q\) be such that \(\frac{1}{p} + \frac{1}{q} = 1\). From here, we have

\[\| A B \|_{q \to \infty} \leq \| A \|_{\infty \to \infty} \| B \|_{q \to \infty}.\]

From here

\[\| A B \|_{q \to \infty} = \| A B\|_{p, \infty}\]

and

\[\| B \|_{q \to \infty} = \| B\|_{p, \infty}.\]

Thus

\[\| A B\|_{p, \infty} \leq \| A \|_{\infty \to \infty} \| B\|_{p, \infty}.\]
Theorem

Relations between \((p, q)\) norms and \((p \to q)\) norms

\[\| A \|_{1, \infty} = \| A \|_{\infty \to \infty}\]
\[\| A \|_{2, \infty} = \| A \|_{2 \to \infty}\]
\[\| A \|_{\infty, \infty} = \| A \|_{1 \to \infty}\]
\[\| A \|_{1 \to 1} = \| A^H \|_{1, \infty}\]
\[\| A \|_{1 \to 2} = \| A^H \|_{2, \infty}\]
\[\]
Proof
The first three are straight forward applications of here. The next two are applications of here. See also here.

Block diagonally dominant matrices and generalized Gershgorin disc theorem

In [FV+62] the idea of diagonally dominant matrices (see here) has been generalized to block matrices using matrix norms. We consider the specific case with spectral norm.

Definition

Let \(A\) be a square matrix in \(\CC^{n \times n}\) which is partitioned in following manner

\[\begin{split}A = \begin{bmatrix} A_{11} & A_{12} & \dots & A_{1 k}\\ A_{21} & A_{22} & \dots & A_{2 k}\\ \vdots & \vdots & \ddots & \vdots\\ A_{k 1} & A_{k 2} & \dots & A_{k k}\\ \end{bmatrix}\end{split}\]

where each of the submatrices \(A_{i j}\) is a square matrix of size \(m \times m\). Thus \(n = k m\).

\(A\) is called block diagonally dominant if

\[\| A_{ii}\|_2 \geq \sum_{j \neq i } \|A_{ij} \|_2.\]

holds true for all \(1 \leq i \leq n\). If the inequality satisfies strictly for all \(i\), then \(A\) is called block strictly diagonally dominant matrix.

Theorem
If the partitioned matrix \(A\) of here is block strictly diagonally dominant matrix, then it is nonsingular.

For proof see [FV+62].

This leads to the generalized Gershgorin disc theorem.

Theorem

Let \(A\) be a square matrix in \(\CC^{n \times n}\) which is partitioned in following manner

\[\begin{split}A = \begin{bmatrix} A_{11} & A_{12} & \dots & A_{1 k}\\ A_{21} & A_{22} & \dots & A_{2 k}\\ \vdots & \vdots & \ddots & \vdots\\ A_{k 1} & A_{k 2} & \dots & A_{k k}\\ \end{bmatrix}\end{split}\]

where each of the submatrices \(A_{i j}\) is a square matrix of size \(m \times m\). Then each eigenvalue \(\lambda\) of \(A\) satisfies

\[\| \lambda I - A_{ii}\|_2 \leq \sum_{j\neq i} \|A_{ij} \| \text{ for some } i \in \{1,2, \dots, n \}.\]

For proof see [FV+62].

Since the \(2\)-norm of a positive semidefinite matrix is nothing but its largest eigen value, the theorem directly applies.

Corollary

Let \(A\) be a Hermitian positive semidefinite matrix. Let \(A\) be partitioned as in here. Then its \(2\)-norm \(\| A \|_2\) satisfies

\[| \| A \|_2 - \|A_{ii}\|_2 | \leq \sum_{j\neq i} \|A_{ij} \| \text{ for some } i \in \{1,2, \dots, n \}.\]

Real Analysis

Metric Spaces

Definition

A metric or a distance \(d\) on a nonempty set \(X\) is a function \(d : X \times X \to \RR\) which satisfies following properties

  1. \(d(x, y) \geq 0 \Forall x, y \in X\) non-negativity axiom [M1];
  2. \(d(x, y) = 0 \iff x = y\) coincidence axiom [M2];
  3. \(d(x, y ) = d(y, x) \Forall x, y \in X\) symmetry [M3];
  4. \(d(x, y) \leq d(x, z) + d(z, y) \Forall x, y, z \in X\) triangle inequality or sub-additivity [M4].

The pair \((X, d)\) is called a metric space.

Lemma

In a metric space \((X, d)\), the inequality

(1)\[| d(x, z) - d(y, z)| \leq d(x, y)\]

holds for all points \(x, y, z \in X\).

Proof
\[d(x, z) \leq d(x, y) + d(y, z) \implies d(x, z) - d(y, z) \leq d(x, y).\]

Interchanging \(x\) and \(y\) we get

\[d(y, z) - d(x, z) \leq d(y, x) = d(x, y).\]

Combining the two, we get the result.

ExampleReal line as metric space

We show different metrics for the set of real numbers \(\RR\). Let \(x, y, z \in \RR\). Define:

\[d_1 (x, y) = | x - y |.\]

Since the absolute value of any real number is non-negative, M1 is satisfied.

\(| x- y | = 0 \iff x - y = 0 \iff x = y\). Thus, M2 is satisfied.

Now,

\[d_1 (x, y) = | x - y | = | - (x - y) | = | y - x | = d_1 (y, x).\]

Thus, M3 is satisfied.

Finally,

\[d_1(x, y) + d_1(y, z) = | x - y | + | y - z | \leq | x - y + y - z | = | x - z | = d_1 (x, z).\]

Thus, M4 is satisfied and \((\RR, d_1)\) is a metric space.

ExampleGeneral Euclidean metrics

We consider metrics defined on \(\RR^n\) (the set of n-tuples).

Let \(x = (x_1, \dots, x_n) \in \RR^n\) and \(y = (y_1, \dots, y_n) \in \RR^n\).

The taxicab metric is defined as

(2)\[d_1(x, y) = \sum_{i = 1}^n | x_i - y_i|.\]

The Euclidean metric is defined as

(3)\[d_2(x, y) = \left ( \sum_{i = 1}^n | x_i - y_i|^2 \right )^{\frac{1}{2}}.\]

The general Euclidean metric is defined as

(4)\[d_p(x, y) = \left ( \sum_{i = 1}^n | x_i - y_i|^p \right )^{\frac{1}{p}} \quad r =2,3, \dots.\]

For \(p = \infty\), metric is defined as

(5)\[d_{\infty}(x, y) = \underset{1 \leq i \leq n}{\max}{ | x_i - y_i|}.\]

We now prove that above are indeed a metric. We start with taxicab metric. M1 is straightforward since

\[d_1(x, y) = \sum_{i = 1}^n | x_i - y_i| \geq 0.\]

M2 is also easy

\[\sum_{i = 1}^n | x_i - y_i| = 0 \iff | x_i - y_i| = 0 \iff x_i = y_i \iff x = y.\]

M3 is straightforward too

\[d_1(x, y) = \sum_{i = 1}^n | x_i - y_i| = \sum_{i = 1}^n | y_i - x_i| = d_1 (y, x).\]

We will prove M4 (triangle inequality) inductively. For \(n=1\)

\[d_1(x, z) + d_1(z, y) = | x_1 - z_1 | + | z_1 - y_1 | \geq | x_1 - z_1 + z_1 - y_1 | = | x_1 - y_1| = d_1(x, y).\]

Thus M1 is true for \(n=1\).

TODO finish it.

Open sets

Definition

Let \((X, d)\) be a metric space. An open ball at any \(x \in X\) with radius \(r > 0\) is the set

(6)\[B(x, r) \triangleq \{ y \in X : d(x, y) < r\}.\]
Lemma
\[B(x, r_1) \subseteq B(x, r) \text{ whenever } r_1 \leq r.\]
Proof
Let \(z \in B(x, r_1)\). Then \(d(x, z) < r_1 \leq r \implies z \in B(x, r)\).
Definition
A set \(A \subseteq X\) is called open if for every \(x \in A\) there exists some \(r > 0\) such that \(B(x, r) \subseteq A\).
Lemma
Every open ball \(B(x, r)\) is an open set.
Proof

Let \(A = B(x, r)\). We need to show that for for every \(y \in A\) there exists an open ball \(B(y, r_1) \subseteq A\).

Let \(r_1 = r - d(x, y)\). Since \(d(x, y) < r \forall y \in A\), hence \(r_1 > 0\). We can also write \(d(x, y) = r - r_1\). Consider \(C = B(y, r_1)\). For any \(z \in C\) we have \(d(y, z) < r_1\). Further using triangle inequality:

\[d(x, z) \leq d(x, y) + d(y, z) \leq r - r_1 + d(y, z) < r - r_1 + r_1 = r.\]

Thus \(z \in B(x, r) \Forall z \in C\), hence \(C \subseteq B(x, r)\). Hence \(B(x, r)\) is open.

Theorem

For a metric space \((X, d)\) following statements hold

  1. \(X\) and \(\EmptySet\) are open sets.
  2. Arbitrary unions of open sets are open sets.
  3. Finite intersections of open sets are open sets.
Proof

Since \(\EmptySet\) doesn’t contain any element hence (i) is vacuously true for \(\EmptySet\). For any \(x \in X\) and any \(r > 0\), \(B(x, r) \subseteq X\) by definition. Hence \(X\) is open.

Let \(\{A_i\}_{i \in I}\) be an arbitrary family of open sets with \(A_i \subseteq X\). Let \(C = \bigcup A_i\). Let \(x \in C\). Then there exists some \(A_i\) such that \(x \in A_i\). Since \(A_i\) is open hence there exists an open ball \(B(x, r) \subseteq A_i \subseteq C\). Thus for every \(x \in C\) there exists an open ball \(B(x, r) \subseteq C\). Hence \(C\) is open.

Let \(\{A_1, \dots, A_n\}\) be a finite collection of open subsets of \(X\). Let \(C = \bigcap A_i\). Let \(x \in C\). Then \(x \in A_i \Forall 1 \leq i \leq n\). Thus there exists an open ball \(B(x, r_i) \subseteq A_i \Forall 1 \leq i \leq n\). Now let \(r = \min(r_1, \dots, r_n)\). Since \(r_i > 0\) and we are taking a minimum over finite set of numbers hence \(r > 0\). Thus \(B(x, r) \subseteq B(x, r_i) \subseteq A_i \Forall 1 \leq i \leq n\). Thus \(B(x, r) \subseteq C\). Thus C is open.

Definition
Let \((X, d)\) be a metric space. Let \(A \subseteq X\). A point \(a \in A\) is called an interior point of \(A\) if there exists an open ball \(B(a, r)\) such that \(B(a, r) \subseteq A\).
Definition
The set of all interior points of a set \(A\) is called its interior. It is denoted by \(\Interior{A}\).
Lemma
For any set \(A \subseteq X\), its interior \(\Interior{A}\) is an open set.
Proof

We need to show that for every \(x \in \Interior{A}\), there exists an open ball \(B(x, r) \subseteq \Interior{A}\).

Let \(x \in \Interior{A}\). Then there exists an open ball \(B(x, r) \subseteq A\). Since \(B( x, r)\) is open hence for every \(y \in B (x, r)\) there exists an open ball \(B (y , r_1) \subseteq B(x, r) \subseteq A\). Thus \(y\) is an interior point of \(A\). Hence \(B(x, r) \subseteq \Interior{A}\).

Lemma
For any set \(A \subseteq X\), its interior \(\Interior{A}\) is the largest open set included in \(A\).
Proof
Let \(C \subseteq A\) be open. Let \(x \in C\). Then there exists an open ball \(B(x, r) \subseteq C \subseteq A\). Thus \(x\) is an interior point of \(A\). Hence \(x \in \Interior{A}\). Thus \(C \subseteq \Interior{A}\). Thus every open subset of \(A\) is a subset of interior of \(A\). We have already shown that Interior{A} is open. Hence Interior{A} is the largest open set contained in \(A\).
Lemma
\(A\) is open if and only if \(\Interior{A} = A\).
Proof

Let \(A\) be open. Hence for every \(x \in A\), there exists an open ball \(B(x, r) \subseteq A\). Thus \(x\) is an interior point of \(A\). Thus \(A \subseteq \Interior{A}\). But since \(\Interior{A} \subseteq A\), hence \(\Interior{A} = A\).

Now the converse. Let \(\Interior{A} = A\). Thus for every point \(x \in A\), there exists an open ball \(B(x, r) \subseteq A\) since \(x \in \Interior{A}\). Hence \(A\) is open.

Closed sets

Definition
A subset \(A\) of a metric space \((X, d)\) is called closed if its complement \(X \setminus A\) denoted as \(A^c\) is open.
Theorem

For a metric space \((X, d)\) the following statements hold:

  1. \(X\) and \(\EmptySet\) are closed sets.
  2. Arbitrary intersections of closed sets are closed sets.
  3. Finite unions of closed sets are closed sets.
Proof

Since \(\EmptySet\) is open hence \(X = X \setminus \EmptySet\) is closed. Since \(X\) is open hence \(\EmptySet = X \setminus X\) is closed.

Let \(\{A_i\}_{i \in I}\) be an arbitrary family of closed sets with \(A_i \subseteq X\). Then \(A_i^c\) are open. Thus \(\bigcup A_i^c\) is open. Thus \(\left ( \bigcup A_i^c \right )^c\) is closed. By De Morgan’s law, \(\bigcap A_i\) is closed.

Let \(\{A_1, \dots, A_n\}\) be a finite collection of closed subsets of \(X\). Then \(A_i^c\) are open. Hence their finite intersection \(\bigcap A_i^c\) is open. Hence \(\left ( \bigcap A_i^c \right )^c\) is closed. By De Morgan’s law, \(\bigcup A_i\) is closed.

Remark
A set \(A\) is open if and only if \(A^c\) is closed. Similarly a set \(A\) is closed if and only if \(A^c\) is open.
Definition
A point \(x \in X\) is called a closure point of a set \(A \subseteq X\) if every open ball at \(x\) contains at least one element of \(A\); i.e. \(B(x, r) \cap A \neq \EmptySet \Forall r > 0\).

Note that a closure point of \(A\) need not belong to \(A\). At the same time, every point in \(A\) is a closure point of \(A\).

Definition
The set of all closure points of a set \(A \subseteq X\) is called closure of \(A\) and is denoted by \(\Closure{A}\).

Clearly \(A \subseteq \Closure{A}\).

Lemma
The closure of a set \(A\) in a metric space \((X, d)\) is a closed set.
Proof

We will show that \(C = \Closure{A}^c\) is open.

Let \(x \in C\). Then \(x\) is not a closure point of \(A\). Hence, there exists an open ball \(B(x, r)\) such that \(B(x, r) \cap A = \EmptySet\). Now, consider \(z \in B (x, r)\). Since \(B(x, r)\) is open, there exists \(r_1 > 0\) such that \(B (z, r_1) \subseteq B(x, r)\). Thus, \(B (z, r_1) \cap A = \EmptySet\). Hence, \(z\) is not a closure point of \(A\). Hence, \(z \in C\). Thus, \(B( x, r) \subseteq C\). Thus, we have shown that for every \(x \in C\), there exists an open ball \(B(x, r) \subseteq C\). Thus, \(C\) is open. Consequently, \(\Closure{A} = C^c\) is closed.

Theorem
For every subset \(A\) of a metric space \((X, d)\) its closure \(\Closure{A}\) is the smallest closed set containing \(A\).
Proof
Let \(C\) be a closed set such that \(A \subseteq C\). Then, \(C^c\) is open. Hence, for every \(x \in C^c\), there exists an open ball \(B(x, r) \subseteq C^c\). Thus, \(B (x, r) \cap C = \EmptySet \implies B (x, r) \cap A = \EmptySet\). Thus, \(x\) is not a closure point of \(A\). Since every point in \(C^c\) is not a closure point of \(A\), hence every closure point of \(A\) belongs to \(C\). Thus, \(\Closure{A} \subseteq C\). Finally, since \(\Closure{A}\) is closed, hence it is the smallest closed set containing \(A\).
Lemma
A set \(A\) is closed if and only if \(A = \Closure{A}\).
Proof

Let \(A\) be closed. Then, \(\Closure{A} \subseteq A\) due to this. But since \(A \subseteq \Closure{A}\), hence, \(A = \Closure{A}\).

Now assume \(A = \Closure{A}\). Since \(\Closure{A}\) is closed, \(A\) is closed.

Definition
A set of the form \(A = \{x \in X : d(x, a) \leq r \}\) is called the closed ball at \(a\) with radius \(r\).
Lemma
A closed ball at point \(a\) with radius \(r\) given by \(A = \{x \in X : d(x, a) \leq r \}\) is a closed set.
Proof

We show that \(A^c\) is open.

Let \(y \in A^c\). Then \(d(y, a) > r\). Now consider \(r_1 = d(y, a) - r > 0\) and an open ball \(B(y, r_1)\). For any \(z \in B(y, r_1)\)

\[d(z, a) \geq d(y, a) - d(z, y) > d(y, a) - r_1 = r.\]

Thus, \(z \in A^c\). Hence, \(B(y, r_1) \subseteq A^c\). Hence, \(A^c\) is open. Thus, \(A\) is closed.

Lemma
The closure of an open ball \(B (x, r) = \{ y \in X : d(x, y) < r\}\) is the closed ball \(A = \{ y \in X : d(x, y) \leq r\}\).
Proof

Any point \(y : d(x, y) < r\) is obviously a closure point of \(B (x, r)\).

We show that a point \(y : d(x, y) = r\) is a closure point of \(B (x, r)\). For contradiction, suppose \(y\) is not a closure point of \(B (x, r)\). Then, there exists an open ball \(B(y, r_1)\) such that \(B (y, r_1) \cap B (x, r) = \EmptySet\).

We show that a point \(y : d(x, y) > r\) is not a closure point of \(B (x, r)\). Let \(r_1 = d(x, y) -r > 0\). Then, \(B ( y, r_1) \cap B (x, r) = \EmptySet\). Hence \(y\) is not a closure point of \(B (x, r)\).

Convex Analysis

Convex sets

We start off with reviewing some basic definitions.

Affine sets

Definition

Let \(x_1\) and \(x_2\) be two points in \(\RR^N\). Points of the form

\[y = \theta x_1 + (1 - \theta) x_2 \text{ where } \theta \in \RR\]

form a line passing through \(x_1\) and \(x_2\).

  • at \(\theta=0\) we have \(y=x_2\).
  • at \(\theta=1\) we have \(y=x_1\).
  • \(\theta \in [0,1]\) corresponds to the points belonging to the [closed] line segment between \(x_1\) and \(x_2\).

We can also rewrite \(y\) as

\[y = x_2 + \theta (x_1 - x_2)\]

In this definition:

  • \(x_2\) is called the base point for this line.
  • \(x_1 - x_2\) defines the direction of the line.
  • \(y\) is the sum of the base point and the direction scaled by the parameter \(\theta\).
  • As \(\theta\) increases from \(0\) to \(1\), \(y\) moves from \(x_2\) to \(x_1\).
Definition

A set \(C \subseteq \RR^N\) is affine if the line through any two distinct points in \(C\) lies in \(C\).

In other words, for any \(x_1, x_2 \in C\), we have \(\theta x_1 + (1 - \theta) x_2 \in C\) for all \(\theta \in \RR\).

If we denote \(\alpha = \theta\) and \(\beta = (1 - \theta)\) we see that \(\alpha x_1 + \beta x_2\) represents a linear combination of points in \(C\) such that \(\alpha + \beta = 1\).

The idea can be generalized in following way.

Definition
A point of the form \(\theta_1 x_1 + \dots + \theta_k x_k\) where \(\theta_1 + \dots + \theta_k = 1\) with \(\theta_i \in \RR\) and \(x_i \in \RR^N\), is called an affine combination of the points \(x_1,\dots,x_k\).

It can be shown easily that an affine set \(C\) contains all affine combinations of its points.

Remark
If \(C\) is an affine set, \(x_1, \dots, x_k \in C\), and \(\theta_1 + \dots + \theta_k = 1\), then the point \(y = \theta_1 x_1 + \dots + \theta_k x_k\) also belongs to \(C\).
Lemma

Let \(C\) be an affine set and \(x_0\) be any element in \(C\). Then the set

\[V = C - x_0 = \{ x - x_0 | x \in C\}\]

is a subspace of \(\RR^N\).

Proof

Let \(v_1\) and \(v_2\) be two elements in \(V\). Then by definition, there exist \(x_1\) and \(x_2\) in \(C\) such that

\[v_1 = x_1 - x_0\]

and

\[v_2 = x_2 - x_0\]

Thus

\[a v_1 + v_2 = a (x_1 - x_0) + x_2 - x_0 = (a x_1 + x_2 - a x_0 ) - x_0 \Forall a \in \RR.\]

But since \(a + 1 - a = 1\), hence \(x_3 = (a x_1 + x_2 - a x_0 ) \in C\) (an affine combination).

Hence \(a v_1 + v_2 = x_3 - x_0 \in V\) [by definition of \(V\)].

Thus any linear combination of elements in \(V\) belongs to \(V\). Hence \(V\) is a subspace of \(\RR^N\).

With this, we can use the following notation:

\[C = V + x_0 = \{ v + x_0 | v \in V\}\]

i.e. an affine set is a subspace with an offset.

Remark
Let \(C\) be an affine set and let \(x_1\) and \(x_2\) be two distinct elements. Let \(V_1 = C - x_1\) and \(V_2 = C - x_2\), then the subspaces \(V_1\) and \(V_2\) are identical.

Thus the subspace \(V\) associated with an affine set \(C\) doesn’t depend upon the choice of offset \(x_0\) in \(C\).

Definition
We define the affine dimension of an affine set \(C\) as the dimension of the associated subspace \(V = C - x_0\) for some \(x_0 \in C\).
ExampleSolution set of linear equations

We now show that the solution set of linear equations forms an affine set.

Let \(C = \{ x | A x = b\}\) where \(A \in \RR^{M \times N}\) and \(b \in \RR^M\).

\(C\) is the set of all vectors \(x \in \RR^N\) which satisfy the system of linear equations given by \(A x = b\). Then \(C\) is an affine set.

Let \(x_1\) and \(x_2\) belong to \(C\). Then we have

\[ A x_1 = b and\]
\[A x_2 = b\]

Thus

\[\begin{split}&\theta A x_1 + ( 1 - \theta ) A x_2 = \theta b + (1 - \theta ) b\\ &\implies A (\theta x_1 + (1 - \theta) x_2) = b\\ &\implies (\theta x_1 + (1 - \theta) x_2) \in C\end{split}\]

Thus \(C\) is an affine set.

The subspace associated with \(C\) is nothing but the null space of \(A\) denoted as \(\NullSpace(A)\).

Remark
Every affine set can be expressed as the solution set of a system of linear equations.
ExampleMore affine sets
  • The empty set \(\EmptySet\) is affine.
  • A singleton set containing a single point \(x_0\) is affine. Its corresponding subspace is \(\{0 \}\) of zero dimension.
  • The whole euclidean space \(\RR^N\) is affine.
  • Any line is affine. The associated subspace is a line parallel to it which passes through origin.
  • Any plane is affine. If it passes through origin, its a subspace. The associated subspace is the plane parallel to it which passes through origin.
Definition

The set of all affine combinations of points in some arbitrary set \(S \subseteq \RR^N\) is called the affine hull of \(S\) and denoted as \(\AffineHull(S)\):

\[\AffineHull(S) = \{\theta_1 x_1 + \dots + \theta_k x_k | x_1, \dots, x_k \in S \text{ and } \theta_1 + \dots + \theta_k = 1\}.\]
Remark
The affine hull is the smallest affine set containing \(S\). In other words, let \(C\) be any affine set with \(S \subseteq C\). Then \(\AffineHull(S) \subseteq C\).
Definition
A set of vectors \(v_0, v_1, \dots, v_K \in \RR^N\) is called affine independent, if the vectors \(v_1 - v_0, \dots, v_K - v_0\) are linearly independent.

Essentially the difference vectors \(v_k - v_0\) belong to the associated subspace.

If the associated subspace has dimension \(L\) then a maximum of \(L\) vectors can be linearly independent in it. Hence a maximum of \(L+1\) vectors can be affine independent for the affine set.

Convex sets

Definition

A set \(C\) is convex if the line segment between any two points in \(C\) lies in \(C\). i.e.

\[\theta x_1 + (1 - \theta) x_2 \in C \Forall x_1, x_2 \in C \text{ and } 0 \leq \theta \leq 1.\]
Definition

We call a point of the form \(\theta_1 x_1 + \dots + \theta_k x_k\), where \(\theta_1 + \dots + \theta_k = 1\) and \(\theta_i \geq 0, i=1,\dots,k\), a convex combination of the points \(x_1, \dots, x_k\).

It is like a weighted average of the points \(x_i\).

Remark
A set is convex if and only if it contains all convex combinations of its points.
ExampleConvex sets
  • A line segment is convex.
  • A circle [including its interior] is convex.
  • A ray is defined as \(\{ x_0 + \theta v | \theta \geq 0 \}\) where \(v \neq 0\) indicates the direction of ray and \(x_0\) is the base or origin of ray. A ray is convex but not affine.
  • Any affine set is convex.
Definition

The convex hull of an arbitrary set \(S \subseteq \RR^n\) denoted as \(\ConvexHull(S)\), is the set of all convex combinations of points in \(S\).

\[\ConvexHull(S) = \{ \theta_1 x_1 + \dots + \theta_k x_k | x_k \in S, \theta_i \geq 0, i = 1,\dots, k, \theta_1 + \dots + \theta_k = 1\}.\]
Remark
The convex hull \(\ConvexHull(S)\) of a set \(S\) is always convex.
Remark
The convex hull of a set \(S\) is the smallest convex set containing it. In other words, let \(C\) be any convex set such that \(S \subseteq C\). Then \(\ConvexHull(S) \subseteq C\).

We can generalize convex combinations to include infinite sums.

Lemma

Let \(\theta_1, \theta_2, \dots\) satisfy

\[\theta_i \geq 0, i = 1,2,\dots, \quad \sum_{i=1}^{\infty} \theta_i = 1,\]

and let \(x_1, x_2, \dots \in C\), where \(C \subseteq \RR^N\) is convex. Then

\[\sum_{i=1}^{\infty} \theta_i x_i \in C,\]

if the series converges.

We can generalize it further to density functions.

Lemma

Let \(p : \RR^N \to \RR\) satisfy \(p(x) \geq 0\) for all \(x \in C\) and

\[\int_{C} p(x) d x = 1\]

Then

\[\int_{C} p(x) x d x \in C\]

provided the integral exists.

Note that \(p\) above can be treated as a probability density function if we define \(p(x) = 0 \Forall x \in \RR^N \setminus C\).

Cones

Definition
A set \(C\) is called a cone or nonnegative homogeneous, if for every \(x \in C\) and \(\theta \geq 0\), we have \(\theta x \in C\).

By definition we have \(0 \in C\).

Definition

A set \(C\) is called a convex cone if it is convex and a cone. In other words, for every \(x_1, x_2 \in C\) and \(\theta_1, \theta_2 \geq 0\), we have

\[\theta_1 x_1 + \theta_2 x_2 \in C\]
Definition
A point of the form \(\theta_1 x_1 + \dots + \theta_k x_k\) with \(\theta_1 , \dots, \theta_k \geq 0\) is called a conic combination (or a non-negative linear combination) of \(x_1,\dots, x_k\).
Remark

Let \(C\) be a convex cone. Then for every \(x_1, \dots, x_k \in C\), a conic combination \(\theta_1 x_1 + \dots + \theta_k x_k\) with \(\theta_i \geq 0\) belongs to \(C\).

Conversely if a set \(C\) contains all conic combinations of its points, then its a convex cone.

The idea of conic combinations can be generalized to infinite sums and integrals.

Definition

The conic hull of a set \(S\) is the set of all conic combinations of points in \(S\). i.e.

\[\{\theta_1 x_1 + \dots \theta_k x_k | x_i \in S, \theta_i \geq 0, i = 1, \dots, k \}\]
Remark
Conic hull of a set is the smallest convex cone that contains the set.
ExampleConvex cones
  • A ray with its base at origin is a convex cone.
  • A line passing through zero is a convex cone.
  • A plane passing through zero is a convex cone.
  • Any subspace is a convex cone.

We now look at some more important convex sets one by one.

Hyperplanes and half spaces

Definition

A hyperplane is a set of the form

\[H = \{ x : a^T x = b \}\]

where \(a \in \RR^N, a \neq 0\) and \(b \in \RR\).

The vector \(a\) is called the normal vector to the hyperplane.

  • Analytically it is a solution set of a nontrivial linear equation. Thus it is an affine set.
  • Geometrically it is a set of points with a constant inner product to a given vector \(a\).

Now let \(x_0\) be an arbitrary element in \(H\). Then

\[\begin{split} &a^T x_0 = b\\ \implies &a^T x = a^T x_0 \Forall x \in H\\ \implies &a^T (x - x_0) = 0 \Forall x \in H\\ \implies &H = \{ x | a^T(x-x_0) = 0\}\end{split}\]

Now consider the orthogonal complement of \(a\) defined as

\[a^{\bot} = \{ v | a^T v = 0\}\]

i.e. the set of all vectors that are orthogonal to \(a\).

Now consider the set

\[S = x_0 + a^{\bot}\]

Clearly for every \(x \in S\), \(a^T x = a^T x_0 = b\).

Thus we can say that

\[H = \{ x | a^T(x-x_0) = 0\} = x_0 + a^{\bot}\]

Thus the hyperplane consists of an offset \(x_0\) plus all vectors orthogonal to the (normal) vector \(a_0\).

Definition

A hyperplane divides \(\RR^N\) into two halfspaces. The two (closed) halfspaces are given by

\[H_+ = \{ x : a^T x \geq b \}\]

and

\[H_- = \{ x : a^T x \leq b \}\]

The halfspace \(H_+\) extends in the direction of \(a\) while \(H_-\) extends in the direction of \(-a\).

  • A halfspace is the solution set of one (nontrivial) linear inequality.

  • A halfspace is convex but not affine.

  • The halfspace can be written alternatively as

    \[\begin{split}H_+ = \{ x | a^T (x - x_0) \geq 0\}\\ H_- = \{ x | a^T (x - x_0) \leq 0\}\end{split}\]

    where \(x_0\) is any point in the associated hyperplane \(H\).

  • Geometrically, points in \(H_+\) make an acute angle with \(a\) while points in \(H_-\) make an obtuse angle with \(a\).

Definition

The sets given by

\[\begin{split}\Interior{H_+} = \{ x | a^T x > b\}\\ \Interior{H_-} = \{ x | a^T x < b\}\end{split}\]

are called open halfspaces. They are the interior of corresponding closed halfspaces.

Euclidean balls and ellipsoids

Definition

A Euclidean closed ball (or just ball) in \(\RR^N\) has the form

\[B = \{ x | \| x - x_c\|_2 \leq r \} = \{x | (x - x_c)^T (x - x_c) \leq r^2 \},\]

where \(r > 0\) and \(\| \|_2\) denotes the Euclidean norm.

\(x_c\) is the center of the ball.

\(r\) is the radius of the ball.

An equivalent definition is given by

\[B = \{x_c + r u | \| u \|_2 \leq 1 \}.\]
Remark
A Euclidean ball is a convex set.
Proof

Let \(x_1, x_2\) be any two points in \(B\). We have

\[\| x_1 - x_c\|_2 \leq r\]

and

\[\| x_2 - x_c\|_2 \leq r\]

Let \(\theta \in [0,1]\) and consider the point \(x = \theta x_1 + (1 - \theta) x_2\). Then

\[\begin{split}\| x - x_c \|_2 &= \| \theta x_1 + (1 - \theta) x_2 - x_c\|_2\\ &= \| \theta (x_1 - x_c) + (1 - \theta) (x_2 - x_c) \|_2\\ &\leq \theta \| (x_1 - x_c)\|_2 + (1 - \theta)\| (x_2 - x_c)\|_2\\ &\leq \theta r + (1 - \theta) r\\ &= r\end{split}\]

Thus \(x \in B\), hence \(B\) is a convex set.

Definition

An ellipsoid is a set of the form

\[\xi = \{x | (x - x_c)^T P^{-1} (x - x_c) \leq 1\}\]

where \(P = P^T \succ 0\) i.e. \(P\) is symmetric and positive definite.

The vector \(x_c \in \RR^N\) is the centroid of the ellipse.

Eigen values of the matrix \(P\) (which are all positive) determine how far the ellipsoid extends in every direction from \(x_c\).

The lengths of semi-axes of \(\xi\) are given by \(\sqrt{\lambda_i}\) where \(\lambda_i\) are the eigen values of \(P\).

Remark
A ball is an ellipsoid with \(P = r^2 I\).

An alternative representation of an ellipsoid is given by

\[\xi = \{x_c + A u | \| u\|_2 \leq 1 \}\]

where \(A\) is a square and nonsingular matrix.

To show the equivalence of the two definitions, we proceed as follows.

Let \(P = A A^T\). Let \(x\) be any arbitrary element in \(\xi\).

Then \(x - x_c = A u\) for some \(u\) such that \(\| u \|_2 \leq 1\).

Thus

\[\begin{split}&(x - x_c)^T P^{-1} (x - x_c) = (A u)^T (A A^T)^{-1} (A u)\\ &= u^T A^T (A^T)^{-1} A^{-1} A u = u^T u \\ &= \| u \|_2^2 \leq 1\end{split}\]

The two representations of an ellipsoid are therefore equivalent.

Remark
An ellipsoid is a convex set.

Norm balls and norm cones

Definition

Let \(\| \cdot \| : \RR^N \to R\) be any norm on \(\RR\). A norm ball with radius \(r\) and center \(x_c\) is given by

\[B = \{ x | \| x - x_c \| \leq r \}\]
Remark
A norm ball is convex.
Definition

Let \(\| \cdot \| : \RR^N \to R\) be any norm on \(\RR\). The norm cone associated with the norm \(\| \cdot \|\) is given by the set

\[C = \{ (x,t) | \| x \| \leq t \} \subseteq \RR^{N+1}\]
Remark
A norm cone is convex. Moreover it is a convex cone.
ExampleSecond order cone

The second order cone is the norm cone for the Euclidean norm, i.e.

\[C = \{(x,t) | \| x \|_2 \leq t \} \subseteq \RR^{N+1}\]

This can be rewritten as

\[\begin{split}C = \left \{ \begin{bmatrix} x \\ t \end{bmatrix} \middle | \begin{bmatrix} x \\ t \end{bmatrix}^T \begin{bmatrix} I & 0 \\ 0 & -1 \end{bmatrix} \begin{bmatrix} x \\ t \end{bmatrix} \leq 0 , t \geq 0 \right \}\end{split}\]

Polyhedra

Definition

A polyhedron is defined as the solution set of a finite number of linear inequalities.

\[P = \{ x | a_j^T x \leq b_j, j = 1, \dots, M, c_k^T x = d_k, k = 1, \dots, P\}\]

A polyhedron thus is the intersection of a finite number of halfspaces (\(M\)) and hyperplanes (\(P\)).

ExamplePolyhedra
  • Affine sets ( subspaces, hyperplanes, lines)
  • Rays
  • Line segments
  • Halfspaces
Remark
A polyhedron is a convex set.
Definition
A bounded polyhedron is known as a polytope.

We can combine the set of inequalities and equalities in the form of linear matrix inequalities and equalities.

\[P = \{ x | A x \preceq b, C x = d\}\]

where

\[\begin{split}&A = \begin{bmatrix} a_1^T \\ \vdots \\ a_M^T \end{bmatrix} , b = \begin{bmatrix} b_1 \\ \vdots \\ b_M \end{bmatrix}\\ &C = \begin{bmatrix} c_1^T \\ \vdots\\ c_P^T \end{bmatrix} , d = \begin{bmatrix} d_1 \\ \vdots \\ d_P \end{bmatrix}\end{split}\]

and the symbol \(\preceq\) means vector inequality or component wise inequality in \(\RR^M\) i.e. \(u \preceq v\) means \(u_i \leq v_i\) for \(i = 1, \dots, M\).

Note that \(b \in \RR^M\), \(A \in \RR^{M \times N}\), \(A x \in \RR^M\), \(d \in \RR^P\), \(C \in \RR^{P \times N}\) and \(C x \in \RR^P\).

ExampleSet of nonnegative numbers
Let \(\RR_+ = \{ x \in \RR | x \geq 0\}\). \(\RR_+\) is a polyhedron (a solution set of a single linear inequality). Hence its a convex set. Moreover its a ray and a convex cone.
ExampleNon-negative orthant

We can generalize \(\RR_+\) as follows. Define

\[\RR_+^N = \{ x \in \RR^N | x_i \geq 0 , i = 1, \dots , N\} = \{x \in \RR^N | x \succeq 0 \}.\]

\(\RR_+^N\) is called nonnegative orthant. It is a polyhedron (solution set of \(N\) linear inequalities). It is also a convex cone.

Definition

Let \(K+1\) points \(v_0, \dots, v_K \in \RR^N\) be affine independent (see here).

The simplex determined by them is given by

\[C = \ConvexHull \{ v_0, \dots, v_K\} = \{ \theta_0 v_0 + \dots + \theta_K v_K | \theta \succeq 0, 1^T \theta = 1\}\]

where \(\theta = [\theta_1, \dots, \theta_K]^T\) and \(1\) denotes a vector of appropriate size \((K)\) with all entries one.

In other words, \(C\) is the convex hull of the set \(\{v_0, \dots, v_K\}\).

The positive semidefinite cone

Definition

We define the set of symmetric \(N\times N\) matrices as

\[S^N = \{X \in \RR^{N \times N} | X = X^T\}.\]
Lemma
\(S^N\) is a vector space with dimension \(\frac{N(N+1)}{2}\).
Definition

We define the set of symmetric positive semidefinite matrices as

\[S_+^N = \{X \in S^N | X \succeq 0 \}.\]

The notation \(X \succeq 0\) means \(v^T X v \geq 0 \Forall v \in \RR^N\).

Definition

We define the set of symmetric positive definite matrices as

\[S_{++}^N = \{X \in S^N | X \succ 0 \}.\]

The notation \(X \succ 0\) means \(v^T X v > 0 \Forall v \in \RR^N\).

Lemma
The set \(S_+^N\) is a convex cone.
Proof

Let \(A, B \in S_+^N\) and \(\theta_1, \theta_2 \geq 0\). We have to show that \(\theta_1 A + \theta_2 B \in S_+^N\).

\[A \in S_+^N \implies v^T A v \geq 0 \Forall v \in \RR^N.\]
\[B \in S_+^N \implies v^T B v \geq 0 \Forall v \in \RR^N.\]

Now

\[v^T (\theta_1 A + \theta_2 B) v = \theta_1 v^T A v + \theta_2 v^T B v \geq 0 \Forall v \in \RR^N.\]

Hence \(\theta_1 A + \theta_2 B \in S_+^N\).

Operations that preserve convexity

In the following, we will discuss several operations which transform a convex set into another convex set, and thus preserve convexity.

Understanding these operations is useful for determining the convexity of a wide variety of sets.

Usually its easier to prove that a set is convex by showing that it is obtained by a convexity preserving operation from a convex set compared to directly verifying the convexity property i.e.

\[\theta x_1 + (1 - \theta) x_2 \in C \Forall x_1, x_2 \in C, \theta \in [0,1].\]

Intersection

Lemma
If \(S_1\) and \(S_2\) are convex sets then \(S_1 \cap S_2\) is convex.
Proof

Let \(x_1, x_2 \in S_1 \cap S_2\). We have to show that

\[\theta x_1 + (1 - \theta) x_2 \in S_1 \cap S_2, \Forall \theta \in [0,1].\]

Since \(S_1\) is convex and \(x_1, x_2 \in S_1\), hence

\[\theta x_1 + (1 - \theta) x_2 \in S_1, \Forall \theta \in [0,1].\]

Similarly

\[\theta x_1 + (1 - \theta) x_2 \in S_2, \Forall \theta \in [0,1].\]

Thus

\[\theta x_1 + (1 - \theta) x_2 \in S_1 \cap S_2, \Forall \theta \in [0,1].\]

which completes the proof.

We can generalize it further.

Lemma
Let \(\{ A_i\}_{i \in I}\) be a family of sets such that \(A_i\) is convex for all \(i \in I\). Then \(\cap_{i \in I} A_i\) is convex.
Proof

Let \(x_1, x_2\) be any two arbitrary elements in \(\cap_{i \in I} A_i\).

\[\begin{split}&x_1, x_2 \in \cap_{i \in I} A_i\\ \implies & x_1, x_2 \in A_i \Forall i \in I\\ \implies &\theta x_1 + (1 - \theta) x_2 \in A_i \Forall \theta \in [0,1] \Forall i \in I \text{ since $A_i$ is convex }\\ \implies &\theta x_1 + (1 - \theta) x_2 \in \cap_{i \in I} A_i\end{split}\]

Hence \(\cap_{i \in I} A_i\) is convex.

Affine functions

Definition

A function \(f : \RR^N \to \RR^M\) is affine if it is a sum of a linear function and a constant, i.e.

\[f = A x + b\]

where \(A \in \RR^{M \times N}\) and \(b \in \RR^M\).

Lemma

Let \(S \subseteq \RR^N\) be convex and \(f : \RR^N \to \RR^M\) be an affine function. Then the image of \(S\) under \(f\) given by

\[f(S) = \{ f(x) | x \in S\}\]

is a convex set.

It applies in the reverse direction also.

Lemma

Let \(f : \RR^K \to \RR^N\) be affine and \(S \subseteq \RR^N\) be convex. Then the inverse image of \(S\) under \(f\) given by

\[f^{-1}(S) = \{ x \in \RR^K | f(x) \in S\}\]

is convex.

ExampleAffine functions preserving convexity

Let \(S \in \RR^N\) be convex.

  1. For some \(\alpha \in \RR\) , \(\alpha S\) given by

    \[\alpha S = \{\alpha x | x \in S\}\]

    is convex. This is the scaling operation.

  2. For some \(a \in \RR^N\), \(S + a\) given by

    \[S + a = \{x + a | x \in S\}\]

    is convex. This is the translation operation.

  3. Let \(N = M + K\) where \(M, N \in \Nat\). Thus let \(\RR^N = \RR^M \times \RR^K\). A vector \(x \in S\) can be written as \(x = (x_1, x_2)\) where \(x_1 \in \RR^M\) and \(x_2 \in \RR^K\). Then

    \[T = \{ x_1 \in \RR^M | (x_1, x_2) \in S \text{ for some } x_2 \in \RR^K\}\]

    is convex. This is the projection operation.

Definition

Let \(S_1\) and \(S_2\) be two arbitrary subsets of \(\RR^N\). Then their sum is defined as

\[S_1 + S_2 = \{ x + y | x \in S_1 , y \in S_2\}.\]
Lemma
Let \(S_1\) and \(S_2\) be two convex subsets of \(\RR^N\). Then \(S_1 + S_2\) is convex.

Proper cones and generalized inequalities

Definition

A cone \(K \in \RR^N\) is called a proper cone if it satisfies the following:

  • \(K\) is convex.
  • \(K\) is closed.
  • \(k\) is solid i.e. it has a nonempty interior.
  • \(K\) is pointed i.e. it contains no line. In other words
\[x \in K, -x \in K \implies x = 0.\]

A proper cone \(K\) can be used to define a generalized inequality, which is a partial ordering on \(\RR^N\).

Definition

Let \(K \subseteq \RR^N\) be a proper cone. A partial ordering on \(\RR^N\) associated with the proper cone \(K\) is defined as

\[x \preceq_{K} y \iff y - x \in K.\]

We also write \(x \succeq_K y\) if \(y \preceq_K x\). This is also known as a generalized inequality.

A strict partial ordering on \(\RR^N\) associated with the proper cone \(K\) is defined as

\[x \prec_{K} y \iff y - x \in \Interior{K}.\]

where \(\Interior{K}\) is the interior of \(K\). We also write \(x \succ_K y\) if \(y \prec_K x\). This is also known as a strict generalized inequality.

When \(K = \RR_+\), then \(\preceq_K\) is same as usual \(\leq\) and \(\prec_K\) is same as usual \(<\) operators on \(\RR_+\).

ExampleNonnegative orthant and component-wise inequality

The nonnegative orthant \(K=\RR_+^N\) is a proper cone. Then the associated generalized inequality \(\preceq_{K}\) means that

\[x \preceq_K y \implies (y-x) \in \RR_+^N \implies x_i \leq y_i \Forall i= 1,\dots,N.\]

This is usually known as component-wise inequality and usually denoted as \(x \preceq y\).

ExamplePositive semidefinite cone and matrix inequality

The positive semidefinite cone \(S_+^N \subseteq S^N\) is a proper cone in the vector space \(S^N\).

The associated generalized inequality means

\[X \preceq_{S_+^N} Y \implies Y - X \in S_+^N\]

i.e. \(Y - X\) is positive semidefinite. This is also usually denoted as \(X \preceq Y\).

Minimum and minimal elements

The generalized inequalities (\(\preceq_K, \prec_K\)) w.r.t. the proper cone \(K \subset \RR^N\) define a partial ordering over any arbitrary set \(S \subseteq \RR^N\).

But since they may not enforce a total ordering on \(S\), not every pair of elements \(x, y\in S\) may be related by \(\preceq_K\) or \(\prec_K\).

ExamplePartial ordering with nonnegative orthant cone

Let \(K = \RR^2_+ \subset \RR^2\). Let \(x_1 = (2,3), x_2 = (4, 5), x_3=(-3, 5)\). Then we have

  • \(x_1 \prec x_2\), \(x_2 \succ x_1\) and \(x_3 \preceq x_2\).
  • But neither \(x_1 \preceq x_3\) nor \(x_1 \succeq x_3\) holds.
  • In general For any \(x , y \in \RR^2\), \(x \preceq y\) if and only if \(y\) is to the right and above of \(x\) in the \(\RR^2\) plane.
  • If \(y\) is to the right but below or \(y\) is above but to the left of \(x\), then no ordering holds.
Definition
We say that \(x \in S \subseteq \RR^N\) is the minimum element of \(S\) w.r.t. the generalized inequality \(\preceq_K\) if for every \(y \in S\) we have \(x \preceq y\).
  • \(x\) must belong to \(S\).
  • It is highly possible that there is no minimum element in \(S\).
  • If a set \(S\) has a minimum element, then by definition it is unique (Prove it!).
Definition
We say that \(x \in S \subseteq \RR^N\) is the maximum element of \(S\) w.r.t. the generalized inequality \(\preceq_K\) if for every \(y \in S\) we have \(y \preceq x\).
  • \(x\) must belong to \(S\).
  • It is highly possible that there is no maximum element in \(S\).
  • If a set \(S\) has a maximum element, then by definition it is unique.
ExampleMinimum element
Consider \(K = \RR^N_+\) and \(S = \RR^N_+\). Then \(0 \in S\) is the minimum element since \(0 \preceq x \Forall x \in \RR^N_+\).
ExampleMaximum element
Consider \(K = \RR^N_+\) and \(S = \{x | x_i \leq 0 \Forall i=1,\dots,N\}\). Then \(0 \in S\) is the maximum element since \(x \preceq 0 \Forall x \in S\).

There are many sets for which no minimum element exists. In this context we can define a slightly weaker concept known as minimal element.

Definition
An element \(x\in S\) is called a minimal element of \(S\) w.r.t. the generalized inequality \(\preceq_K\) if there is no element \(y \in S\) distinct from \(x\) such that \(y \preceq_K x\). In other words \(y \preceq_K x \implies y = x\).
Definition
An element \(x\in S\) is called a maximal element of \(S\) w.r.t. the generalized inequality \(\preceq_K\) if there is no element \(y \in S\) distinct from \(x\) such that \(x \preceq_K y\). In other words \(x \preceq_K y \implies y = x\).
  • The minimal or maximal element \(x\) must belong to \(S\).
  • It is highly possible that there is no minimal or maximal element in \(S\).
  • Minimal or maximal element need not be unique. A set may have many minimal or maximal elements.
Lemma

A point \(x \in S\) is the minimum element of \(S\) if and only if

\[S \subseteq x + K\]
Proof

Let \(x \in S\) be the minimum element. Then by definition \(x \preceq_K y \Forall y \in S\). Thus

\[\begin{split}& y - x \in K \Forall y \in S \\ \implies & \text{ there exists some } k \in K \Forall y \in S \text{ such that } y = x + k\\ \implies & y \in x + K \Forall y \in S\\ \implies & S \subseteq x + K.\end{split}\]

Note that \(k \in K\) would be distinct for each \(y \in S\).

Now let us prove the converse.

Let \(S \subseteq x + K\) where \(x \in S\). Thus

\[\begin{split}& \exists k \in K \text{ such that } y = x + k \Forall y \in S\\ \implies & y - x = k \in K \Forall y \in S\\ \implies & x \preceq_K y \Forall y \in S.\end{split}\]

Thus \(x\) is the minimum element of \(S\) since there can be only one minimum element of S.

\(x + K\) denotes all the points that are comparable to \(x\) and greater than or equal to \(x\) according to \(\preceq_K\).

Lemma

A point \(x \in S\) is a minimal point if and only if

\[\{ x - K \} \cap S = \{ x \}.\]
Proof

Let \(x \in S\) be a minimal element of \(S\). Thus there is no element \(y \in S\) distinct from \(x\) such that \(y \preceq_K x\).

Consider the set \(R = x - K = \{x - k | k \in K \}\).

\[r \in R \iff r = x - k \text { for some } k \in K \iff x - r \in K \iff r \preceq_K x.\]

Thus \(x - K\) consists of all points \(r \in \RR^N\) which satisfy \(r \preceq_K x\). But there is only one such point in \(S\) namely \(x\) which satisfies this. Hence

\[\{ x - K \} \cap S = \{ x \}.\]

Now let us assume that \(\{ x - K \} \cap S = \{ x \}\). Thus the only point \(y \in S\) which satisfies \(y \preceq_K x\) is \(x\) itself. Hence \(x\) is a minimal element of \(S\).

\(x - K\) represents the set of points that are comparable to \(x\) and are less than or equal to \(x\) according to \(\preceq_K\).

Probability and Random Variables

Random Variables

The step function and sign function relation:

\[u(t) = \frac{1}{2} [1 + \sgn (t)].\]

Discrete step function and Kronecker delta function:

\[u(n) = \sum_{k = -\infty}^n \delta(k).\]

For different random variables, we will characterize their distributions by several parameters. These are listed below

  • Probability density function (PDF)
  • Cumulative distribution function (CDF)
  • Probability mass function (PMF)
  • Mean (\(\mu\) or \(\EE(X)\))
  • Variance (\(\sigma^2\) or \(\Var(X)\))
  • Skew
  • Kurtosis
  • Characteristic function (CF)
  • Moment generating function (MGF)
  • Second characteristic function
  • Cumulant generating function (CGF)

Cumulative distribution function

The CDF is defined as

\[F_X (x) = \PP ( X \leq x).\]

Properties of CDF:

\[F_X(x) \geq 0, \quad F_X(-\infty) = 0, \quad F_X(\infty) = 1.\]

CDF is a monotonically non-decreasing function.

\[x_1 < x_2 \implies F_X(x_1) \leq F_X(x_2).\]

\(F_X(-\infty)\) is defined as

\[F_X(-\infty) = \lim_{x \to - \infty} F_X(x).\]

Similarly:

\[F_X(\infty) = \lim_{x \to \infty} F_X(x).\]

\(F_X(x)\) is right continuous.

\[\lim_{x \to t^+} F_X(x) = F_X(t).\]

Probability density function

Properties of PDF

\[f_X(x) \geq 0.\]
\[\int_{-\infty}^{\infty} f_X(x) d x = 1.\]

The CDF and PDF are related as

\[F_X(x) = \int_{-\infty}^x f_X(t ) d t.\]

Expectation

Expectation of a discrete random variable:

\[\EE (X) = \sum_{x} x p_X(x).\]

Expectation of a continuous random variable:

\[\EE (X) = \int_{- \infty}^{\infty} t f_X(t) d t.\]

Expectation of a function of a random variable:

\[\EE [g(X)] = \int_{- \infty}^{\infty} g(t) f_X(t) d t.\]

Mean square value:

\[\EE [X^2] = \int_{- \infty}^{\infty} t^2 f_X(t) d t.\]

Variance:

\[\Var(X) = \EE [X^2] - \EE [X]^2.\]

\(n\)-th moment:

\[\EE [X^n] = \int_{- \infty}^{\infty} t^n f_X(t) d t.\]

Characteristic function

The characteristic function is defined as

\[\Psi_X(j \omega) \triangleq \EE \left [ \exp (j \omega X) \right ].\]

PDF as Fourier transform of CF.

\[\Psi_X(j\omega) = \int_{-\infty}^{\infty} e^{j \omega x} f_X(x) d x.\]
\[f_X(x) = \frac{1}{2 \pi} \int_{-\infty}^{\infty} e^{-j \omega x} \Psi_X(j\omega) d \omega\]
\[\Psi_X(j 0) = \EE (1) = 1.\]
\[\left. \frac{d}{ d \omega} \Psi_X(j\omega) \right |_{\omega = 0} = j \EE [X].\]
\[\left. \frac{d^2}{ d \omega^2} \Psi_X(j\omega) \right |_{\omega = 0} = j^2 \EE [X^2] = - \EE [X^2].\]
\[\EE [X^k] = \frac{1}{j^k} \left. \frac{d^k}{ d \omega^k} \Psi_X(j\omega) \right |_{\omega = 0}.\]

Let \(Y_1, \dots, Y_k\) be independent. Then

\[\Psi_{Y_1 + \dots + Y_k} (j \omega) = \prod_{Y_1, \dots, Y_K} \EE [ \exp (j \omega Y_i)].\]

Moment generating function

The moment generating function is defined as

\[M_X(t) \triangleq \EE \left [ \exp (t X) \right ].\]

Second characteristic function

Cumulant generating function

Gaussian distribution

Standard normal distribution

This distribution has a mean of 0 and a variance of 1. It is denoted by

\[X \sim \NNN(0, 1).\]

The PDF is given by

\[f_X(x) = \frac{1}{\sqrt{2\pi}} \exp \left ( - \frac{x^2}{2} \right ).\]

The CDF is given by

\[F_X(x) = \int_{-\infty}^x f_X(t) d t = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{x} \exp \left ( - \frac{t^2}{2} \right ) d t.\]

Symmetry

\[f(-x) = f(x). \quad F(-x) + F(x) = 1.\]

Some specific values

\[F_X(-\infty) = 0, \quad F_X(0) = \frac{1}{2}, \quad F_X(\infty) = 1.\]

The Q-function is given as

\[Q(x) = \int_{x}^{\infty} f_X(t) d t = \frac{1}{\sqrt{2\pi}} \int_{x}^{\infty} \exp \left ( - \frac{t^2}{2} \right ) d t.\]

We have

\[F_X(x) + Q(x) = 1.\]

Alternatively

\[F_X(x) = 1 - Q(x).\]

Further

\[Q(x) + Q(-x) = 1.\]

This is due to the symmetry of normal distribution. Alternatively

\[Q(x) = 1 - Q(-x).\]

Probability of \(X\) falling in a range \([a,b]\)

\[\PP (a \leq X \leq b) = Q(a) - Q(b) = F(b) - F(a).\]

The characteristic function is

\[\Psi_X(j\omega) = \exp\left ( - \frac{\omega^2}{2}\right ).\]

Mean:

\[\mu = \EE (X) = 0.\]

Mean square value

\[\EE (X^2) = 1.\]

Variance:

\[\sigma^2 = \EE (X^2) - \EE(X)^2 = 1.\]

Standard deviation

\[\sigma = 1.\]

An upper bound on Q-function

\[Q(x) \leq \frac{1}{2} \exp \left ( - \frac{x^2}{2} \right ).\]

The moment generating function is

\[M_X(t) = \exp\left ( \frac{t^2}{2}\right ).\]

Error function and its properties

The error function is defined as

\[\erf(x) \triangleq \frac{2}{\sqrt{\pi}} \int_0^x \exp\left ( - t^2 \right) d t.\]

The complementary error function is defined as

\[\erfc(x) = 1 - \erf(x) = \frac{2}{\sqrt{\pi}} \int_x^{\infty} \exp\left ( - t^2 \right) d t.\]

Error function is an odd function.

\[\erf(-x) = - \erf(x).\]

Some specific values of error function.

\[\erf(0) = 0, \quad \erf(-\infty) = -1 , \quad \erf (\infty) = 1.\]

The relationship with normal CDF.

\[F_X(x) = \frac{1}{2} + \frac{1}{2} \erf \left ( \frac{x}{\sqrt{2}}\right) = \frac{1}{2} \erfc \left (- \frac{x}{\sqrt{2}}\right).\]

Relationship with Q function.

\[Q(x) = \frac{1}{2} \erfc\left (\frac{x}{\sqrt{2}} \right) = \frac{1}{2} - \frac{1}{2} \erf \left ( \frac{x}{\sqrt{2}} \right ).\]
\[\erfc(x) = 2 Q(\sqrt{2} x).\]

We also have some useful results:

\[\int_0^{\infty} \exp\left ( - \frac{t^2}{2}\right ) d t = \sqrt{\frac{\pi}{2}}.\]
General normal distribution

The general Gaussian (or normal) random variable is denoted as

\[X \sim \NNN (\mu, \sigma^2).\]

Its PDF is

\[f_X( x) = \frac{1}{\sqrt{2 \pi} \sigma} \exp \left ( \frac{1}{2} \frac{(x -\mu)^2}{\sigma^2}. \right)\]

A simple transformation

\[Y = \frac{X - \mu}{\sigma}\]

converts it into standard normal random variable.

The mean:

\[\EE (X) = \mu.\]

The mean square value:

\[\EE (X^2) = \sigma^2 + \mu^2.\]

The variance:

\[\EE (X^2) - \EE (X)^2 = \sigma^2.\]

The CDF:

\[F_X(x) = \frac{1}{2} + \frac{1}{2} \erf \left ( \frac{x - \mu}{\sigma\sqrt{2}}\right).\]

Notice the transformation from \(x\) to \((x - \mu) / \sigma\).

The characteristic function:

\[\Psi_X(j\omega) = \exp\left (j \omega \mu - \frac{\omega^2 \sigma^2}{2}\right ).\]

Naturally putting \(\mu = 0\) and \(\sigma = 1\), it reduces to the CF of the standard normal r.v.

Th MGF:

\[M_X(t) = \exp\left (\mu t + \frac{\sigma^2 t^2}{2}\right ).\]

Skewness is zero and Kurtosis is zero.

One sided Gaussian distribution

Truncated normal distribution

Basic inequalities

Probability theory is all about inequalities. So many results are derived from the application of these inequalities. This section collects some basic inequalities.

A good reference is Wikipedia list of inequalities. In particular see the section on probability inequalities.

In this section we will cover the basic inequalities.

Markov’s inequality

http://en.wikipedia.org/wiki/Markov

Theorem

Let \(X\) be a non-negative random variable and \(a > 0\). Then

\[\PP (X \geq a) \leq \frac{\EE (X)}{a}.\]

Chebyshev’s inequality

http://en.wikipedia.org/wiki/Chebyshev

Theorem

Let \(X\) be a random variable with finite mean \(\mu\) and finite non-zero variance \(\sigma^2\). Then for any real number \(k > 0\), the following holds

\[\PP (| X - \mu | \geq k \sigma) \leq \frac{1}{k^2}.\]
Proof
TBD.

Choosing \(k = \sqrt{2}\), we see that at least half of the values lie in the interval \((\mu - \sqrt{2} \sigma, \mu + \sqrt{2} \sigma)\).

Boole’s inequality

http://en.wikipedia.org/wiki/Boole This is also known as union bound.

Theorem

For a countable set of events \(A_1, A_2, \dots\), we have

\[\PP \left ( \bigcup_{i} A_i \right) \leq \sum_{i} \PP \left ( A_i \right).\]
Proof

We first prove it for a finite collection of events using induction. For \(n=1\), obviously

\[\PP (A_1) \leq \PP (A_1).\]

Assume the inequality is true for the set of \(n\) events. i.e.

\[\PP \left ( \bigcup_{i=1}^n A_i \right) \leq \sum_{i=1}^n \PP \left ( A_i \right).\]

Since

\[\PP (A \cup B ) = \PP (A) + \PP(B) - \PP (A \cap B),\]

hence

\[\PP \left ( \bigcup_{i=1}^{n + 1} A_i \right) = \PP \left ( \bigcup_{i=1}^n A_i \right) + \PP (A_{n + 1}) - \PP \left ( \bigcup_{i=1}^n A_i \bigcap A_{n +1} \right ).\]

Since

\[\PP \left ( \bigcup_{i=1}^n A_i \bigcap A_{n +1} \right ) \geq 0,\]

hence

\[\PP \left ( \bigcup_{i=1}^{n + 1} A_i \right) \leq \PP \left ( \bigcup_{i=1}^n A_i \right) + \PP (A_{n + 1}) \leq \sum_{i=1}^{n + 1} \PP \left ( A_i \right).\]

Fano’s inequality

Cramér–Rao inequality

Hoeffding’s inequality

http://en.wikipedia.org/wiki/Hoeffding

This inequality provides an upper bound on the probability that the sum of random variables deviates from its expected value.

We start with a version of the inequality for i.i.d Bernoulli random variables.

Theorem

Let \(X_1, \dots, X_n\) be i.i.d. Bernoulli random variables with probability of success as \(p\). \(\EE \left [\sum_i X_i \right] = p n\). The probability of the sum deviating from the mean by \(\epsilon n\) for some \(\epsilon > 0\) is bounded by

\[\PP \left (\sum_i X_i \leq (p - \epsilon) n \right ) \leq \exp ( -2 \epsilon^2 n)\]

and

\[\PP \left (\sum_i X_i \geq (p + \epsilon) n \right ) \leq \exp ( -2 \epsilon^2 n).\]

The two inequalities can be summarized as

\[\PP \left [ (p - \epsilon) n \leq \sum_i X_i \leq (p + \epsilon) n \right ] \geq 1 - 2\exp ( -2 \epsilon^2 n).\]

The inequality states that the number of successes that we see is concentrated around its mean with exponentially small tail.

We now state the inequality for the general case for any (almost surely) bounded random variable.

Theorem

Let \(X_1, \dots, X_n\) be independent r.v.s. Assume that \(X_i\) are almost surely bounded; i.e.:

\[\PP \left ( X_i \in [ a_i, b_i] \right ) = 1, \quad 1 \leq i \leq n.\]

Define the empirical mean of the variables as

\[\overline{X} \triangleq \frac{1}{n} \left ( X_1 + \dots + X_n \right).\]

Then the probability that \(\overline{X}\) deviates from its mean \(\EE(\overline{X})\) by an amount \(t > 0\) is bounded by following inequalities:

\[\PP \left ( \overline{X} - \EE(\overline{X}) \geq t \right ) \leq \exp \left ( - \frac{2 n^2 t^2}{\sum_{i = 1}^n (b_i - a_i)^2} \right)\]

and

\[\PP \left ( \overline{X} - \EE(\overline{X}) \leq -t \right ) \leq \exp \left ( - \frac{2 n^2 t^2}{\sum_{i = 1}^n (b_i - a_i)^2} \right).\]

Together, we have

\[\PP \left ( \left | \overline{X} - \EE(\overline{X}) \right | \geq t \right ) \leq 2\exp \left ( - \frac{2 n^2 t^2}{\sum_{i = 1}^n (b_i - a_i)^2} \right).\]

Note that we don’t require \(X_i\) to be identically distributed in this formulation. For the special case when \(X_i\) are i.i.d. uniform r.v.s over \([0, 1]\), then \(\EE(\overline{X}) = \EE(X_i) = \frac{1}{2}\) and

\[\PP \left ( \left | \overline{X} - \frac{1}{2}\right | \geq t \right ) \leq 2\exp \left ( - 2 n t^2 \right).\]

Clearly, \(\overline{X}\) starts concentrating around its mean as \(n\) increases and the tail falls exponentially.

The proof of this result depends on what is known as Hoeffding’s Lemma.

Lemma

Let \(X\) be a zero mean r.v. with \(\PP (X \in [a, b]) = 1\). Then

\[\EE \left [ \exp (t X) \right] \leq \exp \left ( \frac{1}{8} t^2 (b - a)^2 \right ).\]

Jensen’s inequality

http://en.wikipedia.org/wiki/Jensen Jensen’s inequality relates the value of a convex function of an integral to the integral of the convex function. In the context of probability theory, the inequality take the following form.

Theorem

Let \(f : \RR \to \RR\) be a convex function. Then

\[f \left ( \EE [X] \right ) \leq \EE \left [ f ( X ) \right ].\]

The equality holds if and only if either \(X\) is a constant r.v. or \(f\) is linear.

Bernstein inequalities

Chernoff’s inequality

http://en.wikipedia.org/wiki/Chernoff This is also known as Chernoff bound.

Fréchet inequalities

Two variables

Let \(X\) and \(Y\) be two random variables and let \(F_(X, Y)(x, y)\) be their joint CDF.

\[\begin{split}\lim_{\substack{x \to -\infty\\ y \to -\infty}} F_{X, Y} (x, y) = 0.\end{split}\]
\[\begin{split}\lim_{\substack{x \to \infty\\ y \to \infty}} F_{X, Y} (x, y) = 1.\end{split}\]

Right continuity:

\[\lim_{x \to x_0^+} F_{X, Y} (x, y) = F_{X, Y} (x_0, y).\]
\[\lim_{y \to y_0^+} F_{X, Y} (x, y) = F_{X, Y} (x, y_0).\]

The joint probability density function is given by \(f_{X, Y} (x, y)\). It satisfies \(f_{X, Y} (x, y) \geq 0\) and

\[\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X, Y} (x, y) d y d x = 1.\]

The joint CDF and joint PDF are related by

\[F_{X, Y} (x, y) = \PP (X \leq x, Y \leq y) = \int_{-\infty}^{x} \int_{-\infty}^{y} f_{X, Y} (u , v) d v d u.\]

Further

\[\PP (a \leq X \leq b, c \leq Y \leq d) = \int_{a}^{b} \int_{c}^{d} f_{X, Y} (u , v) d v d u.\]

The marginal probability is

\[\PP (a \leq X \leq b) = \PP (a \leq X \leq b, -\infty \leq Y \leq \infty) = \int_{a}^{b} \int_{-\infty}^{\infty} f_{X, Y} (u , v) d v d u.\]

We define the marginal density functions as

\[f_X(x) = \int_{-\infty}^{\infty} f_{X, Y} (x, y) d y\]

and

\[f_Y(y) = \int_{-\infty}^{\infty} f_{X, Y} (x, y) d x.\]

We can now write

\[\PP (a \leq X \leq b) = \int_{a}^{b} f_X(x) d x.\]

Similarly

\[\PP (c \leq Y \leq d) = \int_{c}^{d} f_Y(y) d y.\]

Conditional density

We define

\[\PP (a \leq x \leq b | y = c) = \int_{a}^{b} f_{X | Y}(x | y = c) d x.\]

We have

\[f_{X | Y}(x | y = c) = \frac{f_{X, Y} (x, c)}{f_{Y} (c)}.\]

In other words

\[f_{X | Y}(x | y = c) f_{Y} (c) = f_{X, Y} (x, c).\]

In general we write

\[f_{X | Y}(x | y) f_Y(y) = f_{X, Y} (x, y).\]

Or even more loosely as

\[f(x | y) f(y) = f(x, y).\]

More identities

\[f(x | y \leq d) = \frac{ \int_{-\infty}^d f(x, y) d y} {\PP (y \leq d)}.\]

Independent variables

If \(X\) and \(Y\) are independent then

\[f_{X, Y}(x, y) = f_X(x) f_Y(y).\]
\[f(x | y) = \frac{f(x, y)}{f(y)} = \frac{f(x) f(y)}{f(y)} = f(x).\]

Similarly

\[f(y | x) = f(y).\]

The CDF also is separable

\[F_{X, Y}(x, y) = F_X(x) F_Y(y).\]

Expectation

This section contains several results on expectation operator.

Any function \(g(x)\) defines a new random variable \(g(X)\). If \(g(X)\) has a finite expectation, then

\[\EE [g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) d x.\]

If several random variables \(X_1, \dots, X_n\) are defined on the same sample space, then their sum \(X_1 + \dots + X_n\) is a new random variable. If all of them have finite expectations, then the expectation of their sum exists and is given by

\[\EE [X_1 + \dots + X_n] = \EE [X_1] + \dots + \EE [X_n].\]

If \(X\) and \(Y\) are mutually independent random variables with finite expectations, then their product is a random variable with finite expectation and

\[\EE (X Y) = \EE (X) \EE (Y).\]

By induction, if \(X_1, \dots, X_n\) are mutually independent random variables with finite expectations, then

\[\EE \left [ \prod_{i=1}^n X_i \right ] = \prod_{i=1}^n \EE \left [ X_i \right ].\]

Let \(X\) and \(Y\) be two random variables with the joint density function \(f_{X, Y} (x, y)\). Let the marginal density function of \(Y\) given \(X\) be \(f(y | x)\). Then the conditional expectation is defined as follows:

\[\EE [Y | X] = \int_{-\infty}^{\infty} y f(y | x) d y.\]

\(\EE [Y | X ]\) is a new random variable.

\[\begin{split}\begin{aligned} \EE \left [ \EE [Y | X ] \right ] &= \int_{-\infty}^{\infty} \EE [Y | X] f (x) d x\\ &= \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} y f(y | x) f (x) d y d x\\ &= \int_{-\infty}^{\infty}y \left ( \int_{-\infty}^{\infty} f(x, y) d x \right ) d y \\ &= \int_{-\infty}^{\infty} y f(y) d y = \EE [Y]. \end{aligned}\end{split}\]

In short, we have

\[\EE \left [ \EE [Y | X ] \right ] = \EE [Y].\]

The covariance of \(X\) and \(Y\) is defined as

\[\Cov (X, Y) = \EE \left [ (X - \EE[X]) ( Y - \EE[Y]) \right ].\]

It is easy to see that

\[\Cov (X, Y) = \EE [X Y] - \EE [X] \EE [ Y].\]

The correlation coefficient is defined as

\[\rho \triangleq \frac{\Cov (X, Y)}{\sqrt{Var (X) Var (Y)}}.\]

Independent variables

If \(X\) and \(Y\) are independent, then

\[\EE [ g_1(x) g_2 (y)] = \EE [g_1(x)] \EE [g_2 (y)].\]

If \(X\) and \(Y\) are independent, then \(\Cov (X, Y) = 0\).

Uncorrelated variables

The two variables \(X\) and \(Y\) are called uncorrelated if \(\Cov (X, Y) = 0\). Covariance doesn’t imply independence.

Complex random variable

For a complex random variable \(Z = X + j Y\), its PDF is the joint PDF of the r.v. X and Y.

\[f_Z(z) = f_{X, Y} (x, y).\]

The integral over the complex space is defined as

\[\int_{z \in \CC} f_Z(z) d z = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X, Y} (x, y) d x d y = 1.\]

Random vectors

We will continue to use the notation of capital letters to denote a random vector. We will specify the space over which the random vector is generated to clarify the dimensionality.

A real random vector \(X\) takes values in the vector space \(\RR^n\). A complex random vector \(Z\) takes values in the vector space \(\CC^n\). We write

\[\begin{split}X = \begin{bmatrix} X_1 \\ \vdots \\ X_n \end{bmatrix}.\end{split}\]

The expected value or mean of a random vector is \(\EE(X)\).

\[\begin{split}\EE(X) = \begin{bmatrix} \EE(X_1) \\ \vdots \\ \EE(X_n) \end{bmatrix}.\end{split}\]

Covariance-matrix of a random vector:

\[\Cov (X) = \EE [(X - \EE(X)) (X - \EE(X))^T] = \EE [X X^T] - \EE[X] \EE[X]^T.\]

We will use the symbols \(\mu\) and \(\Sigma\) for the mean vector and covariance matrix of a random vector \(X\). Clearly

\[\EE [X X^T] = \Sigma + \mu \mu^T.\]

Cross-covariance matrix of two random vectors:

\[\Cov (X, Y) = \EE [(X - \EE(X)) (Y - \EE(Y))^T] = \EE [X Y^T] - \EE[X] \EE[Y]^T.\]

Note that

\[\Cov (X, Y) =\Cov (Y, X)^T.\]

The characteristic function is defined as

\[\Psi_X(j\omega) = \EE \left ( \exp (j \omega^T X) \right ), \quad \omega \in \RR^n.\]

The MGF is defined as

\[M_X(t) = \EE \left ( \exp (t^T X) \right ), \quad t \in \CC^n.\]
Theorem

The components \(X_1, \dots, X_n\) of a random vector \(X\) are independent if and only if

\[\Psi_X(j\omega) = \prod_{i=1}^n \Psi_{X_i}(j\omega_i), \quad \forall \omega \in \RR^n.\]

Gaussian random vector

Definition

A random vector \(X = [X_1, \dots, X_n]^T\) is called Gaussian random vector if

\[\langle t , X \rangle = X^T t = \sum_{i = 1}^n t_i X_i = t_1 X_1 + \dots + t_n X_n\]

follows a normal distribution for all \(t = [t_1, \dots, t_n ]^T \in \RR^n\). The components \(X_1, \dots, X_n\) are called jointly Gaussian. It is denoted by \(X \sim \NNN_n (\mu, \Sigma)\) where \(\mu\) is its mean vector and \(\Sigma\) is its covariance matrix.

Let \(X \sim \NNN_n (\mu, \Sigma)\) be a Gaussian random vector. The subscript \(n\) denotes that it takes values over the space \(\RR^n\). We assume that \(\Sigma\) is invertible. Its PDF is given by

\[f_X (x) = \frac{1}{(2\pi)^{n / 2} \det (\Sigma)^{1/2} } \exp \left \{- \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right\}.\]

Moments:

\[\EE [X] = \mu \in \RR^n.\]
\[\EE[XX^T] = \Sigma + \mu \mu^T.\]
\[\Cov[X] = \EE[XX^T] - \EE[X]\EE[X]^T = \Sigma.\]

Let \(Y = A X + b\) where \(A \in \RR^{n \times n}\) is an invertible matrix and \(b \in \RR^n\). Then

\[Y \sim \NNN_n (A \mu + b , A \Sigma A^T).\]

\(Y\) is also a Gaussian random vector with the mean vector being \(A \mu + b\) and the covariance matrix being \(A \Sigma A^T\). This essentially is a change in basis in \(\RR^n\).

The CF is given by

\[\Psi_X(j \omega) \exp \left ( j \omega^T x - \frac{1}{2} \omega^T \Sigma \omega \right ), \quad \omega \in \RR^n.\]

Whitening

Usually we are interested in making the components of \(X\) uncorrelated. This process is known as whitening. We are looking for a linear transformation \(Y = A X + b\) such that the components of \(Y\) are uncorrelated. i.e. we start with

\[X \sim \NNN_n (\mu, \Sigma)\]

and transform \(Y = A X + b\) such that

\[Y \sim \NNN_n (0, I_n)\]

where \(I_n\) is the \(n\)-dimensional identity matrix.

Whitening by eigen value decomposition

Let

\[\Sigma = E \Lambda E^T\]

be the eigen value decomposition of \(\Sigma\) with \(\Lambda\) being a diagonal matrix and \(E\) being an orthonormal basis.

Let

\[\Lambda^{\frac{1}{2}} = \Diag (\lambda_1^{\frac{1}{2}}, \dots, \lambda_n^{\frac{1}{2}}).\]

Choose \(B = E \Lambda^{\frac{1}{2}}\) and \(A = B^{-1} = \Lambda^{-\frac{1}{2}} E^T\). Then

\[\Cov (B^{-1} X) = \Cov (A X) = \Lambda^{-\frac{1}{2}} E^T \Sigma E \Lambda^{-\frac{1}{2}} = I.\]
\[\EE [B^{-1} X] = B^{-1} \mu \iff \EE [B^{-1} (X - \mu)] = 0.\]

Thus the random vector \(Y = [B^{-1} (X - \mu)\) is a whitened vector of uncorrelated components.

Causal whitening

We want that the transformation be causal, i.e. \(A\) should be a lower triangular matrix. We start with

\[\Sigma = L D L^T = (L D^{\frac{1}{2}} ) (D^{\frac{1}{2}} L^T).\]

Choose \(B = L D^{\frac{1}{2}}\) and \(A = B^{-1} = D^{-\frac{1}{2}} L^{-1}\). Clearly, \(A\) is lower triangular.

The transformation is \(Y = [B^{-1} (X - \mu)\).

Geometry

Algebraic Geometry Review

This section covers essential notions and facts from algebraic geometry needed for this paper. For a systematic introduction to the subject, see [Har77][Har13][GH14]. Algebraic geometry is the study of geometries that come from algebra. The geometrical objects being studied are the solution sets of systems of multivariate polynomial equations. A data set being studied can be thought of as a collection of sample points from a geometrical object (e.g. a union of subspaces). The objective is to infer the said geometrical object from the given data set and decompose the object into simpler objects which help in better understanding of the data set.

Polynomial Rings

Let \(\FF^m\) be \(m\)-dimensional vector space where \(\FF\) is either \(\RR\) or \(\CC\) (a field of characteristic 0). For \(x = [x_1, \dots, x_m]^T \in \FF^m\), let \(\FF[x] = [x_1, \dots, x_m]\) be the set of all polynomials of \(m\) variables \(x_1, \dots,x_m\). \(\FF[x]\) is a commutative ring [Art91]. A monomial is a product of variables. Its degree is the number of variables in the product. A monomial of degree \(n\) is of the form \(x^n = x_1^{n_1}\dots x_m^{n_m}\) with \(0 \leq n_j \leq n\) and \(n_1 + \dots + n_m = n\). There are a total of \(A_n(m) = \binom{m + n -1}{n} = \binom{m + n -1}{m - 1}\) different degree-n monomials.

We now construct an embedding of vectors in \(\FF^m\) to \(\FF^{A_n(m)}\). The Veronese map of degree \(n\), denoted as \(v_n : \FF^m \to \FF^{A_n(m)}\), is defined as

\[v_n : [x_1, \dots, x_m]^T \to [\dots, x^n, \dots]^T\]

where \(x^n\) are degree-n monomials chosen in the degree lexicographic order. For example, the Veronese map of degree 2 from \(\RR^3\) to \(\RR^6\) is defined as

\[v_2(x) = v_2([x_1, x_2, x_3]^T) = [x_1^2, x_1x_2, x_1x_3,x_2^2, x_2x_3, x_3^2 ]^T.\]

A term is a scalar multiplying a monomial. A polynomial \(p(x)\) is said to be homogeneous if all its terms have the same degree. Homogeneous polynomials are also known as forms. A linear form is a homogeneous polynomial of degree 1. A quadratic form is a homogeneous polynomial of degree 2. A degree-n form \(p(x)\) can be written as

\[p(x) = c_n^T v_n(x) = \sum c_{n_1, \dots, n_m}x_1^{n_1}\dots x_m^{n_m},\]

where \(c_{n_1, \dots, n_m} \in \FF\) are the coefficients associated with the monomials \(x_1^{n_1}\dots x_m^{n_m}\).

A projective space corresponding to a vector space \(V\) is the set of lines passing through its origin (the one dimensional subspaces). Each such line can be represented by any non-zero point on the line.

For a degree-n form \(p(x)\) and a scalar \(b \in \FF\), we have:

\[p(b x_1, \dots, b x_m ) = b^n p (x_1, \dots, x_m).\]

Therefore, if \(p(x) = 0\), then \(p(\alpha x) = 0 \Forall \alpha \in \FF\) and the zero-set of \(p(x)\) includes the one dimensional subspace containing \(x\) (the line passing through \(x\) and \(0\)). Our interest is in the zero sets of homogeneous polynomials. Thus, it is useful to view \(\FF^n\) as a projective space. For a form p(x), p(0) is always 0. If p(a) = 0 for some \(a \neq 0\), then \(p(x) = 0 \Forall x = b a, b \in \FF\).

The ring \(\FF[x]\) can be viewed as a graded ring[Lan02] and decomposed as

(1)\[\FF[x] = \bigoplus_{i=0}^{\infty} \FF_0 \oplus \FF_1 \oplus \dots \FF_p \oplus \dots,\]

where \(\FF_i\) consists of all homogeneous polynomials of degree \(i\). \(\FF_0 = \FF\) is the set of scalars (polynomials of degree 0). \(\FF_1\) is the set of all 1-forms:

\[\FF_1 = {b_1 x_1 + \dots + b_m x_m : [b_1, \dots b_m]^T \in \FF^m}.\]

Note that the polynomial \(0 = 0^T x\) is included in every \(\FF_i\). This enables us to treat \(\FF_i\) as a vector space of \(i\)-forms. \(\FF_1\) can also be viewed as the dual-space of linear functionals for the vector space \(\FF^m\). We will also need following sets in the sequel:

\[\begin{split}\begin{aligned} &\FF_{\leq p} = \bigoplus_{i=0}^p \FF_i = \FF_0 \oplus \dots \oplus \FF_p.\\ &\FF_{\geq p} = \bigoplus_{i=p}^{\infty}\FF_i = \FF_p \oplus \FF_{p+1}\oplus \dots. \end{aligned}\end{split}\]

An ideal in the ring \(\FF[x]\) is an additive subgroup \(I\) such that if \(p(x) \in I\) and \(q(x) \in \FF[x]\), then \(p(x) q(x) \in I\). \(\FF[x]\) is a trivial ideal. \(I\) is called a proper ideal if \(I \neq \FF[x]\). A proper ideal \(I\) is called maximal if no other proper ideal of \(\FF[x]\) contains \(I\). An ideal \(I\) is called a subideal of an ideal \(J\) if \(I \subset J\).

If \(I\) and \(J\) are two ideals in \(\FF[x]\), then \(I \cap J\) is also an ideal. An ideal \(I\) is said to be generated by a subset \(\GGG \subset I\), if every \(p(x) \in I\) can be written as

\[p(x) = \sum_{i=1}^k q_i(x) g_i (x), q_i(x) \in \FF[x],\, g_i(x) \in \GGG.\]

It is denoted by \((\GGG)\). If \(\GGG\) is finite, \((\GGG = \{ g_1, \dots, g_k\})\), then, the generated ideal is also denoted by \((g_1, \dots, g_k)\). An ideal generated by a single element \(p(x)\) is called a principal ideal denoted by \((p(x))\).

\[(p(x)) = \{f(x) p(x) \Forall f(x) \in \FF[x] \}.\]

Given two ideals \(I\) and \(J\), the ideal that is generated by product of elements in \(I\) and \(J\) : \(\{ f(x)g(x) : f(x) \in I, g(x) \in J \}\) is called the product ideal \(IJ\).

A prime ideal is similar to prime numbers in the ring of integers. A proper ideal \(I\) is called prime if \(p(x) q(x) \in I\) implies that \(p(x) \in I\) or \(q(x) \in I\). A polynomial \(p(x)\) is said to be prime or irreducible if it generates a prime ideal. A homogeneous ideal of \(\FF[x]\) is an ideal generated by homogeneous polynomials.

Algebraic Sets

Given a set of homogeneous polynomials \(J \subset \FF[x]\), a corresponding projective algebraic set \(Z(J) \subset \FF^m\) is defined as

\[Z(J) = \{y \in \FF^m | p(y) = 0, \Forall p(x) \in J \}.\]

In other words, \(Z(J)\) is the zero set of polynomials in \(J\) (intersection of zero sets of each polynomial in \(J\)). Let \(I\) and \(K\) be sets of homogeneous polynomials and \(X = Z(I)\) and \(Y = Z(K)\) such that \(Y \subset X\). Then \(Y\) is called an algebraic subset of \(X\). A nonempty algebraic set is called irreducible if it is not the union of two nonempty smaller algebraic sets. An irreducible algebraic set is also known as algebraic variety. Any subspace of \(\FF^m\) is an algebraic variety.

Given any subset \(X \in \FF^m\), we define the vanishing ideal of \(X\) as the set of all polynomials that vanish on \(X\).

\[I(X) = \{ f(x) \in \FF[x] | f(y) = 0, \Forall y \in X \}.\]

It is easy to see that if \(f(x) \in I(X)\) then \(f(x) g(x) \in I(X)\) for all \(g(x) \in \FF[x]\). Thus, \(I(X)\) is indeed an ideal.

Let \(J \subset \FF[x]\) be a set of homogeneous polynomials. \(Z(J)\) is the zero set of \(J\) (an algebraic set). \(I(Z(J))\) is the vanishing ideal of the zero set of \(J\). It can be shown that \(I(Z(J))\) is an ideal that contains \(J\).

Similarly, let \(X \subset \FF^m\) be an arbitrary set of vectors in \(\FF^m\). \(I(X)\) is the vanishing ideal of \(X\) and \(Z(I(X))\) is the zero set of the vanishing ideal of \(X\). Then, \(Z(I(X))\) is an algebraic set that contains \(X\).

It turns out that irreducible algebraic sets and prime ideals are connected. In fact, If \(X\) is an algebraic set and \(I(X)\) is the vanishing ideal of \(X\), then \(X\) is irreducible if and only if \(I(X)\) is a prime ideal.

The natural progression is to look for a one-to-one correspondence between ideals and algebraic sets. The concept of a radical ideal is useful in this context. Given a (homogeneous) ideal \(I\) of \(\FF[x]\), the (homogeneous) radical ideal of \(I\) is defined to be

\[\text{rad}(I) = \{ f(x) \in \FF[x] | f(x)^p \in I \,\text{for some } p \in \Nat\}.\]

Clearly, text{rad}(I) is an ideal in itself and \(I \subset \text{rad}(I)\). \(\text{rad}(I)\) is a fixed-point in the sense that \(\text{rad}(\text{rad}(I)) = \text{rad}(I)\). Also, if \(I\) is homogeneous, then so is \(\text{rad}(I)\). A theorem by Hilbert suggests the following: If \(\FF\) is an algebraically closed field (e.g. \(\FF = \CC\)) and \(I \subset \FF[x]\) is an (homogeneous) ideal, then

\[I(Z(I)) = \text{rad}(I).\]

Thus, the mappings \(I \to Z(I)\) and \(X \to I(X)\) induce a one-to-one correspondence between the collection of (projective) algebraic sets of \(\FF^m\) and (homogeneous) radical ideals of \(\FF[x]\). This result is known as Nullstellensatz.

Algebraic Sampling Theory

We will now explore the problem of identifying a (projective) algebraic set \(Z \in \FF^m\) from a finite number of sample points in \(Z\). In general, the algebraic set \(Z\) may not be irreducible and the ideal \(I(Z)\) may not be prime. Let \(\{z_1, \dots, z_S\} \subset Z\) be the finite (but sufficiently large) set of sample points from \(Z\) for the following discussion. For an arbitrary point \(z \in Z\), we abuse \(z\) to mean the corresponding projective point (i.e. the line passing between 0 and \(z\)). Let \(\mathfrak{m} = I(z)\) be the vanishing ideal of (the line) z. Then, \(\mathfrak{m}\) is a submaximal ideal (i.e. it cannot be a subideal of any other homogeneous ideal of \(\FF[x]\)). Let \(\mathfrak{m}_i\) be the vanishing ideal of \(z_i\). Then the vanishing ideal for the set of points is

\[\mathfrak{a}_S = \mathfrak{m}_1 \cap \dots \cap \mathfrak{m}_S.\]

This is a radical ideal and is in general much larger than \(I(Z)\). In order to ensure that we can infer \(I(Z)\) correctly from the set of samples \(\{ z_i \}\), we need some additional constraints. We require that \(I(Z)\) is generated by a set of (homogeneous) polynomials whose degrees are bound by a relatively small \(n\).

\[I(Z) = (f_1, \dots, f_s) \text{ s.t. }\, \deg(f_j) \leq n.\]

Then, the zero set of \(I\) is given by

\[Z(I) = \{ z \in \FF^m | f_i(z) = 0, i = 1, 2, \dots, s\}.\]

In general, \(I(Z)\) is always a proper subideal of \(I_S\) regardless of how large \(S\) is. We introduce an algebraic sampling theorem which comes to our rescue. It suggests that if \(I(Z)\) is generated by polynomials in \(\FF_{\leq n}\), then there is a finite sequence of points \(Z_S = \{z_1, \dots, z_S \}\) such that the subspace \(I(Z_S) \cap \FF_{\leq n}\) generates \(I(Z)\). While the theorem doesn’t suggest a bound on \(S\), it turns out that with probability one, the vanishing ideal of an algebraic set can be correctly determined from a randomly chosen sequence of samples. This theorem is analogous to the classical Nyquist-Shannon sampling theorem.

So far we have looked at modeling a data set as an algebraic set and obtaining its vanishing ideal. The next step is to extract the internal geometric or algebraic structure of the algebraic set. The idea is to find simpler (possibly irreducible) algebraic sets which can be composed to form the given algebraic set. For example, if an algebraic set is a union of subspaces, then we would like to find out the component subspaces. In other words, given an algebraic set \(X\) or its vanishing ideal \(I(X)\), the objective is to decompose it into a union of subsets each of which cannot be decomposed further.

An algebraic set can have only finitely many irreducible components. That is, there exists a finite \(n\) such that

\[X = X_1 \cup \dots \cup X_n,\]

where \(X_i\) are irreducible algebraic varieties. The vanishing ideal \(I(X_i)\) must be a prime ideal that is minimal over the radical ideal \(I(X)\) (i.e. there is no prime subideal of \(I(X_i)\)) that includes \(I(X)\). The ideal \(I(X)\) is given by

\[I(X) = I(X_1) \cap \dots \cap I(X_n).\]

This is known as the minimal primary decomposition of the radical ideal \(I(X)\).

Given a (projective) algebraic set \(X\) and its vanishing ideal \(I(Z)\), we can grade the ideal by degree as:

\[I(Z) = I_0(Z) \oplus I_1(Z) \oplus \dots.\]

The Hilbert function of \(Z\) is defined to be

(2)\[h_I(i) \triangleq \text{dim} (I_i(Z)).\]

\(h_I(i)\) denotes the number of linearly independent polynomials of degree \(i\) that vanish on \(Z\). Hilbert series of an ideal \(I\) is defined as the power series:

\[\HHH(I, t)\triangleq \sum_{i=0}^{\infty} h_I(i) t^i.\]

Subspace Arrangements

We are interested in special class of algebraic sets known as subspace arrangements in \(\RR^M\). A subspace arrangement is a finite collection of linear or affine subspaces in \(\RR^M\) \(\UUU = \{ \UUU_1, \dots, \UUU_K \}\). The set \(Z_{\UUU} = \UUU_1 \cup \dots \cup \UUU_K\) is the union of subspaces. It is an algebraic set. We will explore the algebraic properties of \(Z_{\UUU}\) in the following. We say a subspace arrangement is central if every subspace passes through origin. In the sequel, we will focus on central subspace arrangements only.

A \(D\)-dimensional subspace \(V\) can be defined by \(D' = M - D\) linearly independent linear forms \(\{b_1, b_2, \dots, b_{D'} \}\):

\[V = \{x \in \RR^M | b_i(x) = 0, 1 \leq i \leq D' \}.\]

Let \(V^*\) denote the vector space of all linear forms that vanish on \(V\). Then \(\dim(V^*) = D' = M - D\). \(V\) is the zero set of \(V^*\) (i.e. \(V = Z(V^*))\). The vanishing ideal of \(V\) is

\[I(V) = \{ p(x) \in \RR[x] : p(x) = 0, \Forall x \in V \}.\]

\(I(V)\) is an ideal generated by linear forms in \(V^*\). It contains polynomials of all degrees that vanish on \(V\). Every polynomial \(p(x) \in I(V)\) can be written as

\[p(x) = h_1 b_1 + \dots + h_{D'} b_{D'}\]

where \(h_i \in \RR[x]\). \(I(V)\) is a prime ideal.

The vanishing ideal of the subspace arrangement \(Z_{\UUU} = \UUU_1 \cup \dots \cup \UUU_K\) is

\[I(Z_{\UUU}) = I(\UUU_1) \cap \dots \cap I(\UUU_K).\]

The ideal can be graded by degree of the polynomial as:

(3)\[I(Z_{\UUU}) = I_m(Z_{\UUU}) \oplus I_{m+1}(Z_{\UUU}) \oplus \dots.\]

Each \(I_i(Z_{\UUU})\) is a vector space that contains forms of degree \(i\) in \(I(Z_{\UUU})\) and \(m\geq 1\) is the least degree of the polynomials in \(I(Z_{\UUU})\). The sequence of dimensions of \(I_i(Z_{\UUU})\) is the Hilbert function \(h_I(i)\) of \(Z_{\UUU}\).

Based on a result on the regularity of subspace arrangements [Der07], the subspace arrangement \(Z_{\UUU}\) is uniquely determined as the zero set of all polynomials of degree up to \(K\) in its vanishing ideal. i.e.

\[Z_{\UUU} = Z (I_0 \oplus I_1 \oplus \dots \oplus I_K).\]

Thus, we don’t really need to determine polynomials of higher degree.

We need to characterize \(I(Z_{\UUU})\) further. Recall that \(\UUU_k\) is a (linear) subspace and \(\UUU_k^*\) is the vector space of linear forms which vanish on \(\UUU_k\). We can construct a product of linear forms by choosing one linear form from each \(\UUU_k^*\). Let \(J(Z_{\UUU})\) be the ideal generated by the products of linear forms

\[\{ b_1 \cdot b_2 \cdot \dots \cdot b_K: \quad b_k \in \UUU_k^* \Forall 1 \leq k \leq K \}\]

Equivalently, we can say that :

\[J(Z_{\UUU}) \triangleq I(\UUU_1) I(\UUU_2) \dots I(\UUU_K)\]

is the product ideal of the vanishing ideals of each of the subspaces. Evidently, \(J(Z_{\UUU})\) is a subideal in \(I(Z_{\UUU})\). In fact, the two ideals share the same zero set:

\[Z_{\UUU} = Z(J(Z_{\UUU})) = Z(I(Z_{\UUU})).\]

Now, \(I(Z_{\UUU})\) is the largest ideal which vanishes on \(Z_{\UUU}\). In fact, \(I(Z_{\UUU})\) is the radical ideal of \(J(Z_{\UUU})\). Now, just like we graded \(I(Z_{\UUU})\), we can also grade \(J(Z_{\UUU})\) as:

\[J(Z_{\UUU}) = J_K(Z_{\UUU}) \oplus J_{K+1}(Z_{\UUU}) \oplus \dots.\]

Note that, the lowest degree of polynomials is always \(K\) which is the number of subspaces in \(\UUU\). Hilbert function of \(J\) is denoted as \(h_J(i) = \text{dim} (J_i(Z_{\UUU}))\). It turns out that Hilbert functions of the vanishing ideal \(I\) and the product ideal \(J\) have interesting and useful relationships.

Subspace Embeddings

Let \(Z_{\UUU'} = \UUU'_1 \cup \dots \cup \UUU'_{K'}\) be another (central) subspace arrangement such that \(Z_{\UUU} \subseteq Z_{\UUU'}\). Then it is necessary that for each \(\UUU_k\), there exists \(\UUU'_{k'}\) such that \(\UUU_k \subseteq \UUU_{k'}\). We call \((Z_{\UUU} \subseteq Z_{\UUU'})\), a subspace embedding. If \(Z_{\UUU'}\) happens to be hyperplane arrangement, we call the embedding as a hyperplane embedding. Let us consider how to create a hyperplane embedding for a given subspace arrangement.

In general, the zero set of each homogeneous component of \(I(Z_{\UUU})\) (i.e. \(I_i(Z_{\UUU})\)), need not be a subspace embedding of \(Z_{\UUU}\). In fact, it may not even be a subspace arrangement. However, the derivatives of the polynomials in \(I(Z_{\UUU})\) come to our rescue. We denote the derivative of \(p(x)\) w.r.t. \(x \in \RR^M\) by \(D p(x)\). Consider a polynomial \(p(x) \in I(Z_{\UUU})\). Pick a point \(x_k\) from each subspace \(\UUU_k\) (\(x_k \in \UUU_k\)). Compute the derivative of \(p(x)\) and evaluate it at \(x_k\) as \(D p(x_k)\). Now, construct the hyperplane \(H_k = \{ x : D p(x_k)^T x = 0 \}\). Recall that the derivative of a smooth function \(f(x)\) is orthogonal to (the tangent space of) its level set \(f(x) = c\). Thus, \(H_k\) contains \(\UUU_k\). It turns out that if the \(K\) points \(\{ x_1, \dots, x_K \}\) (from each subspace) are in general position, then the union of hyperplanes \(\cup_{k=1}^K H_k\) is a hyperplane embedding of the subspace arrangement \(Z(\UUU)\).

For each polynomial in \(I(Z(\UUU))\), we can construct a hyperplane embedding of the subspace arrangement \(Z(\UUU)\). The intersection of hyperplane embeddings constructed from a collection of polynomials in \(I(Z(\UUU))\) is a subspace embedding of \(Z(\UUU)\). When this collection of polynomials contains all the generators of \(I(Z(\UUU))\), the subspace embedding becomes tight. In fact, the resulting subspace arrangement coincides with the original one.

An ideal is said to be pl-generated if it is generated by products of linear forms. The \(J(Z_{\UUU})\) defined above is a pl-generated ideal. If the ideal of a subspace arrangement \(J(Z_{\UUU})\) is pl-generated, then the zero-set of every generator gives a hyperplane embedding of \(J(Z_{\UUU})\).

If \(J(Z_{\UUU})\) is a hyperplane arrangement, then \(I(J(Z_{\UUU}))\) is always pl-generated as it is generated by a single polynomial of the form \(p(x) = (b_1^T x) \dots (b_K^T x)\) where \(b_k \in \RR^M\) are the normal vectors to the \(K\) hyperplanes in the arrangement. In fact, it is also a principal ideal.

The vanishing ideal of a single subspace is always pl-generated. The vanishing ideal of an arrangement of two subspaces is also pl-generated but this is not true in general. But something can be said if the \(K\) subspaces in the arrangement are in general position.

Hilbert Functions of Subspace Arrangements

If a subspace arrangement \(\UUU\) is in general position, then the values of the Hilbert function \(h_I(i)\) of its vanishing ideal \(I(Z_{\UUU})\) depend solely on the dimensions of the subspaces \(D_1, \dots, D_K\) and they are invariant under a continuous change of the position of the subspaces. When identifying a subspace arrangement from a set of samples, the first level parameters to be identified are number of subspaces and the dimensions of each subspace.

Digital Signal Processing

Run Length Encoding

Run length encoding is a common operation in compression applications. In this article, we discuss how to do this efficiently in MATLAB using vectorization techniques.

Let’s consider a simple sequence of integers:

x = [0 0 0 0 0 0 0 4  4 4 3 3 2 2 2 2 2 2 2 1 1 0 0 0 0 0 2 3 9 5 5 5 5 5 5]

The sequence has 35 elements.

First step is change detection:

>> diff_positions = find(diff(x) ~= 0)
diff_positions =

     7    10    12    19    21    26    27    28    29

Note that these positions are 0 based indexes. The first difference is occurring at x(8).

We can use this to compute the runs of each symbol:

>> runs = diff([0 diff_positions numel(x)])
runs =

     7     3     2     7     2     5     1     1     1     6

The start position for the first symbol of each run can also be easily obtained:

>> start_positions  = [1 (diff_positions + 1)]
start_positions =

     1     8    11    13    20    22    27    28    29    30

We can now pick up the symbols from x:

>> symbols = x(start_positions)
symbols =

     0     4     3     2     1     0     2     3     9     5

Combine the symbols and their runs:

>> encoding = [symbols; runs]
encoding =

     0     4     3     2     1     0     2     3     9     5
     7     3     2     7     2     5     1     1     1     6

Flatten the encoding:

>> encoding = encoding(:)';
>> fprintf('%d ', encoding)
0 7 4 3 3 2 2 7 1 2 0 5 2 1 3 1 9 1 5 6 >>

We can cross check that the length of the encoded sequence is correct:

>> total_symbols = sum(runs)
total_symbols =

    35

We can check the length of the encoded sequence:

>> >> numel(encoding)

ans =

    20

It is indeed less than 35. The gain is not much since there were many symbols with just one occurrence.

The decoding can be easily done using a for loop:

x_dec = [];
for i=1:numel(encoding) / 2
    symbol = encoding(i*2 -1);
    run_length = encoding(i*2);
    x_dec = [x_dec symbol * ones([1, run_length])];
end

Let’s print the decoded sequence:

>> fprintf('%d ', x_dec);
0 0 0 0 0 0 0 4 4 4 3 3 2 2 2 2 2 2 2 1 1 0 0 0 0 0 2 3 9 5 5 5 5 5 5

Verify that the decoded sequence is indeed same as original sequence:

>> sum(x_dec - x)
ans =

     0

The library provides useful methods for performing run length encoding and decoding.

Encoding:

>> x = [0 0 0 0 3 3 3 2 2];
>> encoding = spx.dsp.runlength.encode(x)

encoding =

     0     4     3     3     2     2

Decoding:

>> spx.dsp.runlength.decode(encoding)

ans =

     0     0     0     0     3     3     3     2     2

Discrete Cosine Transform

The discussion in this article is based on [Str99].

There are four types of DCT transforms DCT-1, DCT-2, DCT-3 and DCT-4.

Consider the second difference equation:

\[y(n) = - x (n -1) + 2 x(n) - x(n + 1)\]

For finite signals \(x \in \RR^N\), the equation can be implemented by a linear transformation:

\[y = A x\]

where \(A\) is a circulant matrix:

\[\begin{split}A = \begin{bmatrix} 2 & -1 & & & & -1\\ -1 & 2 & -1 & & & \\ & -1 & 2 & -1 & & \\ & & & \ddots & & \\ & & & -1 & 2 & -1\\ -1 & & & & -1 & 2 \end{bmatrix}\end{split}\]

The unspecified values are 0. We can write the individual linear equations as:

\[\begin{split}\begin{aligned} y_1 &= - x_{N} + 2 x_1 - x_2 \\ y_j &= - x_{j-1} + 2 x_j - x_{j +1} \quad \forall 1 < j < N \\ y_N &= - x_{N_1} + 2 x_N - x_1 \end{aligned}\end{split}\]

The first and last equations are boundary conditions while the middle one represents the ordinary second difference equation.

The rows 1 and N of \(A\) are the boundary rows while all other rows are interior rows.

The interior rows correspond to the computation \(- x_{j-1} + 2 x_j - x_{j +1}\) which is the discretization of the second order derivative \(-x''\). The negative sign on the derivative makes the matrix \(A\) positive semi definite. This ensures that no eigen values of \(A\) are negative.

In the first and last rows, we need the values of \(x_0\) and \(x_{N + 1}\). In the periodic extension, we assume that \(x_0 = x_N\) and \(x_{N + 1} = x_1\). This gives the \(-1\) entries in the corners of \(A\) as shown above.

With \(\omega = \exp(2\pi i / N)\), it turns out that

\[v_k = (1, \omega^k, \omega^{2k}, \dots, \omega^{(N-1)k})\]

are eigen vectors for \(A\) for \(0 \leq k \leq N -1\). The corresponding eigen values are \((2 - 2 \cos(2\pi k / N)\).

The eigen vectors are nothing but the basis vectors for DFT basis. Note that the eigen values satisfy a relationship \(\lambda_k = \lambda_{N -k}\). So the linear combinations of the eigen vectors \(v_k\) and \(v_{N -k}\) are also eigen vectors.

It turns out that the real and imaginary parts of the vector \(v_k\) are also eigen vectors of \(A\). They can be easily constructed as linear combinations of \(v_k\) and \(v_{N -k}\).

We define:

\[c_k = \Re(v_k) = \left ( 1, \cos \frac{2\pi k}{ N}, \cos \frac{4\pi k}{ N}, \dots, \cos \frac{2 (N -1)\pi k}{ N} \right ).\]
\[s_k = \Im(v_k) = \left ( 0, \sin \frac{2\pi k}{ N}, \sin \frac{4\pi k}{ N}, \dots, \sin \frac{2 (N -1)\pi k}{ N} \right )\]

The exception to this rule is \(\lambda_0\) for which \(c_0 = (1, 1, \dots, 1)\) and \(s_0 = (0, 0, \dots, 0)\) where \(s_0\) is not an eigen vector while \(c_0\) is.

For even \(N\), there is another exception at \(\lambda_{N/2}\) with \(c_{N/2} = (1, -1, \dots, 1, -1)\) and \(s_0 = (0, 0, \dots, 0)\).

These two eigen vectors have length \(\sqrt{N\) while other eigen vectors \(c_k\) and \(s_k\) have length \(\sqrt{N/ 2}\).

TBD

Detecting Dual Tone Multi Frequency Signals

Highlights

Following Matlab functions are demonstrated in this article: envelope, pulsewidth, periodogram, findpeaks, meanfreq,

A Dual Tone Multi Frequency (DTMF) signal is the signal generated from the punch keys of an ordinary telephone.

Each signal consists of a low frequency and a high frequency.

The table below lists the frequencies used for various keys.

Key Low frequency High frequency
1 697 1209 Hz
2 697 1336
3 697 1477
4 770 1209
5 770 1336
6 770 1477
7 852 1209
8 852 1336
9 852 1477
0 941 1336
941 1209
# 941 1477

Let’s create a DTMM signal for the sequence of symbols 4, 5, 0, 7:

[signal, fs] = spx.dsp.dtmf({'4', '5', '0', '7'});

The corresponding time stamps:

time = (0:(numel(signal) - 1)) / fs;

Let’s plot it:

plot(1e3*time, signal);
xlabel('Time (ms)');
ylabel('Amplitude');
grid on;
_images/dtmf_4507.png

The pulses are 100 ms wide. The gap between pulses is also 100 ms wide and consists of Gaussian noise.

Our challenge would be to isolate the frequencies and identify the symbols transmitted.

Envelope

We can look at the shape of the pulses where the symbols were punched by looking at the RMS envelope of the signal:

envelope_signal = envelope(signal, 80,'rms');
plot(1e3*time, envelope_signal);

We are computing the RMS envelopes for window size of 80 samples.

_images/dtmf_4507_envelope.png

It is now easy to identify the pulses:

pulsewidth(envelope_signal,fs)
ans =

    0.1050
    0.1041
    0.1042
    0.1045
_images/dtmf_4507_pulses.png

The recognized pulses are pretty close in size to the actual pulse size of 100 ms each.

Periodogram

A periodogram can help us identify the dominant frequencies present in the signal.

The frequencies involved in the sequence 4507 are 4 (770, 1209), 5(770, 1336), 0 (941, 1336), 7 (852, 1209).

We note that 770 Hz, 1209 Hz and 1336 Hz repeat twice, hence we expect them to have more contribution in the power spectrum. Other frequencies are 941 and 852 Hz.

Computing the periodogram is straight-forward:

[pxx,f]=periodogram(signal,[],[],fs);

Here is the display of power spectrum in deciBels.

_images/dtmf_4507_periodogram.png

We wish to isolate the peak frequencies from this plot:

[peak_values, peak_freqs] = findpeaks(pxx, f, 'SortStr','descend', 'MinPeakHeight', max(pxx) / 10);
peak_freqs = round(peak_freqs');

>> sort(peak_freqs)

ans =

Columns 1 through 9

766 771 774 853 941 1203 1205 1208 1210

Columns 10 through 14

1212 1215 1331 1335 1340
  1. The frequencies 766, 771 and 774 are near 770 Hz.
  2. 853 is near 852 Hz.
  3. 941 matches 941 Hz.
  4. 1203, 1205, 1208, 1210, 1212 and 1215 are near 1209 Hz.
  5. 1331, 1335 and 1340 are near 1336 Hz.

Thus, the periodogram has been able to identify all the relevant frequencies in the signal and their power contribution appears to match well in their contribution in the constitution of the sequence 4502.

However, the periodogram is unable to localize the frequencies in time and hence is unable to tell us exactly which symbols were transmitted.

It is instructive to compute the mean frequencies in different bands:

>> round(meanfreq(pxx, f, 700 + [0, 100]))

ans =

   769

>> round(meanfreq(pxx, f, 800 + [0, 100]))

ans =

   851

>> round(meanfreq(pxx, f, 900 + [0, 100]))

ans =

   941

>> round(meanfreq(pxx, f, 1200 + [0, 100]))

ans =

        1211

>> round(meanfreq(pxx, f, 1300 + [0, 100]))

ans =

        1336

The mean frequencies in these bands are mostly spot-on or very close to actual frequencies sent in the DTMF signal.

Spectrogram

While, we have been able to identify the frequencies present in the signal, we haven’t been able to localize them in time. Thus, we are unable to identify exactly which symbols were sent.

The spectrogram provides us the time-frequency representation of the signal:

spectrogram(signal, [], [], [], fs, 'yaxis');
% restrict the y-axis between 500Hz to 1500 Hz.
ylim([0.5 1.5]);
_images/dtmf_4507_spectrogram.png

In this plot, it is clearly visible that at any point of time, two frequencies are active. There are four different symbols which seem to have been sent.

  1. In the first symbol, the frequencies active seem to be around 770Hz and 1200 Hz which maps to the symbol 4.
  2. In the second symbol, the frequencies active seem to be around 770Hz and 1330 Hz which maps to the symbol 5.
  3. Similarly, we can see that the symbols 0 and 7 are easily visible in the spectrogram.

This spectrogram is not able to localize the symbols accurately. We are unable to see the portions where no symbols are being sent and only noise is present.

By default the spectrogram has following parameters:

  • Signal is divided into segments which are around 22% of the length of the signal.
  • The segments overlap each other by 50%.
  • No windowing is done for computing the FFT of each segment.

We should increase the time resolution of the spectrogram.

Let’s have a window length of 50 ms:

window_length = floor(fs * 50 / 1000);

Let’s continue to have overlap of 50%:

overlap_length = floor(window_length / 2);

The FFT length depends on the window length:

n_fft = 2^nextpow2(window_length);

We will compute the spectrogram with Hamming window:

spectrogram(signal,hamming(window_length),overlap_length,n_fft, fs, 'yaxis');
ylim([0.5 1.5]);

Let’s visualize the results:

_images/dtmf_4507_spectrogram_50ms.png

In this spectrogram, it is easy to see how the pulses in the signal are clearly visible and their frequencies can be easily read off the diagram.

While, we have improved time localization of pulses, the frequency localization has suffered a bit. Since, our interest is only in knowing the mean frequencies, this loss of frequency localization is not that important in this case.

We can remove the frequencies which are contributing very small values to the spectrogram and enhance the prominent frequencies in the output. Also, we can increase the overlap between subsequent windows to introduce more spectrum lines and make the spectrogram look smoother:

overlap_length = floor(0.8 * window_length );
spectrogram(signal,hamming(window_length),overlap_length,n_fft, fs, 'yaxis', 'MinThreshold', -50);
ylim([0.5 1.5]);
_images/dtmf_4507_spectrogram_50ms_40ms_50db.png

By computing the center of energy for each spectral estimate in both time and frequency, we can do spectral reassignment. This gives us a much cleaner and crisper spectrogram.

_images/dtmf_4507_spectrogram_50ms_40ms_50db_reassigned.png

Decoding the symbols

The complete process for decoding the DTMF sequence using the spectrogram has been implemented in the function spx.dsp.dtmf_decoder.

The function does the following:

  • Compute the spectrogram
  • Identify the times where spectral content has high energy
  • Identify peak frequencies at these times
  • Match these frequencies to the nearest low and high frequencies of DTMF sequences.
  • Map the identified frequencies to actual symbols.
  • Identify the start and duration of each symbol in terms of time.

You are welcome to look at the implementation.

We show the example use:

>> [symbols, starts, durations] = spx.dsp.dtmf_detector(signal, fs)

symbols =

  1×4 cell array

    {'4'}    {'5'}    {'0'}    {'7'}


starts =

    0.1000    0.3000    0.5000    0.7000


durations =

    0.1000    0.1000    0.1000    0.1000

Another example:

>> [signal, fs] = spx.dsp.dtmf({'2', '3', '4', '6', '*'});
>> [symbols, starts, durations] = spx.dsp.dtmf_detector(signal, fs)

symbols =

  1×5 cell array

    {'2'}    {'3'}    {'4'}    {'6'}    {'*'}


starts =

    0.1000    0.3000    0.5000    0.7000    0.9000


durations =

    0.1000    0.1000    0.1000    0.1000    0.1000

Wavelets

Fundamentals

Essential Operations

Dyadic Structure

Here we are looking at the Haar wavelet decomposition of finite dimensional signals.

We assume that a signal \(x \in \RR^N\) where \(N = 2^J\) for some natural number \(J\).

A single level wavelet decomposition splits a signal into two parts, an approximation and a detail part. Both of these parts have \(N/2\) samples. With Haar wavelets, we can decompose the signal \(J\) times.

We will denote the approximations and detail components as \(a_j\) and \(d_j\).

  1. We start with \(a_J = x\) which has \(N = 2^{J}\) samples.
  2. First decomposition splits \(a_{J}\) into two parts \(a_{J-1}\) and \(b_{J -1}\) both of which have \(2^{J-1}\) samples.
  3. Second decomposition splits \(a_{J-1}\) into two parts \(a_{J-2}\) and \(b_{J -2}\) both of which have \(2^{J-2}\) samples.
  4. Third decomposition splits \(a_{J-2}\) into two parts \(a_{J-3}\) and \(b_{J -3}\) both of which have \(2^{J-3}\) samples.
  5. \(J\)-th decomposition splits \(a_{1}\) into two parts \(a_{0}\) and \(d_{0}\) both of which have \(2^{0} = 1\) samples.
  6. No further decomposition is possible.

Note

Depending upon a specific wavelet structure, \(J\) decompositions may not be possible.

The overall decomposition process can be written as

\[\begin{split}\begin{aligned} a_J &\to [a_{J-1}\quad d_{J-1}]\\ &\to [a_{J-2}\quad d_{J-2}\quad d_{J-1}]\\ &\to [a_{J-3}\quad d_{J-3}\quad d_{J-2}\quad d_{J-1}]\\ & \dots \\ &\to [a_{0}\quad d_{0}\quad d_{1}\quad \dots\quad d_{J-3}\quad d_{J-2}\quad d_{J-1}] \end{aligned}\end{split}\]

At every level of decomposition, the number of coefficients in the decomposition is exactly \(N = 2^J\).

The indices occupied by each level of decomposition are given by

\[\begin{bmatrix} [1] & [2] & [3,4] & [5,8] & \dots & [2^{J-1}+1),2^{J}] \end{bmatrix}\]

This is the dyadic structure of the \(J\) levels of decompositions.

ExampleJ=4 decomposition

Consider the case with \(N=16\) where \(J=4\). 4 levels of decomposition are possible with Haar wavelet.

  1. \(a_4\) has 16 samples.
  2. \(a_3\) and \(d_3\) both have 8 samples each.
  3. \(a_2\) and \(d_2\) both have 4 samples each.
  4. \(a_1\) and \(d_1\) both have 2 samples each.
  5. \(a_0\) and \(d_0\) both have 1 samples each.

No further decomposition is possible.

\(d_j\) has \(2^j\) samples and occupies the indices between \(2^{j} +1\) and \(2^{j+1}\).

Functions to work with dyadic structure

We provide a function to identify the indices of \(j\)-th decomposition:

>> spx.wavelet.dyad(1)

ans =

     3 4

>> spx.wavelet.dyad(2)

ans =

     5 6 7 8

>> spx.wavelet.dyad(3)

ans =

     9 10 11 12 13 14 15 16

>> spx.wavelet.dyad(4)

ans =

    17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

A wavelet coefficient is indexed by two numbers \((j, k)\). Here, \(j\) denotes the resolution level of the wavelet and \(k\) denotes the translation. We have \(j >= 0\) and \(0 \leq k <2^j\).

The absolute index is given by \(2^j + k + 1\).

>> spx.wavelet.dyad(1)

ans =

     3  4

>> spx.wavelet.dyad_to_index(1,0)

ans =

     3

>> spx.wavelet.dyad_to_index(1,1)

ans =

     4

>> spx.wavelet.dyad_to_index(3,2)

ans =

    11

dyad_length lets us the number of decompositions possible for a vector:

>> [N, J, c] = spx.wavelet.dyad_length(1:16)
N =
    16
J =
     4
c =
  logical
   1

Here N is the length of the vector, J is the possible number of decompositions, and c is consistency indicating whether N is a power of 2 or not.

cut_dyadic cuts a signal to the length which is the nearest power of 2:

>> spx.wavelet.cut_dyadic(1:15)

ans =

     1  2  3  4  5  6  7  8
Periodic Convolution

Usual convolution of a signal \(x\) of length N with a filter \(h\) of length M results in a signal \(y\) of length N+M-1.

\[y[n] = \sum_{k=1}^M h[k] x[n-k + 1]\]

The assumption here is that \(x[n] =0\) for \(n <=0\) and \(n > N\).

Here is an example:

>> conv([3 1 2], [1 2 2 1])

ans =

     3  7 10  9  5  2

This is not suitable for an orthogonal wavelet decomposition of a signal. We are interested in periodic or circular convolution which is defined by

\[y[n] = \sum_{k=1}^M h[k] x[((n-k) \mod N) + 1]\]
Periodic Extension

To construct the periodic extension of a vector, we provide following methods:

  • repeat_vector_at_start repeats values from the end of a vector to its beginning.
  • repeat_vector_at_end repeats values from the start of a vector to its end.
>> spx.vector.repeat_vector_at_start(1:10, 4)

ans =

     7  8  9 10  1  2  3  4  5  6  7  8  9 10

>> spx.vector.repeat_vector_at_end(1:10, 4)

ans =

     1  2  3  4  5  6  7  8  9 10  1  2  3  4
Computing the Periodic Convolution

We provide a method called iconv to compute the periodic convolution. Let’s go through the steps of periodic convolution one by one.

ExamplePeriodic convolution of constant sequence with difference filter

Let’s take an example signal:

>> x = [1 1 1 1 1 1]

x =

     1  1  1  1  1  1

And an example filter:

>> f = [1 -1]

f =

     1 -1

Let’s get the length of signal:

>> n = length(x)

n =

     6

And the length of filter:

>> p = length(f)

p =

     2

Extend the signal at the start by p values (from the end):

>> x_padded =  spx.vector.repeat_vector_at_start(x, p)

x_padded =

     1  1  1  1  1  1  1  1

Perform full convolution on the extended signal:

>> y_padded = filter(f, 1, x_padded)

y_padded =

     1  0  0  0  0  0  0  0

Drop the first p values from it to get the periodic convolution output:

>> y = y_padded((p+1):(n+p))

y =

     0  0  0  0  0  0

The same can be achieved by a single function call:

>> spx.wavelet.iconv(f,x)

ans =

     0  0  0  0  0  0
ExampleSame vs Periodic Convolution

MATLAB has a same convolution feature. This is different from periodic convolution:

>> u = [-1 2 3 -2 0 1 2];
>> v = [2 -1];
>> conv(u,v,'same')

ans =

     5  4 -7  2  2  3 -2

>> spx.wavelet.iconv(v, u)

ans =

    -4  5  4 -7  2  2  3
ExamplePeriodic convolution with time reversed filter

There is another function for computing the convolution of a signal with the time reversed version of a filter.

>> spx.wavelet.aconv(v, u)

ans =

    -4  1  8 -4 -1  0  5

>> spx.wavelet.iconv(v(length(v):-1:1), u)

ans =

     5 -4  1  8 -4 -1  0

Notice the slight difference in the two outputs. aconv output is circular shifted by 1.

Upsampling

Upsampling introduces zeros between individual samples.

Upsampling by a factor of 2:

>> spx.wavelet.up_sample([-1 2 3 -2 0 1 2])

ans =

    -1  0  2  0  3  0 -2  0  0  0  1  0  2  0

Upsampling by a factor of 3:

>> spx.wavelet.up_sample([-1 2 3 -2 0 1 2], 3)

ans =

    -1  0  0  2  0  0  3  0  0 -2  0  0  0  0  0  1  0  0  2  0  0

The second argument is the upsampling factor.

MATLAB Wavelet Toolbox

Introduction

This section is a quick review of wavelet toolbox in MATLAB.

Wavelet families

The toolbox supports a number of wavelet families:

>> waveletfamilies('f')
===================================
Haar                    haar
Daubechies              db
Symlets                 sym
Coiflets                coif
BiorSplines             bior
ReverseBior             rbio
Meyer                   meyr
DMeyer                  dmey
Gaussian                gaus
Mexican_hat             mexh
Morlet                  morl
Complex Gaussian        cgau
Shannon                 shan
Frequency B-Spline      fbsp
Complex Morlet          cmor
Fejer-Korovkin          fk
===================================

Following command shows how to get the list of wavelets in each of the families:

>> waveletfamilies('n')
===================================
Haar                    haar
===================================
Daubechies              db
------------------------------
db1 db2 db3 db4
db5 db6 db7 db8
db9 db10    db**
===================================
Symlets                 sym
------------------------------
sym2    sym3    sym4    sym5
sym6    sym7    sym8    sym**
===================================
Coiflets                coif
------------------------------
coif1   coif2   coif3   coif4
coif5
===================================
BiorSplines             bior
------------------------------
bior1.1 bior1.3 bior1.5 bior2.2
bior2.4 bior2.6 bior2.8 bior3.1
bior3.3 bior3.5 bior3.7 bior3.9
bior4.4 bior5.5 bior6.8
===================================
ReverseBior             rbio
------------------------------
rbio1.1 rbio1.3 rbio1.5 rbio2.2
rbio2.4 rbio2.6 rbio2.8 rbio3.1
rbio3.3 rbio3.5 rbio3.7 rbio3.9
rbio4.4 rbio5.5 rbio6.8
===================================
Meyer                   meyr
===================================
DMeyer                  dmey
===================================
Gaussian                gaus
------------------------------
gaus1   gaus2   gaus3   gaus4
gaus5   gaus6   gaus7   gaus8
===================================
Mexican_hat             mexh
===================================
Morlet                  morl
===================================
Complex Gaussian        cgau
------------------------------
cgau1   cgau2   cgau3   cgau4
cgau5   cgau6   cgau7   cgau8
===================================
Shannon                 shan
------------------------------
shan1-1.5   shan1-1 shan1-0.5   shan1-0.1
shan2-3 shan**
===================================
Frequency B-Spline      fbsp
------------------------------
fbsp1-1-1.5 fbsp1-1-1   fbsp1-1-0.5 fbsp2-1-1
fbsp2-1-0.5 fbsp2-1-0.1 fbsp**
===================================
Complex Morlet          cmor
------------------------------
cmor1-1.5   cmor1-1 cmor1-0.5   cmor1-1
cmor1-0.5   cmor1-0.1   cmor**
===================================
Fejer-Korovkin          fk
------------------------------
fk4 fk6 fk8 fk14
fk18    fk22
===================================

Working with Daubechies Wavelets

The short name for this family of wavelets is db.

Information about the wavelet family:

>> waveinfo('db')
 Information on Daubechies wavelets.

    Daubechies Wavelets

    General characteristics: Compactly supported
    wavelets with extremal phase and highest
    number of vanishing moments for a given
    support width. Associated scaling filters are
    minimum-phase filters.

    Family                  Daubechies
    Short name              db
    Order N                 N a positive integer from 1 to 45.
    Examples                db1 or haar, db4, db15

    Orthogonal              yes
    Biorthogonal            yes
    Compact support         yes
    DWT                     possible
    CWT                     possible

    Support width           2N-1
    Filters length          2N
    Regularity              about 0.2 N for large N
    Symmetry                far from
    Number of vanishing
    moments for psi         N

    Reference: I. Daubechies,
    Ten lectures on wavelets,
    CBMS, SIAM, 61, 1994, 194-202.
Decomposition and Reconstruction filters

Let’s construct the filters for ‘db4’ wavelet:

>> [LoD,HiD,LoR,HiR] = wfilters('db4');

Let’s plot the filters:

subplot(221);
stem(LoD, '.'); title('Lowpass Decomposition');
subplot(222);
stem(LoR,'.'); title('Lowpass Reconstruction');
subplot(223);
stem(HiD,'.'); title('Highpass Decomposition');
subplot(224);
stem(HiR,'.'); title('Highpass Reconstruction');
_images/db4_filters.png
Single Level Decomposition and Reconstruction

The dwt and idwt functions can be used for single level decomposition and reconstruction.

Let’s load a signal on which we will perform the decomposition:

load noisdopp;
plot(noisdopp);
_images/noisdopp.png

Let’s perform 1-level decomposition:

[approximation, detail] = dwt(noisdopp,LoD,HiD);

Let’s plot the decomposed approximation and detail components:

subplot(211);
plot(approximation); title('Approximation');
subplot(212);
plot(detail); title('Detail');
_images/noisdopp_db4_decomposition.png

Reconstruct the original signal using idwt:

reconstructed = idwt(approximation, detail,LoR,HiR);

Let’s measure the reconstruction error:

>> max_abs_diff = max(abs(noisdopp-reconstructed))

max_abs_diff =

   6.3300e-12
Multi-level Wavelet Decomposition

We can use the wavedec function for multi-level wavelet decomposition:

[coefficients, levels] = wavedec(s,3,'db1');

Let’s plot the decomposition coefficients:

plot(coefficients); title('Coefficients');
_images/noisdopp_db4_l4_decomposition.png

Reconstruction from multi-level decomposition:

reconstructed = waverec(coefficients, levels, LoR, HiR);

Let’s verify the reconstruction error:

max_abs_diff = max(abs(noisdopp-reconstructed))
max_abs_diff =

   2.0627e-11

It is possible to look at the approximation coefficients at all levels:

for level=0:4
    level_app_coeffs = appcoef(coefficients, levels, LoR, HiR, level);
    subplot(511+level);
    plot(level_app_coeffs);
    title(sprintf('Approximation coefficients @ level-%d', level));
end
_images/noisdopp_db4_l4_appcoeffs.png

The level-0 coefficients are nothing but the original signal. The higher level approximation coefficients are increasingly smoother.

It is important to know how many levels of decomposition are possible. wmaxlev can be used for finding it out:

>> wmaxlev(numel(noisdopp),'db4')

ans =

     7

The normal wavelet decomposition creates more coefficients than there are in the original signal.

Let’s see how the number of coefficients increase with the level of decomposition:

>> for i=1:7
    [coefficients, levels] = wavedec(noisdopp,i, LoD,HiD);
    fprintf('%d ', numel(coefficients));
end

1030 1037 1044 1050 1056 1062 1068

For every level 6 or 7 extra coefficients are being introduced. This is because a normal convolution of length M signal with length N filter produces a signal of length M + N -1.

The behavior is controlled by the DWT MODE. It defines how the signals are extended to complete the convolution.

The default mode is:

>> dwtmode

*******************************************************
**  DWT Extension Mode: Symmetrization (half-point)  **
*******************************************************
Decomposition with Periodic Extension

If we want to have a non-redundant wavelet decomposition, we can use the periodic extension DWT mode.

Changing the mode:

old_dwt_mode = dwtmode('status','nodisp');
dwtmode('per');

*****************************************
**  DWT Extension Mode: Periodization  **
*****************************************

Performing level 4 decomposition:

[coefficients, levels] = wavedec(noisdopp,4, LoD,HiD);

Verify that the coefficients array is of same length as signal:

>> numel(coefficients)

ans =

        1024

Verify that number of elements at different levels is changing by a factor of 2 always:

>> levels

levels =

          64 64 128 256 512 1024

Plot the coefficients:

plot(coefficients); title('Coefficients');
_images/noisdopp_db4_l4_decomposition_per.png

Reconstruct the signal:

reconstructed = waverec(coefficients, levels, LoR, HiR);

Verify that the reconstruction is fine:

max_abs_diff = max(abs(noisdopp-reconstructed))

max_abs_diff =

   2.0357e-11

Plot the approximation coefficients at all levels:

for level=0:4
    level_app_coeffs = appcoef(coefficients, levels, LoR, HiR, level);
    subplot(511+level);
    plot(level_app_coeffs);
    fprintf('%d ', numel(level_app_coeffs));
    title(sprintf('Approximation coefficients @ level-%d', level));
end

1024 512 256 128 64
_images/noisdopp_db4_l4_appcoeffs_per.png

The number of approximation coefficients is decreasing exactly by a factor of 2 in each level.

Restoring the old DWT mode:

% restore the old DWT mode
dwtmode(old_dwt_mode);
Synthesis and Analysis Orthonormal Bases

Daubechies wavelets are orthogonal. For the specific case where the DWT is decomposing a signal \(x \in \RR^N\) to a representation \(\alpha \in \RR^N\) (in the periodic extension case), the transformation can be represented by an equation

\[x = \Psi \alpha\]

where \(\Psi\) is an Orthonormal basis (ONB) for \(\RR^N\) synthesizing the signal \(x\) from the representation \(\alpha\).

The decomposition process is represented by

\[\alpha = \Psi^T x.\]

We can easily construct the matrix \(\Psi^T\). Each column of \(\Psi^T\) can be obtained by computing \(\Psi^T e_i\) where \(e_i\) is the standard unit vector in i-th direction for \(\RR^N\).

We will construct the decomposition matrix for the ‘db4’ wavelet and level 4 decomposition. The size of the signal would be \(N=1024\):

[LoD,HiD,LoR,HiR] = wfilters('db4');

N = 1024;
L = 4;

Let’s make sure that we are using per mode:

old_dwt_mode = dwtmode('status','nodisp');
dwtmode('per');

Let’s construct \(\Psi^T\):

PsiT = zeros(N, N);
for i=1:N
    unit_vec = zeros(N, 1);
    unit_vec(i) = 1;
    [coefficients, levels] = wavedec(unit_vec, L, LoD,HiD);
    PsiT(:, i) = coefficients;
end

Let’s verify that the rows of \(\Psi^T\) are unit norm:

>> norms = spx.norm.norms_l2_rw(PsiT);
fprintf('norms: min: %.4f, max: %.4f\n', min(norms), max(norms));

norms: min: 1.0000, max: 1.0000

Let’s get the corresponding synthesis matrix \(\Psi\)

Psi = PsiT';

Let’s verify that it is indeed an orthonormal basis:

>> max(max(abs(Psi * Psi' - eye(N))))

ans =

   1.8573e-12

We should also verify that the matrix \(\Psi\) behaves same as the application of wavedec and waverec functions.

Let’s load our sample signal:

load noisdopp;
%  make it a column vector
noisdopp = noisdopp';

Let’s construct its representation by wavedec:

[a1, levels] = wavedec(noisdopp, L, LoD, HiD);

Let’s construct its representation by \(\Psi^T\):

a2 = PsiT * noisdopp;

Let’s compare if they match:

>> fprintf('Decomposition diff: %e\n', max(a1 - a2));
Decomposition diff: 2.486900e-14

They indeed match. Now, let’s reconstruct the signal through both ways. First using waverec:

x1  = waverec(a1, levels, LoR, HiR);

Now using \(\Psi\)

x2 = Psi * a2;

Compare them:

fprintf('Synthesis diff: %e\n', max(x1 - x2));
Synthesis diff: 1.065814e-14

It’s working great.

Finally, don’t forget to restore the older DWT mode:

dwtmode(old_dwt_mode);

It is instructive to visualize the basis \(\Psi\):

colormap('gray');
imagesc(Psi);
colorbar;
_images/wavelet_basis_db4_level_4_N_1024.png

The matrix is sparse. In fact only 3% of its entries are non-zero:

>> nnz(Psi) / (N*N)

ans =

    0.0283

This is expected since wavelets have a very small support.

MATLAB provides a function for constructing a dictionary from one or more orthonormal or biorthogonal bases. Let’s try to construct a our ONB matrix using this function:

PsiMP = wmpdictionary(N, 'lstcpt', {{'db4', 4}});

Let’s verify that the two approaches are giving us same result:

>> max(max(abs(PsiMP - Psi)))

ans =

   7.9581e-13

A quick note, the wmpdictionary function returns a sparse matrix.

Complete example code can be downloaded here.

Stationary Wavelet Transform

DWT is not translation invariant. In some applications, translation invariance is important. Stationary Wavelet Transform (SWT) overcomes this limitation. It removes all the upsamplers and downsamplers in DWT. It is a highly redundant transform.

In MATLAB, it is implemented using swt function.

swt doesn’t involve any downsampling. All details and approximations are of same length as the original signal.

swt is defined using periodic extension. The length of the approximation and detail coefficients computed at each level equals the length of the signal.

Let us construct a level 4 decomposition:

coefficients = swt(noisdopp, 4, LoD,HiD);

Let’s plot the approximation and detail coefficients:

for level=0:4
    subplot(511+level);
    plot(coefficients(level+1, :));
    title(sprintf('SWT Coefficients @level-%d', level));
end
_images/noisdoop_db4_l4_swt.png

Detection, Classification and Estimation

Binary Hypothesis Testing

Generate a sequence of bits:

% Number of bits being transmitted
B = 1000*100;
transmittedBits = randi(2, B , 1)  - 1;

Modulation:

% Number of samples per detection test.
N = 10;
% The signal shape
signal = ones(N, 1);
transmittedSequence = SPX_Modulator.modulate_bits_with_signals(transmittedBits, signal);

Adding noise:

sigma = 1;
noise = sigma * randn(size(transmittedSequence));
% We add noise to transmitted data to create received sequence
receivedSequence = transmittedSequence + noise;

Matched filtering:

matchedFilterOutput = SPX_MatchedFilter.filter(receivedSequence, signal);

Generating sufficient statistics:

signalNormSquared = signal' * signal;
sufficientStatistics = matchedFilterOutput / signalNormSquared;

Thresholding:

% We define optimal detection threshold
eta = 0.5;
% We create the received bits
receivedBits = sufficientStatistics >= eta;

Detection results:

result = SPX_BinaryHypothesisTest.performance(...
    transmittedBits, receivedBits)

% Number of False sent, False received
result.FF
% Number of False sent, True received
result.FT
% Number of True sent, False received
result.TF
% Number of True sent, True received
result.TT
% Number of times hypothesis 0 was sent.
result.H0
% Number of times hypothesis 1 was sent.
result.H1
% Number of times 0 was detected.
result.D0
% Number of times 1 was detected.
result.D1
% A priori probability of 0
result.P0
% A priori probability of 1
result.P1
% Detection probability
result.PD
% False alarm probability
result.PF
% Miss probability
result.PM
% Accuracy (probability of correct decisions)
result.Accuracy
% Probability of error
result.Pe
% Precision : Truth sent given that truth was detected
result.Precision
% Recall : Truth detected given that truth was sent.
result.Recall
% F1 score
result.F1

ECG

A Short Review of ECG Signals

https://upload.wikimedia.org/wikipedia/commons/9/9e/SinusRhythmLabels.svg

The structure of an ECG signal. Courtesy: Wikipedia.

General Features

P wave

  • P wave has a duration less than 120 msec with frequencies below 10-15 Hz.

QRS complex

  • QRS wave has a duration of about 70-110 msec with frequencies in 10-50 Hz.

T wave

  • It is similar in frequency content to P wave.

PQ segment

  • PQ segment lasts about 80 msec.

Computational Complexity

Introduction

This chapters provides a framework for analysis of computational complexity of sparse recovery algorithms. See [GVL12] for a detailed study of matrix computations.

The table below summarizes the flop counts for various basic operations. A detailed derivation of these flop counts is presented in Basic Operations.

Summary of flop counts for various basic operations
Operation Description Parameters Flop Counts
\(y = \text{abs}(x)\) Absolute values \(x \in \RR^n\) \(n\)
\(\langle x, y \rangle\) Inner product \(x, y \in \RR^n\) \(2n\)
\([v, i] = \text{max}(\text{abs}(x))\) Find maximum value by magnitude \(x \in \RR^n\) \(2n\)
\(y = A x\) Matrix vector multiplication \(A \in \RR^{m \times n}, x \in \RR^n\) \(2mn\)
\(C = AB\) Matrix multiplication \(A \in \RR^{m \times n}, B \in \RR^{n \times p}\) \(2mnp\)
\(y = A x\) \(A\) is diagonal \(A \in \RR^{n\times n}, x \in \RR^n\) \(n\)
\(y = A x\) \(A\) is lower triangular \(A \in \RR^{n\times n}, x \in \RR^n\) \(n(n+1)\)
\((I + u v^T)x\)   \(x, u, v \in \RR^n\) \(4n\)
\(G = A^TA\) Gram matrix (symmetric) \(A \in \RR^{m \times n}\) \(mn^2\)
\(F = AA^T\) Frame operator (symmetric) \(A \in \RR^{m \times n}\) \(nm^2\)
\(\| x \|_2^2\) Squared \(\ell_2\) norm \(x \in \RR^n\) \(2n - 1\)
\(\| x \|_2\) \(\ell_2\) norm \(x \in \RR^n\) \(2n\)
\(x(:) = c\) Set to a constant value \(x \in \RR^n\) \(n\)
Swap rows in \(A\) elementary row operation \(A \in \RR^{m \times n}\) \(3n\)
\(A(i, :) = \alpha A(i, :)\) Scale a row \(A \in \RR^{m \times n}\) \(2n\)
Solve \(L x = b\) Lower triangular system \(L \in \RR^{n \times n}\) \(n^2\)
Solve \(U x = b\) Upper triangular system \(U \in \RR^{n \times n}\) \(n^2\)
Solve \(Ax =b\) Gaussian elimination, \(A\) full rank \(A\in \RR^{n \times n}\) \(\frac{2\, n^3}{3} + \frac{n^2}{2} - \frac{7\, n}{6}\)
\(A = QR\) QR factorization \(A \in \RR^{m \times n}\) \(2mn^2\)
Solve \(\| A x - b \|_2^2\) Least squares through QR \(A \in \RR^{m \times n}\) \(2mn^2 + 2mn + n^2\)
\(A^TA x = A^T b\) Least squares through Cholesky \(A^T A = L L^T\) \(A \in \RR^{m \times n}\) \(mn^2 + \frac{1}{3} n^3\)

Basic Operations

Essential operations in the implementation of a numerical algorithm are addition, multiplication, comparison, load and store. Numbers are stored in floating point representation. A dedicated floating point unit is available for performing arithmetic operations. These operations are known as floating point operations (flops). A typical update operation \(b \leftarrow b + x y\) (a.k.a. multiply and add) involves two flops (one floating point multiply and open floating point addition). Subtraction costs same as addition. A division is usually counted as 4 flops in HPC community as a more sophisticated procedure is invoked in the floating point arithmetic hardware. For our purposes, we will count division as single flop as it is a rare operation and doesn’t affect overall flop count asymptotically. A square root operation can take about 6 flops on typical CPU architectures, but [following [TBI97]], we will treat it as a single flop. We ignore the costs of load and store operations. We usually, also ignore costs of decision making operations and integer counters. We will also be treating real arithmetic as well as complex arithmetic costing same number of flops to maintain ease of analysis.

Let \(x, y \in \RR^n\) be two vectors, then their inner product is computed as

\[\langle x, y \rangle = \sum_{i=1}^n x_i y_i.\]

This involves \(n\) multiplications and \(n-1\) additions. Total operation count is \(2n - 1\) flops. If we implement this as a sequence of multiply and add operation starting with \(0\), then this will take \(2n\) flops. We will use this simpler expression. Addition and subtraction of \(x\) and \(y\) takes \(n\) flops. Scalar multiplication takes \(n\) flops.

Multiplication

Let \(A \in \RR^{m \times n}\) be a real matrix and \(x \in \RR^n\) be a vector. Then \(y = A x \in \RR^m\) is their matrix-vector product. A straight-forward implementation consists of taking inner product of each row of \(A\) with \(x\). Each inner product costs \(2n\) flops. There are \(m\) such inner products computed. Total operation count is \(2mn\). When two matrices \(A \in \RR^{m \times n}\) and \(B \in \RR^{n \times p}\) are multiplied, the operation count is \(2mnp\).

There are specialized matrix-matrix multiplication algorithms which can reduce the flop count, but we would be content with this result. If \(A\) has a certain structure [e.g. Fourier Transform], then specialized algorithms may compute the product much faster. We will not be concerned with this at the moment. Also, partitioning of a matrix into blocks and using block versions of fundamental matrix operations helps a lot in improving the memory traffic and can significantly improve the performance of the algorithm on real computers, but this doesn’t affect the flop count and we won’t burden ourselves with these details.

If \(A\) is diagonal (with \(m=n\)), then \(Ax\) can be computed in \(n\) flops. If \(A\) is lower triangular (with \(m=n\)), then \(Ax\) can be computed in \(n(n+1)\) flops. Here is a quick way to compute \((I + uv^T)x\): Compute \(c = v^T x\) (\(2n\) flops), then compute \(w = c u\) (\(n\) flops), then compute \(w + x\) (\(n\) flops). The total is \(4n\) flops.

The Gram Matrix \(G = A^T A\) (for \(A \in \RR^{m \times n}\)) is symmetric of size \(n \times n\). We need to calculate only the upper triangular part and we can fill the lower triangular part easily. Each row vector of \(A^T\) and column vector of \(A\) belong to \(\RR^{m}\). Their inner product takes \(2m\) flops. We need to compute \(n(n+1)/2\) such inner products. The total flop count is \(mn(n+1) \approx mn^2\). Similarly, the frame operator \(AA^T\) is symmetric requiring \(nm(m+1) \approx nm^2\) flops.

Squared norm of a vector \(\| x \|_2^2 = \langle x, x \rangle\) can be computed in \(2n-1\) flops. Norm can be computed in \(2n\) flops.

Elementary row operations

There are a few memory operations for which we need to assign flop counts. Setting a vector \(x \in \RR^n\) to zero (or any constant value) will take \(n\) flops. Swapping two rows of a matrix \(A\) (with \(n\) columns) takes \(3n\) flops.

Scaling a row of \(A\) takes \(n\) flops. Scaling a row and adding to another row takes \(2n\) flops.

Back and forward substitution

Given an upper triangular matrix \(L \in \RR^{n \times n}\), solving the equation \(L x = b\) takes \(n^2\) flops. This can be easily proved by induction. The case for \(n=1\) is trivial (requiring 1 division). Assume that the flop count is valid for \(1\dots n-1\). For \(n \times n\) matrix \(L\), let the top most row equation be

\[l_{11} x_1 + \sum_{k=2}^n l_{1k} x_k = b_1\]

where \(x_2 \dots x_n\) are already determined in \((n-1)^2\) flops. Solving for \(x_1\) requires \(2n -3 + 1 + 1= 2n - 1\) flops. The total is \((n-1)^2 + 2n -1 = n^2\). Flop count for forward substitution is also \(n^2\).

Gaussian elimination

Let \(A \in \RR^{n \times n}\) be a full rank matrix and let us look at the Gaussian elimination process in solving the equation \(A x = y\) for a given \(y\) and unknown \(x\). As the pivot column shifts in Gaussian elimination process, the number of columns involved keeps reducing. The first pivot is \(a_{11}\). Computing its inverse takes 4 flops. For \(i\)-th row beneath the first row, computing \(a_{11} / a_{i1}\) takes 4 flops, scaling the row with this value takes \(n\) flops, and subtracting 1st from from this takes \(n\) flops. Total flop count is \((2n+1)\) flops. We repeat the same for \((n-1)\) rows. Total flop count is \((2n+1)(n-1)\). For \(i\)-th pivot from \(i\)-th row, the number of columns involved is \(n-i+1\). Number of rows below it is \(n-i\). Flop count of zeroing out entries below the pivot is \((2(n-i+1)+1)(n-i)\). Summing over \(1\) to \(n\), we obtain:

\[\sum_{i=1}^n (2(n-i+1)+1)(n-i) = \frac{2\, n^3}{3} + \frac{n^2}{2} - \frac{7\, n}{6} .\]

For a \(2\times 2\) matrix, this is 5 flops. For a \(3\times 3\) matrix, this is \(19\) flops. Actually, substituting \(n-i+1\) by \(k\), we can rewrite the sum as:

\[\sum_{k=1}^n (2k+1)(k -1) = \frac{2\, n^3}{3} + \frac{n^2}{2} - \frac{7\, n}{6} .\]

Additional \(n^2\) flops are required for back substitution part.

QR factorization

We factorize a full column rank matrix \(A \in \RR^{m \times n}\) as \(A = QR\) where \(Q \in \RR^{m \times n}\) is an orthogonal matrix \(Q^TQ = I\) and \(R \in \RR^{n \times n}\) is an upper triangular matrix. This can be computed in \(2mn^2\) flops using Modified Gram-Schmidt algorithm presented in here.

\caption{Modified Gram-Schmidt Algorithm}

\footnotesize
\SetAlgoLined
\For{:math:`k \leftarrow 1` \KwTo :math:`n`}{
    :math:`v_k \leftarrow a_k`\tcp*{Initialize :math:`Q` matrix}
}
\For{:math:`k \leftarrow 1` \KwTo :math:`n`}{
    :math:`r_{kk} \leftarrow \| v_k \|_2`\tcp*{Compute norm}
    :math:`q_k \leftarrow v_k / r_{kk}` \tcp*{Normalize}
    \For{:math:`j \leftarrow k+1` \KwTo :math:`n`} {
        :math:`r_{kj} \leftarrow q_k^T v_j` \tcp*{Compute projection}
        :math:`v_j \leftarrow v_j - r_{kj} q_k` \tcp*{Subtract projection}
    }
}

Most of the time of the algorithm is spent in the inner loop on \(j\). Projection of \(v_j\) on \(q_k\) is computed in \(2m-1\) flops. It is subtracted from \(v_j\) in \(2m\) flops. Projection of \(q_k\) is subtracted from remaining \((n-k)\) vectors requiring \((n-k)(4m-1)\) flops. Summing over \(k\), we get:

\[\sum_{k=1}^n (n-k)(4m-1) = \frac{n}{2} - 2m n + 2mn^2 - \frac{n^2}{2}.\]

Computing norm \(r_{kk}\) requires \(2m\) flops. Computing \(q_k\) requires \(m+1\) flops (1 inverse and \(m\) multiplications). These contribute \((3m+1)n\) flops for \(n\) columns. Initialization of \(Q\) matrix can be absorbed into the normalization step requiring no additional flops. Thus, the total flop count is \(\frac{3n}{2} + m n + 2mn^2 - \frac{n^2}{2} \approx 2mn^2\).

A variation of this algorithm is presented below. In this version \(Q\) and \(R\) matrices are computed column by column from \(A\) matrix. This allows for incremental update of \(QR\) factorization of \(A\) as more columns in \(A\) are added. This variation is very useful in efficient implementation of algorithms like Orthogonal Matching Pursuit.

_images/alg_mgs.png

Again, the inner loop requires \(4m-1\) flops. This loop is run \(k-1\) times. We have \(\sum_{k=1}^n (k-1)= \sum_{k=1}^n (n - k)\). Thus, flop counts are identical.

Least Squares

Standard least squares problem of minimizing the norm squared \(\| A x - b\|_2^2\) where \(A\) is a full column rank matrix, can be solved using various methods. Solution can be obtained by solving the normal equations \(A^T A x = A^T b\). Since the Gram matrix \(A^T A\) is symmetric, faster solutions than Gaussian elimination are applicable.

QR factorization

We write \(A = QR\). Then, an equivalent formulation of normal equations is \(R x = Q^T b\). The solution is obtained in 3 steps: a) Compute \(QR\) factorization of \(A\). b) Form \(d = Q^T b\). c) Solve \(R x = d\) by back substitution. Total cost for solution is \(2mn^2 + 2mn + n^2\) flops. We refrain from ignoring the lower order terms as we will be using incremental QR update based series of least squares problems in sequel.

Cholesky factorization

We calculate \(G = A^T A\). We then perform the Cholesky factorization of \(G = LL^T\). We compute \(d = A^T b\). We solve \(Lz = d\) by forward substitution. We solve \(L^T x = z\) by back substitution. Total flop count is approximately \(mn^2 + (1/3) n^3 + 2mn + n^2 + n^2\) flops. For large \(m, n\), the cost is approximately \(mn^2 + (1/3) n^3\). QR factorization is numerically more stable though Cholesky is faster. Cholesky factorization can be significantly faster if \(A\) is a sparse matrix. Otherwise QR factorization is the preferred approach.

Incremental QR factorization

Let us spend some time on looking at the QR based solution differently. Let us say that \(A = \begin{bmatrix} a_1 & a_2 & \dots & a_n \end{bmatrix}\). Let \(A_k\) be the submatrix consisting of first \(k\) columns of \(A\). Let the QR factorization of \(A_k\) be \(Q_k R_k\). Let \(x_k\) be the solution of the least squares problem of minimizing \(\| A_k x_k - b \|_2^2\). We form \(d_k = Q_k^T b\) and solve \(R_k x_k = d_k\) via back substitution.

Similarly, QR factorization of \(A_{k+1}\) is \(Q_{k+1} R_{k+1}\). We can write

\[\begin{split}A_{k+1} = \begin{bmatrix}A_k & a_{k+1}\end{bmatrix}, \quad Q_{k+1} = \begin{bmatrix}Q_k & q_{k+1}\end{bmatrix}, \quad R_{k+1} = \begin{bmatrix} R_k & r_{k+1}\\ 0 & r_{k+1, k+1} \end{bmatrix}\end{split}\]

\(k\) entries in the vector \(r_{k+1}\) are computed as per the loop in above. Computing and subtracting projection of \(a_{k+1}\) for each normalized column in \(Q_k\) requires \(4m-1\) flops. This loop is run \(k\) times. Computing norm and division requires \(3m+1\) flops. The whole QR update step requires \(k(4m-1) + 3m + 1\) flops. It is clear that the first \(k\) entries in \(d_{k+1}\) are identical to \(d_k\). We just need to compute the last entry as \(q_{k+1}^T b\) (requiring \(2m\) flops). Back substitution will require all \((k+1)^2\) flops. Total number of flops required for solving the \(k+1\)-th least squares problem is \(k(4m-1) + 3m + 1 + 2m + (k+1)^2\) flops. Summing over \(k=0\) to \(n-1\), we get

\[\sum_{k=0}^{n-1} k(4m-1) + 3m + 1 + 2m + (k+1)^2 = \frac{5\, n}{3} + 3\, m\, n + 2\, m\, n^2 + \frac{n^3}{3}.\]

Compare this with the flop count for QR factorization based least squares solution for whole matrix \(A\): \(2mn^2 + 2mn + n^2\). Asymptotically (with \(n < m\)), this is close to \(2mn^2\), the operation count for solving the full least squares problem. This approach gives us a series of solutions with sacrificing much on computational complexity.

Orthogonal Matching Pursuit

We are modeling a signal \(y \in \RR^M\) in a dictionary \(\Phi \in \RR^{M \times N}\) consisting of \(N\) atoms as \(y = \Phi x + r\) where \(r\) is the approximation error. Our objective is to construct a sparse model \(x \in \RR^N\). \(\Lambda = \supp(x)\) is the set of indices on which \(x_i\) is non-zero. \(K = \| x \|_0 = | \supp(x) |\) is the so called \(\ell_0\)-“norm” of \(x\) which is the number of non-zero entries in \(x\).

A sparse recovery or approximation algorithm need not provide the full vector \(x\). It can provide the positions of non-zero entries \(\Lambda\) and corresponding values \(x_{\Lambda}\) requiring \(2K\) units of storage where \(x_{\Lambda} \in \RR^{K}\) consists of entries from \(x\) indexed by \(\Lambda\). \(\Phi_{\Lambda}\) denotes the submatrix constructed by picking columns indexed by \(\Lambda\).

Orthogonal Matching Pursuit is presented below.

_images/algorithm_orthogonal_matching_pursuit.png

OMP builds the support incrementally. In each iteration, one more atom is added to the support set for \(y\). We terminate the algorithm either after a fixed number of iterations \(K\) or when the magnitude of residual \(\| y - \Phi x \|_2\) reaches a specified threshold.

Following analysis assumes that the main loop of OMP runs for \(K\) iterations. The iteration counter \(k\) varies from \(1\) to \(K\). The counter is increased at the beginning of the iteration. Note that \(K \leq M\).

Matching step requires the multiplication of \(\Phi^T \in \RR^{N \times M}\) with \(r^{k-1}\in \RR^{M}\) (the residual after \(k-1\) iterations). It requires \(2MN\) flops at maximum. OMP has a property that the residual after \(k\)-th iteration is orthogonal to the space spanned by the atoms selected till \(k\)-th iteration \(\{\phi_{\lambda_1}\dots \phi_{\lambda_k}\}\). Thus, the inner product of these atoms with \(r\) is 0 and we can safely ignore these columns. This reduces flop count to \(2M(N-k+1)\).

Identification step requires \(2N\) flops. This includes \(N\) flops for taking absolute values and \(N\) flops for finding the maximum.

\(\Lambda\) is easily implemented in the form of an array whose length is indicated by the iteration counter \(k\). A large array (of size \(M\)) can be allocated in advance for maintaining \(\Lambda\). Thus, support update operation requires a single flop and we will ignore it. \(\Lambda^{k}\) contains \(k\) indices.

While the algorithm shows the full sparse vector \(x\), in practice, we only need to reserve space for \(x_{\Lambda}\) which is an array with maximum size of \(M\) and can be preallocated. \(\Phi_{\Lambda^{k}}\) need not be stored separately. This can be obtained from \(\Phi\) by proper indexing operations. Its size is \(M \times k\).

Let’s skip the least squares step for updating representation for the moment.

Once \(x^{k}_{\Lambda^{k}}\) has been computed, computing the approximation \(y^{k}\) takes \(2Mk\) flops.

Updating the residual \(r^{k}\) takes \(M\) flops as both \(y\) and \(y^{k}\) belong to \(\RR^{M}\). Updating iteration counter takes 1 flop and can be ignored.

Least Squares through QR Update

Let’s come back to the least squares step. Assume that \(\Phi_{\Lambda^{k-1}}\) has a QR decomposition \(Q_{k-1} R_{k-1}\). Addition of \(\phi_{\lambda^{k}}\) to \(\Phi_{\Lambda^k}\) requires us updating the QR decomposition to \(Q_{k} R_{k}\). Following here, Computing and subtracting projection of \(\phi_{\lambda^{k}}\) for each normalized column in \(Q_{k-1}\) requires \(4M-1\) flops. This loop is run \({k-1}\) times. Computing norm and division requires \(3M+1\) flops. The whole QR update step requires \((k-1)(4M-1) + 3M + 1\) flops. We are assuming that enough space has been preallocated to maintain \(Q_k\) and \(R_k\). Solving the least squares problem requires additional steps of computing the projection \(d = Q^T y\) (\(2M\) flops for the new entry in \(d\)) and solving \(R x = d\) by back substitution (\(k^2\) flops). Thus, QR update based least squares solution requires \((k-1)(4M-1) + 3M + 1 + 2M + k^2\) flops.

Refer to here for a summary of all the steps.

Finally, we can put together the cost of all steps in the main loop of OMP as

\[2M(N-k+1) + 2N + 2Mk + M + (k-1)(4M-1) + 3M + 1 + 2M + k^2.\]

This simplifies to \(4\,M+2\,N-k+4\,M\,k+k^2+2\,M\,N+2\). Summing over \(k \in \{1,\dots, K\}\), we obtain

\[\frac{5\, K}{3} + 2\, K^2\, M + \frac{K^3}{3} + 6\, K\, M + 2\, K\, N + 2\, K\, M\, N.\]

For a specific setting of \(K = \sqrt{M} / 2\) and \(M = N/2\), we get

\[\frac{5\,\sqrt{2}\,\sqrt{N}}{12}+\frac{121\,\sqrt{2}\,N^{3/2}}{96}+\frac{\sqrt{2}\,N^{5/2}}{4}+\frac{N^2}{8} \approx \frac{\sqrt{2}\,N^{5/2}}{4}.\]

In terms of \(M\), it will simplify to:

\[\frac{M^2}{2}+\frac{5\,\sqrt{M}}{6}+\frac{121\,M^{3/2}}{24}+2\,M^{5/2} \approx 2\,M^{5/2}.\]

In a typical sparse approximation problem, we have \(K < M \ll N\). Thus, the flop count will be approximately \(2KMN\).

Total flop count of matching step over all iterations is \(K\, M - K^2\, M + 2\, K\, M\, N\). Total flop count of least squares step over all iterations is \(\frac{5\, K}{3} + 2\, K^2\, M + \frac{K^3}{3} + 3\, K\, M\). This suggests that the matching step is the dominant step for OMP.

\centering
\caption{Operations in OMP using QR update}
\begin{tabular}{c | c}
\hline
Operation & Flops \\
\hline
:math:`\Phi^T r` & :math:`2M(N - k +1)`\\
Identification  & :math:`2N` \\
:math:`y^{k} = \Phi_{\Lambda^{k}} x_{\Lambda^{k}}^{k}` & :math:`2Mk`\\
:math:`r^k = y - y^k` & :math:`M` \\
QR update & :math:`(k-1)(4M-1) + 3M + 1` \\
Update :math:`d = Q_k^T y` &  :math:`2M` \\
Solve :math:`R_k x = d` & :math:`k^2` \\
\hline
\end{tabular}

Least Squares through Cholesky Update

If the OMP least squares step is computed through Cholesky decomposition, then we maintain the Cholesky decomposition of \(G = \Phi_{\Lambda}^T \Phi_{\Lambda}\) as \(G = L L^T\). Then

\[\begin{split}\begin{aligned} &x = \Phi_{\Lambda}^{\dag} y\\ \iff & x = (\Phi_{\Lambda}^T \Phi_{\Lambda})^{-1} \Phi_{\Lambda}^T y\\ \iff & (\Phi_{\Lambda}^T \Phi_{\Lambda}) x = \Phi_{\Lambda}^T y\\ \iff & LL^T x = \Phi_{\Lambda}^T y = b \end{aligned}\end{split}\]

In each iteration, we need to update \(L_k\), compute \(b = \Phi_{\Lambda}^T y\), solve \(L u = b\) and then solve \(L^T x = u\). Now,

\[\begin{split}\Phi_{\Lambda^k}^T \Phi_{\Lambda^k} = \begin{bmatrix} \Phi_{\Lambda^{k-1}}^T \Phi_{\Lambda^{k-1}} & \Phi_{\Lambda^{k-1}}^T \phi_{\lambda^k}\\ \phi_{\lambda^k}^T \Phi_{\Lambda^{k-1}} & \phi_{\lambda^k}^T \phi_{\lambda^k} \end{bmatrix}.\end{split}\]

Define \(v = \Phi_{\Lambda^{k-1}}^T \phi_{\lambda^k}\). We have

\[\begin{split}G^k = \begin{bmatrix} G^{k - 1} & v \\ v^T & 1 \end{bmatrix}.\end{split}\]

The Cholesky update is given by:

\[\begin{split}L^k = \begin{bmatrix} L^{k - 1} & 0 \\ w^T & \sqrt{1 - w^T w} \end{bmatrix}\end{split}\]

where solving \(L^{k - 1} w = v\) gives us \(w\). For the first iteration, \(L^1 = 1\) since the atoms in \(\Phi\) are normalized.

Computing \(v\) would take \(2M(k-1)\) flops. Computing \(w\) would take \((k-1)^2\) flops. Computing \(\sqrt{1-w^T w}\) would take another \(2k\) flops. Thus, Cholesky update requires \(2M(k-1) + 2k + (k-1)^2\) flops. Then computing \(b = \Phi^T_{\Lambda} y\) requires only updating the last entry in \(b\) which requires \(2M\) flops. Solving \(LL^T x = b\) requires \(2k^2\) flops.

\centering
\caption{Operations in OMP using Cholesky update}
\begin{tabular}{c | c}
\hline
Operation & Flops \\
\hline
:math:`\Phi^T r` & :math:`2M(N - k +1)`\\
Identification  & :math:`2N` \\
:math:`y^{k} = \Phi_{\Lambda^{k}} x_{\Lambda^{k}}^{k}` & :math:`2Mk`\\
:math:`r^k = y - y^k` & :math:`M` \\
Cholesky update & :math:`2M(k-1) + 2k + (k-1)^2` \\
Update :math:`b = \Phi^T_{\Lambda} y` &  :math:`2M` \\
Solve :math:`LL^T x = b` & :math:`2k^2` \\
\hline
\end{tabular}

We can see that for \(k \ll M\), QR update is around \(4Mk\) flops while Cholesky update is around \(2Mk\) steps (asymptotically).

Flop counts for the main loop of OMP using Cholesky update is

\[3\,k^2+2\,M\,k+3\,M+2\,N+2\,M\,N+1.\]

Summing over \(k \in [K]\), we get total flop count for OMP as

(1)\[\frac{3\,K}{2}+K^2\,M+\frac{3\,K^2}{2}+K^3+4\,K\,M+2\,K\,N+2\,K\,M\,N.\]

For a specific setting of \(K = \sqrt{M} / 2\) and \(M = N/2\), we get In terms of \(M\), it will simplify to:

\[\frac{3\,M}{8}+\frac{M^2}{4}+\frac{3\,\sqrt{M}}{4}+\frac{33\,M^{3/2}}{8}+2\,M^{5/2} \approx 2\,M^{5/2}.\]

In a typical sparse approximation problem, we have \(K < M \ll N\). Thus, the flop count will be approximately \(2KMN\) i.e. dominated by the matching step.

Cholesky update based solution is marginally faster than QR update based solution for small values of \(M\).

Sorting

We sometimes need sorting and searching operations on arrays of numbers in the numerical algorithms. This section summarizes results related to number of operations needed to perform various sorting and searching tasks on arrays. These results are collected from or based on the approach in [SF13]. Fundamental operations in these algorithms are comparison, load, store and exchanges of array elements.

Finding the maximum of an array of length n takes \(n-1 \approx n\) comparisons. We assume the first entry as maximum, keep comparing with other entries in array, and change the maximum if the entry in array is larger. On an average, half of these comparisons will also require changing the maximum entry. Apart from finding the largest entry, we are often required to find its location too. Location will be updated whenever maximum value is updated. If we have to find \(k\) largest entries in the array (along with their indices), we can work iteratively: find maximum, set the corresponding entry in array to small enough value (0 for positive valued array, \(-\infty\) for real array), find the second largest entry and so on. This would require \(kn\) comparisons approximately. Considering additional book-keeping cost, the flop count can be put at \(2kn\). If the array is needed further, we can put the \(k\) largest entries back in the array.

Theorem 1.3 in [SF13] suggests that quicksort algorithm on average uses \((n-1)/2\) partitioning stages, \(2n\ln{n} -1.846n\) compares and \(.333 n \ln{n} -.865 n\) exchanges to sort an array of n randomly ordered distinct elements.

Our needs for sorting also require us to store the indices of entries in sorted array in the original array. This is done by creating an index array and performing exchanges on the index array whenever exchanges are done in the original array. Keeping these extra operations in mind, We will use a conservative estimate of \(4n \ln{n}\) flops for sorting an array. Once the array is sorted, picking the \(k\) largest entries requires \(k\) iterations. It is noted here that when \(n\) is small (say less than 1000), then an efficient implementation of quicksort can actually beat the naive way of finding \(k\) largest entries discussed above. This will be our preferred approach in this work.

Library Classes

Contents:

Sparse recovery pursuit algorithms

Contents:

Matching pursuit

Constructing the solver with dictionary and expected sparsity level:

solver = spx.pursuit.single.MatchingPursuit(Dict, K)

Using the solver to obtain the sparse representation of one vector:

result = solver.solve(y)

Using the solver to obtain the sparse representations of all vectors in the signal matrix Y independently:

result = solver.solve_all(Y)

Orthogonal matching pursuit

Constructing the solver with dictionary and expected sparsity level:

solver  = spx.pursuit.single.OrthogonalMatchingPursuit(Dict, K)

Using the solver to obtain the sparse representation of one vector:

result = solver.solve(y)

There are several ways of solving the least squares problem which is an intermediate step in the orthogonal matching pursuit algorithm. Some of these are described below.

Using the solver to obtain the sparse representation of one vector with incremental QR decomposition of the subdictionary for the least squares step:

result = solver.solve_qr(y)

Using the solver to obtain the sparse representations of all vectors in the signal matrix Y independently:

result = solver.solve_all(Y)

Using the solver to obtain the sparse representations of all vectors in the signal matrix Y independently using the linsolve method for least squares:

result = solver.solve_all_linsolve(Y)

Basis pursuit and its variations

Basis pursuit is a way of solving the sparse recovery problem via \(\ell_1\) minimization. We provide multiple implementations for different variations of the problem.

Note

These algorithms are dependent on the CVX toolbox. Please make sure to install them before using the algorithms.

Constructing the solver with dictionary and set of signals to be solved arranged in a signal matrix:

solver = spx.pursuit.single.BasisPursuit(Dict, Y)

Solving using LASSO method:

result = solver.solve_lasso(lambda)
result = solver.solve_lasso()

Solving using \(\ell_1\) minimization assuming that signals are exact sparse:

result = solver.solve_l1_exact()

Solving using \(\ell_1\) minimization assuming that signals are noisy:

result = solver.solve_l1_noise()

Compressive sampling matching pursuit

Constructing the solver with dictionary and expected sparsity level:

solver = spx.pursuit.single.CoSaMP(Dict, K)

Using the solver to obtain the sparse representation of one vector:

result = solver.solve(y)

Using the solver to obtain the sparse representations of all vectors in the signal matrix Y independently:

result = solver.solve_all(Y)

Joint recovery algorithms

Cluster orthogonal matching pursuit

Warning

This is new algorithm under research.

solver = spx.pursuit.joint.ClusterOMP(Dict, K)
result = solver.solve(Y)

Introduction

This section focuses on methods which solve the sparse recovery or sparse approximation problems for one vector at a time. A subsection on joint recovery algorithms focuses on solving problems where multiple vectors which have largely common supports can be solved jointly.

For each algorithm, there is a solver. The solver should be constructed first with the dictionary / sensing matrix and some other parameters like sparsity level as needed by the algorithm.

The solver can then be used for solving one problem at a time.

Common utilities

Contents:

Signals

Our focus is usually on finite dimensional signals. Such signals are usually stored as column vectors in MATLAB. A set of signals with same dimensions can be stored together in the form of a matrix where each column of the matrix is one signal. Such a matrix of signals is called a signal matrix.

In this section we describe some helper utility functions which provide extra functionality on top of existing support in MATLAB.

General

Constructing unit (column) vector in a given co-ordinate:

>> N = 8; i = 2;
>> spx.vector.unit_vector(N, i)'
0     1     0     0     0     0     0     0
Sparsification

Finding the K-largest indices of a given signal:

>> x = [0 0 0  1 0 0 -1 0 0 -2 0 0 -3 0 0 7 0 0 4 0 0 -6];
>> K=4;
>> spx.commons.signals.largest_indices(x, K)'
16    22    19    13

Constructing the sparse approximation of x with K largest indices:

>> spx.commons.signals.sparseApproximation(x, K)'
0     0     0     0     0     0     0     0     0     0     0     0    -3     0     0     7     0     0     4     0     0    -6
Searching

spx.commons.signals.find_first_signal_with_energy_le finds the first signal in a signal matrix X with an energy less than or equal to a given threshold energy:

[x, i] = spx.commons.signals.find_first_signal_with_energy_le(X, threshold);

x is the first signal with energy less than the given threshold. i is the index of the column in X holding this signal.

Working with matrices

Simple checks on matrices

Let us create a simple matrix:

A = magic(3);

Checking whether the matrix is a square matrix:

spx.matrix.is_square(A)

Checking if it is symmetric:

spx.matrix.is_symmetric(A)

Checking if it is a Hermitian matrix:

spx.matrix.is_hermitian(A)

Checking if it is a positive definite matrix:

spx.matrix.is_positive_definite(A)
Matrix utilities

spx.matrix.off_diagonal_elements returns the off-diagonal elements of a given matrix in a column vector arranged in column major order.

A = magic(3);
spx.matrix.off_diagonal_elements(A)'
ans =
    3     4     1     9     6     7

spx.matrix.off_diagonal_matrix zeros out the diagonal entries of a matrix and returns the modified matrix:

spx.matrix.off_diagonal_matrix(A)
ans =

     0     1     6
     3     0     7
     4     9     0

spx.matrix.off_diag_upper_tri_matrix returns the off diagonal part of the upper triangular part of a given matrix and zeros out the remaining entries:

spx.matrix.off_diag_upper_tri_matrix(A)

ans =

     0     1     6
     0     0     7
     0     0     0

spx.matrix.off_diag_upper_tri_elements returns the elements in the off diagonal part of the upper triangular part of a matrix arranged in column major order:

spx.matrix.off_diag_upper_tri_elements(A)'

ans =

     1     6     7

spx.matrix.nonzero_density returns the ratio of total number of non-zero elements in a matrix with the size of the matrix:

spx.matrix.nonzero_density(A)
ans = 1
diagonally dominant matrices

Checking whether a matrix is diagonally dominant:

spx.matrix.is_diagonally_dominant(A)

Making a matrix diagonally dominant:

A = spx.matrix.make_diagonally_dominant(A)

Both these functions have an extra parameter named strict. When set to true, strict diagonal dominance is considered / enforced.

Norms and distances

Distance measurement utilities

Let X be a matrix. Treat each column of X as a signal.

Euclidean distance between each signal pair can be computed by:

spx.commons.distance.pairwise_distances(X)

If X contains N signals, then the result is an N x N matrix whose (i, j)-th entry contains the distance between i-th and j-th signal. Naturally, the diagonal elements are all zero.

An additional second argument can be provided to specify the distance measure to be used. See the documentation of MATLAB pdist function for supported distance functions.

For example, for measuring city-block distance between each pair of signals, use:

spx.commons.distance.pairwise_distances(X, 'cityblock')

Following dedicated functions are faster.

Squared \(\ell_2\) distances between all pairs of columns of X:

spx.commons.distance.sqrd_l2_distances_cw(X)

Squared \(\ell_2\) distances between all pairs of rows of X:

spx.commons.distance.sqrd_l2_distances_rw(X)
Norm utilities

These functions help in computing norm or normalizing signals in a signal matrix.

Compute \(\ell_1\) norm of each column vector:

spx.norm.norms_l1_cw(X)

Compute \(\ell_2\) norm of each column vector:

spx.norm.norms_l2_cw(X)

Compute \(\ell_{\infty}\) norm of each column vector:

spx.norm.norms_linf_cw(X)

Normalize each column vector w.r.t. \(\ell_1\) norm:

spx.norm.normalize_l1(X)

Normalize each column vector w.r.t. \(\ell_2\) norm:

spx.norm.normalize_l2(X)

Normalize each row vector w.r.t. \(\ell_2\) norm:

spx.norm.normalize_l2_rw(X)

Normalize each column vector w.r.t. \(\ell_{\infty}\) norm:

spx.norm.normalize_linf(X)

Scale each column vector by a separate factor:

spx.norm.scale_columns(X, factors)

Scale each row vector by a separate factor:

spx.norm.scale_rows(X, factors)

Compute the inner product of each column vector in A with each column vector in B:

spx.norm.inner_product_cw(A, B)

Sparse signals

Working with signal support

Let’s create a sparse vector:

>> x = [0 0 0  1 0 0 -1 0 0 -2 0 0 -3 0 0 7 0 0 4 0 0 -6];

Sparse support for a vector:

>> spx.commons.sparse.support(x)
4     7    10    13    16    19    22

\(\ell_0\) “norm” of a vector:

>> spx.commons.sparse.l0norm(x)
7

Let us create one more signal:

>> y = [3 0 0  0 0 0 0 0 0 4 0 0 -6 0 0 -5 0 0 -4 0 8 0];
>> spx.commons.sparse.l0norm(y)
6
>> spx.commons.sparse.support(y)
1    10    13    16    19    21

Support intersection ratio:

>> spx.commons.sparse.support_intersection_ratio(x, y)
0.1364

It is the ratio between the size of common indices in the supports of x and y and maximum of the sizes of supports of x and y.

Average support similarity of a reference signal with a set of signals X (each signal as a column vector):

spx.commons.sparse.support_similarity(X, reference)

Support similarities between two sets of signals (pairwise):

spx.commons.sparse.support_similarities(X, Y)

Support detection ratios

spx.commons.sparse.support_detection_rate(X, trueSupport)

K largest indices over a set of vectors:

spx.commons.sparse.dominant_support_merged(data, K)

Sometimes it’s useful to identify and arrange the non-zero entries in a signal in descending order of their magnitude:

>> spx.commons.sparse.sorted_non_zero_elements(x)
16    22    19    13    10     4     7
 7    -6     4    -3    -2     1    -1

Given a signal x, the function spx.commons.sparse.sorted_non_zero_elements returns a two row matrix where the first row contains the locations of non-zero elements sorted by their magnitude and second row contains their magnitude. If the magnitude of two non-zero elements is same, then the original order is maintained. The sorting is stable.

Comparing signals

Comparing sparse or approximately sparse signals

spx.commons.SparseSignalsComparison class provides a number of methods to compare two sets of sparse signals. It is typically used to compare a set of original sparse signals with corresponding recovered sparse signals.

Let us create two signals of size (N=256) with sparsity level (K=4) with the non-zero entries having magnitude chosen uniformly between [1,2]:

N = 256;
K = 4;
% Constructing a sparse vector
% Choosing the support randomly
Omega = randperm(N, K);
% Number of signals
S = 2;
% Original signals
X = zeros(N, S);
% Choosing non-zero values uniformly between (-b, -a) and (a, b)
a = 1;
b = 2;
% unsigned magnitudes of non-zero entries
XM = a + (b-a).*rand(K, S);
% Generate sign for non-zero entries randomly
sgn = sign(randn(K, S));
% Combine sign and magnitude
XMS = sgn .* XM;
% Place at the right non-zero locations
X(Omega, :) = XMS;

Let us create a noisy version of these signals with noise only in the non-zero entries at 15 dB of SNR:

% Creating noise using helper function
SNR = 15;
Noise = spx.data.noise.Basic.createNoise(XMS, SNR);
Y = X;
Y(Omega, :) = Y(Omega, :) + Noise;

Let us create an instance of sparse signal comparison class:

cs = spx.commons.SparseSignalsComparison(X, Y, K);

Norms of difference signals [X - Y]:

cs.difference_norms()

Norms of original signals [X]:

cs.reference_norms()

Norms of estimated signals [Y]:

cs.estimate_norms()

Ratios between signal error norms and original signal norms:

cs.error_to_signal_norms()

SNR for each signal:

cs.signal_to_noise_ratios()

In case the signals X and Y were not truly sparse, then spx.commons.SignalsComparison has the ability to sparsify them by choosing the K largest (magnitude) entries for each signal in reference signal set and estimated signal set. K is an input parameter taken by the class.

We can access the sparsified reference signals:

cs.sparse_references()

We can access the sparsified estimated signals:

cs.sparse_estimates()

We can also examine the support index set for each sparsified reference signal:

cs.reference_sparse_supports()

Ditto for the supports of sparsified estimated signals:

cs.estimate_sparse_supports()

We can measure the support similarity ratio for each signal

cs.support_similarity_ratios()

We can find out which of the signals have a support similarity above a specified threshold:

cs.has_matching_supports(1.0)

Overall analysis can be easily summarized and printed for each signal:

cs.summarize()

Here is the output

Signal dimension: 256
Number of signals: 2
Combined reference norm: 4.56207362
Combined estimate norm: 4.80070407
Combined difference norm: 0.81126416
Combined SNR: 15.0000 dB
Specified sparsity level: 4

Signal: 1
  Reference norm: 2.81008750
  Estimate norm: 2.91691022
  Error norm: 0.49971207
  SNR: 15.0000 dB
  Support similarity ratio: 1.00

Signal: 2
  Reference norm: 3.59387311
  Estimate norm: 3.81292464
  Error norm: 0.63909106
  SNR: 15.0000 dB
  Support similarity ratio: 1.00
Signal space comparison

For comparing signals which are not sparse, we have another helper utility class spx.commons.SignalsComparison.

Assuming X is a signal matrix (with each column treated as a signal), and Y is its noisy version, we created the signal comparison instance as:

cs = spx.commons.SignalsComparison(X, Y);

Most functions are similar to what we had for spx.commons.SparseSignalsComparison:

cs.difference_norms()
cs.reference_norms()
cs.estimate_norms()
cs.error_to_signal_norms()
cs.signal_to_noise_ratios()
cs.summarize()

Working with Numbers

Some algorithms from number theory are useful at times.

Finding integer factors closest to square root:

>> [a,b] = spx.discrete.number.integer_factors_close_to_sqr_root(120)
a = 10
b = 12

Printing utilities

Sparse signals

Printing a sparse signal as pairs of locations and values:

>> x = [0 0 0  1 0 0 -1 0 0 -2 0 0 -3 0 0 7 0 0 4 0 0 -6];
>> spx.io.print.sparse_signal(x)
(4,1) (7,-1) (10,-2) (13,-3) (16,7) (19,4) (22,-6)   N=22, K=7

Printing the non-zero entries in a signal in descending order of magnitude with location and value:

>> spx.io.print.sorted_sparse_signal(x)
Index:  Value
  16:   7.000000
  22:   -6.000000
  19:   4.000000
  13:   -3.000000
  10:   -2.000000
   4:   1.000000
   7:   -1.000000
Latex

Printing a vector in a format suitable for Latex:

>> spx.io.latex.printVector([1, 2, 3, 4])
\begin{pmatrix}
1 & 2 & 3 & 4
\end{pmatrix}

Printing a matrix in a format suitable for Latex:

>> spx.io.latex.printMatrix(randn(3, 4))
\begin{bmatrix}
-0.340285 & 1.13915 & 0.65748 & 0.0187744\\
-0.925848 & 0.427361 & 0.584246 & -0.425961\\
0.00532169 & 0.181032 & -1.61645 & -2.03403
\end{bmatrix}

Printing a vector as a set in Latex:

>> spx.io.latex.printSet([1, 2, 3, 4])
\{ 1 , 2 , 3 , 4 \}
SciRust

SciRust is a related scientific computing library developed by us. Some helper functions have been written to convert MATLAB data into SciRust compatible Rust source code.

Printing a matrix for consumption in SciRust source code:

>> spx.io.scirust.printMatrix(magic(3))
matrix_rw_f64(3, 3, [
        8.0, 1.0, 6.0,
        3.0, 5.0, 7.0,
        4.0, 9.0, 2.0
        ]);

Sparse recovery

Estimate for the required number of measurements for sparse signals in N and sparsity level K based on paper by Donoho and Tanner:

M = spx.commons.sparse.phase_transition_estimate_m(N, K);

Example:

>> spx.commons.sparse.phase_transition_estimate_m(1000, 4)
60

Synthetic Signals

Some easy to setup recovery problems

General approach:

m = 64;
n = 121;
k = 4;
dict = spx.dict.simple.gaussian_dict(m, n);
gen = spx.data.synthetic.SparseSignalGenerator(n, k);
% create a sparse vector
rep =  gen.biGaussian();
signal = dict*rep;
problem.dictionary = dict;
problem.representation_vector = rep;
problem.sparsity_level = k;
problem.signal_vector = signal;

The problems:

problem = spx.data.synthetic.recovery_problems.problem_small_1()
problem = spx.data.synthetic.recovery_problems.problem_large_1()
problem = spx.data.synthetic.recovery_problems.problem_barbara_blocks()

Sparse signal generation

Create generator:

N = 256; K = 4; S = 10;
gen  = spx.data.synthetic.SparseSignalGenerator(N, K, S);

Uniform signals:

result = gen.uniform();
result = gen.uniform(1, 2);
result = gen.uniform(-1, 1);

Bi-uniform signals:

result = gen.biUniform();
result = gen.biUniform(1, 2);

Gaussian signals:

result = gen.gaussian();

BiGuassian signals:

result = gen.biGaussian();
result = gen.biGaussian(2.0);
result = gen.biGaussian(10.0, 1.0);

Compressible signal generation

We can use randcs function by Cevher, V. for constructing compressible signals:

N = 100;
q = 1;
x = randcs(N, q);
plot(x);
plot(randcs(100, .9));
plot(randcs(100, .8));
plot(randcs(100, .7));
plot(randcs(100, .6));
plot(randcs(100, .5));
plot(randcs(100, .4));
lambda = 2;
x = randcs(N, q, lambda);
dist = 'logn';
x = randcs(N, q, lambda, dist);

Multi-subspace signal generation

Signals with disjoint supports:

% Dimension of representation space
N = 80;
% Number of subspaces
P = 8;
% Number of signals per subspace
SS = 10;
% Sparsity level of each signal (subspace dimension)
K = 4;
% Create signal generator
sg = spx.data.synthetic.MultiSubspaceSignalGenerator(N, K);
% Create disjoint supports
sg.createDisjointSupports(P);
sg.setNumSignalsPerSubspace(SS);
% Generate  signal representations
sg.biUniform(1, 4);
% Access  signal representations
X = sg.X;
% Corresponding supports
qs = sg.Supports;

Graphics and visualization

In this section we cover:

  • Some utility classes which help in specific visualization tasks
  • Some external open source libraries / functions which have been integrated in sparse-plex to make visualization tasks easier
  • Some general techniques for specific visualization tasks

Create a full screen figure:

spx.graphics.figure.full_screen;

Multiple figures:

mf = spx.graphics.Figures();
mf.new_figure('fig 1');
mf.new_figure('fig 2');
mf.new_figure('fig 3');

All these figures will be created with same width and height. They will be placed one after another in a stacked manner.

Controlling size of multiple figures:

width = 1000;
height = 400;
mf = spx.graphics.Figures(width, height);

Display a Gram matrix for a given dictionary Phi:

spx.graphics.display.display_gram_matrix(Phi);

Canvas of a grid of images

Sometimes we wish to show a set of small images in the form of a grid. These images may be patches from a larger image or may be small independent images themselves.

spx.graphics.canvas helps in combining the images in the form of a grid on a canvas image.

We provide all the images to be displayed in the form of a matrix where each column consists of one image.

Creating a canvas of image patches:

% Let us create some random images of size 50x50
width = 50;
height = 50;
rows = 10;
cols = 10;
images = 255* rand(width*height, rows*cols);
% Let's create a canvas of these images formed into a
% 10 x 10 grid.
canvas = spx.graphics.canvas.create_image_grid(images, rows, cols, ...
    height, width);
% Let's convert the canvas to UINT8 image
canvas = uint8(canvas);
% Let's show the image
imshow(canvas);
% Let's set the proper colormap.
colormap(gray);
% Axis sizing etc.
axis image;
axis off;

Displaying a set of signals in the form of a matrix

While working on joint signal recovery problems, we need to visualize a set of signals together. They can be put together in a signal matrix where each column is one (finite dimensional) signal. It is straightforward to create a visualization for these signals:

num_signals = 100;
signal_size = 80;
signal_matrix = randn(signal_size, num_signals);
% Let's create a canvas and put all the signals on it.
canvas = spx.graphics.canvas.create_signal_matrix_canvas(signal_matrix);
% Let's show the image
imshow(canvas);
% Let's set the proper colormap.
colormap(gray);
% Axis sizing etc.
axis image;
axis off;

Some third party open source libraries

Put a title over all subplots:

spx.graphics.suptitle(title);

This function is by Drea Thomas.

RGB code for given colorname:

c = spx.graphics.rgb('DarkRed')
c = spx.graphics.rgb('Green')
plot(x,y,'color',spx.graphics.rgb('orange'))

This function is by Kristján Jónasson and is in public domain.

Supported colors:

%White colors
'FF','FF','FF', 'White'
'FF','FA','FA', 'Snow'
'F0','FF','F0', 'Honeydew'
'F5','FF','FA', 'MintCream'
'F0','FF','FF', 'Azure'
'F0','F8','FF', 'AliceBlue'
'F8','F8','FF', 'GhostWhite'
'F5','F5','F5', 'WhiteSmoke'
'FF','F5','EE', 'Seashell'
'F5','F5','DC', 'Beige'
'FD','F5','E6', 'OldLace'
'FF','FA','F0', 'FloralWhite'
'FF','FF','F0', 'Ivory'
'FA','EB','D7', 'AntiqueWhite'
'FA','F0','E6', 'Linen'
'FF','F0','F5', 'LavenderBlush'
'FF','E4','E1', 'MistyRose'
%Grey colors'
'80','80','80', 'Gray'
'DC','DC','DC', 'Gainsboro'
'D3','D3','D3', 'LightGray'
'C0','C0','C0', 'Silver'
'A9','A9','A9', 'DarkGray'
'69','69','69', 'DimGray'
'77','88','99', 'LightSlateGray'
'70','80','90', 'SlateGray'
'2F','4F','4F', 'DarkSlateGray'
'00','00','00', 'Black'
%Red colors
'FF','00','00', 'Red'
'FF','A0','7A', 'LightSalmon'
'FA','80','72', 'Salmon'
'E9','96','7A', 'DarkSalmon'
'F0','80','80', 'LightCoral'
'CD','5C','5C', 'IndianRed'
'DC','14','3C', 'Crimson'
'B2','22','22', 'FireBrick'
'8B','00','00', 'DarkRed'
%Pink colors
'FF','C0','CB', 'Pink'
'FF','B6','C1', 'LightPink'
'FF','69','B4', 'HotPink'
'FF','14','93', 'DeepPink'
'DB','70','93', 'PaleVioletRed'
'C7','15','85', 'MediumVioletRed'
%Orange colors
'FF','A5','00', 'Orange'
'FF','8C','00', 'DarkOrange'
'FF','7F','50', 'Coral'
'FF','63','47', 'Tomato'
'FF','45','00', 'OrangeRed'
%Yellow colors
'FF','FF','00', 'Yellow'
'FF','FF','E0', 'LightYellow'
'FF','FA','CD', 'LemonChiffon'
'FA','FA','D2', 'LightGoldenrodYellow'
'FF','EF','D5', 'PapayaWhip'
'FF','E4','B5', 'Moccasin'
'FF','DA','B9', 'PeachPuff'
'EE','E8','AA', 'PaleGoldenrod'
'F0','E6','8C', 'Khaki'
'BD','B7','6B', 'DarkKhaki'
'FF','D7','00', 'Gold'
%Brown colors
'A5','2A','2A', 'Brown'
'FF','F8','DC', 'Cornsilk'
'FF','EB','CD', 'BlanchedAlmond'
'FF','E4','C4', 'Bisque'
'FF','DE','AD', 'NavajoWhite'
'F5','DE','B3', 'Wheat'
'DE','B8','87', 'BurlyWood'
'D2','B4','8C', 'Tan'
'BC','8F','8F', 'RosyBrown'
'F4','A4','60', 'SandyBrown'
'DA','A5','20', 'Goldenrod'
'B8','86','0B', 'DarkGoldenrod'
'CD','85','3F', 'Peru'
'D2','69','1E', 'Chocolate'
'8B','45','13', 'SaddleBrown'
'A0','52','2D', 'Sienna'
'80','00','00', 'Maroon'
%Green colors
'00','80','00', 'Green'
'98','FB','98', 'PaleGreen'
'90','EE','90', 'LightGreen'
'9A','CD','32', 'YellowGreen'
'AD','FF','2F', 'GreenYellow'
'7F','FF','00', 'Chartreuse'
'7C','FC','00', 'LawnGreen'
'00','FF','00', 'Lime'
'32','CD','32', 'LimeGreen'
'00','FA','9A', 'MediumSpringGreen'
'00','FF','7F', 'SpringGreen'
'66','CD','AA', 'MediumAquamarine'
'7F','FF','D4', 'Aquamarine'
'20','B2','AA', 'LightSeaGreen'
'3C','B3','71', 'MediumSeaGreen'
'2E','8B','57', 'SeaGreen'
'8F','BC','8F', 'DarkSeaGreen'
'22','8B','22', 'ForestGreen'
'00','64','00', 'DarkGreen'
'6B','8E','23', 'OliveDrab'
'80','80','00', 'Olive'
'55','6B','2F', 'DarkOliveGreen'
'00','80','80', 'Teal'
%Blue colors
'00','00','FF', 'Blue'
'AD','D8','E6', 'LightBlue'
'B0','E0','E6', 'PowderBlue'
'AF','EE','EE', 'PaleTurquoise'
'40','E0','D0', 'Turquoise'
'48','D1','CC', 'MediumTurquoise'
'00','CE','D1', 'DarkTurquoise'
'E0','FF','FF', 'LightCyan'
'00','FF','FF', 'Cyan'
'00','FF','FF', 'Aqua'
'00','8B','8B', 'DarkCyan'
'5F','9E','A0', 'CadetBlue'
'B0','C4','DE', 'LightSteelBlue'
'46','82','B4', 'SteelBlue'
'87','CE','FA', 'LightSkyBlue'
'87','CE','EB', 'SkyBlue'
'00','BF','FF', 'DeepSkyBlue'
'1E','90','FF', 'DodgerBlue'
'64','95','ED', 'CornflowerBlue'
'41','69','E1', 'RoyalBlue'
'00','00','CD', 'MediumBlue'
'00','00','8B', 'DarkBlue'
'00','00','80', 'Navy'
'19','19','70', 'MidnightBlue'
%Purple colors
'80','00','80', 'Purple'
'E6','E6','FA', 'Lavender'
'D8','BF','D8', 'Thistle'
'DD','A0','DD', 'Plum'
'EE','82','EE', 'Violet'
'DA','70','D6', 'Orchid'
'FF','00','FF', 'Fuchsia'
'FF','00','FF', 'Magenta'
'BA','55','D3', 'MediumOrchid'
'93','70','DB', 'MediumPurple'
'99','66','CC', 'Amethyst'
'8A','2B','E2', 'BlueViolet'
'94','00','D3', 'DarkViolet'
'99','32','CC', 'DarkOrchid'
'8B','00','8B', 'DarkMagenta'
'6A','5A','CD', 'SlateBlue'
'48','3D','8B', 'DarkSlateBlue'
'7B','68','EE', 'MediumSlateBlue'
'4B','00','82', 'Indigo'
%Gray repeated with spelling grey
'80','80','80', 'Grey'
'D3','D3','D3', 'LightGrey'
'A9','A9','A9', 'DarkGrey'
'69','69','69', 'DimGrey'
'77','88','99', 'LightSlateGrey'
'70','80','90', 'SlateGrey'
'2F','4F','4F', 'DarkSlateGrey'

Dictionaries

Basic Dictionaries

Some simple dictionaries can be constructed using library functions.

The dictionaries are available in two flavors:

  1. As simple matrices
  2. As objects which implement the spx.dict.Operator abstraction defined below.

The functions returning the dictionary as a simple matrix have a suffix “mtx”. The functions returning the dictionary as a spx.dict.Operator have the suffix “dict” at the end.

These functions can also be used to construct random sensing matrices which are essentially random dictionaries.

Dirac Fourier Dictionary

spx.dict.simple.dirac_fourier_dict(N)

Dirac DCT Dictionary:

spx.dict.simple.dirac_dct_dict(N)

Gaussian Dictionary:

spx.dict.simple.gaussian_dict(N, D, normalized_columns)

Rademacher Dictionary:

Phi = spx.dict.simple.rademacher_dict(N, D);

Partial Fourier Dictionary:

Phi = spx.dict.simple.partial_fourier_dict(N, D);

Over complete 1-D DCT dictionary:

spx.dict.simple.overcomplete1DDCT(N, D)

Over complete 2-D DCT dictionary:

spx.dict.simple.overcomplete2DDCT(N, D)

Dictionaries from SPIE2011 paper:

spx.dict.simple.spie_2011(name) % ahoc, orth, rand, sine

Sensing matrices

Gaussian sensing matrix:

Phi = spx.dict.simple.gaussian_mtx(M, N);

Rademacher sensing matrix:

Phi = spx.dict.simple.rademacher_mtx(M, N);

Partial Fourier matrix:

Phi = spx.dict.simple.partial_fourier_mtx(M, N);

Operators

In simple terms, a (finite) dictionary is implemented as a matrix whose columns are atoms of the dictionary. This approach is not powerful enough. A dictionary \(\Phi\) usually acts on a sparse representation \(\alpha\) to obtain a signal \(x = \Phi \alpha\). During sparse recovery, the Hermitian transpose of the dictionary acts on the signal [or residual] to compute \(\Phi^H x\) or \(\Phi^H r\). Thus, the fundamental operations are multiplication by \(\Phi\) and \(\Phi^H\). While, these operations can be directly implemented by using a matrix representation of a dictionary, they are slow and require a large storage for the dictionary. For random dictionaries, this is the only option. But for structured dictionaries and sensing matrices, the whole of dictionary need not be held in memory. The multiplication by \(\Phi\) and \(\Phi^H\) can be implemented using fast functions.

Also multiple dictionaries can be combined to construct a composite dictionary, e.g. \(\Phi \Psi\).

In order to take care of these scenarios, we define the notion of a generic operator in an abstract class spx.dict.Operator. All operators support following methods.

Constructing a matrix representation of the operator:

op.double()

Computing \(\Phi x\):

op.mtimes(x)

The transpose operator:

op.transpose()

By default it is constructed by computing the matrix representation of the transpose of the operator. But specialized dictionaries can implement it smartly.

The Hermitian transpose operator:

op.ctranspose()

By default it is constructed by computing the matrix representation of the Hermitian transpose of the operator. But specialized dictionaries can implement it smartly.

Obtaining specific columns from the operator:

op.columns(columns)

Note that this doesn’t require computing the complete matrix representation of the operator.

op.apply_columns(vectors, columns)

Constructing an operator which uses only the specified columns from this dictionary:

op.columns_operator(columns)

A specific column of the dictionary:

op.column(index)

Printing the contents of the dictionary:

disp(op)

Matrix operators

Matrix operators are constructed by wrapping a given matrix into spx.dict.MatrixOperator which is a subclass of spx.dict.Operator.

Constructing the matrix operator from a matrix A:

op = spx.dict.MatrixOperator(A)

The matrix operator holds references to the matrix as well as its Hermitian transpose:

op.A
op.AH

Composite Operators

A composite operator can be created by combining two or more operators:

co = spx.dict.CompositeOperator(f, g)

Unitary/Orthogonal matrices

spx.dict.unitary.uniform_normal_qr(n)
spx.dict.unitary.analyze_rr(O)
spx.dict.unitary.synthesize_rr(rotations, reflections)
spx.dict.unitary.givens_rot(a, b)

Dictionary Properties

dp = spx.dict.Properties(Dict)

dp.gram_matrix()
dp.abs_gram_matrix()
dp.frame_operator()
dp.singular_values()
dp.gram_eigen_values()
dp.lower_frame_bound()
dp.upper_frame_bound()
dp.coherence()

Coherence of a dictionary:

mu = spx.dict.coherence(dict)

Babel function of a dictionary:

mu = spx.dict.babel(dict)

Spark of a dictionary (for small sizes):

[ K, columns ] = spx.dict.spark( Phi )

Equiangular Tight Frames

spx.dict.etf.ss_to_etf(M)
spx.dict.etf.is_etf(F)
spx.dict.etf.ss_etf_structure(k, v)

Grassmannian Frames

spx.dict.grassmannian.minimum_coherence(m, n)
spx.dict.grassmannian.n_upper_bound(m)
spx.dict.grassmannian.min_coherence_max_n(ms)
spx.dict.grassmannian.max_n_for_coherence(m, mu)
spx.dict.grassmannian.alternate_projections(dict, options)

Vector Spaces

Our work is focused on finite dimensional vector spaces \(\mathbb{R}^N\) or \(\mathbb{C}^N\). We represent a vector space by a basis in the vector space. In this section, we describe several useful functions for working with one or more vector spaces (represented by one basis per vector space).

Basis for intersection of two subspaces:

result = spx.la.spaces.insersection_space(A, B)

Orthogonal complement of A in B:

result = spx.la.spaces.orth_complement(A, B)

Principal angles between subspaces spanned by A and B:

result = spx.la.spaces.principal_angles_cos(A, B);
result = spx.la.spaces.principal_angles_radian(A, B);
result = spx.la.spaces.principal_angles_degree(A, B);

Smallest principal angle between subspaces spanned by A and B:

result = spx.la.spaces.smallest_angle_cos(A, B);
result = spx.la.spaces.smallest_angle_rad(A, B);
result = spx.la.spaces.smallest_angle_deg(A, B);

Principal angle between two orthogonal bases:

result = spx.la.spaces.principal_angles_orth_cos(A, B)
result = spx.la.spaces.smallest_angle_orth_cos(A, B);

Smallest angles between subspaces:

result = spx.la.spaces.smallest_angles_cos(subspaces, d)
result = spx.la.spaces.smallest_angles_rad(subspaces, d)
result = spx.la.spaces.smallest_angles_deg(subspaces, d)

Distance between subspaces based on Grassmannian space:

result = spx.la.spaces.subspace_distance(A, B)

This is computed as the operator norm of the difference between projection matrices for two subspaces.

Check if v in range of unitary matrix U:

result = spx.la.spaces.is_in_range_orth(v, U)

Check if v in range of A:

result = spx.la.spaces.is_in_range(v, A)

A basis for matrix A:

result = spx.la.spaces.find_basis(A)

Elementary matrices product and row reduced echelon form:

[E, R] = spx.la.spaces.elim(A)

Basis for null space of A:

result = spx.la.spaces.null_basis(A)

Bases for four fundamental spaces:

[col_space, null_space, row_space, left_null_space]  = spx.la.spaces.four_bases(A)
[col_space, null_space, row_space, left_null_space]  = spx.la.spaces.four_orth_bases(A)

Utility for constructing specific examples

Two spaces at a given angle:

[A, B]  = spx.data.synthetic.subspaces.two_spaces_at_angle(N, theta)

Three spaces at a given angle:

[A, B, C] = spx.la.spaces.three_spaces_at_angle(N, theta)

Three disjoint spaces at a given angle:

[A, B, C] = spx.la.spaces.three_disjoint_spaces_at_angle(N, theta)

Map data from k dimensions to n dimensions:

result = spx.la.spaces.k_dim_to_n_dim(X, n, indices)

Describing relations between three spaces:

spx.la.spaces.describe_three_spaces(A, B, C);

Usage:

d = 4;
theta = 10;
n = 20;
[A, B, C] = spx.la.spaces.three_disjoint_spaces_at_angle(deg2rad(theta), d);
spx.la.spaces.describe_three_spaces(A, B, C);
% Put them together
X = [A B C];
% Put them to bigger dimension
X = spx.la.spaces.k_dim_to_n_dim(X, n);
% Perform a random orthonormal transformation
O = orth(randn(n));
X = O * X;

Combinatorics

Steiner Systems

Steiner system with block size 2:

v = 10;
m = spx.discrete.steiner_system.ss_2(v);

Steiner system with block size 3 (STS Steiner Triple System):

m = spx.discrete.steiner_system.ss_3(v);

Bose construction for STS system for v = 6n + 3

m = spx.discrete.steiner_system.ss_3_bose(v);

Verify if a given incidence matrix is a Steiner system:

spx.discrete.steiner_system.is_ss(M, k)

Latin square construction:

spx.discrete.steiner_system.commutative_idempotent_latin_square(n)

Verify if a table is a Latin square:

spx.discrete.steiner_system.is_latin_square(table)

Matrix factorization algorithms

Note

Better implementations for these algorithms may be available in stock MATLAB distribution or other third party libraries. These codes were developed for instructional purposes as variations of these algorithms were needed in development of other algorithms in this package.

Various versions of QR Factorization

Gram Schmidt:

[Q, R] =  spx.la.qr.gram_schmidt(A)

Householder UR:

[U, R] = spx.la.qr.householder_ur(A)

Householder QR:

[Q, R] =  spx.la.qr.householder_qr(A)

Householder matrix for a given vector:

[H, v] = spx.la.qr.householder_matrix(x)

External Code

almost equal:

isalmost(a,b,tol)

Timing

[t, measurement_overhead, measurement_details] = timeit(f, num_outputs)

Noise

Noise generation

Gaussian noise:

ng = spx.data.noise.Basic(N, S);
sigma = 1;
mean = 0;
ng.gaussian(sigma, mean);

Creating noise at a specific SNR:

% Sparse signal dimension
N = 100;
% Sparsity level
K = 20;
% Number of signals
S = 4;
% Create sparse signals
signals = spx.data.synthetic.SparseSignalGenerator(N, K, S).gaussian();
% Create noise at specific SNR level.
snrDb = 10;
noises = spx.data.noise.Basic.createNoise(signals, snrDb);
% add signal to noise
signals_with_noise = signals + noises;
% Verify SNR level
20 * log10 (spx.norm.norms_l2_cw(signals) ./ spx.norm.norms_l2_cw(noises))

Noise measurement

SNR in dB:

result = spx.commons.snr.SNR(signals, noises)

SNR in dB from signal and reconstruction:

reconstructions = signals_with_noise;
result = spx.commons.snr.recSNRdB(signals, reconstructions)

Signal energy in DB

result = spx.commons.snr.energyDB(signals)

Reconstruction SNR as energy ratio:

result = spx.commons.snr.recSNR(signal, reconstruction)

Error energy normalized by signal energy:

result = spx.commons.snr.normalizedErrorEnergy(signal, reconstruction)

Reconstruction SNRs over multiple signals in dB:

result = spx.commons.snr.recSNRsdB(signals, reconstructions)

Reconstruction SNRs over multiple signals as energy ratios:

result = spx.commons.snr.recSNRs(signals, reconstructions)

Signal energies:

result = spx.commons.snr.energies(signals)

Signal energies in dB:

result = spx.commons.snr.energiesDB(signals)

Exercises

The best way to learn is by doing exercises yourself. In this section, we present a set of computer exercises which help you learn the fundamentals of sparse representations: algorithms and applications.

Most of these exercises are implemented in some form or other as part of the sparse-plex library. Once you have written your own implementations, you may hunt the code in library and compare your implementation with the reference implementation.

The exercises are described in terms of MATLAB programming environment. But they can be easily developed in other programming environments too.

Throughout these exercises, we will develop a set of functions which are reusable for performing various tasks related to sparse representation problems. We suggest you to collect such functions developed by you in one place together so that you can implement the more sophisticated exercises easily later.

Creating a sparse signal

The first aspect is deciding the support for the sparse signal.

  1. Decide on the length of signal N=1024.
  2. Decide on the sparsity level K=10.
  3. Choose K entries from 1..N randomly as your choice of sparse support. You can use randperm function.

Now, we need to consider the values of non-zero entries in the sparse vector. Typically, they are chosen from a random distribution. Few of the common choices are:

  • Gaussian
  • Uniform
  • Bi-uniform

Gaussian

  1. Generate K Gaussian random numbers with zero mean and unit standard deviation. You can use randn function. You may choose to change the standard deviation, but mean should usually be zero.
  2. Create a column vector with N zeros.
  3. On the entries indexed by the sparse support set, place the K numbers generated above.

Plotting

  1. Use stem command to visualize the sparse signal.

Uniform

  • Most of the steps are similar to creating a Gaussian sparse vector.
  • The rand function generates a number uniformly between 0 and 1.
  • In order to generate a number uniformly between a and b, we can use the simple trick of a + (b -a) * rand
  1. Choose a and b (say -4 and 4).
  2. Generate K uniformly distributed numbers between a and b.
  3. Place them in the N length vector as described above.
  4. Plot them.

Bi-uniform

A problem with Gaussian and uniform distributions as described above is that they are prone to generate some non-zero entries which are much smaller compared to others.

Bi-uniform approach attempts to avoid this situation. It generates numbers uniformly between [-b, -a] and [a, b] where a and b are both positive numbers with a < b.

  1. Choose a and by [say 1 and 2].
  2. Generate K uniformly distributed random numbers between a and b (as discussed above). These are the magnitudes of the sparse non-zero entries.
  3. Generate K Gaussian numbers and apply sign function to them to map them to 1 and -1. Note that with equal probability, the signs would be 1 or -1.
  4. Multiply the signs and magnitudes to generate your sparse non-zero entries.
  5. Place them in the N length vector as described above.
  6. Plot them.

Following image is an example of how a sparse vector looks.

_images/k_sparse_biuniform_signal1.png

Creating a two ortho basis

Simplest example of an overcomplete dictionary is Dirac Fourier dictionary.

  • You can use eye(N) to generate the standard basis of \(\mathbb{C}^N\) which is also known as Dirac basis.
  • dftmtx(N) gives the matrix for forward Fourier transform. Corresponding Fourier basis can be constructed by taking its transpose.
  • The columns / rows of dftmtx(N) are not normalized. Hence, in order to construct an orthonormal basis, we need to normalize the columns too. This can be easily done by multiplying with \(\frac{1}{\sqrt{N}}\).
  1. Choose the dimension of the ambient signal space (say N=1024).
  2. Construct the Dirac basis for \(\mathbb{C}^N\).
  3. Construct the orthonormal Fourier basis for \(\mathbb{C}^N\).
  4. Combine the two to form the two ortho basis (Dirac in left, Fourier in right).

Verification

We assume that the dictionary has been stored in a variable named Phi. We will use the mathematical symbol \(\Phi\) for the same.

  • Verify that each column has unit norm.
  • Verify that each row has a norm of \(\sqrt{2}\).
  • Compute the Gram matrix \(\Phi' * \Phi\).
  • Verify that the diagonal elements are all one.
  • Divide the Gram matrix into four quadrants.
  • Verify that the first and fourth quadrants are identity matrices.
  • Verify that the Gram matrix is symmetric.
  • What can you say about the values in 2nd and 3rd quadrant?

Creating a Dirac-DCT two-ortho basis

While Dirac-DFT two ortho basis has the lowest possible coherence amongst all pairs of orthogonal bases, it is not restricted to \(\mathbb{R}^N\). A good starting point is to consider constructing a Dirac-DCT two ortho basis.

  1. Construct the Dirac-DCT two-ortho basis dictionary.
  • Replace dftmtx(N) by dctmtx(N).
  • Follow steps similar to previous exercise to construct a Dirac-DCT dictionary.
  • Notice the differences in the Gram matrix of Dirac-DFT dictionary with Dirac-DCT dictionary.
  • Construct the Dirac-DCT dictionary for different values of N=(8, 16, 32, 64, 128, 256).
  • Look at the changes in the Gram matrix as you vary N for constructing Dirac-DCT dictionary.

An example Dirac-DCT dictionary has been illustrated in the figure below.

_images/dirac_dct_2561.png

Note

While constructing the two-ortho bases is nice for illustration, it should be noted that using them directly for computing \(\Phi x\) is not efficient. This entails full cost of a matrix vector multiplication. An efficient implementation would consider following ideas:

  • \(\Phi x = [I \Psi] x = I x_1 + \Psi x_2\) where \(x_1\) and \(x_2\) are upper and lower halves of the vector \(x\).
  • \(I x_1\) is nothing but x_1.
  • \(\Psi x_2\) can be computed by using the efficient implementations of (Inverse) DFT or DCT transforms with appropriate scaling.
  • Such implementations would perform the multiplication with dictionary in \(O(N \log N)\) time.
  • In fact, if the second basis is a wavelet basis, then the multiplication can be carried out in linear time too.
  • You are suggested to take advantage of these ideas in following exercises.

Creating a signal which is a mixture of sinusoids and impulses

If we split the sparse vector \(x\) into two halves \(x_1\) and \(x_2\) then: * The first half corresponds to impulses from the Dirac basis. * The second half corresponds to sinusoids from DCT or DFT basis.

It is straightforward to construct a signal which is a mixture of impulses and sinusoids and has a sparse representation in Dirac-DFT or Dirac-DCT representation.

  1. Pick a suitable value of N (say 64).
  2. Construct the corresponding two ortho basis.
  3. Choose a sparsity pattern for the vector x (of size 2N) such that some of the non-zero entries fall in first half while some in second half.
  4. Choose appropriate non-zero coefficients for x.
  5. Compute \(y = \Phi x\) to obtain a signal which is a mixture of impulses and sinusoids.

Verification

  • It is obvious that the signal is non-sparse in time domain.
  • Plot the signal using stem function.
  • Compute the DCT or DFT representation of the signal (by taking inverse transform).
  • Plot the transform basis representation of the signal.
  • Verify that the transform basis representation does indeed have some large spikes (corresponding to the non-zero entries in second half of \(x\)) but the rest of the representation is also full with (small) non-zero terms (corresponding to the transform representation of impulses).

Creating a random dictionary

We consider constructing a Gaussian random matrix.

  1. Choose the number of measurements \(M\) say 128.
  2. Choose the signal space dimension \(N\) say 1024.
  3. Generate a Gaussian random matrix as \(\Phi = \text{randn(M, N)}\).

Normalization

There are two ways of normalizing the random matrix to a dictionary.

One view considers that all columns or atoms of a dictionary should be of unit norm.

  1. Measure the norm of each column. You may be tempted to write a for loop to do the same. While this is alright, but MATLAB is known for its vectorization capabilities. Consider using a combination of sum conj element wise multiplication and sqrt to come up with a function which can measure the column wise norms of a matrix. You may also explore bsxfun.
  2. Divide each column by its norm to construct a normalized dictionary.
  3. Verify that the columns of this dictionary are indeed unit norm.

An alternative way considers a probabilistic view.

  • We say that each entry in the Gaussian random matrix should be zero mean and variance \(\frac{1}{M}\).
  • This ensures that on an average the mean of each column is indeed 1 though actual norms of each column may differ.
  • As the number of measurements increases, the likelihood of norm being close to one increases further.

We can apply these ideas as follows. Recall that randn generates Gaussian random variables with zero mean and unit variance.

  1. Divide the whole random matrix by \(\frac{1}{\sqrt{M}}\) to achieve the desired sensing matrix.
  2. Measure the norm of each column.
  3. Verify that the norms are indeed close to 1 (though not exactly).
  4. Vary M and N to see how norms vary.
  5. Use imagesc or imshow function to visualize the sensing matrix.

An example Gaussian sensing matrix is illustrated in figure below.

_images/gaussian_matrix1.png

Taking compressive measurements

  1. Choose a sparsity level (say K=10)
  2. Choose a sparse support over \(1 \dots N\) of size K randomly using randperm function.
  3. Construct a sparse vector with bi uniform non-zero entries.
  4. Apply the Gaussian sensing matrix on to the sparse signal to compute compressive measurement vector \(y = \Phi x \in \mathbb{R}^M\).

An example of compressive measurement vector is shown in figure below.

_images/measurement_vector_biuniform1.png

In the sequel we will refer to the computation of noiseless measurement vector by the equation \(y = \Phi x\).

When we make measurement noisy, the equation would be \(y = \Phi x + e\).

Before we jump into sparse recovery, let us spend some time studying some simple properties of dictionaries.

Measuring dictionary properties

Gram matrix

You have already done this before. The straight forward calculation is \(G = \Phi' * \Phi\) where we are considering the conjugate transpose of the dictionary \(\Phi\).

  1. Write a function to measure the Gram matrix of any dictionary.
  2. Compute the Gram matrix for all the dictionaries discussed above.
  3. Verify that Gram matrix is symmetric.

For most of our purposes, the sign or phase of entries in the Gram matrix is not important. We may use the symbol G to refer to the Gram matrix in the sequel.

  1. Compute absolute value Gram matrix abs(G).

Coherence

Recall that the coherence of a dictionary is largest (absolute value) inner product between any pair of atoms. Actually it’s quite easy to read the coherence from the absolute value Gram matrix.

  • We reject the diagonal elements since they correspond to the inner product of an atom with itself. For a properly normalized dictionary, they should be 1 anyway.
  • Since the matrix is symmetric we need to look at only the upper triangular half or the lower triangular half (excluding the diagonal) to read off the coherence.
  • Pick the largest value in the upper triangular half.
  1. Write a MATLAB function to compute the coherence.
  2. Compute coherence of a Dirac-DFT dictionary for different values of N. Plot the same to see how coherence decreases with N.
  3. Do the same for Dirac-DCT.
  4. Compute the coherence of Gaussian dictionary (with say N=1024) for different values of M and plot it.
  5. In the case of Gaussian dictionary, it is better to take average coherence for same M and N over different instances of Gaussian dictionary of the specified size.

Babel function

Babel function is quite interesting. While the definition looks pretty scary, it turns out that it can be computed very easily from the Gram matrix.

  1. Compute the (absolute value) Gram matrix for a dictionary.
  2. Sort the rows of the Gram matrix (each row separately) in descending order.
  3. Remove the first column (consists of all ones in for a normalized dictionary).
  4. Construct a new matrix by accumulating over the columns of the sorted Gram matrix above. In other words, in the new matrix
    • First column is as it is.
    • Second column consists of sum of first and second column of sorted matrix.
    • Third column consists of sum of first to third column of sorted matrix .
    • Continue accumulating like this.
  5. Compute the maximum for each column.
  6. Your Babel function is in front of you.
  7. Write a MATLAB function to carry out the same for any dictionary.
  8. Compute the Babel function for Dirac-DFT and Dirac-DCT dictionary with (N=256).
  9. Compute the Babel function for Gaussian dictionary with N=256. Actually compute Babel functions for many instances of Gaussian dictionary and then compute the average Babel function.

Getting started with sparse recovery

Our first objective will be to develop algorithms for sparse recovery in noiseless case.

The defining equation is \(y = \Phi x\) where \(x\) is the sparse representation vector, \(\Phi\) is the dictionary or sensing matrix and \(y\) is the signal or measurement vector. In any sparse recovery algorithm, following quantities are of core interest:

  • \(x\) which is unknown to us.
  • \(\Phi\) which is known to us. Sometimes we may know \(\Phi\) only approximately.
  • \(y\) which is known to us.
  • Given \(\Phi\) and \(y\), we estimate an approximation of \(x\) which we will represent as \(\widehat{x}\).
  • \(\widehat{x}\) is (typically) sparse even if \(x\) may be only approximately sparse or compressible.
  • Given an estimate \(\widehat{x}\), we compute the residual \(r = y - \Phi \widehat{x}\). This quantity is computed during the sparse recovery process.
  • Measurement or signal error norm \(\| r \|_2\). We strive to reduce this as much as possible.
  • Sparsity level \(K\). We try to come up with an \(\widehat{x}\) which is K-sparse.
  • Representation error or recovery error \(f = x - \widehat{x}\). This is unknown to us. The recovery process tends to minimize its norm \(\| f \|_2\) (if it is working correctly !).

Some notes are in order

  • K may or may not be given to us. If K is given to us, we should use it in our recovery process. If it is not given, then we should work with \(\| r \|_2\).
  • While the recovery algorithm itself doesn’t know about \(x\) and hence cannot calculate \(f\), a controlled testing environment can carefully choose and \(x\), compute \(y\) and pass \(\Phi\) and \(y\) to the recovery algorithm. Thus, the testing environment can easily compute \(f\) by using the \(x\) known to it and \(\widehat{x}\) given by the recovery algorithm.

Usually the sparse recovery algorithms are iterative. In each iteration, we improve our approximation \(\widehat{x}\) and reduce \(\| r \|_2\).

  • We can denote the iteration counter by \(k\) starting from 0 onwards.
  • We denote k-th approximation by \(\widehat{x}^k\) and k-th residual by \(r^k\).
  • A typical initial estimate is given by \(\widehat{x}^0 = 0\) and thus, \(r^0 = y\).

Objectives of recovery algorithm

There are fundamentally two objectives of a sparse recovery algorithm

  • Identification of locations at which \(\widehat{x}\) has non-zero entries. This corresponds to the sparse support of \(x\).
  • Estimation of the values of non-zero entries in \(\widehat{x}\).

We will use following notation.

  • The identified support will be denoted as \(\Lambda\). It is the responsibility of the sparse recovery algorithm to guess it.
  • If the support is identified gradually in each iteration, we can use the notation \(\Lambda^k\).
  • The actual support of \(x\) will be denoted by \(\Omega\). Since \(x\) is unknown to us hence \(\Omega\) is also unknown to us within the sparse recovery algorithm. However, the controlled testing environment would know about \(\Omega\).

If the support has been identified correctly, then estimation part is quite easy. It’s nothing but the application of least squares over the columns of \(\Phi\) selected by the support set.

Different recovery algorithms vary in how they approach the support identification and coefficient estimations.

  • Some algorithms try to identify whole support at once and then estimate the values of non-zero entries.
  • Some algorithms identify atoms in the support one at a time and iteratively estimate the non-zero values for the current support.

Simple support identification

  • Write a function which sorts a given vector by the decreasing order of magnitudes of its entries.
  • Identify the K largest (magnitude) entries in the sorted vector and their locations in the original vector.
  • Collect the locations of K largest entries into a set

Note

[sorted_x, index_vector] = sort(x) in MATLAB returns both the sorted entries and the index vector such that sorted_x = x[index_vector]. Our interest is usually in the index_vector as we don’t want to really change the order of entries in x while identifying the largest K entries.

In MATLAB a set can be represented using an array. You have to be careful to ensure that such a set never have any duplicate elements.

Sparse approximation of a given vector

Given a vector \(x\) which may not be sparse, its K sparse approximation which is the best approximation in \(l_p\) norm sense can be obtained by choosing the K largest (in magnitude) entries.

  1. Write a MATLAB function to compute the K sparse representation of any vector.
    • Identify the K largest entries and put their locations in the support set \(\Lambda\).
    • Compute \(\Lambda^c = \{1 \dots N \} \setminus \Lambda\).
    • Set the entries corresponding to \(\Lambda^c\) in \(x\) to zero.

The proxy vector

A very interesting quantity which appears in many sparse recovery algorithms is the proxy vector \(p = \Phi' r\).

The figure below shows a sparse vector, its measurements and corresponding proxy vector \(p^0 = \Phi r^0 =\Phi y\).

_images/proxy_vector.png

While the proxy vector may look quite chaotic on first look, it is very interesting to note that it tends to have large entries at exactly the same location as the sparse vector \(x\) itself.

if we think about the proxy vector closely, we can notice that each entry in the proxy is the inner product of an atom in \(\Phi\) with the residual \(r\). Thus, each entry in proxy vector indicates how similar an atom in the dictionary is with the residual.

  1. Choose M, N and K and construct a sparse vector \(x\) with support \(\Omega\) and Gaussian dictionary \(\Phi\).
  2. For the measurement vector \(y = \Phi x\), compute \(p = \Phi' y\).
  3. Identify the K largest entries in \(p\) and use their locations to make a guess of support as \(\Lambda\).
  4. Compare the sets \(\Omega\) and \(\Lambda\). Measure the support identification ratio as \(\frac{|\Lambda \cap \Omega|}{|\Omega|}\) i.e. the ratio of the number of indices common in \(\Lambda\) and \(\Omega\) with the number of indices in \(\Omega\) (which is K).
  5. Keep M and N fixed and vary K to see how support identification ratio changes. For this, measure average support identification ratio for say 100 trials. You may increase the number of trials if you want.
  6. Keep K=4, N=1024 and vary M from 10 to 500 to see how support identification ratio changes. Again use the average value.

Note

The support identification ratio is a critical tool for evaluating the quality of a sparse recovery algorithm. Recall that if the support has been identified correctly, then reconstructing a sparse vector is a simple least squares problem. If the support is identified partially, or some of the indices are incorrect, then it can lead to large recovery errors.

If the support identification ratio is 1, then we have correctly identified the support. Otherwise, we haven’t.

For noiseless recovery, if support is identified correctly, then representation will be recovered correctly (unless \(\Phi\) is ill conditioned). Thus, support identification ratio is a good measure of success or failure of recovery. We don’t need to worry about SNR or norm of recovery error.

In the sequel, for noiseless recovery, we will say that recovery succeeds if support identification ratio is 1.

If we run multiple trials of a recovery algorithm (for a specific configuration of K, M, N etc.) with different data, then the recovery rate would be the number of trials in which successful recovery happened divided by the total number of trials.

The recovery rate (on reasonably high number of trials) would be our main tool for measuring the quality of a recovery algorithm. Note that the recovery rate depends on

  • The representation space dimension \(N\).
  • The number of measurements \(M\).
  • The sparsity level \(K\).
  • The choice of dictionary \(\Phi\).

It doesn’t really depend much on the choice of distribution for the non-zero entries in \(x\) if the entries are i.i.d. Or the dependence as such is not very significant.

Developing the hard thresholding algorithm

Based on the idea of the proxy vector, we can easily compute a sparse approximation as follows.

  1. Identify the K largest entries in the proxy and their locations.
  2. Put the locations together in your guess for the support \(\Lambda\).
  3. Identify the columns of \(\Phi\) corresponding to \(\Lambda\) and construct a submatrix \(\Phi_{\Lambda}\).
  4. Compute \(x_{\Lambda} = \Phi_{\Lambda}^{\dagger} y\) as the least squares solution of the problem \(y = \Phi_{\Lambda} x_{\Lambda}\).
  5. Set the remaining entries in \(x\) corresponding to \(\Lambda^c\) as zeros.

Put together the algorithm described above in a MATLAB function like x_hat = hard_thresholding(Phi, y, K).

  1. Think and explain why hard thresholding will always succeed if \(K=1\).
  2. Say \(N=256\) and \(K=2\). What is the required number of measurements at which the recovery rate will be equal to 1.

Phase transition diagram

A nice visualization of the performance of a recovery algorithm is via its phase transition diagram. The figure below shows the phase transition diagram for orthogonal matching pursuit algorithm with a Gaussian dictionary and Gaussian sparse vectors.

  • N is fixed at 64.
  • K is varied from 1 to 4.
  • M is varied from 1 and 2 to 32 (N/2) with steps of 2.
  • For each configuration of K and M, 1000 trials are conducted and recovery rate is measured.
  • In the phase transition diagram, a white cell indicates that for the corresponding K and M, the algorithm is able to recover successfully always.
  • A black cell indicates that the algorithm never successfully recovers any signal for the corresponding K and M.
  • A gray cell indicates that the algorithm sometimes recovers successfully while sometimes it may fail.
  • Safe zone of operation is the white area in the diagram.
_images/OMP_gaussian_dict_gaussian_data_phase_transition.png

In the figure below, we capture the minimum required number of measurements for different values of K for OMP algorithm running on Gaussian sensing matrix.

_images/OMP_gaussian_dict_gaussian_data_k_vs_min_m.png

It is evident that as K increases, the minimum M required for successful recovery also increases.

  1. Generate the phase transition diagram for thresholding algorithm with N = 256, K varying from 1 to 16 and M varying from 2 to 128 and a minimum of 100 trials for each configuration.
  2. Use the phase transition diagram data for estimating the minimum M for different values of K and plot it.

Developing the matching pursuit algorithm

You can read the description of matching pursuit algorithms on Wikipedia. This is a simpler algorithm than orthogonal matching pursuit. It doesn’t involve any least squares step.

  1. Implement the matching pursuit (MP) algorithm in MATLAB.
  2. Generate the phase transition diagram for MP algorithm with N = 256, K varying from 1 to 16 and M varying from 2 to 128 and a minimum of 100 trials for each configuration.
  3. Use the phase transition diagram data for estimating the minimum M for different values of K and plot it.

Developing the orthogonal matching pursuit algorithm

The orthogonal matching pursuit algorithm is described in the figure below.

_images/omp_algorithm.png
  1. Implement the orthogonal matching pursuit (OMP) algorithm in MATLAB.
  2. Generate the phase transition diagram for OMP algorithm with N = 256, K varying from 1 to 16 and M varying from 2 to 128 and a minimum of 100 trials for each configuration.
  3. Use the phase transition diagram data for estimating the minimum M for different values of K and plot it.

Sparsifying an image

Scripts

Preamble

close all; clear all; clc;

Resetting random numbers:

rng('default');

Export management flag:

export = true;

Figures

Exporting figures:

if export
export_fig images\figure_name.png -r120 -nocrop;
export_fig images\figure_name.pdf;
end

Typical steps in figures:

xlabel('Principal angle (degrees)');
ylabel('Number of subspace pairs');
title('Distribution of principal angles over subspace pairs in signal space');
grid on;

References

[AB98]C.D. Aliprantis and O. Burkinshaw. Principles of real analysis. ACADEMIC PressINC, 1998. ISBN 9780120502578.
[Art91]M. Artin. Algebra. Prentice Hall, 1991. ISBN 9780130047632.
[BDDH11]Richard Baraniuk, M Davenport, M Duarte, and Chinmay Hegde. An introduction to compressive sensing. Connexions e-textbook, 2011.
[BJ03]Ronen Basri and David W Jacobs. Lambertian reflectance and linear subspaces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(2):218–233, 2003.
[BFH+07]E. van den Berg, M. P. Friedlander, G. Hennenfent, F. Herrmann, R. Saab, and Ö. Yılmaz. Sparco: A testing framework for sparse reconstruction. Technical Report TR-2007-20, Dept. Computer Science, University of British Columbia, Vancouver, October 2007.
[BjorckG73]Ȧke Björck and Gene H Golub. Numerical methods for computing angles between linear subspaces. Mathematics of computation, 27(123):579–594, 1973.
[BD09]Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009.
[BB91]Terrance E Boult and Lisa Gottesfeld Brown. Factorization-based segmentation of motions. In Visual Motion, 1991., Proceedings of the IEEE Workshop on, 179–186. IEEE, 1991.
[BM00]Paul S Bradley and Olvi L Mangasarian. K-plane clustering. Journal of Global Optimization, 16(1):23–32, 2000.
[BK00]DS Broomhead and Michael Kirby. A new approach to dimensionality reduction: theory and algorithms. SIAM Journal on Applied Mathematics, 60(6):2114–2142, 2000.
[BM13]Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013.
[CR04]Emmanuel J Candes and Justin Romberg. Practical signal recovery from random projections. Wavelet Applications in Signal and Image Processing XI Proc. SPIE Conf. 5914., 2004.
[CT05]Emmanuel J Candes and Terence Tao. Decoding by linear programming. Information Theory, IEEE Transactions on, 51(12):4203–4215, 2005.
[CT06]Emmanuel J Candes and Terence Tao. Near-optimal signal recovery from random projections: universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406–5425, 2006.
[Cev09]Volkan Cevher. Learning with compressible priors. In Advances in Neural Information Processing Systems, 261–269. 2009.
[CDS98]Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM journal on scientific computing, 20(1):33–61, 1998.
[CK98]João Paulo Costeira and Takeo Kanade. A multibody factorization method for independently moving objects. International Journal of Computer Vision, 29(3):159–179, 1998.
[DG99]Sanjoy Dasgupta and Anupam Gupta. An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Technical Report, pages 99–006, 1999.
[DLR77]Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
[Der07]Harm Derksen. Hilbert series of subspace arrangements. Journal of pure and applied algebra, 209(1):91–98, 2007.
[Don06]David L Donoho. For most large underdetermined systems of linear equations the minimal $l_1$-norm solution is also the sparsest solution. Communications on pure and applied mathematics, 59(6):797–829, 2006.
[DE03]David L Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via $l_1$ minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003.
[DET06]David L Donoho, Michael Elad, and Vladimir N Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. Information Theory, IEEE Transactions on, 52(1):6–18, 2006.
[DHS12]Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, 2012.
[DSB13]Eva L Dyer, Aswin C Sankaranarayanan, and Richard G Baraniuk. Greedy feature selection for subspace clustering. The Journal of Machine Learning Research, 14(1):2487–2517, 2013.
[Ela10]Michael Elad. Sparse and redundant representations. Springer, 2010.
[EV13]Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: algorithm, theory, and applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(11):2765–2781, 2013.
[EV09]Ehsan Elhamifar and René Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2790–2797. IEEE, 2009.
[FV+62]David G Feingold, Richard S Varga, and others. Block diagonally dominant matrices and generalizations of the gerschgorin circle theorem. Pacific J. Math, 12(4):1241–1250, 1962.
[Gea98]Charles William Gear. Multibody grouping from motion images. International Journal of Computer Vision, 29(2):133–150, 1998.
[GVL12]Gene H Golub and Charles F Van Loan. Matrix computations. Volume 3. JHU Press, 2012.
[GH14]Phillip Griffiths and Joseph Harris. Principles of algebraic geometry. John Wiley & Sons, 2014.
[Har13]Joe Harris. Algebraic geometry: a first course. Volume 133. Springer Science & Business Media, 2013.
[Har75]John A Hartigan. Clustering algorithms. 1975.
[Har77]Robin Hartshorne. Algebraic geometry. Volume 52. Springer Science & Business Media, 1977.
[HYL+03]Jeffrey Ho, Ming-Hsuan Yang, Jongwoo Lim, Kuang-Chih Lee, and David Kriegman. Clustering appearances of objects under varying illumination conditions. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, I–11. IEEE, 2003.
[HMV04]Kun Huang, Yi Ma, and René Vidal. Minimum effective dimension for mixtures of subspaces: a robust gpca algorithm and its applications. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, II–631. IEEE, 2004.
[Jol02]Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.
[Kan01]Kenichi Kanatani. Motion segmentation by subspace separation and model selection. image, 1:1, 2001.
[KW79]Paul Joseph Kelly and Max L Weiss. Geometry and convexity: a study in mathematical methods. John Wiley & Sons, 1979.
[Lan02]Serge Lang. Algebra revised third edition. Volume 1. Springer Science and Media, 2002.
[LBBH98]Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[M+67]James MacQueen and others. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, 281–297. Oakland, CA, USA., 1967.
[MZ93]Stephane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397–3415, 1993.
[NT09]Deanna Needell and Joel A Tropp. Cosamp: iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009.
[NJW+02]Andrew Y Ng, Michael I Jordan, Yair Weiss, and others. On spectral clustering: analysis and an algorithm. Advances in neural information processing systems, 2:849–856, 2002.
[PRK93]Yagyensh Chandra Pati, Ramin Rezaiifar, and PS Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, 40–44. IEEE, 1993.
[PK97]Conrad J Poelman and Takeo Kanade. A paraperspective factorization method for shape and motion recovery. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(3):206–218, 1997.
[RBE10]Ron Rubinstein, Alfred M Bruckstein, and Michael Elad. Dictionaries for sparse representation modeling. Proceedings of the IEEE, 98(6):1045–1057, 2010.
[RZE08]Ron Rubinstein, Michael Zibulevsky, and Michael Elad. Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit. Cs Technion, 40(8):1–15, 2008.
[SF13]Robert Sedgewick and Philippe Flajolet. An introduction to the analysis of algorithms. Addison-Wesley, 2013.
[SM00]Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000.
[Str99]Gilbert Strang. The discrete cosine transform. SIAM review, 41(1):135–147, 1999.
[TK91]Carlo Tomasi and Takeo Kanade. Detection and tracking of point features. School of Computer Science, Carnegie Mellon Univ. Pittsburgh, 1991.
[TK92]Carlo Tomasi and Takeo Kanade. Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2):137–154, 1992.
[TBI97]Lloyd N Trefethen and David Bau III. Numerical linear algebra. Volume 50. Siam, 1997.
[Tro04]Joel A Tropp. Greed is good: algorithmic results for sparse approximation. Information Theory, IEEE Transactions on, 50(10):2231–2242, 2004.
[TRO04]JOEL A TROPP. Just relax: convex programming methods for subset selection and sparse approximation. 2004.
[Tro06]Joel A Tropp. Just relax: convex programming methods for identifying sparse signals in noise. Information Theory, IEEE Transactions on, 52(3):1030–1051, 2006.
[TG07]Joel A Tropp and Anna C Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. Information Theory, IEEE Transactions on, 53(12):4655–4666, 2007.
[TW10]Joel A Tropp and Stephen J Wright. Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE, 98(6):948–958, 2010.
[Vap13]Vladimir Vapnik. The nature of statistical learning theory. Springer Science & Business Media, 2013.
[VMS05]Rene Vidal, Yi Ma, and Shankar Sastry. Generalized principal component analysis (gpca). Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(12):1945–1959, 2005.
[Vid10]René Vidal. A tutorial on subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2010.
[VH04]René Vidal and Richard Hartley. Motion segmentation with missing data using powerfactorization and gpca. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, II–310. IEEE, 2004.
[VMS03]René Vidal, Yi Ma, and Shankar Sastry. Generalized principal component analysis (gpca). In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, I–621. IEEE, 2003.
[VTH08]René Vidal, Roberto Tron, and Richard Hartley. Multiframe motion segmentation with missing data using powerfactorization and gpca. International Journal of Computer Vision, 79(1):85–105, 2008.
[VL07]Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
[WW07]Silke Wagner and Dorothea Wagner. Comparing clusterings: an overview. 2007.
[YRV16]Chong You, D Robinson, and René Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1. 2016.
[YV15]Chong You and René Vidal. Sparse subspace clustering by orthogonal matching pursuit. arXiv preprint arXiv:1507.01238, 2015.
[ZMP04]Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In Advances in neural information processing systems, 1601–1608. 2004.

_images/union_of_subspaces.png

Indices and tables