Sparse-Plex¶
A journey into sparse and redundant representations.
About Sparse-Plex¶
Sparse-plex is a MATLAB library for solving sparse representation problems.

This is an example of a union of subspaces model. While the ambient space is \(\RR^3\), the data points actually fall in one of the three 2-d planes. The black points are in \(xy\)-plane, yellow points in \(yz\)-plane and red points in \(zx\) plane. Each of the 3 planes is a subspace of the ambient 3 dimensional space. Once an appropriate basis for each of the subspaces is chosen, the data points require only 2 coordinates to identify them in the subspace. In this case, it is easy to see that the standard basis for \(\RR^3\) contains the basis vectors for individual subspaces also. Thus, in the standard basis, each data point has only 2-non-zero coordinates.
The library website is : http://indigits.github.io/sparse-plex/.
Online documentation is hosted at: http://sparse-plex.readthedocs.org/en/latest/.
The project is hosted on GITHUB at: https://github.com/indigits/sparse-plex.
It contains implementations of many state of the art algorithms. Some implementations are simple and straight-forward while some have taken extra efforts to optimize the speed.
In addition to these, the library provides implementations of many other algorithms which are building blocks for the sparse recovery algorithms.
The library aims to solve:
- Single vector sparse recovery or sparse approximation problems
- Multiple vector joint sparse recovery or sparse approximation problems
The library provides
- Various simple dictionaries and sensing matrices
- Implementations of pursuit algorithms
- Matching pursuit
- Orthogonal matching pursuit
- Compressive sampling matching pursuit
- Basis pursuit
- Some joint recovery algorithms
- Cluster orthogonal matching pursuit
- Some clustering algorithms
- Spectral clustering
- Sparse subspace clustering using l_1 minimization
- Sparse subspace clustering using orthogonal matching pursuit
- Various utilities for working with matrices, signals, norms, distances, signal comparison, vector spaces
- Some visualization utilities
- Some combinatoric systems
- Various constructions for synthetic sparse signals
- Some optimization algorithms
- steepest descent
- conjugate gradient descent
- Detection and estimation algorithms
- Compressive binary detector
The documentation contains several how-to-do tutorials. They are meant to help beginners in the area ramp up quickly. The documentation is not really a user manual. It doesn’t describe all parameters and behavior of a function in detail. Rather, it provides various code examples to explain how things work. Users are requested to read through the source code and relevant papers to get a deeper understanding of the methods.
Getting Started¶
Requirements¶
While much of the library can be used on stock MATLAB distribution with standard toolboxes, some parts of the library are dependent on some specific third party libraries. These dependencies are explained below.
MATLAB toolboxes
- Signal processing toolbox
- Image processing toolbox
- Statistics toolbox
- Optimization toolbox
Third party library dependencies (optional)
- CVX http://cvxr.com/cvx/
- LRS library https://github.com/andrewssobral/lrslibrary
- Wavelab http://statweb.stanford.edu/~wavelab/
We repeat that only some parts of the library and examples depend on the third party libraries. You can install them on need basis. You don’t need to install them in advance.
Installation¶
- Download
sparse-plex
library from http://indigits.github.io/sparse-plex/. - Unzip it in a suitable folder.
- Add following commands to your MATLAB startup script:
- Change directory to the root directory of
sparse-plex
. - Run
spx_setup
function. - Change back to whatever directory you want to be in.
- Change directory to the root directory of
Note
Make sure that MATLAB has write permissions to the directory in which you install sparse-plex. Some functions in sparse-plex create some MAT files for caching of intermediate results. Moreover, the sparse-plex setup script also creates a local settings file. For creating these files, write access is needed.
Getting acquainted¶
The online library documentation includes a number of step-by-step demonstrations. Follow these tutorials to get familiar with the library.
Running examples¶
- Change directory to the root directory of
sparse-plex
. - Go into
examples
directory. - Browse the examples.
- Run the example you want.
Checking the source code¶
- Change directory to the root directory of
sparse-plex
. - Go into
library
directory. - Browse the source code.
- The source code for
spx
library is maintained in+spx
directory. - Unit-tests for the library are maintained in
tests
directory.
- The source code for
Verifying the installation¶
A number of unit tests have been included in the software to verify its integrity. The unit tests are based on MATLAB’s built in testing frameworks.
- Change directory to the root directory of
sparse-plex
. - Move to the directory
library/tests
. - Execute the
runalltests.m
script. - Verify that all unit tests pass.
Building MATLAB Extensions¶
Some of the fast implementations of various algorithms are written in C as MATLAB extensions. You will need to build them before using them.
This section assumes that you have the necessary build tools available in your MATLAB installation. See What You Need to Build MEX Files for details.
- Go to the
sparse-plex\library\+spx\+fast\private
directory inside MATLAB. - Run the
make.m
script.
The script make.m
contains necessary commands to invoke
the mex compiler on each of the source files in this private
directory. The script takes care of building only those files
which have been modified since last build.
Building documentation¶
Only if you really want to do it! Normally, you can read it online.
You will require Python Sphinx and other related packages like Pygments library etc. to build the documentation from scratch.
- Change directory to the root directory of
sparse-plex
. - Go into
docs
directory. - Build the documentation using Sphinx tool chain.
Here is the command for building documentation automatically as the changes are being made to documentation:
sphinx-autobuild --port=9102 . _build\html
Configuring test data directories¶
Several examples in sparse-plex are developed on top of standard data sets. These include (but not limited to):
- Standard test images
- Yale Extended B Faces database (cropped images)
In order to execute these examples, access to the
data is needed. The data is not distributed along
with this software. You can download data and store
it on your computer wherever you wish. In order
to provide access to this data, you need to tell
sparse-plex where does the data lie. This can
be done by changing spx_local.ini
file.
When you download and unzip the library, this file
doesn’t exist. When you run spx_setup
, spx_defaults.ini
is copied into spx_local.ini
.
All you need to do is to point to the right directories which hold the test datasets.
Specific settings in spx_local.ini
are:
standard_test_images_dir
yale_faces_db_dir
For more information, read the file.
Demos¶
Dirac DCT Tutorial¶
This tutorial is based on examples\\ex_dirac_dct_two_ortho_basis.m
.
In this tutorial we will:
- Construct a DCT basis
- Construct a Dirac-DCT dictionary.
- Construct a signal which is a mixture of few impulses and a few sinusoids.
- Construct its representation in the DCT basis.
- Recover its representation in Dirac-DCT dictionary
using following sparse recovery algorithms
- Matching Pursuit
- Orthogonal Matching Pursuit
- Basis Pursuit
- Measure the recovery error for different sparse recovery algorithms.
Signal space dimension:
N = 256;
Dirac basis:
I = eye(N);
DCT basis:
Psi = dctmtx(N)';
Visualizing the DCT basis:
imagesc(Psi) ;
colormap(gray);
colorbar;
axis image;
title('\Psi');

Combining the Dirac and DCT orthonormal bases to form a two-ortho dictionary:
Phi = [I Psi];
Visualizing the dictionary:
imagesc(Phi) ;
colormap(gray);
colorbar;
axis image;
title('\Phi');

Constructing a signal which is a combination of impulses and cosines:
alpha = zeros(2*N, 1);
alpha(20) = 1;
alpha(30) = -.4;
alpha(100) = .6;
alpha(N + 4) = 1.2;
alpha(N + 58) = -.8;
x = Phi * alpha;
K = 5;

Finding the representation in DCT basis:
x_dct = Psi' * x;

Sparse representation in the Dirac DCT dictionary

Obtaining the sparse representation using matching pursuit algorithm:
solver = spx.pursuit.single.MatchingPursuit(Phi, K);
result = solver.solve(x);
mp_solution = result.z;
mp_diff = alpha - mp_solution;
% Recovery error
mp_recovery_error = norm(mp_diff) / norm(x);

Matching pursuit recovery error: 0.0353.
Obtaining the sparse representation using orthogonal matching pursuit algorithm:
solver = spx.pursuit.single.OrthogonalMatchingPursuit(Phi, K);
result = solver.solve(x);
omp_solution = result.z;
omp_diff = alpha - omp_solution;
% Recovery error
omp_recovery_error = norm(omp_diff) / norm(x);

Orthogonal Matching pursuit recovery error: 0.0000.
Obtaining a sparse approximation via basis pursuit:
solver = spx.pursuit.single.BasisPursuit(Phi, x);
result = solver.solve_l1_noise();
l1_solution = result;
l1_diff = alpha - l1_solution;
% Recovery error
l1_recovery_error = norm(l1_diff) / norm(x);

l_1 recovery error: 0.0010.
Basic CS Tutorial¶
This tutorial is based on examples\\ex_simple_compressed_sensing_demo.m
.
In this tutorial we will:
- Create sparse signals (with Gaussian and bi-uniform distributed non-zero samples).
- Look at how to identify support of a signal.
- Construct a Gaussian sensing matrix.
- Visualize the sensing matrix.
- Compute random measurements on the sparse signal with the sensing matrix.
- Add measurement noise to the measurements.
- Recover the sparse vector
using following sparse recovery algorithms
- Matching Pursuit
- Orthogonal Matching Pursuit
- Basis Pursuit
- Measure the recovery error for different sparse recovery algorithms.
Basic setup:
% Signal space
N = 1000;
% Number of measurements
M = 200;
% Sparsity level
K = 8;
Choosing the support randomly:
Omega = randperm(N, K);
Constructing a sparse vector with Gaussian entries:
% Initializing a zero vector
x = zeros(N, 1);
% Filling it with non-zero Gaussian entries at specified support
x(Omega) = 4 * randn(K, 1);

Constructing a bi-uniform sparse vector:
a = 1;
b = 2;
% unsigned magnitudes of non-zero entries
xm = a + (b-a).*rand(K, 1);
% Generate sign for non-zero entries randomly
sgn = sign(randn(K, 1));
% Combine sign and magnitude
x(Omega) = sgn .* xm;

Identifying support:
find(x ~= 0)'
% 98 127 277 544 630 815 905 911
Constructing a Gaussian sensing matrix:
Phi = randn(M, N);
% Make sure that variance is 1/sqrt(M)
Phi = Phi ./ sqrt(M);
Computing norm of each column:
column_norms = sqrt(sum(Phi .* conj(Phi)));
Norm histogram

Constructing a Gaussian dictionary with normalized columns:
for i=1:N
v = column_norms(i);
% Scale it down
Phi(:, i) = Phi(:, i) / v;
end
Visualizing the sensing matrix:
imagesc(Phi) ;
colormap(gray);
colorbar;
axis image;

Making random measurements from sparse high dimensional vector:
y0 = Phi * x;

Adding some measurement noise:
SNR = 15;
snr = db2pow(SNR);
noise = randn(M, 1);
% we treat each column as a separate data vector
signalNorm = norm(y0);
noiseNorm = norm(noise);
actualNormRatio = signalNorm / noiseNorm;
requiredNormRatio = sqrt(snr);
gain_factor = actualNormRatio / requiredNormRatio;
noise = gain_factor .* noise;
Measurement vector with noise:
y = y0 + noise;

Sparse recovery using matching pursuit:
solver = spx.pursuit.single.MatchingPursuit(Phi, K);
result = solver.solve(y);
mp_solution = result.z;
Recovery error:
mp_diff = x - mp_solution;
mp_recovery_error = norm(mp_diff) / norm(x);

Matching pursuit recovery error: 0.1612.
Sparse recovery using orthogonal matching pursuit:
solver = spx.pursuit.single.OrthogonalMatchingPursuit(Phi, K);
result = solver.solve(y);
omp_solution = result.z;
omp_diff = x - omp_solution;
omp_recovery_error = norm(omp_diff) / norm(x);

Orthogonal Matching pursuit recovery error: 0.0301.
Sparse recovery using l_1 minimization:
solver = spx.pursuit.single.BasisPursuit(Phi, y);
result = solver.solve_l1_noise();
l1_solution = result;
l1_diff = x - l1_solution;
l1_recovery_error = norm(l1_diff) / norm(x);

l_1 recovery error: 0.1764.
Dictionaries¶
Compressive Sensing¶
Pursuit Algorithms¶
Sparse Signal Models¶
Outline¶
In this chapter we develop initial concepts of sparse signal models.

A bird’s eye view of the sparse representations and compressive sensing framework. Signals (like speech, images, etc.) reside in a signal space \(\RR^N\). Analytical or trained dictionaries can be constructed such that the signals can have a sparse representation in such dictionaries. These sparse representations reside in a representation space \(\RR^D\). A sparse approximation algorithm \(\Delta_a\) can construct a representation \(\alpha\) for a signal \(x\) in the dictionary \(\DDD\). The approximation error is \(e\). A small number of \(M\) random measurements are sufficient to capture all the information in \(x\). The sensing process \(y = \Phi x + n\) constructs the measurement vector \(y \in \RR^M\) for a given signal where \(n\) is the measurement noise. In order to get \(x\) from \(y\), we first need to recover the sparse representation \(\alpha\) using the sparse recovery algorithm \(\Delta_r\). Then \(x \approx \DDD \alpha\).
We begin our study with a review of solutions of under-determined systems. We build a case for solutions which promote sparsity.
We show that although the real life signals may not be sparse yet they are compressible and can be approximated with sparse signals.
We then review orthonormal bases and explain the inadequacy of those bases in exploiting the sparsity in many signals of interest. We develop an example of Dirac Fourier basis as a two ortho basis and demonstrate how it can better exploit signal sparsity compared to Dirac basis and Fourier basis individually.
We follow this with a general discussion of redundant signal dictionaries. We show how they can be used to create sparse and redundant signal representations.
We study various properties of signal dictionaries which are useful in characterizing the capabilities of a signal dictionary in exploiting signal sparsity.
In this chapter, our signals of interest will typically lie in the finite \(N\)-dimensional complex vector space \(\CC^N\). Sometimes we will restrict our attention to the \(N\) dimensional Euclidean space to simplify discussion.
We will be concerned with different representations of our signals of interest in \(\CC^D\) where \(D \geq N\). This aspect will become clearer as we go along in this chapter.
Sparsity¶
We quickly define the notion of sparsity in a signal.
We recall the definition of \(l_0\)-“norm” (don’t forget the quotes) of \(x \in \CC^N\) given by
where \(\supp(x) = \{ i : x_i \neq 0\}\) denotes the support of \(x\).
Informally we say that a signal \(x \in \CC^N\) is term{sparse} if \(\| x \|_0 \ll N\). index{Sparse signal}
More generally if \(x =\DDD \alpha\) where \(\DDD \in \CC^{N \times D}\) with \(D > N\) is some signal dictionary (to be formally defined later), then \(x\) is sparse in dictionary \(\DDD\) if \(\| \alpha \|_0 \ll D\).
Sometimes we simply say that \(x\) is \(K\)-sparse if \(\| x \|_0 \leq K\) where \(K < N\). We do not specifically require that \(K \ll N\).
An even more general definition of sparsity is the degrees of freedom a signal may have.
As an example consider all points on the surface of a unit sphere in \(\RR^N\). For every point \(x\) belonging to the surface \(|x|_2 = 1\). Thus if we choose the values of \(N-1\) components of \(x\) then the value of the remaining component is automatically fixed. Thus the number of degrees of freedom \(x\) has on the surface of the unit sphere in \(\RR^N\) is actually \(N-1\). Such a surface represents a manifold in the ambient Euclidean space. Of special interest are low dimensional manifolds where the number of degrees of freedom \(K \ll N\).
Sparse solutions for under-determined linear systems¶
The discussion in this section is largely based on chapter 1 of [Ela10].
Consider a matrix \(\Phi \in \CC^{M \times N}\) with \(M < N\).
Define an under-determined system of linear equations:
where \(y \in \CC^M\) is known and \(x \in \CC^N\) is unknown.
This system has \(N\) unknowns and \(M\) linear equations. There are more unknowns than equations.
Let the columns of \(\Phi\) be given by \(\phi_1, \phi_2, \dots, \phi_N\).
Column space of \(\Phi\) (vector space spanned by all columns of \(\Phi\)) is denoted by \(\ColSpace(\Phi)\) i.e.
We know that \(\ColSpace(\Phi) \subset \CC^M\).
Clearly \(\Phi x \in \ColSpace(\Phi)\) for every \(x \in \CC^N\). Thus if \(y \notin \ColSpace(\Phi)\) then we have no solution. But, if \(y \in \ColSpace(\Phi)\) then we have infinite number of solutions.
Let \(\NullSpace(\Phi)\) represent the null space of \(\Phi\) given by
Let \(\widehat{x}\) be a solution of \(y = \Phi x\). And let \(z \in \NullSpace(\Phi)\). Then
Thus the set \(\widehat{x} + \NullSpace(\Phi)\) forms the complete set of infinite solutions to the problem \(y = \Phi x\) where
As a running example in this section, we will consider a simple under-determined system in \(\RR^2\). The system is specified by
and
with
where \(x\) is unknown and \(y\) is known. Alternatively
or more simply
The solution space of this system is a line in \(\RR^2\) which is shown in the figure below.

Specification of the under-determined system as above, doesn’t give us any reason to prefer one particular point on the line as the preferred solution.
Two specific solutions are of interest:
- \((x_1, x_2) = (4,0)\) lies on the \(x_1\) axis.
- \((x_1, x_2) = (0,3)\) lies on the \(x_2\) axis.
In both of these solutions, one component is 0, thus leading these solutions to be sparse.
It is easy to visualize sparsity in this simplified 2-dimensional setup but situation becomes more difficult when we are looking at high dimensional signal spaces. We need well defined criteria to promote sparse solutions.
Regularization¶
Are all these solutions equivalent or can we say that one solution is better than the other in some sense? In order to suggest that some solution is better than other solutions, we need to define a criteria for comparing two solutions.
In optimization theory, this idea is known as regularization.
We define a cost function \(J(x) : \CC^N \to \RR\) which defines the desirability of a given solution \(x\) out of infinitely possible solutions. The higher the cost, lower is the desirability of the solution.
Thus the goal of the optimization problem is to find a desired \(x\) with minimum possible cost.
In optimization literature, the cost function is one type of objective function. While the objective of an optimization problem might be either minimized or maximized, cost is always minimized.
We can write this optimization problem as
If \(J(x)\) is convex, then its possible to find a global minimum cost solution over the solution set.
If \(J(x)\) is not convex, then it may not be possible to find a global minimum, we may have to settle with a local minimum.
A variety of such cost function based criteria can be considered.
\(l_2\) Regularization¶
One of the most common criteria is to choose a solution with the smallest \(l_2\) norm.
The problem can then be reformulated as an optimization problem
In fact minimizing \(\| x \|_2\) is same as minimizing its square \(\| x \|_2^2 = x^H x\).
So an equivalent formulation is
We continue with our running example.
We can write \(x_2\) as
x_2 = 3 - frac{3}{4} x_1.
With this definition the squared \(l_2\) norm of \(x\) becomes
Minimizing \(\| x \|_2^2\) over all \(x\) is same as minimizing over all \(x_1\).
Since \(\| x \|_2^2\) is a quadratic function of \(x_1\), we can simply differentiate it and equate to 0 giving us
This gives us
Thus the optimal \(l_2\) norm solution is obtained at \((x_1, x_2) = (1.44, 1.92)\).
We note that the minimum \(l_2\) norm at this solution is
It is instructive to note that the \(l_2\) norm cost function prefers a non-sparse solution to the optimization problem.
We can view this solution graphically by drawing \(l_2\) norm balls of different radii in figure below. The ball which just touches the solution space line (i.e. the line is tangent to the ball) gives us the optimal solution.

All other norm balls either don’t touch the solution line at all, or they cross it at exactly two points.
A formal solution to \(l_2\) norm minimization problem can be easily obtained using Lagrange multipliers.
We define the Lagrangian
with \(\lambda \in \CC^M\) being the Lagrange multipliers for the (equality) constraint set.
Differentiating \(\mathcal{L}(x)\) w.r.t. \(x\) we get
By equating the derivative to \(0\) we obtain the optimal value of \(x\) as
Plugging this solution back into the constraint \(\Phi x= y\) gives us
In above we are implicitly assuming that \(\Phi\) is a full rank matrix thus, \(\Phi \Phi^H\) is invertible and positive definite.
Putting \(\lambda\) back in above we obtain the well known closed form least squares solution using pseudo-inverse solution
We would like to mention that there are several iterative approaches to solve the \(l_2\) norm minimization problem (like gradient descent and conjugate descent). For large systems, they are more effective than computing the pseudo-inverse.
The beauty of \(l_2\) norm minimization lies in its simplicity and availability of closed form analytical solutions. This has led to its prevalence in various fields of science and engineering. But \(l_2\) norm is by no means the only suitable cost function. Rather the simplicity of \(l_2\) norm often drives engineers away from trying other possible cost functions. In the sequel, we will look at various other possible cost functions.
Convexity¶
Convex optimization problems have a unique feature that it is possible to find the global optimal solution if such a solution exists.
The solution space \(\Omega = \{x : \Phi x = y\}\) is convex. Thus the feasible set of solutions for the optimization problem is also convex. All it remains is to make sure that we choose a cost function \(J(x)\) which happens to be convex. This will ensure that a global minimum can be found through convex optimization techniques. Moreover, if \(J(x)\) is strictly convex, then it is guaranteed that the global minimum solution is unique. Thus even though, we may not have a nice looking closed form expression for the solution of a strictly convex cost function minimization problem, the guarantee of the existence and uniqueness of solution as well as well developed algorithms for solving the problem make it very appealing to choose cost functions which are convex.
We remind that all \(l_p\) norms with \(p \geq 1\) are convex functions. In particular \(l_{\infty}\) and \(l_1\) norms are very interesting and popular where
and
In the following section we will attempt to find a unique solution to our optimization problem using \(l_1\) norm.
\(l_1\) Regularization¶
In this section we will restrict our attention to the Euclidean space case where \(x \in \RR^N\), \(\Phi \in \RR^{M \times N}\) and \(y \in \RR^M\).
We choose our cost function \(J(x) = l_1(x)\).
The cost minimization problem can be reformulated as
We continue with our running example.
Again we can view this solution graphically by drawing \(l_1\) norm balls of different radii in the figure below. The ball which just touches the solution space line gives us the optimal solution.

As we can see from the figure the minimum \(l_1\) norm solution is given by \((x_1,x_2) = (0,3)\).
It is interesting to note that \(l_1\) norm solution promotes sparser solutions while \(l_2\) norm solution promotes solutions in which signal energy is distributed amongst all of its components.
It’s time to have a closer look at our cost function \(J(x) = \|x \|_1\). This function is convex yet not strictly convex.
Consider again \(x \in \RR^2\). For \(x \in \RR_+^2\) (the first quadrant),
Hence for any \(c_1, c_2 \geq 0\) and \(x, y \in \RR_+^2\):
Thus, \(l_1\)-norm is not strictly convex. Consequently, a unique solution may not exist for \(l_1\) norm minimization problem.
As an example consider the under-determined system
We can easily visualize that the solution line will pass through points \((0,4)\) and \((4,0)\). Moreover, it will be clearly parallel with \(l_1\)-norm ball of radius \(4\) in the first quadrant. See again the figure above. This gives us infinitely possible solutions to the minimization problem.
We can still observe that
- These solutions are gathered in a small line segment that is bounded (a bounded convex set) and
- There exist two solutions \((4,0)\) and \((0,4)\) amongst these solutions which have only 1 non-zero component.
For the \(l_1\) norm minimization problem since \(J(x)\) is not strictly convex, hence a unique solution may not be guaranteed. In specific cases, there may be infinitely many solutions. Yet what we can claim is begin{itemize} item these solutions are gathered in a set that is bounded and convex, and item among these solutions, there exists at least one solution with at most \(M\) non-zeros (as the number of constraints in \(\Phi x = y\)). end{itemize} todo{Provide reference to the claim that solution set is convex and bounded} todo{Show that at least one solution exists with \(M\) sparsity level}
We have
- \(S\) is convex and bounded.
- \(\Phi x^* = y \, \Forall x^* \in S\).
- Since \(\Phi \in \RR^{M \times N}\) is full rank and \(M < N\), hence \(\text{rank}(\Phi) = M\).
Let \(x^* \in S\) be an optimal solution with \(\| x^* \|_0 = L > M\).
Consider the \(L\) columns of \(\Phi\) which correspond to \(\supp(x^*)\).
Since \(L > M\) and \(\text{rank}(\Phi) = M\) hence these columns linearly dependent.
Thus there exists a vector \(h \in \RR^N\) with \(\supp(h) \subseteq \supp(x^*)\) such that
Note that since we are only considering those columns of \(\Phi\) which correspond to \(\supp(x)\), hence we require \(h_i = 0\) whenever \(x^*_i = 0\).
Consider a new vector
where \(\epsilon\) is small enough such that every element in \(x\) has the same sign as \(x^*\).
As long as
such an \(x\) can be constructed.
Note that \(x_i = 0\) whenever \(x^*_i = 0\).
Clearly
Thus \(x\) is a feasible solution to the problem (1) though it need not be an optimal solution.
But since \(x^*\) is optimal hence, we must assume that \(l_1\) norm of \(x\) is greater than or equal to the \(l_1\) norm of \(x^*\)
Now look at \(\|x \|_1\) as a function of \(\epsilon\) in the region \(|\epsilon| \leq \epsilon_0\).
In this region, \(l_1\) function is continuous and differentiable since all vectors \(x^* + \epsilon h\) have the same sign pattern. If we define \(y^* = | x^* |\) (the vector of absolute values), then
Since the sign patterns don’t change, hence
Thus
The quantity \(h^T \sgn(x^*)\) is a constant. The inequality \(\|x \|_1 \geq \| x^* \|_1\) applies to both positive and negative values of \(\epsilon\) in the region \(|\epsilon | \leq \epsilon_0\). This is possible only when inequality is in fact an equality.
This implies that the addition / subtraction of \(\epsilon h\) under these conditions does not change the \(l_1\) length of the solution. Thus, \(x \in S\) is also an optimal solution.
This can happen only if
We now wish to tune \(\epsilon\) such that one entry in \(x^*\) gets nulled while keeping the solutions \(l_1\) length.
We choose \(i\) corresponding to \(\epsilon_0\) (defined above) and pick
Clearly for the corresponding
the \(i\)-th entry is nulled while others keep their sign and the \(l_1\) norm is also preserved. Thus, we have got a new optimal solution with \(L-1\) non-zeros at the most. It is possible that more than 1 entries get nulled this operation.
We can repeat this procedure till we are left with \(M\) non-zero elements.
Beyond this we may not proceed since \(\text{rank}(\Phi) = M\) hence we cannot say that corresponding columns of \(\Phi\) are linearly dependent.
We thus note that \(l_1\) norm has a tendency to prefer sparse solutions. This is a well known and fundamental property of linear programming.
\(l_1\) norm minimization problem as a linear programming problem¶
We now show that \(l_1\) norm minimization problem in \(\RR^N\) is in fact a linear programming problem.
Recalling the problem:
Let us write \(x\) as \(u - v\) where \(u, v \in \RR^N\) are both non-negative vectors such that \(u\) takes all positive entries in \(x\) while \(v\) takes all the negative entries in \(x\).
Let
x = (-1, 0 , 0 , 2, 0 , 0, 0, 4, 0, 0, -3, 0 , 0 , 0 , 0, 2 , 10).
Then
u = (0, 0 , 0 , 2, 0 , 0, 0, 4, 0, 0, 0, 0 , 0 , 0 , 0, 2 , 10).
And
v = (1, 0 , 0 , 0, 0 , 0, 0, 0, 0, 0, 3, 0 , 0 , 0 , 0, 0 , 0).
Clearly \(x = u - v\).
We note here that by definition
i.e. support of \(u\) and \(v\) do not overlap.
We now construct a vector
We can now verify that
And
where \(z \succeq 0\).
Hence the optimization problem (1) can be recast as
This optimization problem has the classic Linear Programming structure since the objective function is affine as well as constraints are affine.
Let \(z^* =\begin{bmatrix} u^* \\ v^* \end{bmatrix}\) be an optimal solution to the problem (2).
In order to show that the two optimization problems are equivalent, we need to verify that our assumption about the decomposition of \(x\) into positive entries in \(u\) and negative entries in \(v\) is indeed satisfied by the optimal solution \(u^*\) and \(v^*\). i.e. support of \(u^*\) and \(v^*\) do not overlap.
Since \(z \succeq 0\) hence \(\langle u^* , v^* \rangle \geq 0\). If support of \(u^*\) and \(v^*\) don’t overlap, then we have \(\langle u^* , v^* \rangle = 0\). And if they overlap then \(\langle u^* , v^* \rangle > 0\).
Now for the sake of contradiction, let us assume that support of \(u^*\) and \(v^*\) do overlap for the optimal solution \(z^*\).
Let \(k\) be one of the indices at which both \(u_k \neq 0\) and \(v_k \neq 0\). Since \(z \succeq 0\), hence \(u_k > 0\) and \(v_k > 0\).
Without loss of generality let us assume that \(u_k > v_k > 0\).
In the equality constraint
Both of these coefficients multiply the same column of \(\Phi\) with opposite signs giving us a term
Now if we replace the two entries in \(z^*\) by
and
to obtain an new vector \(z'\), we see that there is no impact in the equality constraint
Also the positivity constraint
is satisfied. This means that \(z'\) is a feasible solution.
On the other hand the objective function \(1^T z\) value reduces by \(2 v_k\) for \(z'\). This contradicts our assumption that \(z^*\) is the optimal solution.
Hence for the optimal solution of (2) we have
thus
is indeed the desired solution for the optimization problem (1).
Dictionary based representations¶
Dictionaries¶
A dictionary for \(\CC^N\) is a finite collection \(\mathcal{D}\) of unit-norm vectors which span the whole space.
The elements of a dictionary are called atoms and they are denoted by \(\phi_{\omega}\) where \(\omega\) is drawn from an index set \(\Omega\).
The whole dictionary structure is written as
where
and
We use the letter \(D\) to denote the number of elements in the dictionary, i.e.
This definition is adapted from [Tro04].
The indices may have an interpretation, such as the time-frequency or time-scale localization of an atom, or they may simply be labels without any underlying meaning.
Note
In most cases, the dictionary is a matrix of size \(N \times D\) where \(D\) is the number of columns or atoms in the dictionary. The index set in this situation is \([1:D]\) which is the set of integers from 1 to \(D\).
Let’s construct a simple Dirac-DCT dictionary of dimensions \(4 \times 8\).
>> A = spx.dict.simple.dirac_dct_mtx(4); A
A =
1.0000 0 0 0 0.5000 0.6533 0.5000 0.2706
0 1.0000 0 0 0.5000 0.2706 -0.5000 -0.6533
0 0 1.0000 0 0.5000 -0.2706 -0.5000 0.6533
0 0 0 1.0000 0.5000 -0.6533 0.5000 -0.2706
This dictionary consists of two parts. The left part is a \(4 \times 4\) identity matrix and the right part is a \(4 \times 4\) DCT matrix.
The rank of this dictionary is 4. Since the columns come from \(\RR^4\), any 5 columns are linearly dependent.
It is interesting to note that there exists a set of 4 columns in this dictionary which is linearly dependent.
>> B = A(:, [1, 4, 5, 7]); B
B =
1.0000 0 0.5000 0.5000
0 0 0.5000 -0.5000
0 0 0.5000 -0.5000
0 1.0000 0.5000 0.5000
>> rank(B)
ans =
3
This is a crucial difference between an orthogonal basis and an overcomplete dictionary. In an orthogonal basis for \(\RR^N\), all \(N\) vectors are linearly independent. As we create overcomplete dictionaries, it is possible that there exist some subsets of columns of size \(N\) or less which are linearly dependent.
Let’s quickly examine the null space of \(B\):
>> c = null(B)
c =
-0.5000
-0.5000
0.5000
0.5000
>> B * c
ans =
1.0e-16 *
0.5551
-0.2776
-0.8327
-0.2776
Note that the dictionary need not provide a unique representation for any vector \(x \in \CC^N\), but it provides at least one representation for each \(x \in \CC^N\).
We will construct a vector in the null space of \(A\):
>> n = zeros(8,1); n([1,4,5,7]) = c; n
n =
-0.5000
0
0
-0.5000
0.5000
0
0.5000
0
Consider the vector:
>> x = [4 ,2,2,5]';
Following calculation shows two different representations of \(x\) in \(A\):
>> alpha = [2, 0, 0, 3, 4, 0, 0, 0]'
>> A * alpha
ans =
4
2
2
5
>> A * (alpha + n)
ans =
4
2
2
5
>> beta = alpha + n
beta =
1.5000
0
0
2.5000
4.5000
0
0.5000
0
Both alpha and beta are valid representations of x in A. While alpha has 3 non-zero entries, beta has 4. In that sense alpha is a more sparse representation of x in A.
Constructing x from A requires only 3 columns if we choose the alpha representation, but it requires 4 columns if we choose the beta representation.
When \(D=N\) we have a set of unit norm vectors which span the whole of \(\CC^N\). Thus, we have a basis (not-necessarily an orthonormal basis). A dictionary cannot have \(D < N\). The more interesting case is when \(D > N\).
Note
There are also applications of undercomplete dictionaries where the number of atoms \(D\) is less than the ambient space dimension \(N\). However, we will not be considering them unless specifically mentioned.
Redundant dictionaries and sparse signals¶
With \(D > N\), clearly there are more atoms than necessary to provide a representation of signal \(x \in \CC^N\). Thus such a dictionary is able provide multiple representations to same vector \(x\). We call such dictionaries redundant dictionaries or over-complete dictionaries.
In contrast a basis with \(D=N\) is called a complete dictionary.
A special class of signals is those signals which have a sparse representation in a given dictionary \(\mathcal{D}\).
It is usually expected that \(K \ll N\) also holds.
Let \(\Lambda \subset \Omega\) be a subset of indices with \(|\Lambda|=K\).
Let \(x\) be any signal in \(\CC^N\) such that \(x\) can be expressed as
Note that this is not the only possible representation of \(x\) in \(\mathcal{D}\). This is just one of the possible representations of \(x\). The special thing about this representation is that it is \(K\)-sparse i.e. only at most \(K\) atoms from the dictionary are being used.
Now there are \(\binom{D}{K}\) ways in which we can choose a set of \(K\) atoms from the dictionary \(\mathcal{D}\).
Thus the set of \((\mathcal{D},K)\)-sparse signals is given by
for some index set \(\Lambda \subset \Omega\) with \(|\Lambda|=K\).
This set \(\Sigma_{(\mathcal{D},K)}\) is dependent on the chosen dictionary \(\mathcal{D}\). In the sequel, we will simply refer to it as \(\Sigma_K\).
For the special case where \(\mathcal{D}\) is nothing but the standard basis of \(\CC^N\), then
i.e. the set of signals which has \(K\) or less non-zero elements.
In contrast if we choose an orthonormal basis \(\Psi\) such that every \(x\in\CC^N\) can be expressed as
then with the dictionary \(\mathcal{D} = \Psi\), the set of \(K\)-sparse signals is given by
We also note that set of vectors \(\{ \alpha_{\lambda} : \lambda \in \Lambda \}\) with \(K < N\) form a subspace of \(\CC^N\).
So we have \(\binom{D}{K}\) \(K\)-sparse subspaces contained in the dictionary \(\mathcal{D}\). And the \(K\)-sparse signals lie in the union of all these subspaces.
Sparse approximation problem¶
In sparse approximation problem, we attempt to express a given signal \(x \in \CC^N\) using a linear combination of \(K\) atoms from the dictionary \(\mathcal{D}\) where \(K \ll N\) and typically \(N \ll D\) i.e. the number of atoms in a dictionary \(\mathcal{D}\) is typically much larger than the ambient signal space dimension \(N\).
- Naturally we wish to obtain a best possible sparse representation of \(x\) over the atoms
- \(\phi_{\omega} \in \mathcal{D}\) which minimizes the approximation error.
Let \(\Lambda\) denote the index set of atoms which are used to create a \(K\)-sparse representation of \(x\) where \(\Lambda \subset \Omega\) with \(|\Lambda| = K\).
Let \(x_{\Lambda}\) represent an approximation of \(x\) over the set of atoms indexed by \(\Lambda\).
Then we can write \(x_{\Lambda}\) as
We put all complex valued coefficients \(b_{\lambda}\) in the sum into a list \(b\).
The approximation error is given by
We would like to minimize the approximation error over all possible choices of \(K\) atoms and corresponding set of coefficients \(b_{\lambda}\).
Thus the sparse approximation problem can be cast as a minimization problem given by
If we choose a particular \(\Lambda\), then the inner minimization problem becomes a straight-forward least squares problem. But there are \(\binom{D}{K}\) possible choices of \(\Lambda\) and solving the inner least squares problem for each of them becomes prohibitively expensive.
We reemphasize here that in this formulation we are using a fixed dictionary \(\mathcal{D}\) while the vector \(x \in \CC^N\) is arbitrary.
This problem is known as \((\mathcal{D}, K)\)-sparse approximation problem.
A related problem is known as \((\mathcal{D}, K)\)-exact-sparse problem where it is known a-priori that \(x\) is a linear combination of at-most \(K\) atoms from the given dictionary \(\mathcal{D}\) i.e. \(x\) is a \(K\)-sparse signal as defined in previous section for the dictionary \(\mathcal{D}\).
This formulation simplifies the minimization problem (1) since it is known a priori that for \(K\)-sparse signals, a \(0\) approximation error can be achieved. The only problem is to find a set of subspaces from the \(\binom{D}{K}\) possible \(K\)-sparse subspaces which are able to provide a \(K\)-sparse representation of \(x\) and amongst them choose one. It is imperative to note that even the \(K\)-sparse representation need not be unique.
Clearly the exact-sparse problem is simpler than the sparse approximation problem. Thus if exact-sparse problem is NP-Hard then so is the harder sparse-approximation problem. It is expected that solving the exact-sparse problem will provide insights into solving the sparse problem.
It would be useful to get some uniqueness conditions for general dictionaries which guarantee that the sparse representation of a vector is unique in the dictionary. Such conditions would help us guarantee the uniqueness of exact-sparse problem.
Synthesis and analysis¶
The atoms of a dictionary \(\mathcal{D}\) can be organized into a \(N \times D\) matrix as follows:
where \(\Omega = \{\omega_1, \omega_2, \dots, \omega_N\}\) is the index set for the atoms of \(\mathcal{D}\). We remind that \(\phi_{\omega} \in \CC^N\), hence they have a column vector representation in the standard basis for \(\CC^N\).
The order of columns doesn’t matter as long as it remains fixed once chosen.
Thus, in matrix terminology, a representation of \(x \in \CC^N\) in the dictionary can be written as
where \(b \in \CC^D\) is a vector of coefficients to produce a superposition \(x\) from the atoms of dictionary \(\mathcal{D}\). Clearly with \(D > N\), \(b\) is not unique. Rather for every vector \(z \in \NullSpace(\Phi)\), we have:
We can also view the synthesis matrix \(\Phi\) as a linear operator from \(\CC^D\) to \(\CC^N\).
There is another way to look at \(x\) through \(\Phi\).
The conjugate transpose \(\Phi^H\) of the synthesis matrix \(\Phi\) is called the analysis matrix. It maps a given vector \(x \in \CC^N\) to a list of inner products with the dictionary:
where \(c \in \CC^N\).
With the help of synthesis matrix \(\Phi\), the \((\mathcal{D}, K)\)-exact-sparse can now be written as
With the help of synthesis matrix \(\Phi\), the \((\mathcal{D}, K)\)-sparse approximation can now be written as
p-norms and sparse signals¶
l1 , l2 and max norms¶
There are some simple and useful results on relationships between different \(p\)-norms listed in this section. We also discuss some interesting properties of \(l_1\)-norm specifically.
Let \(v \in \CC^N\). Let the entries in \(v\) be represented as
where \(r_i = | v_i |\) with the convention that \(\theta_i = 0\) whenever \(r_i = 0\).
The sign vector for \(v\) denoted by \(\sgn(v)\) is defined as
where
For any \(v \in \CC^N\) :
Note that whenever \(v_i = 0\), corresponding \(0\) entry in \(\sgn(v)\) has no effect on the sum.
Suppose \(v \in \CC^N\). Then
For the lower bound, we go as follows
This gives us
We can write \(l_1\) norm as
By Cauchy-Schwartz inequality we have
Since \(\sgn(v)\) can have at most \(N\) non-zero values, each with magnitude 1,
Thus, we get
Let \(v \in \CC^N\). Then
Thus
Let \(v \in \CC^N\). Let \(1 \leq p, q \leq \infty\). Then
Let \(\OneVec \in \CC^N\) be the vector of all ones i.e. \(\OneVec = (1, \dots, 1)\). Let \(v \in \CC^N\) be some arbitrary vector. Let \(| v |\) denote the vector of absolute values of entries in \(v\). i.e. \(|v|_i = |v_i| \Forall 1 \leq i \leq N\). Then
Finally since \(\OneVec\) consists only of real entries, hence its transpose and Hermitian transpose are same.
Let \(\OneMat \in \CC^{N \times N}\) be a square matrix of all ones. Let \(v \in \CC^N\) be some arbitrary vector. Then
We know that
Thus,
We used the fact that \(\| v \|_1 = \OneVec^T | v |\).
\(k\)-th largest (magnitude) entry in a vector \(x \in \CC^N\) denoted by \(x_{(k)}\) obeys
Let \(n_1, n_2, \dots, n_N\) be a permutation of \(\{ 1, 2, \dots, N \}\) such that
Thus, the \(k\)-th largest entry in \(x\) is \(x_{n_k}\). It is clear that
Obviously
Similarly
Thus
Sparse signals¶
In this section we explore some useful properties of \(\Sigma_K\), the set of \(K\)-sparse signals in standard basis for \(\CC^N\).
We recall that
We established before that this set is a union of \(\binom{N}{K}\) subspaces of \(\CC^N\) each of which is is constructed by an index set \(\Lambda \subset \{1, \dots, N \}\) with \(| \Lambda | = K\) choosing \(K\) specific dimensions of \(\CC^N\).
We first present some lemmas which connect the \(l_1\), \(l_2\) and \(l_{\infty}\) norms of vectors in \(\Sigma_K\).
Suppose \(u \in \Sigma_K\). Then
We can write \(l_1\) norm as
\[\| u \|_1 = \langle u, \sgn (u) \rangle.\]By Cauchy-Schwartz inequality we have
\[\langle u, \sgn (u) \rangle \leq \| u \|_2 \| \sgn (u) \|_2\]Since \(u \in \Sigma_K\), \(\sgn(u)\) can have at most \(K\) non-zero values each with magnitude 1. Thus, we have
\[\| \sgn (u) \|_2^2 \leq K \implies \| \sgn (u) \|_2 \leq \sqrt{K}\]Thus we get the lower bound
\[\| u \|_1 \leq \| u \|_2 \sqrt{K} \implies \frac{\| u \|_1}{\sqrt{K}} \leq \| u \|_2.\]Now \(| u_i | \leq \max(| u_i |) = \| u \|_{\infty}\). So we have
\[\| u \|_2^2 = \sum_{i= 1}^{N} | u_i |^2 \leq K \| u \|_{\infty}^2\]since there are only \(K\) non-zero terms in the expansion of \(\| u \|_2^2\).
This establishes the upper bound:
\[\| u \|_2 \leq \sqrt{K} \| u \|_{\infty}\]
Compressible signals¶
In this section, we first look at some general results and definitions related to \(K\)-term approximations of arbitrary signals \(x \in \CC^N\). We then define the notion of a compressible signal and study properties related to it.
K-term approximation of general signals¶
Let \(x \in \CC^N\). Let \(T \subset \{ 1, 2, \dots, N\}\) be any index set.Further let
such that
Let \(x_T \in \CC^{|T|}\) be defined as
Then \(x_T\) is a restriction of the signal \(x\) on the index set \(T\).
Alternatively let \(x_T \in \CC^N\) be defined as
In other words, \(x_T \in \CC^N\) keeps the entries in \(x\) indexed by \(T\) while sets all other entries to 0. Then we say that \(x_T\) is obtained by masking \(x\) with \(T\). As an abuse of notation, we will use any of the two definitions whenever we are referring to \(x_T\). The definition being used should be obvious from the context.
Let
Let
Then
Since \(|T| = 4\), sometimes we will also write
Let \(x \in \CC^N\) be an arbitrary signal. Consider any index set \(T \subset \{1, \dots, N \}\) with \(|T| = K\). Then \(x_T\) is a \(K\)-term approximation of \(x\).
Clearly for any \(x \in \CC^N\) there are \(\binom{N}{K}\) possible \(K\)-term approximations of \(x\).
Let
Let \(T= \{ 1, 6 \}\). Then
is a \(2\)-term approximation of \(x\).
If we choose \(T= \{7,8,9,10\}\), the corresponding \(4\)-term approximation of \(x\) is
Let \(x \in \CC^N\) be an arbitrary signal. Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(x\) such that
In case of ties, the order is resolved lexicographically, i.e. if \(|x_i| = |x_j|\) and \(i < j\) then \(i\) will appear first in the sequence \(\lambda_k\).
Consider the index set \(\Lambda_K = \{ \lambda_1, \lambda_2, \dots, \lambda_K\}\). The restriction of \(x\) on \(\Lambda_K\) given by \(x_{\Lambda_K}\) (see above) contains the \(K\) largest entries \(x\) while setting all other entries to 0. This is known as the \(K\) largest entries approximation of \(x\).
This signal is denoted henceforth as \(x|_K\). i.e.
where \(\Lambda_K\) is the index set corresponding to \(K\) largest entries in \(x\) (magnitude wise).
Let
Then
All further \(K\) largest entries approximations are same as \(x\).
A pertinent question at this point is: which \(K\)-term approximation of \(x\) is the best \(K\)-term approximation? Certainly in order to compare two approximations we need some criterion. Let us choose \(l_p\) norm as the criterion. The next lemma gives an interesting result for best \(K\)-term approximations in \(l_p\) norm sense.
Let \(x \in \CC^N\). Let the best \(K\) term approximation of \(x\) be obtained by the following optimization program:
where \(p \in [1, \infty]\).
Let an optimal solution for this optimization problem be denoted by \(x_{T^*}\). Then
i.e. the \(K\)-largest entries approximation of \(x\) is an optimal solution to (3) .
For \(p=\infty\), the result is obvious. In the following, we focus on \(p \in [1, \infty)\).
We note that maximizing \(\| x_T \|_p\) is equivalent to maximizing \(\| x_T \|^p_p\).
Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(x\) such that
Further let \(\{ \omega_1, \dots, \omega_N\}\) be any permutation of \(\{1, \dots, N \}\).
Clearly
Thus if \(T^*\) corresponds to an optimal solution of (3) then
Thus \(x|_K\) is an optimal solution to (3) .
This lemma helps us establish that whenever we are looking for a best \(K\)-term approximation of \(x\) under any \(l_p\) norm, all we have to do is to pickup the \(K\)-largest entries in \(x\).
Let \(\Phi \in \CC^{M \times N}\). Let \(T \subset \{ 1, 2, \dots, N\}\) be any index set.Further let
such that
Let \(\Phi_T \in \CC^{M \times |T|}\) be defined as
Then \(\Phi_T\) is a restriction of the matrix \(\Phi\) on the index set \(T\).
Alternatively let \(\Phi_T \in \CC^{M \times N}\) be defined as
In other words, \(\Phi_T \in \CC^{M \times N}\) keeps the columns in \(\Phi\) indexed by \(T\) while sets all other columns to 0. Then we say that \(\Phi_T\) is obtained by masking \(\Phi\) with \(T\). As an abuse of notation, we will use any of the two definitions whenever we are referring to \(\Phi_T\). The definition being used should be obvious from the context.
Let \(\supp(x) = \Lambda\). Then
Let \(S\) and \(T\) be two disjoint index sets such that for some \(x \in \CC^N\)
using the mask version of \(x_T\) notation. Then the following holds
Straightforward application of previous result:
Let \(T\) be any index set. Let \(\Phi \in \CC^{M \times N}\) and \(y \in \CC^M\). Then
Now let
Then
Compressible signals¶
We will now define the notion of a compressible signal in terms of the decay rate of magnitude of its entries when sorted in descending order.
Let \(x \in \CC^N\) be an arbitrary signal. Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(x\) such that
In case of ties, the order is resolved lexicographically, i.e. if \(|x_i| = |x_j|\) and \(i < j\) then \(i\) will appear first in the sequence \(\lambda_k\). Define
The signal \(x\) is called \(p\)-compressible with magnitude \(R\) if there exists \(p \in (0, 1]\) such that
Let \(x\) be be \(p\)-compressible with \(p=1\). Then
Recalling \(\widehat{x}\) from (6) it’s straightforward to see that
since the \(l_1\) norm doesn’t depend on the ordering of entries in \(x\).
Now since \(x\) is \(1\)-compressible, hence from (7) we have
This gives us
The sum on the R.H.S. is the \(N\)-th Harmonic number (sum of reciprocals of first \(N\) natural numbers). A simple upper bound on Harmonic numbers is
This completes the proof.
We now demonstrate how a compressible signal is well approximated by a sparse signal.
Let \(x\) be a \(p\)-compressible signal and let \(x|_K\) be its best \(K\)-term approximation. Then the \(l_1\) norm of approximation error satisfies
with
Moreover the \(l_2\) norm of approximation error satisfies
with
We now approximate the R.H.S. sum with an integral.
Now
We can similarly show the result for \(l_2\) norm.
Tools for dictionary analysis¶
In this and following sections we review various properties associated with a dictionary \(\mathcal{D}\) which are useful in understanding the behavior and capabilities of a dictionary.
We recall that a dictionary \(\mathcal{D}\) consists of a finite number of unit norm vectors in \(\CC^N\) called atoms which span the signal space \(\CC^N\). Atoms of the dictionary are indexed by an index set \(\Omega\). i.e.
with \(|\Omega| = D\) and \(N \leq D\) with \(\| d_{\omega} \|_2 = 1\) for all atoms.
The vectors \(x \in \CC^N\) can be represented by a synthesis matrix consisting of the atoms of \(\mathcal{D}\) by a vector \(\alpha \in \CC^D\) as
Note that we are using the same symbol \(\DDD\) to represent the dictionary as a set of atoms as well as the corresponding synthesis matrix.
We can write the matrix \(\DDD\) consisting of its columns as
This shouldn’t be causing any confusion in the sequel. When we write the subscript as \(d_{\omega_i}\) where \(\omega_i \in \Omega\) we are referring to the atoms of the dictionary \(\mathcal{D}\) indexed by the set \(\Omega\), while when we write the subscript as \(d_i\) we are referring to a column of corresponding synthesis matrix. In this case, \(\Omega\) will simply mean the index set \(\{ 1, \dots, D \}\). Obviously \(|\Omega| = D\) holds still.
Often, we will be working with a subset of atoms in a dictionary. Usually such a subset of atoms will be indexed by an index set \(\Lambda \subseteq \Omega\). \(\Lambda\) will take the form of \(\Lambda \subseteq \{\omega_1, \dots, \omega_D\}\) or \(\Lambda \subseteq \{1, \dots, D\}\) depending upon whether we are talking about the subset of atoms in the dictionary or a subset of columns from the corresponding synthesis matrix.
We will need the notion of a sub-dictionary [Tro06] described below.
A sub-dictionary is a linearly independent collection of atoms. Let \(\Lambda \subset \{\omega_1, \dots, \omega_D\}\) be the index set for the atoms in the sub-dictionary. We denote the sub-dictionary as \(\DDD_{\Lambda}\). We also use \(\DDD_{\Lambda}\) to denote the corresponding matrix with \(\Lambda \subset \{1, \dots, D\}\).
This is obvious since it is a collection of linearly independent atoms.
For subdictionaries often we will say \(K = | \Lambda |\) and \(G = \DDD_{\Lambda}^H \DDD_{\Lambda}\) as its Gram matrix. Sometimes, we will also be considering \(G^{-1}\). \(G^{-1}\) has a useful interpretation in terms of the dual vectors for the atoms in \(\DDD_{\Lambda}\) [TRO04].
Let \(\{ d_{\lambda} \}_{\lambda \in \Lambda}\) denote the atoms in \(\DDD_{\Lambda}\). Let \(\{ c_{\lambda} \}_{\lambda \in \Lambda}\) be chosen such that
and
Each dual vector \(c_{\lambda}\) is orthogonal to atoms in the subdictionary at different indices and is long enough so that its inner product with \(d_{\lambda}\) is one. The dual system somehow inverts the sub-dictionary. In fact the dual vectors are nothing but the columns of the matrix \(B = (\DDD_{\Lambda}^{\dag})^H\). Now, a simple calculation:
Therefore, the inverse Gram matrix lists the inner products between the dual vectors.
Sometimes we will be discussing tools which also apply for general matrices. We will use the symbol \(\Phi\) for representing general matrices. Whenever the dictionary is an orthonormal basis, we will use the symbol \(\Psi\).
Spark¶
Note that the definition of spark applies to all matrices (wide, tall or square). It is not restricted to the synthesis matrices for a dictionary.
Correspondingly, the spark of a dictionary is defined as the minimum number of atoms which are linearly dependent.
We recall that rank of a matrix is defined as the maximum number of columns which are linearly independent. Definition of spark bears remarkable resemblance yet its very hard to obtain as it requires a combinatorial search over all possible subsets of columns of \(\Phi\).
Spark of the \(3 \times 3\) identity matrix
\[\begin{split}\begin{pmatrix} 1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}\end{split}\]is 4 since all columns are linearly independent.
Spark of the \(2 \times 4\) matrix
\[\begin{split}\begin{pmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 0 & -1 \end{pmatrix}\end{split}\]is 2 since column 1 and 3 are linearly dependent.
If a matrix has a column with all zero entries, then the spark of such a matrix is 1. This is a trivial case and we will not consider such matrices in the sequel.
In general for an \(N \times D\) synthesis matrix, \(\spark(\DDD) \in [2, N+1]\).
A naive combinatorial algorithm to calculate the spark of a matrix is given below.

A naive algorithm to compute the spark of a matrix
Spark is useful in characterizing the uniqueness of the solution of a \((\DDD, K)\)- exact-sparse problem.
The \(l_0\)-“norm” of vectors belonging to null space of a matrix \(\Phi\) is greater than or equal to \(\spark(\Phi)\):
If \(x \in \NullSpace(\Phi)\) then \(\Phi x = 0\). Thus non-zero entries in \(x\) pick a set of columns in \(\Phi\) which are linearly dependent. Clearly \(\| x \|_0\) indicates the number of columns in the set which are linearly dependent. By definition spark of \(\Phi\) indicates the minimum number of columns which are linearly dependent hence the result.
We now present a criteria based on spark which characterizes the uniqueness of a sparse solution to the problem \(y = \Phi x\).
Consider a solution \(x^*\) to the under-determined system \(y = \Phi x\). If \(x^*\) obeys
then it is necessarily the sparsest solution.
Let \(x'\) be some other solution to the problem. Then
Now based on previous remark we have
Now
Hence, if \(\| x^* \|_0 < \frac{\spark(\Phi)}{2}\), then we have
for all other solutions \(x'\) to the equation \(y = \Phi x\).
Thus \(x^*\) is necessarily the sparsest possible solution.
This result is quite useful as it establishes a global optimality criterion for the \((\DDD, K)\)- exact-sparse problem.
As long as \(K < \frac{1}{2}\spark(\Phi)\) this theorem guarantees that the solution to \((\DDD, K)\)- exact-sparse problem is unique. This is quite surprising result for a non-convex combinatorial optimization problem. We are able to guarantee a global uniqueness for the solution based on a simple check on the sparsity of the solution.
Note that we are only saying that if a sufficiently sparse solution is found then it is unique. We are not claiming that it is possible to find such a solution.
Obviously, the larger the spark, we can guarantee uniqueness for signals with higher sparsity levels. So a natural question is: How large can spark of a dictionary be? We consider few examples.
Consider a dictionary \(\DDD\) whose atoms \(d_{i}\) are random vectors independently drawn from normal distribution. Since a dictionary requires all its atoms to be unit-norms, hence we divide the each of the random vectors with their norms.
We know that with probability \(1\), any set of \(N\) independent Gaussian random vectors is linearly independent. Also since \(d_i \in \CC^N\) hence a set of \(N+1\) atoms is always linearly dependent.
Thus \(\spark(\DDD) = N +1\).
Thus, if a solution to exact-sparse problem contains \(\frac{N}{2}\) or fewer non-zero entries then it is necessarily unique with probability 1.
For
it can be shown that
In this case, the sparsity level of a unique solution must be less than \(\sqrt{N}\).
Let’s construct a Hadamard matrix of size \(20 \times 20\):
PhiA = hadamard(20);
Let’s print it:
>> PhiA
PhiA =
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1
1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1
1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1
1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1
1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1
1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1
1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1
1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1
1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1
1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1
1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1
1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1
1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1
1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1
1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1
1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1
1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1
1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1
1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1
We will now select 10 rows randomly from it:
>> rng default;
>> rows = randperm(20, 10)
rows =
6 18 7 16 12 13 3 4 19 20
>> Phi = PhiA(rows, :)
Phi =
1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1
1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1
1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1
1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1
1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1
1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1
1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1
1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1
1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1
1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1
Let’s measure its spark:
>> spx.dict.spark(Phi)
ans =
8
We can also find out the set of 8 columns which are linearly dependent:
>> [spark, columns] = spx.dict.spark(Phi)
spark =
8
columns =
1 2 3 7 11 14 19 20
Let’s find out this sub-matrix
>> PhiD = Phi(:, columns)
PhiD =
1 -1 -1 -1 1 -1 1 1
1 -1 -1 1 -1 -1 1 1
1 -1 -1 1 1 -1 1 -1
1 1 1 -1 -1 -1 1 1
1 1 -1 1 -1 1 1 -1
1 -1 1 -1 -1 -1 -1 1
1 -1 1 -1 1 1 1 -1
1 1 1 -1 -1 1 -1 -1
1 -1 1 1 -1 1 1 -1
1 1 -1 -1 1 -1 -1 -1
Let’s verify that this matrix is indeed singular:
>> rank(PhiD)
ans =
7
We can find out a vector in its null space:
>> z = null(PhiD)'
z =
0.4472 0.2236 0.2236 0.4472 0.4472 0.2236 -0.2236 0.4472
Verify that it is indeed a null space vector:
>> norm (PhiD * z')
ans =
1.1776e-15
The rank of this matrix is 10. If every set of 10 columns was independent, then the spark would have been 11 and the matrix would be a full spark matrix. Unfortunately, it is not so. However the spark is still quite large.
We can normalize the columns of this matrix to make it a proper dictionary:
>> Phi = spx.norm.normalize_l2(Phi);
Let’s verify the column-wise norms:
>> spx.norm.norms_l2_cw(Phi)
ans =
Columns 1 through 12
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Columns 13 through 20
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
The coherence of this dictionary [to be discussed in next section] is 0.6 which is moderate (but not low).
Coherence¶
Finding out the spark of a dictionary \(\DDD\) is NP-hard since it involves considering combinatorially large number of selections of columns from \(\DDD\) . In this section we consider the coherence of a dictionary which is computationally tractable and quite useful in characterizing the solutions of sparse approximation problems.
The coherence of a dictionary \(\DDD\) is defined as the maximum absolute inner product between two distinct atoms in the dictionary:
If the dictionary consists of two orthonormal bases, then coherence is also known as mutual coherence or proximity. Since the atoms within each orthonormal basis are orthogonal to each other, the coherence is determined only by the inner products of atoms from one basis with another basis.
We note that \(d_{\omega_i}\) is the \(i\) -th column of synthesis matrix \(\DDD\) . Also \(\DDD^H \DDD\) is the Gram matrix for \(\DDD\) whose elements are nothing but the inner-products of columns of \(\DDD\) .
We note that by definition \(\| d_{\omega} \|_2 = 1\) hence \(\mu \leq 1\) and since absolute values are considered hence \(\mu \geq 0\) . Thus, \(0 \leq \mu \leq 1\).
For an orthonormal basis \(\Psi\) all atoms are orthogonal to each other, hence
Thus \(\mu = 0\) .
In the following, we will use the notation \(|A|\) to denote a matrix consisting of absolute values of entries in a matrix \(A\) . i.e.
The off-diagonal entries of the Gram matrix are captured by the matrix \(\DDD^H \DDD - I\) . Note that all diagonal entries in \(\DDD^H \DDD - I\) are zero since atoms of \(\DDD\) are unit norm. Moreover, each of the entries in \(| \DDD^H \DDD - I |\) is dominated by \(\mu(\DDD)\) .
The inner product between any two atoms \(| \langle d_{\omega_j}, d_{\omega_k} \rangle |\) is a measure of how much they look alike or how much they are correlated. Coherence just picks up the two vectors which are most alike and returns their correlation. In a way \(\mu\) is quite a blunt measure of the quality of a dictionary, yet it is quite useful.
If a dictionary is uniform in the sense that there is not much variation in \(| \langle d_{\omega_j}, d_{\omega_k} \rangle |\) , then \(\mu\) captures the behavior of the dictionary quite well.
We say that a dictionary is incoherent if the coherence of the dictionary is small.
We are looking for dictionaries which are incoherent. In the sequel we will see how incoherence plays a role in sparse approximation.
The coherence of two ortho-bases is bounded by
The coherence of Dirac Fourier basis is \(\frac{1}{\sqrt{N}}\) .
A lower bound on the coherence of a general dictionary is given by
If each atomic inner product meets this bound, the dictionary is called an optimal Grassmannian frame .
The definition of coherence can be extended to arbitrary matrices \(\Phi \in \CC^{N \times D}\) .
The coherence of a matrix \(\Phi \in \CC^{N \times D}\) is defined as the maximum absolute normalized inner product between two distinct columns in the matrix. Let
Then coherence of \(\Phi\) is given by
It is assumed that none of the columns in \(\Phi\) is a zero vector.
Lower bounds for spark¶
Coherence of a matrix is easy to compute. More interestingly it also provides a lower bound on the spark of a matrix.
For any matrix \(\Phi \in \CC^{N \times D}\) (with non-zero columns) the following relationship holds
We note that scaling of a column of \(\Phi\) doesn’t change either the spark or coherence of \(\Phi\) . Therefore, we assume that the columns of \(\Phi\) are normalized.
We now construct the Gram matrix of \(\Phi\) given by \(G = \Phi^H \Phi\) . We note that
since each column of \(\Phi\) is unit norm.
Also
Consider any \(p\) columns from \(\Phi\) and construct its Gram matrix. This is nothing but a leading minor of size \(p \times p\) from the matrix \(G\) .
From the Gershgorin disk theorem, if this minor is diagonally dominant, i.e. if
then this sub-matrix of \(G\) is positive definite and so corresponding \(p\) columns from \(\Phi\) are linearly independent.
But
and
for the minor under consideration. Hence for \(p\) columns to be linearly independent the following condition is sufficient
Thus if
then every set of \(p\) columns from \(\Phi\) is linearly independent.
Hence, the smallest possible set of linearly dependent columns must satisfy
This establishes the lower bound that
This bound on spark doesn’t make any assumptions on the structure of the dictionary. In fact, imposing additional structure on the dictionary can give better bounds. Let us look at an example for a two ortho-basis [DE03].
Let \(\DDD\) be a two ortho-basis. Then
It can be shown that for any vector \(v \in \NullSpace(\DDD)\)
But
Thus
For maximally incoherent two orthonormal bases, we know that \(\mu = \frac{1}{\sqrt{N}}\) . A perfect example is the pair of Dirac and Fourier bases. In this case \(\spark(\DDD) \geq 2 \sqrt{N}\) .
Uniqueness-Coherence¶
We can now establish a uniqueness condition for sparse solution of \(x = \Phi \alpha\) .
Consider a solution \(x^*\) to the under-determined system \(y = \Phi x\) . If \(x^*\) obeys
then it is necessarily the sparsest solution.
It is interesting to compare the two uniqueness theorems: spark uniqueness theorem and coherence uniqueness theorem.
First one is sharp and is far more powerful than the second one based on coherence.
Coherence can never be smaller than \(\frac{1}{\sqrt{N}}\) , therefore the bound on \(\| x^* \|_0\) in above can never be larger than \(\frac{\sqrt{N} + 1}{2}\) .
However, spark can be easily as large as \(N\) and then bound on \(\| x^* \|_0\) can be as large as \(\frac{N}{2}\) .
Thus, we note that coherence gives a weaker bound than spark for supportable sparsity levels of unique solutions. The advantage that coherence has is that it is easily computable and doesn’t require any special structure on the dictionary (two ortho basis has a special structure).
Singular values of sub-dictionaries¶
Let \(\DDD\) be a dictionary and \(\DDD_{\Lambda}\) be a sub-dictionary. Let \(\mu\) be the coherence of \(\DDD\) . Let \(K = | \Lambda |\) . Then the eigen values of \(G = \DDD_{\Lambda}^H \DDD_{\Lambda}\) satisfy:
Moreover, the singular values of the sub-dictionary \(\DDD_{\Lambda}\) satisfy
We recall from Gershgorin’s theorem that for any square matrix \(A \in \CC^{K \times K}\) , every eigen value \(\lambda\) of \(A\) satisfies
Now consider the matrix \(G = \DDD_{\Lambda}^H \DDD_{\Lambda}\) with diagonal elements equal to 1 and off diagonal elements bounded by a value \(\mu\) . Then
Thus,
This gives us a lower bound on the smallest eigen value.
Since \(G\) is positive definite ( \(\DDD_{\Lambda}\) is full-rank), hence its eigen values are positive. Thus, the above lower bound is useful only if
We also get an upper bound on the eigen values of \(G\) given by
The bounds on singular values of \(\DDD_{\Lambda}\) are obtained as a straight-forward extension by taking square roots on the expressions.
Embeddings using sub-dictionaries¶
Let \(\DDD\) be a real dictionary and \(\DDD_{\Lambda}\) be a sub-dictionary with \(K = |\Lambda|\) . Let \(\mu\) be the coherence of \(\DDD\) . Let \(v \in \RR^K\) be an arbitrary vector. Then
where \(\OneMat\) is a \(K\times K\) matrix of all ones. Moreover
We can easily write
The terms in the R.H.S. for \(i = j\) are given by
Summing over \(i = 1, \cdots, K\) , we get
We are now left with \(K^2 - K\) off diagonal terms. Each of these terms is bounded by
Summing over the \(K^2 - K\) off-diagonal terms we get:
Thus,
Thus,
We get the result by slight reordering of terms:
We recall that
Thus, the inequalities can be written as
Alternatively,
Finally
This gives us
We now present the above theorem for the complex case. The proof is based on singular values. This proof is simpler and more general than the one presented above.
Let \(\DDD\) be a dictionary and \(\DDD_{\Lambda}\) be a sub-dictionary with \(K = |\Lambda|\) . Let \(\mu\) be the coherence of \(\DDD\) . Let \(v \in \CC^K\) be an arbitrary vector. Then
Recall that
A previous result tells us:
Thus,
and
This gives us the result
Babel function¶
Recalling the definition of coherence, we note that it reflects only the extreme correlations between atoms of dictionary. If most of the inner products are small compared to one dominating inner product, then the value of coherence is highly misleading.
In [Tro04], Tropp introduced Babel function , which measures the maximum total coherence between a fixed atom and a collection of other atoms. The Babel function quantifies an idea as to how much the atoms of a dictionary are speaking the same language.
The Babel function for a dictionary \(\DDD\) is defined by
where the vector \(\psi\) ranges over the atoms indexed by \(\Omega \setminus \Lambda\). We define
for sparsity level \(p=0\).
Let us understand what is going on here. For each value of \(p\) we consider all possible \(\binom{D}{p}\) subspaces by choosing \(p\) vectors from \(\mathcal{D}\).
Let the atoms spanning one such subspace be identified by an index set \(\Lambda \subset \Omega\).
All other atoms are indexed by the index set \(\Gamma = \Omega \setminus \Lambda\).
Let
denote the atoms indexed by \(\Gamma\).
We pickup a vector \(\psi \in \Psi\) and compute its inner product with all atoms indexed by \(\Lambda\). We compute the sum of absolute value of these inner products over all \(\{ d_{\lambda} : \lambda \in \Lambda\}\).
We run it for all \(\psi \in \Psi\) and then pickup the maximum value of above sum over all \(\psi\).
We finally compute the maximum over all possible \(p\)-subspaces.
This number is considered at the Babel number for sparsity level \(p\).
We first make a few observations over the properties of Babel function.
Babel function is a generalization of coherence.
For \(p=1\) we observe that
the coherence of \(\mathcal{D}\).
This is easy to see since the sum
cannot decrease as \(p = | \Lambda|\) increases.
In particular for some value of \(p\) let \(\Lambda^p\) and \(\psi^p\) denote the set and vector for which the maximum in (1) is achieved. Now pick some column which is not \(\psi^p\) and is not indexed by \(\Lambda^p\) and include it for \(\Lambda^{p + 1}\). Note that \(\Lambda^{p + 1}\) and \(\psi^p\) might not be the worst case for sparsity level \(p+1\) in (1). Clearly
\(\mu_1(p+1)\) cannot be less than \(\mu_1(p)\).
Babel function is upper bounded by coherence as per
This leads to
Computation of Babel function¶
It might seem at first that computation of Babel function is combinatorial and hence prohibitively expensive. But it is not true.
We will demonstrate this through an example in this section. Our example synthesis matrix will be
From the synthesis matrix \(\DDD\) we first construct its Gram matrix given by
We then take absolute value of each entry in \(G\) to construct \(|G|\).
For the running example
We now sort every row in descending order to obtain a new matrix \(G'\).
First entry in each row is now \(1\). This corresponds to \(\langle d_i, d_i \rangle\) and it doesn’t appear in the calculation of \(\mu_1(p)\) hence we disregard whole of first column.
Now look at column 2 in \(G'\). In the \(i\)-th row it is nothing but
Thus,
i.e. the coherence is given by the maximum in the 2nd column of \(G'\).
In the running example
Looking carefully we can note that for \(\psi = d_i\) the maximum value of sum
while \(| \Lambda| = p\) is given by the sum over elements from 2nd to \((p+1)\)-th columns in \(i\)-th row.
Thus
For the running example the Babel function values are given by
We see that Babel function stops increasing after \(p=4\). Actually \(\DDD\) is constructed by shuffling the columns of two orthonormal bases. Hence many of the inner products are 0 in \(G\).
Babel function and spark¶
We first note that Babel function tells something about linear independence of columns of \(\DDD\).
Let \(\mu_1\) be the Babel function for a dictionary \(\DDD\). If
then all selections of \(p+1\) columns from \(\DDD\) are linearly independent.
We recall from the proof of this result that if
then every set of \((p+1)\) columns from \(\DDD\) are linearly independent.
We also know from this result that
Thus if
then all selections of \(p+1\) columns from \(\DDD\) are linearly independent.
This leads us to a lower bound on spark from Babel function .
A lower bound of spark of a dictionary \(\DDD\) is given by
For all \(j \leq p-2\) we are given that \(\mu_1(j) < 1\). Thus all sets of \(p-1\) columns from \(\DDD\) are linearly independent (using this result).
Finally \(\mu_1(p-1) \geq 1\), hence we cannot say definitively whether a set of \(p\) columns from \(\DDD\) is linearly dependent or not. This establishes the lower bound on spark.
An earlier version of this result also appeared in [DE03] theorem 6.
Babel function and singular values¶
Let \(\DDD\) be a dictionary and \(\Lambda\) be an index set with \(|\Lambda| = K\). The singular values of \(\DDD_{\Lambda}\) are bounded by
Consider the Gram matrix
\(G\) is a \(K\times K\) square matrix.
Also let
so that
The Gershgorin Disc Theorem states that every eigenvalue of \(G\) lies in one of the \(K\) discs
Since \(d_i\) are unit norm, hence \(G_{k k} = 1\).
Also we note that
since there are \(K-1\) terms in sum and \(\mu_1(K-1)\) is an upper bound on all such sums.
Thus if \(z\) is an eigen value of \(G\) then we have
This is OK since \(G\) is positive semi-definite, thus, the eigen values of \(G\) are real.
But the eigen values of \(G\) are nothing but the squared singular values of \(\DDD_{\Lambda}\). Thus we get
From previous theorem we have
Since the singular values are always non-negative, the lower bound is useful only when \(\mu_1(K-1) < 1\). When it holds we have
If \(\mu_1(K -1 ) < 1\), then the singular values of any sub-matrix of \(K\) atoms are non-zero. Thus, the minimum number of atoms required to form a linear dependent set is \(K + 1\). Let the number of atoms in any other exact representation of the signal be \(l\). Then
Babel function and gram matrix¶
Let \(\Lambda\) index a subdictionary and let \(G = \DDD_{\Lambda}^H \DDD_{\Lambda}\) denote the Gram matrix of the subdictionary \(\DDD_{\Lambda}\). Assume \(K = | \Lambda |\).
Since \(G\) is Hermitian, hence the two norms are equal:
Now each row consists of a diagonal entry \(1\) and \(K-1\) off diagonal entries. The absolute sum of all the off-diagonal entries in a row is upper bounded by \(\mu_1(K -1)\). Thus, the absolute sum of all the entries in a row is upper bounded by \(1 + \mu_1(K - 1)\). Since \(\| G \|_{\infty}\) is nothing but the maximum \(l_1\) norm of rows of \(G\), hence
Suppose that \(\mu_1(K - 1) < 1\). Then
Since \(G\) is Hermitian, hence the two operator norms are equal:
As usual we can write \(G\) as \(G = I + A\) where \(A\) consists of off-diagonal entries in \(A\) (recall that since atoms are unit norm, hence diagonal entries in \(G\) are 1).
Each row of \(A\) lists inner products between a fixed atom and \(K-1\) other atoms (leaving the 0 at the diagonal entry). Therefore
(since \(l_1\) norm of any row is upper bounded by the babel number \(\mu_1(K - 1)\) ). Now \(G^{-1}\) can be written as a Neumann series
Thus
Finally
Thus
Quasi incoherent dictionaries¶
When the Babel function of a dictionary grows slowly, we say that the dictionary is quasi-incoherent .
Implementing the Babel function¶
We will implement the babel function in Matlab. Here is the signature of the function:
function [ babel ] = babel( Phi )
Let’s compute the Gram matrix:
G = Phi' * Phi;
We now take the absolute values of all entries in the gram matrix:
absG = abs(G);
We sort the rows in absG
in descending order:
GS = sort(absG, 2,'descend');
We compute the cumulative sums over each row
of GS
leaving out the first column:
rowSums = cumsum(GS(:, 2:end), 2);
The babel function is now obtained by simply taking maximum over each column:
babel = max(rowSums);
This implementation is available in the
sparse-plex
library as
spx.dict.babel
.
Dirac-DCT dictionary¶
This dictionary is suitable for real signals since both Dirac and DCT are totally real bases \(\in \RR^{N \times N}\).
The dictionary is obtained by combining the \(N \times N\) identity matrix (Dirac basis) with the \(N \times N\) DCT matrix for signals in \(\RR^N\).
Let \(\Psi_{\text{DCT}, N}\) denote the DCT matrix for \(\RR^N\). Let \(I_N\) denote the identity matrix for \(\RR^N\). Then
Let
The \(k\)-th column of \(\Psi_{\text{DCT}, N}\) is given by
with \(\Omega_k = \frac{1}{\sqrt{2}}\) for \(k=1\) and \(\Omega_k = 1\) for \(2 \leq k \leq N\).
Note that for \(k=1\), the entries become
Thus, the \(l_2\) norm of \(\psi_1\) is 1. We can similarly verify the \(l_2\) norm of other columns also. They are all one.
The coherence of a two ortho basis where one basis is Dirac basis is given by the magnitude of the largest entry in the other basis. For \(\Psi_{\text{DCT}, N}\), the largest value is obtained when \(\Omega_k = 1\) and the \(\cos\) term evaluates to 1. Clearly,
The \(p\)-babel function for Dirac-DCT dictionary is given by
In particular, the standard babel function is given by
Hands-on with Dirac DCT dictionaries¶
We need to specify the dimension of the ambient space:
N = 256;
We are ready to construct the dictionary:
Phi = spx.dict.simple.dirac_dct_mtx(N);
Let’s visualize the dictionary:
imagesc(Phi);
colorbar;

Measuring the coherence of the dictionary:
>> spx.dict.coherence(Phi)
ans =
0.0884
We can cross-check with the theoretical estimate:
>> sqrt(2/N)
ans =
0.0884
Let’s construct the babel function for this dictionary:
mu1 = spx.dict.babel(Phi);
We can plot it:
plot(mu1);
grid on;

We note that the babel function increases linearly for the initial part and saturates to a value of 16 afterwards.
Dirac-Hadamard dictionary¶
This dictionary is suitable for real signals since both Dirac and Hadamard are totally real bases \(\in \RR^{N \times N}\).
\(N\), \(N/12\) or \(N/20\) must be a power of 2 to allow for the construction of Hadamard matrix.
Hadamard matrix is special in the sense that all the entries are either 1 or -1. Thus, multiplication with the matrix can be achieved by simple additions and subtractions:
>> A = hadamard(12)
A =
1 1 1 1 1 1 1 1 1 1 1 1
1 -1 1 -1 1 1 1 -1 -1 -1 1 -1
1 -1 -1 1 -1 1 1 1 -1 -1 -1 1
1 1 -1 -1 1 -1 1 1 1 -1 -1 -1
1 -1 1 -1 -1 1 -1 1 1 1 -1 -1
1 -1 -1 1 -1 -1 1 -1 1 1 1 -1
1 -1 -1 -1 1 -1 -1 1 -1 1 1 1
1 1 -1 -1 -1 1 -1 -1 1 -1 1 1
1 1 1 -1 -1 -1 1 -1 -1 1 -1 1
1 1 1 1 -1 -1 -1 1 -1 -1 1 -1
1 -1 1 1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 1 1 -1 -1 -1 1 -1 -1
>> A' * A
ans =
12 0 0 0 0 0 0 0 0 0 0 0
0 12 0 0 0 0 0 0 0 0 0 0
0 0 12 0 0 0 0 0 0 0 0 0
0 0 0 12 0 0 0 0 0 0 0 0
0 0 0 0 12 0 0 0 0 0 0 0
0 0 0 0 0 12 0 0 0 0 0 0
0 0 0 0 0 0 12 0 0 0 0 0
0 0 0 0 0 0 0 12 0 0 0 0
0 0 0 0 0 0 0 0 12 0 0 0
0 0 0 0 0 0 0 0 0 12 0 0
0 0 0 0 0 0 0 0 0 0 12 0
0 0 0 0 0 0 0 0 0 0 0 12
While constructing the Dirac-Hadamard dictionary, we need to ensure that the columns of the dictionary are normalized.
Hands-on with Dirac Hadamard dictionaries¶
We need to specify the dimension of the ambient space:
N = 256;
We are ready to construct the dictionary:
Phi = spx.dict.simple.dirac_hadamard_mtx(N);
Let’s visualize the dictionary:
imagesc(Phi);
colorbar;

Measuring the coherence of the dictionary:
>> spx.dict.coherence(Phi)
ans =
0.0625
Let’s construct the babel function for this dictionary:
mu1 = spx.dict.babel(Phi);
We can plot it:
plot(mu1);
grid on;

We note that the babel function increases linearly for the initial part and saturates to a value of 16 afterwards.
We can construct a Dirac Hadamard dictionary for a small size to see the effect of normalization:
>> spx.dict.simple.dirac_hadamard_mtx(4)
ans =
1.0000 0 0 0 0.5000 0.5000 0.5000 0.5000
0 1.0000 0 0 0.5000 -0.5000 0.5000 -0.5000
0 0 1.0000 0 0.5000 0.5000 -0.5000 -0.5000
0 0 0 1.0000 0.5000 -0.5000 -0.5000 0.5000
Dictionaries with Wavelet Toolbox¶
MATLAB Wavelet Toolbox provides good support for constructing multi-basis dictionaries (dictionaries that are constructed by concatenating one or more subdictionaries which are either orthogonal bases or wavelet packets).
Constructing Dictionaries¶
We need to specify the dimension of the signal space \(\RR^N\):
N = 32;
We can now construct the dictionary:
Phi = wmpdictionary(N, 'lstcpt', {'RnIdent', 'dct'});

The name-value pair argument lstcpt
takes the list of constituent subdictionaries.
We wish to combine a symlet ONB with 4 vanishing moments and 5 level decomposition and a DCT basis
We can now construct the dictionary:
N = 256;
[Phi, nb_atoms] = wmpdictionary(N, 'lstcpt', { {'sym4', 5}, 'dct'});

The vector nb_atoms
tells us the number
of atoms in each subdictionary:
>> nb_atoms
nb_atoms =
256 256
Here we will combine symlets with the wavelet packet version of symets and DCT ONB.
- symlet with 4 vanishing moments and 5 level decomposition
- wavelet packet symlet with 4 vanishing moments and 5 level decomposition
- DCT basis
N = 256;
[Phi, nb_atoms] = wmpdictionary(N, 'lstcpt', { {'sym4', 5}, {'wpsym4', 5}, 'dct'});

We can visualize the atoms in this dictionary one by one.
sparse-plex
provides a method to visualize the
atoms one by one and save the visualizations in the
form of an MP4 video file:
spx.graphics.multi_basis_dict_movie('sym4_wpsym4_dct.mp4', ...
Phi, nb_atoms, {'sym4', 'wpsym4', 'dct'})
We have specified the name of the output video file, the dictionary to be visualized, number of atoms in each subdictionary and names of subdictionaries.
Compressive Sensing¶
Introduction to compressive sensing¶
In this section we formally define the problem of compressed sensing.
Compressive sensing refers to the idea that for sparse or compressible signals, a small number of nonadaptive measurements carries sufficient information to approximate the signal well. In the literature it is also known as compressed sensing and compressive sampling . Different authors seem to prefer different names.
In this section we will represent a signal dictionary as well as its synthesis matrix as \(\DD\) .
We recall the definition of sparse signals. A signal \(x \in \CC^N\) is \(K\) -sparse in \(\DD\) if there exists a representation \(\alpha\) for \(x\) which has at most \(K\) non-zeros. i.e.
and
The dictionary could be standard basis, Fourier basis, wavelet basis, a wavelet packet dictionary, a multi-ONB or even a randomly generated dictionary.
Real life signals are not sparse, yet they are compressible in the sense that entries in the signal decay rapidly when sorted by magnitude. As a result, compressible signals are well approximated by sparse signals. Note that we are talking about the sparsity or compressibility of the signal in a suitable dictionary. Thus, we mean that the signal \(x\) has a representation \(\alpha\) in \(\DD\) in which the coefficients decay rapidly when sorted by magnitude.
In compressed sensing, a measurement is a linear functional applied to a signal
The compressed sensor makes multiple such linear measurements. This can best be represented by the action of a sensing matrix \(\Phi\) on the signal \(x\) given by
where \(\Phi \in \CC^{M \times N}\) represents \(M\) different measurements made on the signal \(x\) by the sensing process. Each row of \(\Phi\) represents one linear measurement.
The vector \(y \in \CC^M\) is known as measurement vector .
\(\CC^N\) forms the signal space while \(\CC^M\) forms the measurement space .
We also note that above can be written as
It is assumed that the signal \(x\) is \(K\) -sparse or \(K\) -compressible in \(\DD\) and \(K \ll N\) .
The objective is to recover \(x\) from \(y\) given that \(\Phi\) and \(\DD\) are known.
We do this by first recovering the sparse representation \(\alpha\) from \(y\) and then computing \(x = \DD \alpha\) .
If \(M \geq N\) then the problem is a straight forward least squares problem. So we don’t consider it here.
The more interesting case is when \(K < M \ll N\) i.e. the number of measurements is much less than the dimension of the ambient signal space while more than the sparsity level of signal namely \(K\) .
We note that given \(\alpha\) is found, finding \(x\) is straightforward. We therefore can remove the dictionary from our consideration and look at the simplified problem given as: recover \(x\) from \(y\) with
where \(x \in \CC^N\) itself is assumed to be \(K\) -sparse or \(K\) -compressible and \(\Phi \in \CC^{M \times N}\) is the sensing matrix.
Note
The definition above doesn’t consider the noise introduced during taking the measurements. We will introduce noise later.
The sensing matrix¶
There are two ways to look at the sensing matrix. First view is in terms of its columns
where \(\phi_i \in \CC^M\) are the columns of sensing matrix. In this view we see that
i.e. \(y\) belongs to the column span of \(\Phi\) and one representation of \(y\) in \(\Phi\) is given by \(x\) .
This view looks very similar to a dictionary and its atoms but there is a difference. In a dictionary, we require each atom to be unit norm. We don’t require columns of the sensing matrix \(\Phi\) to be unit norm.
The second view of sensing matrix \(\Phi\) is in terms of its columns. We write
where \(\chi_i \in \CC^N\) are conjugate transposes of rows of \(\Phi\) . This view gives us following result
In this view \(y_i\) is a measurement given by the inner product of \(x\) with \(\chi_i\) \(( \langle x , \chi_i \rangle = \chi_i^H x)\) .
We will call \(\chi_i\) as a sensing vector . There are \(M\) such sensing vectors in \(\CC^N\) comprising \(\Phi\) corresponding to \(M\) measurements in the measurement space \(\CC^M\) .
Note
Dictionary design focuses on creating sparsest possible representations of the signals in a particular domain. Sensing matrix design focuses on reducing the number of measurements as much as possible while still being able to recover the sparse representation from the measurements.
Number of measurements¶
A fundamental question of compressed sensing framework is: How many measurements are necessary to acquire :math:`K` -sparse signals? By necessary we mean that \(y\) carries enough information about \(x\) such that \(x\) can be recovered from \(y\) .
If \(M < K\) then recovery is not possible.
We further note that the sensing matrix \(\Phi\) should not map two different \(K\) -sparse signals to the same measurement vector. Thus, we will need \(M \geq 2K\) and each collection of \(2K\) columns in \(\Phi\) must be non-singular.
If the \(K\)-column sub matrices of \(\Phi\) are badly conditioned, then it is possible that some sparse signals get mapped to very similar measurement vectors. Thus it is numerically unstable to recover the signal. Moreover, if noise is present, stability further degrades.
In [CT06] Cand`es and Tao showed that the geometry of sparse signals should be preserved under the action of a sensing matrix. In particular the distance between two sparse signals shouldn’t change by much during sensing.
They quantified this idea in the form of a restricted isometric constant of a matrix \(\Phi\) as the smallest number \(\delta_K\) for which the following holds
We will study more about this property known as restricted isometry property (RIP) later. Here we just sketch the implications of RIP for compressed sensing.
When \(\delta_K < 1\) then the inequalities imply that every collection of \(K\) columns from \(\Phi\) is non-singular. Since we need every collection of \(2K\) columns to be non-singular, we actually need \(\delta_{2K} < 1\) which is the minimum requirement for recovery of \(K\) sparse signals.
Further if \(\delta_{2K} \ll 1\), then we note that sensing operator very nearly maintains the \(l_2\) distance between any two \(K\) sparse signals. In consequence, it is possible to invert the sensing process stably.
It is now known that many randomly generated matrices have excellent RIP behavior. One can show that if \(\delta_{2K} \leq 0.1\) , then with
measurements, one can recover \(x\) with high probability.
Some of the typical random matrices which have suitable RIP properties are
- Gaussian sensing matrices
- Partial Fourier matrices
- Rademacher sensing matrices
Signal recovery¶
The second fundamental problem in compressed sensing is: Given the compressed measurements \(y\) how do we recover the signal \(x\)? This problem is known as SPARSE-RECOVERY problem.
A simple formulation of the problem as: minimize \(\| x \|_0\) subject to \(y = \Phi x\) is hopeless since it entails a combinatorial explosion of search space.
Over the years, people have developed a number of algorithms to tackle the sparse recovery problem.
The algorithms can be broadly classified into following categories
- [Greedy pursuits] These algorithms attempt to build the approximation of the signal iteratively by making locally optimal choices at each step. Examples of such algorithms include OMP (orthogonal matching pursuit), stage-wise OMP, regularized OMP, CoSaMP (compressive sampling pursuit) and IHT (iterative hard thresholding).
- [Convex relaxation] These techniques relax the \(l_0\) “norm” minimization problem into a suitable problem which is a convex optimization problem. This relaxation is valid for a large class of signals of interest. Once the problem has been formulated as a convex optimization problem, a number of solutions are available, e.g. interior point methods, projected gradient methods and iterative thresholding.
- [Combinatorial algorithms] These methods are based on research in group testing and are specifically suited for situations where highly structured measurements of the signal are taken. This class includes algorithms like Fourier sampling, chaining pursuit, and HHS pursuit.
A major emphasis of the following chapters will be the study of these sparse recovery algorithms.
In the following we present examples of real life problems which can be modeled as compressed sensing problems.
Error correction in linear codes¶
The classical error correction problem was discussed in one of the seminal founding papers on compressed sensing [CT05].
Let \(f \in \RR^N\) be a “plaintext” message being sent over a communication channel.
In order to make the message robust against errors in communication channel, we encode the error with an error correcting code.
We consider \(A \in \RR^{D \times N}\) with \(D > N\) as a linear code . \(A\) is essentially a collection of code words given by
where \(a_i \in \RR^D\) are the codewords.
We construct the “ciphertext”
where \(x \in \RR^D\) is sent over the communication channel. \(x\) is a redundant representation of \(f\) which is expected to be robust against small errors during transmission.
\(A\) is assumed to be full column rank. Thus \(A^T A\) is invertible and we can easily see that
where
is the left pseudo inverse of \(A\) . The communication channel is going to add some error. What we actually receive is
where \(e \in \RR^D\) is the error being introduced by the channel.
The least squares solution by minimizing the error \(l_2\) norm is given by
Since \(A^{\dag} e\) is usually non-zero (we cannot assume that \(A^{\dag}\) will annihilate \(e\) ), hence \(f'\) is not an exact replica of \(f\).
What is needed is an exact reconstruction of \(f\). To achieve this, a common assumption in literature is that error vector \(e\) is in fact sparse. i.e.
To reconstruct \(f\) it is sufficient to reconstruct \(e\) since once \(e\) is known we can get
and from there \(f\) can be faithfully reconstructed.
The question is: for a given sparsity level \(K\) for the error vector \(e\) can one reconstruct \(e\) via practical algorithms? By practical we mean algorithms which are of polynomial time w.r.t. the length of “ciphertext” (D).
The approach in [CT05] is as follows.
We construct a matrix \(F \in \RR^{M \times D}\) which can annihilate \(A\) i.e.
We then apply \(F\) to \(y\) giving us
Therefore the decoding problem is reduced to that of reconstructing a sparse vector \(e \in \RR^D\) from the measurements \(Fe \in \RR^M\) where we would like to have \(M \ll D\) .
With this the problem of finding \(e\) can be cast as problem of finding a sparse solution for the under-determined system given by
This now becomes the compressed sensing problem. The natural questions are
- How many measurements \(M\) are necessary (in \(F\) ) to be able to recover \(e\) exactly?
- How should \(F\) be constructed?
- How do we recover \(e\) from \(\tilde{y}\) ?
These questions are discussed in upcoming sections.
Recovery of exactly sparse signals¶
The null space of a matrix \(\Phi\) is denoted as
The set of \(K\) -sparse signals is defined as
Let \(N=10\) .
- \(x=(1,2, 1, -1, 2 , -3, 4, -2, 2, -2) \in \RR^{10}\) is not a sparse signal.
- \(x=(0,0,0,0,1,0,0,-1,0,0)\in \RR^{10}\) is a 2-sparse signal. Its also a 4 sparse signal.
Let N = 5.
- Let \(a = (0,1,-1,0, 0)\) and \(b = (0,2,0,-1, 0)\). Then \(a - b = (0,-1,-1,1, 0)\) is a 3 sparse hence 4 sparse signal.
- Let \(a = (0,1,-1,0, 0)\) and \(b = (0,2,-1,0, 0)\). Then \(a - b = (0,-1,-2,0, 0)\) is a 2 sparse hence 4 sparse signal.
- Let \(a = (0,1,-1,0, 0)\) and \(b = (0,0,0,1, -1)\). Then \(a - b = (0,1,-1,-1, 1)\) is a 4 sparse signal.
Let \(a\) and \(b\) be two \(K\) sparse signals. Then \(\Phi a\) and \(\Phi b\) are corresponding measurements. Now if \(\Phi\) allows recovery of all \(K\) sparse signals, then \(\Phi a \neq \Phi b\) . Thus \(\Phi (a - b) \neq 0\) . Thus \(a - b \notin \NullSpace(\Phi)\) .
Let \(x \in \NullSpace(\Phi) \cap \Sigma_{2K}\) . Thus \(\Phi x = 0\) and \(\#x \leq 2K\) . Then we can find \(y, z \in \Sigma_K\) such that \(x = z - y\) . Thus \(m = \Phi z = \Phi y\) . But then, \(\Phi\) doesn’t uniquely represent \(y, z \in \Sigma_K\) .
There are many equivalent ways of characterizing above condition.
The spark¶
We recall from definition of spark, that spark of a matrix \(\Phi\) is defined as the minimum number of columns which are linearly dependent.
We need to show
- If for every measurement, there is only one \(K\) -sparse explanation, then \(\spark(\Phi) > 2K\) .
- If \(\spark(\Phi) > 2K\) then for every measurement, there is only one \(K\) -sparse explanation.
Assume that for every \(y \in \RR^M\) there exists at most one \(K\) sparse signal \(x \in \RR^N\) such that \(y = \Phi x\) .
Now assume that \(\spark(\Phi) \leq 2K\) . Thus there exists a set of at most \(2K\) columns which are linearly dependent.
Thus there exists \(v \in \Sigma_{2K}\) such that \(\Phi v = 0\) . Thus \(v \in \NullSpace (\Phi)\) .
Thus \(\Sigma_{2K} \cap \NullSpace (\Phi) \neq \phi\) .
Hence \(\Phi\) doesn’t uniquely represent each signal \(x \in \Sigma_K\) . A contradiction.
Hence \(\spark(\Phi) > 2K\) .
Now suppose that \(\spark(\Phi) > 2K\) .
Assume that for some \(y\) there exist two different K-sparse explanations \(x, x'\) such that \(y = \Phi x =\Phi x'\) .
Thus \(\Phi (x - x') = 0\) . Thus \(x - x ' \in \NullSpace (\Phi)\) and \(x - x' \in \Sigma_{2K}\) .
Thus \(\spark(\Phi) \leq 2K\) . A contradiction.
Since \(\spark(\Phi) \in [2, M+1]\) and we require that \(\spark(\Phi) > 2K\) hence we require that \(M \geq 2K\) .
Recovery of approximately sparse signals¶
Spark is a useful criteria for characterization of sensing matrices for truly sparse signals. But this doesn’t work well for approximately sparse signals. We need to have more restrictive criteria on \(\Phi\) for ensuring recovery of approximately sparse signals from compressed measurements.
In this context we will deal with two types of errors:
- [Approximation error] Let us approximate a signal \(x\) using only \(K\) coefficients. Let us call the approximation as \(\widehat{x}\) . Thus \(e_a = (x - \widehat{x})\) is approximation error.
- [Recovery error] Let \(\Phi\) be a sensing matrix. Let \(\Delta\) be a recovery algorithm. Then \(x'= \Delta(\Phi x)\) is the recovered signal vector. The error \(e_r = (x - x')\) is recovery error.
In this section we will
- Formalize the notion of null space property (NSP) of a matrix \(\Phi\) .
- Describe a measure for performance of an arbitrary recovery algorithm \(\Delta\) .
- Establish the connection between NSP and performance guarantee for recovery algorithms.
Suppose we approximate \(x\) by a \(K\) -sparse signal \(\widehat{x} \in \Sigma_K\), then the minimum error for \(l_p\) norm is given by
Specific \(\widehat{x} \in \Sigma_K\) for which this minimum is achieved is the best \(K\) -term approximation.
In the following, we will need some new notation.
Let \(I = \{1,2,\dots, N\}\) be the set of indices for signal \(x \in \RR^N\) .
Let \(\Lambda \subset I\) be a subset of indices.
Let \(\Lambda^c = I \setminus \Lambda\) .
\(x_{\Lambda}\) will denote a signal vector obtained by setting the entries of \(x\) indexed by \(\Lambda^c\) to zero.
Let N = 4. Then \(I = \{1,2,3,4\}\) . Let \(\Lambda = \{1,3\}\) . Then \(\Lambda^c = \{2, 4\}\) .
Now let \(x = (-1,1,2,-4)\) . Then \(x_{\Lambda} = (-1, 0, 2, 0)\) .
\(\Phi_{\Lambda}\) will denote a \(M\times N\) matrix obtained by setting the columns of \(\Phi\) indexed by \(\Lambda^c\) to zero.
Let N = 4. Then \(I = \{1,2,3,4\}\) . Let \(\Lambda = \{1,3\}\) . Then \(\Lambda^c = \{2, 4\}\) .
Now let \(x = (-1,1,2,-4)\) . Then \(x_{\Lambda} = (-1, 0, 2, -4)\) .
Now let
Then
A matrix \(\Phi\) satisfies the null space property (NSP) of order \(K\) if there exists a constant \(C > 0\) such that,
holds \(\forall h \in \NullSpace (\Phi)\) and \(\forall \Lambda\) such that \(|\Lambda| \leq K\) .
- Let \(h\) be \(K\) sparse. Thus choosing the indices on which \(h\) is non-zero, I can construct a \(\Lambda\) such that \(|\Lambda| \leq K\) and \(h_{{\Lambda}^c} = 0\) . Thus \(\| h_{{\Lambda}^c}\|_1\) = 0. Hence above condition is not satisfied. Thus such a vector \(h\) should not belong to \(\NullSpace(\Phi)\) if \(\Phi\) satisfies NSP.
- Essentially vectors in \(\NullSpace (\Phi)\) shouldn’t be concentrated in a small subset of indices.
- If \(\Phi\) satisfies NSP then the only \(K\) -sparse vector in \(\NullSpace(\Phi)\) is \(h = 0\) .
Measuring the performance of a recovery algorithm¶
Let \(\Delta : \RR^M \rightarrow \RR^N\) represent a recovery method to recover approximately sparse \(x\) from \(y\) .
\(l_2\) recovery error is given by
The \(l_1\) error for \(K\) -term approximation is given by \(\sigma_K(x)_1\) .
We will be interested in guarantees of the form
Why, this recovery guarantee formulation?
- Exact recovery of K-sparse signals. \(\sigma_K (x)_1 = 0\) if \(x \in \Sigma_K\) .
- Robust recovery of non-sparse signals
- Recovery dependent on how well the signals are approximated by \(K\) -sparse vectors.
- Such guarantees are known as instance optimal guarantees.
- Also known as uniform guarantees.
Why the specific choice of norms?
- Different choices of \(l_p\) norms lead to different guarantees.
- \(l_2\) norm on the LHS is a typical least squares error.
- \(l_2\) norm on the RHS will require prohibitively large numbertodo{Why? Prove it.} of measurements.
- \(l_1\) norm on the RHS helps us keep the number of measurements less.
If an algorithm \(\Delta\) provides instance optimal guarantees as defined above, what kind of requirements does it place on the sensing matrix \(\Phi\) ?
We show that NSP of order \(2K\) is a necessary condition for providing uniform guarantees.
We are given that
- \((\Phi, \Delta)\) form an encoder-decoder pair.
- Together, they satisfy instance optimal guarantee :eq`eq:nspguarantee`.
- Thus they are able to recover all sparse signals exactly.
- For non-sparse signals, they are able to recover their \(K\) -sparse approximation with bounded recovery error.
We need to show that if \(h \in \NullSpace(\Phi)\), then \(h\) satisfies
where \(\Lambda\) corresponds to \(2K\) largest magnitude entries in \(h\) .
Note that we have used \(2K\) in this expression, since we need to show that \(\Phi\) satisfies NSP of order \(2K\) .
Let \(h \in \NullSpace(\Phi)\) .
Let \(\Lambda\) be the indices corresponding to the \(2K\) largest entries of h. Thus
Split \(\Lambda\) into \(\Lambda_0\) and \(\Lambda_1\) such that \(|\Lambda_0| = |\Lambda_1| = K\) . Now
Let
Let
Then
By assumption \(h \in \NullSpace(\Phi)\)
Thus
But since \(x' \in \Sigma_K\) (recall that \(\Lambda_1\) indexes only \(K\) entries) and \(\Delta\) is able to recover all \(K\) -sparse signals exactly, hence
Thus
i.e. the recovery algorithm \(\Delta\) recovers \(x'\) for the signal \(x\) . Certainly \(x'\) is not \(K\) -sparse.
Finally we also have (since \(h\) contains some additional non-zero entries)
But as per instance optimal recovery guarantee (1) for \((\Phi, \Delta)\) pair, we have
Thus
But
Recall that \(x =h_{\Lambda_0} + h_{\Lambda^c}\) where \(\Lambda_0\) indexes \(K\) entries of \(h\) which are (magnitude wise) larger than all entries indexed by \(\Lambda^c\) . Thus the best \(l_1\) -norm \(K\) term approximation of \(x\) is given by \(h_{\Lambda_0}\) .
Hence
Thus we finally have
Thus \(\Phi\) satisfies the NSP of order \(2K\) .
It turns out that NSP of order \(2K\) is also sufficient to establish a guarantee of the form above for a practical recovery algorithm
Recovery in presence of measurement noise¶
Measurement vector in the presence of noise is given by
where \(e\) is the measurement noise or error. \(\| e \|_2\) is the \(l_2\) size of measurement error.
Recovery error as usual is given by
Stability of a recovery algorithm is characterized by comparing variation of recovery error w.r.t. measurement error.
NSP is both necessary and sufficient for establishing guarantees of the form:
These guarantees do not account for presence of noise during measurement. We need stronger conditions for handling noise. The restricted isometry property for sensing matrices comes to our rescue.
Restricted isometry property¶
A matrix \(\Phi\) satisfies the restricted isometry property (RIP) of order \(K\) if there exists \(\delta_K \in (0,1)\) such that
holds for all \(x \in \Sigma_K = \{ x : \| x\|_0 \leq K \}\) .
- If a matrix satisfies RIP of order \(K\) , then we can see that it approximately preserves the size of a \(K\)-sparse vector.
- If a matrix satisfies RIP of order \(2K\) , then we can see that it approximately preserves the distance between any two \(K\)-sparse vectors since difference vectors would be \(2K\) sparse.
- We say that the matrix is nearly orthonormal for sparse vectors.
- If a matrix satisfies RIP of order \(K\) with a constant \(\delta_K\) , it automatically satisfies RIP of any order \(K' < K\) with a constant \(\delta_{K'} \leq \delta_{K}\) .
Stability¶
Informally, a recovery algorithm is stable if recovery error is small in the presence of small measurement error.
Is RIP necessary and sufficient for sparse signal recovery from noisy measurements?
Let us look at the necessary part. We will define a notion of stability of the recovery algorithm.
Let \(\Phi : \RR^N \rightarrow \RR^M\) be a sensing matrix and \(\Delta : \RR^M \rightarrow \RR^N\) be a recovery algorithm. We say that the pair \((\Phi, \Delta)\) is \(C\)-stable if for any \(x \in \Sigma_K\) and any \(e \in \RR^M\) we have that
- Error is added to the measurements.
- LHS is \(l_2\) norm of recovery error.
- RHS consists of scaling of the \(l_2\) norm of measurement error.
- The definition says that recovery error is bounded by a multiple of the measurement error.
- Thus, adding a small amount of measurement noise shouldn’t be causing arbitrarily large recovery error.
It turns out that \(C\)-stability requires \(\Phi\) to satisfy RIP.
If a pair \((\Phi, \Delta)\) is \(C\)-stable then
for all \(x \in \Sigma_{2K}\) .
Any \(x \in \Sigma_{2K}\) can be written in the form of \(x = y - z\) where \(y, z \in \Sigma_K\) .
So let \(x \in \Sigma_{2K}\) . Split it in the form of \(x = y -z\) with \(y, z \in \Sigma_{K}\) .
Define
Thus
We have
Also, we have
Let
Since \((\Phi, \Delta)\) is \(C\)-stable, hence we have
also
Using the triangle inequality
Thus we have \(\forall x \in \Sigma_{2K}\)
This theorem gives us the lower bound for RIP property of order \(2K\) in (1) with \(\delta_{2K} = 1 - \frac{1}{C^2}\) as a necessary condition for \(C\)-stable recovery algorithms.
Note that smaller the constant \(C\) , lower is the bound on recovery error (w.r.t. measurement error). But as \(C \to 1\) , \(\delta_{2K} \to 0\), thus reducing the impact of measurement noise requires sensing matrix \(\Phi\) to be designed with tighter RIP constraints.
\(C\)-stability doesn’t require an upper bound on the RIP property in (1).
It turns out that If \(\Phi\) satisfies RIP, then this is also sufficient for a variety of algorithms to be able to successfully recover a sparse signal from noisy measurements. We will discuss this later.
Measurement bounds¶
As stated in previous section, for a \((\Phi, \Delta)\) pair to be \(C\)-stable we require that \(\Phi\) satisfies RIP of order \(2K\) with a constant \(\delta_{2K}\).
Let us ignore \(\delta_{2K}\) for the time being and look at relationship between \(M\) , \(N\) and \(K\).
We have a sensing matrix \(\Phi\) of size \(M\times N\) and expect it to provide RIP of order \(2K\) .
How many measurements \(M\) are necessary?
We will assume that \(K < N / 2\). This assumption is valid for approximately sparse signals.
Before we start figuring out the bounds, let us develop a special subset of \(\Sigma_K\) sets.
Consider the set
Some explanation: By \(A^N\) we mean \(A \times A \times \dots \times A\) i.e. \(N\) times Cartesian product of \(A\) .
When we say \(\| x\|_0 = K\) , we mean that only \(K\) terms in each member of \(U\) can be non-zero (i.e. \(-1\) or \(+1\) ).
So \(U\) is a set of signal vectors \(x\) of length \(N\) where each sample takes values from \(\{0, +1, -1\}\) and number of allowed non-zero samples is fixed at \(K\) .
An example below explains it further.
Each vector in \(U\) will have 6 elements out of which \(2\) can be non zero. There are \(\binom{6}{2}\) ways of choosing the non-zero elements. Some of those sets are listed below as examples:
\(U\) is a grid in the union of subspaces \(\Sigma_K\).
Revisiting
It’s now obvious that
Since there are \(\binom{N}{K}\) ways of choosing \(K\) non-zero elements and each non zero element can take either of the two values \(+1\) or \(-1\) , hence the cardinality of set \(U\) is given by:
By definition
Further Let \(x, y \in U\) .
Then \(x - y\) will have a maximum of \(2K\) non-zero elements. The non-zero elements would have values \(\in \{-2,-1,1,2\}\) .
Thus \(\| x - y \|_0 = R \leq 2K\).
Further, \(\| x - y \|_2^2 \geq R\).
Hence
We now state a lemma which will help us in getting to the bounds.
Let \(K\) and \(N\) satisfying \(K < \frac{N}{2}\) be given. There exists a set \(X \subset \Sigma_K\) such that for any \(x \in X\) we have \(\| x \|_2 \leq \sqrt{K}\) and for any \(x, y \in X\) with \(x \neq y\) ,
and
The lemma establishes the existence of a set in the union of subspaces \(\Sigma_K\) within a sphere of radius \(\sqrt{K}\) whose points are sufficiently apart and whose size is sufficiently large.
We just need to find one set \(X\) which satisfies the requirements of this lemma. We have to construct a set \(X\) such that
- \(\| x \|_2 \leq \sqrt{K} \quad \forall x \in X.\)
- \(\| x - y \|_2 \geq \sqrt{\frac{K}{2}} \quad \forall x, y \in X.\)
- \(\ln | X | \geq \frac{K}{2} \ln \left( \frac{N}{K} \right)\) or equivalently \(|X| \geq \left( \frac{N}{K} \right)^{\frac{K}{2}}\) .
We will construct \(X\) by picking vectors from \(U\) . Thus \(X \subset U\) .
Since \(x \in X \subset U\) hence \(\| x \|_2 = \sqrt{K} \leq \sqrt{K} \quad \forall x \in X\) .
Consider any fixed \(x \in U\) .
How many elements \(y\) are there in \(U\) such that \(\|x - y\|_2^2 < \frac{K}{2}\) ?
Define
Clearly by requirements in the lemma, if \(x \in X\) then \(U_x^2 \cap X = \phi\) . i.e. no vector in \(U_x^2\) belongs to \(X\) .
How many elements are there in \(U_x^2\) ?
Let us find an upper bound. \(\forall x, y \in U\) we have \(\|x - y\|_0 \leq \|x - y\|_2^2\) .
If \(x\) and \(y\) differ in \(\frac{K}{2}\) or more places, then naturally \(\|x - y\|_2^2 \geq \frac{K}{2}\) .
Hence if \(\|x - y\|_2^2 < \frac{K}{2}\) then \(\|x - y\|_0 < \frac{K}{2}\) hence \(\|x - y\|_0 \leq \frac{K}{2}\) for any \(x, y \in U_x^2\) .
So define
We have
Thus we have an upper bound given by
Let us look at \(U_x^0\) carefully.
We can choose \(\frac{K}{2}\) indices where \(x\) and \(y\) may differ in \(\binom{N}{\frac{K}{2}}\) ways.
At each of these \(\frac{K}{2}\) indices, \(y_i\) can take value as one of \((0, +1, -1)\) .
Thus We have an upper bound
We now describe an iterative process for building \(X\) from vectors in \(U\) .
Say we have added \(j\) vectors to \(X\) namely \(x_1, x_2,\dots, x_j\) .
Then
Number of vectors in \(U^2_{x_1} \cup U^2_{x_2} \cup \dots \cup U^2_{x_j}\) is bounded by \(j \binom {N}{ \frac{K}{2}} 3^{\frac{K}{2}}\) .
Thus we have at least
vectors left in \(U\) to choose from for adding in \(X\) .
We can keep adding vectors to \(X\) till there are no more suitable vectors left.
So we can construct a set of size \(|X|\) provided
Now
Note that \(\frac{N - K + i}{ K/ 2 + i}\) is a decreasing function of \(i\) .
Its minimum value is achieved for \(i=\frac{K}{2}\) as \((\frac{N}{K} - \frac{1}{2})\) .
So we have
Rephrasing (2) we have
So if
then (2) will be satisfied.
Now it is given that \(K < \frac{N}{2}\) . So we have:
Thus we have
Choose
Clearly, this value of \(|X|\) satisfies (2). Hence \(X\) can have at least these many elements. Thus
which completes the proof.
We can now establish following bound on the required number of measurements to satisfy RIP.
At this moment, we won’t worry about exact value of \(\delta_{2K}\) . We will just assume that \(\delta_{2K}\) is small in range \((0, \frac{1}{2}]\) .
Let \(\Phi\) be an \(M \times N\) matrix that satisfies RIP of order \(2K\) with constant \(\delta_{2K} \in (0, \frac{1}{2}]\) . Then
where \(C = \frac{1}{2 \ln (\sqrt{24} + 1)} \approx 0.28173\) .
Since \(\Phi\) satisfies RIP of order \(2K\) we have
Also
Consider the set \(X \subset U \subset \Sigma_K\) developed in above.
We have
Also
since \(\| x\|_2 \leq \sqrt{K} \quad \forall x \in X\) .
So we have a lower bound:
and an upper bound:
What do these bounds mean? Let us start with the lower bound. \(\Phi x\) and \(\Phi y\) are projections of \(x\) and \(y\) in \(\RR^M\) (measurement space).
Construct \(l_2\) balls of radius \(\sqrt{\frac{K}{4}} / 2= \sqrt{\frac{K}{16}}\) in \(\RR^M\) around \(\Phi x\) and \(\Phi y\) .
Lower bound says that these balls are disjoint. Since \(x, y\) are arbitrary, this applies to every \(x \in X\).
Upper bound tells us that all vectors \(\Phi x\) lie in a ball of radius \(\sqrt {\frac{3K}{2}}\) around origin in \(\RR^M\) .
Thus, the set of all balls lies within a larger ball of radius \(\sqrt {\frac{3K}{2}} + \sqrt{\frac{K}{16}}\) around origin in \(\RR^M\) .
So we require that the volume of the larger ball MUST be greater than the sum of volumes of \(|X|\) individual balls.
Since volume of an \(l_2\) ball of radius \(r\) is proportional to \(r^M\) , we have:
Again from above we have
Putting back we get
which establishes a lower bound on the number of measurements \(M\) .
- \(N=1000, K=100 \implies M \geq 65\) .
- \(N=1000, K=200 \implies M \geq 91\) .
- \(N=1000, K=400 \implies M \geq 104\) .
Some remarks are in order:
- The theorem only establishes a necessary lower bound on \(M\) . It doesn’t mean that if we choose an \(M\) larger than the lower bound then \(\Phi\) will have RIP of order \(2K\) with any constant \(\delta_{2K} \in (0, \frac{1}{2}]\) .
- The restriction \(\delta_{2K} \leq \frac{1}{2}\) is arbitrary and is made for convenience. In general, we can work with \(0 < \delta_{2K} \leq \delta_{\text{max}} < 1\) and develop the bounds accordingly.
- This result fails to capture dependence of \(M\) on the RIP constant \(\delta_{2K}\) directly. Johnson-Lindenstrauss lemma helps us resolve this which concerns embeddings of finite sets of points in low-dimensional spaces.
- We haven’t made significant efforts to optimize the constants. Still they are quite reasonable.
The RIP and the NSP¶
RIP and NSP are connected. If a matrix \(\Phi\) satisfies RIP then it also satisfies NSP (under certain conditions).
Thus RIP is strictly stronger than NSP (under certain conditions).
We will need following lemma which applies to any arbitrary \(h \in \RR^N\) . The lemma will be proved later.
Suppose that \(\Phi\) satisfies RIP of order \(2K\), and let \(h \in \RR^N, h \neq 0\) be arbitrary. Let \(\Lambda_0\) be any subset of \(\{1,2,\dots, N\}\) such that \(|\Lambda_0| \leq K\).
Define \(\Lambda_1\) as the index set corresponding to the \(K\) entries of \(h_{\Lambda_0^c}\) with largest magnitude, and set \(\Lambda = \Lambda_0 \cup \Lambda_1\). Then
where
Let us understand this lemma a bit. If \(h \in \NullSpace (\Phi)\), then the lemma simplifies to
- \(\Lambda_0\) maps to the initial few ( \(K\) or less) elements we chose.
- \(\Lambda_0^c\) maps to all other elements.
- \(\Lambda_1\) maps to largest (in magnitude) \(K\) elements of \(\Lambda_0^c\) .
- \(h_{\Lambda}\) contains a maximum of \(2K\) non-zero elements.
- \(\Phi\) satisfies RIP of order \(2K\) .
- Thus \((1 - \delta_{2K}) \| h_{\Lambda} \|_2 \leq \| \Phi h_{\Lambda} \|_2 \leq (1 + \delta_{2K}) \| h_{\Lambda} \|_2\) .
We now state the connection between RIP and NSP.
Suppose that \(\Phi\) satisfies RIP of order \(2K\) with \(\delta_{2K} < \sqrt{2} - 1\) . Then \(\Phi\) satisfies the NSP of order \(2K\) with constant
We are given
holds for all \(x \in \Sigma_{2K}\) where \(\delta_{2K} < \sqrt{2} - 1\).
We have to show that:
holds \(\forall h \in \NullSpace (\Phi)\) and \(\forall \Lambda\) such that \(|\Lambda| \leq 2K\).
Let \(h \in \NullSpace(\Phi)\) . Then \(\Phi h = 0\) .
Let \(\Lambda_m\) denote the \(2K\) largest entries of \(h\). Then
Similarly
Thus if we show that \(\Phi\) satisfies NSP of order \(2K\) for \(\Lambda_m\) , i.e.
then we would have shown it for all \(\Lambda\) such that \(|\Lambda| \leq 2K\) . So let \(\Lambda = \Lambda_m\) .
We can divide \(\Lambda\) into two components \(\Lambda_0\) and \(\Lambda_1\) of size \(K\) each.
Since \(\Lambda\) maps to the largest \(2K\) entries in \(h\) hence whatever entries we choose in \(\Lambda_0\) , the largest \(K\) entries in \(\Lambda_0^c\) will be \(\Lambda_1\) .
Hence as per lemma above above, we have
Also
Thus we have
We have to get rid of \(\Lambda_1\) .
Since \(h_{\Lambda_1} \in \Sigma_K\) , by applying lem:u_sigma_k_norms we get
Hence
But since \(\Lambda_1 \subset \Lambda\) , hence \(\| h_{\Lambda_1} \|_2 \leq \| h_{\Lambda} \|_2\) , hence
Note that the inequality is also satisfied for \(\alpha = 1\) in which case, we don’t need to bring \(1-\alpha\) to denominator.
Now
Putting
we see that \(\Phi\) satisfies NSP of order \(2K\) whenever \(\Phi\) satisfies RIP of order \(2K\) with \(\delta_{2K} \leq \sqrt{2} -1\) .
Note that for \(\delta_{2K} = \sqrt{2} - 1\) , \(C=\infty\) .
Matrices satisfying RIP¶
The natural question at this moment is how to construct matrices which satisfy RIP.
There are two different approaches
- Deterministic approach
- Randomized approach
Known deterministic approaches so far tend to require \(M\) to be very large ( \(O(K^2 \ln N)\) or \(O(KN^{\alpha}\) ).
We can overcome this limitation by randomizing matrix construction.
Construction process:
- Input \(M\) and \(N\) .
- Generate \(\Phi\) by choosing \(\Phi_{ij}\) as independent realizations from some probability distribution.
Suppose that \(\Phi\) is drawn from normal distribution.
It can be shown that the rank of \(\Phi\) is \(M\) with probability 1.
We can verify this fact by doing a small computer simulation.
M = 6;
N = 20;
trials = 10000;
n_full_rank = 0;
for i=1:trials
% Create a random matrix of size M x N
A = rand(M,N);
% Obtain its rank
R = rank(A);
% Check whether the rank equals M or not
if R == M
n_full_rank = n_full_rank + 1;
end
end
fprintf('Number of trials: %d\n',trials);
fprintf('Number of full rank matrices: %d\n',n_full_rank);
percentage = n_full_rank*100/trials;
fprintf('Percentage of full rank matrices: %.2f %%\n', percentage);
Above program generates a number of random matrices and measures their ranks. It verifies whether they are full rank or not.
Here is a sample output:
Number of trials: 10000
Number of full rank matrices: 10000
Percentage of full rank matrices: 100.00 %
Thus, if we choose \(M=2K\) , any subset of \(2K\) columns will be linearly independent. The matrix will satisfy RIP with some \(\delta_{2K} > 0\).
But this construction doesn’t tell us exact value of \(\delta_{2K}\) .
In order to find out \(\delta_{2K}\), we must consider all possible \(K\)-dimensional subspaces formed by columns of \(\Phi\).
This is computationally impossible for reasonably large \(N\) and \(K\).
What is the alternative?
We can start with a chosen value of \(\delta_{2K}\) and try to construct a matrix which matches it.
Before we proceed further, we should take a detour and review sub-Gaussian distributions in this section.
We now state the main theorem of this section.
Suppose that \(X = [X_1, X_2, \dots, X_M]\) where each \(X_i\) is i.i.d. with \(X_i \sim \Sub (c^2)\) and \(\EE (X_i^2) = \sigma^2\) . Then
Moreover, for any \(\alpha \in (0,1)\) and for any \(\beta \in [c^2/\sigma^2, \beta_{\text{max}}]\), there exists a constant \(\kappa^* \geq 4\) depending only on \(\beta_{\text{max}}\) and the ratio \(\sigma^2/c^2\) such that
and
The theorem states that the length (squared) of the random vector \(X\) is concentrated around its mean value. If we choose \(\sigma\) such that \(M \sigma^2 = 1\), then we have \(\beta \leq \| X \|_2^2 \leq \alpha\) with very high probability.
Conditions on random distribution for RIP¶
Let us get back to our business of constructing a matrix \(\Phi\) using random distributions which satisfies RIP with a given \(\delta\) .
We will impose some conditions on the random distribution.
- We require that the distribution will yield a matrix that is norm-preserving. This requires that
(1)¶\[ \EE (\Phi_{ij}^2) = \frac{1}{M}\]Hence variance of distribution should be \(\frac{1}{M}\).
We require that distribution is a sub-Gaussian distribution i.e. there exists a constant \(c > 0\) such that
(2)¶\[ \EE(\exp(\Phi_{ij} t)) \leq \exp \left (\frac{c^2 t^2}{2} \right )\]This says that the moment generating function of the distribution is dominated by a Gaussian distribution.
In other words, tails of the distribution decay at least as fast as the tails of a Gaussian distribution.
We will further assume that entries of \(\Phi\) are strictly sub-Gaussian. i.e. they must satisfy (2) with
Under these conditions we have the following result.
Suppose that \(\Phi\) is an \(M\times N\) matrix whose entries \(\Phi_{ij}\) are i.i.d. with \(\Phi_{ij}\) drawn according to a strictly sub-Gaussian distribution with \(c^2 = \frac{1}{M^2}\).
Let \(Y = \Phi x\) for \(x \in \RR^N\). Then for any \(\epsilon > 0\) and any \(x \in \RR^N\) ,
and
where \(\kappa^* = \frac{2}{1 - \ln(2)} \approx 6.5178\) .
This means that the norm of a sub-Gaussian random vector strongly concentrates about its mean.
Sub Gaussian random matrices satisfy the RIP¶
Using this result we now state that sub-Gaussian matrices satisfy the RIP.
Fix \(\delta \in (0,1)\) . Let \(\Phi\) be an \(M\times N\) random matrix whose entries \(\Phi_{ij}\) are i.i.d. with \(\Phi_{ij}\) drawn according to a strictly sub-Gaussian distribution with \(c^2 = \frac{1}{M}\) . If
then \(\Phi\) satisfies the RIP of order \(K\) with the prescribed \(\delta\) with probability exceeding \(1 - 2e^{-\kappa_2 M}\) , where \(\kappa_1\) is arbitrary and
We note that this theorem achieves \(M\) of the same order as the lower bound obtained in this result up to a constant.
This is much better than deterministic approaches.
Advantages of random construction¶
There are a number of advantages of the random sensing matrix construction approach:
- One can show that for random construction, the measurements are democratic. This means that all measurements are equal in importance and it is possible to recover the signal from any sufficiently large subset of the measurements. Thus by using random \(\Phi\) one can be robust to the loss or corruption of a small fraction of measurements.
- In general we are more interested in \(x\) which is sparse in some basis \(\Psi\) . In this setting, we require that \(\Phi \Psi\) satisfy the RIP. Deterministic construction would explicitly require taking \(\Psi\) into account. But if \(\Phi\) is random, we can avoid this issue. If \(\Phi\) is Gaussian and \(\Psi\) is an orthonormal basis, then one can easily show that \(\Phi \Psi\) will also have a Gaussian distribution. Thus if \(M\) is high, \(\Phi \Psi\) will also satisfy RIP with very high probability.
Similar results hold for other sub-Gaussian distributions as well.
Subgaussian distributions¶
In this section we review subgaussian distributions and matrices drawn from subgaussian distributions.
Examples of subgaussian distributions include
- Gaussian distribution
- Rademacher distribution taking values \(\pm \frac{1}{\sqrt{M}}\)
- Any zero mean distribution with a bounded support
A random variable \(X\) is called subgaussian if there exists a constant \(c > 0\) such that
holds for all \(t \in \RR\). We use the notation \(X \sim \Sub (c^2)\) to denote that \(X\) satisfies the constraint (1). We also say that \(X\) is \(c\)-subgaussian.
\(\EE [\exp(X t) ]\) is moment generating function of \(X\).
\(\exp \left (\frac{c^2 t^2}{2} \right )\) is moment generating function of a Gaussian random variable with variance \(c^2\).
The definition means that for a subgaussian variable \(X\), its M.G.F. is bounded by the M.G.F. of a Gaussian random variable \(\sim \mathcal{N}(0, c^2)\).
Consider zero-mean Gaussian random variable \(X \sim \mathcal{N}(0, \sigma^2)\) with variance \(\sigma^2\). Then
Putting \(c = \sigma\) we see that (1) is satisfied. Hence, \(X\sim \Sub(\sigma^2)\) is a subgaussian r.v. or \(X\) is \(\sigma\)-subgaussian.
Consider \(X\) with
i.e. \(X\) takes a value \(1\) with probability \(0.5\) and value \(-1\) with probability \(0.5\).
Then
Thus \(X \sim \Sub(1)\) or \(X\) is 1-subgaussian.
Consider \(X\) as uniformly distributed over the interval \([-a, a]\) for some \(a > 0\). i.e.
Then
But \((2n+1)! \geq n! 2^n\). Hence we have
Thus
Hence \(X \sim \Sub(a^2)\) or \(X\) is \(a\)-subgaussian.
Consider \(X\) as a zero mean, bounded random variable i.e.
for some \(B \in \RR^+\) and
Then, the following upper bound holds:
This result can be proven with some advanced calculus. \(X \sim \Sub(B^2)\) or \(X\) is \(B\)-subgaussian.
There are some useful properties of subgaussian random variables.
If \(X \sim \Sub(c^2)\) then
and
Thus subgaussian random variables are always zero-mean.
Their variance is always bounded by the variance of the bounding Gaussian distribution.
But since \(X \sim \Sub(c^2)\) hence
Restating
Dividing throughout by \(t > 0\) and letting \(t \to 0\) we get \(\EE (X) \leq 0\).
Dividing throughout by \(t < 0\) and letting \(t \to 0\) we get \(\EE (X) \geq 0\).
Thus \(\EE (X) = 0\). So \(\Var(X) = \EE (X^2)\).
Now we are left with
Dividing throughout by \(t^2\) and letting \(t \to 0\) we get \(\Var(X) \leq c^2\)
Subgaussian variables have a linear structure.
If \(X \sim \Sub(c^2)\) i.e. \(X\) is \(c\)-subgaussian, then for any \(\alpha \in \RR\), the r.v. \(\alpha X\) is \(|\alpha| c\)-subgaussian.
If \(X_1, X_2\) are r.v. such that \(X_i\) is \(c_i\)-subgaussian, then \(X_1 + X_2\) is \(c_1 + c_2\)-subgaussian.
Let \(X\) be \(c\)-subgaussian. Then
Now for \(\alpha \neq 0\), we have
Hence \(\alpha X\) is \(|\alpha| c\)-subgaussian.
Now consider \(X_1\) as \(c_1\)-subgaussian and \(X_2\) as \(c_2\)-subgaussian.
Let \(p, q >1\) be two numbers s.t. \(\frac{1}{p} + \frac{1}{q} = 1\).
Using H”older’s inequality, we have
Since this is valid for any \(p > 1\), we can minimize the r.h.s. over \(p > 1\). If suffices to minimize the term
We have
Equating it to 0 gives us
Taking second derivative, we can verify that this is indeed a minimum value.
Thus
Hence we have the result
Thus \(X_1 + X_2\) is \((c_1 + c_2)\)-subgaussian.
If \(X_1\) and \(X_2\) are independent, then \(X_1 + X_2\) is \(\sqrt{c_1^2 + c_2^2}\)-subgaussian.
If \(X\) is \(c\)-subgaussian then naturally, \(X\) is \(d\)-subgaussian for any \(d \geq c\). A question arises as to what is the minimum value of \(c\) such that \(X\) is \(c\)-subgaussian.
For a centered random variable \(X\), the subgaussian moment of \(X\), denoted by \(\sigma(X)\), is defined as
\(X\) is subgaussian if and only if \(\sigma(X)\) is finite.
We can also show that \(\sigma(\cdot)\) is a norm on the space of subgaussian random variables. And this normed space is complete.
For centered Gaussian r.v. \(X \sim \mathcal{N}(0, \sigma^2)\), the subgaussian moment coincides with the standard deviation. \(\sigma(X) = \sigma\).
Sometimes it is useful to consider more restrictive class of subgaussian random variables.
A random variable \(X\) is called strictly subgaussian if \(X \sim \Sub(\sigma^2)\) where \(\sigma^2 = \EE(X^2)\), i.e. the inequality
holds true for all \(t \in \RR\).
We will denote strictly subgaussian variables by \(X \sim \SSub (\sigma^2)\).
Characterization of subgaussian random variables¶
We quickly review Markov’s inequality which will help us establish the results in this section.
Let \(X\) be a non-negative random variable. And let \(t > 0\). Then
For a centered random variable \(X\), the following statements are equivalent:
- moment generating function condition:
- subgaussian tail estimate: There exists \(a > 0\) such that
- \(\psi_2\)-condition: There exists some \(b > 0\) such that
\((1) \implies (2)\) Using Markov’s inequality, for any \(t > 0\) we have
Since this is valid for all \(t \in \RR\), hence it should be valid for the minimum value of r.h.s.
The minimum value is obtained for \(t = \frac{\lambda}{c^2}\).
Thus we get
Since \(X\) is \(c\)-subgaussian, hence \(-X\) is also \(c\)-subgaussian.
Hence
Thus
Thus we can choose \(a = \frac{1}{2 c^2}\) to complete the proof.
\((2)\implies (3)\)
TODO PROVE THIS
\((3)\implies (1)\)
TODO PROVE THIS
More properties¶
We also have the following result on the exponential moment of a subgaussian random variable.
Suppose \(X \sim \Sub(c^2)\). Then
for any \(\lambda \in [0,1)\).
We are given that
Multiplying on both sides with \(\exp \left ( -\frac{c^2 t^2}{2 \lambda} \right )\) :
Integrating on both sides w.r.t. \(t\) we get:
which reduces to:
which completes the proof.
Subgaussian random vectors¶
The linearity property of subgaussian r.v.s can be extended to random vectors also. This is stated more formally in following result.
Norm of a subgaussian random vector
Let \(X\) be a random vector where each \(X_i\) is i.i.d. with \(X_i \sim \Sub (c^2)\).
Consider the \(l_2\) norm \(\| X \|_2\). It is a random variable in its own right.
It would be useful to understand the average behavior of the norm.
Suppose \(N=1\). Then \(\| X \|_2 = |X_1|\).
Also \(\| X \|^2_2 = X_1^2\). Thus \(\EE (\| X \|^2_2) = \sigma^2\).
- It looks like \(\EE (\| X \|^2_2)\) should be connected with \(\sigma^2\).
- Norm can increase or decrease compared to the average value.
- A ratio based measure between actual value and average value would be useful.
- What is the probability that the norm increases beyond a given factor?
- What is the probability that the norm reduces beyond a given factor?
These bounds are stated formally in the following theorem.
Suppose that \(X = [X_1, X_2,\dots, X_N]\), where each \(X_i\) is i.i.d. with \(X_i \sim \Sub(c^2)\).
Then
Moreover, for any \(\alpha \in (0,1)\) and for any \(\beta \in [\frac{c^2}{\sigma^2}, \beta_{\max}]\), there exists a constant \(\kappa^* \geq 4\) depending only on \(\beta_{\max}\) and the ratio \(\frac{\sigma^2}{c^2}\) such that
and
- First equation gives the average value of the square of the norm.
- Second inequality states the upper bound on the probability that norm could reduce beyond a factor given by \(\alpha < 1\).
- Third inequality states the upper bound on the probability that norm could increase beyond a factor given by \(\beta > 1\).
- Note that if \(X_i\) are strictly subgaussian, then \(c=\sigma\). Hence \(\beta \in (1, \beta_{\max})\).
Since \(X_i\) are independent hence
This proves the first part. That was easy enough.
Now let us look at eqref{eq:subgaussian_vector_norm_expansion_probability}.
By applying Markov’s inequality for any \(\lambda > 0\) we have:
Since \(X_i\) is \(c\)-subgaussian, hence from cref {lem:subgaussian_exp_square_moment} we have
Thus:
Putting it back we get:
Since above is valid for all \(\lambda > 0\), we can minimize the R.H.S. over \(\lambda\) by setting the derivative w.r.t. \(\lambda\) to \(0\).
Thus we get optimum \(\lambda\) as:
Plugging this back we get:
Similarly proceeding for eqref{eq:subgaussian_vector_norm_reduction_probability} we get
We need to simplify these equations. We will do some jugglery now.
Consider the function
By differentiating twice, we can show that this is a strictly increasing function.
Let us have \(\gamma \in (0, \gamma_{\max}]\).
Define
Clearly
Which gives us:
Hence by exponentiating on both sides we get:
By slight manipulation we get:
We now choose
Substituting we get:
Finally
Thus we get
Similarly by choosing \(\gamma = \beta \frac{\sigma^2}{c^2}\) proves the other bound.
We can now map \(\gamma_{\max}\) to some \(\beta_{\max}\) by:
This result tells us that given a vector with entries drawn from a subgaussian distribution, we can expect the norm of the vector to concentrate around its expected value \(N\sigma^2\).
Rademacher sensing matrices¶
In this section we collect several results related to Rademacher sensing matrices.
A Rademacher sensing matrix \(\Phi \in \RR^{M \times N}\) with \(M < N\) is constructed by drawing each entry \(\phi_{ij}\) independently from a Radamacher random distribution given by
Thus \(\phi_{ij}\) takes a value \(\pm \frac{1}{\sqrt{M}}\) with equal probability.
We can remove the scale factor \(\frac{1}{\sqrt{M}}\) out of the matrix \(\Phi\) writing
With that we can draw individual entries of \(\Chi\) from a simpler Rademacher distribution given by
Thus entries in \(\Chi\) take values of \(\pm 1\) with equal probability.
This construction is useful since it allows us to implement the multiplication with \(\Phi\) in terms of just additions and subtractions. The scaling can be implemented towards the end in the signal processing chain.
We note that
Actually we have a better result with
We can write
where \(\phi_j \in \RR^M\) is a Rademacher random vector with independent entries.
We note that
Actually in this case we also have
Thus the squared length of each of the columns in \(\Phi\) is \(1\) .
Let \(z \in \RR^M\) be a Rademacher random vector with i.i.d entries \(z_i\) that take a value \(\pm \frac{1}{\sqrt{M}}\) with equal probability. Let \(u \in \RR^M\) be an arbitrary unit norm vector. Then
Representative values of this bound are plotted below.

Tail bound for the probability of inner product of a Rademacher random vector with a unit norm vector
A particular application of this lemma is when \(u\) itself is another (independently chosen) unit norm Rademacher random vector.
The lemma establishes that the probability of inner product of two independent unit norm Rademacher random vectors being large is very very small. In other words, independently chosen unit norm Rademacher random vectors are incoherent with high probability. This is a very useful result as we will see later in measurement of coherence of Rademacher sensing matrices.
Joint correlation¶
Columns of \(\Phi\) satisfy a joint correlation property ([TG07]) which is described in following lemma.
Let \(\{u_k\}\) be a sequence of \(K\) vectors (where \(u_k \in \RR^M\) ) whose \(l_2\) norms do not exceed one. Independently choose \(z \in \RR^M\) to be a random vector with i.i.d. entries \(z_i\) that take a value \(\pm \frac{1}{\sqrt{M}}\) with equal probability. Then
Let us call \(\gamma = \max_{k} | \langle z, u_k\rangle |\) .
We note that if for any \(u_k\) , \(\| u_k \|_2 <1\) and we increase the length of \(u_k\) by scaling it, then \(\gamma\) will not decrease and hence \(\PP(\gamma \leq \epsilon)\) will not increase. Thus if we prove the bound for vectors \(u_k\) with \(\| u_k\|_2 = 1 \Forall 1 \leq k \leq K\) , it will be applicable for all \(u_k\) whose \(l_2\) norms do not exceed one. Hence we will assume that \(\| u_k \|_2 = 1\) .
From previous lemma we have
Now the event
i.e. if any of the inner products (absolute value) is greater than \(\epsilon\) then the maximum is greater.
We recall Boole’s inequality which states that
Thus
This gives us
Coherence of Rademacher sensing matrix¶
We show that coherence of Rademacher sensing matrix is fairly small with high probability (adapted from [TG07]).
Fix \(\delta \in (0,1)\) . For an \(M \times N\) Rademacher sensing matrix \(\Phi\) as defined above, the coherence statistic
with probability exceeding \(1 - \delta\) .

Coherence bounds for Rademacher sensing matrices
We recall the definition of coherence as
Since \(\Phi\) is a Rademacher sensing matrix hence each column of \(\Phi\) is unit norm column. Consider some \(1 \leq j < k \leq N\) identifying columns \(\phi_j\) and \(\phi_k\) . We note that they are independent of each other. Thus from above we have
Now there are \(\frac{N(N-1)}{2}\) such pairs of \((j, k)\) . Hence by applying Boole’s inequality
Thus, we have
What we need to do now is to choose a suitable value of \(\epsilon\) so that the R.H.S. of this inequality is simplified.
We choose
This gives us
Putting back we get
This justifies why we need \(\delta \in (0,1)\) .
Finally
and
which completes the proof.
Gaussian sensing matrices¶
In this section we collect several results related to Gaussian sensing matrices.
We note that
We can write
where \(\phi_j \in \RR^M\) is a Gaussian random vector with independent entries.
We note that
Thus the expected value of squared length of each of the columns in \(\Phi\) is \(1\) .
Joint correlation¶
Columns of \(\Phi\) satisfy a joint correlation property ([TG07]) which is described in following lemma.
Let \(\{u_k\}\) be a sequence of \(K\) vectors (where \(u_k \in \RR^M\) ) whose \(l_2\) norms do not exceed one. Independently choose \(z \in \RR^M\) to be a random vector with i.i.d. \(\Gaussian(0, \frac{1}{M})\) entries. Then
Let us call \(\gamma = \max_{k} | \langle z, u_k\rangle |\) .
We note that if for any \(u_k\) , \(\| u_k \|_2 <1\) and we increase the length of \(u_k\) by scaling it, then \(\gamma\) will not decrease and hence \(\PP(\gamma \leq \epsilon)\) will not increase. Thus if we prove the bound for vectors \(u_k\) with \(\| u_k\|_2 = 1 \Forall 1 \leq k \leq K\) , it will be applicable for all \(u_k\) whose \(l_2\) norms do not exceed one. Hence we will assume that \(\| u_k \|_2 = 1\) .
Now consider \(\langle z, u_k \rangle\) . Since \(z\) is a Gaussian random vector, hence \(\langle z, u_k \rangle\) is a Gaussian random variable. Since \(\| u_k \| =1\) hence
We recall a well known tail bound for Gaussian random variables which states that
Now the event
i.e. if any of the inner products (absolute value) is greater than \(\epsilon\) then the maximum is greater.
We recall Boole’s inequality which states that
Thus
This gives us
Hands on with Gaussian sensing matrices¶
We will show several examples of working with Gaussian sensing matrices through the sparse-plex library.
Let’s specify the size of representation space:
N = 1000;
Let’s specify the number of measurements:
M = 100;
Let’s construct the sensing matrix:
Phi = spx.dict.simple.gaussian_mtx(M, N, false);
By default the function gaussian_mtx constructs a matrix with normalized columns. When we set the third argument to false as in above, it constructs a matrix with unnormalized columns.
We can visualize the matrix easily:
imagesc(Phi);
colorbar;

Let’s compute the norms of each of the columns:
column_norms = spx.norm.norms_l2_cw(Phi);
Let’s look at the mean value:
>> mean(column_norms)
ans =
0.9942
We can see that the mean value is very close to unity as expected.
Let’s compute the standard deviation:
>> std(column_norms)
ans =
0.0726
As expected, the column norms are concentrated around its mean.
We can examine the variation in norm values by looking at the quantile values:
>> quantile(column_norms, [0.1, 0.25, 0.5, 0.75, 0.9])
ans =
0.8995 0.9477 0.9952 1.0427 1.0871
The histogram of column norms can help us visualize it better:
hist(column_norms);

The singular values of the matrix help us get deeper understanding of how well behaved the matrix is:
singular_values = svd(Phi);
figure;
plot(singular_values);
ylim([0, 5]);
grid;

As we can see, singular values decrease quite slowly.
The condition number captures the variation in singular values:
>> max(singular_values)
ans =
4.1177
>> min(singular_values)
ans =
2.2293
>> cond(Phi)
ans =
1.8471
The source code can be downloaded
here
.
Examples¶
In this section we will look at several examples which can be modeled using sparse and redundant representations and measured using compressed sensing techniques.
Several examples in this section have been incorporated from Sparco [BFH+07] (a testing framework for sparse reconstruction).
Piecewise cubic polynomial signal¶
This example was discussed in [CR04]. Our signal of interest is a piecewise cubic polynomial signal as shown here.

A piecewise cubic polynomials signal
It has a sparse representation in a wavelet basis.

Sparse representation of signal in wavelet basis
We can sort the wavelet coefficients by magnitude and plot them in descending order to visualize how sparse the representation is.

Wavelet coefficients sorted by magnitude
The chosen basis is a Daubechies wavelet basis \(\Psi\).

Daubechies-8 wavelet basis
A Gaussian random sensing matrix \(\Phi\) is used to generate the measurement vector \(y\)

Gaussian sensing matrix \(\Phi\)
The measurements are shown here:

Measurement vector \(y = \Phi x + e\)
Finally the product of \(\Phi\) and \(\Psi\) given by \(\Phi \Psi\) will be used for actual recovery of sparse representation.

Recovery matrix \(\Phi \Psi\)
Fundamental equations are:
and
with \(x \in \RR^N\). In this example \(N = 2048\). \(\Psi\) is a complete dictionary of size \(N \times N\). Thus we have \(D = N\) and \(\alpha \in \RR^N\). \(\Phi \in \RR^{M \times N}\). In this example, the number of measurements \(M=600\). The measurement vector \(y \in \RR^M\). For this problem we chose \(e = 0\).
Sparse signal recovery problem is denoted as
where \(\widehat{\alpha}\) is a \(K\)-sparse approximation of \(\alpha\).
Closely examining the coefficients in \(\alpha\) we can note that \(\max(\alpha_i) = 78.0546\). Further if we put different thresholds over magnitudes of entries in \(\alpha\) we can find the number of coefficients higher than the threshold as listed in the table below. A choice of \(M = 600\) looks quite reasonable given the decay of entries in \(\alpha\).
Threshold | Entries higher than threshold |
---|---|
1 | 129 |
1E-1 | 173 |
1E-2 | 186 |
1E-4 | 197 |
1E-8 | 199 |
1E-12 | 200 |
Data Analysis¶
Principal Component Analysis¶
Principal component analysis (PCA) [Jol02] is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. If a multivariate dataset is visualized as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA can supply the user with a lower-dimensional picture, a projection of this object when viewed from its most informative viewpoint. PCA can be thought of as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component.
Consider a data-matrix \(X \in \RR^{n \times p}\) \((n \geq p)\) with each column representing one feature (or random variable) and each row representing one feature vector (or observation vector). Assume that \(X\) has column wise zero sample mean. The principal components decomposition of \(X\) is given by \(T = X V\) where \(V\) is a \(p \times p\) matrix whose columns are eigen vectors of \(X^T X\). If each row of \(X\) ( resp. T) is given by a (column) vector \(x\) (resp. t), then they are related by \(t = V^T x\) or \(x = V t\). Each principal component \(t_i\) is obtained by taking the inner product of an eigen vector \(v^i\) in \(V\) with \(x\). \(T\) can be obtained straight-away from the SVD of \(X = U \Sigma V^T\) giving \(T = X V = U \Sigma\). Note that \(T^T T = \Sigma^T \Sigma\) implying that the columns of \(T\) are orthogonal to each other. In other words, the features (or random variables) corresponding to each column of \(T\) are uncorrelated. Recall that \(T^T T\) is proportional to the empirical covariance matrix of \(T\) and \(\sigma_1 \geq \dots \geq \sigma_p\) shows how variance of individual columns in \(T\) decreases. The form \(T = U \Sigma\) is also known as the polar decomposition of \(T\).
The dimensionality reduction of data-set in \(X\) is obtained by keeping just the first \(k\) columns of \(T\).
Data Clustering¶
Data Clustering Introduction¶
In this section, we summarize some of the traditional and general purpose data clustering algorithms. These algorithms get used as building blocks for various subspace clustering algorithms. The objective of data clustering is to group the data points into clusters such that points within each cluster are more related to each other than points across different cluster. The relationship can be measured in various ways: distance between points, similarity of points, etc. In distance based clustering, we group the points into \(K\) clusters such that the distance among points in the same group is significantly smaller than those between clusters. In similarity based clustering, the points within the same cluster are more similar to each other than points from different cluster. A graph based clustering will treat each point as a node on a graph [with appropriate edges] and split the graph into connected components. Compare this with subspace clustering which assumes that points in each cluster are sampled from one subspace [even though they may be far apart within the subspace].
Simplest distance measure is the standard Euclidean distance measure. But it is susceptible to the choice of basis. This can be improved by adopting a statistical model for data in each cluster. We assume that the data in \(k\)-th cluster is sampled from a probability distribution with mean \(\mu_k\) and covariance \(\Sigma_k\). An appropriate distance measure from the mean of a distribution which is invariant of the choice of basis is the Mahanalobis distance:
For Gaussian distributions, this is proportional to the negative of the log-likelihood of a sample point. A simple way to measure similarity between two points is the absolute value of the inner product. Alternatively, one can look at the angle between two points or inner product of the normalized points. Another way to measure similarity is to consider the inverse of an appropriate distance measure.
Measurement of clustering performance¶
In general a clustering \(\CCC\) of a set \(Y\) constructed by a clustering algorithm is a set \(\{\CCC_1, \dots, \CCC_C\}\) of non-empty disjoint subsets of \(Y\) such that their union equals \(Y\). Clearly: \(|\CCC_c| > 0\).
The clustering process may identify incorrect number of clusters and \(C\) may not be equal to \(K\). More-over even if \(K = C\), the vectors may be placed in wrong clusters. Ideally, we want \(K = C\) and \(\CCC_c = Y_k\) with a bijective mapping between \(1 \leq c \leq C\) and \(1 \leq k \leq K\). In practice, a clustering algorithm estimates the number of clusters \(C\) and assigns a label \(l_s\), \(1 \leq s \leq S\) to each vector \(y_s\) where \(1\leq l_s \leq C\). All the labels can be put in a label vector \(L\) where \(L \in \{1, \dots, C\}^S\). The permutation matrix \(\Gamma\) can be easily obtained from \(L\).
Following [WW07], we will quickly establish the measures used in this work for clustering performance of synthetic experiments. We have a reference clustering of vectors in \(Y\) given by \(\BBB = \{Y_1, \dots, Y_K\}\) which is known to us in advance (either by construction in synthetic experiments or as ground truth with real life data-sets). The clustering obtained by the algorithm is given by \(\CCC= \{\CCC_1, \dots, \CCC_C\}\). For two arbitrary vectors \(y_i, y_j \in Y\), there are four possibilities: a) they belong to same cluster in both \(\BBB\) and \(\CCC\) (true positive), b) they are in same cluster in \(\BBB\) but different cluster in \(\CCC\) (false negative) c) they are in different clusters in \(\BBB\) but in same cluster in \(\CCC\) d) they are in different clusters in both \(\BBB\) and \(\CCC\) (true negative).
Consider some cluster \(Y_i \in \BBB\) and \(\CCC_j \in \CC\). The elements common to \(Y_i\) and \(\CCC_j\) are given by \(Y_i \cap \CCC_j\). We define \(\text{precision}_{ij} \triangleq \frac{|Y_i \cap \CCC_j|}{|\CCC_j|}.\) We define the overall precision for \(\CCC_j\) as \(\text{precision}(\CCC_j) \triangleq \underset{i}{\max}(\text{precision}_{ij}).\) We define \(\text{recall}_{ij} \triangleq \frac{|Y_i \cap \CCC_j|}{|Y_i|}\). We define the overall recall for \(Y_i\) as \(\text{recall}(Y_i) \triangleq \underset{j}{\max}(\text{recall}_{ij})\). We define the \(F\) score as \(F_{ij} \triangleq \frac{2 \text{precision}_{ij} \text{recall}_{ij} }{\text{precision}_{ij} + \text{recall}_{ij}}.\) We define the overall \(F\)-score for \(Y_i\) as \(F(Y_i) \triangleq \underset{j}{\max}(F_{ij}).\) We note that cluster \(\CCC_j\) for which the maximum is achieved is best matching cluster for \(Y_i\). Finally, we define the overall \(F\)-score for the clustering \(F(\mathcal{B}, \mathcal{C}) \triangleq \frac{1}{S}\sum_{i=1}^p |Y_i | F(Y_i)\) where \(S\) is the total number of vectors in \(Y\). We also define a clustering ratio given by the factor \(\eta \triangleq \frac{C}{K}\).
There are different ways to define clustering error. For the special case where the number of clusters is known in advance, and we ensure that the data-set is divided into exactly those many clusters, it is possible to define subspace clustering error as follows:
The definition is adopted from [EV13] for comparing the results in this work with their results. This definition can be used after a proper one-one mapping between original labels and cluster labels assigned by the clustering algorithms has been identified. We can compute this mapping by comparing \(F\)-scores.
K-means Clustering¶

K-means clustering algorithm
K-means clustering algorithm [M+67][DHS12][Har75] is an iterative clustering method. We start with an initial set of means and covariance matrices for each cluster. In each iteration, we segment the data points into individual clusters by choosing the nearest mean. Then, we estimate the new mean and covariance matrices. We return a label vector \(L[1:K]\) which maps each point to corresponding cluster. A within-cluster-scatter can be defined as
This represents the average (squared) distance of each point to the respective cluster mean. The \(K\)-means algorithm reduces the scatter in each iteration. it is guaranteed to converge to a local minimum.
A simpler version of this algorithm is based on Euclidean distance and doesn’t compute or updates the covariance matrices for each cluster.
Spectral Clustering¶
Spectral clustering is a graph based clustering algorithm [VL07]. \(\GGG = \{T, W\}\) to obtain the clustering \(\CCC\) of \(X\). More specifically, the following steps are performed. The degree of a vertex \(t_s \in T\) is defined as \(d_s = \sum_{j = 1}^S w_{s j}\). The degree matrix \(D\) is defined as the diagonal matrix with the degrees \(\{ d_s \}_{s =1 }^S\). The unnormalized graph Laplacian is defined as \(\LLL = D - W\). The normalized graph Laplacian is defined as \(\LLL_{\text{rw}} \triangleq D^{-1} \LLL = I - D^{-1} W\) footnote{We specifically use the random walk version of normalized Graph Laplacian as defined in [VL07]. There are other ways to define normalized graph Laplacian.}. The subscript \(\text{rw}\) stands for random walk. We compute \(\LLL_{\text{rw}}\) and examine its eigen-structure to estimate the number of clusters \(C\) and the label vector \(L\). If \(C\) is known in advance, usually the first \(C\) eigen vectors of \(\LLL_{\text{rw}}\) corresponding to the smallest eigen-values are taken and their row vectors are clustered using K-means algorithm [SM00]. Since, we don’t make any assumption on the number of clusters, we need to estimate it. A simple way is to track the eigen-gap statistic. After arranging the eigen values in increasing order, we can choose the number \(C\) such that the eigen values \(\lambda_1, \dots, \lambda_C\) are very small and \(\lambda_{C + 1}\) is large. This is guided by the theoretical results that if a Graph has \(C\) connected components then exactly \(C\) eigen values of \(\LLL_{\text{rw}}\) are 0. However, when the subspaces are not clearly separated, and noise is introduced, this approach becomes tricky. We go for a more robust approach by analyzing the eigen vectors as described in [ZMP04]. The approach of [ZMP04], with a slightly different definition of the graph Laplacian \((D^{-1/2} W D^{-1/2})\) [NJW+02], has been adapted for working with the Laplacian \(\LLL_{\text{rw}}\) as defined above.
Robust estimation of number of clusters¶
In step 6, we estimate the number of clusters from the Graph Laplacian. It can be easily shown that \(0\) is an eigen value of \(\LLL_{\text{rw}}\) with an eigen vector \(\OneVec_S\) [VL07]. Further, the multiplicity of eigen value 0 equals the number of connected components in \(\GGG\). In fact the adjacency matrix can be factored as
where \(W_p \in \RR^{S_p \times S_p}\) is the adjacency matrix for the \(p\)-th connected component of \(\GGG\) corresponding to the subspace \(\UUU^p\) and \(\Gamma\) is the unknown permutation matrix. The graph Laplacian for each \(W_p\) has an eigen value \(0\) and the eigen-vector \(\OneVec_{S_p}\). Thus, if we look at the \(P\)-dimensional eigen-space of \(\LLL_{\text{rw}}\) corresponding to eigen value \(0\), then there exists a basis \(\widehat{V} \in \RR^{S \times P}\) such that each row of \(\widehat{V}\) is a unit vector in \(\RR^P\) and the columns contain \(S_1, \dots, S_P\) ones. Actual eigen vectors obtained through any numerical method will be a rotated version of \(\widehat{V}\) given by \(V = \widehat{V} R\). [ZMP04] suggests a cost function over the entries in \(V\) such that the cost is minimized when the rows of \(V\) are close to coordinate vectors. It then estimates a rotation matrix as a product of Givens rotations which can rotate \(V\) to minimize the cost. The parameters of the rotation matrix are the angles of Givens rotations which are estimated through a Gradient descent process. Since \(P\) is unknown, the algorithm is run over multiple values of \(C\) and we choose the value which gives minimum cost. Note that, we reuse the rotated version of \(V\) obtained for a particular value of \(C\) when we go for examining \(C+1\) eigen-vectors. This may appear to be ad-hoc, but is seen to help in faster convergence of the gradient descent algorithm for next iteration.
When \(S\) is small, we can do a complete SVD of \(\LLL_{\text{rw}}\) to get the eigen vectors. However, this is time consuming when \(S\) is large (say 1000+). An important question is how many eigen vectors we really need to examine! As \(C\) increases, the number of Givens rotation parameters increase as \(C(C-1)/2\). Thus, if we examine too many eigen-vectors, we will lose out unnecessarily on speed. We can actually use the eigen-gap statistic described above to decide how many eigen vectors we should examine.
Finally, we assign labels to each data point to identify the cluster they belong to. As described above, we maintain the rotated version of \(V\) during the estimation of rotation matrix. Once, we have zeroed in on the right value of \(C\), then assigning labels to \(x^s\) is straight-forward. We simply perform non-maximum suppression on the rows of V, i.e. we keep the largest (magnitude) entry in each row of \(V\) and assign zero to the rest. The column number of the largest entry in the \(s\)-th row of \(V\) is the label \(l_s\) for \(x^s\). This completes the clustering process.
While eigen gap statistic based estimation of number of clusters is quick, it requires running an additional \(K\)-means algorithm step on the first \(C\) eigen vectors to assign the labels. In contrast, eigen vector based estimation of number of clusters is involved and slow but it allows us to pick the labels very quickly.
Expectation Maximization¶
Expectation-Maximization (EM) [DLR77] method is a maximum likelihood based estimation paradigm. It requires an explicit probabilistic model of the mixed data-set. The algorithm estimates model parameters and the segmentation of data in Maximum-Likelihood (ML) sense.
We assume that \(y_s\) are samples drawn from multiple “component” distributions and each component distribution is centered around a mean. Let there be \(K\) such component distributions. We introduce a latent (hidden) discrete random variable \(z \in \{1, \dots, K\}\) associated with the random variable \(y\) such that \(z_s = k\) if \(y_s\) is drawn from \(k\)-th component distribution. The random vector \((y, z) \in \RR^M \times \{1, \dots, K\}\) completely describes the event that a point \(y\) is drawn from a component indexed by the value of \(z\).
We assume that \(z\) is subject to a multinomial (marginal) distribution. i.e.:
Each component distribution can then be modeled as a conditional (continuous) distribution \(f(y | z)\). If each of the components is a multivariate normal distribution, then we have \(f(y | z = k) \sim \NNN(\mu_k, \Sigma_k)\) where \(\mu_k\) is the mean and \(\Sigma_k\) is the covariance matrix of the \(k\)-th component distribution. The parameter set for this model is then \(\theta = \{\pi_k, \mu_k, \Sigma_K \}_{k=1}^K\) which is unknown in general and needs to be estimated from the dataset \(Y\).
With \((y, z)\) being the complete random vector, the marginal PDF of \(y\) given \(\theta\) is given by
The log-likelihood function for the dataset \(Y = \{ y_s\}_{s=1}^N\) is given by
An ML estimate of the parameters, namely \(\hat{\theta}_{\ML}\) is obtained by maximizing \(l (Y; \theta)\) over the parameter space. The statistic \(l (Y; \theta)\) is called incomplete log-likelihood function since it is marginalized over \(z\) It is very difficult to compute and maximize directly. The EM method provides an alternate means of maximizing \(l (Y; \theta)\) by utilizing the latent r.v. \(z\).
We start with noting that
Thus, \(l (Y; \theta)\) can be rewritten as
The first term is expected complete log-likelihood function and the second term is the conditional entropy of \(z_s\) given \(y_s\) and \(\theta\).
Let us introduce auxiliary variables \(w_{sk} (\theta) = p(z_s = k | y_s , \theta)\). \(w_{sk}\) basically represents the expected membership of \(y_s\) in the \(k\)-th cluster. Put \(w_{sk}\) in a matrix \(W (\theta)\) and write:
Then, we have
where, we have written \(l\) as a function of both \(\theta\) and \(W\). An iterative maximization approach can be introduced as follows:
- Maximize \(l(Y; \theta, W)\) w.r.t. \(W\) keeping \(\theta\) as constant.
- Maximize \(l(Y; \theta, W)\) w.r.t. \(\theta\) keeping \(W\) as constant.
- Repeat the previous two steps till convergence.
This is essentially the EM algorithm. Step 1 is known as E-step and step 2 is known as the M-step. In the E-step, we are estimating the expected membership of each sample being drawn from each component distribution. In the M-step, we are maximizing the expected complete log-likelihood function as the conditional entropy term doesn’t depend on \(\theta\).
Using Lagrange multiplier, we can show that the optimal \(\hat{w}_{sk}\) in the E-step is given by
A closed form solution for the \(M\)-step depends on the particular choice of the component distributions. We provide a closed form solution for the special case when each of the components is an isotropic normal distribution (\(\NNN(\mu_k, \sigma_k^2 I)\)).
In \(K\)-means, each \(y_s\) gets hard assigned to a specific cluster. In EM, we have a soft assignment given by \(w_{sk}\).
EM-method is a good method for a hybrid dataset consisting of mixture of component distributions. Yet, its applicability is limited. We need to have a good idea of the number of components beforehand. Further, for a Gaussian Mixture Model (GMM), it fails to work if the variance in some of the directions is arbitrarily small [Vap13]. For example, a subspace like distribution is one where the data has large variance within a subspace but almost zero variance orthogonal to the subspace. The EM method tends to fail with subspace like distributions.
Hands-on spectral clustering¶
In this example, we will cluster 2D data which form three different rings in the plane.
Sample data is available in the data directory.
Let us load the data:
dataset_file = fullfile(spx.data_dir, 'clustering', ...
'self_tuning_paper_clustering_data');
data = load(dataset_file);
datasets = data.XX;
raw_data = datasets{1};
num_clusters = data.group_num(1);
The raw data is organized in a matrix where each row represents one 2D point. Number of data points is the number of rows in the dataset. Let’s plot the data to get a better understanding:
X = raw_data(:, 1);
Y = raw_data(:, 2);
figure;
axis equal;
plot(X, Y, '.', 'MarkerSize',16);

We can see that the data is organized in three different rings. This data set is unlikely to be clustered properly by K-means algorithm.
It is good practice to scale the data before clustering it:
raw_data = raw_data - repmat(mean(raw_data),size(raw_data,1),1);
raw_data = raw_data/max(max(abs(raw_data)));
X = raw_data(:, 1);
Y = raw_data(:, 2);
figure;
axis equal;
plot(X, Y, '.', 'MarkerSize',16);

The next step is to compute pairwise distances between the points in the dataset:
sqrt_dist_mat = spx.commons.distance.sqrd_l2_distances_rw(raw_data);
We convert the distances into a Gaussian similarity. To compute the similarity, we will need to provide the scale value:
scale = 0.04;
% Compute the similarity matrix
sim_mat = spx.cluster.similarity.gauss_sim_from_sqrd_dist_mat(sqrt_dist_mat, scale);
We are now ready to perform spectral clustering on the data.
Create the spectral clustering algorithm instance:
clusterer = spx.cluster.spectral.Clustering(sim_mat);
Inform it about the expected number of clusters:
clusterer.NumClusters = num_clusters;
There are two different spectral clustering algorithms available. We will use the random walk version:
cluster_labels = clusterer.cluster_random_walk();
We can summarize the results of clustering:
>> tabulate(cluster_labels)
Value Count Percent
1 99 33.11%
2 139 46.49%
3 61 20.40%
Let’s plot the data points in different colors depending on which cluster they belong to:
figure;
colors = [1,0,0;0,1,0;0,0,1;1,1,0;1,0,1;0,1,1;0,0,0];
hold on;
axis equal;
for c=1:num_clusters
% Identify points in this cluster
points = raw_data(cluster_labels == c, :);
X = points(:, 1);
Y = points(:, 2);
plot(X, Y, '.','Color',colors(c,:), 'MarkerSize',16);
end

Complete example code can be downloaded
here
.
Inside Unnormalized Spectral Clustering¶
In this section, we will start with a similarity matrix and go through the steps of unnormalized spectral clustering.
We will consider a simple case of of 8 data points which are known to be falling into two clusters.
We construct an undirected graph \(G\) where the nodes in same cluster are connected to each other and nodes in different clusters are not connected to each other.
In this simple example, we will assume that the graph is unweighted.
The adjacency matrix for the graph is \(W\):
>> W = [ones(4) zeros(4); zeros(4) ones(4)]
W =
1 1 1 1 0 0 0 0
1 1 1 1 0 0 0 0
1 1 1 1 0 0 0 0
1 1 1 1 0 0 0 0
0 0 0 0 1 1 1 1
0 0 0 0 1 1 1 1
0 0 0 0 1 1 1 1
0 0 0 0 1 1 1 1
We have arranged the adjacency matrix in a manner so that the clusters are easily visible.
Let’s just get the number of nodes:
>> [num_nodes, ~] = size(W);
Let’s also assign the true labels to these nodes which will be used for verification later:
>> true_labels = [1 1 1 1 2 2 2 2];
We construct the degree matrix \(D\) for the graph:
>> Degree = diag(sum(W))
Degree =
4 0 0 0 0 0 0 0
0 4 0 0 0 0 0 0
0 0 4 0 0 0 0 0
0 0 0 4 0 0 0 0
0 0 0 0 4 0 0 0
0 0 0 0 0 4 0 0
0 0 0 0 0 0 4 0
0 0 0 0 0 0 0 4
The unnormalized Laplacian is given by \(L = D - W\):
>> Laplacian = Degree - W
Laplacian =
3 -1 -1 -1 0 0 0 0
-1 3 -1 -1 0 0 0 0
-1 -1 3 -1 0 0 0 0
-1 -1 -1 3 0 0 0 0
0 0 0 0 3 -1 -1 -1
0 0 0 0 -1 3 -1 -1
0 0 0 0 -1 -1 3 -1
0 0 0 0 -1 -1 -1 3
We now compute the singular value decomposition of the Laplacian \(U \Sigma V^T = L\):
>> [~, S, V] = svd(Laplacian);
>> singular_values = diag(S);
>> fprintf('Singular values: \n');
>> spx.io.print.vector(singular_values);
Singular values:
4.00 4.00 4.00 4.00 4.00 4.00 0.00 0.00
We know that the number of connected components in an undirected graph is equal to the number of singular values of the Laplacian which are zero. On inspection, we can see that the there are indeed two such zeros.
For more general cases, the lower singular values may not indeed be zero. We need to find the knee of the singular value curve.

A simple way to find it to look at the changes between consecutive singular values and find the place where the change is largest:
>> sv_changes = diff( singular_values(1:end-1) );
>> spx.io.print.vector(sv_changes);
0.00 0.00 0.00 -0.00 0.00 -4.00
Note that it is known that the Laplacian always has one singular value which is 0. Thus, we need to look at the changes only in remaining singular values.
Locate the largest change:
>> [min_val , ind_min ] = min(sv_changes)
min_val =
-4.0000
ind_min =
6
The number of clusters is now easy to determine:
>> num_clusters = num_nodes - ind_min
num_clusters =
2
We pickup the right singular vector corresponding to the last 2 smallest singular values:
>> Kernel = V(:,num_nodes-num_clusters+1:num_nodes);
Each row of this matrix corresponds to one data point. At this point, the standard k-means clustering can be invoked to cluster the points into clusters where the number of clusters was determined as above:
% Maximum iteration for KMeans Algorithm
>> max_iterations = 1000;
% Replication for KMeans Algorithm
>> replicates = 100;
>> labels = kmeans(Kernel, num_clusters, ...
'start','sample', ...
'maxiter',max_iterations,...
'replicates',replicates, ...
'EmptyAction','singleton'...
);
Print the labels given by k-means:
>> spx.io.print.vector(labels, 0);
1 1 1 1 2 2 2 2
As expected, the algorithm has been able to group the points into two clusters. The labels are matching with the original true labels.
Complete example code can be downloaded
here
.
sparse-plex
includes a function
which implements the unnormalized
spectral clustering algorithm. We
can use it on the data above as follows:
>> result = spx.cluster.spectral.simple.unnormalized(W);
>> result.labels'
ans =
1 1 1 1 2 2 2 2
Inside Normalized (Random Walk) Spectral Clustering¶
In this section, we will look at a spectral clustering method using normalized Laplacians. The primary difference is the way the graph Laplacian is computed \(L = I - D^{-1} W\).
We will use the third example from self tuning paper for demonstration here.

There are three clusters in the dataset. While two of the clusters have very clear convex shapes, the third one forms a half moon. It is this cluster which causes problems with a simple algorithm like k-means. The semi-moon is the first cluster, while the other two above it are cluster 2 and 3 (from left to right).
Loading the dataset:
dataset_file = fullfile(spx.data_dir, 'clustering', ...
'self_tuning_paper_clustering_data');
data = load(dataset_file);
datasets = data.XX;
raw_data = datasets{3};
% Scale the raw_data
raw_data = raw_data - repmat(mean(raw_data),size(raw_data,1),1);
raw_data = raw_data/max(max(abs(raw_data)));
num_clusters = data.group_num(1);
X = raw_data(:, 1);
Y = raw_data(:, 2);
% plot it
axis equal;
plot(X, Y, '.', 'MarkerSize',16);
Let’s compute the pairwise (squared) \(\ell_2\) distances between the points:
sqrt_dist_mat = spx.commons.distance.sqrd_l2_distances_rw(raw_data);
imagesc(sqrt_dist_mat);
title('Distance Matrix');

- In cluster 2 and 3, the distances are quite small between all pairs of points.
- In cluster 1, every point is near to some of the points. The points form kind of a chain structure. They keep getting farther and farther. This is visible from the gradual color change from blue to yellow in the off diagonal parts of the first (diagonal) sub-part in the image above.
We map the (squared) distances to similarity values between 0 to 1:
scale = 0.04;
W = spx.cluster.similarity.gauss_sim_from_sqrd_dist_mat(sqrt_dist_mat, scale);
imagesc(W);
title('Similarity Matrix');
The transformation involved here is

The chain structure of similarity in cluster 1 is clearly visible here. In cluster 2 and 3 points are fairly similar to each other.
We now compute the graph Laplacian:
[num_nodes, ~] = size(W);
Degree = diag(sum(W));
DegreeInv = Degree^(-1);
Laplacian = speye(num_nodes) - DegreeInv * W;
imagesc(Laplacian);
title('Normalized Random Walk Laplacian');
Note how the Laplacian has been computed slightly differently.

Let’s look at the singular values:
[~, S, V] = svd(Laplacian);
singular_values = diag(S);
plot(singular_values, 'b.-');
grid on;
title('Singular values of the Laplacian');

This time no clear knee is visible in the singular value plot. We can verify this by looking at the differences:
sv_changes = diff( singular_values(1:end-1) );
plot(sv_changes, 'b.-');
grid on;
title('Changes in singular values');

Finding the largest change in singular values will not give us the correct number of clusters:
>> [min_val , ind_min ] = min(sv_changes);
>> num_clusters = num_nodes - ind_min;
>> num_clusters
num_clusters =
26
However, by data inspection, we can clearly see that there are only 3 clusters of interest.
In this case, since the data are well segregated, the number of singular values which is close to zero actually matches with the number of clusters:
>> num_clusters = sum(singular_values < 1e-6)
num_clusters =
3
Let’s verify by printing 10 smallest singular values:
>> spx.io.print.vector(singular_values(end-10:end), 6)
0.027412 0.023242 0.014100 0.012101 0.006108 0.003988 0.001655 0.000471 0.000000 0.000000 0.000000
We will stick to this way of computing number of clusters here.
Let’s pick up the right singular vectors corresponding to the last 3 singular values:
% Choose the last num_clusters eigen vectors
Kernel = V(:,num_nodes-num_clusters+1:num_nodes);
Time to perform k-means clustering on the row vectors of this kernel:
% Maximum iteration for KMeans Algorithm
max_iterations = 1000;
% Replication for KMeans Algorithm
replicates = 100;
cluster_labels = kmeans(Kernel, num_clusters, ...
'start','plus', ...
'maxiter',max_iterations,...
'replicates',replicates, ...
'EmptyAction','singleton'...
);
The labels are returned in the variable cluster_labels
.
Let’s plot the original data by assigning different colors to points belonging to different labels:
hold on;
axis equal;
for c=1:num_clusters
% Identify points in this cluster
points = raw_data(cluster_labels == c, :);
X = points(:, 1);
Y = points(:, 2);
plot(X, Y, '.', 'MarkerSize',16);
end
hold off;

We can see that the clusters have been clearly identified.
Utility Functions for Clustering Experiments¶
We provide some utility functions which are quite useful in setting up clustering experiments.
Suppose you stack data vectors from different clusters together in a matrix column-wise. You wish to assign labels to each column of the matrix. We provide a function to automatically choose such labels.
Let’s choose some cluster sizes:
>> cluster_sizes = [ 4 3 3 2];
Let’s generate labels for these clusters:
>> labels = spx.cluster.labels_from_cluster_sizes(cluster_sizes)
labels =
1 1 1 1 2 2 2 3 3 3 4 4
Notice how first 4 labels are 1, next 3 labels are 2, next 3 are 3 and final 2 are 4.
Let’s randomly reorder the labels. This is a typical step in feeding a clustering algorithm so that any inherent order in data is destroyed before applying the clustering algorithm.
>> labels = labels(randperm(numel(labels)))
labels =
3 2 2 3 4 3 1 1 4 1 2 1
A useful function is to find the sizes of clusters for each label. We provide a function for that:
>> spx.cluster.cluster_sizes_from_labels(labels)
ans =
4 3 3 2
Comparing Clusterings¶
In Measurement of clustering performance, we looked at the theoretical aspects of comparing two different clusterings.
In this section, we will learn the tools available in sparse-plex library for comparing clusterings.
In this example, we will consider a set of 14 objects which are clustered into 4 different clusters by two different algorithms, algorithm A and B. Algorithm A could be human annotations themselves, in which case the labels are the ground truth against which we will compare the results of B.
We assume that the number of clusters is known in advance to be 4 and the two algorithms are generating the labels 1, 2, 3, 4.
The algorithm A outputs following labels:
A = [2 1 3 2 4 2 1 1 1 1 4 3 3 3];
It puts 5 objects into cluster 1, 3 into cluster 2, 4 into cluster 3, and 2 in cluster 4.
The algorithm B outputs following labels:
B = [4 2 3 4 2 4 2 2 3 2 1 3 3 3];
It puts only 1 object in cluster 1, 5 in cluster 2, 5 in cluster 3 and 3 in cluster 4.
An easy way to determine this is the tabulate function:
>> tabulate(A)
Value Count Percent
1 5 35.71%
2 3 21.43%
3 4 28.57%
4 2 14.29%
By inspection, we can see that the two algorithms are assigning different labels in most cases.
We need to figure out the label mapping between two clusters. It describes how the labels between two clusterings are related to each other. e.g. when A assigns a label 1 to some object, what is the most likely label assigned by B.
sparse-plex provides a cluster comparison tool:
>> cc = spx.cluster.ClusterComparison(A, B);
In order to compare the two clusterings, the first tool is the confusion matrix:
>> cm = cc.confusionMatrix(); cm
cm =
0 4 1 0
0 0 0 3
0 0 4 0
1 1 0 0
In the confusion matrix, the rows represent the labels assigned by A and columns represent the labels assigned by B.
e.g. for the 5 objects assigned to cluster 1 by A, 4 of them were assigned to cluster 2 by B and 1 was assigned to cluster 3 by B.
Confusion matrix is a very useful tool to identify label mapping. In this case, cluster 1 of algorithm A and cluster 2 of algorithm B are likely to be similar.
From Measurement of clustering performance, we would like to get the precision, recall and f1-measure numbers between the two clusterings.
ClusterComparison
provides a method
to get all of these metrics:
>> fm = cc.fMeasure();
>> fm.precisionMatrix
ans =
0 0.8000 0.2000 0
0 0 0 1.0000
0 0 0.8000 0
1.0000 0.2000 0 0
>> fm.recallMatrix
ans =
0 0.8000 0.2000 0
0 0 0 1.0000
0 0 1.0000 0
0.5000 0.5000 0 0
>> fm.fMatrix
ans =
0 0.8000 0.2000 0
0 0 0 1.0000
0 0 0.8889 0
0.6667 0.2857 0 0
>> fm.precision
ans =
0.8571
>> fm.recall
ans =
0.8571
>> fm.fMeasure
ans =
0.8492
A label map is also computed using the f1 matrix:
>> fm.labelMap'
ans =
2 4 3 1
The map suggests a mapping from labels of A to labels of B as follows: 1->2, 2->4, 3->3, 4->1.
It also provides you the new B labels after remapping:
>> fm.remappedLabels'
ans =
2 1 3 2 1 2 1 1 3 1 4 3 3 3
We can look at the number of places the remapped labels of B differ from the original A labels:
>> fm.remappedLabels' ~= A
ans =
1×14 logical array
0 0 0 0 1 0 0 0 1 0 0 0 0 0
We see that after remapping of labels, A and B differ in only 2 places. The clustering done by B is actually very close to the clustering done by A.
The ClusterComparison
class provides a helpful
method for printing the results in fm
object:
>> spx.cluster.ClusterComparison.printF1MeasureResult(fm)
F1-measure: 0.85, Precision: 0.86, Recall: 0.86, Misclassification rate: 0.14, Clusters: A: 4, B: 4, Clustering ratio: 1.00
Label map:
1=>2, 2=>4, 3=>3, 4=>1,
Label mapping using Hungarian method¶
Label mapping is essentially an assignment problem. We want to assign labels by the two different algorithms in such a way that the clustering error is minimized.
The Hungarian algorithm is used in assignment problems when we want to minimize cost.
sparse-plex
includes an implementation of
hungarian mapping by Niclas Borlin.
We can perform the assignment as follows:
>> C = bestMap(A, B)'; C
ans =
2 1 3 2 1 2 1 1 3 1 4 3 3 3
>> C ~= A
ans =
1×14 logical array
0 0 0 0 1 0 0 0 1 0 0 0 0 0
In this case the mapping given by Hungarian method is same as mapping generated by \(f_1\) measure method. Sometimes, it is not so.
The bestMap
method is easy to use.
Clustering Error¶
If two clusterings have same number of labels, then a simpler clustering error metric is quite useful.
We start with an example set of true labels A and estimated labels B:
A = [2 1 3 2 4 2 1 1 1 1 4 3 3 3];
B = [4 2 3 4 2 4 2 2 3 2 1 3 3 3];
Total number of labels:
num_labels = numel(A);
num_labels =
14
Let’s use the Hungarian mapping technique to find the mapping of labels between A and B:
mapped_B = bestMap(A, B)'
mapped_B =
2 1 3 2 1 2 1 1 3 1 4 3 3 3
After this mapping, the mapped B labels are looking pretty much like A. The difference between these two labels is where the algorithm has made some mistakes:
mistakes =
1×14 logical array
0 0 0 0 1 0 0 0 1 0 0 0 0 0
Total number of mistakes:
num_mistakes = sum(mistakes)
num_mistakes =
2
Clustering error is nothing but the ratio of mistakes made and total number of data points:
clustering_error = num_mistakes / num_labels
clustering_error =
0.1429
In percentage:
clustering_error_perc = clustering_error * 100
clustering_error_perc =
14.2857
Accuracy can be computed from error:
clustering_acc_perc = 100 -clustering_error_perc
Sparse-Plex provides a function which does all of this together:
>> spx.cluster.clustering_error_hungarian_mapping(A, B)
ans =
struct with fields:
num_labels: 14
num_missed_points: 2
error: 0.1429
error_perc: 14.2857
mapped_labels: [2 1 3 2 1 2 1 1 3 1 4 3 3 3]
misses: [0 0 0 0 1 0 0 0 1 0 0 0 0 0]
Pursuit Algorithms¶
Prelude to greedy pursuit algorithms¶
In this chapter we will review some matching pursuit algorithms which can help us solve the sparse approximation problem and the sparse recovery problem discussed in here.
The presentation in this chapter is based on a number of sources including [BDDH11][BD09][Ela10][NT09][Tro04][TG07].
Let us recall the definitions of sparse approximation and recovery problems from previous chapters.
From here let \(\DDD\) be a signal dictionary with \(\Phi \in \CC^{N \times D}\) being its synthesis matrix. The \((\mathcal{D}, K)\)-sparse approximation can be written as
From here with the help of synthesis matrix \(\Phi\), the \((\mathcal{D}, K)\)-exact-sparse problem can be written as
From here we recall the sparse signal recovery from compressed measurements problem as following. Let \(\Phi \in \CC^{M \times N}\) be a sensing matrix. Let \(x \in \CC^N\) be an unknown signal which is assumed to be sparse or compressible. Let \(y = \Phi x\) be a measurement vector in \(\CC^M\).
Then the signal recovery problem is to recover \(x\) from \(y\) subject to
assuming \(x\) to be \(K\) sparse or at least \(K\) compressible.
We note that sparse approximation problem and sparse recovery problems have pretty much same structure. They are in fact dual to each other. Thus we will see that the same set of algorithms can be adapted to solve both problems.
In the sequel we will see many variations of above problems.
Our first problem¶
We will start with attacking a very simple version of \((\mathcal{D}, K)\)-exact-sparse problem.
Setting up notation
- \(x \in \CC^N\) is our signal of interest and it is known.
- \(\DDD\) is the dictionary in which we are looking for a sparse representation of \(x\).
- \(\Phi \in \CC^{N \times D}\) is the synthesis matrix for \(\DDD\).
- The sparse representation of \(x\) in \(\DDD\) is given by
- It is assumed that \(\alpha \in \CC^D\) is sparse with \(|\alpha|_0 \leq K\).
- Also we assume that \(\alpha\) is the sparsest possible solution for \(x\) that we are looking.
- We know \(x\), we know \(\Phi\), we don’t know \(\alpha\). We are looking for it.
Thus we need to solve the optimization problem given by
For the unknown vector \(\alpha\), we need to find
- the sparsest support for the solution i.e. \(\{ i | \alpha_i \neq 0 \}\)
- the non-zero values \(\alpha_i\) over this support.
If we are able to find the support for the solution \(\alpha\), then we may assume that the non-zero values of \(\alpha\) can be easily computed by least squares methods.
Note that the support is discrete in nature (An index \(i\) either belongs to the support or it does not). Hence algorithms which will seek the support will also be discrete in nature.
We now build up a case for greedy algorithms before jumping into specific algorithms later.
Let us begin with a much simplified version of the problem.
Let the columns of the matrix \(\Phi\) be represented as
Let \(\spark (\Phi) > 2\). Thus no two columns in \(\Phi\) are linearly dependent and as per here, for any \(x\), there is at most only one \(1\)-sparse explanation vector.
We now assume that such a representation exists and we would be looking for optimal solution vector \(\alpha^*\) that has only one non-zero value, i.e. \(\| \alpha^*\|_0 = 1\).
Let \(i\) be the index at which \(\alpha^*_i \neq 0\).
Thus \(x = \alpha^*_i \phi_i\), i.e. \(x\) is a scalar multiple of \(\phi_i\) (the \(i\)-th column of \(\Phi\)).
Of course we don’t know what is the value of index \(i\).
We can find this by comparing \(x\) with each column of \(\Phi\) and find the column which best matches it.
Consider the least squares minimization problem:
where \(z_j \in \CC\) is a scalar.
From linear algebra, it attempts to find the projection of \(x\) over \(\phi_j\) and \(\epsilon(j)\) represents the magnitude of error between \(x\) and the projection of \(x\) over \(\phi_j\).
Optimal solution is given by
since columns of a dictionary are assumed to be unit norm.
Plugging it back into the expression of minimum squared error we get
Now since \(x\) is a scalar multiple of \(\phi_i\), hence \(\epsilon(i) = 0\), thus if we look at \(\epsilon(j)\) for \(j = 1, \dots, D\), the minimum value \(0\) will be obtained for \(j = i\).
And \(\epsilon(i) = 0\) means
This is a special case of Cauchy-Schwartz inequality when \(x\) and \(\phi_i\) are collinear.
The sparse representation is given by
Since \(x \in \CC^N\) and \(\phi_j \in \CC^N\), hence computation of \(\epsilon(j)\) requires \(\bigO{N}\) time.
Since we may need to do it for all \(D\) columns, hence finding the index \(i\) takes \(\bigO{ND}\) time.
Now let us make our life more complex. We now suppose that \(\spark(\Phi) > 2 K\). Thus a sparse representation \(\alpha\) of \(x\) with up to \(K\) non-zero values is unique if it exists(see again here). We assume it exists. Since \(x=\Phi \alpha\), \(x\) is a linear combination of up to \(K\) columns of \(\Phi\).
One approach could be to check out all \(\binom{D}{K}\) possible subsets of \(K\) columns from \(\Phi\).
But \(D \choose K\) is \(\bigO{D^{K}}\) and for each subset of \(K\) columns solving the least squares problem will take \(\bigO{N K^2}\) time. Hence overall complexity of the recovery process would be \(\bigO{D^{K} N K^2}\). This is prohibitively expensive.
A way around is by adopting a greedy strategy in which we abandon the hopeless exhaustive search and attempt a series of single term updates in the solution vector \(\alpha\).
Since this is an iterative procedure, let us call the approximation at each iteration as \(\alpha^k\) where \(k\) is the iteration index.
We start with \(\alpha^0 = 0\).
At each iteration we choose one new column in \(\Phi\) and fill in a value at corresponding index in \(\alpha^k\).
The column and value are chosen such that it maximally reduces the \(l_2\) error between \(x\) and the approximation. i.e.
\[\| x -\Phi \alpha^{k + 1} \|_2 < \| x -\Phi \alpha^{k} \|_2\]and the error reduction is as high as possible.
We stop when the \(l_2\) error reduces below a specific threshold.
Many variations to this scheme are possible.
- We can choose more than one atom in each iteration.
- In fact we can choose even K atoms in each iteration.
- We can drop some previously chosen atoms in an iteration too if they seem to be incorrect choices.
Not every chosen atom may be a correct one. Some algorithms have mechanisms to identify atoms which are more likely to be part of the support than others and thus drop the unlikely ones.
We are now ready to explore different greedy algorithms.
Matching Pursuit¶
Algorithm¶

Matching Pursuit
The matching pursuit algorithm is a very simple iterative approach to solve the sparse recovery problem. We are given the signal \(y\) and the dictionary \(\Phi\) and we are to recover the sparse representation \(x\) satisfying \(y = \Phi x\).
In each iteration of matching pursuit:
- A current estimate of the representation vector \(x\) is maintained in the variable \(z\).
- Current residual \(r = y - \Phi z\) is maintained.
- The inner product of the residual with all the atoms in \(\Phi\) is computed.
- We look for the atom which has the largest inner product in magnitude.
- Contribution from this atom is added to the representation.
- Residual is reduced accordingly.
Note that it is guaranteed that the norm of residual decreases monotonically in each iteration till it converges.
The algorithm can be motivated as follows.
Let \(\Lambda\) be the support of the representation vector \(x\). Then
For some \(k \in \Lambda\)
If the atoms formed an orthonormal set, this would have reduced to \(x_{k} = \langle y, \phi_k \rangle\) and picking the largest inner product would give us the largest non-zero entry in \(x\).
In fact, if \(\Phi\) was an orthonormal basis, then matching pursuit recovers the representation of \(y\) in exactly \(K\) iterations where \(K = |\Lambda|\) by successively picking up nonzero coefficients in \(x\) in the order of descending magnitude. We hope that the algorithm is useful even when the atoms in \(\Phi_{\Lambda}\) are not orthogonal.
Now, let us look at the iterative structure. Assume that the current estimate \(z\) satisfies \(\supp(z) \subseteq \Lambda\). Then \(\Phi z \in \Range(\Phi_{\Lambda})\). Since \(y \in \Range(\Phi_{\Lambda})\), hence the residual \(r \in \Range(\Phi_{\Lambda})\) also holds.
Finally, if the atoms in \(\Phi\) are nearly orthogonal to each other, then the largest inner product of \(r\) will be for one of the atoms in \(\Lambda\). This atom is indexed by the variable \(k\). Then \(h_k\) is the projection of the residual \(r\) on the atom \(\phi_k\).
We add this projection coefficient to \(z_k\) and remove the projection from the residual. The support of \(z\) continues to be within \(\Lambda\).
Since the atoms are not orthogonal, matching pursuit typically takes much larger number of iterations than the sparsity level \(K\). However, under suitable conditions, it does converge to the correct solution.
Hands-on with Matching Pursuit¶
Matching pursuit on a 2-sparse vector¶
In this example, we will reconstruct a 2-sparse representation vector \(x\) from a signal \(y = \Phi x\). We will develop the basic implementation of matching pursuit along-with.
From this example, we know of a way to construct a dictionary with high spark:
rng default;
N = 20;
M = 10;
K = 2;
PhiA = hadamard(N);
rows = randperm(N, M);
PhiB = PhiA(rows, :);
Let’s print its contents:
>> PhiB
PhiB =
1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1
1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1
1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1
1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1
1 1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1
1 -1 1 1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1
1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1
1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1 1 -1 -1
1 -1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1
1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 -1
Let’s normalize its columns:
Phi = spx.norm.normalize_l2(PhiB);
Bi-Gaussian discusses ways to generate synthetic sparse vectors.
Let’s generate our 2-sparse representation vector:
rng(100);
gen = spx.data.synthetic.SparseSignalGenerator(N, K);
x = gen.biGaussian();
Let’s print \(x\):
>> spx.io.print.sparse_signal(x);
(6,1.6150) (11,-1.2390) N=20, K=2
This is a nice helper function to print sparse vectors. It prints a sequence of tuples where each tuple consists of the index of a non-zero value and corresponding value.
The support for this vector is:
>> spx.commons.sparse.support(x)'
ans =
6 11
Let’s construct our 10-dimensional signal from it:
y = Phi * x;
Let’s print it:
>> spx.io.print.vector(y)
0.12 -0.12 -0.90 0.90 0.90 0.90 -0.90 -0.12 0.90 0.12
Our problem is now setup. Our job now is to recover \(x\) from \(\Phi\) and \(y\).
Initialize the estimated representation and current residual:
z = zeros(N, 1);
r = y;
We will run the matching pursuit iterations up to 100 times:
for i=1:100
Following code samples are part of each matching pursuit iteration. We start with computing the inner products of the current residual with each atom:
inner_products = Phi' * r;
Find the index of best matching atom \(k\)
[max_abs_inner_product, index] = max(abs(inner_products));
Corresponding signed inner product \(h_k\):
max_inner_product = inner_products(index);
Update the representation:
z(index) = z(index) + max_inner_product;
Remove the projection of the atom from the residual:
r = r - max_inner_product * Phi(:, index);
Compute the norm of residual:
norm_residual = norm(r);
If the norm is less than a threshold, we break out of loop:
if norm_residual < 1e-4
break;
end
It will be instructive to print current value of residual norm, selected atom index and estimated coefficients in the \(z\) variable in each iteration:
fprintf('[%d]: k: %d, h_k : %.4f, r_norm: %.4f, estimate: ', i, index, norm_residual, max_inner_product);
Here is the output of running this algorithm for this problem:
[1]: k: 6, h_k : 1.2140, r_norm: 1.8628, estimate: (6,1.8628) N=20, K=1
[2]: k: 11, h_k : 0.2428, r_norm: -1.1894, estimate: (6,1.8628) (11,-1.1894) N=20, K=2
[3]: k: 6, h_k : 0.0486, r_norm: -0.2379, estimate: (6,1.6249) (11,-1.1894) N=20, K=2
[4]: k: 11, h_k : 0.0097, r_norm: -0.0476, estimate: (6,1.6249) (11,-1.2370) N=20, K=2
[5]: k: 6, h_k : 0.0019, r_norm: -0.0095, estimate: (6,1.6154) (11,-1.2370) N=20, K=2
[6]: k: 11, h_k : 0.0004, r_norm: -0.0019, estimate: (6,1.6154) (11,-1.2389) N=20, K=2
[7]: k: 6, h_k : 0.0001, r_norm: -0.0004, estimate: (6,1.6150) (11,-1.2389) N=20, K=2
It took us 7 iterations, but the residual norm reached close to 0. We can note that the non-zero values in \(z\) match closely with the corresponding values in \(x\). Matching pursuit has been successful. We can also notice that the reconstruction alternates between atom number 6 and 11 in each iteration. Also, the residual norm keeps on decreasing with each iteration.
The complete code can be downloaded
here
.
Although the spark of the dictionary in previous example is \(8\), matching pursuit fails to recover signals which are 3-sparse.
Here is an example output of running matching pursuit on a 3-sparse vector for 20 iterations:
The representation: (6,-1.9014) (8,1.3481) (11,1.6150) N=20, K=3
[1]: k: 6, h_k : 1.9189, r_norm: -2.7636, estimate: (6,-2.7636) N=20, K=1
[2]: k: 11, h_k : 1.2654, r_norm: 1.4425, estimate: (6,-2.7636) (11,1.4425) N=20, K=2
[3]: k: 8, h_k : 0.7712, r_norm: 1.0032, estimate: (6,-2.7636) (8,1.0032) (11,1.4425) N=20, K=3
[4]: k: 6, h_k : 0.3449, r_norm: 0.6898, estimate: (6,-2.0738) (8,1.0032) (11,1.4425) N=20, K=3
[5]: k: 8, h_k : 0.2069, r_norm: 0.2759, estimate: (6,-2.0738) (8,1.2791) (11,1.4425) N=20, K=3
[6]: k: 11, h_k : 0.1542, r_norm: 0.1380, estimate: (6,-2.0738) (8,1.2791) (11,1.5805) N=20, K=3
[7]: k: 6, h_k : 0.0690, r_norm: 0.1380, estimate: (6,-1.9359) (8,1.2791) (11,1.5805) N=20, K=3
[8]: k: 8, h_k : 0.0414, r_norm: 0.0552, estimate: (6,-1.9359) (8,1.3343) (11,1.5805) N=20, K=3
[9]: k: 16, h_k : 0.0308, r_norm: 0.0276, estimate: (6,-1.9359) (8,1.3343) (11,1.5805) (16,0.0276) N=20, K=4
[10]: k: 14, h_k : 0.0241, r_norm: -0.0193, estimate: (6,-1.9359) (8,1.3343) (11,1.5805) (14,-0.0193) (16,0.0276)
N=20, K=5
[11]: k: 10, h_k : 0.0197, r_norm: 0.0138, estimate: (6,-1.9359) (8,1.3343) (10,0.0138) (11,1.5805) (14,-0.0193)
(16,0.0276) N=20, K=6
[12]: k: 6, h_k : 0.0151, r_norm: 0.0127, estimate: (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5805) (14,-0.0193)
(16,0.0276) N=20, K=6
[13]: k: 11, h_k : 0.0115, r_norm: 0.0097, estimate: (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5902) (14,-0.0193)
(16,0.0276) N=20, K=6
[14]: k: 15, h_k : 0.0095, r_norm: -0.0065, estimate: (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5902) (14,-0.0193)
(15,-0.0065) (16,0.0276) N=20, K=7
[15]: k: 13, h_k : 0.0078, r_norm: 0.0055, estimate: (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5902) (13,0.0055)
(14,-0.0193) (15,-0.0065) (16,0.0276) N=20, K=8
[16]: k: 1, h_k : 0.0056, r_norm: -0.0054, estimate: (1,-0.0054) (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5902)
(13,0.0055) (14,-0.0193) (15,-0.0065) (16,0.0276) N=20, K=9
[17]: k: 20, h_k : 0.0044, r_norm: -0.0035, estimate: (1,-0.0054) (6,-1.9232) (8,1.3343) (10,0.0138) (11,1.5902)
(13,0.0055) (14,-0.0193) (15,-0.0065) (16,0.0276) (20,-0.0035)
N=20, K=10
[18]: k: 2, h_k : 0.0034, r_norm: 0.0028, estimate: (1,-0.0054) (2,0.0028) (6,-1.9232) (8,1.3343) (10,0.0138)
(11,1.5902) (13,0.0055) (14,-0.0193) (15,-0.0065) (16,0.0276)
(20,-0.0035) N=20, K=11
[19]: k: 4, h_k : 0.0025, r_norm: 0.0023, estimate: (1,-0.0054) (2,0.0028) (4,0.0023) (6,-1.9232) (8,1.3343)
(10,0.0138) (11,1.5902) (13,0.0055) (14,-0.0193) (15,-0.0065)
(16,0.0276) (20,-0.0035) N=20, K=12
[20]: k: 17, h_k : 0.0021, r_norm: -0.0014, estimate: (1,-0.0054) (2,0.0028) (4,0.0023) (6,-1.9232) (8,1.3343)
(10,0.0138) (11,1.5902) (13,0.0055) (14,-0.0193) (15,-0.0065)
(16,0.0276) (17,-0.0014) (20,-0.0035) N=20, K=13
The sparse vector is supported on atoms 6, 8 and 11. If we order the atoms in terms of the magnitude of their coefficients, the order is 6,11 and 8.
- Atom 6 is discovered in first iteration.
- Atom 11 is discovered in second iteration.
- Atom 8 is discovered in the third iteration.
- The coefficients for atom 6, 8 and 11 continue to be updated till 8 iterations.
- In 9-th iteration, it discovers an incorrect atom 16.
- In the following iterations, it keeps discovering more incorrect atoms 14, 10, 15, 13, 1, 20, etc.
- The algorithm is side-tracked after 9-th iteration. The residual doesn’t belong to the range \(\Range(\Phi_{\Lambda})\) anymore.
- After 20 iterations, as many as 13 atoms are involved in the representation.
- Yet, most of the energy is concentrated in atoms 6, 8, 11 only. In that sense, MP hasn’t failed completely.
- A simple thresholding can remove the spurious contributions from incorrect atoms.
Orthogonal Matching Pursuit¶
The OMP Algorithm¶
Orthogonal Matching Pursuit (OMP) addresses some of the limitations of Matching Pursuit. In particular, in each iteration:
- The current estimate is computed by performing a least squares estimation on the subdictionary formed by atoms selected so far.
- It ensures that the residual is totally orthogonal to already selected atoms.
- It also means that an atom is selected only once.
- Further, if all the atoms in the support are selected by OMP correctly, then the least squares estimate is able to achieve perfect recovery. The residual becomes 0.
- In other words, if OMP is recovering a K-sparse representation, then it can recover it in exactly K iterations (if in each iteration it recovers one atom correctly).
- OMP performs far better than MP in terms of the set of signals it can recover correctly.
- At the same time, OMP is a much more complex algorithm (due to the least squares step).

Orthogonal Matching Pursuit
The core OMP algorithm is presented above. The algorithm is iterative.
- We start with the initial estimate of solution as \(x=0\).
- We also maintain the support of \(x\) i.e. the set of indices for which \(x\) is non-zero in a variable \(\Lambda\). We start with an empty support.
- In each (\(k\)-th) iteration we attempt to reduce the difference between the actual signal \(y\) and the approximate signal based on current solution \(x^{k}\) given by \(r^{k} = y - \Phi x^{k}\).
- We do this by choosing a new index in \(x\) given by \(\lambda^{k+1}\) for the column \(\phi_{\lambda^{k+1}}\) which most closely matches our current residual.
- We include this to our support for \(x\), estimate new solution vector \(x^{k+1}\) and compute new residual.
- We stop when the residual magnitude is below a threshold \(\epsilon\) defined by us.
Each iteration of algorithm consists of following stages:
Match For each column \(\phi_j\) in our dictionary, we measure the projection of residual from previous iteration on the column
Identify We identify the atom with largest inner product and store its index in the variable \(\lambda^{k+1}\).
Update support We include \(\lambda^{k+1}\) in the support set \(\Lambda^{k}\).
Update representation In this step we find the solution of minimizing \(\| \Phi x - y \|^2\) over the support \(\Lambda^{k+1}\) as our next candidate solution vector.
By keeping \(x_i = 0\) for \(i \notin \Lambda^{k+1}\) we are essentially leaving out corresponding columns \(\phi_i\) from our calculations.
Thus we pickup up only the columns specified by \(\Lambda^{k+1}\) from \(\Phi\). Let us call this matrix as \(\Phi_{\Lambda^{k+1}}\). The size of this matrix is \(N \times | \Lambda^{k+1} |\). Let us call corresponding sub vector as \(x_{\Lambda^{k+1}}\).
E.g. suppose \(D=4\), then \(\Phi = \begin{bmatrix} \phi_1 & \phi_2 & \phi_3 & \phi_4 \end{bmatrix}\). Let \(\Lambda^{k+1} = \{1, 4\}\). Then \(\Phi_{\Lambda^{k+1}} = \begin{bmatrix} \phi_1 & \phi_4 \end{bmatrix}\) and \(x_{\Lambda^{k+1}} = (x_1, x_4)\).
Our minimization problem then reduces to minimizing \(\|\Phi_{\Lambda^{k+1}} x_{\Lambda^{k+1}} - y \|_2\).
We use standard least squares estimate for getting the coefficients for \(x_{\Lambda^{k+1}}\) over these indices. We put back \(x_{\Lambda^{k+1}}\) to obtain our new solution estimate \(x^{k+1}\).
In the running example after obtaining the values \(x_1\) and \(x_4\), we will have \(x^{k+1} = (x_1, 0 , 0, x_4)\).
The solution to this minimization problem is given by
\[\Phi_{\Lambda^{k+1}}^H ( \Phi_{\Lambda^{k+1}}x_{\Lambda^{k+1}} - y ) = 0 \implies x_{\Lambda^{k+1}} = ( \Phi_{\Lambda^{k+1}}^H \Phi_{\Lambda^{k+1}} )^{-1} \Phi_ {\Lambda^{k+1}}^H y.\]Interestingly, we note that \(r^{k+1} = y - \Phi x^{k+1} = y - \Phi_{\Lambda^{k+1}} x_{\Lambda^{k+1}}\), thus
\[\Phi_{\Lambda^{k+1}}^H r^{k+1} = 0\]which means that columns in \(\Phi_{\Lambda^k}\) which are part of support \(\Lambda^{k+1}\) are necessarily orthogonal to the residual \(r^{k+1}\). This implies that these columns will not be considered in the coming iterations for extending the support. This orthogonality is the reason behind the name of the algorithm as OMP.
Update residual We finally update the residual vector to \(r^{k+1}\) based on new solution vector estimate.
Hands-on with Orthogonal Matching Pursuit¶
Let us consider a synthesis matrix of size \(10 \times 20\). Thus \(N=10\) and \(D=20\). In order to fit into the display, we will present the matrix in two 10 column parts.
tiny
with
You may verify that each column is unit norm.
It is known that \(\Rank(\Phi) = 10\) and \(\spark(\Phi)= 6\). Thus if a signal \(y\) has a \(2\) sparse representation in \(\Phi\) then the representation is necessarily unique.
We now consider a signal \(y\) given by
For saving space, we have written it as an n-tuple over two rows. You should treat it as a column vector of size \(10 \times 1\).
It is known that the vector has a two sparse representation in \(\Phi\). Let us go through the steps of OMP and see how it works.
In step 0, \(r^0= y\), \(x^0 = 0\), and \(\Lambda^0 = \EmptySet\).
We now compute absolute value of inner product of \(r^0\) with each of the columns. They are given by
We quickly note that the maximum occurs at index 7 with value 11.
We modify our support to \(\Lambda^1 = \{ 7 \}\).
We now solve the least squares problem
The solution gives us \(x_7 = 11.00\). Thus we get
Again note that to save space we have presented \(x\) over two rows. You should consider it as a \(20 \times 1\) column vector.
This leaves us the residual as
We can cross check that the residual is indeed orthogonal to the columns already selected, for
Next we compute inner product of \(r^1\) with all the columns in \(\Phi\) and take absolute values. They are given by
We quickly note that the maximum occurs at index 13 with value \(4.8\).
We modify our support to \(\Lambda^1 = \{ 7, 13 \}\).
We now solve the least squares problem
This gives us \(x_7 = 10\) and \(x_{13} = -5\).
Thus we get
Finally the residual we get at step 2 is
The magnitude of residual is very small. We conclude that our OMP algorithm has converged and we have been able to recover the exact 2 sparse representation of \(y\) in \(\Phi\).
Exact recovery conditions¶
Recall the \((\mathcal{D}, K)\)-exact-sparse problem discussed in Sparse approximation problem. OMP is a good and fast algorithm for solving this problem.
In terms of theoretical understanding, it is quite useful to know of certain conditions under which a sparse representation can be exactly recovered from a given signal using OMP. Such guarantees are known as exact recovery guarantees.
In this section, following Tropp in [Tro04], we will closely look at some conditions under which OMP is guaranteed to recover the solution for \((\mathcal{D}, K)\)-exact-sparse problem.
We rephrase the OMP algorithm following the conventions in \((\mathcal{D}, K)\)-exact-sparse problem.

It is known that \(x = \Phi \alpha\) where \(\alpha\) contains at most \(K\) non-zero entries. Both the support and entries of \(\alpha\) are known. OMP is only given \(\Phi\), \(x\) and \(K\) and is estimating \(\alpha\). The estimate returned by OMP is denoted as \(\widehat{\alpha}\).
Let \(\Lambda_{\text{opt}} = \supp(\alpha)\) be the set of indices at which optimal representation \(\alpha\) has non-zero entries. Then we can write
From the synthesis matrix \(\Phi\) we can extract a \(N \times K\) matrix \(\Phi_{\text{opt}}\) whose columns are indexed by \(\Lambda_{\text{opt}}\).
where \(\lambda_i \in \Lambda_{\text{opt}}\). Thus, we can also write
where \(\alpha_{\text{opt}} \in \CC^K\) is a vector of \(K\) complex entries.
Now the columns of optimum \(\Phi_{\text{opt}}\) are linearly independent. Hence \(\Phi_{\text{opt}}\) has full column rank.
We define another matrix \(\Psi_{\text{opt}}\) whose columns are the remaining \(D - K\) columns of \(\Phi\). Thus \(\Psi_{\text{opt}}\) consists of atoms or columns which do not participate in the optimum representation of \(x\).
OMP starts with an empty support. In every step, it picks up one column from \(\Phi\) and adds to the support of approximation. If we can ensure that it never selects any column from \(\Psi_{\text{opt}}\) we will be guaranteed that correct \(K\) sparse representation is recovered.
We will use mathematical induction and assume that OMP has succeeded in its first \(k\) steps and has chosen \(k\) columns from \(\Phi_{\text{opt}}\) so far. At this point it is left with the residual \(r^k\).
In \((k+1)\)-th iteration, we compute inner product of \(r^k\) with all columns in \(\Phi\) and choose the column which has highest inner product.
We note that maximum value of inner product of \(r^k\) with any of the columns in \(\Psi_{\text{opt}}\) is given by
Correspondingly, maximum value of inner product of \(r^k\) with any of the columns in \(\Phi_{\text{opt}}\) is given by
Actually since we have already shown that \(r^k\) is orthogonal to the columns already chosen, hence they will not contribute to this equation.
In order to make sure that none of the columns in \(\Psi_{\text{opt}}\) is selected, we need
We define a ratio
This ratio is known as greedy selection ratio.
We can see that as long as \(\rho(r^k) < 1\), OMP will make a right decision at \((k+1)\)-th stage. If \(\rho(r^k) = 1\) then there is no guarantee that OMP will make the right decision. We will assume pessimistically that OMP makes wrong decision in such situations.
We note that this definition of \(\rho(r^k)\) looks very similar to matrix \(p\)-norms defined in p-norm for matrices. It is suggested to review the properties of \(p\)-norms for matrices at this point.
We now present a condition which guarantees that \(\rho(r^k) < 1\) is always satisfied.
A sufficient condition for Orthogonal Matching Pursuit to resolve \(x\) completely in \(K\) steps is that
where \(\psi\) ranges over columns in \(\Psi_{\text{opt}}\).
Moreover, Orthogonal Matching Pursuit is a correct algorithm for \((\mathcal{D}, K)\)-exact-sparse problem whenever the condition holds for every superposition of \(K\) atoms from \(\DD\).
In (2) \(\Phi_{\text{opt}}^{\dag}\) is the pseudo-inverse of \(\Phi\)
What we need to show is if (2) holds true then \(\rho(r^k)\) will always be less than 1.
We note that the projection operator for the column span of \(\Phi_{\text{opt}}\) is given by
We also note that by assumption since \(x \in \ColSpace(\Phi_{\text{opt}})\) and the approximation at the \(k\)-th step, \(x^k = \Phi \alpha^k \in \ColSpace(\Phi_{\text{opt}})\), hence \(r^k = x - x^k\) also belongs to \(\ColSpace(\Phi_{\text{opt}})\).
Thus
i.e. applying the projection operator for \(\Phi_{\text{opt}}\) on \(r^k\) doesn’t change it.
Using this we can rewrite \(\rho(r^k)\) as
We see \(\Phi_{\text{opt}}^H r^k\) appearing both in numerator and denominator.
Now consider the matrix \(\Psi_{\text{opt}}^H (\Phi_{\text{opt}}^{\dag})^H\) and recall the definition of matrix \(\infty\)-norm from here
Thus
which gives us
Finally we recall that \(\| A\|_{\infty}\) is max row sum norm while \(\| A\|_1\) is max column sum norm. Hence
which means
Thus
Now the columns of \(\Phi_{\text{opt}}^{\dag} \Psi_{\text{opt}}\) are nothing but \(\Phi_{\text{opt}}^{\dag} \psi\) where \(\psi\) ranges over columns of \(\Psi_{\text{opt}}\).
Thus in terms of max column sum norm
Thus assuming that OMP has made \(k\) correct decision and \(r^k\) lies in \(\ColSpace( \Phi_{\text{opt}})\), \(\rho(r^k) < 1\) whenever
Finally the initial residual \(r^0 = 0\) which always lies in column space of \(\Phi_{\text{opt}}\). By above logic, OMP will always select an optimal column in each step. Since the residual is always orthogonal to the columns already selected, hence it will never select the same column twice. Thus in \(K\) steps it will retrieve all \(K\) atoms which comprise \(x\).
Babel function estimates¶
There is a small problem with this result. Since we don’t know the support a-priori hence its not possible to verify that
holds. Of course, verifying this for all \(K\) column sub-matrices is computationally prohibitive.
It turns out that Babel function (recall from Babel function) is there to help. We show how Babel function guarantees that exact recovery condition for OMP holds.
Suppose that \(\mu_1\) is the Babel function for a dictionary \(\DD\) with synthesis matrix \(\Phi\). The exact recovery condition holds whenever
Thus, Orthogonal Matching Pursuit is a correct algorithm for \((\mathcal{D}, K)\)-exact-sparse problem whenever (3) holds.
In other words, for sufficiently small \(K\) for which (3) holds, OMP will recover any arbitrary superposition of \(K\) atoms from \(\DD\).
We can write
We recall from here that operator-norm is subordinate i.e.
Thus with \(A = (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1}\) we have
With this we have
Now let us look at \(\| \Phi_{\text{opt}}^H \psi \|_1\) closely. There are \(K\) columns in \(\Phi_{\text{opt}}\). For each column we compute its inner product with \(\psi\). And then absolute sum of the inner product.
Also recall the definition of Babel function:
Clearly
We also need to provide a bound on \(\| (\Phi_{\text{opt}}^H \Phi_{\text{opt}})^{-1} \|_1\) which requires more work.
First note that since all columns in \(\Phi\) are unit norm, hence the diagonal of \(\Phi_{\text{opt}}^H \Phi_{\text{opt}}\) contains unit entries. Thus we can write
where \(A\) contains the off diagonal terms in \(\Phi_{\text{opt}}^H \Phi_{\text{opt}}\).
Looking carefully , each column of \(A\) lists the inner products between one atom of \(\Phi_{\text{opt}}\) and the remaining \(K-1\) atoms. By definition of Babel function
Now whenever \(\| A \|_1 < 1\) then the Von Neumann series \(\sum(-A)^k\) converges to the inverse \((I_K + A)^{-1}\).
Thus we have
Thus putting things together we get
Thus whenever
we have
Sparse approximation conditions¶
We now remove the assumption that \(x\) is \(K\)-sparse in \(\Phi\). This is indeed true for all real life signals as they are not truly sparse.
In this section we will look at conditions under which OMP can successfully solve the \((\mathcal{D}, K)\)-sparse approximation problem.

Let \(x\) be an arbitrary signal and suppose that \(\alpha_{\text{opt}}\) is an optimum \(K\)-term approximation representation of \(x\). i.e. \(\alpha_{\text{opt}}\) is a solution to (3) and the optimal \(K\)-term approximation of \(x\) is given by
We note that \(\alpha_{\text{opt}}\) may not be unique.
Let \(\Lambda_{\text{opt}}\) be the support of \(\alpha_{\text{opt}}\) which identifies the atoms in \(\Phi\) that participate in the \(K\)-term approximation of \(x\).
Once again let \(\Phi_{\text{opt}}\) be the sub-matrix consisting of columns of \(\Phi\) indexed by \(\Lambda_{\text{opt}}\).
We assume that columns in \(\Phi_{\text{opt}}\) are linearly independent. This is easily established since if any atom in this set were linearly dependent on other atoms, we could always find a more optimal solution.
Again let \(\Psi_{\text{opt}}\) be the matrix of \((D - K)\) columns which are not indexed by \(\Lambda_{\text{opt}}\).
We note that if \(\Lambda_{\text{opt}}\) is identified then finding \(\alpha_{\text{opt}}\) is a straightforward least squares problem.
We now present a condition under which Orthogonal Matching Pursuit is able to recover the optimal atoms.
Assume that \(\mu_1(K) < \frac{1}{2}\), and suppose that at \(k\)-th iteration, the support \(S^k\) for \(\alpha^k\) consists only of atoms from an optimal \(k\)-term approximation of the signal \(x\).At step \((k+1)\), Orthogonal Matching Pursuit will recover another atom indexed by \(\Lambda_{\text{opt}}\) whenever
A few remarks are in order.
- \(\| x - \Phi \alpha^k \|_2\) is the approximation error norm at \(k\)-th iteration.
- \(\| x - \Phi \alpha_{\text{opt}}\|_2\) is the optimum approximation error after \(K\) iterations.
- The theorem says that OMP makes absolute progress whenever the current error is larger than optimum error by a factor.
- As a result of this theorem, we note that every optimal \(K\)-term approximation of \(x\) contains the same kernel of atoms. The optimum error is always independent of choice of atoms in \(K\) term approximation (since it is optimum). Initial error is also independent of choice of atoms (since initial support is empty). OMP always selects the same set of atoms by design.
Let us assume that after \(k\) steps, OMP has recovered an approximation \(x^k\) given by
where \(S^k = \supp(\alpha^k)\) chooses \(k\) columns from \(\Phi\) all of which belong to \(\Phi_{\text{opt}}\).
Let the residual at \(k\)-th stage be
Recalling from previous section, a sufficient condition for recovering another optimal atom is
One difference from previous section is that \(r^k \notin \ColSpace(\Phi_{\text{opt}})\).
We can write
Note that \((x - x_{\text{opt}})\) is nothing but the residual left after \(K\) iterations.
We also note that since residual in OMP is always orthogonal to already selected columns, hence
We will now use these expressions to simplify \(\rho(r^k)\).
We now define two new terms
and
With these we have
Now \(x_{\text{opt}}\) has an exact \(K\)-term representation in \(\Phi\) given by \(\alpha_{\text{opt}}\). Hence \(\rho_{\text{opt}}(r^k)\) is nothing but \(\rho(r^k)\) for corresponding exact-sparse problem.
From the proof of here we recall
since
The remaining problem is \(\rho_{\text{err}}(r^k)\). Let us look at its numerator and denominator one by one.
\(\| \Psi_{\text{opt}}^H (x - x_{\text{opt}})\|_{\infty}\) essentially is the maximum (absolute) inner product between any column in \(\Psi_{\text{opt}}\) with \(x - x_{\text{opt}}\).
We can write
since all columns in \(\Phi\) are unit norm. In between we used Cauchy-Schwartz inequality.
Now look at denominator \(\| \Phi_{\text{opt}}^H (x_{\text{opt}} - x^k) \|_{\infty}\) where \((x_{\text{opt}} - x^k) \in \CC^N\) and \(\Phi_{\text{opt}} \in \CC^{N \times K}.\) Thus
Now for every \(v \in \CC^K\) we have
Hence
Since \(\Phi_{\text{opt}}\) has full column rank, hence its singular values are non-zero. Thus
From here we have
Combining these observations we get
Now from (2) \(\rho(r^k) <1\) whenever \(\rho_{\text{opt}}(r^k) + \rho_{\text{err}}(r^k) < 1\).
Thus a sufficient condition for \(\rho(r^k) <1\) can be written as
We need to simplify this expression a bit. Multiplying by \((1 - \mu_1(K))\) on both sides we get
We assumed \(\mu_1(K) < \frac{1}{2}\) thus \(1 - 2 \mu_1(K) > 0\) which validates the steps above.
Finally we remember that \((x - x_{\text{opt}}) \perp \ColSpace(\Phi_{\text{opt}})\) and \((x_{\text{opt}} - x^k) \in \ColSpace(\Phi_{\text{opt}})\) thus \((x - x_{\text{opt}})\) and \((x_{\text{opt}} - x^k)\) are orthogonal to each other. Thus by applying Pythagorean theorem we have
Thus we have
This gives us a sufficient condition
i.e. whenever (3) holds true, we have \(\rho(r^k) < 1\) which leads to OMP making a correct choice and choosing an atom from the optimal set.
Putting \(x^k = \Phi \alpha^k\) and \(x_{\text{opt}} = \Phi \alpha_{\text{opt}}\) we get back (1) which is the desired result.
This result establishes that as long as (1) holds for each of the steps from 1 to \(K\), OMP will recover a \(K\) term optimum approximation \(x_{\text{opt}}\). If \(x \in \CC^N\) is completely arbitrary, then it may not be possible that (1) holds for all the \(K\) iterations. In this situation, a question arises as to what is the worst \(K\)-term approximation error that OMP will incur if (1) doesn’t hold true all the way.
This is answered in following corollary of previous theorem.
Assume that \(\mu_1(K) < \frac{1}{2}\) and let \(x \in \CC^N\) be a completely arbitrary signal. Orthogonal Matching Pursuit produces a \(K\)-term approximation \(x^K\) which satisfies
where \(x_{\text{opt}}\) is the optimum \(K\)-term approximation of \(x\) in dictionary \(\DD\) (i.e. \(x_{\text{opt}} = \Phi \alpha_{\text{opt}}\) where \(\alpha_{\text{opt}}\) is an optimal solution of (3) ). \(C(\DD, K)\) is a constant depending upon the dictionary \(\DD\) and the desired sparsity level \(K\). An estimate of \(C(\DD, K)\) is given by
Suppose that OMP runs fine for first \(p\) steps where \(p < K\). Thus (1) keeps holding for first \(p\) steps. We now assume that (1) breaks at step \(p+1\) and OMP is no longer guaranteed to make an optimal choice of column from \(\Phi_{\text{opt}}\). Thus at step \(p+1\) we have
Any further iterations of OMP will only reduce the error further (although not in an optimal way). This gives us
Choosing
we can rewrite this as
This is a very useful result. It establishes that even if OMP is not able to recover the optimum \(K\)-term representation of \(x\), it always constructs an approximation whose error lies within a constant factor of optimum approximation error where the constant factor is given by \(\sqrt{1 + C(\DD, K)}\).
If the optimum approximation error \(\| x - x_{\text{opt}} \|_2\) is small then \(\| x - x^K \|_2\) will also be not too large.
If \(\| x - x_{\text{opt}} \|_2\) is moderate, then the OMP may inflate the approximation error to a higher value. But in this case, probably sparse approximation is not the right tool for signal representation over the dictionary.
Fast Implementation of OMP¶
As part of sparse-plex, we provide a fast CPU based implementation of OMP. It is up to 4 times faster than the OMP implementation in OMPBOX.
This is written in C and uses the BLAS and LAPACK features available in MATLAB. The implementation is available in the function spx.fast.omp. The corresponding C code is in omp.c.
For a \(100 \times 1000\) sensing matrix, the implementation can recover sparse representations for each signal in few hundred microseconds (depending upon the number of non-zero coefficients in the sparse representation and hence the sparsity level) on an Intel i7 2.4 GHz laptop with 16 GB RAM.
Read Building MATLAB Extensions for how to build the mex files for fast OMP implementation.
A Simple Example¶
Let’s create a Gaussian sensing matrix:
M = 100;
N = 1000;
A = spx.dict.simple.gaussian_mtx(M, N);
See Hands on with Gaussian sensing matrices for details.
Let’s create a 1000 sparse signals with sparsity 7:
S = 1000;
K = 7;
gen = spx.data.synthetic.SparseSignalGenerator(N, K, S);
X = gen.biGaussian();
See Generation of synthetic sparse representations for details.
Let’s compute their measurements using the Gaussian matrix:
Y = A*X;
Let’s recover the representations from the measurements:
start_time = tic;
result = spx.fast.omp(A, Y, K, 1e-12);
elapsed_time = toc(start_time);
fprintf('Time taken: %.2f seconds\n', elapsed_time);
fprintf('Per signal time: %.2f usec', elapsed_time * 1e6/ S);
Time taken: 0.17 seconds
Per signal time: 169.56 usec
See The OMP Algorithm for a review of OMP algorithm.
We are taking just 169 micro seconds per signal.
Let’s verify that all the signals have been recovered correctly:
cmpare = spx.commons.SparseSignalsComparison(X, result, K);
cmpare.summarize();
Signal dimension: 1000
Number of signals: 1000
Combined reference norm: 159.67639347
Combined estimate norm: 159.67639347
Combined difference norm: 0.00000000
Combined SNR: 307.9221 dB
All signals have indeed been recovered correctly.
See Comparing sparse or approximately sparse signals for
details about SparseSignalsComparison
.
Example code can be downloaded
here
.
Benchmarks¶
OS | Windows 7 Professional 64 Bit |
Processor | Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz |
Memory (RAM) | 16.0 GB |
Hard Disk | SATA 120GB |
MATLAB | R2017b |
The method for benchmarking has been adopted from
the file ompspeedtest.m
in the OMPBOX
package by Ron Rubinstein.
We compare following algorithms:
- The Cholesky decomposition based OMP implementation in OMPBOX.
- Our C version in sparse-plex.
The work load consists of a Gaussian dictionary of size \(512 \times 1000\). Sufficient signals are chosen so that the benchmarks can run reasonable duration. 8 sparse representations are constructed for each randomly generated signal in the given dictionary.
Speed summary for 6917 signals, dictionary size 512 x 1000:
Call syntax Algorithm Total time
--------------------------------------------------------
OMP(D,X,[],T) OMP-Cholesky 16.65 seconds
SPX-OMP(D, X, T) SPX-OMP-Cholesky 4.29 seconds
Our implementation is close to 4 times faster.
The benchmark generation code is in ex_fast_omp_speed_test.m.
Batch OMP¶
In this section, we develop an efficient version of OMP known as Batch OMP [RZE08].
In OMP, given a signal \(\bar{y}\) and a dictionary \(\Phi\), our goal is to iteratively construct a sparse representation \(x\) such that \(\bar{y} \approx \Phi x\) satisfying either a target sparsity \(K\) of \(x\) or a target error \(\| \bar{y} - \Phi x\|_2 \leq \epsilon\). The algorithm picks an atom from \(\Phi\) in each iteration and computes a least squares estimate \(y\) of \(\bar{y}\) on the selected atoms. The residual \(r = \bar{y} - y\) is used to select the next atom by choosing the atom which matches best with the residual. Let \(I\) be the set of atoms selected in OMP after some iterations.
Recalling the OMP steps in the next iteration:
- Matching atoms with residuals: \(h = \Phi^T r\)
- Finding the new atom (best match with residual): \(i = \underset{j}{\text{arg max}} (\abs(h_j))\)
- Support update: \(I = I \cup \{ i \}\)
- Least squares: \(x_I = \Phi_I^{\dag} \bar{y}\)
- Approximation update: \(y = \Phi_I x_I = \Phi_I \Phi_I^{\dag} \bar{y}\)
- Residual update: \(r = \bar{y} - y = (I - \Phi_I \Phi_I^{\dag}) \bar{y}\)
Batch OMP is useful when we are trying to reconstruct representations of multiple signals at the same time.
Least Squares in OMP using Cholesky Update¶
Here we review how least squares can be fast implemented using Cholesky updates.
In the following, we will denote
- the matrix \(\Phi^T \Phi\) by the symbol \(G\)
- the matrix \(\Phi^T \Phi_I\) by the symbol \(G_I\)
- the matrix \((\Phi_I^T \Phi_I)\) by the symbol \(G_{I, I}\).
Note that \(G_I\) is formed by taking the columns indexed by \(I\) from \(G\). The matrix \(G_{I, I}\) is formed by taking the rows and columns both indexed by \(I\) from \(G\).
We have
and
We can rewrite:
If we perform a Cholesky decomposition of the Gram matrix \(G_{I, I} = \Phi_I^T \Phi_I\) as \(G_{I, I} = L L^T\), then we have:
Solving this equation involves
- Computing \(b = \Phi_I^T \bar{y}\)
- Solving the triangular system \(L u = b\)
- Solving the triangular system \(L^T x_I = u\)
We also need an efficient way of computing \(L\). It so happens that the Cholesky decomposition \(G_{I, I} = L L^T\) can be updated incrementally in each iteration of OMP. Let
- \(I^k\) denote the index set of chosen atoms after k iterations.
- \(\Phi_{I^k}\) denote the corresponding subdictionary of chosen atoms.
- \(G_{I^k, I^k}\) denote the Gram matrix \(\Phi_{I^k}^T \Phi_{I^k}\).
- \(L^k\) denote the Cholesky decomposition of G_{I^k, I^k}.
- \(i^k\) be the index of atom chosen in k-th iteration.
The Cholesky update process aims to compute \(L^k\) given \(L^{k-1}\) and \(i^k\). Note that we can write
Define \(v = \Phi_{I^{k-1}}^T \phi_{i^k}\). Note that \(\phi_{i^k}^T \phi_{i^k} = 1\) for dictionaries with unit norm columns. This gives us:
This can be solved to give us an equation for update of Cholesky decomposition:
where \(w\) is the solution of the triangular system \(L^{k - 1} w = v\).
Removing residuals from the computation¶
An interesting observation on OMP is that the real goal of OMP is to identify the index set of atoms participating in the sparse representation of \(\bar{y}\). The computation of residuals is just a way of achieving the same. If the index set has been identified, then the sparse representation is given by \(x_I = \Phi_I^{\dag} \bar{y}\) with all other entries in \(x\) set to zero and the sparse approximation of \(y\) is given by \(\Phi_I x_I\).
The selection of atoms doesn’t really need the residual explicitly. All it needs is a way to update the inner products of atoms in \(\Phi\) with the current residual. In this section, we will rewrite the OMP steps in a way that doesn’t require explicit computation of residual.
We begin with pre-computation of \(\bar{h} = \Phi^T \bar{y}\). This is the initial value of \(h\) (the inner products of atoms in dictionary with the current residual). This computation is anyway needed for OMP. Now, let’s expand the calculation of \(h\):
But \(\Phi_I^T \bar{y}\) is nothing but \(\bar{h}_I\). Thus,
This means that if \(\bar{h} = \Phi^T \bar{y}\) and \(G = \Phi^T \Phi\) have been precomputed, then \(h\) can be computed for each iteration without explicitly computing the residual.
If we are reconstructing just one signal, then the computation of \(G\) is very expensive. But, if we are reconstructing thousands of signals together in batch, computation of \(G\) is actually a minuscule factor in overall computation. This is the essential trick in Batch OMP algorithm.
There is one more issue to address. A typical halting criterion in OMP is the error based stopping criterion which compares the norm of the residual with a threshold. If the residual norm goes below the threshold, we stop OMP. If the residual is not computed explicitly, the it becomes challenging to apply this criterion. However, there is a way out. In the following, let
- \(x_{I^k} = \Phi_{I^k}^{\dag} \bar{y}\) be the non-zero entries in the k-th sparse representation
- \(x^k\) denote the k-th sparse representation
- \(y^k\) be the k-th sparse approximation \(y^k = \Phi x^k = \Phi_{I^k} x_{I^k}\)
- \(r^k\) be the residual \(\bar{y} - y^k\).
We start by writing a residual update equation. We have:
Combining the two, we get:
Due to the orthogonality of the residual, we have \(\langle r^k, y^k \rangle = 0\). Using this property and a long derivation (in eq 2.8 of [RZE08]), we obtain the relationship:
We introduce the symbols \(\epsilon^k = \| r^k \|_2^2\) and \(\delta^k = (x^k)^T G x^k\). The previous equation reduces to:
Thus, we just need to keep track of the quantity \(\delta^k\). Note that \(\delta^0 = 0\) since the initial estimate \(x^0 = 0\) for OMP.
Recall that
which has already been computed for updating \(h\) and can be reused. So
which is a simple inner product.
The Batch OMP Algorithm¶
The batch OMP algorithm is described in the figure below.
The inputs are
- The Gram matrix \(G = \Phi^T \Phi\).
- The initial correlation vector \(\bar{h} = \Phi^T \bar{y}\).
- The squared norm \(\epsilon^0\) of the signal \(\bar{y}\) whose sparse representation we are constructing.
- The upper bound on the desired sparsity level \(K\)
- Residual norm (squared) threshold \(\epsilon\).
It returns the sparse representation \(x\).
Note that the algorithm doesn’t need direct access to either the dictionary \(\Phi\) or the signal \(\bar{y}\).

Note
The sparse vector \(x\) is usually returned as a pair of vectors \(I\) and \(x_I\). This is more efficient in terms of space utilization.
Fast Batch OMP Implementation¶
As part of sparse-plex, we provide a fast CPU based implementation of Batch OMP. It is up to 3 times faster than the Batch OMP implementation in OMPBOX.
This is written in C and uses the BLAS and LAPACK features available in MATLAB. The implementation is available in the function spx.fast.batch_omp. The corresponding C code is in batch_omp.c.
A Simple Example
Let’s create a Gaussian matrix (with normalized columns):
M = 400;
N = 1000;
Phi = spx.dict.simple.gaussian_mtx(M, N);
See Hands on with Gaussian sensing matrices for details.
Let’s create a few thousand sparse signals:
K = 16;
S = 5000;
X = spx.data.synthetic.SparseSignalGenerator(N, K, S).biGaussian();
See Generation of synthetic sparse representations for details.
Let’s compute their measurements using the Gaussian matrix:
Y = Phi*X;
We wish to recover \(X\) from \(Y\) and \(\Phi\).
Let’s precompute the Gram matrix:
G = Phi' * Phi;
Let’s precompute the correlation vectors for each signal:
DtY = Phi' * Y;
Let’s perform sparse recovery using Batch OMP and time it:
start_time = tic;
result = spx.fast.batch_omp(Phi, [], G, DtY, K, 1e-12);
elapsed_time = toc(start_time);
fprintf('Time taken: %.2f seconds\n', elapsed_time);
fprintf('Per signal time: %.2f usec', elapsed_time * 1e6/ S);
Time taken: 0.52 seconds
Per signal time: 103.18 usec
We note that the reconstruction has happened very quickly taking about just 100 micro seconds per signal.
We can verify the correctness of the result:
cmpare = spx.commons.SparseSignalsComparison(X, result, K);
cmpare.summarize();
Signal dimension: 1000
Number of signals: 5000
Combined reference norm: 536.04604784
Combined estimate norm: 536.04604784
Combined difference norm: 0.00000000
Combined SNR: 302.5784 dB
All signals have indeed been recovered correctly.
See :ref:`sec:library-commons-comparison-sparse` for
details about ``SparseSignalsComparison``.
For comparison, let’s see the time taken by Fast OMP implementation:
fprintf('Reconstruction with Fast OMP')
start_time = tic;
result = spx.fast.omp(Phi, Y, K, 1e-12);
elapsed_time = toc(start_time);
fprintf('Time taken: %.2f seconds\n', elapsed_time);
fprintf('Per signal time: %.2f usec', elapsed_time * 1e6/ S);
Reconstruction with Fast OMPTime taken: 4.39 seconds
Per signal time: 878.88 usec
See Fast Implementation of OMP for details about our fast OMP implementation.
Fast Batch OMP implementation is more than 8 times faster than fast OMP implementation for this problem configuration (M, N, K, S).
Benchmarks
OS | Windows 7 Professional 64 Bit |
Processor | Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz |
Memory (RAM) | 16.0 GB |
Hard Disk | SATA 120GB |
MATLAB | R2017b |
The method for benchmarking has been adopted from
the file ompspeedtest.m
in the OMPBOX
package by Ron Rubinstein.
We compare following algorithms:
- Batch OMP in OMPBOX.
- Our C version in sparse-plex.
The work load consists of a Gaussian dictionary of size \(512 \times 1000\). Sufficient signals are chosen so that the benchmarks can run reasonable duration. 8 sparse representations are constructed for each randomly generated signal in the given dictionary.
Speed summary for 178527 signals, dictionary size 512 x 1000:
Call syntax Algorithm Total time
--------------------------------------------------------
OMP(D,X,G,T) Batch-OMP 60.83 seconds
OMP(DtX,G,T) Batch-OMP with DTX 12.73 seconds
SPX-Batch-OMP(D, X, G, [], T) SPX-Batch-OMP 19.78 seconds
SPX-Batch-OMP([], [], G, Dtx, T) SPX-Batch-OMP DTX 7.25 seconds
Gain SPX/OMPBOX without DTX 3.08
Gain SPX/OMPBOX with DTX 1.76
Our implementation is up to 3 times faster on this large workload.
The benchmark generation code is in ex_fast_batch_omp_speed_test.m.
Orthogonal least squares¶
Compressive sampling matching pursuit¶
Iterative hard thresholding¶
Hard thresholding pursuit¶
Framework for study of performance of pursuit algorithms¶
Experimental study of pursuit algorithms for sparse recovery needs following components:
- Generation of synthetic sparse representations
- Generation of synthetic compressible representations
- Addition of measurement error
- Measurement of recovery error
- Phase transition diagrams
sparse-plex
library provides a wide variety of functions
to help with the study of pursuit algorithms.
Generation of synthetic sparse representations¶
A sparse representation is constructed in an appropriate representation space.
The class spx.data.synthetic.SparseSignalGenerator
provides various
methods for generating synthetic sparse representations from different
distributions.
- Uniform
- Bi-uniform
- Gaussian
- Complex Gaussian
- Rademacher
- Bi-Gaussian
- Real spherical rows
- Complex spherical rows
It takes following parameters:
\(N\)
The dimension of the representation space
\(K\)
The sparsity level of representations
\(S\) (optional)
Number of sparse representations to generate with a common support. Default value is 1.
The generator first uniformly selects a random support of \(K\) indices from the index set \([1, N]\).
After that it provides various ways to generate the non-zero values.
Uniform¶
We create the sparse signal generator instance:
N = 32;
K = 4;
gen = spx.data.synthetic.SparseSignalGenerator(N, K);
We generate a sparse vector:
rep = gen.uniform();
Let’s plot it:
stem(rep, '.');

Note that all non-zero entries are positive and they are distributed uniformly between \([0, 1]\).
We can easily identify the support for the representation:
>> spx.commons.sparse.support(rep)'
ans =
4 27 29 32
The \(\ell_0\)-“norm” can be calculated easily too:
>> spx.commons.sparse.l0norm(rep)
ans =
4
Let’s cross-check with the support used by the generator:
>> gen.Omega
ans =
27 29 4 32
By default, non-zero values are chosen between the range \([0, 1]\).
We can specify a custom range \([a, b]\) by calling:
rep = gen.uniform(a, b);
Bi-uniform¶
The problem with previous example is that all non-zero entries are positive. We would like that the sign of non-zero entries also changes with equal probability. This can be achieved using bi-uniform generator.
- The non-zero values are generated using uniform distribution.
- A sign for each non-zero entry is chosen with equal probability.
- The sign is multiplied to the non-zero value.
The setup steps are same:
N = 32;
K = 4;
gen = spx.data.synthetic.SparseSignalGenerator(N, K);
The representation generation step changes:
rep = gen.biUniform();
Plotting:
stem(rep, '.');

We will generate the magnitudes between 2 and 4:
rep = gen.biUniform(2, 4);
Plotting:
stem(rep, '.');

Gaussian¶
Let’s increase the dimensions of our representation space and sparsity level:
N = 128;
K = 8;
gen = spx.data.synthetic.SparseSignalGenerator(N, K);
Let’s generate non-zero entries using Gaussian distribution:
rep = gen.gaussian();
Plot it:
stem(rep, '.');

Bi-Gaussian¶
While the non-zero values in Gaussian distribution have both signs, we can see that some of the non-zero values are way too small. These are problematic for those sparse recovery algorithms which are not very good with way too small values or which demand that the dynamic range between the large non-zero values and small non-zero values shouldn’t be too high. The small non-zero values are also problematic in the presence of noise as it is hard to distinguish them from noise.
To address these concerns, we have a bi-Gaussian distribution.
The way it works is as follows:
- Generate non-zero values using Gaussian distribution.
- Let a value be \(x\).
- Let an offset \(\alpha > 0\) be given.
- If \(x > 0\), then \(x = x + \alpha\).
- If \(x < 0\), then \(x = x - \alpha\).
Default value of offset is 1.
Setup:
N = 128;
K = 8;
gen = spx.data.synthetic.SparseSignalGenerator(N, K);
Generating the representation vector:
rep = gen.biGaussian();
Plot it:
stem(rep, '.');

Let’s pickup the non-zero values from this vector:
>> nz_rep = rep(rep~=0)'; nz_rep
nz_rep =
-1.0631 -2.3499 1.7147 -1.2050 1.7254 4.5784 4.0349 3.7694
Let’s estimate the dynamic range:
>> anz_rep = abs(nz_rep);
>> dr = max(anz_rep) / min(anz_rep)
dr =
4.3068
The bi-Gaussian distribution is quite flexible.
- The non-zero values are both positive and negative.
- Quite large non-zero values are possible (though rare).
- Too small values are not allowed.
- Dynamic range between largest and smallest non-zero values is not much.
Rademacher¶
Sometimes, you want a sparse representation where the non-zero values are either \(+1\) or \(-1\). In this case, the non-zero values should be drawn from Rademacher distribution.
Setup:
N = 128;
K = 8;
gen = spx.data.synthetic.SparseSignalGenerator(N, K);
Generating Rademacher distributed non-zero values:
rep = gen.rademacher();
Plot it:
stem(rep, '.');

Generating compressible signals¶
[Cev09] describes a set of probability distributions, dubbed compressible priors whose independent and identically distributed realizations result in p-compressible signals.
The authors provided a Matlab function randcs.m
for generating compressible signals. It is included
in sparse-plex
.
Subspace Clustering¶
Introduction¶
High dimensional data-sets are now pervasive in various signal processing applications. For example, high resolution surveillance cameras are now commonplace generating millions of images continually. A major factor in the success of current generation signal processing algorithms is the fact that, even though these data-sets are high dimensional, their intrinsic dimension is often much smaller than the dimension of the ambient space.
One resorts to inferring (or learning) a quantitative model \(\mathbb{M}\) of a given set of data points \(Y = \{ y_1, \dots, y_S\} \subset \RR^M\). Such a model enables us to obtain a low dimensional representation of a high dimensional data set. The low dimensional representations enable efficient implementation of acquisition, compression, storage, and various statistical inferencing tasks without losing significant precision. There is no such thing as a perfect model. Rather, we seek a model \(\mathbb{M}^*\) that is best amongst a restricted class of models \(\mathcal{M} = \{ \mathbb{M} \}\) which is rich enough to describe the data set to a desired accuracy yet restricted enough so that selecting the best model is tractable.
In absence of training data, the problem of modeling falls into the category of unsupervised learning. There are two common viewpoints of data modeling. A statistical viewpoint assumes that data points are random samples from a probabilistic distribution. Statistical models try to learn the distribution from the dataset. In contrast, a geometrical viewpoint assumes that data points belong to a geometrical object (a smooth manifold or a topological space). A geometrical model attempts to learn the shape of the object to which the data points belong. Examples of statistical modeling include maximum likelihood, maximum a posteriori estimates, Bayesian models etc. An example of geometrical models is Principal Component Analysis (PCA) which assumes that data is drawn from a low dimensional subspace of the high dimensional ambient space. PCA is simple to implement and has found tremendous success in different fields e.g., pattern recognition, data compression, image processing, computer vision, etc. footnote{PCA can also be viewed as a statistical model. When the data points are independent samples drawn from a Gaussian distribution, the geometric formulation of PCA coincides with its statistical formulation.}
The assumption that all the data points in a data set could be drawn from a single model however happens to be a stretched one. In practice, it often occurs that if we group or segment the data set \(Y\) into multiple disjoint subsets: \(Y = Y_1 \cup \dots \cup Y_K\), then each subset can be modeled sufficiently well by a model \(\mathbb{M}_k^*\) (\(1 \leq k \leq K\)) chosen from a simple model class. Each model \(\mathbb{M}_k^*\) is called a primitive or component model. In this sense, the data set \(Y\) is called a mixed dataset and the collection of primitive models is called a hybrid model for the dataset. Let us look at some examples of mixed data sets.
Consider the problem of vanishing point detection in computer vision. Under perspective projection, a group of parallel lines pass through a common point in the image plane which is known as the vanishing point for the group. For a typical scene consisting of multiple sets of parallel lines, the problem of detecting all vanishing points in the image plane from the set of edge segments (identified in the image) can be transformed into clustering points (in edge segments) into multiple 2D subspaces in \(\RR^3\) (world coordinates of the scene).
In the Motion segmentation problem, an image sequence consisting of multiple moving objects is segmented so that each segment consists of motion from only one object. This is a fundamental problem in applications such as motion capture, vision based navigation, target tracking and surveillance. We first track the trajectories of feature points (from all objects) over the image sequence. It has been shown (see here) that trajectories of feature points for rigid motion for a single object form a low dimensional subspace. Thus motion segmentation problem can be solved by segmenting the feature point trajectories for different objects separately and estimating the motion of each object from corresponding trajectories.
In a face clustering problem, we have a collection of unlabeled images of different faces taken under varying illumination conditions. Our goal is to cluster, images of the same face in one group each. For a Lambertian object, it has been shown that the set of images taken under different lighting conditions forms a cone in the image space. This cone can be well approximated by a low-dimensional subspace [BJ03][HYL+03]. The images of the face of each person form one low dimensional subspace and the face clustering problem reduces to clustering the collection of images to multiple subspaces.
As the examples above suggest, a typical hybrid model for a mixed data set consists of multiple primitive models where each primitive is a (low dimensional) subspace. The data set is modeled as being sampled from a collection or arrangement \(\UUU\) of linear (or affine) subspaces \(\UUU_k \subset \RR^M\) : \(\UUU = \{ \UUU_1 , \dots , \UUU_K \}\). The union of the subspaces footnote{In the sequel, we would use the terms arrangement and union interchangeably. For more discussion see here.} is denoted as \(Z_{\UUU} = \UUU_1 \cup \dots \cup \UUU_K\). This is indeed a geometric model. In such modeling problems, individual subspaces (dimension and orientation of each subspace and total number of subspaces) and the membership of a data point (a single image in the face clustering problem) to a particular subspace is unknown beforehand. This entails the need for algorithms which can simultaneously identify the subspaces involved and cluster/segment the data points from individual subspaces into separate groups. This problem is known as subspace clustering which is the focus of this paper. An earlier detailed introduction to subspace clustering can be found in [Vid10].
An example of a statistical hybrid model is a Gaussian Mixture Model (GMM) where one assumes that the sample points are drawn independently from a mixture of Gaussian distributions. A way of estimating such a mixture model is the expectation maximization (EM) method.
The fundamental difficulty in the estimation of hybrid models is the “chicken-and-egg” relationship between data segmentation and model estimation. If the data segmentation was known, one could easily fit a primitive model to each subset. Alternatively, if the constituent primitive models were known, one could easily segment the data by choosing the best model for each data point. An iterative approach starts with an initial (hopefully good) guess of primitive models or data segments. It then alternates between estimating the models for each segment and segmenting the data based on current primitive models till the solution converges. On the contrary, a global algorithm can perform the segmentation and primitive modeling simultaneously. In the sequel, we will look at a variety of algorithms for solving the subspace clustering problem.
Notation and problem formulation¶
First some general notation for vectors and matrices. For a vector \(v \in \RR^n\), its support is denoted by \(\supp(v)\) and is defined as \(\supp(v) \triangleq \{i : v_i \neq 0, 1 \leq i \leq n \}\). \(|v|\) denotes a vector obtained by taking the absolute values of entries in \(v\). \(\OneVec_n \in \RR^n\) denotes a vector whose each entry is \(1\). \(\| v \|_p\) denotes the \(\ell_p\) norm of \(v\). \(\| v \|_0\) denotes the \(\ell_0\)-“norm” of \(v\). Let \(A\) be any \(m \times n\) real matrix (\(A \in \RR^{m \times n}\)). \(a_{i, j}\) is the element at the \(i\)-th row and \(j\)-th column of \(A\). \(a_j\) with \(1 \leq j \leq n\) denotes the \(j\)-th column vector of \(A\). \(\underline{a}_i\) with \(1 \leq i \leq m\) denotes the \(i\)-th row vector of \(A\). \(a_{j,k}\) is the \(k\)-th entry in \(a_j\). \(\underline{a}_{i,k}\) is the \(k\)-th entry in \(\underline{a}_i\). \(A_{\Lambda}\) denotes a submatrix of \(A\) consisting of columns indexed by \(\Lambda \subset \{1, \dots, n \}\). \(\underline{A}_{\Lambda}\) denotes a submatrix of \(A\) consisting of rows indexed by \(\Lambda \subset \{1, \dots, m \}\). \(|A|\) denotes matrix consisting of absolute values of entries in \(A\).
\(\supp(A)\) denotes the index set of non-zero rows of \(A\). Clearly, \(\supp(A) \subseteq \{1, \dots, m\}\). \(\| A \|_{0}\) denotes the number of non-zero rows of \(A\). Clearly, \(\| A \|_{0} = |\supp(A)|\). We note that while \(\| A \|_{0}\) is not a norm, its behavior is similar to the \(l_0\)-“norm” for vectors \(v \in \RR^n\) defined as \(\| v \|_0 \triangleq | \supp(v) |\). \(\OneVec_n \in \RR^n\) denotes a vector consisting of all \(1\text{s}\).
We use \(f(x)\) and \(F(x)\) to denote the PDF and CDF of a continuous random variable. We use \(p(x)\) to denote the PMF of a discrete random variable. We use \(\PP(E)\) to denote the probability of an event.
Problem formulation¶
The data set can be modeled as a set of data points lying in a union of low dimensional linear or affine subspaces in a Euclidean space \(\RR^M\) where \(M\) denotes the dimension of ambient space. Let the data set be \(\{ y_j \in \RR^M \}_{j=1}^S\) drawn from the union of subspaces under consideration. \(S\) is the total number of data points being analyzed simultaneously. We put the data points together in a data matrix as
The data matrix \(Y\) off course is known to us.
We will slightly abuse the notation and let \(Y\) denote the set of data points \(\{ y_j \in \RR^M \}_{j=1}^S\) also. We will use the terms data points and vectors interchangeably in the sequel. Let the vectors be drawn from a set of \(K\) (linear or affine) subspaces, The number of subspaces may not be known in advance. The subspaces are indexed by a variable \(k\) with \(1 \leq k \leq K\). The \(k\)-th subspace is denoted by \(\UUU_k\). Let the (linear or affine) dimension of \(k\)-th subspace be \(\dim(\UUU_k) = D_k\) with \(D_k \leq D\). Here \(D\) is an upper bound on the dimension of individual subspaces. We may or may not know \(D\). We assume that none of the subspaces is contained in another. A pair of subspaces may not intersect (e.g. parallel lines or planes), may have a trivial intersection (lines passing through origin), or a non-trivial intersection (two planes intersecting at a line). The collection of subspaces may also be independent or disjoint.
The vectors in \(Y\) can be grouped (or segmented or clustered) as submatrices \(Y_1, Y_2, \dots, Y_K\) such that all vectors in \(Y_k\) lie in subspace \(\UUU_k\). Thus, we can write
where \(\Gamma\) is an \(S \times S\) unknown permutation matrix placing each vector to the right subspace. This segmentation is straight-forward if the (affine) subspaces do not intersect or the subspaces intersect trivially at one point (e.g. any pair of linear subspaces passes through origin). Let there be \(S_k\) vectors in \(Y_k\) with \(S = S_1 + \dots + S_K\). Naturally, we may not have any prior information about the number of points in individual subspaces. We do typically require that there are enough vectors drawn from each subspace so that they can span the corresponding subspace. This requirement may vary for individual subspace clustering algorithms. For example, for linear subspaces, sparse representation based algorithms require that whenever a vector is removed from \(Y_k\), the remaining set of vectors spans \(\UUU_k\). This guarantees that every vector in \(Y_k\) can be represented in terms of other vectors in \(Y_k\). The minimum required \(S_k\) for which this is possible is \(S_k = D_k + 1\) when the data points from each subspace are in general position (i.e. \(\spark(Y_k) = D_k + 1\)).
Let \(Q_k\) be an orthonormal basis for subspace \(\UUU_k\). Then, the subspaces can be described as
For linear subspaces, \(\mu_k = 0\). We will abuse \(Y_k\) to also denote the set of vectors from the \(k\)-th subspace.
The basic objective of subspace clustering algorithms is to obtain a clustering or segmentation of vectors in \(Y\) into \(Y_1, \dots, Y_K\). This involves finding out the number of subspaces/clusters \(K\), and placing each vector \(y_s\) in its cluster correctly. Alternatively, if we can identify \(\Gamma\) and the numbers \(S_1, \dots, S_K\) correctly, we have solved the clustering problem. Since the clusters fall into different subspaces, as part of subspace clustering, we may also identify the dimensions \(\{D_k\}_{k=1}^K\) of individual subspaces, the bases \(\{ Q_k \}_{k=1}^K\) and the offset vectors \(\{ \mu_k \}_{k=1}^K\) in case of affine subspaces. These quantities emerge due to modeling the clustering problem as a subspace clustering problem. However, they are not essential outputs of the subspace clustering algorithms. Some subspace clustering algorithms may not calculate them, yet they are useful in the analysis of the algorithm. See here for a quick review of data clustering terminology.
Noisy case¶
We also consider clustering of data points which are contaminated with noise. The data points do not perfectly lie in a subspace but can be approximated as a sum of a component which lies perfectly in a subspace and a noise component. Let
be the \(s\)-th vector that is obtained by corrupting an error free vector \(\bar{y}_s\) (which perfectly lies in a low dimensional subspace) with a noise vector \(e_s \in \RR^M\). The clustering problem remains the same. Our goal would be to characterize the behavior of the clustering algorithm in the presence of noise at different levels.
Algorithms¶
A number of algorithms have been developed to address the subspace clustering problem over last 3 decades. They can be largely classified under: algebraic methods, iterative methods, statistical methods, spectral clustering and sparse representations based methods. Some algorithms combine ideas from different approaches. In the following, we review a set of representative algorithms from the literature.
Algebraic methods include: matrix factorization based algorithms, Generalized Principal Component Analysis (GPCA).
Iterative methods include: \(K\)-plane clustering, \(K\)-subspace clustering, Expectation-Maximization based subspace clustering.
Statistical methods include: Mixture of Probabilistic Principal Component Analysis (MPPCA), ALC, Random Sampling Consensus (RANSAC).
Spectral clustering based methods include: Spectral Curvature Clustering (SCC).
Sparse representations based methods in turn use spectral clustering as a post processing step. These methods include: Low Rank Representation (LRR), Sparse Subspace Clustering via \(\ell_1\) minimization (SSC-\(\ell_1\)), Sparse Subspace Clustering via Orthogonal Matching Pursuit (SSC-OMP).
Some algorithms assume that the subspaces are independent. Some algorithms are capable of handling subspaces which may not be independent but are disjoint. Some algorithms can allow for arbitrary intersection between subspaces too. The performance of an algorithm depends on a number of parameters: ambient space dimension, number of subspaces, dimension of each subspace, number of points in each subspace and their distribution within the subspace, the separation between subspaces (in terms of say subspace angles). We provide relevant commentary on the features and capabilities of each algorithm.
Some algorithms have explicit support for handling affine subspaces. Many of them are designed for linear subspaces only. This is not a handicap in general as a \(d\)-dimensional affine subspace in \(\RR^M\) can easily be mapped to a \(d+1\)-dimensional linear subspace in \(\RR^{M + 1}\) by using homogeneous coordinates. This representation is one-to-one. The only downside is that we have to add one more coordinate in the ambient space. This may not be an issue if \(M\) is large.
When \(M\) is very large (say images), then it may be useful to perform a dimensionality reduction in advance before applying a subspace clustering algorithm. With the union of subspaces being \(Z_{\UUU} = \UUU_1 \cup \dots \cup \UUU_K\), two situations are possible. The linear span of \(Z_{\UUU}\) is a proper low dimensional subspace of \(\RR^M\). In this case, a direct PCA on the dataset pretty good in achieving the dimensional reduction. Alternatively the dimension of \(\text{span}(Z_{\UUU})\) may be very large even though individual subspace dimensions \(D_k\) are small. Now, let \(D_{\max} = \max(\{ D_k \}\). If \(D_{\max}\) is known and \(D_{\max} < M - 1\), then we can choose a \(D_{\max}+1\) dimensional subspace which can preserve the separation and dimension of all the subspaces \(\UUU_k\) and project all the points to it. Such a subspace may be chosen either randomly or using special purpose methods [BK00]. Note that such a projection may not preserve distances between points or angles between subspaces fully. An approximately distance preserving projection may require larger dimension subspace [DG99].
Matrix Factorization based algorithms¶
Basic matrix factorization based algorithms were developed for solving the motion segmentation problem in [BB91][Gea98][CK98][Kan01]. These algorithms are primarily algebraic in nature. See here for the motivation from motion segmentation problem.
The following derivation is applicable if the subspaces are linear and independent.
We start with the equation:
Under the independence assumption, we have
Note that each \(Y_k \in \RR^{M \times S_k}\) can be factorized via SVD as
where \(U_k \in \RR^{M \times D_k}\), \(\Sigma_k = \text{diag}(\sigma_{p 1}, \dots, \sigma_{p D_k}) \in \RR^{D_k \times D_k}\) and \(V_k \in \RR^{S_k \times D_k}\). Columns of \(U_k\) form an orthonormal basis for the subspace \(\UUU_k\). Columns of \(\Sigma_k V_k^T\) give the coordinates of points in \(Y_k\) in the orthonormal basis \(U_k\). Singular values are non-zero since \(Y_k\) spans \(\UUU_k\). Alternatively, \(D_k\) can be obtained by counting the non-zero singular values in the SVD of Y. Denoting:
we can write
This is a valid SVD of \(Y^*\) if the subspaces \(\UUU_k\) are independent. This differs from the standard SVD of \(Y^*\) only in the permutation of singular values in \(\Sigma\) as the standard SVD of \(Y^*\) will require them to be ordered in decreasing order. Nevertheless,
It is clear that both \(Y\) and \(Y^*\) share the same singular values. Let the SVD of \(Y\) be \(Y = U \Sigma V^T\). Let \(\Sigma = \hat{\Sigma}\hat{\Gamma}\) where \(\hat{\Gamma}\) permutes the singular values in \(\hat{\Sigma}\) in decreasing order. Then \(\hat{\Sigma} = \Sigma \hat{\Gamma}^T\) and
Matching terms, we see that \(U = \hat{U}\) and \(V = \Gamma \hat{V} \hat{\Gamma}\). Thus \(\hat{V}\) is obtained by permuting the rows and columns of \(V\) where \(\Gamma\) and \(\hat{\Gamma}\) are unknown permutations.
Let \(W = VV^T\) and \(\hat{W} = \hat{V} \hat{V}^T\). Then
Alternatively
Thus, \(\hat{W}\) can be obtained by identical row and column permutations of \(W\) given by \(\Gamma\).
The matrix \(W\) is very useful. But first let’s check out \(\hat{W}\). Note that \(\hat{V}\) can be considered as a \(P \times P\) block matrix with diagonal matrix elements. Thus
Simplifying, we obtain
\(V_k V_k^T\) is a \(S_k \times S_k\) non-zero matrix. \(\hat{W}\) is an \(S \times S\) matrix. Clearly, \(\hat{W}_{i j} = 0\) if \(i\)-th and \(j\)-th columns in \(Y^*\) belong to the different subspaces. Since \(W\) is obtained by permuting the rows and columns of \(\hat{W}\) by \(\Gamma\), hence \(W_{ij} = 0\) if \(i\)-th and \(j\)-th columns in the unsorted data matrix \(Y\) come from different subspaces. A simple algorithm for data segmentation is thus obtained which puts the \(i\)-th and \(j\)-th columns in \(Y\) in same cluster if the corresponding entry \(W_{ij}\) is non-zero.
K-plane clustering¶
K-plane clustering [BM00] is a variation of the K-means algorithm [DHS12]. In \(K\)-means, we choose a point as the center of each cluster. In \(K\)-plane clustering, we instead choose a hyperplane at the center of each cluster. This algorithm can be used for solving subspace clustering problem when each subspace \(\UUU_k\) is deemed to be a hyperplane of \(\RR^M\). See here for a quick review of affine subspaces. In our notation, we will be estimating \(K\) hyperplanes \(\mathcal{H}_k\) with \(1 \leq k \leq K\). We also assume that \(K\) is known in advance. Each of the planes is defined as
The algorithm seeks to choose planes such that the sum of squares of distances of each point in \(Y\) to the nearest plane is minimized.
The algorithm alternates between cluster assignment step (where each point is assigned to the nearest plane) and cluster update step (where a new nearest plane is computed for each cluster).
We assume that the normal vector \(w_k\) is unit norm, i.e. \(\| w_k \|_2 = 1\). Thus, the distance of a point \(y_s\) from a plane \(\mathcal{H}_k\) is \(| \langle w_k , y_s \rangle - d_k |\).
In the cluster assignment step, the closest plane for the point \(y_s\) is chosen as
where \(k(s)\) denotes the assignment of \(s\)-th point to \(k\)-th cluster. Next, we look at the problem of finding the nearest hyperplane to a given set of points. Let \(\{y_{k 1}, y_{k 2}, y_{k n_k} \}\) be the set of points assigned to \(k\)-th cluster at a given iteration. We can stack the vectors \(y_{k n}\) in a matrix \(Y_k = \begin{bmatrix} y_{k 1} & \dots & y_{k n_k} \end{bmatrix}\). If \(\Rank(Y_k) < M\), then it is easy to find a hyperplane which contains all the points and the minimum distance is 0. In particular, if \(\Rank(Y_k) = M-1\), then this hyperplane is the range of columns of \(Y_k\): \(\Range(Y_k)\). Otherwise, any hyperplane containing \(\Range(Y_k)\) would work fine.
In the general case, for an arbitrary hyperplane specified by \((w, d)\), the sum of squared distances from the plane is given by
The cluster update step thus is equivalent to finding the solution to the optimization problem:
To solve this problem, we define a matrix
A global solution to this problem is obtained at any eigenvector \(w\) of \(B\) corresponding to a minimum eigenvalue of \(B\) and \(d = \frac{\OneVec^T Y_k^T w}{n_k}\) [BM00]. When \(Y_k\) is degenerate (\(\Rank(Y_k) < M\)), then the minimum eigen value of \(B\) is 0 and the minimum distance is 0.
Finally, it can also be shown that the \(K\)-plane clustering algorithm terminates in a finite number of steps at a cluster assignment that is locally optimal. This concludes our discussion of \(K\)-plane clustering.
K-subspace clustering¶
K-subspace clustering [HYL+03] is a generalization of K-means [see here] and K-plane clustering. In K-means, we cluster points around centroids, in K-plane, we cluster points around hyperplanes, and in K-subspace clustering, we cluster points around subspaces. This algorithm requires the number of subspaces \(K\) and their dimensions \(\{ D_1, \dots, D_K \}\) to be known in advance. We present the version for linear subspaces with \(\mu_k = 0\). Fitting the dataset \(Y\) into \(K\)-subspaces can be reduced to means identifying an orthogonal basis \(Q_k \in \RR^{M \times D_k}\) for each subspace. If the data points fit perfectly, then for every \(s\) in \(\{ 1, \dots , S\}\) there exists a \(k\) in \(\{1, \dots, K\}\) such that \(y_s = Q_k \alpha_s\) (i.e. \(y_s\) belongs to \(k\)-th subspace with basis \(Q_k\)). If the data point belongs to an intersection of two or more subspaces, then we can arbitrarily assign the data point to one of the subspaces.
Lastly, data points may not be lying perfectly in the subspace. The orthoprojector for each subspace is given by \(Q_k Q_k^T\). Thus, the projection of a point \(y_s\) on a subspace \(\UUU_k\) is \(Q_k Q_k^T y_s\) and the error is \((I - Q_k Q_k^T) y_s\). The (squared) distance from the subspace is then \(\|(I - Q_k Q_k^T) y_s\|_2^2\). The point can be assigned to the subspace closest to it.
Given that a set of points \(Y_k\) are assigned to the subspace \(\UUU_k\), the orthonormal basis \(Q_k\) can be estimated for \(\UUU_k\) by performing principal component analysis here.
This gives us a straightforward iterative method for fitting the subspaces.
- Start with initial subspace bases \(Q_1^{(0)}, \dots, Q_K^{(0)}\).
- Assign points to subspaces by using minimum distance criteria.
- Estimate the bases for each subspace.
- Repeat steps 2 and 3 till the clustering keeps changing.
Initial subspaces can be chosen randomly.
Expectation-Maximization for K-subspaces¶
The EM method can be adapted for fitting of subspaces also. We need to assume a statistical mixture model for the dataset.
We assume that the dataset \(Y\) is sampled from a mixture of \(K\) component distributions where each component is centered around a subspace. A latent (hidden) discrete random variable \(z \in \{1, \dots, K \}\) picks the component distribution from which a sample \(y\) is drawn. Let the \(k\)-th component be centered around the subspace \(\UUU_k\) which has an orthogonal basis \(Q_k\). Then, we can write
where \(B_k \in \RR^{M \times (M - D_k)}\) is an orthonormal basis for the subspace \(\UUU_k^{\perp}\), \(Q_k \alpha_k\) is the component of \(y_k\) lying perfectly in \(\UUU_k\) and \(B_k \beta_k\) is the component lying in \(\UUU_k^{\perp}\) representing the projection error (to the subspace). We will assume that both \(\alpha\) and \(\beta\) are sampled from multivariate isotropic normal distributions, i.e. \(\alpha \sim \NNN(0, \sigma'^2_{k} I)\) and \(\beta \sim \NNN(0, \sigma^2_{k} I)\). Assuming that \(\alpha\) and \(\beta\) are independent, the covariance matrix for \(y\) is given by
Since \(y\) is expected to be very close the to the subspace \(\UUU_k\), hence \(\sigma^2_k \ll \sigma'^2_k\). In the limit \(\sigma'^2_k \to \infty\), we have \(\Sigma_k^{-1} \to \sigma^{-2}_k B_k B_k^T\). Basically, this means that \(y\) is uniformly distributed in the subspace and its location inside the subspace (given by \(Q_k \alpha\)) is not important to us. All we care about is that \(y\) should belong to one of the subspaces \(\UUU_k\) with \(B_k \beta\) capturing the projection error being small and normally distributed.
The component distributes therefore are:
\(z\) is multinomial distributed with \(p (z = k) = \pi_k\). The parameter set for this model is then \(\theta = \{\pi_k, B_k, \sigma_K \}_{k=1}^K\) which is unknown and needs to be estimated from the dataset \(Y\). The marginal distribution \(f(y| \theta)\) and the incomplete likelihood function \(l(Y | \theta)\) can be derived just like here. We again introduce auxiliary variables \(w_{sk}\) and convert the ML estimation problem into an iterative estimation problem.
Estimates for \(\hat{w}_{sk}\) in the E-step remain the same.
Estimates of parameters in \(\theta\) in M-step are computed as follows. We compute the weighted sample covariance matrix for the \(k\)-th cluster as
\(\hat{B}_k\) is the eigenvectors associated with the smallest \(M - D_k\) eigenvalues of \(\hat{\Sigma}_k\). \(\pi_k\) and \(\sigma_k\) are estimated as follows:
The primary conceptual difference between \(K\)-subspaces and EM algorithm is: At each iteration, \(K\)-subspaces gives a definite assignment of every point to one of the subspaces; while EM views the membership as a random variable and uses its expected value \(\sum_{s=1}^S w_{ks}\) to give a “probabilistic” assignment of a data point to a subspace.
Both of these algorithms require number of subspaces and the dimension of each subspace as input and depend on a good initialization of subspaces to converge to an optimal solution.
Generalized PCA¶
Generalized Principal Component Analysis (GPCA) is algebraic subspace clustering technique based on polynomial fitting and differentiation [VMS03][VH04][HMV04][VMS05][VTH08]. The basic idea is that a union of subspaces can be represented as a zero set of a set of homogeneous polynomials. Once the set of polynomials has been fitted for the given dataset, individual component subspaces can be identified via polynomial differentiation and division. See here for a quick review of ideas from algebraic geometry which are used in the development of GPCA algorithm.
We will assume that \(\UUU_k\) are linear subspaces. If they are affine, we simply take their homogeneous embeddings.
Representing the union of subspaces with a set of homogeneous polynomials¶
Consider the \(k\)-th subspace \(\UUU_k \subset \RR^M\) with dimension \(D_k\) and its orthogonal complement \(\UUU_k^{\perp}\) with dimension \(D'_k = M - D_k\). Choose a basis for \(\UUU_k^{\perp}\) as:
Recall that for each \(y \in \UUU_k\), \(b_{k_i}^T y = 0\) as vectors in \(\UUU_k^{\perp}\) are orthogonal to vectors in \(\UUU_k\). Note that each of the forms \(b_{k_i}^T y\) is a homogeneous polynomial of degree 1. The solutions of \(b_{k_i}^T y = 0\) are (linear) hyperplanes of dimension \(M-1\) and the subspace \(\UUU_k\) is the intersection of these hyperplanes. In other words:
Note that \(y \in Z_{\UUU}\) if and only if \((y \in \UUU_1) \vee \dots \vee (y \in \UUU_K)\). Alternatively:
where \(\sigma\) denotes an arbitrary choice of one normal vector \(b_{k_{\sigma(k)}}\) from each basis \(B_k\) and we are considering all such choices. If \(y\in Z_{\UUU}\), it belongs to some \(\UUU_k\), and \((b_{k_i}^T y = 0)\) for each \(b_{k_i}\) in \(B_k\). Hence, for each choice \(\sigma\), \(b_{k_{\sigma(k)}}^T y = 0\) and RHS is true. Conversely, assume RHS is true. If \(y \notin Z_{\UUU}\), then from each \(B_k\), we could pick one normal vector \(b\) such that \(b^T y \neq 0\). This choice would make RHS false, hence \(y \in Z_{\UUU}\). The total number of choices \(\sigma\) is: \(\prod_{k=1}^K D'_k\). Interestingly:
where \(p^K_{\sigma}(y)\) is a homogeneous polynomial of degree \(K\) in \(M\) variables.
Therefore, A union of \(K\) subspaces can be represented as the zero set of a set of homogeneous polynomials of the form:
where \(b_k \in \RR^M\) is a normal vector to the \(k\)-th subspace and \(v_K(y)\) is the Veronese embedding (see here) of \(y \in \RR^M\) into \(\RR^{A_{K}(M)}\). The problem of fitting \(K\) subspaces to the given dataset is then equivalent the problem of fitting homogeneous polynomials \(p^K(y)\) such that all the points in the dataset belong to the zero set of these polynomials. Fitting of such polynomials doesn’t require iterative data segmentation and model estimation since they depend on all the points in the dataset. Once, the polynomials have been identified, the remaining task is to split their zero-set into individual subspaces identified by \(B_k\).
In the following, we assume that the number of subspaces \(K\) is known beforehand. We consider the task of estimating \(K\) later.
Fitting polynomials to data¶
Let \(I(Z_{\UUU})\) be the vanishing ideal of \(Z_{\UUU}\). Since, the number of subspaces \(K\) is known, we only need to consider the homogeneous component \(I_K\) of \(I(Z_{\UUU})\) (3).
The vanishing ideal \(I(\UUU_K)\) of \(\UUU_k\) is generated by the set of linear forms
If the subspace arrangement is transversal, \(I_K\) is generated by products of \(K\) linear forms that vanish on the \(K\) subspaces. Any polynomial \(p(y) \in I_K\) can be written as a summation of products of linear forms
where \(l_k(y)\) is a linear form in \(I(\UUU_k)\). Using the Veronese map, each polynomial in \(I_K\) can also be written as:
where \(k_1 + \dots + k_M = K\) and \(c_{k_1, \dots, k_M} \in \RR\) represents the coefficient of monomial \(y^{\underline{K}} = y_1^{k_1} \dots y_M^{k_M}\). Fitting the polynomial \(p(y)\) is equivalent to identifying its coefficient vector \(c_K\). Since \(p(y) = 0\) is satisfied by each data point \(y_s \in Y\), we have \(c_K^T v_K(y_s) = 0\) for all \(s = 1, \dots, S\). We define
as embedded data matrix. Then, we have
The coefficient vector \(c_K\) of every polynomial in \(I_K\) is in the null space of \(V_K(M)\). To ensure that every polynomial obtained from \(V_K(M)\) is in \(I_K\), we require that
where \(h_I\) is the Hilbert function of \(I(Z_{\UUU})\) (2). Equivalently, the rank of \(V_K(M)\) needs to satisfy:
This condition is typically satisfied with \(S \geq (A_K(M) - 1)\) points in general position. Assuming this, a basis for \(I_K\) can be constructed from the set of \(h_I(K)\) singular vectors of \(V_K(M)\) associated with its \(h_I(K)\) zero singular values. In the presence of moderate noise, we can still estimate the coefficients of the polynomials in the least squares sense from the singular vectors associated with the \(h_I(K)\) smallest singular values.
Subspaces by polynomial differentiation¶
Now that we have obtained a basis for the polynomials in \(I_K\), the next step is to calculate the basis vectors \(B_k\) for each \(\UUU_k^{\perp}\).
Sparse Subspace Clustering (SSC)¶
Sparse representations using overcomplete dictionaries have become a popular approach to solve a number of signal and image processing problems in last couple of decades [Ela10]. The dictionary [Tro04][RBE10] consists of a set of prototype signals called atoms which are representative of the particular class of signals of interest. Signals are then approximated by a sparse linear combination of these atoms (i.e. linear combinations of as few atoms as possible). A wide range of sparse recovery algorithms have been developed to decompose a given signal in terms of the atoms from the dictionary in order to obtain the sparsest possible representation [TW10]. Essentially, it is expected that the signals reside in low dimensional subspaces of the ambient signal space and a good dictionary contains well chosen elementary signals called atoms such that a small set of those atoms can span (or approximate) any of the low dimensional subspaces in the class of signals under consideration. Two typical approaches for computing the sparse representation (a.k.a. sparse coding or recovery) of a given signal in a given dictionary are convex relaxation (\(\ell_1\)-minimization) [CDS98][TRO04][CT05][DET06][Don06] and greedy pursuits [MZ93][PRK93][TG07][NT09].
Sparse Subspace Clustering (SSC), introduced in [EV09][EV13] is a method which utilizes the idea of sparse representations for solving the subspace clustering problem. It treats the dataset \(Y\) itself as an (unstructured) dictionary and suggests that a sparse representation of each point in a union of subspaces may be constructed from other data points in the dataset.
A dataset where each point can be expressed as a linear combination of other points in the dataset is said to satisfy self-expressiveness property. The self-expressive representation of a point \(y_s\) in \(Y\) is given by
where \(C = \begin{bmatrix}c_1, \dots, c_S \end{bmatrix} \in \RR^{S \times S}\) is the matrix of representation coefficients.
In general, the representation \(c_s\) for vector \(y_s\) need not be unique. Now, let \(y_s\) belong to \(k\)-th subspace \(\UUU_k\). Let \(Y^{-s}\) denote the dataset \(Y\) excluding the point \(y_s\) and \(Y_k^{-s}\) denote the set of points in \(Y_k\) excluding \(y_s\). If \(Y_k^{-s}\) spans the subspace \(\UUU_k\), then a representation of \(y_s\) can be constructed entirely from the points in \(Y_k^{-s}\). A representation is called subspace preserving if it consists of points within the same subspace. Now if \(c_i\) is a subspace preserving representation of \(y_i\) and \(y_j\) belongs to a different subspace, then \(c_{ij} = 0\). Thus, if \(C\) consists entirely of subspace preserving representations, then \(C_{ij} = 0\) whenever \(y_i\) and \(y_j\) belong to different subspaces.
Note that, \(C\) may not be symmetric. i.e. even if \(y_j\) participates in the representation of \(y_i\), \(y_i\) may not participate in the representation of \(y_j\) or the representation coefficients \(C_{ij}\) and \(C_{ji}\) may be different. But we can construct a symmetric matrix \(W = | C | + |C|^T\), where \(|C|\) denotes taking absolute value of each entry in \(C\). The matrix \(W\) can be used as an affinity matrix for the points from the union of subspaces such that the affinity of points from different subspaces is 0. \(W\) can be used to partition \(Y\) into \(Y_k\) via spectral clustering footnote{See here for a review of spectral clustering.} [VL07].
The remaining issue is constructing a subspace preserving representation \(C\) of \(Y\). This is where the sparse recovery methods developed in sparse representations literature come to our rescue. [EV09][EV13] initially proposed the use of using \(\ell_1\)-minimization by solving
They proved theoretically that, if the subspaces \(\{\UUU_k\}\) are independent, then \(\ell_1\) minimization can recover subspace preserving representations. They also showed that if the subspaces are disjoint, then under certain conditions, subspace preserving representations can be obtained.
Subsequently, [DSB13][YV15] showed that Orthogonal Matching Pursuit (OMP) [PRK93][TG07] can also be used for obtaining subspace preserving representations under appropriate conditions. We will call these two variants of SSC as SSC-\(\ell_1\) and SSC-OMP respectively. The essential SSC method is described below.

SSC by Basis Pursuit¶
Hands-on SSC-BP with Synthetic Data¶
In this example, we will select a set of random subspaces in an ambient space and pick random points within those subspaces. We will make the data noisy and then use sparse subspace clustering by basis pursuit to solve the clustering problem.
Configure the random number generator for repeatability of experiment:
rng default;
Let’s choose the ambient space dimension:
M = 50;
The number of subspaces to be drawn in this ambient space:
K = 10;
Dimension of each of the subspaces:
D = 20;
Choose random subspaces (by choosing bases for them):
bases = spx.data.synthetic.subspaces.random_subspaces(M, K, D);
See Random Subspaces for details.
Compute the smallest principal angles between them:
>> angles_matrix = spx.la.spaces.smallest_angles_deg(bases)
angles_matrix =
0 13.7806 21.2449 12.6763 18.2977 14.5865 19.0584 14.1622 20.4491 15.9609
13.7806 0 12.7650 14.3358 15.5764 12.5790 18.1699 14.8446 19.3907 13.2812
21.2449 12.7650 0 14.7511 13.2121 10.7509 16.1944 11.7819 15.3850 19.7930
12.6763 14.3358 14.7511 0 14.1313 15.6603 14.1016 13.4738 13.1950 19.8852
18.2977 15.5764 13.2121 14.1313 0 13.1154 18.3977 15.4241 12.2688 16.7764
14.5865 12.5790 10.7509 15.6603 13.1154 0 7.6558 13.6178 13.3462 10.5027
19.0584 18.1699 16.1944 14.1016 18.3977 7.6558 0 12.6955 13.8088 17.2580
14.1622 14.8446 11.7819 13.4738 15.4241 13.6178 12.6955 0 13.8851 17.1396
20.4491 19.3907 15.3850 13.1950 12.2688 13.3462 13.8088 13.8851 0 8.4910
15.9609 13.2812 19.7930 19.8852 16.7764 10.5027 17.2580 17.1396 8.4910 0
See Hands on with Principal Angles for details.
Let’s quickly look at the minimum angle between any of the pairs of subspaces:
>> angles = spx.matrix.off_diag_upper_tri_elements(angles_matrix)';
>> min(angles)
ans =
7.6558
Some of the subspaces are indeed very closely aligned.
Let’s choose the number of points we will draw for each subspace:
>> Sk = 4 * D
Sk =
80
Number of points that will be drawn in each subspace:
cluster_sizes = Sk * ones(1, K);
Total number of points to be drawn:
S = sum(cluster_sizes);
Let’s generate these points on the unit sphere in each subspace:
points_result = spx.data.synthetic.subspaces.uniform_points_on_subspaces(bases, cluster_sizes);
X0 = points_result.X;
See Uniformly Distributed Points in Subspaces for more details.
Let’s add some noise to the data points:
% noise level
sigma = 0.5;
% Generate noise
Noise = sigma * spx.data.synthetic.uniform(M, S);
% Add noise to signal
X = X0 + Noise;
See Uniformly Distributed Points in Space for
the spx.data.synthetic.uniform
function details.
Let’s normalize the noisy data points:
X = spx.norm.normalize_l2(X);
Let’s create true labels for each of the data points:
true_labels = spx.cluster.labels_from_cluster_sizes(cluster_sizes);
See Utility Functions for Clustering Experiments for
labels_from_cluster_sizes
function.
It is time to apply the sparse subspace clustering algorithm. There are following steps involved:
- Compute the sparse representations using basis pursuit.
- Convert the representations into a Graph adjacency matrix.
- Apply spectral clustering on the adjacency matrix.
Basis Pursuit based Representation Computation
Let’s allocate storage for storing the representation of each point in terms of other points:
Z = zeros(S, S);
Note that there are exactly S points and each has to have a representation in terms of others. The diagonal elements of Z must be zero since a data point cannot participate in its own representation.
We will use CVX to construct the sparse representation of each point in terms of other points using basis pursuit:
start_time = tic;
fprintf('Processing %d signals\n', S);
for s=1:S
fprintf('.');
if (mod(s, 50) == 0)
fprintf('\n');
end
x = X(:, s);
cvx_begin
% storage for l1 solver
variable z(S, 1);
minimize norm(z, 1)
subject to
x == X*z;
z(s) == 0;
cvx_end
Z(:, s) = z;
end
elapsed_time = toc(start_time);
fprintf('\n Time spent: %.2f seconds\n', elapsed_time);
The constraint x == X*z
is forcing each
data point to be represented in terms of other
data points.
The constraint z(s) == 0
ensures that a
data point cannot participate in its own
representation. In other words, the diagonal
elements of the matrix Z are forced to be zero.
The output of this loop looks like:
Processing 800 signals
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
Time spent: 313.70 seconds
CVX based basis pursuit is indeed a slow algorithm.
Graph adjacency matrix
The sparse representation matrix Z is not symmetric. Also, the sparse representation coefficients are not always positive.
We need to make it symmetric and positive so that it can be used as an adjacency matrix of a graph:
W = abs(Z) + abs(Z).';
Spectral Clustering
See Hands-on spectral clustering about detailed intro to spectral clustering.
We can now apply spectral clustering on this matrix. We will choose normalized symmetric spectral clustering:
clustering_result = spx.cluster.spectral.simple.normalized_symmetric(W);
The labels assigned by the clustering algorithms:
cluster_labels = clustering_result.labels;
Performance of the Algorithm
Time to compare the clusterings and measure clustering accuracy and error. We will use the Hungarian mapping trick to map between original cluster labels and estimated cluster labels by clustering algorithm:
comparsion_result = spx.cluster.clustering_error_hungarian_mapping(cluster_labels, true_labels, K);
See Clustering Error for Hungarian mapping based clustering error.
The clustering accuracy and error:
clustering_error_perc = comparsion_result.error_perc;
clustering_acc_perc = 100 - comparsion_result.error_perc;
Let’s print it:
>> fprintf('\nclustering error: %0.2f %%, clustering accuracy: %0.2f %% \n'...
, clustering_error_perc, clustering_acc_perc);
clustering error: 7.00 %, clustering accuracy: 93.00 %
We have achieved pretty good accuracy despite very closely aligned subspaces and significant amount of noise.
Subspace Preserving Representations
Let’s also get the subspace preserving representation statistics:
spr_stats = spx.cluster.subspace.subspace_preservation_stats(Z, cluster_sizes);
spr_error = spr_stats.spr_error;
spr_flag = spr_stats.spr_flag;
spr_perc = spr_stats.spr_perc;
See Performance Metrics for Sparse Subspace Clustering for more details.
Print it:
>> fprintf('mean spr error: %0.2f, preserving : %0.2f %%\n', spr_stats.spr_error, spr_stats.spr_perc);
mean spr error: 0.68, preserving : 0.00 %
Complete example code can be downloaded
here
.
SSC by Orthogonal Matching Pursuit¶
Motion Segmentation¶
The theory of structure from motion and motion segmentation has evolved over a set of papers [TK91][TK92][BB91][PK97][Gea98][CK98][Kan01]. In this section, we review the essential ideas from this series of work.
A typical image sequence (from a single camera shot) may contain multiple objects moving independently of each other. In the simplest model, we can assume that images in a sequence are views of a single moving object observed by a stationary camera or a stationary object observed by a moving camera. Only rigid motions are considered. In either case, the object is moving with respect to the camera. The structure from motion problem focuses on recovering the (3D) shape and motion information of the moving object. In the general case, there are multiple objects moving independently. Thus, we also need to perform a motion segmentation such that motions of different objects can be separated and (either after or simultaneously) shape and motion of each object can be inferred.
This problem is typically solved in two stages. In the first stage, a frame to frame correspondence problem is solved which identifies a set of feature points whose coordinates can be tracked over the sequence as the point moves from one position to other in the sequence. We obtain a set of trajectories for these points over the frames in the video. If there is a single moving object or the scene is static and the observer is moving then all the feature points will belong to the same object. Otherwise, we need to cluster these feature points to different objects moving in different directions. In the second stage, these trajectories are analyzed to group the feature points into separate objects and recover the shape and motion for individual objects. In this section we assume that the feature trajectories have been obtained by an appropriate method. Our focus is to identify the moving objects and obtain the shape and motion information for each object from the trajectories.
Modeling structure from motion for single object¶
We start with the simple model of a static camera and a moving object. All feature point trajectories belong to the moving object. Our objective is to demonstrate that the subspace spanned by feature trajectories of a single moving object is a low dimensional subspace.
Let the image sequence consist of \(F\) frames denoted by \(1 \leq f \leq F\). Let us assume that \(S\) feature points of the moving object have been tracked over this image sequence. Let \((u_{fs}, v_{fs})\) be the image coordinates of the \(s\)-th point in \(f\)-th frame. We form the feature trajectory vector for the \(s\)-th point by stacking its coordinates for the \(F\) frames vertically as
Putting together the feature trajectory vectors of \(S\) points in a single feature trajectory matrix, we obtain
This is the data matrix under consideration from which the shape and motion of the object need to be inferred.
We need two coordinate systems. We use the camera coordinate system as the world coordinate system with the \(Z\)-axis along the optical axis. The coordinates of different points in the object are changing from frame to frame in the world coordinate system as the object is moving. We also establish a coordinate system within the object with origin at the centroid of the feature points such that the coordinates of individual points do not change from frame to frame in the object coordinate system. The (rigid) motion of the object is then modeled by the translation (of the centroid) and rotation of its coordinate system with respect to the world coordinate system. Let \((a_s, b_s, c_s)\) be the coordinate of the \(s\)-th point in the object coordinate system. Then, the matrix
represents the shape of the object (w.r.t. its centroid).
Let us choose an orthonormal basis in the object coordinate system. Let \(d_f\) be the position of the centroid and \((i_f, j_f, k_f)\) be the (orthonormal) basis vectors of the object coordinate system in the \(f\)-th frame. Then, the position of the \(s\)-th point in the world coordinate system in \(f\)-th frame is given by
Assuming orthographic projection and letting \(h_{fs} = (u_{fs}, v_{fs}, w_{fs})\), the image coordinates are obtained by chopping of the third component \(w_{fs}\). We define the rotation matrix for \(f\)-th frame as
where \(\underline{i}_f\), \(\underline{j}_f\), \(\underline{k}_f\) are the row vectors of \(R_f\). Let \(x_s = (a_s, b_s, c_s, 1)\) be the homogeneous coordinates of the \(s\)-th point in object coordinate system. We can write the homogeneous coordinates in camera coordinate system as
If we write \(d_f = (d_{fi}, d_{fj}, d_{fk})\), then, the data matrix \(Y\) can be factorized as
We rewrite this as
where \(\mathbb{M}\) represents the motion information of the object and \(\mathbb{S}\) footnote{The last row of \(\mathbb{S}\) as formulated above consists of \(1`s.} represents the shape information of the object. This factorization is known as the *Tomasi-Kanade factorization* of shape and motion information of a moving object. Note that :math:\)mathbb{M} in RR^{2F times 4}` and \(\mathbb{S} \in \RR^{4 \times S}\). Thus the rank of \(Y\) is at most 4. Thus the feature trajectories of the rigid motion of an object span an up to 4-dimensional subspace of the trajectory space \(\RR^{2F}\).
Solving the structure from motion problem¶
We digress a bit to understand how to perform the factorization of \(Y\) into \(\mathbb{M}\) and \(\mathbb{S}\). Using SVD, \(Y\) can be decomposed as
Since \(Y\) is at most rank \(4\), we keep only the first 4 singular values as \(\Sigma = \text{diag}(\sigma_1, \sigma_2, \sigma_3, \sigma_4)\). Matrices \(U \in \RR^{2F \times 4}\) and \(V \in \RR^{S \times 4}\) are the left and right singular matrices respectively.
There is no unique factorization of \(Y\) in general. One simple factorization can be obtained by defining:
But for any \(4 \times 4\) invertible matrix \(A\),
is also a possible solution since \(\mathbb{M} \mathbb{S} = \widehat{\mathbb{M}} \widehat{\mathbb{S}} = Y\). Remember that \(\mathbb{M}\) is not an arbitrary matrix but represents the rigid motion of an object. There is considerable structure inside the motion matrix. These structural constraints can be used to compute an appropriate \(A\) and thus obtain \(\mathbb{M}\) from \(\widehat{\mathbb{M}}\). To proceed further, let us break \(A\) into two parts
where \(A_R \in \RR^{4 \times 3}\) is the rotational component and \(a_t \in \RR^4\) is related to translation. We can now write:
Rotational constraints Recall that \(R_f\) is a rotation matrix hence its rows are unit norm and orthogonal to each other. Thus every row of \(\widehat{\mathbb{M}} A_R\) is unit norm and every pair of rows (for a given frame) is orthogonal. This yields following constraints.
where \(\widehat{m}_k\) are rows of matrix \(\widehat{\mathbb{M}}\) for \(1 \leq f \leq F\). This over-constrained system can be solved for the entries of \(A_R\) using least squares techniques.
Translational constraints Recall that the image of a centroid of a set of points under an isometry (rigid motion) is the centroid of the images of the points under the same isometry. The homogeneous coordinates of the centroid in the object coordinate system are \((0, 0, 0, 1)\). The coordinates of the centroid in image are \((\frac{1}{S} \sum_s {u_{f s}}, \frac{1}{S} \sum_s {v_{f s}} )\). Putting back, we obtain
A least squares solution for \(a_t\) is straight-forward.
Modeling motion for multiple objects¶
The generalization of modeling of motion of one object to multiple objects is straight-forward. Let there be \(K\) objects in the scene moving independently. footnote{Our realization of an object is a set of feature points undergoing same rotation and translation over a sequence of images. The notion of locality, color, connectivity etc. plays no role in this definition. It is possible that two visually distinct objects are undergoing same rotation and translation within a given image sequence. For the purposes of inferring an object from its motion, these two visually distinct object are treated as one.} Let \(S_1, S_2, \dots, S_K\) feature points be tracked for objects \(1,2, \dots, K\) respectively for \(F\) frames with \(S = \sum_k S_k\). Let these feature trajectories be put in a data matrix \(Y \in \RR^{2F \times S}\). In general, we don’t know which feature point belongs to which object and how many feature points are there for each object. Of course there is at least one feature point for each object (otherwise the object isn’t being tracked at all). We could permute the columns of \(Y\) via an (unknown) permutation \(\Gamma\) so that the feature points of each object are placed contiguously giving us
Clearly, each submatrix \(Y_k\) (\(1 \leq k \leq K\)) which consists of feature trajectories of one object spans an (up to) 4 dimensional subspace. Now, the problem of motion segmentation is essentially separating \(Y\) into \(Y_k\) which reduces to a standard subspace clustering problem.
Let us dig a bit deeper to see how the motion shape factorization identity changes for the multi-object formulation. Each data submatrix \(Y_k\) can be factorized as
\(Y^*\) now has the canonical factorization:
If we further denote :
then we obtain a factorization similar to the single object case given by
Thus, when the segmentation of \(Y\) in terms of the unknown permutation \(\Gamma\) has been obtained, (sorted) data matrix \(Y^*\) can be factorized into shape and motion components as appropriate.
Limitations Our discussion so far has established that feature trajectories for each moving object span a 4-dimensional space. There are a number of reasons why this is only approximately valid: perspective distortion of camera, tracking errors, and pixel quantization. Thus, a subspace clustering algorithm should allow for the presence of noise or corruption of data in real life applications.
Synthetic Data Generation¶
Random Subspaces¶
Subspace clustering is focused on segmenting data which fall in different subspaces where subspaces are either independent or disjoint with each other and they are sufficiently oriented away from each other.
For testing algorithms, it is useful to pick random subspaces of an ambient signal space and then draw data points within these subspaces.
A way to pick a random subspace is to pick a basis for the subspace. Then, all the linear combinations of the basis elements fall in the subspace and the basis elements span every vector in the said random subspace.
Let’s pick a random plane in the 3-Dimensional space:
>> basis = orth(randn(3, 2))
basis =
-0.2634 0.6981
-0.5459 -0.6769
0.7954 -0.2334
What we are doing is we are constructing a 3x2 Gaussian random matrix and orthogonalizing its columns. With probability 1, the Gaussian random matrix is full rank. Hence, this is a safe way of choosing a basis for a random plane.
We can verify that the basis is indeed orthogonal:
>> basis'*basis
ans =
1.0000 0
0 1.0000
Visualizing Subspaces¶
It is possible to visualize 2D subspaces in 3D space.
Let’s pick one subspace:
rng(10);
A = orth(randn(3, 2))
Identify its basis vectors:
e1 = A(:, 1);
e2 = A(:, 2);
Identify the corner points of a square around its basis vectors:
corners = [e1+e2, e2-e1, -e1-e2, -e2+e1];
Visualize it:
fill3(corners(1,:),corners(2,:),corners(3,:),'r');
grid on;
hold on;
alpha(0.3);
Add the arrows of basis vectors from origin:
quiver3(0, 0, 0, e1(1), e1(2), e1(3), 'color', 'r');
quiver3(0, 0, 0, e2(1), e2(2), e2(3), 'color', 'r');

Let’s add one more basis:
B = orth(randn(3, 2));
e1 = B(:, 1);
e2 = B(:, 2);
corners = [e1+e2, e2-e1, -e1-e2, -e2+e1];
fill3(corners(1,:),corners(2,:),corners(3,:),'g');
alpha(0.3);
quiver3(0, 0, 0, e1(1), e1(2), e1(3), 'color', spx.graphics.rgb('DarkGreen'));
quiver3(0, 0, 0, e2(1), e2(2), e2(3), 'color', spx.graphics.rgb('DarkGreen'));

Multiple Subspaces¶
sparse-plex
provides a way to draw
multiple random subspaces of a given dimension
from an ambient space.
Let’s pick the dimension of the ambient space:
M = 10;
Let’s pick the dimension of subspaces:
D = 4;
Let’s pick the number of subspaces to be drawn:
K = 2;
Let’s draw the bases for each random subspace:
import spx.data.synthetic.subspaces.random_subspaces;
bases = random_subspaces(M, K, D);
The result bases
is a cell array
containing the orthogonal basis for each subspace:
>> bases{1}
ans =
-0.1178 -0.1432 0.0438 -0.0100
0.1311 -0.0110 -0.4409 0.1758
0.5198 -0.6404 0.0422 -0.3980
0.5211 -0.0172 -0.2929 0.6334
-0.2253 -0.1194 -0.2797 0.0920
0.4695 0.1059 0.5408 0.1396
0.1919 0.0765 -0.1441 -0.3519
0.0940 0.0145 -0.4542 -0.4078
0.3209 0.6274 -0.2325 -0.2118
-0.0855 -0.3791 -0.2537 0.2153
>> bases{2}
ans =
0.4784 -0.0579 -0.4213 -0.0206
0.1213 -0.0591 0.3498 0.2351
0.3077 -0.2110 0.2573 0.0042
-0.5581 -0.5284 0.0988 -0.1403
0.1128 0.5914 0.2518 -0.1872
-0.1804 -0.0095 0.0707 -0.1351
-0.0728 0.2774 -0.2063 0.3801
-0.4417 0.3878 0.2071 0.4004
0.0695 -0.2496 -0.1836 0.7344
0.3158 -0.1732 0.6608 0.1647
Verify orthogonality:
>> Psi = bases{1}
>> Psi' * Psi
ans =
1.0000 -0.0000 -0.0000 -0.0000
-0.0000 1.0000 0.0000 0.0000
-0.0000 0.0000 1.0000 -0.0000
-0.0000 0.0000 -0.0000 1.0000
Principal Angles¶
If \(\UUU\) and \(\VVV\) are two linear subspaces of \(\RR^M\), then the smallest principal angle between them denoted by \(\theta\) is defined as [BjorckG73]
For the functions provided in sparse-plex
for measuring principal angles, see
Hands on with Principal Angles.
Uniformly Distributed Points in Space¶
If we wish to generate points uniformly distributed on unit sphere, we have to follow the following two step procedure:
- Generate independent standard Gaussian random vectors.
- Normalize their lengths.
Here is an example.
Let ambient space dimension be:
>> M = 4;
Let number of points to be generated by:
>> S = 6;
Let’s generate the Gaussian random vectors:
>> X = randn(M, S)
X =
-0.6568 -0.2926 -0.4930 0.6113 1.8045 0.6001
-1.4814 -0.5408 -0.1807 0.1093 -0.7231 0.5939
0.1555 -0.3086 0.0458 1.8140 0.5265 -2.1860
0.8186 -1.0966 -0.0638 0.3120 -0.2603 -1.3270
Let’s normalize them:
>> X = spx.norm.normalize_l2(X)
X =
-0.3605 -0.2260 -0.9286 0.3147 0.8886 0.2228
-0.8130 -0.4177 -0.3404 0.0563 -0.3561 0.2205
0.0853 -0.2384 0.0863 0.9338 0.2593 -0.8117
0.4492 -0.8471 -0.1201 0.1606 -0.1282 -0.4928
Verify that they are indeed on the unit-sphere:
>> spx.norm.norms_l2_cw(X)
ans =
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
We provide a reusable function to generate uniformly distributed points on unit sphere:
>> spx.data.synthetic.uniform(M, S)
ans =
-0.6788 0.5450 -0.3194 -0.1977 -0.6098 -0.4051
0.1893 0.3660 0.6441 0.2742 0.3803 0.1614
0.6926 -0.7056 -0.0138 -0.0292 -0.0341 -0.6422
-0.1540 0.2667 0.6949 -0.9407 -0.6946 0.6305
Uniformly Distributed Points in Subspaces¶
For subspace clustering purposes, individual vectors are usually normalized. They then fall onto the surface of the unit sphere of the subspace to which they belong.
For experimentation, it is useful to generate uniformly distributed points on the unit sphere of a random subspace.
It is actually very easy to do. Let’s start with a simple example of a random 2D plane inside 3D space.
Let’s choose a random plane:
basis = orth(randn(3, 2));
Let’s choose coordinates of some points in this basis where the coordinates are Gaussian distributed:
num_points = 100;
coefficients = randn(2, num_points);
Let’s normalize the coefficients:
coefficients = spx.norm.normalize_l2(coefficients);
The coordinates of these points in the 3D space can be easily calculated now:
uniform_points = basis * coefficients;
Verify that these points are indeed on unit sphere:
>> max(abs(spx.norm.norms_l2_cw(uniform_points) - 1))
ans =
4.4409e-16
Time to visualize everything. First the plane:
e1 = basis(:, 1);
e2 = basis(:, 2);
corners = [e1+e2, e2-e1, -e1-e2, -e2+e1];
spx.graphics.figure.full_screen;
fill3(corners(1,:),corners(2,:),corners(3,:),'r');
grid on;
hold on;
alpha(0.3);
Then the unit vectors:
quiver3(0, 0, 0, e1(1), e1(2), e1(3), 'color', 'blue');
quiver3(0, 0, 0, e2(1), e2(2), e2(3), 'color', 'blue');
Finally the points:
x = uniform_points(1, :);
y = uniform_points(2, :);
z = uniform_points(3, :);
plot3(x, y, z, '.', 'color', spx.graphics.rgb('Brown') );
We might as well draw the origin too:
plot3(0, 0, 0, '+k', 'MarkerSize', 10, 'color', spx.graphics.rgb('DarkRed'));

Complete example code can be downloaded
here
.
Uniformly distributed points in multiple subspaces
We provide a useful function which can generate uniformly distributed points on one or more subspaces.
We first need to choose the bases for the subspaces for which we will draw uniformly distributed points. Here we will choose those bases randomly.
Ambient space dimension:
M = 10;
Number of subspaces:
K = 4;
Dimension of each subspace:
D = 5;
Bases for each subspace:
bases = spx.data.synthetic.subspaces.random_subspaces(M, K, D);
Now, let’s decide how many points do we need in each subspace:
cluster_sizes = [10 4 4 8];
Let’s generate uniformly distributed points in each subspace:
data_points = spx.data.synthetic.subspaces.uniform_points_on_subspaces(bases, cluster_sizes);
The returned value contains the data matrix containing the points and start and end indices for each cluster of points (for each subspace):
>> data_points
data_points =
struct with fields:
X: [10×26 double]
start_indices: [1 11 15 19]
end_indices: [10 14 18 26]
Let’s look at the start and end indices for each cluster:
>> data_points.start_indices
ans =
1 11 15 19
>> data_points.end_indices
ans =
10 14 18 26
Verify the size of the data matrix:
>> size(data_points.X)
ans =
10 26
Let’s look at the data points for 2nd cluster:
>> data_points.X(:, data_points.start_indices(2):data_points.end_indices(2))
ans =
0.0987 0.5278 -0.4014 0.2963
-0.1793 0.0614 0.1551 0.2283
0.4603 0.1510 -0.0926 0.0340
0.3573 -0.1289 0.1654 0.4519
0.1202 -0.0495 0.1382 -0.4503
-0.1857 -0.6572 -0.1129 -0.3851
0.4265 0.1540 -0.6315 -0.0117
-0.4420 0.4131 0.0530 0.1565
0.0262 0.1973 -0.2354 0.1153
-0.4377 -0.1022 0.5385 -0.5155
Complete example code can be downloaded
here
.
Performance Metrics for Sparse Subspace Clustering¶
Consider a sparse representation matrix \(C\) where each signal has been represented in terms of other signals. With \(S\) signals, the matrix is of size \(S \times S\) and the diagonal elements of the matrix are zero.
We use following metrics for comparison of algorithms.
Percentage of subspace preserving representations (p%) [YV15]
This is the percentage of points whose representations are subspace-preserving. Due to the imprecision of solvers, coefficients with absolute values less than \(10^{-3}\) are considered zero. A subspace preserving \(C\) gives \(p = 100\).
Subspace preserving representation error (e%) [EV13]
For each column \(c_s\) in \(C\), we compute the fraction of its \(\ell_1\) norm that comes from other subspaces and average over all \(1 \leq s \leq S\).
where \(w_{is} \in \{0, 1\}\) is its true affinity. A subspace-preserving \(C\) gives \(e=0\).
Clustering accuracy (a %) [YV15]
This is the percentage of correctly labeled data points. It is computed by matching the estimated and true labels as
where \(\pi\) is a permutation of the \(K\) cluster labels, \(L_{ks} = 1\) if point \(s\) belongs to cluster \(k\), and 0 otherwise. This assumes that either the number of subspaces/clusters is known a priori to the clustering algorithm or the clustering algorithm has inferred it correctly.
Running time (t)
For each clustering task using MATLAB.
Hands-on with Subspace Preservation Metrics¶
Let’s consider a data set of 10 points:
X =
0.2813 -0.9343 0.2368 -0.7846 0.7908 0 0 0 0 0
0.9596 0.3566 -0.9716 0.6200 0.6120 -0.4064 0.9962 0.9613 -0.0830 0.7051
0 0 0 0 0 0.9137 -0.0866 -0.2757 0.9965 0.7091
The points are drawn from a 3 dimensional space. First 5 points are drawn from X-Y plane and last 5 points are from Y-Z plane.
We constructed the sparse presentations of these data points in terms of other points using basis pursuit. The representations are:
C =
0.0000 -0.0000 -0.0000 0.0000 0.8565 -0.0000 0.3284 0.0000 0.0000 0.3615
0.0000 -0.0000 -0.0000 0.7476 -0.5885 -0.0000 0.0000 0.0000 0.0000 0.0000
-0.0000 -0.0000 -0.0000 -0.3638 -0.0000 0.0000 -0.3902 -0.0000 -0.0000 -0.4295
0.0000 0.8797 -0.3018 0.0000 -0.0000 -0.0000 0.0000 0.0000 0.0000 0.0000
0.3558 -0.3085 -0.0000 -0.0000 -0.0000 -0.0000 0.0000 0.0000 0.0000 0.0000
-0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.2187 0.8167 0.0000
0.6854 -0.0000 -0.7247 0.0000 0.0000 -0.0000 0.0000 0.8757 0.0000 0.0000
0.0000 -0.0000 -0.0000 0.0000 0.0000 -0.3520 0.3141 0.0000 -0.0000 0.0000
0.0000 -0.0000 -0.0000 0.0000 0.0000 0.8195 -0.0000 -0.0000 -0.0000 0.7116
0.0837 -0.0000 -0.0885 0.0000 0.0000 0.0000 0.0000 0.0000 0.3530 0.0000
For subspace preserving representations:
- In the first 5 columns, non-zero entries should appear in first 5 rows.
- In the last 5 columns, non-zero entries should appear in last 5 rows.
On inspection, we can see that column 1 is not subspace preserving while column 2 is.
Let’s go through the steps of computing the metrics. We will work on column 1.
Let’s assign cluster labels to each of the columns:
cluster_sizes = [5 5];
labels = spx.cluster.labels_from_cluster_sizes(cluster_sizes)
>> spx.io.print.vector(labels, 0)
1 1 1 1 1 2 2 2 2 2
Let’s compute absolute values of \(C\):
C = abs(C);
We will allocate some space to keep flags indicating whether a column contains subspace preserving representation or not and the amount of \(\ell_1\)-error in each column:
spr_flags = zeros(1, S);
spr_errors = zeros(1, S);
Let’s pick up the first column:
c1 = C(:, 1);
The label assigned to this column is:
k = labels(1);
which happens to be 1 (first cluster).
Identify the rows which contain non-zero values:
non_zero_indices = (c1 >= 1e-3);
Each non-zero value is a contribution from some other column. We wish to identify the cluster to which those columns belong:
non_zero_labels = labels(non_zero_indices)
non_zero_labels =
1 2 2
Notice, how only one of the contributors is from 1st cluster while the other two are from second cluster. Cross check this in the \(C\) matrix display above.
Verify if all the contributors are from the same
cluster and store it in the spr_flags
variable:
spr_flags(1) = all(non_zero_labels == k)
0
Next, let’s identify the columns which come from the same cluster as the current cluster:
w = labels == k;
Coefficients from same cluster are:
c1k = c1(w);
Subspace preserving representation error is given by:
spr_errors(1) = 1 - sum(c1k) / sum (c1)
>> spr_errors(1)
ans =
0.6837
We provide a function which does this whole sequence of operations on all data points:
spr_stats = spx.cluster.subspace.subspace_preservation_stats(C, cluster_sizes);
The flags whether a representation is subspace preserving or not for each data point:
>> spr_stats.spr_flags
ans =
0 1 0 1 1 1 0 1 1 0
Indicator if all representations are subspace preserving or not:
>> spr_stats.spr_flag
0
Data point wise subspace preserving representation error:
>> spr_stats.spr_errors
ans =
0.6837 0.0000 0.7293 0.0000 0.0000 0.0000 0.6958 0.0000 0.0000 0.5264
Average representation error:
>> spr_stats.spr_error
ans =
0.2635
This is about 26% error.
Percentage of data points having subspace preserving representations:
>> spr_stats.spr_perc
ans =
60
Not too bad given that the number of data points was very small.
Complete example code can be downloaded
here
.
Sparse Subspace Clustering with MNIST Digits¶
In this section, we discuss using SSC algorithms on MNIST dataset.
MNIST dataset [LBBH98] contains gray scale images of handwritten digits 0-9. The dataset consists of \(60,000\) images. Following [YRV16], for each image, we compute a set of feature vectors using a scattering convolution network [BM13]. The feature vector is translation invariant and deformation stable. Each feature vector is of length \(3,472\). The feature vectors are available here.
MNIST Dataset¶
Please download the file MNIST_SC.mat
and place it in sparse-plex/data/mnist
directory.
We provide a wrapper class to load the data from this dataset:
md = spx.data.image.ChongMNISTDigits;
Beware, the whole dataset is 1GB in size and can take 10-20 seconds to load depending upon your system capability.
Let’s look at the structure md
:
>> md
md =
ChongMNISTDigits with properties:
Y: [3472×60000 double]
labels: [1×60000 double]
digits: [0 1 2 3 4 5 6 7 8 9]
cluster_sizes: [5923 6742 5958 6131 5842 5421 5918 6265 5851 5949]
- The \(Y\) matrix contains one feature vector (as column) per example digit.
- The labels array contains information about the digit represented in each column of \(Y\).
cluster_sizes
is the number of examples of each digit in this dataset.
Seeing some labels:
>> md.labels(1:10)
ans =
5 0 4 1 9 2 1 3 1 4
Number of examples of digit 5
>> sum(md.labels == 5)
ans =
5421
>> md.cluster_sizes(5+1)
ans =
5421
The object md
provides a method to find out the column indices
for a given digit in the labels array.
Let’s find all the indices for digit 4:
>> four_indices = md.digit_indices(4);
>> numel(four_indices)
ans =
5842
Let’s checkout some of these indices and verify them in the labels array:
>> four_indices(1:4)
ans =
3 10 21 27
>> md.labels(four_indices(1:4))
ans =
4 4 4 4
We can select a subset of samples from this dataset along with the labels as follows:
>> indices = [1 10 11 40];
>> [Y, labels] = md.selected_samples(indices);
>> labels
labels =
5 4 3 6
SSC-OMP on MNIST Dataset¶
In this section, we will go through the steps of applying the SSC-OMP algorithm on the MNIST dataset.
We will work on all the digits:
digit_set = 0:9;
Number of samples for each digit:
num_samples_per_digit = 400;
Number of clusters or corresponding low dimensional subspaces:
K = length(digit_set);
Sizes of each cluster:
cluster_sizes = num_samples_per_digit*ones(1, K);
Let’s draw 200 examples/samples for each digit from the MNIST dataset described above:
sample_list = [];
for k=1:K
digit = digit_set(k);
digit_indices = md.digit_indices(digit);
num_digit_samples = length(digit_indices);
choices = randperm(num_digit_samples, cluster_sizes(k));
selected_indices = digit_indices(choices);
sample_list = [sample_list selected_indices];
end
We have picked the column numbers of samples/examples
for each digit and concatenated them into
sample_list
.
Time to pickup the samples from the dataset along with labels:
[Y, true_labels] = md.selected_samples(sample_list);
The feature vectors are 3472 dimensional. We don’t really need this much of detail. We will perform PCA to reduce the dimensions to 500:
fprintf('Performing PCA\n');
tstart = tic;
Y = spx.la.pca.low_rank_approx(Y, 500);
elapsed_time = toc (tstart);
fprintf('Time taken in PCA %.2f seconds\n', elapsed_time);
Performing PCA
Time taken in PCA 17.69 seconds
The ambient space dimension M and the number of data vectors S:
[M, S] = size(Y);
Time to perform sparse subspace clustering with orthogonal matching pursuit:
tstart = tic;
fprintf('Performing SSC OMP\n');
import spx.cluster.ssc.OMP_REPR_METHOD;
solver = spx.cluster.ssc.SSC_OMP(Y, D, K, 1e-3, OMP_REPR_METHOD.FLIPPED_OMP_MATLAB);
solver.Quiet = true;
clustering_result = solver.solve();
elapsed_time = toc (tstart);
fprintf('Time taken in SSC-OMP %.2f seconds\n', elapsed_time);
Performing SSC OMP
Time taken in SSC-OMP 10.54 seconds
Let’s collect the statistics related to clustering error and subspace preserving representations error:
connectivity = clustering_result.connectivity;
% estimated number of clusters
estimated_num_subspaces = clustering_result.num_clusters;
% Time to compare the clustering
cluster_labels = clustering_result.labels;
fprintf('Measuring clustering error and accuracy\n');
comparsion_result = spx.cluster.clustering_error_hungarian_mapping(cluster_labels, true_labels, K);
clustering_error_perc = comparsion_result.error_perc;
clustering_acc_perc = 100 - comparsion_result.error_perc;
spr_stats = spx.cluster.subspace.subspace_preservation_stats(clustering_result.Z, cluster_sizes);
spr_error = spr_stats.spr_error;
spr_flag = spr_stats.spr_flag;
spr_perc = spr_stats.spr_perc;
fprintf('\nclustering error: %0.2f %% , clustering accuracy: %0.2f %%\n, mean spr error: %0.4f preserving : %0.2f %%\n, connectivity: %0.2f, elapsed time: %0.2f sec',...
clustering_error_perc, clustering_acc_perc,...
spr_stats.spr_error, spr_stats.spr_perc,...
connectivity, elapsed_time);
fprintf('\n\n');
Results
Measuring clustering error and accuracy
clustering error: 6.42 % , clustering accuracy: 93.58 %
, mean spr error: 0.3404 preserving : 0.00 %
, connectivity: -1.00, elapsed time: 10.54 sec
SSC-OMP on MNIST Benchmarks¶
The table below reports the performance of SSC-OMP algorithm on MNIST dataset. The data consists of randomly chosen number of images for each of the 10 digits. Scattering network features are extracted from the image and they are projected to dimension 500 using PCA. The images per digit are varied for each experiment from 50 to 400.
Images per Digit | a% | e% | t |
---|---|---|---|
50 | 82.18 | 42.11 | 0.36 |
80 | 87.39 | 39.79 | 0.81 |
100 | 87.20 | 38.86 | 1.11 |
150 | 89.16 | 37.33 | 2.02 |
200 | 89.68 | 36.39 | 3.25 |
300 | 92.19 | 35.18 | 6.27 |
400 | 91.13 | 34.26 | 7.07 |
Benchmarks on SSC-MC-OMP¶
The section describing SSC-MC-OMP algorithm is under development.
Here we report the benchmarks using the SSC-MC-OMP algorithm.
Clustering Accuracy a%
Images per Digit | 1-4 | 2.1-4 | 42.1-4 | 2-4 |
---|---|---|---|---|
50 | 82.18 | 82.87 | 82.68 | 83.81 |
80 | 87.39 | 87.14 | 85.34 | 86.82 |
100 | 87.20 | 87.47 | 86.75 | 89.17 |
150 | 89.16 | 89.15 | 88.06 | 89.09 |
200 | 89.68 | 90.23 | 88.17 | 88.31 |
300 | 92.19 | 91.18 | 87.80 | 91.89 |
400 | 91.13 | 91.52 | 90.16 | 91.50 |
Subspace Preserving Representation Error e%
Images per Digit | 1-4 | 2.1-4 | 42.1-4 | 2-4 |
---|---|---|---|---|
50 | 42.11 | 41.63 | 41.46 | 41.00 |
80 | 39.79 | 39.10 | 38.85 | 38.19 |
100 | 38.86 | 38.12 | 37.80 | 37.06 |
150 | 37.33 | 36.56 | 36.11 | 35.19 |
200 | 36.39 | 35.50 | 34.99 | 34.00 |
300 | 35.18 | 34.15 | 33.59 | 32.60 |
400 | 34.26 | 33.26 | 32.70 | 31.57 |
Time t
Images per Digit | 1-4 | 2.1-4 | 42.1-4 | 2-4 |
---|---|---|---|---|
50 | 2.07 | 3.26 | 5.95 | 9.22 |
80 | 3.57 | 6.22 | 11.67 | 15.77 |
100 | 4.71 | 8.39 | 15.88 | 20.61 |
150 | 8.97 | 15.98 | 30.88 | 37.71 |
200 | 13.50 | 24.94 | 48.13 | 57.25 |
300 | 30.50 | 56.81 | 120.77 | 121.76 |
400 | 50.38 | 95.76 | 177.78 | 192.57 |
Yale Faces Dataset¶
Loading the faces:
yf = spx.data.image.YaleFaces();
yf.load();
Number of subjects:
ns = yf.num_subjects();
Images to load per subject:
ni = yf.ImagesToLoadPerSubject;
Images of a particular subject:
Y = yf.get_subject_images(i);
Resized images of a particular subject:
Y = yf.get_subject_images_resized(i)
Total images:
yf.num_total_images()
Size of image in pixels:
yf.image_size()
Image by global index across all subjects:
yf.get_image_by_glob_idx(index)
Resize all images in buffer:
yf.resize_all(width, height)
yf.resize_all(42, 48);
Describe the contents of the database:
yf.describe()
Create a canvas of images randomly chosen from all subjects:
canvas = yf.create_random_canvas();
imshow(canvas);
colormap(gray);
axis image;
axis off;
Creating a canvas for a particular subject:
yf.resize_all(42, 48);
canvas = yf.create_subject_canvas(1);
imshow(canvas);
colormap(gray);
axis image;
axis off;
Pick ten random images from each subject:
images = yf.training_set_a()
Set Theory¶
Introduction¶
This chapter is a background material on basic concepts in set theory and basic properties of real numbers.
We look at
- Basic properties of sets
- Concept of a function
- Cartesian products
- Relations
- Notion of order in sets
- Countable and uncountable sets
Concepts are developed sequentially where one concept builds upon other concepts previously defined. Examples are added wherever suitable for better understanding. In examples we use sets of real numbers, natural numbers, integers frequently. Some of their properties may not have been defined before they are used in examples. This has been done keeping in mind that the reader has an intuitive understanding of these numbers and the examples are easier to visualize.
The presentation in this chapter largely follows [AB98].
Sets¶
In this section we will review basic concepts of set theory.
Actually, it’s not a formal definition. It is just a working definition which we will use going forward.
- Sets are denoted by capital letters.
- Objects in a set are called members, elements or points.
- \(x \in A\) means that element \(x\) belongs to set \(A\).
- \(x \notin A\) means that \(x\) doesn’t belong to set \(A\).
- \(\{ a,b,c\}\) denotes a set with elements \(a\), \(b\), and \(c\). Their order is not relevant.
Clearly, \(A = B \iff (A \subseteq B \text{ and } B \subseteq A)\).
We define fundamental set operations below
- The union \(A \cup B\) of \(A\) and \(B\) is defined as
- The intersection \(A \cap B\) of \(A\) and \(B\) is defined as
- The difference \(A \setminus B\) of \(A\) and \(B\) is defined as
Some useful identities
- \((A \cup B) \cap C = (A \cup C) \cap (B \cup C)\).
- \((A \cap B) \cup C = (A \cap C) \cup (B \cap C)\).
- \((A \cup B) \setminus C = (A \setminus C) \cap (B \setminus C)\).
- \((A \cap B) \setminus C = (A \setminus C) \cap (B \setminus C)\).
Symmetric difference between \(A\) and \(B\) is defined as
i.e. the elements which are in \(A\) but not in \(B\) and the elements which are in \(B\) but not in \(A\).
Family of sets¶
Following are some examples of index sets
- \(\{1,2,3,4\}\): the family consists of only 4 sets.
- \(\{0,1,2,3\}\): the family consists again of only 4 sets but indices are different.
- \(\Nat\): The sets in family are indexed by natural numbers. They are countably infinite.
- \(\ZZ\): The sets in family are indexed by integers. They are countably infinite.
- \(\QQ\): The sets in family are indexed by rational numbers. They are countably infinite.
- \(\RR\): There are uncountably infinite sets in the family.
If \(\mathcal{F}\) is a family of sets, then by letting \(I=\mathcal{F}\) and \(A_i = i \quad \forall i \in I\), we can express \(\mathcal{F}\) in the form of \(\{ A_i\}_{i \in I}\).
Let \(\{ A_i\}_{i \in I}\) be a family of sets.
The union of the family is defined to be
\[\bigcup_{i\in I} A_i = \{ x : \exists i \in I \text{ such that } x \in A_i\}.\]The intersection of the family is defined to be
\[\bigcap_{i \in I} A_i = \{ x : x \in A_i \quad \forall i \in I\}.\]
We will also use simpler notation \(\bigcup A_i\), \(\bigcap A_i\) for denoting the union and inersection of family.
If \(I =\Nat = \{1,2,3,\dots\}\) (the set of natural numbers), then we will denote union and intersection by \(\bigcup_{i=1}^{\infty}A_i\) and \(\bigcap_{i=1}^{\infty}A_i\).
We now have the generalized distributive law:
In the following \(X\) is a big fixed set (sort of a frame of reference) and we will be considering different subsets of it.
Let \(X\) be a fixed set. If \(P(x)\) is a property well defined for all \(x \in X\), then the set of all \(x\) for which \(P(x)\) is true is denoted by \(\{x \in X : P(x)\}\).
We have
- \((A^c)^c = A\).
- \(A \cap A^c = \EmptySet\).
- \(A \cup A^c = X\).
- \(A\setminus B = A \cap B^c\).
- \(A \subseteq B \iff B^c \subseteq A^c\).
- \((A \cup B)^c = A^c \cap B^c\).
- \((A \cap B)^c = A^c \cup B^c\).
Functions¶
A function from a set \(A\) to a set \(B\), in symbols \(f : A \to B\) (or \(A \xrightarrow{f} B\) or \(x \mapsto f(x)\)) is a specific rule that assigns to each element \(x \in A\) a unique element \(y \in B\).
We say that the element \(y\) is the value of the function \(f\) at \(x\) (or the image of \(x\) under \(f\)) and denote as \(f(x)\), that is, \(y = f(x)\).
We also sometimes say that \(y\) is the output of \(f\) when the input is \(x\).
The set \(A\) is called domain of \(f\). The set \(\{y \in B : \exists x \in A \text{ with } y = f(x)\}\) is called the range of \(f\).
This function is not continuous anywhere on the real line.
This function is continuous but not differentiable at \(x=0\).
Let \(f : X \to Y\) be a function. If \(A \subseteq X\), then image of \(A\) under \(f\) denoted as \(f(A)\) (a subset of \(Y\)) is defined by
If \(B\) is a subset of \(Y\) then the inverse image \(f^{-1}(B)\) of \(B\) under \(f\) is the subset of \(X\) defined by
Let \(\{A_i\}_{i \in I}\) be a family of subsets of \(X\).
Let \(\{B_i\}_{i \in I}\) be a family of subsets of \(Y\).
Then the following results hold:
Given two functions \(f : X \to Y\) and \(g : Y \to Z\), their composition \(g \circ f\) is the function \(g \circ f : X \to Z\) defined by
If a function \(f : X \to Y\) is one-one and onto, then for every \(y \in Y\) there exists a unique \(x \in X\) such that \(y = f(x)\). This unique element is denoted by \(f^{-1}(y)\). Thus a function \(f^{-1} : Y \to X\) can be defined by
The function \(f^{-1}\) is called the inverse of \(f\).
We can see that \((f \circ f^{-1})(y) = y\) for all \(y \in Y\).
Also \((f^{-1} \circ f) (x) = x\) for all \(x \in X\).
We define an identity function on a set \(X\) as
Thus we have:
If \(f : X \to Y\) is one-one, then we can define a function \(g : X \to f(Y)\) given by \(g(x) = f(x)\). This function is one-one and onto. Thus \(g^{-1}\) exists. We will use this idea to define an inverse function for a one-one function \(f\) as \(f^{-1} : f(X) \to X\) given by \(f^{-1}(y) = x \Forall y \in f(X)\). Clearly \(f^{-1}\) so defined is one-one and onto between \(X\) and \(f(X)\).
Clearly, we can define a one-one onto function \(f^{-1} : f(X) \to X\) and another one-one onto function \(g^{-1} : g(Y) \to Y\). Let the two-sided sequence \(C_x\) be defined as
Note that the elements in the sequence alternate between \(X\) and \(Y\). On the left side, the sequence stops whenever \(f^{-1}(y)\) or \(g^{-1}(x)\) is not defined. On the right side the sequence goes on infinitely.
We call the sequence as \(X\) stopper if it stops at an element of \(X\) or as \(Y\) stopper if it stops at an element of \(Y\). If any element in the left side repeats, then the sequence on the left will keep on repeating. We call the sequence doubly infinite if all the elements (on the left) are distinct, or cyclic if the elements repeat. Define \(Z = X \cup Y\) If an element \(z \in Z\) occurs in two sequences, then the two sequences must be identical by definition. Otherwise, the two sequences must be disjoint. Thus the sequences form a partition on \(Z\). All elements within one equivalence class of \(Z\) are reachable from each other through one such sequence. The elements from different sequences are not reachable from each other at all. Thus, we need to define bijections between elements of \(X\) and \(Y\) which belong to same sequence separately.
For an \(X\)-stopper sequence \(C\), every element \(y \in C \cap Y\) is reachable from \(f\). Hence \(f\) serves as the bijection between elements of \(X\) and \(Y\). For an \(Y\)-stopper sequence \(C\), every element \(x \in C \cap X\) is reachable from \(g\). Hence \(g\) serves as the bijection between elements of \(X\) and \(Y\). For a cyclic or doubly infinite sequence \(C\), every element \(y \in C \cap Y\) is reachable from \(f\) and every element \(x \in C \cap X\) is reachable from \(g\). Thus either of \(f\) and \(g\) can serve as a bijection.
Sequence¶
Any function \(x : \Nat \to X\), where \(\Nat = \{1,2,3,\dots\}\) is the set of natural numbers, is called a sequence of \(X\).
We say that \(x(n)\) denoted by \(x_n\) is the \(n^{\text{th}}\) term in the sequence.
We denote the sequence by \(\{ x_n \}\).
Note that sequence may have repeated elements and the order of elements in a sequence is important.
Cartesian product¶
Let \(\{ A_i \}_{i \in I}\) be a family of sets. Then the Cartesian product \(\prod_{i \in I} A_i\) or \(\prod A_i\) is defined to be the set consisting of all functions \(f : I \to \cup_{i \in I}A_i\) such that \(x_i = f(i) \in A_i\) for each \(i \in I\).
Such a function is called a choice function and often denoted by \((x_i)_{i \in I}\) or simply by \((x_i)\).
If a family consists of two sets, say \(A\) and \(B\), then the Cartesian product of the sets \(A\) and \(B\) is designated by \(A \times B\). The members of \(A \times B\) are denoted as ordered pairs.
Similarly the Cartesian product of a finite family of sets \(\{ A_1, \dots, A_n\}\) is written as \(A_1 \times \dots \times A_n\) and its members are denoted as \(n\)-tuples, i.e.:
Note that \((a_1,\dots, a_n) = (b_1,\dots,b_n)\) if and only if \(a_i = b_i \forall i = 1,\dots,n\).
If \(A_1 = A_2 = \dots = A_n = A\), then it is standard to write \(A_1 \times \dots \times A_n\) as \(A^n\).
Let \(A = \{ 0, +1, -1\}\).
Then \(A^2\) is
And \(A^3\) is given by
If the family of sets \(\{A_i\}_{i \in I}\) satisfies \(A_i = A \forall i \in I\), then \(\prod_{i \in I} A_i\) is written as \(A^I\).
i.e. \(A^I\) is the set of all functions from \(I\) to \(A\).
- Let \(A = \{0, 1\}\). \(A^{\RR}\) is a set of all functions on \(\RR\) which can take only one of the two values \(0\) or \(1\). \(A^{\Nat}\) is a set of all sequences of zeros and ones.
- \(\RR^\RR\) is a set of all functions from \(\RR\) to \(\RR\).
Axiom of choice¶
If a Cartesian product is non-empty, then each \(A_i\) must be non-empty.
We can therefore ask: If each :math:`A_i` is non-empty, is then the Cartesian product :math:`prod A_i` nonempty?
An affirmative answer cannot be proven within the usual axioms of set theory.
This requires us to introduce the axiom of choice.
Another way to state the axiom of choice is:
Relations¶
A binary relation on a set \(X\) is defined as a subset \(\mathcal{R}\) of \(X \times X\).
If \((x,y) \in \mathcal{R}\) then \(x\) is said to be in relation \(\mathcal{R}\) with \(y\). This is denoted by \(x \mathcal{R} y\).
The most interesting relations are equivalence relations.
A relation \(\mathcal{R}\) on a set \(X\) is said to be an equivalence relation if it satisfies the following properties:
- \(x \mathcal{R} x\) for each \(x \in X\) (reflexivity).
- If \(x \mathcal{R} y\) then \(y \mathcal{R} x\) (symmetry).
- If \(x \mathcal{R} y\) and \(y \mathcal{R} z\) then \(x \mathcal{R} z\) (transitivity).
We can now introduce equivalence classes on a set.
Let \(\mathcal{R}\) be an equivalence relation on a set \(X\). Then the equivalence class determined by the element \(x \in X\) is denoted by \([x]\) and is defined as
i.e. all elements in \(X\) which are related to \(x\).
We can now look at some properties of equivalence classes and relations.
Let \(X\) bet the set of integers \(\ZZ\). Let \(\mathcal{R}\) be defined as
i.e. \(x\) and \(y\) are related if the difference of \(x\) and \(y\) given by \(x-y\) is divisible by \(2\).
Clearly the set of odd integers and the set of even integers forms two disjoint equivalent classes.
If a set \(X\) can be represented as a union of a family \(\{A_i\}_{i \in I}\) of pairwise disjoint sets i.e.
then we say that \(\{A_i\}_{i \in I}\) is a partition of \(X\).
A partition over a set \(X\) also defines an equivalence relation on it.
If there exists a family \(\{A_i\}_{i \in I}\) of pairwise disjoint sets which partitions a set \(X\), (i.e. \(X = \cup_{i \in I} A_i\)), then by letting
an equivalence relation is defined on \(X\) whose equivalence classes are precisely the sets \(A_i\).
In words, the relation \(\mathcal{R}\) includes only those tuples \((x,y)\) from the Cartesian product \(X\times X\) for which there exists one set \(A_i\) in the family of sets \(\{A_i\}_{i \in I}\) such that both \(x\) and \(y\) belong to \(A_i\).
Order¶
Another important type of relation is an order relation.
A relation, denoted by \(\leq\), on a set \(X\) is said to be a partial order for \(X\) (or that \(X\) is partially ordered by \(\leq\)) if it satisfies the following properties:
- \(x \leq x\) holds for every \(x \in X\) (reflexivity).
- If \(x \leq y\) and \(y \leq x\), then \(x = y\) (antisymmetry).
- If \(x \leq y\) and \(y \leq z\), then \(x \leq z\) (transitivity).
An alternative notation for \(x \leq y\) is \(y \geq x\).
Consider a set \(A = \{1,2,3\}\). Consider the power set of \(A\) which is
Define a relation \(\mathcal{R}\) on \(X\) such that \(x \mathcal{R} y\) if \(x \subseteq y\).
Clearly
- \(x \subseteq x \quad \forall x \in X\).
- If \(x \subseteq y\) and \(y \subseteq x\) then \(x =y\).
- If \(x \subseteq y\) and \(y \subseteq z\) then \(x \subseteq y\).
Thus the relation \(\mathcal{R}\) defines a partial order on the power set \(X\).
We can look at how elements are ordered within a set a bit more closely.
A subset \(Y\) of a partially ordered set \(X\) is called a chain if for every \(x, y \in Y\) either \(x \leq y\) or \(y \leq x\) holds.
A chain is also known as a totally ordered set.
- In a partially ordered set \(X\), we don’t require that for every \(x,y \in X\), either \(x \leq y\) or \(y \leq x\) should hold. Thus there could be elements which are not connected by the order relation.
- In a totally ordered set \(Y\), for every \(x,y \in Y\) we require that either \(x \leq y\) or \(y \leq x\).
- If a set is totally ordered, then it is partially ordered also.
Continuing from previous example consider a subset \(Y\) of \(X\) defined by
Clearly for every \(x, y \in Y\), either \(x \subseteq y\) or \(y \subseteq x\) holds.
Hence \(Y\) is a chain or a totally ordered set within \(X\).
- The set of natural numbers \(\Nat\) is totally ordered.
- The set of integers \(\ZZ\) is totally ordered.
- The set of real numbers \(\RR\) is totally ordered.
- Suppose we define an order relation in the set of complex numbers as follows. Let \(x+jy\) and \(u+jv\) be two complex numbers. We say that
With this definition, the set of complex numbers \(\CC\) is partially ordered.
- \(\RR\) is a totally ordered subset of \(\CC\) since the imaginary component is 0 for all real numbers in the complex plane.
- In fact any line or a ray or a line segment in the complex plane represents a totally ordered set in the complex plane.
We can now define the notion of upper bounds in a partially ordered set.
Note that there can be more than one upper bounds of \(Y\). Upper bound is not required to be unique.
This means that there is no other element in \(X\) which is greater than \(m\).
A maximal element need not be unique. A partially ordered set may contain more than one maximal element.
Consider the following set
The set is partially ordered w.r.t. the relation \(\subseteq\).
There are three maximal elements in this set namely \(\{1,2\} , \{2,3\} , \{1,3\}\).
- The set of natural numbers \(\Nat\) has no maximal element.
What are the conditions under which a maximal element is guaranteed in a partially ordered set \(X\)?
Following statement known as Zorn’s lemma guarantees the existence of maximal elements in certain partially ordered sets.
Following is corresponding notion of lower bounds.
As before there can be more than one minimal elements in a set.
Countable and uncountable sets¶
In this section, we deal with questions concerning the size of a set.
When do we say that two sets have same number of elements?
If we can find a one-to-one correspondence between two sets \(A\) and \(B\) then we can say that the two sets \(A\) and \(B\) have same number of elements.
In other words, if there exists a function \(f : A \to B\) that is one-to-one and onto (hence invertible), we say that \(A\) and \(B\) have same number of elements.
Note that two sets may be equivalent yet not equal to each other.
The set of natural numbers \(\Nat\) is equivalent to the set of integers \(\ZZ\). Consider the function \(f : \Nat \to \ZZ\) given by
\[\begin{split}f (n) = \left\{ \begin{array}{ll} (n - 1) / 2 & \mbox{if $n$ is odd};\\ -n / 2 & \mbox{if $n$ is even}. \end{array} \right.\end{split}\]It is easy to show that this function is one-one and on-to.
\(\Nat\) is equivalent to the set of even natural numbers \(E\). Consider the function \(f : \Nat \to E\) given by \(f(n) = 2n\). This is one-one and onto.
\(\Nat\) is equivalent to the set of rational numbers \(\QQ\).
The sets \(\{a, b, c\}\) and \(\{1,4, 9\}\) are equivalent but not equal.
Let \(A, B, C\) be sets. Then:
- \(A \sim A\).
- If \(A \sim B\), then \(B \sim A\).
- If \(A \sim B\), and \(B \sim C\), then \(A \sim C\).
Thus it is an equivalence relation.
(i). Construct a function \(f : A \to A\) given by \(f (a) = a \Forall a \in A\). This is a one-one and onto function. Hence \(A \sim A\).
(ii). It is given that \(A \sim B\). Thus, there exists a function \(f : A \to B\) which is one-one and onto. Thus, there exists an inverse function \(g : B \to A\) which is one-one and onto. Thus, \(B \sim A\).
(iii). It is given that \(A \sim B\) and \(B \sim C\). Thus there exist two one-one and onto functions \(f : A \to B\) and \(g : B \to C\). Define a function \(h : A \to C\) given by \(h = g \circ f\). Since composition of bijective functions is bijective , \(h\) is one-one and onto. Thus, \(A \sim C\).
We now look closely at the set of natural numbers \(\Nat = \{1,2,3,\dots\}\).
Clearly, two segments \(\{1,\dots,m\}\) and \(\{1,\dots,n\}\) are equivalent only if \(m= n\).
Thus a proper subset of a segment cannot be equivalent to the segment.
The number of elements of a set which is equivalent to a segment is equal to the number of elements in the segment.
The empty set is also considered to be finite with zero elements.
It should be noted that so far we have defined number of elements only for sets which are equivalent to a segment.
A countable set \(A\) is usually written as \(A = \{a_1, a_2, \dots\}\) which indicates the one-to-one correspondence of \(A\) with the set of natural numbers \(\Nat\).
This notation is also known as the enumeration of \(A\).
With the definitions in place, we are now ready to study the connections between countable, uncountable and finite sets.
If a subset \(S\) of \(\Nat\) satisfies the following properties:
- \(1 \in S\) and
- \(n \in S \implies n + 1 \in S\),
then \(S = \Nat\).
The principle of mathematical induction is applied as follows. We consider a set \(S = \{ n \in \Nat : n \mbox{ satisfies } P \}\) where \(P\) is some property that the members of this set satisfy. We that show that \(1\) satisfies the property \(P\). Further, we show that if \(n\) satisfies property \(P\), then \(n + 1\) also has to satisfy \(P\). Then applying the principle of mathematical induction, we claim that \(S = \Nat\) i.e. every number \(n \in \Nat\) satisfies the property \(P\).
We present different characterizations of a countable set.
Let \(A\) be an infinite set. The following are equivalent:
- A is countable
- There exists a subset \(B\) of \(\Nat\) and a function \(f: B \to A\) that is on-to.
- There exists a function \(g : A \to \Nat\) that is one-one.
(i):math:implies (ii). Since \(A\) is countable, there exists a function \(f : \Nat \to A\) which is one-one and on to. Choosing \(B = \Nat\), we get the result.
(ii):math:implies (iii). We are given that there exists a subset \(B\) of \(\Nat\) and a function \(f: B \to A\) that is on-to. For some \(a \in A\), consider \(f^{-1}{a} = \{ b \in B : f(b) = a \}\). Since \(f\) is on-to, hence \(f^{-1}(a)\) is non-empty. Since \(f^{-1}(a)\) is a set of natural numbers, it has a least element due to well ordering principle. Further if \(a_1, a_2 \in A\) are distinct, then \(f^{-1}(a_1)\) and \(f^{-1}(a_2)\) are disjoint and the corresponding least elements are distinct. Assign \(g(a) = \text{ least element of } f^{-1}(a) \Forall a \in A\). Such a function is well defined by construction. Clearly, the function is one-one.
(iii):math:implies (i). We are given that there exists a function \(g : A \to \Nat\) that is one-one. Clearly, \(A \sim g(A)\) where \(g(A) \subseteq \Nat\). Since \(A\) is infinite, hence \(g(A)\) is also infinite. Due to here, \(g(A)\) is countable implying \(g(A) \sim \Nat\). Thus, \(A \sim g(A) \sim \Nat\) and \(A\) is countable.
Let \(\{A_1, A_2, \dots \}\) be a countable family of sets where each \(A_i\) is a countable set. Then
is countable.
Let \(A_i = \{a_1^i, a_2^i, \dots\} \Forall 1 \leq i \leq n\). Choose \(n\) distinct prime numbers \(p_1, p_2, \dots, p_n\). Consider the set \(B = \{p_1^{k_1}p_2^{k_2} \dots p_n^{k_n} : k_1, k_2, \dots, k_n \in \Nat \}\). Clearly, \(B \subset \Nat\). Define a function \(f : A \to \Nat\) as
By fundamental theorem of arithmetic, every natural number has a unique prime factorization. Thus, \(f\) is one-one. Invoking here, \(A\) is countable.
Let \(F\) denote the set of finite subsets of \(\Nat\). Let \(f \in F\). Then we can write \(f = \{n_1, \dots, n_k\}\) where \(k\) is the number of elements in \(f\). Consider the sequence of prime numbers \(\{p_n\}\) where \(p_n\) denotes \(n\)-th prime number. Now, define a mapping \(g : F \to \Nat\) as
The mapping \(g\) is one-one, since the prime decomposition of a natural number is unique. Hence invoking here, \(F\) is countable.
In this sense, \(B\) has at least as many elements as \(A\).
The relation \(\preceq\) satisfies following properties
- \(A \preceq A\) for all sets \(A\).
- If \(A \preceq B\) and \(b \preceq C\), then \(A \preceq C\).
- If \(A \preceq B\) and \(B \preceq A\), then \(A \sim B\).
(i). We can use the identity function \(f (a ) = a \Forall a \in A\).
(ii). Straightforward application of the result that composition of injective functions is injective.
(iii). Straightforward application of Schröder-Bernstein theorem.
If \(A = \EmptySet\), then \(\Power(A) = \{ \EmptySet\}\) and the result is trivial. So, lets consider non-empty \(A\). We can choose \(f : A \to \Power(A)\) given by \(f (x) = \{ x\} \Forall x \in A\). This is clearly a one-one function leading to \(A \preceq \Power (A)\).
Now for the sake of contradiction, lets us assume that \(A \sim \Power (A)\). Then, there exists a bijective function \(g : A \to \Power(A)\). Consider the set \(B = \{ a \in A : a \notin g(a) \}\). Since \(B \subseteq A\), and \(g\) is bijective, there exists \(a \in A\) such that \(g (a) = B\).
Now if \(a \in B\) then \(a \notin g(a) = B\). And if \(a \notin B\), then \(a \in g(a) = B\). This is impossible, hence \(A \nsim \Power(A)\).
Note that the cardinal numbers are different from natural numbers, real numbers etc. If \(A\) is finite, with \(A = \{a_1, a_2, \dots, a_n \}\), then \(\Card{A} = n\). We use the symbol \(\aleph_0\) to denote the cardinality of \(\Nat\). By saying \(A\) has the cardinality of \(\aleph_0\), we simply mean that \(A \sim \Nat\).
If \(a\) and \(b\) are two cardinal numbers, then by \(a \leq b\), we mean that there exist two sets \(A\) and \(B\) such that \(\Card{A} = a\), \(\Card{B} = b\) and \(A \preceq B\). By \(a < b\), we mean that \(A \preceq B\) and \(A \nsim B\). \(a \leq b\) and \(b \leq a\) guarantees that \(a = b\).
It can be shown that \(\Power(\Nat) \sim \RR\). The cardinality of \(\RR\) is denoted by \(\mathfrak{c}\).
\(2^X\) is the set of all functions \(f : X \to 2\). i.e. a function from \(X\) to \(\{ 0, 1 \}\) which can take only one the two values \(0\) and \(1\).
Define a function \(g : \Power (X) \to 2^X\) as follows. Let \(y \in \Power(X)\). Then \(g(y)\) is a function \(f : X \to \{ 0, 1 \}\) given by
The function \(g\) is one-one and on-to. Thus \(2^X \sim \Power(X)\).
We denote the cardinal number of \(\Power(X)\) by \(2^{\Card{X}}\). Thus, \(\mathfrak{c} = 2^{\aleph_0}\).
The following inequalities of cardinal numbers hold:
Linear Algebra¶
Vector Spaces¶
Algebraic structures¶
In mathematics, the term algebraic structure refers to an arbitrary set with one or more operations defined on it. Simpler algebraic structures include groups, rings, and fields. More complex algebraic structures like vector spaces are built on top of the simpler structures. We will develop the notion of vector spaces as a progression of these algebraic structures.
Groups¶
A group is a set with a single binary operation. It is one of the simplest algebraic structures.
Let \(G\) be a set and let \(*\) be a binary operation defined on \(G\) as:
such that the binary operation \(*\) satisfies following requirements.
[Closure] The set \(G\) is closed under the binary operation \(*\). i.e.
\[\forall g_1, g_2 \in G, g_1 * g_2 \in G.\][Associativity] For every \(g_1, g_2, g_3 \in G\)
\[g_1 * (g_2 * g_3) = (g_1 * g_2) * g_3\][Identity element] There exists an element \(e \in G\) such that
\[g * e = e * g = g \quad \forall g \in G\][Inverse element] For every \(g \in G\) there exists an element \(g^{-1} \in G\) such that
\[g * g^{-1} = g^{-1} * g = e\]
Then the set \(G\) together with the operator \(*\) denoted as \((G, *)\) is known as a group.
Above requirements are known as group axioms. Note that commutativity is not a requirement of a group.
In the sequel we will write \(g_1 * g_2\) as \(g_1 g_2\).
Commutative groups¶
A commutative group is a richer structure than a group. Its elements also satisfy commutativity property.
Let \((G, *)\) be a group such that it satisfies
- [Commutativity] For every \(g_1, g_2 \in G\)
Then \((G,*)\) is known as a commutative group or an Abelian group.
In the sequel we may simply write a group \((G, *)\) as \(G\) when the underlying operation \(*\) is clear from context.
Rings¶
A ring is a set with two binary operations defined over it with some requirements as described below.
Let \(R\) be a set with two binary operations \(+\) (addition) and \(\cdot\) (multiplication) defined over it as:
such that \((R, +, \cdot)\) satisfies following requirements:
\((R, +)\) is an Abelian group.
\(R\) is closed under multiplication.
\[r_1 \cdot r_2 \in R \quad \forall r_1, r_2 \in R\]Multiplication is associative.
\[r_1 \cdot (r_2 \cdot r_3) = (r_1 \cdot r_2) \cdot r_3 \quad \forall r_1, r_2, r_3 \in R\]Multiplication distributes over addition.
\[\begin{split}\begin{aligned} &r_1 \cdot (r_2 + r_3) = (r_1 \cdot r_2) + (r_1 \cdot r_3) \quad \forall r_1, r_2, r_3 \in R\\ &(r_1 + r_2) \cdot r_3 = (r_1 \cdot r_3) + (r_2 \cdot r_3) \quad \forall r_1, r_2, r_3 \in R \end{aligned}\end{split}\]
Then \((R, +, \cdot)\) is known as an associative ring.
We denote the identity element for \(+\) as \(0\) and call it additive identity.
In the sequel we will write \(r_1 \cdot r_2\) as \(r_1 r_2\).
We may simply write a ring \((R, +, \cdot)\) as \(R\) when the underlying operations \(+,\cdot\) are clear from context.
There is a hierarchy of ring like structures. In particular we mention:
- Associative ring with identity
- Field
Let \((R, +, \cdot)\) be an associative ring such that it satisfies following additional requirement:
There exists an element \(1 \in R\) (known as multiplicative identity) such that
\[1 \cdot r = r \cdot 1 = r \quad \forall r \in R\]
Then \((R, +, \cdot)\) is known as an associative ring with identity.
Fields¶
Field is the richest algebraic structure on one set with two operations.
Let \(F\) be a set with two binary operations \(+\) (addition) and \(\cdot\) (multiplication) defined over it as:
such that \((F, +, \cdot)\) satisfies following requirements:
\((F, +)\) is an Abelian group (with additive identity as \(0 \in F\)).
\((F \setminus \{0\}, \cdot)\) is an Abelian group (with multiplicative identity as \(1 \in F\)).
Multiplication distributes over addition:
\[\alpha \cdot (\beta + \gamma) = (\alpha \cdot \beta) + (\alpha \cdot \gamma) \quad \forall \alpha, \beta, \gamma \in F\]
Then \((F, +, \cdot)\) is known as a field.
- The set of real numbers \(\RR\) is a field.
- The set of complex numbers \(\CC\) is a field.
- The Galois field GF-2 is the the set \(\{ 0, 1 \}\) with modulo-2 additions and multiplications.
Vector space¶
We are now ready to define a vector space. A vector space involves two sets. One set \(\VV\) contains the vectors. The other set \(\mathrm{F}\) (a field) contains scalars which are used to scale the vectors.
A set \(\VV\) is called a vector space over the field \(\mathrm{F}\) (or an \(\mathrm{F}\)-vector space) if there exist two mappings
which satisfy following requirements:
\((\VV, +)\) is an Abelian group.
Scalar multiplication \(\cdot\) distributes over vector addition \(+\):
\[\alpha (v_1 + v_2) = \alpha v_1 + \alpha v_2 \quad \forall \alpha \in \mathrm{F}; \forall v_1, v_2 \in \VV.\]Addition in \(\mathrm{F}\) distributes over scalar multiplication \(\cdot\):
\[( \alpha + \beta) v = (\alpha v) + (\beta v) \quad \forall \alpha, \beta \in \mathrm{F}; \forall v \in \VV.\]Multiplication in \(\mathrm{F}\) commutes over scalar multiplication:
\[(\alpha \beta) \cdot v = \alpha \cdot (\beta \cdot v) = \beta \cdot (\alpha \cdot v) = (\beta \alpha) \cdot v \quad \forall \alpha, \beta \in \mathrm{F}; \forall v \in \VV.\]Scalar multiplication from multiplicative identity \(1 \in \mathrm{F}\) satisfies the following:
\[1 v = v \quad \forall v \in \VV.\]
Some remarks are in order:
- \(\VV\) as defined above is also known as an \(\mathrm{F}\) vector space.
- Elements of \(\VV\) are known as vectors.
- Elements of \(\mathrm{F}\) are known as scalars.
- There are two \(0\) involved: \(0 \in \mathrm{F}\) and \(0 \in \VV\). It should be clear from context which \(0\) is being referred to.
- \(0 \in \VV\) is known as the zero vector.
- All vectors in \(\VV \setminus \{0\}\) are non-zero vectors.
- We will typically denote elements of \(\mathrm{F}\) by \(\alpha, \beta, \dots\).
- We will typically denote elements of \(\VV\) by \(v_1, v_2, \dots\).
We quickly look at some vector spaces which will appear again and again in our discussions.
Let \(\mathrm{F}\) be some field.
The set of all \(N\)-tuples \((a_1, a_2, \dots, a_N)\) with \(a_1, a_2, \dots, a_N \in \mathrm{F}\) is denoted as \(\mathrm{F}^N\). This is a vector space with the operations of coordinate-wise addition and scalar multiplication.
Let \(u, v \in \mathrm{F}^N\) with
and
Addition is defined as
Let \(c \in \mathrm{F}\). Scalar multiplication is defined as
\(u, v\) are called equal if \(u_1 = v_1, \dots, u_N = v_N\).
In matrix notation, vectors in \(\mathrm{F}^N\) are also written as row vectors
or column vectors
Let \(\mathrm{F}\) be some field. A matrix is an array of the form
with \(M\) rows and \(N\) columns where \(a_{ij} \in \mathrm{F}\).
The set of these matrices is denoted as \(\mathrm{F}^{M \times N}\) which is a vector space with operations of matrix addition and scalar multiplication.
Let \(A, B \in \mathrm{F}^{M \times N}\). Matrix addition is defined by
Let \(c \in \mathrm{F}\). Scalar multiplication is defined by
Let \(\mathrm{F}[x]\) denote the set of all polynomials with coefficients drawn from field \(\mathrm{F}\). i.e. if \(f(x) \in \mathrm{F}[x]\), then it can be written as
where \(a_i \in \mathrm{F}\).
The set \(\mathrm{F}[x]\) is a vector space with usual operations of addition and scalar multiplication
Some useful results are presented without proof.
This is known as the cancellation law of vector spaces.
In a vector space \(\VV\) the following statements are true
- \(0x = 0 \Forall x \in \VV\).
- \((-a)x = - (ax) = a(-x) \Forall a \in \mathrm{F} \text{ and } x \in \VV\).
- \(a 0 = 0 \Forall a \in \mathrm{F}\).
Linear independence¶
A linear combination of two vectors \(v_1, v_2 \in \VV\) is defined as
where \(\alpha, \beta \in \mathrm{F}\).
A linear combination of \(p\) vectors \(v_1,\dots, v_p \in \VV\) is defined as
Let \(\VV\) be a vector space and let \(S\) be a nonempty subset of \(\VV\). A vector \(v \in \VV\) is called a linear combination of vectors of \(S\) if there exist a finite number of vectors \(s_1, s_2, \dots, s_n \in S\) and scalars \(a_1, \dots, a_N\) in \(\mathrm{F}\) such that
We also say that \(v\) is a linear combination of \(s_1, s_2, \dots, s_n\) and \(a_1, a_2, \dots, a_n\) are the coefficients of linear combination.
Note that \(0\) is a trivial linear combination of any subset of \(\VV\).
Note that linear combination may refer to the expression itself or its value. e.g. two different linear combinations may have same value.
Note that a linear combination always consists of a finite number of vectors.
A finite set of non-zero vectors \(\{v_1, \cdots, v_p\} \subset \VV\) is called linearly dependent if there exist \(\alpha_1,\dots,\alpha_p \in \mathrm{F}\) not all \(0\) such that
A set \(S \subseteq \VV\) is called linearly dependent if there exist a finite number of distinct vectors \(u_1, u_2, \dots, u_n \in S\) and scalars \(a_1, a_2, \dots, a_n \in \mathrm{F}\) not all zero, such that
More specifically a finite set of non-zero vectors \(\{v_1, \cdots, v_n\} \subset \VV\) is called linearly independent if
- The empty set is linearly independent.
- A set of a single non-zero vector \(\{v\}\) is always linearly independent. Prove!
- If two vectors are linearly dependent, we say that they are collinear.
- Alternatively if two vectors are linearly independent, we say that they are not collinear.
- If a set \(\{v_1, \cdots, v_p\}\) is linearly independent, then any subset of it will be linearly independent. Prove!
- Adding another vector \(v\) to the set may make it linearly dependent. When?
- It is possible to have an infinite set to be linearly independent. Consider the set of polynomials \(\{1, x, x^2, x^3, \dots\}\). This set is infinite, yet linearly independent.
Span¶
Vectors can be combined to form other vectors. It makes sense to consider the set of all vectors which can be created by combining a given set of vectors.
Let \(S \subset \VV\) be a subset of vectors. The span of \(S\) denoted as \(\langle S \rangle\) or \(\Span(S)\) is the set of all possible linear combinations of vectors belonging to \(S\).
For convenience we define \(\Span(\EmptySet) = \{ 0 \}\).
Span of a finite set of vectors \(\{v_1, \cdots, v_p\}\) is denoted by \(\langle v_1, \cdots, v_p \rangle\).
We say that a set of vectors \(S \subseteq \VV\) spans \(\VV\) if \(\langle S \rangle = \VV\).
Let \(S \subset \VV\). We say that \(S\) spans (or generates) \(\VV\) if
In this case we also say that vectors of \(S\) span (or generate) \(\VV\).
Basis¶
- Since \(\Span(\EmptySet) = \{ 0 \}\) and \(\EmptySet\) is linearly independent, \(\EmptySet\) is a basis for the zero vector space \(\{ 0 \}\).
- The basis \(\{ e_1, \dots, e_N\}\) with \(e_1 = (1, 0, \dots, 0)\), \(e_2 = (0, 1, \dots, 0)\), \(\dots\), \(e_N = (0, 0, \dots, 1)\), is called the standard basis for \(\mathrm{F}^N\).
- The set \(\{1, x, x^2, x^3, \dots\}\) is the standard basis for \(\mathrm{F}[x]\). Indeed, an infinite basis. Note that though the basis itself is infinite, yet every polynomial \(p \in \mathrm{F}[x]\) is a linear combination of finite number of elements from the basis.
We review some properties of bases.
Let \(\VV\) be a vector space and \(\mathcal{B} = \{ v_1, v_2, \dots, v_n\}\) be a subset of \(\VV\). Then \(\mathcal{B}\) is a basis for \(\VV\) if and only if each \(v \in \VV\) can be uniquely expressed as a linear combination of vectors of \(\mathcal{B}\):
for unique scalars \(a_1, \dots, a_n\).
This theorem states that a basis \(\mathrm{B}\) provides a unique representation to each vector \(v \in \VV\) where the representation is defined as the \(n\)-tuple \((a_1, a_2, \dots, a_n)\).
If the basis is infinite, then the above theorem needs to be modified as follows:
Let \(\VV\) be a vector space and \(\mathcal{B}\) be a subset of \(\VV\). Then \(\mathcal{B}\) is a basis for \(\VV\) if and only if each \(v \in \VV\) can be uniquely expressed as a linear combination of vectors of \(\mathcal{B}\):
for unique scalars \(a_1, \dots, a_n\) and unique vectors \(v_1, v_2, \dots v_n \in \mathcal{B}\).
Let \(\VV\) be a vector space that is spanned by a set \(G\) containing exactly \(n\) vectors. Let \(L\) be a linearly independent subset of \(\VV\) containing exactly \(m\) vectors.
Then \(m \leq n\) and there exists a subset \(H\) of \(G\) containing exactly \(n-m\) vectors such that \(L \cup H\) spans \(\VV\).
A vector space \(\VV\) is called finite-dimensional if it has a basis consisting of a finite number of vectors. This unique number of vectors in any basis \(\mathcal{B}\) of the vector space \(\VV\) is called the dimension or dimensionality of the vector space. It is denoted as \(\dim \VV\). We say:
If \(\VV\) is not finite-dimensional, then we say that \(\VV\) is infinite-dimensional.
- Dimension of \(\mathrm{F}^N\) is \(N\).
- Dimension of \(\mathrm{F}^{M \times N}\) is \(MN\).
- The vector space of polynomials \(\mathrm{F}[x]\) is infinite dimensional.
Let \(\VV\) be a vector space with dimension \(n\).
- Any finite spanning set for \(\VV\) contains at least \(n\) vectors, and a spanning set that contains exactly \(n\) vectors is a basis for \(\VV\).
- Any linearly independent subset of \(\VV\) that contains exactly \(n\) vectors is a basis for \(\VV\).
- Every linearly independent subset of \(\VV\) can be extended to a basis for \(\VV\).
Typically we will write an ordered basis as \(\BBB = \{ v_1, v_2, \dots, v_n\}\) and assume that the basis vectors are ordered in the order they appear.
With the help of an ordered basis, we can define a coordinate vector.
Let \(\BBB = \{ v_1, \dots, v_n\}\) be an ordered basis for \(\VV\), and for \(x \in \VV\), let \(\alpha_1, \dots, \alpha_n\) be unique scalars such that
The coordinate vector of \(x\) relative to \(\BBB\) is defined as
Subspace¶
Let \(W\) be a subset of \(\VV\). Then \(W\) is called a subspace if \(W\) is a vector space in its own right under the same vector addition \(+\) and scalar multiplication \(\cdot\) operations. i.e.
are defined by restricting \(+ : \VV \times \VV \to \VV\) and \(\cdot : \VV \times \VV \to \VV\) to \(W\) and \(W\) is closed under these operations.
- \(\VV\) is a subspace of \(\VV\).
- \(\{0\}\) is a subspace of any \(\VV\).
A subset \(\WW \subseteq \VV\) is a subspace of \(\VV\) if and only if
- \(0 \in\WW\)
- \(x + y \in\WW\) whenever \(x, y \in\WW\)
- \(\alpha x \in\WW\) whenever \(\alpha \in \mathrm{F}\) and \(x \in\WW\).
A matrix \(M \in \mathrm{F}^{M \times N}\) is symmetric if
The set of symmetric matrices forms a subspace of set of all \(M\times N\) matrices.
A matrix \(M\) is called a diagonal if \(M_{ij} = 0\) whenever \(i \neq j\).
The set of diagonal matrices is a subspace of \(\mathrm{F}^{M \times N}\).
We note that a union of subspaces is not necessarily a subspace, since it is not closed under addition.
The span of a set \(S \subset \VV\) given by \(\langle S \rangle\) is a subspace of \(\VV\).
Moreover any subspace of \(\VV\) that contains \(S\) must also contain the span of \(S\).
This theorem is quite useful. It allows us to construct subspaces from a given basis.
Let \(\mathcal{B}\) be a basis of an \(n\) dimensional space \(\VV\). There are \(n\) vectors in \(\mathcal{B}\). We can create \(2^n\) distinct subsets of \(\mathcal{B}\). Thus we can construct \(2^n\) distinct subspaces of \(\VV\).
Choosing some other basis lets us construct another set of subspaces.
An \(n\)-dimensional vector space has infinite number of bases. Correspondingly, there are infinite possible subspaces.
If \(W_1\) and \(W_2\) are two subspaces of \(\VV\) then we say that \(W_1\) is smaller than \(W_2\) if \(W_1 \subset\WW _2\).
Let \(\WW\) be the smallest subspace containing vectors \(\{ v_1, \dots, v_p \}\). Then
i.e. \(\WW\) is same as the span of \(\{ v_1, \dots, v_p \}\).
Let \(\WW\) be a subspace of a finite-dimensional vector space \(\VV\). Then \(\WW\) is finite dimensional and
Moreover, if
then \(\WW = \VV\).
Let \(\VV\) be a finite dimensional vector space and \(\WW\) be a subspace of \(\VV\). The codimension of \(\WW\) is defined as
Linear transformations¶
In this section, we will be using symbols \(\VV\) and \(\WW\) to represent arbitrary vector spaces over a field \(\FF\). Unless specified the two vector spaces won’t be related in any way.
Following results can be restated for more general situations where \(\VV\) and \(\WW\) are defined over different fields, but we will assume that they are defined over the same field \(\FF\) for simplicity of discourse.
We call a map \(\TT : \VV \to \WW\) a linear transformation from \(\VV\) to \(\WW\) if for all \(x, y \in \VV\) and \(\alpha \in \FF\), we have
- \(\TT(x + y) = \TT(x) + \TT(y)\) and
- \(\TT(\alpha x) = \alpha \TT(x)\)
A linear transformation is also known as a linear map or a linear operator. Usually when the domain (\(\VV\)) and co-domain (\(\WW\)) for a linear transformation are same, then the term linear operator is used.
This is straightforward since
Assuming \(\TT\) to be linear we have
Now for the converse, assume
Choosing both \(x\) and \(y\) to be 0 and \(\alpha=1\) we get
Choosing \(y=0\) we get
Choosing \(\alpha = 1\) we get
Thus \(\TT\) is a linear transformation.
\(\TT\) is linear \(\iff\) for \(x_1, \dots, x_n \in \VV\) and \(\alpha_1, \dots, \alpha_n \in \FF\),
We can use mathematical induction to prove this.
Some special linear transformations need mention.
The identity transformation \(\mathrm{I}_{\VV} : \VV \to \VV\) is defined as
The zero transformation \(\mathrm{0} : \VV \to \WW\) is defined as
In this definition \(0\) is taking up multiple meanings: a linear transformation from \(\VV\) to \(\WW\) which maps every vector in \(\VV\) to the \(0\) vector in \(\WW\).
From the context usually it should be obvious whether we are talking about \(0 \in \FF\) or \(0 \in \VV\) or \(0 \in \WW\) or \(0\) as a linear transformation from \(\VV\) to \(\WW\).
Null space and range¶
The null space or kernel of a linear transformation \(\TT : \VV \to \WW\) denoted by \(\NullSpace(\TT)\) or \(\Kernel(\TT)\) is defined as
Let \(v_1, v_2 \in \Kernel(\TT)\). Then
Thus \(\alpha v_1 + v_2 \in \Kernel(\TT)\). Thus \(\Kernel(\TT)\) is a subspace of \(\VV\).
The range or image of a linear transformation \(\TT : \VV \to \WW\) denoted by \(\Range(\TT)\) or \(\Image(\TT)\) is defined as
We note that \(\Image(\TT) \subseteq \WW\).
Let \(w_1, w_2 \in \Image(\TT)\). Then there exist \(v_1, v_2 \in \VV\) such that
Thus
Thus \(\alpha w_1 + w_2 \in \Image(\TT)\). Hence \(\Image(\TT)\) is a subspace of \(\WW\).
Let \(\TT : \VV \to \WW\) be a linear transformation. Let \(\mathcal{B} = \{v_1, v_2, \dots, v_n\}\) be some basis of \(\VV\). Then
i.e. The image of a basis of \(\VV\) under a linear transformation \(\TT\) spans the range of the transformation.
Let \(w\) be some arbitrary vector in \(\Image(\TT)\). Then there exists \(v \in \VV\) such that \(w = \TT(v)\). Now
since \(\mathcal{B}\) forms a basis for \(\VV\).
Thus
This means that \(w \in \langle \TT(\mathcal{B}) \rangle\).
For vector spaces \(\VV\) and \(\WW\) and linear \(\TT : \VV \to \WW\) if \(\Kernel{\TT}\) is finite dimensional then nullity of \(\TT\) is defined as
i.e. the dimension of the null space or kernel of \(\TT\).
For vector spaces \(\VV\) and \(\WW\) and linear \(\TT : \VV \to \WW\) if \(\Image{\TT}\) is finite dimensional then rank of \(\TT\) is defined as
i.e. the dimension of the range or image of \(\TT\).
For vector spaces \(\VV\) and \(\WW\) and linear \(\TT : \VV \to \WW\) if \(\VV\) is finite dimensional, then
This is known as dimension theorem.
If \(\TT\) is one-one, then
Let \(v \neq 0\). Now \(\TT(0) = 0 \implies \TT(v) \neq 0\) since \(\TT\) is one-one. Thus \(\Kernel(\TT) = \{ 0\}\).
For converse let us assume that \(\Kernel(\TT) = \{ 0\}\). Let \(v_1, v_2 \in V\) be two vectors in \(V\) such that
Thus \(\TT\) is one-one.
For vector spaces \(\VV\) and \(\WW\) of equal finite dimensions and linear \(\TT : \VV \to \WW\), the following are equivalent.
- \(\TT\) is one-one.
- \(\TT\) is onto.
- \(\Rank(\TT) = \dim (\VV)\).
From (a) to (b)
Let \(\mathcal{B} = \{v_1, v_2, \dots v_n \}\) be some basis of \(\VV\) with \(\dim \VV = n\).
Let us assume that \(\TT(\mathcal{B})\) are linearly dependent. Thus there exists a linear relationship
where \(\alpha_i\) are not all 0.
Now
since \(\TT\) is one-one. This means that \(v_i\) are linearly dependent. This contradicts our assumption that \(\mathcal{B}\) is a basis for \(\VV\).
Thus \(\TT(\mathcal{B})\) are linearly independent.
Since \(\TT\) is one-one, hence all vectors in \(\TT(\mathcal{B})\) are distinct, hence
Since \(\TT(\mathcal{B})\) span \(\Image(\TT)\) and are linearly independent, hence they form a basis of \(\Image(\TT)\). But
and \(\TT(\mathcal{B})\) are a set of \(n\) linearly independent vectors in \(\WW\).
Hence \(\TT(\mathcal{B})\) form a basis of \(\WW\). Thus
Thus \(\TT\) is on-to.
From (b) to (c) \(\TT\) is on-to means \(\Image(\TT) = \WW\) thus
From (c) to (a) We know that
But it is given that \(\Rank(\TT) = \dim \VV\). Thus
Thus \(\TT\) is one-one.
Bracket operator¶
Recall the definition of coordinate vector from here. Conversion of a given vector to its coordinate vector representation can be shown to be a linear transformation.
Let \(\VV\) be a finite dimensional vector space over a field \(\FF\) where \(\dim \VV = n\). Let \(\BBB = \{ v_1, \dots, v_n\}\) be an ordered basis in \(\VV\). We define a bracket operator from \(\VV\) to \(\FF^n\) as
where
In other words, the bracket operator takes a vector \(v\) from a finite dimensional space \(\VV\) to its representation in \(\FF^n\) for a given basis \(\BBB\).
We now show that the bracket operator is linear.
Let \(\VV\) be a finite dimensional vector space over a field \(\FF\) where \(\dim \VV = n\). Let \(\BBB = \{ v_1, \dots, v_n\}\) be an ordered basis in \(\VV\). The bracket operator \(\Bracket_{\BBB} : \VV \to \FF^n\) as defined here is a linear operator.
Moreover \(\Bracket_{\BBB}\) is a one-one and onto mapping.
Let \(x, y \in \VV\) such that
and
Then
Thus
Thus \(\Bracket_{\BBB}\) is linear.
We can see that by definition \(\Bracket_{\BBB}\) is one-one. Now since \(\dim \VV = n = \dim \FF^n\) hence \(\Bracket_{\BBB}\) is on-to due to here.
Matrix representations¶
It is much easier to work with a matrix representation of a linear transformation. In this section we describe how matrix representations of a linear transformation are developed.
In order to develop a representation for the map \(\TT : \VV \to \WW\) we first need to choose a representation for vectors in \(\VV\) and \(\WW\). This can be easily done by choosing a basis in \(\VV\) and another in \(\WW\). Once the bases are chosen, then we can represent vectors as coordinate vectors.
Let \(\VV\) and \(\WW\) be finite dimensional vector spaces with ordered bases \(\BBB = \{v_1, \dots, v_n\}\) and \(\Gamma = \{w_1, \dots,w_m\}\) respectively. Let \(\TT : \VV \to \WW\) be a linear transformation. For each \(v_j \in \BBB\) we can find a unique representation for \(\TT(v_j)\) in \(\Gamma\) given by
The \(m\times n\) matrix \(A\) defined by \(A_{ij} = a_{ij}\) is the matrix representation of \(\TT\) in the ordered bases \(\BBB\) and \(\Gamma\), denoted as
If \(\VV = \WW\) and \(\BBB = \Gamma\) then we write
The \(j\)-th column of \(A\) is the representation of \(\TT(v_j)\) in \(\Gamma\).
In order to justify the matrix representation of \(\TT\) we need to show that application of \(\TT\) is same as multiplication by \(A\). This is stated formally below.
Let
Then
Now
Thus
Vector space of linear transformations¶
If we consider the set of linear transformations from \(\VV\) to \(\WW\) we can impose some structure on it and take its advantages.
First of all we will define basic operations like addition and scalar multiplication on the general set of functions from a vector space \(\VV\) to another vector space \(\WW\).
Let \(\TT\) and \(\UU\) be arbitrary functions from vector space \(\VV\) to vector space \(\WW\) over the field \(\FF\). Then addition of functions is defined as
Scalar multiplication on a function is defined as
With these definitions we have
We are now ready to show that with the addition and scalar multiplication as defined above, the set of linear transformations from \(\VV\) to \(\WW\) actually forms a vector space.
Let \(\VV\) and \(\WW\) be vector spaces over field \(\FF\). Let \(\TT\) and \(\UU\) be some linear transformations from \(\VV\) to \(\WW\). Let addition and scalar multiplication of linear transformations be defined as in here. Then \(\alpha \TT + \UU\) where \(\alpha \in \FF\) is a linear transformation.
Moreover the set of linear transformations from \(\VV\) to \(\WW\) forms a vector space.
We first show that \(\alpha \TT + \UU\) is linear.
Let \(x,y \in \VV\) and \(\beta \in \FF\). Then we need to show that
Starting with the first one:
Now the next one
We can now easily verify that the set of linear transformations from \(\VV\) to \(\WW\) satisfies all the requirements of a vector space. Hence its a vector space (of linear transformations from \(\VV\) to \(\WW\)).
Let \(\VV\) and \(\WW\) be vector spaces over field \(\FF\). Then the vector space of linear transformations from \(\VV\) to \(\WW\) is denoted by \(\LinTSpace(\VV, \WW)\).
When \(\VV = \WW\) then it is simply denoted by \(\LinTSpace(\VV)\).
The addition and scalar multiplication as defined in here carries forward to matrix representations of linear transformations also.
Let \(\VV\) and \(\WW\) be finite dimensional vector spaces over field \(\FF\) with \(\BBB\) and \(\Gamma\) being their respective bases. Let \(\TT\) and \(\UU\) be some linear transformations from \(\VV\) to \(\WW\).
Then the following hold
- \([\TT + \UU]_{\BBB}^{\Gamma} = [\TT]_{\BBB}^{\Gamma} + [\UU]_{\BBB}^{\Gamma}\)
- \([\alpha \TT]_{\BBB}^{\Gamma} = \alpha [\TT]_{\BBB}^{\Gamma} \Forall \alpha \in \FF\)
Inner product spaces¶
Inner product¶
Inner product is a generalization of the notion of dot product.
An inner product over a \(K\)-vector space \(V\) is any map
satisfying following requirements:
Positive definiteness
(1)¶\[ \langle v, v \rangle \geq 0 \text{ and } \langle v, v \rangle = 0 \iff v = 0\]Conjugate symmetry
(2)¶\[ \langle v_1, v_2 \rangle = \overline{\langle v_2, v_1 \rangle} \quad \forall v_1, v_2 \in V\]Linearity in the first argument
(3)¶\[\begin{split} \begin{aligned} &\langle \alpha v, w \rangle = \alpha \langle v, w \rangle \quad \forall v, w \in V; \forall \alpha \in K\\ &\langle v_1 + v_2, w \rangle = \langle v_1, w \rangle + \langle v_2, w \rangle \quad \forall v_1, v_2,w \in V \end{aligned}\end{split}\]
Remarks
- Linearity in first argument extends to any arbitrary linear combination:
- Similarly we have conjugate linearity in second argument for any arbitrary linear combination:
Orthogonality¶
A set of non-zero vectors \(\{v_1, \dots, v_p\}\) is called orthogonal if
A set of non-zero vectors \(\{v_1, \dots, v_p\}\) is called orthonormal if
i.e. \(\langle v_i, v_j \rangle = \delta(i, j)\).
Remarks:
- A set of orthogonal vectors is linearly independent. Prove!
Norm¶
Norms are a generalization of the notion of length.
A norm over a \(K\)-vector space \(V\) is any map
satisfying following requirements:
Positive definiteness
(5)¶\[ \| v\| \geq 0 \quad \forall v \in V \text{ and } \| v\| = 0 \iff v = 0\]Scalar multiplication
\[\| \alpha v \| = | \alpha | \| v \| \quad \forall \alpha \in K; \forall v \in V\]Triangle inequality
\[\| v_1 + v_2 \| \leq \| v_1 \| + \| v_2 \| \quad \forall v_1, v_2 \in V\]
Projection¶
A projection is a linear transformation \(P\) from a vector space \(V\) to itself such that \(P^2=P\). i.e. if \(P v = \beta\), then \(P \beta = \beta\). Thus whenever \(P\) is applied twice to any vector, it gives the same result as if it was applied once.
Thus \(P\) is an idempotent operator.
Consider the operator \(P : \RR^3 \to \RR^3\) defined as
Then application of \(P\) on any arbitrary vector is given by
A second application doesn’t change it
Thus \(P\) is a projection operator.
Usually we can directly verify the property by computing \(P^2\) as
Orthogonal projection¶
Consider a projection operator \(P : V \to V\) where \(V\) is an inner product space.
The range of \(P\) is given by
The null space of \(P\) is given by
A projection operator \(P : V \to V\) over an inner product space \(V\) is called orthogonal projection operator if its range \(\Range(P)\) and the null space \(\NullSpace(P)\) as defined above are orthogonal to each other. i.e.
Consider a unit norm vector \(u \in \RR^N\). Thus \(u^T u = 1\).
Consider
Now
Thus \(P\) is a projection operator.
Now
Thus \(P_u\) is self-adjoint. Hence \(P_u\) is an orthogonal projection operator.
Now
Thus \(P_u\) leaves \(u\) intact. i.e. Projection of \(u\) on to \(u\) is \(u\) itself.
Let \(v \in u^{\perp}\) i.e. \(\langle u, v \rangle = 0\).
Then
Thus \(P_u\) annihilates all vectors orthogonal to \(u\).
Now any vector \(x \in \RR^N\) can be broken down into two components
such that \(\langle u , x_{\perp} \rangle =0\) and \(x_{\parallel}\) is collinear with \(u\).
Then
Thus \(P_u\) retains the projection of \(x\) on \(u\) given by \(x_{\parallel}\).
Let \(A \in \RR^{M \times N}\) with \(N \leq M\) be a matrix given by
where \(a_i \in \RR^M\) are its columns which are linearly independent.
The column space of \(A\) is given by
It can be shown that \(A^T A\) is invertible.
Consider the operator
Now
Thus \(P_A\) is a projection operator.
Thus \(P_A\) is self-adjoint.
Hence \(P_A\) is an orthogonal projection operator on the column space of \(A\).
Parallelogram identity¶
Thus
When inner product is a real number following identity is quite useful.
Thus
since for real inner products
Polarization identity¶
When inner product is a complex number, polarization identity is quite useful.
Thus
The Euclidean space¶
In this book we will be generally concerned with the Euclidean space \(\RR^N\). This section summarizes important results for this space.
\(\RR^2\) (the 2-dimensional plane) and \(\RR^3\) the 3-dimensional space are the most familiar spaces to us.
\(\RR^N\) is a generalization in \(N\) dimensions.
An element \(x\) in \(\RR^N\) is written as
where each \(x_i\) is a real number.
Vector space operations on \(\RR^N\) are defined by:
\(\RR^N\) comes with the standard ordered basis \(B = \{e_1, e_2, \dots, e_N\}\):
An arbitrary vector \(x\in\RR^N\) can be written as
Inner product¶
Standard inner product (a.k.a. dot product) is defined as:
This makes \(\RR^N\) an inner product space.
The result is always a real number. Hence we have symmetry:
Norm¶
The length of the vector (a.k.a. Euclidean norm or \(\ell_2\) norm) is defined as:
This makes \(\RR^N\) a normed linear space.
The angle \(\theta\) between two vectors is given by:
Distance¶
Distance between two vectors is defined as:
This distance function is known as Euclidean metric.
This makes \(\RR^N\) a metric space.
\(\ell_p\) norms¶
In addition to standard Euclidean norm, we define a family of norms indexed by \(p \in [1, \infty]\) known as \(l_p\) norms over \(\RR^N\).
\(\ell_p\) norm is defined as:
\(\ell_2\) norm¶
As we can see from definition, \(\ell_2\) norm is same as Euclidean norm. So we have:
\(\ell_1\) norm¶
From above definition we have
We use norms as a measure of strength of a signal or size of an error. Different norms signify different aspects of the signal.
Quasi-norms¶
In some cases it is useful to extend the notion of \(\ell_p\) norms to the case where \(0 < p < 1\).
In such cases norm as defined in (2) doesn’t satisfy triangle inequality, hence it is not a proper norm function. We call such functions as quasi-norms.
\(\ell_0\)-“norm”¶
Of specific mention is \(\ell_0\)-“norm”. It isn’t even a quasi-norm. Note the use of quotes around the word norm to distinguish \(\ell_0\)-“norm” from usual norms.
\(\ell_0\)-“norm” is defined as:
where \(\supp(x) = \{ i : x_i \neq 0\}\) denotes the support of \(x\).
Note that \(\| x \|_0\) defined above doesn’t follow the definition in (2).
Yet we can show that:
which justifies the notation.
N dimensional complex space¶
In this section we review important features of N dimensional complex vector space \(\CC^N\).
An element \(x\) in \(\CC^N\) is written as
where each \(x_i\) is a complex number.
Vector space operations on \(\CC^N\) are defined by:
\(\CC^N\) comes with the standard ordered basis \(B = \{e_1, e_2, \dots, e_N\}\):
We note that the basis is same as the basis for \(N\) dimensional real vector space (the Euclidean space).
An arbitrary vector \(x\in\CC^N\) can be written as
Inner product¶
Standard inner product is defined as:
where \(\overline{y_i}\) denotes the complex conjugate.
This makes \(\CC^N\) an inner product space.
This satisfies the inner product rule:
Norm¶
The length of the vector (a.k.a. \(\ell_2\) norm) is defined as:
This makes \(\CC^N\) a normed linear space.
Distance¶
Distance between two vectors is defined as:
This makes \(\CC^N\) a metric space.
\(\ell_p\) norms¶
In addition to standard Euclidean norm, we define a family of norms indexed by \(p \in [1, \infty]\) known as \(\ell_p\) norms over \(\CC^N\).
\(\ell_p\) norm is defined as:
So we have:
\(\ell_1\) norm¶
From above definition we have
We use norms as a measure of strength of a signal or size of an error. Different norms signify different aspects of the signal.
Quasi-norms¶
In some cases it is useful to extend the notion of \(\ell_p\) norms to the case where \(0 < p < 1\).
In such cases norm as defined in (2) doesn’t satisfy triangle inequality, hence it is not a proper norm function. We call such functions as quasi-norms.
\(\ell_0\) “norm”¶
Of specific mention is \(\ell_0\) “norm”. It isn’t even a quasi-norm. Note the use of quotes around the word norm to distinguish \(\ell_0\) “norm” from usual norms.
\(\ell_0\) “norm” is defined as:
where \(\supp(x) = \{ i : x_i \neq 0\}\) denotes the support of \(x\).
Note that \(\| x \|_0\) defined above doesn’t follow the definition in (2).
Yet we can show that:
which justifies the notation.
Affine Subspaces Review¶
For a detailed introduction to affine concepts, see [KW79]. For a vector \(v \in \RR^n\), the function \(f\) defined by \(f (x) = x + v, x \in \RR^n\) is a translation of \(\RR^n\) by \(v\). The image of any set \(\mathcal{S}\) under \(f\) is the \(v\)-translate of \(\mathcal{S}\). A translation of space is a one to one isometry of \(\RR^n\) onto \(\RR^n\).
A translate of a \(d\)-dimensional, linear subspace of \(\RR^n\) is a \(d\)-dimensional flat or simply \(d\)-flat in \(\RR^n\). Flats of dimension 1, 2, and \(n-1\) are also called lines, planes, and hyperplanes, respectively. Flats are also known as affine subspaces.
Every \(d\)-flat in \(\RR^n\) is congruent to the Euclidean space \(\RR^d\). Flats are closed sets.
An affine combination of the vectors \(v_1, \dots, v_m\) is a linear combination in which the sum of coefficients is 1. Thus, \(b\) is an affine combination of \(v_1, \dots, v_m\) if \(b = k_1 v_1 + \dots k_m v_m\) and \(k_1 + \dots + k_m = 1\). The set of affine combinations of a set of vectors \(\{ v_1, \dots, v_m \}\) is their affine span. A finite set of vectors \(\{v_1, \dots, v_m\}\) is called affine independent if the only zero-sum linear combination of theirs representing the null vector is the null combination. i.e. \(k_1 v_1 + \dots + k_m v_m = 0\) and \(k_1 + \dots + k_m = 0\) implies \(k_1 = \dots = k_m = 0\). Otherwise, the set is affinely dependent. A finite set of two or more vectors is affine independent if and only if none of them is an affine combination of the others.
Vectors vs. Points An n-tuple \((x_1, \dots, x_n)\) is used to refer to a point \(X\) in \(\RR^n\) as well as to a vector from origin \(O\) to \(X\) in \(\RR^n\). In basic linear algebra, the terms vector and point are used interchangeably. While discussing geometrical concepts (affine or convex sets etc.), it is useful to distinguish between vectors and points. When the terms “dependent” and “independent” are used without qualification to points, they refer to affine dependence/independence. When used for vectors, they mean linear dependence/independence.
The span of \(k+1\) independent points is a \(k\)-flat and is the unique \(k\)-flat that contains all \(k+1\) points. Every \(k\)-flat contains \(k+1\) (affine) independent points. Each set of \(k+1\) independent points in the \(k\)-flat forms an affine basis for the flat. Each point of a \(k\)-flat is represented by one and only one affine combination of a given affine basis for the flat. The coefficients of the affine combination of a point are the affine coordinates of the point in the given affine basis of the \(k\)-flat. A \(d\)-flat is contained in a linear subspace of dimension \(d+1\). This can be easily obtained by choosing an affine basis for the flat and constructing its linear span.
A function \(f\) defined on a vector space \(V\) is an affine function or affine transformation or affine mapping if it maps every affine combination of vectors \(u, v\) in \(V\) onto the same affine combination of their images. If \(f\) is real valued, then \(f\) is an affine functional. A property which is invariant under an affine mapping is called affine invariant. The image of a flat under an affine function is a flat.
Every affine function differs from a linear function by a translation. A functional is an affine functional if and only if there exists a unique vector \(a \in \RR^n\) and a unique real number \(k\) such that \(f(x) = \langle a, x \rangle + k\). Affine functionals are continuous. If \(a \neq 0\), then the linear functional \(f(x) = \langle a, x \rangle\) and the affine functional \(g(x) = \langle a, x \rangle + k\) map bounded sets onto bounded sets, neighborhoods onto neighborhoods, balls onto balls and open sets onto open sets.
Hyperplanes and Half spaces¶
Corresponding to a hyperplane \(\mathcal{H}\) in \(\RR^n\) (an \(n-1\)-flat), there exists a non-null vector \(a\) and a real number \(k\) such that \(\mathcal{H}\) is the graph of \(\langle a , x \rangle = k\). The vector \(a\) is orthogonal to \(PQ\) for all \(P, Q \in \mathcal{H}\). All non-null vectors \(a\) to have this property are normal to the hyperplane. The directions of \(a\) and \(-a\) are called opposite normal directions of \(\mathcal{H}\). Conversely, the graph of \(\langle a , x \rangle = k\), \(a \neq 0\), is a hyperplane for which \(a\) is a normal vector. If \(\langle a, x \rangle = k\) and \(\langle b, x \rangle = h\), \(a \neq 0\), \(b \neq 0\) are both representations of a hyperplane \(\mathcal{H}\), then there exists a real non-zero number \(\lambda\) such that \(b = \lambda a\) and \(h = \lambda k\). Obviously, we can find a unit norm normal vector for \(\mathcal{H}\). Each point \(P\) in space has a unique foot (nearest point) \(P_0\) in a Hyperplane \(\mathcal{H}\). Distance of the point \(P\) with vector \(p\) from a hyperplane \(\mathcal{H} : \langle a , x \rangle = k\) is given by
The coordinate \(p_0\) of the foot \(P_0\) is given by
Hyperplanes \(\mathcal{H}\) and \(\mathcal{K}\) are parallel if they don’t intersect. This occurs if and only if they have a common normal direction. They are different constant sets of the same linear functional. If \(\mathcal{H}_1 : \langle a , x \rangle = k_1\) and \(\mathcal{H}_2 : \langle a, x \rangle = k_2\) are parallel hyperplanes, then the distance between the two hyperplanes is given by
If \(\langle a, x \rangle = k\), \(a \neq 0\), is a hyperplane \(\mathcal{H}\), then the graphs of \(\langle a , x \rangle > k\) and \(\langle a , x \rangle < k\) are the opposite sides or opposite open half spaces of \(\mathcal{H}\). The graphs of \(\langle a , x \rangle \geq k\) and \(\langle a , x \rangle \leq k\) are the opposite closed half spaces of \(\mathcal{H}\). \(\mathcal{H}\) is the face of the four half-spaces. Corresponding to a hyperplane \(\mathcal{H}\), there exists a unique pair of sets \(\mathcal{S}_1\) and \(\mathcal{S}_2\) that are the opposite sides of \(\mathcal{H}\). Open half spaces are open sets and closed half spaces are closed sets. If \(A\) and \(B\) belong to the opposite sides of a hyperplane \(\mathcal{H}\), then there exists a unique point of \(\mathcal{H}\) that is between \(A\) and \(B\).
General Position¶
A general position for a set of points or other geometric objects is a notion of genericity. In means the general case situation as opposed to more special and coincidental cases. For example, generically, two lines in a plane intersect in a single point. The special cases are when the two lines are either parallel or coincident. Three points in a plane in general are not collinear. If they are, then it is a degenerate case. A set of \(n+1\) or more points in \(\RR^n\) is in said to be in general position if every subset of \(n\) points is linearly independent. In general, a set of \(k+1\) or more points in a \(k\)-flat is said to be in general linear position if no hyperplane contains more than \(k\) points.
Matrix Factorizations¶
Singular Value Decomposition¶
A non-negative real value \(\sigma\) is a singular value for a matrix \(A \in \RR^{m \times n}\) if and only if there exist unit length vectors \(u \in \RR^m\) and \(v \in \RR^n\) such that \(A v = \sigma u\) and \(A^T u = \sigma v\). The vectors u and v are called left singular and right singular vectors for \(\sigma\) respectively. For every \(A \in \RR^{m \times n}\) with \(k = \min(m, n)\), there exist two orthogonal matrices \(U \in \RR^{m \times m}\) and \(V \in \RR^{n \times n}\) and a sequence of real numbers \(\sigma_1 \geq \dots \geq \sigma_k \geq 0\) such that \(U^T A V = \Sigma\) where \(\Sigma = \text{diag}(\sigma_1, \dots, \sigma_k, 0, \dots, 0) \in \RR^{m \times n}\) (Extra columns or rows are filled with zeros). The decomposition of \(A\) given by \(A = U \Sigma V^T\) is called the singular value decomposition of \(A\). The first \(k\) columns of \(U\) and \(V\) are the left and right singular vectors of \(A\) corresponding to the singular values \(\sigma_1, \dots, \sigma_k\). The rank of \(A\) is equal to the number of non-zero singular values which equals the rank of \(\Sigma\). The eigen values of positive semi-definite matrices \(A^T A\) and \(A A^T\) are given by \(\sigma_1^2, \dots, \sigma_k^2\) (remaining eigen values being 0). Specifically, \(A^T A = V \Sigma^T \Sigma V^T\) and \(A A^T = U \Sigma \Sigma^T U^T\). We can rewrite \(A = \sum_{i=1}^k \sigma_i u_i v_i^T\). \(\sigma_1 u_1 v_1^T\) is rank-1 approximation of \(A\) in Frobenius norm sense. The spectral radius and \(2\)-norm of \(A\) is given by its largest singular value \(\sigma_1\). The Moore-Penrose pseudo-inverse of \(\Sigma\) is easily obtained by taking the transpose of \(\Sigma\) and inverting the non-zero singular values. Further, \(A^{\dag} = V \Sigma^{\dag} U^T\). The non-zero singular values of \(A^{\dag}\) are just reciprocals of the non-zero singular values of \(A\). Geometrically, singular values of \(A\) are the precisely the lengths of the semi-axes of the hyper-ellipsoid \(E\) defined by \(E = \{ A x | \| x \|_2 = 1 \}\) (i.e. image of the unit sphere under \(A\)). Thus, if \(A\) is a data matrix, then the SVD of \(A\) is strongly connected with the principal component analysis of \(A\).
Principal Angles¶
If \(\UUU\) and \(\VVV\) are two linear subspaces of \(\RR^M\), then the smallest principal angle between them denoted by \(\theta\) is defined as [BjorckG73]
In other words, we try to find unit norm vectors in the two spaces which are maximally aligned with each other. The angle between them is the smallest principal angle. Note that \(\theta \in [0, \pi /2 ]\) (\(\cos \theta\) as defined above is always positive). If we have \(U\) and \(V\) as matrices whose column spans are the subspaces \(\UUU\) and \(\VVV\) respectively, then in order to find the principal angles, we construct orthogonal bases \(Q_U\) and \(Q_V\). We then compute the inner product matrix \(G = Q_U^T Q_V\). The SVD of \(G\) gives the principal angles. In particular, the smallest principal angle is given by \(\cos \theta = \sigma_1\), the largest singular value.
Hands on with Principal Angles¶
We will generate two random 4-D subspaces in an ambient spaces \(\RR^{10}\):
% subspace dimension
D = 4;
% ambient dimension
M = 10;
% Number of subspaces
K = 2;
import spx.data.synthetic.subspaces.random_subspaces;
bases = random_subspaces(M, K, D);
Finding the smallest principal angle in two subspaces is quite easy.
Let’s give some convenient names to the two bases:
>> A = bases{1};
>> B = bases{2};
Now let’s compute the inner products matrix between the basis vectors of the two bases:
>> G = A' * B
G =
-0.3416 -0.4993 0.1216 0.2732
-0.3780 0.0173 -0.5111 0.4413
0.1296 -0.1153 -0.4123 -0.5332
-0.2152 -0.4476 0.2282 -0.5022
Compute the singular values for G:
>> sigmas = svd(G)'
sigmas =
0.9676 0.8197 0.6738 0.1664
The largest inner product between unit vectors drawn from A and B is given by:
>> largest_product = sigmas(1)
largest_product =
0.9676
It is clear that this is very high. The corresponding smallest principal angle is:
>> smallest_angle_rad = acos(largest_product)
smallest_angle_rad =
0.2551
Or in radians:
>> smallest_angle_deg = rad2deg(smallest_angle_rad)
smallest_angle_deg =
14.6143
sparse-plex
provides a number of
convenience functions for measuring
principal angles.
We start with functions which can tell us about the smallest principal angle between a pair of subspaces.
The smallest principal angle in degrees:
>> spx.la.spaces.smallest_angle_deg(A, B)
ans =
14.6143
The smallest principal angle in radians:
>> spx.la.spaces.smallest_angle_rad(A, B)
ans =
0.2551
The smallest principal angle in cosine version:
>> spx.la.spaces.smallest_angle_cos(A, B)
ans =
0.9676
If we have more than two subspaces, then we have a way of computing principal angles between each of them.
Let’s draw 6 subspaces from \(\RR^{10}\):
>> K = 6;
>> bases = random_subspaces(M, K, D);
We now want pairwise smallest principal angles between them:
>> angles = spx.la.spaces.smallest_angles_deg(bases)
angles =
0 19.9756 32.3022 21.1835 47.2059 24.9171
19.9756 0 14.9874 17.8499 20.5399 42.5358
32.3022 14.9874 0 34.6420 21.9036 34.4935
21.1835 17.8499 34.6420 0 14.0794 26.5235
47.2059 20.5399 21.9036 14.0794 0 39.5866
24.9171 42.5358 34.4935 26.5235 39.5866 0
We can pull off the upper off-diagonal entries in the matrix to look at the distribution of angles:
>> angles = spx.matrix.off_diag_upper_tri_elements(angles)'
angles =
Columns 1 through 13
19.9756 32.3022 14.9874 21.1835 17.8499 34.6420 47.2059 20.5399 21.9036 14.0794 24.9171 42.5358 34.4935
Columns 14 through 15
26.5235 39.5866
For more information about off_diag_upper_tri_elements
,
see Working with matrices.
The statistics:
>> max(angles)
ans =
47.2059
>> min(angles)
ans =
14.0794
>> mean(angles)
ans =
27.5151
>> std(angles)
ans =
10.3412
There is quite variation in the distribution of angles. While some pairs of subspaces are so closely aligned that their smallest principle angle is as low as 14 degrees, there are some pairs for which the smallest principal angle is as high as 47 degrees.
While it is possible to select two subspaces which are arbitrarily close to each other, the distribution of principal angles gives us an idea as to who close/aligned the subspaces are likely to be if chosen randomly.
Above, we computed the smallest principal angles in degrees. We can also compute them in radians:
>> angles = spx.la.spaces.smallest_angles_rad(bases)
angles =
0 0.3486 0.5638 0.3697 0.8239 0.4349
0.3486 0 0.2616 0.3115 0.3585 0.7424
0.5638 0.2616 0 0.6046 0.3823 0.6020
0.3697 0.3115 0.6046 0 0.2457 0.4629
0.8239 0.3585 0.3823 0.2457 0 0.6909
0.4349 0.7424 0.6020 0.4629 0.6909 0
Or directly the largest singular values for each pair of subspaces:
>> angles = spx.la.spaces.smallest_angles_cos(bases)
angles =
1.0000 0.9398 0.8452 0.9324 0.6794 0.9069
0.9398 1.0000 0.9660 0.9519 0.9364 0.7369
0.8452 0.9660 1.0000 0.8227 0.9278 0.8242
0.9324 0.9519 0.8227 1.0000 0.9700 0.8948
0.6794 0.9364 0.9278 0.9700 1.0000 0.7707
0.9069 0.7369 0.8242 0.8948 0.7707 1.0000
Matrix Algebra¶
Introduction¶
In this chapter we collect results related to matrix algebra which are relevant to this book. Some specific topics which are typically not found in standard books are also covered here.
Standard notation in this chapter is given here. Matrices are denoted by capital letters \(A\), \(B\) etc.. They can be rectangular with \(m\) rows and \(n\) columns. Their elements or entries are referred to with small letters \(a_{i j}\), \(b_{i j}\) etc. where \(i\) denotes the \(i\)-th row of matrix and \(j\) denotes the \(j\)-th column of matrix. Thus
Mostly we consider complex matrices belonging to \(\CC^{m \times n}\). Sometimes we will restrict our attention to real matrices belonging to \(\RR^{m \times n}\).
A diagonal matrix is a matrix (usually a square matrix) whose entries outside the main diagonal are zero.
Whenever we refer to a diagonal matrix which is not square, we will use the term rectangular diagonal matrix.
A square diagonal matrix \(A\) is also represented by \(\Diag(a_{11}, a_{22}, \dots, a_{n n})\) which lists only the diagonal (non-zero) entries in \(A\).
The transpose of a matrix \(A\) is denoted by \(A^T\) while the Hermitian transpose is denoted by \(A^H\). For real matrices \(A^T = A^H\).
When matrices are square, we have the number of rows and columns both equal to \(n\) and they belong to \(\CC^{n \times n}\).
If not specified, the square matrices will be of size \(n \times n\) and rectangular matrices will be of size \(m \times n\). If not specified the vectors (column vectors) will be of size \(n \times 1\) and belong to either \(\RR^n\) or \(\CC^n\). Corresponding row vectors will be of size \(1 \times n\).
For statements which are valid both for real and complex matrices, sometimes we might say that matrices belong to \(\FF^{m \times n}\) while the scalars belong to \(\FF\) and vectors belong to \(\FF^n\) where \(\FF\) refers to either the field of real numbers or the field of complex numbers. Note that this is not consistently followed at the moment. Most results are written only for \(\CC^{m \times n}\) while still being applicable for \(\RR^{m \times n}\).
Identity matrix for \(\FF^{n \times n}\) is denoted as \(I_n\) or simply \(I\) whenever the size is clear from context.
Sometimes we will write a matrix in terms of its column vectors. We will use the notation
indicating \(n\) columns.
When we write a matrix in terms of its row vectors, we will use the notation
indicating \(m\) rows with \(a_i\) being column vectors whose transposes form the rows of \(A\).
The rank of a matrix \(A\) is written as \(\Rank(A)\), while the determinant as \(\det(A)\) or \(|A|\).
We say that an \(m \times n\) matrix \(A\) is left-invertible if there exists an \(n \times m\) matrix \(B\) such that \(B A = I\). We say that an \(m \times n\) matrix \(A\) is right-invertible if there exists an \(n \times m\) matrix \(B\) such that \(A B= I\).
We say that a square matrix \(A\) is invertible when there exists another square matrix \(B\) of same size such that \(AB = BA = I\). A square matrix is invertible iff its both left and right invertible. Inverse of a square invertible matrix is denoted by \(A^{-1}\).
A special left or right inverse is the pseudo inverse, which is denoted by \(A^{\dag}\).
Column space of a matrix is denoted by \(\ColSpace(A)\), the null space by \(\NullSpace(A)\), and the row space by \(\RowSpace(A)\).
We say that a matrix is symmetric when \(A = A^T\), conjugate symmetric or Hermitian when \(A^H =A\).
When a square matrix is not invertible, we say that it is singular. A non-singular matrix is invertible.
The eigen values of a square matrix are written as \(\lambda_1, \lambda_2, \dots\) while the singular values of a rectangular matrix are written as \(\sigma_1, \sigma_2, \dots\).
The inner product or dot product of two column / row vectors \(u\) and \(v\) belonging to \(\RR^n\) is defined as
The inner product or dot product of two column / row vectors \(u\) and \(v\) belonging to \(\CC^n\) is defined as
Block matrix¶
A block matrix is a matrix whose entries themselves are matrices with following constraints
- Entries in every row are matrices with same number of rows.
- Entries in every column are matrices with same number of columns.
Let \(A\) be an \(m \times n\) block matrix. Then
where \(A_{i j}\) is a matrix with \(r_i\) rows and \(c_j\) columns.
A block matrix is also known as a partitioned matrix.
Quite frequently we will be using \(2x2\) block matrices.
An example
We have
- \(P_{11}\) and \(P_{12}\) have \(2\) rows.
- \(P_{21}\) and \(P_{22}\) have \(1\) row.
- \(P_{11}\) and \(P_{21}\) have \(2\) columns.
- \(P_{12}\) and \(P_{22}\) have \(1\) column.
Let \(A = [A_{ij}]\) be an \(m \times n\) block matrix with \(A_{ij}\) being an \(r_i \times c_j\) matrix. Then \(A\) is an \(r \times c\) matrix where
and
Let \(A = [A_{ij}]\) be an \(m \times n\) block matrix with \(A_{ij}\) being a \(p_i \times q_j\) matrix. Let \(B = [B_{jk}]\) be an \(n \times p\) block matrix with \(B_{jk}\) being a \(q_j \times r_k\) matrix. Then the two block matrices are compatible for multiplication and their multiplication is defined by \(C = AB = [C_{i k}]\) where
and \(C_{i k}\) is a \(p_i \times r_k\) matrix.
Linear independence, span, rank¶
Spaces associated with a matrix¶
The column space of a matrix is defined as the vector space spanned by columns of the matrix.
Let \(A\) be an \(m \times n\) matrix with
Then the column space is given by
The row space of a matrix is defined as the vector space spanned by rows of the matrix.
Let \(A\) be an \(m \times n\) matrix with
Then the row space is given by
Rank¶
For an \(m \times n\) matrix \(A\)
An \(m \times n\) matrix \(A\) is called full rank if
In other words it is either a full column rank matrix or a full row rank matrix or both.
Let \(A\) be an \(m \times n\) matrix and \(B\) be an \(n \times p\) matrix then
Let \(A\) be an \(m \times n\) matrix and \(B\) be an \(n \times p\) matrix. If \(B\) is of rank \(n\) then
Let \(A\) be an \(m \times n\) matrix and \(B\) be an \(n \times p\) matrix. If \(A\) is of rank \(n\) then
Invertible matrices¶
A square matrix \(A\) is called invertible if there exists another square matrix \(B\) of same size such that
The matrix \(B\) is called the inverse of \(A\) and is denoted as \(A^{-1}\).
Assume \(A\) is invertible, then there exists a matrix \(B\) such that
Assume that columns of \(A\) are linearly dependent. Then there exists \(u \neq 0\) such that
a contradiction. Hence columns of \(A\) are linearly independent.
Assume \(A\) is invertible, then there exists a matrix \(B\) such that
Now let \(x \in \FF^n\) be any arbitrary vector. We need to show that there exists \(\alpha \in \FF^n\) such that
But
Thus if we choose \(\alpha = Bx\), then
Thus columns of \(A\) span \(\FF^n\).
Assume \(A\) is invertible, then there exists a matrix \(B\) such that
Applying transpose on both sides we get
Thus \(B^T\) is inverse of \(A^T\) and \(A^T\) is invertible.
Assume \(A\) is invertible, then there exists a matrix \(B\) such that
Applying conjugate transpose on both sides we get
Thus \(B^H\) is inverse of \(A^H\) and \(A^H\) is invertible.
We note that
Similarly
Thus \(B^{-1}A^{-1}\) is the inverse of \(AB\).
We verify the properties of a group
- [Closure] If \(A\) and \(B\) are invertible then \(AB\) is invertible. Hence the set is closed.
- [Associativity] Matrix multiplication is associative.
- [Identity element] \(I\) is invertible and \(AI = IA = A\) for all invertible matrices.
- [Inverse element] If \(A\) is invertible then \(A^{-1}\) is also invertible.
Thus the set of invertible matrices is indeed a group under matrix multiplication.
An \(n \times n\) matrix \(A\) is invertible if and only if it is full rank i.e.
Similar matrices¶
An \(n \times n\) matrix \(B\) is similar to an \(n \times n\) matrix \(A\) if there exists an \(n \times n\) non-singular matrix \(C\) such that
Thus there exists a matrix \(D = C^{-1}\) such that
Thus \(A\) is similar to \(B\).
Let \(B\) be similar to \(A\). Thus their exists an invertible matrix \(C\) such that
Since \(C\) is invertible hence we have \(\Rank (C) = \Rank(C^{-1}) = n\). Now using here \(\Rank (AC) = \Rank (A)\) and using here we have \(\Rank(C^{-1} (AC) ) = \Rank (AC) = \Rank(A)\). Thus
Gram matrices¶
Gram matrix of columns of \(A\) is given by
Gram matrix of rows of \(A\) is given by
This is also known as the frame operator of \(A\).
Following results apply equally well for the real case.
Let \(A\) be an \(m\times n\) matrix and \(G = A^H A\) be the Gram matrix of its columns.
If columns of \(A\) are linearly dependent, then there exists a vector \(u \neq 0\) such that
Thus
Hence the columns of \(G\) are also dependent and \(G\) is not invertible.
Conversely let us assume that \(G\) is not invertible, thus columns of \(G\) are dependent and there exists a vector \(v \neq 0\) such that
Now
From previous equation, we have
Since \(v \neq 0\) hence columns of \(A\) are also linearly dependent.
Columns of \(A\) can be dependent only if its Gram matrix is not invertible. Thus if the Gram matrix is invertible, then the columns of \(A\) are linearly independent.
The Gram matrix is not invertible only if columns of \(A\) are linearly dependent. Thus if columns of \(A\) are linearly independent then the Gram matrix is invertible.
The null space of \(A\) and its Gram matrix \(A^HA\) coincide. i.e.
Let \(u \in \NullSpace(A)\). Then
Thus
Now let \(u \in \NullSpace(A^H A)\). Then
Thus we have
Rows of \(A\) are linearly dependent, if and only if columns of \(A^H\) are linearly dependent. There exists a vector \(v \neq 0\) s.t.
Thus
Since \(v \neq 0\) hence \(G\) is not invertible.
Converse: assuming that \(G\) is not invertible, there exists a vector \(u \neq 0\) s.t.
Now
Since \(u \neq 0\) hence columns of \(A^H\) and consequently rows of \(A\) are linearly dependent.
Pseudo inverses¶
Let \(A\) be an \(m \times n\) matrix. An \(n\times m\) matrix \(A^{\dag}\) is called its Moore-Penrose pseudo-inverse if it satisfies all of the following criteria:
- \(A A^{\dag} A = A\).
- \(A^{\dag} A A^{\dag} = A^{\dag}\).
- \(\left(A A^{\dag} \right)^H = A A^{\dag}\) i.e. \(A A^{\dag}\) is Hermitian.
- \((A^{\dag} A)^H = A^{\dag} A\) i.e. \(A^{\dag} A\) is Hermitian.
We omit the proof for this. The pseudo-inverse can actually be obtained by the singular value decomposition of \(A\). This is shown here.
Let \(D = \Diag(d_1, d_2, \dots, d_n)\) be an \(n \times n\) diagonal matrix. Then its Moore-Penrose pseudo-inverse is \(D^{\dag} = \Diag(c_1, c_2, \dots, c_n)\) where
We note that \(D^{\dag} D = D D^{\dag} = F = \Diag(f_1, f_2, \dots f_n)\) where
We now verify the requirements listed here.
\(D^{\dag} D = D D^{\dag} = F\) is a diagonal hence Hermitian matrix.
Let \(D = \Diag(d_1, d_2, \dots, d_p)\) be an \(m \times n\) rectangular diagonal matrix where \(p = \min(m, n)\). Then its Moore-Penrose pseudo-inverse is an \(n \times m\) rectangular diagonal matrix \(D^{\dag} = \Diag(c_1, c_2, \dots, c_p)\) where
\(F = D^{\dag} D = \Diag(f_1, f_2, \dots f_n)\) is an \(n \times n\) matrix where
\(G = D D^{\dag} = \Diag(g_1, g_2, \dots g_n)\) is an \(m \times m\) matrix where
We now verify the requirements listed here.
\(F = D^{\dag} D\) and \(G = D D^{\dag}\) are both diagonal hence Hermitian matrices.
If \(A\) is full column rank then its Moore-Penrose pseudo-inverse is given by
It is a left inverse of \(A\).
By here \(A^H A\) is invertible.
First of all we verify that it is a left inverse.
We now verify all the properties.
Hermitian properties:
If \(A\) is full row rank then its Moore-Penrose pseudo-inverse is given by
It is a right inverse of \(A\).
By here \(A A^H\) is invertible.
First of all we verify that it is a right inverse.
We now verify all the properties.
Hermitian properties:
Trace and determinant¶
Trace¶
The trace of a square matrix is defined as the sum of the entries on its main diagonal. Let \(A\) be an \(n\times n\) matrix, then
where \(\Trace(A)\) denotes the trace of \(A\).
The trace of a square matrix and its transpose are equal.
Trace of sum of two square matrices is equal to the sum of their traces.
Let \(A\) be an \(m \times n\) matrix and \(B\) be an \(n \times m\) matrix. Then
Let \(AB = C = [c_{ij}]\). Then
Thus
Now
Let \(BA = D = [d_{ij}]\). Then
Thus
Hence
This completes the proof.
Let \(A \in \FF^{m \times n}\), \(B \in \FF^{n \times p}\), \(C \in \FF^{p \times m}\) be three matrices. Then
Let \(AB = D\). Then
Similarly the other result can be proved.
Let \(B\) be similar to \(A\). Thus
for some invertible matrix \(C\). Then
We used this.
Determinants¶
Following are some results on determinant of a square matrix \(A\).
Determinant of a square matrix and its transpose are equal.
Let \(A\) be a complex square matrix. Then
Let \(A\) and \(B\) be two \(n\times n\) matrices. Then
Let \(A\) be an invertible matrix. Then
Determinant of a triangular matrix is the product of its diagonal entries. i.e. if \(A\) is upper or lower triangular matrix then
Determinant of a diagonal matrix is the product of its diagonal entries. i.e. if \(A\) is a diagonal matrix then
Let \(u\) and \(v\) be vectors in \(\FF^n\). Then
Let \(A\) be a square matrix and let \(\epsilon \approx 0\). Then
Unitary and orthogonal matrices¶
Orthogonal matrix¶
A real square matrix \(U\) is called orthogonal if the columns of \(U\) form an orthonormal set. In other words, let
with \(u_i \in \RR^n\). Then we have
Let
be orthogonal with
Then
Since columns of \(U\) are linearly independent and span \(\RR^n\), hence \(U\) is invertible. Thus
Let \(U\) be an orthogonal matrix. Then
Thus we have
Unitary matrix¶
A complex square matrix \(U\) is called unitary if the columns of \(U\) form an orthonormal set. In other words, let
with \(u_i \in \CC^n\). Then we have
Let
be orthogonal with
Then
Since columns of \(U\) are linearly independent and span \(\CC^n\), hence \(U\) is invertible. Thus
Let \(U\) be a unitary matrix. Then
Thus we have
F unitary matrix¶
We provide a common definition for unitary matrices over any field \(\FF\). This definition applies to both real and complex matrices.
A square matrix \(U \in \FF^{n \times n}\) is called \(\FF\)-unitary if the columns of \(U\) form an orthonormal set. In other words, let
with \(u_i \in \FF^n\). Then we have
We note that a suitable definition of inner product transports the definition appropriately into orthogonal matrices over \(\RR\) and unitary matrices over \(\CC\).
When we are talking about \(\FF\) unitary matrices, then we will use the symbol \(U^H\) to mean its inverse. In the complex case, it will map to its conjugate transpose, while in real case it will map to simple transpose.
This definition helps us simplify some of the discussions in the sequel (like singular value decomposition).
Following results apply equally to orthogonal matrices for real case and unitary matrices for complex case.
\(\FF\)-unitary matrices preserve norm. i.e.
For the real case we have
\(\FF\)-unitary matrices preserve inner product. i.e.
For the real case we have
Eigen values¶
Much of the discussion in this section will be equally applicable to real as well as complex matrices. We will use the complex notation mostly and make specific remarks for real matrices wherever needed.
A scalar \(\lambda\) is an eigen value of an \(n \times n\) matrix \(A = [ a_{ij} ]\) if there exists a non null vector \(x\) such that
A non null vector \(x\) which satisfies this equation is called an eigen vector of \(A\) for the eigen value \(\lambda\).
An eigen value is also known as a characteristic value, proper value or a latent value.
We note that (1) can be written as
Thus \(\lambda\) is an eigen value of \(A\) if and only if the matrix \(A - \lambda I\) is singular.
Assume that for \(x\) there are two eigen values \(\lambda_1\) and \(\lambda_2\), then
This can happen only when either \(x = 0\) or \(\lambda_1 = \lambda_2\). Since \(x\) is an eigen vector, it cannot be 0. Thus \(\lambda_1 = \lambda_2\).
If \(x\) is an eigen vector for \(A\), then the corresponding eigen value is given by
since \(x\) is non-zero.
An eigen vector \(x\) of \(A\) for eigen value \(\lambda\) belongs to the null space of \(A - \lambda I\), i.e.
In other words \(x\) is a nontrivial solution to the homogeneous system of linear equations given by
The set comprising all the eigen vectors of \(A\) for an eigen value \(\lambda\) is given by
since \(0\) cannot be an eigen vector.
Clearly
A scalar \(\lambda\) can be an eigen value of a square matrix \(A\) if and only if
\(\det (A - \lambda I)\) is a polynomial in \(\lambda\) of degree \(n\).
where \(\alpha_i\) depend on entries in \(A\).
In this sense, an eigen value of \(A\) is a root of the equation
Its easy to show that \(\alpha^n = (-1)^n\).
For any square matrix \(A\), the polynomial given by \(p(\lambda) = \det(A - \lambda I )\) is known as its characteristic polynomial. The equation give by
is known as its characteristic equation. The eigen values of \(A\) are the roots of its characteristic polynomial or solutions of its characteristic equation.
For real square matrices, if we restrict eigen values to real values, then the characteristic polynomial can be factored as
The polynomial has \(k\) distinct real roots. For each root \(\lambda_i\), \(r_i\) is a positive integer indicating how many times the root appears. \(q(\lambda)\) is a polynomial that has no real roots. The following is true
Clearly \(k \leq n\).
For complex square matrices where eigen values can be complex (including real square matrices), the characteristic polynomial can be factored as
The polynomial can be completely factorized into first degree polynomials. There are \(k\) distinct roots or eigen values. The following is true
Thus including the duplicates there are exactly \(n\) eigen values for a complex square matrix.
Let us consider the sum of \(r_i\) which gives the count of total number of roots of \(p(\lambda)\).
With this there are \(m\) not-necessarily distinct roots of \(p(\lambda)\). Let us write \(p(\lambda)\) as
where \(c_1, c_2, \dots, c_m\) are \(m\) scalars (not necessarily distinct) of which \(r_1\) scalars are \(\lambda_1\), \(r_2\) are \(\lambda_2\) and so on. Obviously for the complex case \(q(\lambda)=1\).
We will refer to the set (allowing repetitions) \(\{c_1, c_2, \dots, c_m \}\) as the eigen values of the matrix \(A\) where \(c_i\) are not necessarily distinct. In contrast the spectrum of \(A\) refers to the set of distinct eigen values of \(A\). The symbol \(c\) has been chosen based on the other name for eigen values (the characteristic values).
We can put together eigen vectors of a matrix into another matrix by itself. This can be very useful tool. We start with a simple idea.
Let \(A\) be an \(n \times n\) matrix. Let \(u_1, u_2, \dots, u_r\) be \(r\) non-zero vectors from \(\FF^n\). Let us construct an \(n \times r\) matrix
Then all the \(r\) vectors are eigen vectors of \(A\) if and only if there exists a diagonal matrix \(D = \Diag(d_1, \dots, d_r)\) such that
Expanding the equation, we can write
Clearly we want
where \(u_i\) are non-zero. This is possible only when \(d_i\) is an eigen value of \(A\) and \(u_i\) is an eigen vector for \(d_i\).
Converse: Assume that \(u_i\) are eigen vectors. Choose \(d_i\) to be corresponding eigen values. Then the equation holds.
Let \(0\) be an eigen value of \(A\). Then there exists \(u \neq 0\) such that
Thus \(u\) is a non-trivial solution of the homogeneous linear system. Thus \(A\) is singular.
Converse: Assuming that \(A\) is singular, there exists \(u \neq 0\) s.t.
Thus \(0\) is an eigen value of \(A\).
The eigen values of \(A^T\) are given by
But
Hence (using here)
Thus the characteristic polynomials of \(A\) and \(A^T\) are same. Hence the eigen values are same. In other words the spectrum of \(A\) and \(A^T\) are same.
If \(x\) is an eigen vector with a non-zero eigen value \(\lambda\) for \(A\) then \(Ax\) and \(x\) are collinear.
In other words the angle between \(Ax\) and \(x\) is either \(0^{\circ}\) when \(\lambda\) is positive and is \(180^{\circ}\) when \(\lambda\) is negative. Let us look at the inner product:
Meanwhile
Thus
The angle \(\theta\) between \(Ax\) and \(x\) is given by
For \(p=1\) the statement holds trivially since \(\lambda^1\) is an eigen value of \(A^1\). Assume that the statement holds for some value of \(p\). Thus let \(\lambda^p\) be an eigen value of \(A^{p}\) and let \(u\) be corresponding eigen vector. Now
Thus \(\lambda^{p + 1}\) is an eigen value for \(A^{p + 1}\) with the same eigen vector \(u\). With the principle of mathematical induction, the proof is complete.
Let \(u \neq 0\) be an eigen vector of \(A\) for the eigen value \(\lambda\). Then
Thus \(u\) is also an eigen vector of \(A^{-1}\) for the eigen value \(\frac{1}{\lambda}\).
Now let \(B = A^{-1}\). Then \(B^{-1} = A\). Thus if \(\mu\) is an eigen value of \(B\) then \(\frac{1}{\mu}\) is an eigen value of \(B^{-1} = A\).
Thus if \(A\) is invertible then eigen values of \(A\) and \(A^{-1}\) have one to one correspondence.
This result is very useful. Since if it can be shown that a matrix \(A\) is similar to a diagonal or a triangular matrix whose eigen values are easy to obtain then determination of the eigen values of \(A\) becomes straight forward.
Invariant subspaces¶
Let \(A\) be a square \(n\times n\) matrix and let \(\WW\) be a subspace of \(\FF^n\) i.e. \(\WW \leq \FF\). Then \(\WW\) is invariant relative to \(A\) if
i.e. \(A (W) \subseteq W\) or for every vector \(w \in \WW\) its mapping \(A w\) is also in \(\WW\). Thus action of \(A\) on \(\WW\) doesn’t take us outside of \(\WW\).
We also say that \(\WW\) is \(A\)-invariant.
Eigen vectors are generators of invariant subspaces.
Let \(A\) be an \(n \times n\) matrix. Let \(x_1, x_2, \dots, x_r\) be \(r\) eigen vectors of \(A\). Let us construct an \(n \times r\) matrix
Then the column space of \(X\) i.e. \(\ColSpace(X)\) is invariant relative to \(A\).
Let us assume that \(c_1, c_2, \dots, c_r\) are the eigen values corresponding to \(x_1, x_2, \dots, x_r\) (not necessarily distinct).
Let any vector \(x \in \ColSpace(X)\) be given by
Then
Clearly \(Ax\) is also a linear combination of \(x_i\) hence belongs to \(\ColSpace(X)\). Thus \(X\) is invariant relative to \(A\) or \(X\) is \(A\)-invariant.
Triangular matrices¶
If \(A\) is triangular then \(A - \lambda I\) is also triangular with its diagonal entries being \((a_{i i} - \lambda)\). Using here, we have
Clearly the roots of characteristic polynomial are \(a_{i i}\).
Several small results follow from this lemma.
Let \(A = [a_{i j}]\) be an \(n \times n\) triangular matrix.
- The characteristic polynomial of \(A\) is \(p(\lambda) = (-1)^n (\lambda - a_{i i})\).
- A scalar \(\lambda\) is an eigen value of \(A\) iff its one of the diagonal entries of \(A\).
- The algebraic multiplicity of an eigen value \(\lambda\) is equal to the number of times it appears on the main diagonal of \(A\).
- The spectrum of \(A\) is given by the distinct entries on the main diagonal of \(A\).
A diagonal matrix is naturally both an upper triangular matrix as well as a lower triangular matrix. Similar results hold for the eigen values of a diagonal matrix also.
Let \(A = [a_{i j}]\) be an \(n \times n\) diagonal matrix.
- Its eigen values are the entries on its main diagonal.
- The characteristic polynomial of \(A\) is \(p(\lambda) = (-1)^n (\lambda - a_{i i})\).
- A scalar \(\lambda\) is an eigen value of \(A\) iff its one of the diagonal entries of \(A\).
- The algebraic multiplicity of an eigen value \(\lambda\) is equal to the number of times it appears on the main diagonal of \(A\).
- The spectrum of \(A\) is given by the distinct entries on the main diagonal of \(A\).
There is also a result for the geometric multiplicity of eigen values for a diagonal matrix.
The unit vectors \(e_i\) are eigen vectors for \(A\) since
They are independent. Thus if a particular eigen value appears \(r\) number of times, then there are \(r\) linearly independent eigen vectors for the eigen value. Thus its geometric multiplicity is equal to the algebraic multiplicity.
Similar matrices¶
Some very useful results are available for similar matrices.
Let \(B\) be similar to \(A\). Thus there exists an invertible matrix \(C\) such that
Now
Thus \(B - \lambda I\) is similar to \(A - \lambda I\). Hence due to here, their determinant is equal i.e.
This means that the characteristic polynomials of \(A\) and \(B\) are same. Since eigen values are nothing but roots of the characteristic polynomial, hence they are same too. This means that the spectrum (the set of distinct eigen values) is same.
If \(A\) and \(B\) are similar to each other then
- An eigen value has same algebraic and geometric multiplicity for both \(A\) and \(B\).
- The (not necessarily distinct) eigen values of \(A\) and \(B\) are same.
Although the eigen values are same, but the eigen vectors are different.
Let \(A\) and \(B\) be similar with
for some invertible matrix \(C\). If \(u\) is an eigen vector of \(A\) for an eigen value \(\lambda\), then \(C^{-1} u\) is an eigen vector of \(B\) for the same eigen value.
\(u\) is an eigen vector of \(A\) for an eigen value \(\lambda\). Thus we have
Thus
Now \(u \neq 0\) and \(C^{-1}\) is non singular. Thus \(C^{-1} u \neq 0\). Thus \(C^{-1}u\) is an eigen vector of \(B\).
Let an \(n \times n\) matrix \(A\) has \(k\) distinct eigen values \(\lambda_1, \lambda_2, \dots, \lambda_k\) with algebraic multiplicities \(r_1, r_2, \dots, r_k\) and geometric multiplicities \(g_1, g_2, \dots g_k\) respectively. Then
Moreover if
then
Linear independence of eigen vectors¶
We first prove the simpler case with 2 eigen vectors \(x_1\) and \(x_2\) and corresponding eigen values \(\lambda_1\) and \(\lambda_2\) respectively.
Let there be a linear relationship between \(x_1\) and \(x_2\) given by
Multiplying both sides with \((A - \lambda_1 I)\) we get
Since \(\lambda_1 \neq \lambda_2\) and \(x_2 \neq 0\) , hence \(\alpha_2 = 0\).
Similarly by multiplying with \((A - \lambda_2 I)\) on both sides, we can show that \(\alpha_1 = 0\). Thus \(x_1\) and \(x_2\) are linearly independent.
Now for the general case, consider a linear relationship between \(x_1, x_2, \dots , x_k\) given by
Multiplying by \(\prod_{i \neq j, i=1}^k (A - \lambda_i I)\) and using the fact that \(\lambda_i \neq \lambda_j\) if \(i \neq j\), we get \(\alpha_j = 0\). Thus the only linear relationship is the trivial relationship. This completes the proof.
For eigen values with geometric multiplicity greater than \(1\) there are multiple eigenvectors corresponding to the eigen value which are linearly independent. In this context, above theorem can be generalized further.
This result puts an upper limit on the number of linearly independent eigen vectors of a square matrix.
Let \(\{ \lambda_1, \dots, \lambda_k \}\) represents the spectrum of an \(n \times n\) matrix \(A\). Let \(g_1, \dots, g_k\) be the geometric multiplicities of \(\lambda_1, \dots \lambda_k\) respectively. Then the number of linearly independent eigen vectors for \(A\) is
Moreover if
then a set of \(n\) linearly independent eigen vectors of \(A\) can be found which forms a basis for \(\FF^n\).
Diagonalization¶
Diagonalization is one of the fundamental operations in linear algebra. This section discusses diagonalization of square matrices in depth.
We note that if we restrict to real matrices, then \(U\) and \(D\) should also be real. If \(A \in \CC^{n \times n}\) (it may still be real) then \(P\) and \(D\) can be complex.
The next theorem is the culmination of a variety of results studied so far.
Let \(A\) be a diagonalizable matrix with \(D = P^{-1} A P\) being its diagonalization. Let \(D = \Diag(d_1, d_2, \dots, d_n)\). Then the following hold
\(\Rank(A) = \Rank(D)\) which equals the number of non-zero entries on the main diagonal of \(D\).
\(\det(A) = d_1 d_2 \dots d_n\).
\(\Trace(A) = d_1 + d_2 + \dots d_n\).
The characteristic polynomial of \(A\) is
\[p(\lambda) = (-1)^n (\lambda - d_1) (\lambda -d_2) \dots (\lambda - d_n).\]The spectrum of \(A\) comprises the distinct scalars on the diagonal entries in \(D\).
The (not necessarily distinct) eigenvalues of \(A\) are the diagonal elements of \(D\).
The columns of \(P\) are (linearly independent) eigenvectors of \(A\).
The algebraic and geometric multiplicities of an eigenvalue \(\lambda\) of \(A\) equal the number of diagonal elements of \(D\) that equal \(\lambda\).
From here we note that \(D\) and \(A\) are similar. Due to here
Due to here
Now due to here
Further due to here the characteristic polynomial and spectrum of \(A\) and \(D\) are same. Due to here the eigen values of \(D\) are nothing but its diagonal entries. Hence they are also the eigen values of \(A\).
Now writing
we have
Thus \(p_i\) are eigen vectors of \(A\).
Since the characteristic polynomials of \(A\) and \(D\) are same, hence the algebraic multiplicities of eigen values are same.
From here we get that there is a one to one correspondence between the eigen vectors of \(A\) and \(D\) through the change of basis given by \(P\). Thus the linear independence relationships between the eigen vectors remain the same. Hence the geometric multiplicities of individual eigenvalues are also the same.
This completes the proof.
So far we have verified various results which are available if a matrix \(A\) is diagonalizable. We haven’t yet identified the conditions under which \(A\) is diagonalizable. We note that not every matrix is diagonalizable. The following theorem gives necessary and sufficient conditions under which a matrix is diagonalizable.
We note that since \(P\) is non-singular hence columns of \(P\) have to be linearly independent.
The necessary condition part was proven in here. We now show that if \(P\) consists of \(n\) linearly independent eigen vectors of \(A\) then \(A\) is diagonalizable.
Let the columns of \(P\) be \(p_1, p_2, \dots, p_n\) and corresponding (not necessarily distinct) eigen values be \(d_1, d_2, \dots , d_n\). Then
Thus by letting \(D = \Diag (d_1, d_2, \dots, d_n)\), we have
Now since columns of \(P\) are linearly independent, hence \(P\) is invertible. This gives us
Thus \(A\) is similar to a diagonal matrix \(D\). This validates the sufficient condition.
A corollary follows.
Now we know that geometric multiplicities of eigen values of \(A\) provide us information about linearly independent eigenvectors of \(A\).
Let \(A\) be an \(n \times n\) matrix. Let \(\lambda_1, \lambda_2, \dots, \lambda_k\) be its \(k\) distinct eigen values (comprising its spectrum). Let \(g_j\) be the geometric multiplicity of \(\lambda_j\).Then \(A\) is diagonalizable if and only if
Symmetric matrices¶
This subsection is focused on real symmetric matrices.
Following is a fundamental property of real symmetric matrices.
The proof of this result is beyond the scope of this book.
By definition we have \(A x_1 = \lambda_1 x_1\) and \(A x_2 = \lambda_2 x_2\). Thus
Thus \(x_1\) and \(x_2\) are orthogonal. In between we took transpose on both sides, used the fact that \(A= A^T\) and \(\lambda_1 - \lambda_2 \neq 0\).
A real \(n \times n\) matrix \(A\) is said to be orthogonally diagonalizable if there exists an orthogonal matrix \(U\) which can diagonalize \(A\), i.e.
is a real diagonal matrix.
We have a diagonal matrix \(D\) such that
Taking transpose on both sides we get
Thus \(A\) is symmetric.
We skip the proof of this theorem.
Hermitian matrices¶
Following is a fundamental property of Hermitian matrices.
The proof of this result is beyond the scope of this book.
Let \(A\) be a Hermitian matrix and let \(\lambda\) be an eigen value of \(A\). Let \(u\) be a corresponding eigen vector. Then
thus \(\lambda\) is real. We used the facts that \(A = A^H\) and \(u \neq 0 \implies \|u\|_2 \neq 0\).
By definition we have \(A x_1 = \lambda_1 x_1\) and \(A x_2 = \lambda_2 x_2\). Thus
Thus \(x_1\) and \(x_2\) are orthogonal. In between we took conjugate transpose on both sides, used the fact that \(A= A^H\) and \(\lambda_1 - \lambda_2 \neq 0\).
A complex \(n \times n\) matrix \(A\) is said to be unitary diagonalizable if there exists a unitary matrix \(U\) which can diagonalize \(A\), i.e.
is a complex diagonal matrix.
We have a real diagonal matrix \(D\) such that
Taking conjugate transpose on both sides we get
Thus \(A\) is Hermitian. We used the fact that \(D^H = D\) since \(D\) is real.
We skip the proof of this theorem. The theorem means that if \(A\) is Hermitian then \(A = U \Lambda U^H\)
Let \(A\) be an \(n \times n\) Hermitian matrix. Let \(\lambda_1, \dots \lambda_n\) be its eigen values such that \(|\lambda_1| \geq |\lambda_2| \geq \dots \geq |\lambda_n |\). Let
Let \(U\) be a unit matrix consisting of orthonormal eigen vectors corresponding to \(\lambda_1, \dots, \lambda_n\). Then The eigen value decomposition of \(A\) is defined as
If \(\lambda_i\) are distinct, then the decomposition is unique. If they are not distinct, then
Let \(\Lambda\) be a diagonal matrix as in here. Consider some vector \(x \in \CC^n\).
Now if \(\lambda_i \geq 0\) then
Also
Let \(A\) be a Hermitian matrix with non-negative eigen values. Let \(\lambda_1\) be its largest and \(\lambda_n\) be its smallest eigen values.
\(A\) has an eigen value decomposition given by
Let \(x \in \CC^n\) and let \(v = U^H x\). Clearly \(\| x \|_2 = \| v \|_2\). Then
From previous remark we have
Thus we get
Miscellaneous properties¶
This subsection lists some miscellaneous properties of eigen values of a square matrix.
Thus \(\lambda\) is an eigen value of \(A\) with an eigen vector \(x\) if and only if \(\lambda + k\) is an eigen vector of \(A + kI\) with an eigen vector \(x\).
Diagonally dominant matrices¶
Let \(A = [a_{ij}]\) be a square matrix in \(\CC^{n \times n}\). \(A\) is called diagonally dominant if
holds true for all \(1 \leq i \leq n\). i.e. the absolute value of the diagonal element is greater than or equal to the sum of absolute values of all the off diagonal elements on that row.
Let \(A = [a_{ij}]\) be a square matrix in \(\CC^{n \times n}\). \(A\) is called strictly diagonally dominant if
holds true for all \(1 \leq i \leq n\). i.e. the absolute value of the diagonal element is bigger than the sum of absolute values of all the off diagonal elements on that row.
Let us consider
We can see that the strict diagonal dominance condition is satisfied for each row as follows:
Strictly diagonally dominant matrices have a very special property. They are always non-singular.
Suppose that \(A\) is diagonally dominant and singular. Then there exists a vector \(u \in \CC^n\) with \(u\neq 0\) such that
Let
We first show that every entry in \(u\) cannot be equal in magnitude. Let us assume that this is so. i.e.
Since \(u \neq 0\) hence \(c \neq 0\). Now for any row \(i\) in (2) , we have
but this contradicts our assumption that \(A\) is strictly diagonally dominant. Thus all entries in \(u\) are not equal in magnitude.
Let us now assume that the largest entry in \(u\) lies at index \(i\) with \(|u_i| = c\). Without loss of generality we can scale down \(u\) by \(c\) to get another vector in which all entries are less than or equal to 1 in magnitude while \(i\)-th entry is \(\pm 1\). i.e. \(u_i = \pm 1\) and \(|u_j| \leq 1\) for all other entries.
Now from (2) we get for the \(i\)-th row
which again contradicts our assumption that \(A\) is strictly diagonally dominant.
Hence strictly diagonally dominant matrices are non-singular.
Gershgorin’s theorem¶
We are now ready to examine Gershgorin’ theorem which provides very useful bounds on the spectrum of a square matrix.
Every eigen value \(\lambda\) of a square matrix \(A \in \CC^{n\times n}\) satisfies
The proof is a straight forward application of non-singularity of diagonally dominant matrices.
We know that for an eigen value \(\lambda\), \(\det(\lambda I - A) = 0\) i.e. the matrix \((\lambda I - A)\) is singular. Hence it cannot be strictly diagonally dominant due to here.
Thus looking at each row \(i\) of \((\lambda I - A)\) we can say that
cannot be true for all rows simultaneously. i.e. it must fail at least for one row. This means that there exists at least one row \(i\) for which
holds true.
What this theorem means is pretty simple. Consider a disc in the complex plane for the \(i\)-th row of \(A\) whose center is given by \(a_{ii}\) and whose radius is given by \(r = \sum_{j\neq i} |a_{ij}|\) i.e. the sum of magnitudes of all non-diagonal entries in \(i\)-th row.
There are \(n\) such discs corresponding to \(n\) rows in \(A\). (3) means that every eigen value must lie within the union of these discs. It cannot lie outside.
This idea is crystallized in following definition.
For \(i\)-th row of matrix \(A\) we define the radius \(r_i = \sum_{j\neq i} |a_{ij}|\) and the center \(c_i = a_{ii}\). Then the set given by
is called the \(i\)-th Gershgorin’s disc of \(A\).
We note that the definition is equally valid for real as well as complex matrices. For real matrices, the centers of disks lie on the real line. For complex matrices, the centers may lie anywhere in the complex plane.
Clearly there is nothing magical about the rows of \(A\). We can as well consider the columns of \(A\).
Every eigen value of a matrix \(A\) must lie in a Gershgorin disc corresponding to the columns of \(A\) where the Gershgorin disc for \(j\)-th column is given by
with
Singular values¶
In previous section we saw diagonalization of square matrices which resulted in an eigen value decomposition of the matrix. This matrix factorization is very useful yet it is not applicable in all situations. In particular, the eigen value decomposition is useless if the square matrix is not diagonalizable or if the matrix is not square at all. Moreover, the decomposition is particularly useful only for real symmetric or Hermitian matrices where the diagonalizing matrix is an \(\FF\)-unitary matrix (see here). Otherwise, one has to consider the inverse of the diagonalizing matrix also.
Fortunately there happens to be another decomposition which applies to all matrices and it involves just \(\FF\)-unitary matrices.
A non-negative real number \(\sigma\) is a singular value for a matrix \(A \in \FF^{m \times n}\) if and only if there exist unit-length vectors \(u \in \FF^m\) and \(v \in \FF^n\) such that
and
hold. The vectors \(u\) and \(v\) are called left-singular and right-singular vectors for \(\sigma\) respectively.
We first present the basic result of singular value decomposition. We will not prove this result completely although we will present proofs of some aspects.
For every \(A \in \FF^{m \times n}\) with \(k = \min(m , n)\), there exist two \(\FF\)-unitary matrices \(U \in \FF^{m \times m}\) and \(V \in \FF^{n \times n}\) and a sequence of real numbers
such that
where
The non-negative real numbers \(\sigma_i\) are the singular values of \(A\) as per here.
The sequence of real numbers \(\sigma_i\) doesn’t depend on the particular choice of \(U\) and \(V\).
\(\Sigma\) is rectangular with the same size as \(A\). The singular values of \(A\) lie on the principle diagonal of \(\Sigma\). All other entries in \(\Sigma\) are zero.
It is certainly possible that some of the singular values are 0 themselves.
Since \(U^H A V = \Sigma\) hence
The decomposition of a matrix \(A \in \FF^{m \times n}\) given by
is known as its singular value decomposition.
When \(\FF\) is \(\RR\) then the decomposition simplifies to
and
We can also write
Let us expand
Alternatively, let us expand
This gives us
Following lemma verifies that \(\Sigma\) indeed consists of singular values of \(A\) as per here.
We have
Let us expand R.H.S.
where \(0\) columns in the end appear \(n - k\) times.
Expanding the L.H.S. we get
Thus by comparing both sides we get
and
Now let us start with
Let us expand R.H.S.
where \(0\) columns appear \(m - k\) times.
Expanding the L.H.S. we get
Thus by comparing both sides we get
and
We now consider the three cases.
For \(m = n\), we have \(k = m =n\). And we get
Thus \(\sigma_i\) is a singular value of \(A\) and \(u_i\) is a left singular vector while \(v_i\) is a right singular vector.
For \(m < n\), we have \(k = m\). We get for first \(m\) vectors in \(V\)
Finally for remaining \(n-m\) vectors in \(V\), we can write
They belong to the null space of \(A\).
For \(m > n\), we have \(k = n\). We get for first \(n\) vectors in \(U\)
Finally for remaining \(m - n\) vectors in \(U\), we can write
\(\Sigma \Sigma^H\) is an \(m \times m\) matrix given by
where the number of \(0\)‘s following \(\sigma_k^{2}\) is \(m - k\).
\(\Sigma^H \Sigma\) is an \(n \times n\) matrix given by
where the number of \(0\)‘s following \(\sigma_k^{2}\) is \(n - k\).
Let \(A \in \FF^{m \times n}\) have a singular value decomposition given by
Then
In other words, rank of \(A\) is number of non-zero singular values of \(A\). Since the singular values are ordered in descending order in \(A\) hence, the first \(r\) singular values \(\sigma_1, \dots, \sigma_r\) are non-zero.
Let \(r = \Rank(A)\). Then \(\Sigma\) can be split as a block matrix
where \(\Sigma_r\) is an \(r \times r\) diagonal matrix of the non-zero singular values \(\Diag(\sigma_1, \sigma_2, \dots, \sigma_r)\). All other sub-matrices in \(\Sigma\) are 0.
We note that \(A^H A\) is Hermitian. Hence \(A^HA\) is diagonalized by \(V\) and the diagonalization of \(A^H A\) is \(\Sigma^H \Sigma\). Thus the eigen values of \(A^H A\) are \(\sigma_1^2, \sigma_2^2, \dots \sigma_k^{2}, 0, 0,\dots 0\) with \(n - k\) \(0\)‘s after \(\sigma_k^{2}\).
Clearly
thus columns of \(V\) are the eigen vectors of \(A^H A\).
We note that \(A^H A\) is Hermitian. Hence \(A^HA\) is diagonalized by \(V\) and the diagonalization of \(A^H A\) is \(\Sigma^H \Sigma\). Thus the eigen values of \(A^H A\) are \(\sigma_1^2, \sigma_2^2, \dots \sigma_k^{2}, 0, 0,\dots 0\) with \(m - k\) \(0\)‘s after \(\sigma_k^{2}\).
Clearly
thus columns of \(U\) are the eigen vectors of \(A A^H\).
The largest singular value¶
For all \(u \in \FF^n\) the following holds
Moreover for all \(u \in \FF^m\) the following holds
Let us expand the term \(\Sigma u\).
Now since \(\sigma_1\) is the largest singular value, hence
Thus
or
The result follows.
A simpler representation of \(\Sigma u\) can be given using here. Let \(r = \Rank(A)\). Thus
We split entries in \(u\) as \(u = [(u_1, \dots, u_r )( u_{r + 1} \dots u_n)]^T\). Then
Thus
2nd result can also be proven similarly.
Let \(\sigma_1\) be the largest singular value of an \(m \times n\) matrix \(A\). Then
Moreover
since \(U\) is unitary. Now from previous lemma we have
since \(V^H\) also unitary. Thus we get the result
Similarly
since \(V\) is unitary. Now from previous lemma we have
since \(U^H\) also unitary. Thus we get the result
There is a direct connection between the largest singular value and \(2\)-norm of a matrix (see here).
The largest singular value of \(A\) is nothing but its \(2\)-norm. i.e.
SVD and pseudo inverse¶
Let \(A = U \Sigma V^H\) and let \(r = \Rank (A)\). Let \(\sigma_1, \dots, \sigma_r\) be the \(r\) non-zero singular values of \(A\). Then the Moore-Penrose pseudo-inverse of \(\Sigma\) is an \(n \times m\) matrix \(\Sigma^{\dag}\) given by
where \(\Sigma_r = \Diag(\sigma_1, \dots, \sigma_r)\).
Essentially \(\Sigma^{\dag}\) is obtained by transposing \(\Sigma\) and inverting all its non-zero (positive real) values.
The rank of \(\Sigma\) and its pseudo-inverse \(\Sigma^{\dag}\) are same. i.e.
Let \(A\) be an \(m \times n\) matrix and let \(A = U \Sigma V^H\) be its singular value decomposition. Let \(\Sigma^{\dag}\) be the pseudo inverse of \(\Sigma\) as per here. Then the Moore-Penrose pseudo-inverse of \(A\) is given by
As usual we verify the requirements for a Moore-Penrose pseudo-inverse as per here. We note that since \(\Sigma^{\dag}\) is the pseudo-inverse of \(\Sigma\) it already satisfies necessary criteria.
First requirement:
Second requirement:
We now consider
Thus
since \(\Sigma \Sigma^{\dag}\) is Hermitian.
Finally we consider
Thus
since \(\Sigma^{\dag} \Sigma\) is also Hermitian.
This completes the proof.
Finally we can connect the singular values of \(A\) with the singular values of its pseudo-inverse.
The rank of any \(m \times n\) matrix \(A\) and its pseudo-inverse \(A^{\dag}\) are same. i.e.
Let \(A\) be an \(m \times n\) matrix and let \(A^{\dag}\) be its \(n \times m\) pseudo inverse as per here. Let \(r = \Rank(A)\) Let \(k = \min(m, n)\) denote the number of singular values while \(r\) denote the number of non-singular values of \(A\). Let \(\sigma_1, \dots, \sigma_r\) be the non-zero singular values of \(A\). Then the number of singular values of \(A^{\dag}\) is same as that of \(A\) and the non-zero singular values of \(A^{\dag}\) are
while all other \(k - r\) singular values of \(A^{\dag}\) are zero.
\(k= \min(m, n)\) denotes the number of singular values for both \(A\) and \(A^{\dag}\). Since rank of \(A\) and \(A^{\dag}\) are same, hence the number of non-zero singular values is same. Now look at
where
Clearly \(\Sigma_r^{-1} = \Diag(\frac{1}{\sigma_1} , \dots, \frac{1}{\sigma_r})\).
Thus expanding the R.H.S. we can get
where \(v_i\) and \(u_i\) are first \(r\) columns of \(V\) and \(U\) respectively. If we reverse the order of first \(r\) columns of \(U\) and \(V\) and reverse the first \(r\) diagonal entries of \(\Sigma^{\dag}\) , the R.H.S. remains the same while we are able to express \(A^{\dag}\) in the standard singular value decomposition form. Thus \(\frac{1}{\sigma_1} , \dots, \frac{1}{\sigma_r}\) are indeed the non-zero singular values of \(A^{\dag}\).
Full column rank matrices¶
In this subsection we consider some specific results related to singular value decomposition of a full column rank matrix.
We will consider \(A\) to be an \(m \times n\) matrix in \(\FF^{m \times n}\) with \(m \geq n\) and \(\Rank(A) = n\). Let \(A = U \Sigma V^H\) be its singular value decomposition. From here we observe that there are \(n\) non-zero singular values of \(A\). We will call these singular values as \(\sigma_1, \sigma_2, \dots, \sigma_n\). We will define
Clearly \(\Sigma\) is an \(2\times 1\) block matrix given by
where the lower \(0\) is an \((m - n) \times n\) zero matrix. From here we obtain that \(\Sigma^H \Sigma\) is an \(n \times n\) matrix given by
where
Since all singular values are non-zero hence \(\Sigma_n^2\) is invertible. Thus
Let \(A\) be a full column rank matrix with singular value decomposition \(A = U \Sigma V^H\). Let \(\sigma_1\) be its largest singular value and \(\sigma_n\) be its smallest singular value. Then
Let \(x \in \FF^n\). We have
Now since
hence
thus
Applying square roots, we get
We recall from here that the Gram matrix of its column vectors \(G = A^HA\) is full rank and invertible.
Let \(A\) be a full column rank matrix with singular value decomposition \(A = U \Sigma V^H\). Let \(\sigma_1\) be its largest singular value and \(\sigma_n\) be its smallest singular value. Then
Let \(x \in \FF^n\). Let
Let
Then from previous lemma we have
Finally
Thus
Substituting we get
There are bounds for the inverse of Gram matrix also. First let us establish the inverse of Gram matrix.
Let \(A\) be a full column rank matrix with singular value decomposition \(A = U \Sigma V^H\). Let the singular values of \(A\) be \(\sigma_1, \dots, \sigma_n\). Let the Gram matrix of columns of \(A\) be \(G = A^H A\). Then
where
We have
Thus
From here we have
This completes the proof.
We can now state the bounds:
Let \(A\) be a full column rank matrix with singular value decomposition \(A = U \Sigma V^H\). Let \(\sigma_1\) be its largest singular value and \(\sigma_n\) be its smallest singular value. Then
From here we have
where
Let \(x \in \FF^n\). Let
Let
Then
Thus
Finally
Thus
Substituting we get the result.
Low rank approximation of a matrix¶
An \(m \times n\) matrix \(A\) is called low rank if
Following is a simple procedure for making a low rank approximation of a given matrix \(A\).
- Perform the singular value decomposition of \(A\) given by \(A = U \Sigma V^H\).
- Identify the singular values of \(A\) in \(\Sigma\).
- Keep the first \(r\) singular values (where \(r \ll \min(m, n)\) is the rank of the approximation) and set all other singular values to 0 to obtain \(\widehat{\Sigma}\).
- Compute \(\widehat{A} = U \widehat{\Sigma} V^H\).
Matrix norms¶
This section reviews various matrix norms on the vector space of complex matrices over the field of complex numbers \((\CC^{m \times n}, \CC)\).
We know \((\CC^{m \times n}, \CC)\) is a finite dimensional vector space with dimension \(m n\). We will usually refer to it as \(\CC^{m \times n}\).
Matrix norms will follow the usual definition of norms for a vector space.
A function \(\| \cdot \| : \CC^{m \times n} \to \RR\) is called a matrix norm on \(\CC^{m \times n}\) if for all \(A, B \in \CC^{m \times n}\) and all \(\alpha \in \CC\) it satisfies the following
[Positivity]
\[\| A \| \geq 0\]with \(\| A \| = 0 \iff A = 0\).
[Homogeneity]
\[\| \alpha A \| = | \alpha | \| A \|.\][Triangle inequality]
\[\| A + B \| \leq \| A \| + \| B \|.\]
We recall some of the standard results on normed vector spaces.
All matrix norms are equivalent. Let \(\| \cdot \|\) and \(\| \cdot \|'\) be two different matrix norms on \(\CC^{m \times n}\). Then there exist two constants \(a\) and \(b\) such that the following holds
A matrix norm is a continuous function \(\| \cdot \| : \CC^{m \times n} \to \RR\).
Norms like \(\ell_p\) on complex vector space¶
Following norms are quite like \(\ell_p\) norms on finite dimensional complex vector space \(\CC^n\). They are developed by the fact that the matrix vector space \(\CC^{m\times n}\) has one to one correspondence with the complex vector space \(\CC^{m n}\).
Let \(A \in \CC^{m\times n}\) and \(A = [a_{ij}]\).
Matrix sum norm is defined as
Let \(A \in \CC^{m\times n}\) and \(A = [a_{ij}]\).
Matrix Frobenius norm is defined as
Let \(A \in \CC^{m\times n}\) and \(A = [a_{ij}]\).
Matrix Max norm is defined as
Properties of Frobenius norm¶
We now prove some elementary properties of Frobenius norm.
The Frobenius norm of a matrix is equal to the Frobenius norm of its Hermitian transpose.
Let
Then
Now
Let \(A \in \CC^{m \times n}\) be written as a row of column vectors
Then
We note that
Now
We thus showed that that the square of the Frobenius norm of a matrix is nothing but the sum of squares of \(\ell_2\) norms of its columns.
Let \(A \in \CC^{m \times n}\) be written as a column of row vectors
Then
We note that
Now
We now consider how the Frobenius norm is affected with the action of unitary matrices.
Let \(A\) be any arbitrary matrix in \(\CC^{m \times n}\). Let \(U\) be some unitary matrices in \(\CC^{m \times m}\). Let \(V\) be some unitary matrices in \(\CC^{n \times n}\).
We present our first result that multiplication with unitary matrices doesn’t change Frobenius norm of a matrix.
The Frobenius norm of a matrix is invariant to pre or post multiplication by a unitary matrix. i.e.
and
We can write \(A\) as
So
Then applying here clearly
But we know that unitary matrices are norm preserving. Hence
Thus
which implies
Similarly writing \(A\) as
we have
Then applying here clearly
But we know that unitary matrices are norm preserving. Hence
Thus
which implies
An alternative approach for the 2nd part of the proof using the first part is just one line
In above we use here and the fact that \(V\) is a unitary matrix implies that \(V^H\) is also a unitary matrix. We have already shown that pre multiplication by a unitary matrix preserves Frobenius norm.
Let \(A \in \CC^{m \times n}\) and \(B \in \CC^{n \times P}\) be two matrices. Then the Frobenius norm of their product is less than or equal to the product of Frobenius norms of the matrices themselves. i.e.
We can write \(A\) as
where \(a_i\) are \(m\) column vectors corresponding to rows of \(A\). Similarly we can write B as
where \(b_i\) are column vectors corresponding to columns of \(B\). Then
Now looking carefully
Applying the Cauchy-Schwartz inequality we have
Now
which implies
by taking square roots on both sides.
Let \(A \in \CC^{m \times n}\) and let \(x \in \CC^n\). Then
We note that Frobenius norm for a column matrix is same as \(\ell_2\) norm for corresponding column vector. i.e.
Now applying here we have
It turns out that Frobenius norm is intimately related to the singular value decomposition of a matrix.
Let \(A \in \CC^{m \times n}\). Let the singular value decomposition of \(A\) be given by
Let the singular value of \(A\) be \(\sigma_1, \dots, \sigma_n\). Then
But
since \(U\) and \(V\) are unitary matrices (see here ).
Now the only non-zero terms in \(\Sigma\) are the singular values. Hence
Consistency of a matrix norm¶
A matrix norm \(\| \cdot \|\) is called consistent on \(\CC^{n \times n}\) if
holds true for all \(A, B \in \CC^{n \times n}\). A matrix norm \(\| \cdot \|\) is called consistent if it is defined on \(\CC^{m \times n}\) for all \(m, n \in \Nat\) and eq (1) holds for all matrices \(A, B\) for which the product \(AB\) is defined.
A consistent matrix norm is also known as a sub-multiplicative norm.
With this definition and results in here we can see that Frobenius norm is consistent.
Subordinate matrix norm¶
A matrix operates on vectors from one space to generate vectors in another space. It is interesting to explore the connection between the norm of a matrix and norms of vectors in the domain and co-domain of a matrix.
Let \(m, n \in \Nat\) be given. Let \(\| \cdot \|_{\alpha}\) be some norm on \(\CC^m\) and \(\| \cdot \|_{\beta}\) be some norm on \(\CC^n\). Let \(\| \cdot \|\) be some norm on matrices in \(\CC^{m \times n}\). We say that \(\| \cdot \|\) is subordinate to the vector norms \(\| \cdot \|_{\alpha}\) and \(\| \cdot \|_{\beta}\) if
for all \(A \in \CC^{m \times n}\) and for all \(x \in \CC^n\). In other words the length of the vector doesn’t increase by the operation of \(A\) beyond a factor given by the norm of the matrix itself.
If \(\| \cdot \|_{\alpha}\) and \(\| \cdot \|_{\beta}\) are same then we say that \(\| \cdot \|\) is subordinate to the vector norm \(\| \cdot \|_{\alpha}\).
We have shown earlier in here that Frobenius norm is subordinate to Euclidean norm.
Operator norm¶
We now consider the maximum factor by which a matrix \(A\) can increase the length of a vector.
Let \(m, n \in \Nat\) be given. Let \(\| \cdot \|_{\alpha}\) be some norm on \(\CC^n\) and \(\| \cdot \|_{\beta}\) be some norm on \(\CC^m\). For \(A \in \CC^{m \times n}\) we define
\(\frac{\| A x \|_{\beta}}{\| x \|_{\alpha}}\) represents the factor with which the length of \(x\) increased by operation of \(A\). We simply pick up the maximum value of such scaling factor.
The norm as defined above is known as :math:`(alpha to beta)` operator norm, the \((\alpha \to \beta)\)-norm, or simply the \(\alpha\)-norm if \(\alpha = \beta\).
Of course we need to verify that this definition satisfies all properties of a norm.
Clearly if \(A= 0\) then \(A x = 0\) always, hence \(\| A \| = 0\).
Conversely, if \(\| A \| = 0\) then \(\| A x \|_{\beta} = 0 \Forall x \in \CC^n\). In particular this is true for the unit vectors \(e_i \in \CC^n\). The \(i\)-th column of \(A\) is given by \(A e_i\) which is 0. Thus each column in \(A\) is 0. Hence \(A = 0\).
Now consider \(c \in \CC\).
We now present some useful observations on operator norm before we can prove triangle inequality for operator norm.
For any \(x \in \Kernel(A)\), \(A x = 0\) hence we only need to consider vectors which don’t belong to the kernel of \(A\).
Thus we can write
We also note that
Thus, it is sufficient to find the maximum on unit norm vectors:
Note that since \(\|x \|_{\alpha} = 1\) hence the term in denominator goes away.
The \((\alpha \to \beta)\)-operator norm is subordinate to vector norms \(\| \cdot \|_{\alpha}\) and \(\| \cdot \|_{\beta}\). i.e.
For \(x = 0\) the inequality is trivially satisfied. Now for \(x \neq 0\) by definition, we have
There exists a vector \(x^* \in \CC^{n}\) with unit norm (\(\| x^* \|_{\alpha} = 1\)) such that
Let \(x' \neq 0\) be some vector which maximizes the expression
Then
Now consider \(x^* = \frac{x'}{\| x' \|_{\alpha}}\). Thus \(\| x^* \|_{\alpha} = 1\). We know that
Hence
We are now ready to prove triangle inequality for operator norm.
Let \(A\) and \(B\) be some matrices in \(\CC^{m \times n}\). Consider the operator norm of matrix \(A+B\). From previous remarks, there exists some vector \(x^* \in \CC^n\) with \(\| x^* \|_{\alpha} = 1\) such that
Now
From another remark we have
and
since \(\| x^* \|_{\alpha} = 1\).
Hence we have
It turns out that operator norm is also consistent under certain conditions.
Let \(\| \cdot \|_{\alpha}\) be defined over all \(m \in \Nat\). Let \(\| \cdot \|_{\beta} = \| \cdot \|_{\alpha}\). Then the operator norm
is consistent.
We need to show that
Now
We note that if \(Bx = 0\), then \(A B x = 0\). Hence we can rewrite as
Now if \(Bx \neq 0\) then \(\| Bx \|_{\alpha} \neq 0\). Hence
and
Clearly
Furthermore
Thus we have
p-norm for matrices¶
We recall the definition of \(\ell_p\) norms for vectors \(x \in \CC^n\) from (2)
The operator norms \(\| \cdot \|_p\) defined from \(\ell_p\) vector norms are of specific interest.
The \(p\)-norm for a matrix \(A \in \CC^{m \times n}\) is defined as
where \(\| x \|_p\) is the standard \(\ell_p\) norm for vectors in \(\CC^m\) and \(\CC^n\).
Special cases are considered for \(p=1,2\) and \(\infty\).
Let \(A \in \CC^{m \times n}\).
For \(p=1\) we have
This is also known as max column sum norm.
For \(p=\infty\) we have
This is also known as max row sum norm.
Finally for \(p=2\) we have
where \(\sigma_1\) is the largest singular value of \(A\). This is also known as spectral norm.
Let
Then
Thus,
which the maximum column sum. We need to show that this upper bound is indeed an equality.
Indeed for any \(x=e_j\) where \(e_j\) is a unit vector with \(1\) in \(j\)-th entry and 0 elsewhere,
Thus
Combining the two, we see that
For \(p=\infty\), we proceed as follows:
where \(\underline{a}^i\) are the rows of \(A\).
This shows that
We need to show that this is indeed an equality.
Fix an \(i = k\) and choose \(x\) such that
Clearly \(\| x \|_{\infty} = 1\).
Then
Thus,
Combining the two inequalities we get:
Remaining case is for \(p=2\).
For any vector \(x\) with \(\| x \|_2 = 1\),
since \(\ell_2\) norm is invariant to unitary transformations.
Let \(v = V^H x\). Then \(\|v\|_2 = \| V^H x \|_2 = \| x \|_2 = 1\).
Now
This shows that
Now consider some vector \(x\) such that \(v = (1, 0, \dots, 0)\). Then
Thus
Combining the two, we get that \(\| A \|_2 = \sigma_1\).
The 2-norm¶
Let \(A\in \CC^{n \times n}\) has singular values \(\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_n\). Let the eigen values for \(A\) be \(\lambda_1, \lambda_2, \dots, \lambda_n\) with \(|\lambda_1| \geq |\lambda_2| \geq \dots \geq |\lambda_n|\). Then the following hold
and if \(A\) is non-singular
If \(A\) is symmetric and positive definite, then
and if \(A\) is non-singular
If \(A\) is normal then
and if \(A\) is non-singular
Unitary invariant norms¶
We have already seen in here that Frobenius norm is unitary invariant.
It turns out that spectral norm is also unitary invariant.
More properties of operator norms¶
In this section we will focus on operator norms connecting normed linear spaces \((\CC^n, \| \cdot \|_{p})\) and \((\CC^m, \| \cdot \|_{q})\). Typical values of \(p, q\) would be in \(\{1, 2, \infty\}\).
We recall that
The following table (based on [TRO04]) shows how to compute different \((p, q)\) norms. Some can be computed easily while others are NP-hard to compute.
p | q | \(\| A \|_{p \to q}\) | Calculation |
---|---|---|---|
1 | 1 | \(\| A \|_{1 }\) | Maximum \(\ell_1\) norm of a column |
1 | 2 | \(\| A \|_{1 \to 2}\) | Maximum \(\ell_2\) norm of a column |
1 | \(\infty\) | \(\| A \|_{1 \to \infty}\) | Maximum absolute entry of a matrix |
2 | 1 | \(\| A \|_{2 \to 1}\) | NP hard |
2 | 2 | \(\| A \|_{2}\) | Maximum singular value |
2 | \(\infty\) | \(\| A \|_{2 \to \infty}\) | Maximum \(\ell_2\) norm of a row |
\(\infty\) | 1 | \(\| A \|_{\infty \to 1}\) | NP hard |
\(\infty\) | 2 | \(\| A \|_{\infty \to 2}\) | NP hard |
\(\infty\) | \(\infty\) | \(\| A \|_{\infty}\) | Maximum \(\ell_1\)-norm of a row |
The topological dual of the finite dimensional normed linear space \((\CC^n, \| \cdot \|_{p})\) is the normed linear space \((\CC^n, \| \cdot \|_{p'})\) where
\(\ell_2\)-norm is dual of \(\ell_2\)-norm. It is a self dual. \(\ell_1\) norm and \(\ell_{\infty}\)-norm are dual of each other.
When a matrix \(A\) maps from the space \((\CC^n, \| \cdot \|_{p})\) to the space \((\CC^m, \| \cdot \|_{q})\), we can view its conjugate transpose \(A^H\) as a mapping from the space \((\CC^m, \| \cdot \|_{q'})\) to \((\CC^n, \| \cdot \|_{p'})\).
Operator norm of a matrix always equals the operator norm of its conjugate transpose. i.e.
where
Specific applications of this result are:
This is obvious since the maximum singular value of a matrix and its conjugate transpose are same.
This is also obvious since max column sum of \(A\) is same as the max row sum norm of \(A^H\) and vice versa.
We now need to show the result for the general case (arbitrary \(1 \leq p, q \leq \infty\)).
where
Thus,
We need to show that this upper bound is indeed an equality.
Indeed for any \(x=e_j\) where \(e_j\) is a unit vector with \(1\) in \(j\)-th entry and 0 elsewhere,
Thus
Combining the two, we see that
where
For two matrices \(A\) and \(B\) and \(p \geq 1\), we have
We start with
From here, we obtain
Thus,
For two matrices \(A\) and \(B\) and \(p \geq 1\), we have
We start with
From here, we obtain
Thus,
In particular
Choosing \(q = \infty\) and \(s = p\) and applying here
But \(\| I \|_{p \to \infty}\) is the maximum \(\ell_p\) norm of any row of \(I\) which is \(1\). Thus
Consider the expression
\(z \in \ColSpace(A^H), z \neq 0\) means there exists some vector \(u \notin \Kernel(A^H)\) such that \(z = A^H u\).
This expression measures the factor by which the non-singular part of \(A\) can decrease the length of a vector.
The following bound holds for every matrix \(A\):
If \(A\) is surjective (onto), then the equality holds. When \(A\) is bijective (one-one onto, square, invertible), then the result implies
The spaces \(\ColSpace(A^H)\) and \(\ColSpace(A)\) have same dimensions given by \(\Rank(A)\). We recall that \(A^{\dag} A\) is a projector onto the column space of \(A\).
As a result we can write
whenever \(z \in \ColSpace(A^H)\). Now
When \(A\) is surjective, then \(\ColSpace(A) = \CC^m\). Hence
Thus, the inequality changes into equality. Finally
which completes the proof.
Row column norms¶
Let \(A\) be an \(m\times n\) matrix with rows \(\underline{a}^i\) as
Then we define
where \(1 \leq p < \infty\). i.e. we take \(p\)-norms of all row vectors and then find the maximum.
We define
This is equivalent to taking \(\ell_{\infty}\) norm on each row and then taking the maximum of all the norms.
For \(1 \leq p , q < \infty\), we define the norm
i.e., we compute \(p\)-norm of all the row vectors to form another vector and then take \(q\)-norm of that vector.
Note that the norm \(\| A \|_{p, \infty}\) is different from the operator norm \(\| A \|_{p \to \infty}\). Similarly \(\| A \|_{p, q}\) is different from \(\| A \|_{p \to q}\).
where
From here we get
This is exactly the definition of \(\| A \|_{p, \infty}\).
From here
For any two matrices \(A, B\), we have
Let \(q\) be such that \(\frac{1}{p} + \frac{1}{q} = 1\). From here, we have
From here
and
Thus
Relations between \((p, q)\) norms and \((p \to q)\) norms
Block diagonally dominant matrices and generalized Gershgorin disc theorem¶
In [FV+62] the idea of diagonally dominant matrices (see here) has been generalized to block matrices using matrix norms. We consider the specific case with spectral norm.
Let \(A\) be a square matrix in \(\CC^{n \times n}\) which is partitioned in following manner
where each of the submatrices \(A_{i j}\) is a square matrix of size \(m \times m\). Thus \(n = k m\).
\(A\) is called block diagonally dominant if
holds true for all \(1 \leq i \leq n\). If the inequality satisfies strictly for all \(i\), then \(A\) is called block strictly diagonally dominant matrix.
For proof see [FV+62].
This leads to the generalized Gershgorin disc theorem.
Let \(A\) be a square matrix in \(\CC^{n \times n}\) which is partitioned in following manner
where each of the submatrices \(A_{i j}\) is a square matrix of size \(m \times m\). Then each eigenvalue \(\lambda\) of \(A\) satisfies
For proof see [FV+62].
Since the \(2\)-norm of a positive semidefinite matrix is nothing but its largest eigen value, the theorem directly applies.
Let \(A\) be a Hermitian positive semidefinite matrix. Let \(A\) be partitioned as in here. Then its \(2\)-norm \(\| A \|_2\) satisfies
Real Analysis¶
Metric Spaces¶
A metric or a distance \(d\) on a nonempty set \(X\) is a function \(d : X \times X \to \RR\) which satisfies following properties
- \(d(x, y) \geq 0 \Forall x, y \in X\) non-negativity axiom [M1];
- \(d(x, y) = 0 \iff x = y\) coincidence axiom [M2];
- \(d(x, y ) = d(y, x) \Forall x, y \in X\) symmetry [M3];
- \(d(x, y) \leq d(x, z) + d(z, y) \Forall x, y, z \in X\) triangle inequality or sub-additivity [M4].
The pair \((X, d)\) is called a metric space.
In a metric space \((X, d)\), the inequality
holds for all points \(x, y, z \in X\).
Interchanging \(x\) and \(y\) we get
Combining the two, we get the result.
We show different metrics for the set of real numbers \(\RR\). Let \(x, y, z \in \RR\). Define:
Since the absolute value of any real number is non-negative, M1 is satisfied.
\(| x- y | = 0 \iff x - y = 0 \iff x = y\). Thus, M2 is satisfied.
Now,
Thus, M3 is satisfied.
Finally,
Thus, M4 is satisfied and \((\RR, d_1)\) is a metric space.
We consider metrics defined on \(\RR^n\) (the set of n-tuples).
Let \(x = (x_1, \dots, x_n) \in \RR^n\) and \(y = (y_1, \dots, y_n) \in \RR^n\).
The taxicab metric is defined as
The Euclidean metric is defined as
The general Euclidean metric is defined as
For \(p = \infty\), metric is defined as
We now prove that above are indeed a metric. We start with taxicab metric. M1 is straightforward since
M2 is also easy
M3 is straightforward too
We will prove M4 (triangle inequality) inductively. For \(n=1\)
Thus M1 is true for \(n=1\).
TODO finish it.
Open sets¶
Let \((X, d)\) be a metric space. An open ball at any \(x \in X\) with radius \(r > 0\) is the set
Let \(A = B(x, r)\). We need to show that for for every \(y \in A\) there exists an open ball \(B(y, r_1) \subseteq A\).
Let \(r_1 = r - d(x, y)\). Since \(d(x, y) < r \forall y \in A\), hence \(r_1 > 0\). We can also write \(d(x, y) = r - r_1\). Consider \(C = B(y, r_1)\). For any \(z \in C\) we have \(d(y, z) < r_1\). Further using triangle inequality:
Thus \(z \in B(x, r) \Forall z \in C\), hence \(C \subseteq B(x, r)\). Hence \(B(x, r)\) is open.
For a metric space \((X, d)\) following statements hold
- \(X\) and \(\EmptySet\) are open sets.
- Arbitrary unions of open sets are open sets.
- Finite intersections of open sets are open sets.
Since \(\EmptySet\) doesn’t contain any element hence (i) is vacuously true for \(\EmptySet\). For any \(x \in X\) and any \(r > 0\), \(B(x, r) \subseteq X\) by definition. Hence \(X\) is open.
Let \(\{A_i\}_{i \in I}\) be an arbitrary family of open sets with \(A_i \subseteq X\). Let \(C = \bigcup A_i\). Let \(x \in C\). Then there exists some \(A_i\) such that \(x \in A_i\). Since \(A_i\) is open hence there exists an open ball \(B(x, r) \subseteq A_i \subseteq C\). Thus for every \(x \in C\) there exists an open ball \(B(x, r) \subseteq C\). Hence \(C\) is open.
Let \(\{A_1, \dots, A_n\}\) be a finite collection of open subsets of \(X\). Let \(C = \bigcap A_i\). Let \(x \in C\). Then \(x \in A_i \Forall 1 \leq i \leq n\). Thus there exists an open ball \(B(x, r_i) \subseteq A_i \Forall 1 \leq i \leq n\). Now let \(r = \min(r_1, \dots, r_n)\). Since \(r_i > 0\) and we are taking a minimum over finite set of numbers hence \(r > 0\). Thus \(B(x, r) \subseteq B(x, r_i) \subseteq A_i \Forall 1 \leq i \leq n\). Thus \(B(x, r) \subseteq C\). Thus C is open.
We need to show that for every \(x \in \Interior{A}\), there exists an open ball \(B(x, r) \subseteq \Interior{A}\).
Let \(x \in \Interior{A}\). Then there exists an open ball \(B(x, r) \subseteq A\). Since \(B( x, r)\) is open hence for every \(y \in B (x, r)\) there exists an open ball \(B (y , r_1) \subseteq B(x, r) \subseteq A\). Thus \(y\) is an interior point of \(A\). Hence \(B(x, r) \subseteq \Interior{A}\).
Let \(A\) be open. Hence for every \(x \in A\), there exists an open ball \(B(x, r) \subseteq A\). Thus \(x\) is an interior point of \(A\). Thus \(A \subseteq \Interior{A}\). But since \(\Interior{A} \subseteq A\), hence \(\Interior{A} = A\).
Now the converse. Let \(\Interior{A} = A\). Thus for every point \(x \in A\), there exists an open ball \(B(x, r) \subseteq A\) since \(x \in \Interior{A}\). Hence \(A\) is open.
Closed sets¶
For a metric space \((X, d)\) the following statements hold:
- \(X\) and \(\EmptySet\) are closed sets.
- Arbitrary intersections of closed sets are closed sets.
- Finite unions of closed sets are closed sets.
Since \(\EmptySet\) is open hence \(X = X \setminus \EmptySet\) is closed. Since \(X\) is open hence \(\EmptySet = X \setminus X\) is closed.
Let \(\{A_i\}_{i \in I}\) be an arbitrary family of closed sets with \(A_i \subseteq X\). Then \(A_i^c\) are open. Thus \(\bigcup A_i^c\) is open. Thus \(\left ( \bigcup A_i^c \right )^c\) is closed. By De Morgan’s law, \(\bigcap A_i\) is closed.
Let \(\{A_1, \dots, A_n\}\) be a finite collection of closed subsets of \(X\). Then \(A_i^c\) are open. Hence their finite intersection \(\bigcap A_i^c\) is open. Hence \(\left ( \bigcap A_i^c \right )^c\) is closed. By De Morgan’s law, \(\bigcup A_i\) is closed.
Note that a closure point of \(A\) need not belong to \(A\). At the same time, every point in \(A\) is a closure point of \(A\).
Clearly \(A \subseteq \Closure{A}\).
We will show that \(C = \Closure{A}^c\) is open.
Let \(x \in C\). Then \(x\) is not a closure point of \(A\). Hence, there exists an open ball \(B(x, r)\) such that \(B(x, r) \cap A = \EmptySet\). Now, consider \(z \in B (x, r)\). Since \(B(x, r)\) is open, there exists \(r_1 > 0\) such that \(B (z, r_1) \subseteq B(x, r)\). Thus, \(B (z, r_1) \cap A = \EmptySet\). Hence, \(z\) is not a closure point of \(A\). Hence, \(z \in C\). Thus, \(B( x, r) \subseteq C\). Thus, we have shown that for every \(x \in C\), there exists an open ball \(B(x, r) \subseteq C\). Thus, \(C\) is open. Consequently, \(\Closure{A} = C^c\) is closed.
Let \(A\) be closed. Then, \(\Closure{A} \subseteq A\) due to this. But since \(A \subseteq \Closure{A}\), hence, \(A = \Closure{A}\).
Now assume \(A = \Closure{A}\). Since \(\Closure{A}\) is closed, \(A\) is closed.
We show that \(A^c\) is open.
Let \(y \in A^c\). Then \(d(y, a) > r\). Now consider \(r_1 = d(y, a) - r > 0\) and an open ball \(B(y, r_1)\). For any \(z \in B(y, r_1)\)
Thus, \(z \in A^c\). Hence, \(B(y, r_1) \subseteq A^c\). Hence, \(A^c\) is open. Thus, \(A\) is closed.
Any point \(y : d(x, y) < r\) is obviously a closure point of \(B (x, r)\).
We show that a point \(y : d(x, y) = r\) is a closure point of \(B (x, r)\). For contradiction, suppose \(y\) is not a closure point of \(B (x, r)\). Then, there exists an open ball \(B(y, r_1)\) such that \(B (y, r_1) \cap B (x, r) = \EmptySet\).
We show that a point \(y : d(x, y) > r\) is not a closure point of \(B (x, r)\). Let \(r_1 = d(x, y) -r > 0\). Then, \(B ( y, r_1) \cap B (x, r) = \EmptySet\). Hence \(y\) is not a closure point of \(B (x, r)\).
Convex Analysis¶
Convex sets¶
We start off with reviewing some basic definitions.
Affine sets¶
Let \(x_1\) and \(x_2\) be two points in \(\RR^N\). Points of the form
form a line passing through \(x_1\) and \(x_2\).
- at \(\theta=0\) we have \(y=x_2\).
- at \(\theta=1\) we have \(y=x_1\).
- \(\theta \in [0,1]\) corresponds to the points belonging to the [closed] line segment between \(x_1\) and \(x_2\).
We can also rewrite \(y\) as
In this definition:
- \(x_2\) is called the base point for this line.
- \(x_1 - x_2\) defines the direction of the line.
- \(y\) is the sum of the base point and the direction scaled by the parameter \(\theta\).
- As \(\theta\) increases from \(0\) to \(1\), \(y\) moves from \(x_2\) to \(x_1\).
A set \(C \subseteq \RR^N\) is affine if the line through any two distinct points in \(C\) lies in \(C\).
In other words, for any \(x_1, x_2 \in C\), we have \(\theta x_1 + (1 - \theta) x_2 \in C\) for all \(\theta \in \RR\).
If we denote \(\alpha = \theta\) and \(\beta = (1 - \theta)\) we see that \(\alpha x_1 + \beta x_2\) represents a linear combination of points in \(C\) such that \(\alpha + \beta = 1\).
The idea can be generalized in following way.
It can be shown easily that an affine set \(C\) contains all affine combinations of its points.
Let \(C\) be an affine set and \(x_0\) be any element in \(C\). Then the set
is a subspace of \(\RR^N\).
Let \(v_1\) and \(v_2\) be two elements in \(V\). Then by definition, there exist \(x_1\) and \(x_2\) in \(C\) such that
and
Thus
But since \(a + 1 - a = 1\), hence \(x_3 = (a x_1 + x_2 - a x_0 ) \in C\) (an affine combination).
Hence \(a v_1 + v_2 = x_3 - x_0 \in V\) [by definition of \(V\)].
Thus any linear combination of elements in \(V\) belongs to \(V\). Hence \(V\) is a subspace of \(\RR^N\).
With this, we can use the following notation:
i.e. an affine set is a subspace with an offset.
Thus the subspace \(V\) associated with an affine set \(C\) doesn’t depend upon the choice of offset \(x_0\) in \(C\).
We now show that the solution set of linear equations forms an affine set.
Let \(C = \{ x | A x = b\}\) where \(A \in \RR^{M \times N}\) and \(b \in \RR^M\).
\(C\) is the set of all vectors \(x \in \RR^N\) which satisfy the system of linear equations given by \(A x = b\). Then \(C\) is an affine set.
Let \(x_1\) and \(x_2\) belong to \(C\). Then we have
Thus
Thus \(C\) is an affine set.
The subspace associated with \(C\) is nothing but the null space of \(A\) denoted as \(\NullSpace(A)\).
- The empty set \(\EmptySet\) is affine.
- A singleton set containing a single point \(x_0\) is affine. Its corresponding subspace is \(\{0 \}\) of zero dimension.
- The whole euclidean space \(\RR^N\) is affine.
- Any line is affine. The associated subspace is a line parallel to it which passes through origin.
- Any plane is affine. If it passes through origin, its a subspace. The associated subspace is the plane parallel to it which passes through origin.
The set of all affine combinations of points in some arbitrary set \(S \subseteq \RR^N\) is called the affine hull of \(S\) and denoted as \(\AffineHull(S)\):
Essentially the difference vectors \(v_k - v_0\) belong to the associated subspace.
If the associated subspace has dimension \(L\) then a maximum of \(L\) vectors can be linearly independent in it. Hence a maximum of \(L+1\) vectors can be affine independent for the affine set.
Convex sets¶
A set \(C\) is convex if the line segment between any two points in \(C\) lies in \(C\). i.e.
We call a point of the form \(\theta_1 x_1 + \dots + \theta_k x_k\), where \(\theta_1 + \dots + \theta_k = 1\) and \(\theta_i \geq 0, i=1,\dots,k\), a convex combination of the points \(x_1, \dots, x_k\).
It is like a weighted average of the points \(x_i\).
- A line segment is convex.
- A circle [including its interior] is convex.
- A ray is defined as \(\{ x_0 + \theta v | \theta \geq 0 \}\) where \(v \neq 0\) indicates the direction of ray and \(x_0\) is the base or origin of ray. A ray is convex but not affine.
- Any affine set is convex.
The convex hull of an arbitrary set \(S \subseteq \RR^n\) denoted as \(\ConvexHull(S)\), is the set of all convex combinations of points in \(S\).
We can generalize convex combinations to include infinite sums.
Let \(\theta_1, \theta_2, \dots\) satisfy
and let \(x_1, x_2, \dots \in C\), where \(C \subseteq \RR^N\) is convex. Then
if the series converges.
We can generalize it further to density functions.
Let \(p : \RR^N \to \RR\) satisfy \(p(x) \geq 0\) for all \(x \in C\) and
Then
provided the integral exists.
Note that \(p\) above can be treated as a probability density function if we define \(p(x) = 0 \Forall x \in \RR^N \setminus C\).
Cones¶
By definition we have \(0 \in C\).
A set \(C\) is called a convex cone if it is convex and a cone. In other words, for every \(x_1, x_2 \in C\) and \(\theta_1, \theta_2 \geq 0\), we have
Let \(C\) be a convex cone. Then for every \(x_1, \dots, x_k \in C\), a conic combination \(\theta_1 x_1 + \dots + \theta_k x_k\) with \(\theta_i \geq 0\) belongs to \(C\).
Conversely if a set \(C\) contains all conic combinations of its points, then its a convex cone.
The idea of conic combinations can be generalized to infinite sums and integrals.
The conic hull of a set \(S\) is the set of all conic combinations of points in \(S\). i.e.
- A ray with its base at origin is a convex cone.
- A line passing through zero is a convex cone.
- A plane passing through zero is a convex cone.
- Any subspace is a convex cone.
We now look at some more important convex sets one by one.
Hyperplanes and half spaces¶
A hyperplane is a set of the form
where \(a \in \RR^N, a \neq 0\) and \(b \in \RR\).
The vector \(a\) is called the normal vector to the hyperplane.
- Analytically it is a solution set of a nontrivial linear equation. Thus it is an affine set.
- Geometrically it is a set of points with a constant inner product to a given vector \(a\).
Now let \(x_0\) be an arbitrary element in \(H\). Then
Now consider the orthogonal complement of \(a\) defined as
i.e. the set of all vectors that are orthogonal to \(a\).
Now consider the set
Clearly for every \(x \in S\), \(a^T x = a^T x_0 = b\).
Thus we can say that
Thus the hyperplane consists of an offset \(x_0\) plus all vectors orthogonal to the (normal) vector \(a_0\).
A hyperplane divides \(\RR^N\) into two halfspaces. The two (closed) halfspaces are given by
and
The halfspace \(H_+\) extends in the direction of \(a\) while \(H_-\) extends in the direction of \(-a\).
A halfspace is the solution set of one (nontrivial) linear inequality.
A halfspace is convex but not affine.
The halfspace can be written alternatively as
\[\begin{split}H_+ = \{ x | a^T (x - x_0) \geq 0\}\\ H_- = \{ x | a^T (x - x_0) \leq 0\}\end{split}\]where \(x_0\) is any point in the associated hyperplane \(H\).
Geometrically, points in \(H_+\) make an acute angle with \(a\) while points in \(H_-\) make an obtuse angle with \(a\).
The sets given by
are called open halfspaces. They are the interior of corresponding closed halfspaces.
Euclidean balls and ellipsoids¶
A Euclidean closed ball (or just ball) in \(\RR^N\) has the form
where \(r > 0\) and \(\| \|_2\) denotes the Euclidean norm.
\(x_c\) is the center of the ball.
\(r\) is the radius of the ball.
An equivalent definition is given by
Let \(x_1, x_2\) be any two points in \(B\). We have
and
Let \(\theta \in [0,1]\) and consider the point \(x = \theta x_1 + (1 - \theta) x_2\). Then
Thus \(x \in B\), hence \(B\) is a convex set.
An ellipsoid is a set of the form
where \(P = P^T \succ 0\) i.e. \(P\) is symmetric and positive definite.
The vector \(x_c \in \RR^N\) is the centroid of the ellipse.
Eigen values of the matrix \(P\) (which are all positive) determine how far the ellipsoid extends in every direction from \(x_c\).
The lengths of semi-axes of \(\xi\) are given by \(\sqrt{\lambda_i}\) where \(\lambda_i\) are the eigen values of \(P\).
An alternative representation of an ellipsoid is given by
where \(A\) is a square and nonsingular matrix.
To show the equivalence of the two definitions, we proceed as follows.
Let \(P = A A^T\). Let \(x\) be any arbitrary element in \(\xi\).
Then \(x - x_c = A u\) for some \(u\) such that \(\| u \|_2 \leq 1\).
Thus
The two representations of an ellipsoid are therefore equivalent.
Norm balls and norm cones¶
Let \(\| \cdot \| : \RR^N \to R\) be any norm on \(\RR\). A norm ball with radius \(r\) and center \(x_c\) is given by
Let \(\| \cdot \| : \RR^N \to R\) be any norm on \(\RR\). The norm cone associated with the norm \(\| \cdot \|\) is given by the set
The second order cone is the norm cone for the Euclidean norm, i.e.
This can be rewritten as
Polyhedra¶
A polyhedron is defined as the solution set of a finite number of linear inequalities.
A polyhedron thus is the intersection of a finite number of halfspaces (\(M\)) and hyperplanes (\(P\)).
- Affine sets ( subspaces, hyperplanes, lines)
- Rays
- Line segments
- Halfspaces
We can combine the set of inequalities and equalities in the form of linear matrix inequalities and equalities.
where
and the symbol \(\preceq\) means vector inequality or component wise inequality in \(\RR^M\) i.e. \(u \preceq v\) means \(u_i \leq v_i\) for \(i = 1, \dots, M\).
Note that \(b \in \RR^M\), \(A \in \RR^{M \times N}\), \(A x \in \RR^M\), \(d \in \RR^P\), \(C \in \RR^{P \times N}\) and \(C x \in \RR^P\).
We can generalize \(\RR_+\) as follows. Define
\(\RR_+^N\) is called nonnegative orthant. It is a polyhedron (solution set of \(N\) linear inequalities). It is also a convex cone.
Let \(K+1\) points \(v_0, \dots, v_K \in \RR^N\) be affine independent (see here).
The simplex determined by them is given by
where \(\theta = [\theta_1, \dots, \theta_K]^T\) and \(1\) denotes a vector of appropriate size \((K)\) with all entries one.
In other words, \(C\) is the convex hull of the set \(\{v_0, \dots, v_K\}\).
The positive semidefinite cone¶
We define the set of symmetric \(N\times N\) matrices as
We define the set of symmetric positive semidefinite matrices as
The notation \(X \succeq 0\) means \(v^T X v \geq 0 \Forall v \in \RR^N\).
We define the set of symmetric positive definite matrices as
The notation \(X \succ 0\) means \(v^T X v > 0 \Forall v \in \RR^N\).
Let \(A, B \in S_+^N\) and \(\theta_1, \theta_2 \geq 0\). We have to show that \(\theta_1 A + \theta_2 B \in S_+^N\).
Now
Hence \(\theta_1 A + \theta_2 B \in S_+^N\).
Operations that preserve convexity¶
In the following, we will discuss several operations which transform a convex set into another convex set, and thus preserve convexity.
Understanding these operations is useful for determining the convexity of a wide variety of sets.
Usually its easier to prove that a set is convex by showing that it is obtained by a convexity preserving operation from a convex set compared to directly verifying the convexity property i.e.
Intersection¶
Let \(x_1, x_2 \in S_1 \cap S_2\). We have to show that
Since \(S_1\) is convex and \(x_1, x_2 \in S_1\), hence
Similarly
Thus
which completes the proof.
We can generalize it further.
Let \(x_1, x_2\) be any two arbitrary elements in \(\cap_{i \in I} A_i\).
Hence \(\cap_{i \in I} A_i\) is convex.
Affine functions¶
A function \(f : \RR^N \to \RR^M\) is affine if it is a sum of a linear function and a constant, i.e.
where \(A \in \RR^{M \times N}\) and \(b \in \RR^M\).
Let \(S \subseteq \RR^N\) be convex and \(f : \RR^N \to \RR^M\) be an affine function. Then the image of \(S\) under \(f\) given by
is a convex set.
It applies in the reverse direction also.
Let \(f : \RR^K \to \RR^N\) be affine and \(S \subseteq \RR^N\) be convex. Then the inverse image of \(S\) under \(f\) given by
is convex.
Let \(S \in \RR^N\) be convex.
For some \(\alpha \in \RR\) , \(\alpha S\) given by
\[\alpha S = \{\alpha x | x \in S\}\]is convex. This is the scaling operation.
For some \(a \in \RR^N\), \(S + a\) given by
\[S + a = \{x + a | x \in S\}\]is convex. This is the translation operation.
Let \(N = M + K\) where \(M, N \in \Nat\). Thus let \(\RR^N = \RR^M \times \RR^K\). A vector \(x \in S\) can be written as \(x = (x_1, x_2)\) where \(x_1 \in \RR^M\) and \(x_2 \in \RR^K\). Then
\[T = \{ x_1 \in \RR^M | (x_1, x_2) \in S \text{ for some } x_2 \in \RR^K\}\]is convex. This is the projection operation.
Let \(S_1\) and \(S_2\) be two arbitrary subsets of \(\RR^N\). Then their sum is defined as
Proper cones and generalized inequalities¶
A cone \(K \in \RR^N\) is called a proper cone if it satisfies the following:
- \(K\) is convex.
- \(K\) is closed.
- \(k\) is solid i.e. it has a nonempty interior.
- \(K\) is pointed i.e. it contains no line. In other words
A proper cone \(K\) can be used to define a generalized inequality, which is a partial ordering on \(\RR^N\).
Let \(K \subseteq \RR^N\) be a proper cone. A partial ordering on \(\RR^N\) associated with the proper cone \(K\) is defined as
We also write \(x \succeq_K y\) if \(y \preceq_K x\). This is also known as a generalized inequality.
A strict partial ordering on \(\RR^N\) associated with the proper cone \(K\) is defined as
where \(\Interior{K}\) is the interior of \(K\). We also write \(x \succ_K y\) if \(y \prec_K x\). This is also known as a strict generalized inequality.
When \(K = \RR_+\), then \(\preceq_K\) is same as usual \(\leq\) and \(\prec_K\) is same as usual \(<\) operators on \(\RR_+\).
The nonnegative orthant \(K=\RR_+^N\) is a proper cone. Then the associated generalized inequality \(\preceq_{K}\) means that
This is usually known as component-wise inequality and usually denoted as \(x \preceq y\).
The positive semidefinite cone \(S_+^N \subseteq S^N\) is a proper cone in the vector space \(S^N\).
The associated generalized inequality means
i.e. \(Y - X\) is positive semidefinite. This is also usually denoted as \(X \preceq Y\).
Minimum and minimal elements¶
The generalized inequalities (\(\preceq_K, \prec_K\)) w.r.t. the proper cone \(K \subset \RR^N\) define a partial ordering over any arbitrary set \(S \subseteq \RR^N\).
But since they may not enforce a total ordering on \(S\), not every pair of elements \(x, y\in S\) may be related by \(\preceq_K\) or \(\prec_K\).
Let \(K = \RR^2_+ \subset \RR^2\). Let \(x_1 = (2,3), x_2 = (4, 5), x_3=(-3, 5)\). Then we have
- \(x_1 \prec x_2\), \(x_2 \succ x_1\) and \(x_3 \preceq x_2\).
- But neither \(x_1 \preceq x_3\) nor \(x_1 \succeq x_3\) holds.
- In general For any \(x , y \in \RR^2\), \(x \preceq y\) if and only if \(y\) is to the right and above of \(x\) in the \(\RR^2\) plane.
- If \(y\) is to the right but below or \(y\) is above but to the left of \(x\), then no ordering holds.
- \(x\) must belong to \(S\).
- It is highly possible that there is no minimum element in \(S\).
- If a set \(S\) has a minimum element, then by definition it is unique (Prove it!).
- \(x\) must belong to \(S\).
- It is highly possible that there is no maximum element in \(S\).
- If a set \(S\) has a maximum element, then by definition it is unique.
There are many sets for which no minimum element exists. In this context we can define a slightly weaker concept known as minimal element.
- The minimal or maximal element \(x\) must belong to \(S\).
- It is highly possible that there is no minimal or maximal element in \(S\).
- Minimal or maximal element need not be unique. A set may have many minimal or maximal elements.
A point \(x \in S\) is the minimum element of \(S\) if and only if
Let \(x \in S\) be the minimum element. Then by definition \(x \preceq_K y \Forall y \in S\). Thus
Note that \(k \in K\) would be distinct for each \(y \in S\).
Now let us prove the converse.
Let \(S \subseteq x + K\) where \(x \in S\). Thus
Thus \(x\) is the minimum element of \(S\) since there can be only one minimum element of S.
\(x + K\) denotes all the points that are comparable to \(x\) and greater than or equal to \(x\) according to \(\preceq_K\).
A point \(x \in S\) is a minimal point if and only if
Let \(x \in S\) be a minimal element of \(S\). Thus there is no element \(y \in S\) distinct from \(x\) such that \(y \preceq_K x\).
Consider the set \(R = x - K = \{x - k | k \in K \}\).
Thus \(x - K\) consists of all points \(r \in \RR^N\) which satisfy \(r \preceq_K x\). But there is only one such point in \(S\) namely \(x\) which satisfies this. Hence
Now let us assume that \(\{ x - K \} \cap S = \{ x \}\). Thus the only point \(y \in S\) which satisfies \(y \preceq_K x\) is \(x\) itself. Hence \(x\) is a minimal element of \(S\).
\(x - K\) represents the set of points that are comparable to \(x\) and are less than or equal to \(x\) according to \(\preceq_K\).
Probability and Random Variables¶
Random Variables¶
The step function and sign function relation:
Discrete step function and Kronecker delta function:
For different random variables, we will characterize their distributions by several parameters. These are listed below
- Probability density function (PDF)
- Cumulative distribution function (CDF)
- Probability mass function (PMF)
- Mean (\(\mu\) or \(\EE(X)\))
- Variance (\(\sigma^2\) or \(\Var(X)\))
- Skew
- Kurtosis
- Characteristic function (CF)
- Moment generating function (MGF)
- Second characteristic function
- Cumulant generating function (CGF)
Cumulative distribution function¶
The CDF is defined as
Properties of CDF:
CDF is a monotonically non-decreasing function.
\(F_X(-\infty)\) is defined as
Similarly:
\(F_X(x)\) is right continuous.
Probability density function¶
Properties of PDF
The CDF and PDF are related as
Expectation¶
Expectation of a discrete random variable:
Expectation of a continuous random variable:
Expectation of a function of a random variable:
Mean square value:
Variance:
\(n\)-th moment:
Characteristic function¶
The characteristic function is defined as
PDF as Fourier transform of CF.
Let \(Y_1, \dots, Y_k\) be independent. Then
Moment generating function¶
The moment generating function is defined as
Second characteristic function¶
Cumulant generating function¶
Gaussian distribution¶
Standard normal distribution¶
This distribution has a mean of 0 and a variance of 1. It is denoted by
The PDF is given by
The CDF is given by
Symmetry
Some specific values
The Q-function is given as
We have
Alternatively
Further
This is due to the symmetry of normal distribution. Alternatively
Probability of \(X\) falling in a range \([a,b]\)
The characteristic function is
Mean:
Mean square value
Variance:
Standard deviation
An upper bound on Q-function
The moment generating function is
Error function and its properties¶
The error function is defined as
The complementary error function is defined as
Error function is an odd function.
Some specific values of error function.
The relationship with normal CDF.
Relationship with Q function.
We also have some useful results:
General normal distribution¶
The general Gaussian (or normal) random variable is denoted as
Its PDF is
A simple transformation
converts it into standard normal random variable.
The mean:
The mean square value:
The variance:
The CDF:
Notice the transformation from \(x\) to \((x - \mu) / \sigma\).
The characteristic function:
Naturally putting \(\mu = 0\) and \(\sigma = 1\), it reduces to the CF of the standard normal r.v.
Th MGF:
Skewness is zero and Kurtosis is zero.
One sided Gaussian distribution¶
Truncated normal distribution¶
Basic inequalities¶
Probability theory is all about inequalities. So many results are derived from the application of these inequalities. This section collects some basic inequalities.
A good reference is Wikipedia list of inequalities. In particular see the section on probability inequalities.
In this section we will cover the basic inequalities.
Markov’s inequality¶
http://en.wikipedia.org/wiki/Markov
Let \(X\) be a non-negative random variable and \(a > 0\). Then
Chebyshev’s inequality¶
http://en.wikipedia.org/wiki/Chebyshev
Let \(X\) be a random variable with finite mean \(\mu\) and finite non-zero variance \(\sigma^2\). Then for any real number \(k > 0\), the following holds
Choosing \(k = \sqrt{2}\), we see that at least half of the values lie in the interval \((\mu - \sqrt{2} \sigma, \mu + \sqrt{2} \sigma)\).
Boole’s inequality¶
http://en.wikipedia.org/wiki/Boole This is also known as union bound.
For a countable set of events \(A_1, A_2, \dots\), we have
We first prove it for a finite collection of events using induction. For \(n=1\), obviously
Assume the inequality is true for the set of \(n\) events. i.e.
Since
hence
Since
hence
Fano’s inequality¶
Cramér–Rao inequality¶
Hoeffding’s inequality¶
http://en.wikipedia.org/wiki/Hoeffding
This inequality provides an upper bound on the probability that the sum of random variables deviates from its expected value.
We start with a version of the inequality for i.i.d Bernoulli random variables.
Let \(X_1, \dots, X_n\) be i.i.d. Bernoulli random variables with probability of success as \(p\). \(\EE \left [\sum_i X_i \right] = p n\). The probability of the sum deviating from the mean by \(\epsilon n\) for some \(\epsilon > 0\) is bounded by
and
The two inequalities can be summarized as
The inequality states that the number of successes that we see is concentrated around its mean with exponentially small tail.
We now state the inequality for the general case for any (almost surely) bounded random variable.
Let \(X_1, \dots, X_n\) be independent r.v.s. Assume that \(X_i\) are almost surely bounded; i.e.:
Define the empirical mean of the variables as
Then the probability that \(\overline{X}\) deviates from its mean \(\EE(\overline{X})\) by an amount \(t > 0\) is bounded by following inequalities:
and
Together, we have
Note that we don’t require \(X_i\) to be identically distributed in this formulation. For the special case when \(X_i\) are i.i.d. uniform r.v.s over \([0, 1]\), then \(\EE(\overline{X}) = \EE(X_i) = \frac{1}{2}\) and
Clearly, \(\overline{X}\) starts concentrating around its mean as \(n\) increases and the tail falls exponentially.
The proof of this result depends on what is known as Hoeffding’s Lemma.
Let \(X\) be a zero mean r.v. with \(\PP (X \in [a, b]) = 1\). Then
Jensen’s inequality¶
http://en.wikipedia.org/wiki/Jensen Jensen’s inequality relates the value of a convex function of an integral to the integral of the convex function. In the context of probability theory, the inequality take the following form.
Let \(f : \RR \to \RR\) be a convex function. Then
The equality holds if and only if either \(X\) is a constant r.v. or \(f\) is linear.
Bernstein inequalities¶
Chernoff’s inequality¶
http://en.wikipedia.org/wiki/Chernoff This is also known as Chernoff bound.
Fréchet inequalities¶
Two variables¶
Let \(X\) and \(Y\) be two random variables and let \(F_(X, Y)(x, y)\) be their joint CDF.
Right continuity:
The joint probability density function is given by \(f_{X, Y} (x, y)\). It satisfies \(f_{X, Y} (x, y) \geq 0\) and
The joint CDF and joint PDF are related by
Further
The marginal probability is
We define the marginal density functions as
and
We can now write
Similarly
Conditional density¶
We define
We have
In other words
In general we write
Or even more loosely as
More identities
Independent variables¶
If \(X\) and \(Y\) are independent then
Similarly
The CDF also is separable
Expectation¶
This section contains several results on expectation operator.
Any function \(g(x)\) defines a new random variable \(g(X)\). If \(g(X)\) has a finite expectation, then
If several random variables \(X_1, \dots, X_n\) are defined on the same sample space, then their sum \(X_1 + \dots + X_n\) is a new random variable. If all of them have finite expectations, then the expectation of their sum exists and is given by
If \(X\) and \(Y\) are mutually independent random variables with finite expectations, then their product is a random variable with finite expectation and
By induction, if \(X_1, \dots, X_n\) are mutually independent random variables with finite expectations, then
Let \(X\) and \(Y\) be two random variables with the joint density function \(f_{X, Y} (x, y)\). Let the marginal density function of \(Y\) given \(X\) be \(f(y | x)\). Then the conditional expectation is defined as follows:
\(\EE [Y | X ]\) is a new random variable.
In short, we have
The covariance of \(X\) and \(Y\) is defined as
It is easy to see that
The correlation coefficient is defined as
Independent variables¶
If \(X\) and \(Y\) are independent, then
If \(X\) and \(Y\) are independent, then \(\Cov (X, Y) = 0\).
Complex random variable¶
For a complex random variable \(Z = X + j Y\), its PDF is the joint PDF of the r.v. X and Y.
The integral over the complex space is defined as
Random vectors¶
We will continue to use the notation of capital letters to denote a random vector. We will specify the space over which the random vector is generated to clarify the dimensionality.
A real random vector \(X\) takes values in the vector space \(\RR^n\). A complex random vector \(Z\) takes values in the vector space \(\CC^n\). We write
The expected value or mean of a random vector is \(\EE(X)\).
Covariance-matrix of a random vector:
We will use the symbols \(\mu\) and \(\Sigma\) for the mean vector and covariance matrix of a random vector \(X\). Clearly
Cross-covariance matrix of two random vectors:
Note that
The characteristic function is defined as
The MGF is defined as
The components \(X_1, \dots, X_n\) of a random vector \(X\) are independent if and only if
Gaussian random vector¶
A random vector \(X = [X_1, \dots, X_n]^T\) is called Gaussian random vector if
follows a normal distribution for all \(t = [t_1, \dots, t_n ]^T \in \RR^n\). The components \(X_1, \dots, X_n\) are called jointly Gaussian. It is denoted by \(X \sim \NNN_n (\mu, \Sigma)\) where \(\mu\) is its mean vector and \(\Sigma\) is its covariance matrix.
Let \(X \sim \NNN_n (\mu, \Sigma)\) be a Gaussian random vector. The subscript \(n\) denotes that it takes values over the space \(\RR^n\). We assume that \(\Sigma\) is invertible. Its PDF is given by
Moments:
Let \(Y = A X + b\) where \(A \in \RR^{n \times n}\) is an invertible matrix and \(b \in \RR^n\). Then
\(Y\) is also a Gaussian random vector with the mean vector being \(A \mu + b\) and the covariance matrix being \(A \Sigma A^T\). This essentially is a change in basis in \(\RR^n\).
The CF is given by
Whitening¶
Usually we are interested in making the components of \(X\) uncorrelated. This process is known as whitening. We are looking for a linear transformation \(Y = A X + b\) such that the components of \(Y\) are uncorrelated. i.e. we start with
and transform \(Y = A X + b\) such that
where \(I_n\) is the \(n\)-dimensional identity matrix.
Whitening by eigen value decomposition¶
Let
be the eigen value decomposition of \(\Sigma\) with \(\Lambda\) being a diagonal matrix and \(E\) being an orthonormal basis.
Let
Choose \(B = E \Lambda^{\frac{1}{2}}\) and \(A = B^{-1} = \Lambda^{-\frac{1}{2}} E^T\). Then
Thus the random vector \(Y = [B^{-1} (X - \mu)\) is a whitened vector of uncorrelated components.
Causal whitening¶
We want that the transformation be causal, i.e. \(A\) should be a lower triangular matrix. We start with
Choose \(B = L D^{\frac{1}{2}}\) and \(A = B^{-1} = D^{-\frac{1}{2}} L^{-1}\). Clearly, \(A\) is lower triangular.
The transformation is \(Y = [B^{-1} (X - \mu)\).
Geometry¶
Algebraic Geometry Review¶
This section covers essential notions and facts from algebraic geometry needed for this paper. For a systematic introduction to the subject, see [Har77][Har13][GH14]. Algebraic geometry is the study of geometries that come from algebra. The geometrical objects being studied are the solution sets of systems of multivariate polynomial equations. A data set being studied can be thought of as a collection of sample points from a geometrical object (e.g. a union of subspaces). The objective is to infer the said geometrical object from the given data set and decompose the object into simpler objects which help in better understanding of the data set.
Polynomial Rings¶
Let \(\FF^m\) be \(m\)-dimensional vector space where \(\FF\) is either \(\RR\) or \(\CC\) (a field of characteristic 0). For \(x = [x_1, \dots, x_m]^T \in \FF^m\), let \(\FF[x] = [x_1, \dots, x_m]\) be the set of all polynomials of \(m\) variables \(x_1, \dots,x_m\). \(\FF[x]\) is a commutative ring [Art91]. A monomial is a product of variables. Its degree is the number of variables in the product. A monomial of degree \(n\) is of the form \(x^n = x_1^{n_1}\dots x_m^{n_m}\) with \(0 \leq n_j \leq n\) and \(n_1 + \dots + n_m = n\). There are a total of \(A_n(m) = \binom{m + n -1}{n} = \binom{m + n -1}{m - 1}\) different degree-n monomials.
We now construct an embedding of vectors in \(\FF^m\) to \(\FF^{A_n(m)}\). The Veronese map of degree \(n\), denoted as \(v_n : \FF^m \to \FF^{A_n(m)}\), is defined as
where \(x^n\) are degree-n monomials chosen in the degree lexicographic order. For example, the Veronese map of degree 2 from \(\RR^3\) to \(\RR^6\) is defined as
A term is a scalar multiplying a monomial. A polynomial \(p(x)\) is said to be homogeneous if all its terms have the same degree. Homogeneous polynomials are also known as forms. A linear form is a homogeneous polynomial of degree 1. A quadratic form is a homogeneous polynomial of degree 2. A degree-n form \(p(x)\) can be written as
where \(c_{n_1, \dots, n_m} \in \FF\) are the coefficients associated with the monomials \(x_1^{n_1}\dots x_m^{n_m}\).
A projective space corresponding to a vector space \(V\) is the set of lines passing through its origin (the one dimensional subspaces). Each such line can be represented by any non-zero point on the line.
For a degree-n form \(p(x)\) and a scalar \(b \in \FF\), we have:
Therefore, if \(p(x) = 0\), then \(p(\alpha x) = 0 \Forall \alpha \in \FF\) and the zero-set of \(p(x)\) includes the one dimensional subspace containing \(x\) (the line passing through \(x\) and \(0\)). Our interest is in the zero sets of homogeneous polynomials. Thus, it is useful to view \(\FF^n\) as a projective space. For a form p(x), p(0) is always 0. If p(a) = 0 for some \(a \neq 0\), then \(p(x) = 0 \Forall x = b a, b \in \FF\).
The ring \(\FF[x]\) can be viewed as a graded ring[Lan02] and decomposed as
where \(\FF_i\) consists of all homogeneous polynomials of degree \(i\). \(\FF_0 = \FF\) is the set of scalars (polynomials of degree 0). \(\FF_1\) is the set of all 1-forms:
Note that the polynomial \(0 = 0^T x\) is included in every \(\FF_i\). This enables us to treat \(\FF_i\) as a vector space of \(i\)-forms. \(\FF_1\) can also be viewed as the dual-space of linear functionals for the vector space \(\FF^m\). We will also need following sets in the sequel:
An ideal in the ring \(\FF[x]\) is an additive subgroup \(I\) such that if \(p(x) \in I\) and \(q(x) \in \FF[x]\), then \(p(x) q(x) \in I\). \(\FF[x]\) is a trivial ideal. \(I\) is called a proper ideal if \(I \neq \FF[x]\). A proper ideal \(I\) is called maximal if no other proper ideal of \(\FF[x]\) contains \(I\). An ideal \(I\) is called a subideal of an ideal \(J\) if \(I \subset J\).
If \(I\) and \(J\) are two ideals in \(\FF[x]\), then \(I \cap J\) is also an ideal. An ideal \(I\) is said to be generated by a subset \(\GGG \subset I\), if every \(p(x) \in I\) can be written as
It is denoted by \((\GGG)\). If \(\GGG\) is finite, \((\GGG = \{ g_1, \dots, g_k\})\), then, the generated ideal is also denoted by \((g_1, \dots, g_k)\). An ideal generated by a single element \(p(x)\) is called a principal ideal denoted by \((p(x))\).
Given two ideals \(I\) and \(J\), the ideal that is generated by product of elements in \(I\) and \(J\) : \(\{ f(x)g(x) : f(x) \in I, g(x) \in J \}\) is called the product ideal \(IJ\).
A prime ideal is similar to prime numbers in the ring of integers. A proper ideal \(I\) is called prime if \(p(x) q(x) \in I\) implies that \(p(x) \in I\) or \(q(x) \in I\). A polynomial \(p(x)\) is said to be prime or irreducible if it generates a prime ideal. A homogeneous ideal of \(\FF[x]\) is an ideal generated by homogeneous polynomials.
Algebraic Sets¶
Given a set of homogeneous polynomials \(J \subset \FF[x]\), a corresponding projective algebraic set \(Z(J) \subset \FF^m\) is defined as
In other words, \(Z(J)\) is the zero set of polynomials in \(J\) (intersection of zero sets of each polynomial in \(J\)). Let \(I\) and \(K\) be sets of homogeneous polynomials and \(X = Z(I)\) and \(Y = Z(K)\) such that \(Y \subset X\). Then \(Y\) is called an algebraic subset of \(X\). A nonempty algebraic set is called irreducible if it is not the union of two nonempty smaller algebraic sets. An irreducible algebraic set is also known as algebraic variety. Any subspace of \(\FF^m\) is an algebraic variety.
Given any subset \(X \in \FF^m\), we define the vanishing ideal of \(X\) as the set of all polynomials that vanish on \(X\).
It is easy to see that if \(f(x) \in I(X)\) then \(f(x) g(x) \in I(X)\) for all \(g(x) \in \FF[x]\). Thus, \(I(X)\) is indeed an ideal.
Let \(J \subset \FF[x]\) be a set of homogeneous polynomials. \(Z(J)\) is the zero set of \(J\) (an algebraic set). \(I(Z(J))\) is the vanishing ideal of the zero set of \(J\). It can be shown that \(I(Z(J))\) is an ideal that contains \(J\).
Similarly, let \(X \subset \FF^m\) be an arbitrary set of vectors in \(\FF^m\). \(I(X)\) is the vanishing ideal of \(X\) and \(Z(I(X))\) is the zero set of the vanishing ideal of \(X\). Then, \(Z(I(X))\) is an algebraic set that contains \(X\).
It turns out that irreducible algebraic sets and prime ideals are connected. In fact, If \(X\) is an algebraic set and \(I(X)\) is the vanishing ideal of \(X\), then \(X\) is irreducible if and only if \(I(X)\) is a prime ideal.
The natural progression is to look for a one-to-one correspondence between ideals and algebraic sets. The concept of a radical ideal is useful in this context. Given a (homogeneous) ideal \(I\) of \(\FF[x]\), the (homogeneous) radical ideal of \(I\) is defined to be
Clearly, text{rad}(I) is an ideal in itself and \(I \subset \text{rad}(I)\). \(\text{rad}(I)\) is a fixed-point in the sense that \(\text{rad}(\text{rad}(I)) = \text{rad}(I)\). Also, if \(I\) is homogeneous, then so is \(\text{rad}(I)\). A theorem by Hilbert suggests the following: If \(\FF\) is an algebraically closed field (e.g. \(\FF = \CC\)) and \(I \subset \FF[x]\) is an (homogeneous) ideal, then
Thus, the mappings \(I \to Z(I)\) and \(X \to I(X)\) induce a one-to-one correspondence between the collection of (projective) algebraic sets of \(\FF^m\) and (homogeneous) radical ideals of \(\FF[x]\). This result is known as Nullstellensatz.
Algebraic Sampling Theory¶
We will now explore the problem of identifying a (projective) algebraic set \(Z \in \FF^m\) from a finite number of sample points in \(Z\). In general, the algebraic set \(Z\) may not be irreducible and the ideal \(I(Z)\) may not be prime. Let \(\{z_1, \dots, z_S\} \subset Z\) be the finite (but sufficiently large) set of sample points from \(Z\) for the following discussion. For an arbitrary point \(z \in Z\), we abuse \(z\) to mean the corresponding projective point (i.e. the line passing between 0 and \(z\)). Let \(\mathfrak{m} = I(z)\) be the vanishing ideal of (the line) z. Then, \(\mathfrak{m}\) is a submaximal ideal (i.e. it cannot be a subideal of any other homogeneous ideal of \(\FF[x]\)). Let \(\mathfrak{m}_i\) be the vanishing ideal of \(z_i\). Then the vanishing ideal for the set of points is
This is a radical ideal and is in general much larger than \(I(Z)\). In order to ensure that we can infer \(I(Z)\) correctly from the set of samples \(\{ z_i \}\), we need some additional constraints. We require that \(I(Z)\) is generated by a set of (homogeneous) polynomials whose degrees are bound by a relatively small \(n\).
Then, the zero set of \(I\) is given by
In general, \(I(Z)\) is always a proper subideal of \(I_S\) regardless of how large \(S\) is. We introduce an algebraic sampling theorem which comes to our rescue. It suggests that if \(I(Z)\) is generated by polynomials in \(\FF_{\leq n}\), then there is a finite sequence of points \(Z_S = \{z_1, \dots, z_S \}\) such that the subspace \(I(Z_S) \cap \FF_{\leq n}\) generates \(I(Z)\). While the theorem doesn’t suggest a bound on \(S\), it turns out that with probability one, the vanishing ideal of an algebraic set can be correctly determined from a randomly chosen sequence of samples. This theorem is analogous to the classical Nyquist-Shannon sampling theorem.
So far we have looked at modeling a data set as an algebraic set and obtaining its vanishing ideal. The next step is to extract the internal geometric or algebraic structure of the algebraic set. The idea is to find simpler (possibly irreducible) algebraic sets which can be composed to form the given algebraic set. For example, if an algebraic set is a union of subspaces, then we would like to find out the component subspaces. In other words, given an algebraic set \(X\) or its vanishing ideal \(I(X)\), the objective is to decompose it into a union of subsets each of which cannot be decomposed further.
An algebraic set can have only finitely many irreducible components. That is, there exists a finite \(n\) such that
where \(X_i\) are irreducible algebraic varieties. The vanishing ideal \(I(X_i)\) must be a prime ideal that is minimal over the radical ideal \(I(X)\) (i.e. there is no prime subideal of \(I(X_i)\)) that includes \(I(X)\). The ideal \(I(X)\) is given by
This is known as the minimal primary decomposition of the radical ideal \(I(X)\).
Given a (projective) algebraic set \(X\) and its vanishing ideal \(I(Z)\), we can grade the ideal by degree as:
The Hilbert function of \(Z\) is defined to be
\(h_I(i)\) denotes the number of linearly independent polynomials of degree \(i\) that vanish on \(Z\). Hilbert series of an ideal \(I\) is defined as the power series:
Subspace Arrangements¶
We are interested in special class of algebraic sets known as subspace arrangements in \(\RR^M\). A subspace arrangement is a finite collection of linear or affine subspaces in \(\RR^M\) \(\UUU = \{ \UUU_1, \dots, \UUU_K \}\). The set \(Z_{\UUU} = \UUU_1 \cup \dots \cup \UUU_K\) is the union of subspaces. It is an algebraic set. We will explore the algebraic properties of \(Z_{\UUU}\) in the following. We say a subspace arrangement is central if every subspace passes through origin. In the sequel, we will focus on central subspace arrangements only.
A \(D\)-dimensional subspace \(V\) can be defined by \(D' = M - D\) linearly independent linear forms \(\{b_1, b_2, \dots, b_{D'} \}\):
Let \(V^*\) denote the vector space of all linear forms that vanish on \(V\). Then \(\dim(V^*) = D' = M - D\). \(V\) is the zero set of \(V^*\) (i.e. \(V = Z(V^*))\). The vanishing ideal of \(V\) is
\(I(V)\) is an ideal generated by linear forms in \(V^*\). It contains polynomials of all degrees that vanish on \(V\). Every polynomial \(p(x) \in I(V)\) can be written as
where \(h_i \in \RR[x]\). \(I(V)\) is a prime ideal.
The vanishing ideal of the subspace arrangement \(Z_{\UUU} = \UUU_1 \cup \dots \cup \UUU_K\) is
The ideal can be graded by degree of the polynomial as:
Each \(I_i(Z_{\UUU})\) is a vector space that contains forms of degree \(i\) in \(I(Z_{\UUU})\) and \(m\geq 1\) is the least degree of the polynomials in \(I(Z_{\UUU})\). The sequence of dimensions of \(I_i(Z_{\UUU})\) is the Hilbert function \(h_I(i)\) of \(Z_{\UUU}\).
Based on a result on the regularity of subspace arrangements [Der07], the subspace arrangement \(Z_{\UUU}\) is uniquely determined as the zero set of all polynomials of degree up to \(K\) in its vanishing ideal. i.e.
Thus, we don’t really need to determine polynomials of higher degree.
We need to characterize \(I(Z_{\UUU})\) further. Recall that \(\UUU_k\) is a (linear) subspace and \(\UUU_k^*\) is the vector space of linear forms which vanish on \(\UUU_k\). We can construct a product of linear forms by choosing one linear form from each \(\UUU_k^*\). Let \(J(Z_{\UUU})\) be the ideal generated by the products of linear forms
Equivalently, we can say that :
is the product ideal of the vanishing ideals of each of the subspaces. Evidently, \(J(Z_{\UUU})\) is a subideal in \(I(Z_{\UUU})\). In fact, the two ideals share the same zero set:
Now, \(I(Z_{\UUU})\) is the largest ideal which vanishes on \(Z_{\UUU}\). In fact, \(I(Z_{\UUU})\) is the radical ideal of \(J(Z_{\UUU})\). Now, just like we graded \(I(Z_{\UUU})\), we can also grade \(J(Z_{\UUU})\) as:
Note that, the lowest degree of polynomials is always \(K\) which is the number of subspaces in \(\UUU\). Hilbert function of \(J\) is denoted as \(h_J(i) = \text{dim} (J_i(Z_{\UUU}))\). It turns out that Hilbert functions of the vanishing ideal \(I\) and the product ideal \(J\) have interesting and useful relationships.
Subspace Embeddings¶
Let \(Z_{\UUU'} = \UUU'_1 \cup \dots \cup \UUU'_{K'}\) be another (central) subspace arrangement such that \(Z_{\UUU} \subseteq Z_{\UUU'}\). Then it is necessary that for each \(\UUU_k\), there exists \(\UUU'_{k'}\) such that \(\UUU_k \subseteq \UUU_{k'}\). We call \((Z_{\UUU} \subseteq Z_{\UUU'})\), a subspace embedding. If \(Z_{\UUU'}\) happens to be hyperplane arrangement, we call the embedding as a hyperplane embedding. Let us consider how to create a hyperplane embedding for a given subspace arrangement.
In general, the zero set of each homogeneous component of \(I(Z_{\UUU})\) (i.e. \(I_i(Z_{\UUU})\)), need not be a subspace embedding of \(Z_{\UUU}\). In fact, it may not even be a subspace arrangement. However, the derivatives of the polynomials in \(I(Z_{\UUU})\) come to our rescue. We denote the derivative of \(p(x)\) w.r.t. \(x \in \RR^M\) by \(D p(x)\). Consider a polynomial \(p(x) \in I(Z_{\UUU})\). Pick a point \(x_k\) from each subspace \(\UUU_k\) (\(x_k \in \UUU_k\)). Compute the derivative of \(p(x)\) and evaluate it at \(x_k\) as \(D p(x_k)\). Now, construct the hyperplane \(H_k = \{ x : D p(x_k)^T x = 0 \}\). Recall that the derivative of a smooth function \(f(x)\) is orthogonal to (the tangent space of) its level set \(f(x) = c\). Thus, \(H_k\) contains \(\UUU_k\). It turns out that if the \(K\) points \(\{ x_1, \dots, x_K \}\) (from each subspace) are in general position, then the union of hyperplanes \(\cup_{k=1}^K H_k\) is a hyperplane embedding of the subspace arrangement \(Z(\UUU)\).
For each polynomial in \(I(Z(\UUU))\), we can construct a hyperplane embedding of the subspace arrangement \(Z(\UUU)\). The intersection of hyperplane embeddings constructed from a collection of polynomials in \(I(Z(\UUU))\) is a subspace embedding of \(Z(\UUU)\). When this collection of polynomials contains all the generators of \(I(Z(\UUU))\), the subspace embedding becomes tight. In fact, the resulting subspace arrangement coincides with the original one.
An ideal is said to be pl-generated if it is generated by products of linear forms. The \(J(Z_{\UUU})\) defined above is a pl-generated ideal. If the ideal of a subspace arrangement \(J(Z_{\UUU})\) is pl-generated, then the zero-set of every generator gives a hyperplane embedding of \(J(Z_{\UUU})\).
If \(J(Z_{\UUU})\) is a hyperplane arrangement, then \(I(J(Z_{\UUU}))\) is always pl-generated as it is generated by a single polynomial of the form \(p(x) = (b_1^T x) \dots (b_K^T x)\) where \(b_k \in \RR^M\) are the normal vectors to the \(K\) hyperplanes in the arrangement. In fact, it is also a principal ideal.
The vanishing ideal of a single subspace is always pl-generated. The vanishing ideal of an arrangement of two subspaces is also pl-generated but this is not true in general. But something can be said if the \(K\) subspaces in the arrangement are in general position.
Hilbert Functions of Subspace Arrangements¶
If a subspace arrangement \(\UUU\) is in general position, then the values of the Hilbert function \(h_I(i)\) of its vanishing ideal \(I(Z_{\UUU})\) depend solely on the dimensions of the subspaces \(D_1, \dots, D_K\) and they are invariant under a continuous change of the position of the subspaces. When identifying a subspace arrangement from a set of samples, the first level parameters to be identified are number of subspaces and the dimensions of each subspace.
Digital Signal Processing¶
Run Length Encoding¶
Run length encoding is a common operation in compression applications. In this article, we discuss how to do this efficiently in MATLAB using vectorization techniques.
Let’s consider a simple sequence of integers:
x = [0 0 0 0 0 0 0 4 4 4 3 3 2 2 2 2 2 2 2 1 1 0 0 0 0 0 2 3 9 5 5 5 5 5 5]
The sequence has 35 elements.
First step is change detection:
>> diff_positions = find(diff(x) ~= 0)
diff_positions =
7 10 12 19 21 26 27 28 29
Note that these positions are 0 based indexes. The first difference is occurring at x(8).
We can use this to compute the runs of each symbol:
>> runs = diff([0 diff_positions numel(x)])
runs =
7 3 2 7 2 5 1 1 1 6
The start position for the first symbol of each run can also be easily obtained:
>> start_positions = [1 (diff_positions + 1)]
start_positions =
1 8 11 13 20 22 27 28 29 30
We can now pick up the symbols from x:
>> symbols = x(start_positions)
symbols =
0 4 3 2 1 0 2 3 9 5
Combine the symbols and their runs:
>> encoding = [symbols; runs]
encoding =
0 4 3 2 1 0 2 3 9 5
7 3 2 7 2 5 1 1 1 6
Flatten the encoding:
>> encoding = encoding(:)';
>> fprintf('%d ', encoding)
0 7 4 3 3 2 2 7 1 2 0 5 2 1 3 1 9 1 5 6 >>
We can cross check that the length of the encoded sequence is correct:
>> total_symbols = sum(runs)
total_symbols =
35
We can check the length of the encoded sequence:
>> >> numel(encoding)
ans =
20
It is indeed less than 35. The gain is not much since there were many symbols with just one occurrence.
The decoding can be easily done using a for loop:
x_dec = [];
for i=1:numel(encoding) / 2
symbol = encoding(i*2 -1);
run_length = encoding(i*2);
x_dec = [x_dec symbol * ones([1, run_length])];
end
Let’s print the decoded sequence:
>> fprintf('%d ', x_dec);
0 0 0 0 0 0 0 4 4 4 3 3 2 2 2 2 2 2 2 1 1 0 0 0 0 0 2 3 9 5 5 5 5 5 5
Verify that the decoded sequence is indeed same as original sequence:
>> sum(x_dec - x)
ans =
0
The library provides useful methods for performing run length encoding and decoding.
Encoding:
>> x = [0 0 0 0 3 3 3 2 2];
>> encoding = spx.dsp.runlength.encode(x)
encoding =
0 4 3 3 2 2
Decoding:
>> spx.dsp.runlength.decode(encoding)
ans =
0 0 0 0 3 3 3 2 2
Discrete Cosine Transform¶
The discussion in this article is based on [Str99].
There are four types of DCT transforms DCT-1, DCT-2, DCT-3 and DCT-4.
Consider the second difference equation:
For finite signals \(x \in \RR^N\), the equation can be implemented by a linear transformation:
where \(A\) is a circulant matrix:
The unspecified values are 0. We can write the individual linear equations as:
The first and last equations are boundary conditions while the middle one represents the ordinary second difference equation.
The rows 1 and N of \(A\) are the boundary rows while all other rows are interior rows.
The interior rows correspond to the computation \(- x_{j-1} + 2 x_j - x_{j +1}\) which is the discretization of the second order derivative \(-x''\). The negative sign on the derivative makes the matrix \(A\) positive semi definite. This ensures that no eigen values of \(A\) are negative.
In the first and last rows, we need the values of \(x_0\) and \(x_{N + 1}\). In the periodic extension, we assume that \(x_0 = x_N\) and \(x_{N + 1} = x_1\). This gives the \(-1\) entries in the corners of \(A\) as shown above.
With \(\omega = \exp(2\pi i / N)\), it turns out that
are eigen vectors for \(A\) for \(0 \leq k \leq N -1\). The corresponding eigen values are \((2 - 2 \cos(2\pi k / N)\).
The eigen vectors are nothing but the basis vectors for DFT basis. Note that the eigen values satisfy a relationship \(\lambda_k = \lambda_{N -k}\). So the linear combinations of the eigen vectors \(v_k\) and \(v_{N -k}\) are also eigen vectors.
It turns out that the real and imaginary parts of the vector \(v_k\) are also eigen vectors of \(A\). They can be easily constructed as linear combinations of \(v_k\) and \(v_{N -k}\).
We define:
The exception to this rule is \(\lambda_0\) for which \(c_0 = (1, 1, \dots, 1)\) and \(s_0 = (0, 0, \dots, 0)\) where \(s_0\) is not an eigen vector while \(c_0\) is.
For even \(N\), there is another exception at \(\lambda_{N/2}\) with \(c_{N/2} = (1, -1, \dots, 1, -1)\) and \(s_0 = (0, 0, \dots, 0)\).
These two eigen vectors have length \(\sqrt{N\) while other eigen vectors \(c_k\) and \(s_k\) have length \(\sqrt{N/ 2}\).
TBD
Detecting Dual Tone Multi Frequency Signals¶
Highlights
Following Matlab functions are demonstrated
in this article:
envelope
, pulsewidth
,
periodogram
, findpeaks
,
meanfreq
,
A Dual Tone Multi Frequency (DTMF) signal is the signal generated from the punch keys of an ordinary telephone.
Each signal consists of a low frequency and a high frequency.
The table below lists the frequencies used for various keys.
Key | Low frequency | High frequency |
---|---|---|
1 | 697 | 1209 Hz |
2 | 697 | 1336 |
3 | 697 | 1477 |
4 | 770 | 1209 |
5 | 770 | 1336 |
6 | 770 | 1477 |
7 | 852 | 1209 |
8 | 852 | 1336 |
9 | 852 | 1477 |
0 | 941 | 1336 |
941 | 1209 | |
# | 941 | 1477 |
Let’s create a DTMM signal for the sequence of symbols 4, 5, 0, 7:
[signal, fs] = spx.dsp.dtmf({'4', '5', '0', '7'});
The corresponding time stamps:
time = (0:(numel(signal) - 1)) / fs;
Let’s plot it:
plot(1e3*time, signal);
xlabel('Time (ms)');
ylabel('Amplitude');
grid on;

The pulses are 100 ms wide. The gap between pulses is also 100 ms wide and consists of Gaussian noise.
Our challenge would be to isolate the frequencies and identify the symbols transmitted.
Envelope¶
We can look at the shape of the pulses where the symbols were punched by looking at the RMS envelope of the signal:
envelope_signal = envelope(signal, 80,'rms');
plot(1e3*time, envelope_signal);
We are computing the RMS envelopes for window size of 80 samples.

It is now easy to identify the pulses:
pulsewidth(envelope_signal,fs)
ans =
0.1050
0.1041
0.1042
0.1045

The recognized pulses are pretty close in size to the actual pulse size of 100 ms each.
Periodogram¶
A periodogram can help us identify the dominant frequencies present in the signal.
The frequencies involved in the sequence 4507 are 4 (770, 1209), 5(770, 1336), 0 (941, 1336), 7 (852, 1209).
We note that 770 Hz, 1209 Hz and 1336 Hz repeat twice, hence we expect them to have more contribution in the power spectrum. Other frequencies are 941 and 852 Hz.
Computing the periodogram is straight-forward:
[pxx,f]=periodogram(signal,[],[],fs);
Here is the display of power spectrum in deciBels.

We wish to isolate the peak frequencies from this plot:
[peak_values, peak_freqs] = findpeaks(pxx, f, 'SortStr','descend', 'MinPeakHeight', max(pxx) / 10);
peak_freqs = round(peak_freqs');
>> sort(peak_freqs)
ans =
Columns 1 through 9
766 771 774 853 941 1203 1205 1208 1210Columns 10 through 14
1212 1215 1331 1335 1340
- The frequencies 766, 771 and 774 are near 770 Hz.
- 853 is near 852 Hz.
- 941 matches 941 Hz.
- 1203, 1205, 1208, 1210, 1212 and 1215 are near 1209 Hz.
- 1331, 1335 and 1340 are near 1336 Hz.
Thus, the periodogram has been able to identify all the relevant frequencies in the signal and their power contribution appears to match well in their contribution in the constitution of the sequence 4502.
However, the periodogram is unable to localize the frequencies in time and hence is unable to tell us exactly which symbols were transmitted.
It is instructive to compute the mean frequencies in different bands:
>> round(meanfreq(pxx, f, 700 + [0, 100]))
ans =
769
>> round(meanfreq(pxx, f, 800 + [0, 100]))
ans =
851
>> round(meanfreq(pxx, f, 900 + [0, 100]))
ans =
941
>> round(meanfreq(pxx, f, 1200 + [0, 100]))
ans =
1211
>> round(meanfreq(pxx, f, 1300 + [0, 100]))
ans =
1336
The mean frequencies in these bands are mostly spot-on or very close to actual frequencies sent in the DTMF signal.
Spectrogram¶
While, we have been able to identify the frequencies present in the signal, we haven’t been able to localize them in time. Thus, we are unable to identify exactly which symbols were sent.
The spectrogram provides us the time-frequency representation of the signal:
spectrogram(signal, [], [], [], fs, 'yaxis');
% restrict the y-axis between 500Hz to 1500 Hz.
ylim([0.5 1.5]);

In this plot, it is clearly visible that at any point of time, two frequencies are active. There are four different symbols which seem to have been sent.
- In the first symbol, the frequencies active seem to be around 770Hz and 1200 Hz which maps to the symbol 4.
- In the second symbol, the frequencies active seem to be around 770Hz and 1330 Hz which maps to the symbol 5.
- Similarly, we can see that the symbols 0 and 7 are easily visible in the spectrogram.
This spectrogram is not able to localize the symbols accurately. We are unable to see the portions where no symbols are being sent and only noise is present.
By default the spectrogram
has following parameters:
- Signal is divided into segments which are around 22% of the length of the signal.
- The segments overlap each other by 50%.
- No windowing is done for computing the FFT of each segment.
We should increase the time resolution of the spectrogram.
Let’s have a window length of 50 ms:
window_length = floor(fs * 50 / 1000);
Let’s continue to have overlap of 50%:
overlap_length = floor(window_length / 2);
The FFT length depends on the window length:
n_fft = 2^nextpow2(window_length);
We will compute the spectrogram with Hamming window:
spectrogram(signal,hamming(window_length),overlap_length,n_fft, fs, 'yaxis');
ylim([0.5 1.5]);
Let’s visualize the results:

In this spectrogram, it is easy to see how the pulses in the signal are clearly visible and their frequencies can be easily read off the diagram.
While, we have improved time localization of pulses, the frequency localization has suffered a bit. Since, our interest is only in knowing the mean frequencies, this loss of frequency localization is not that important in this case.
We can remove the frequencies which are contributing very small values to the spectrogram and enhance the prominent frequencies in the output. Also, we can increase the overlap between subsequent windows to introduce more spectrum lines and make the spectrogram look smoother:
overlap_length = floor(0.8 * window_length );
spectrogram(signal,hamming(window_length),overlap_length,n_fft, fs, 'yaxis', 'MinThreshold', -50);
ylim([0.5 1.5]);

By computing the center of energy for each spectral estimate in both time and frequency, we can do spectral reassignment. This gives us a much cleaner and crisper spectrogram.

Decoding the symbols¶
The complete process for decoding the DTMF sequence
using the spectrogram has been implemented in
the function spx.dsp.dtmf_decoder
.
The function does the following:
- Compute the spectrogram
- Identify the times where spectral content has high energy
- Identify peak frequencies at these times
- Match these frequencies to the nearest low and high frequencies of DTMF sequences.
- Map the identified frequencies to actual symbols.
- Identify the start and duration of each symbol in terms of time.
You are welcome to look at the implementation.
We show the example use:
>> [symbols, starts, durations] = spx.dsp.dtmf_detector(signal, fs)
symbols =
1×4 cell array
{'4'} {'5'} {'0'} {'7'}
starts =
0.1000 0.3000 0.5000 0.7000
durations =
0.1000 0.1000 0.1000 0.1000
Another example:
>> [signal, fs] = spx.dsp.dtmf({'2', '3', '4', '6', '*'});
>> [symbols, starts, durations] = spx.dsp.dtmf_detector(signal, fs)
symbols =
1×5 cell array
{'2'} {'3'} {'4'} {'6'} {'*'}
starts =
0.1000 0.3000 0.5000 0.7000 0.9000
durations =
0.1000 0.1000 0.1000 0.1000 0.1000
Wavelets¶
Fundamentals¶
Essential Operations¶
Dyadic Structure¶
Here we are looking at the Haar wavelet decomposition of finite dimensional signals.
We assume that a signal \(x \in \RR^N\) where \(N = 2^J\) for some natural number \(J\).
A single level wavelet decomposition splits a signal into two parts, an approximation and a detail part. Both of these parts have \(N/2\) samples. With Haar wavelets, we can decompose the signal \(J\) times.
We will denote the approximations and detail components as \(a_j\) and \(d_j\).
- We start with \(a_J = x\) which has \(N = 2^{J}\) samples.
- First decomposition splits \(a_{J}\) into two parts \(a_{J-1}\) and \(b_{J -1}\) both of which have \(2^{J-1}\) samples.
- Second decomposition splits \(a_{J-1}\) into two parts \(a_{J-2}\) and \(b_{J -2}\) both of which have \(2^{J-2}\) samples.
- Third decomposition splits \(a_{J-2}\) into two parts \(a_{J-3}\) and \(b_{J -3}\) both of which have \(2^{J-3}\) samples.
- \(J\)-th decomposition splits \(a_{1}\) into two parts \(a_{0}\) and \(d_{0}\) both of which have \(2^{0} = 1\) samples.
- No further decomposition is possible.
Note
Depending upon a specific wavelet structure, \(J\) decompositions may not be possible.
The overall decomposition process can be written as
At every level of decomposition, the number of coefficients in the decomposition is exactly \(N = 2^J\).
The indices occupied by each level of decomposition are given by
This is the dyadic structure of the \(J\) levels of decompositions.
Consider the case with \(N=16\) where \(J=4\). 4 levels of decomposition are possible with Haar wavelet.
- \(a_4\) has 16 samples.
- \(a_3\) and \(d_3\) both have 8 samples each.
- \(a_2\) and \(d_2\) both have 4 samples each.
- \(a_1\) and \(d_1\) both have 2 samples each.
- \(a_0\) and \(d_0\) both have 1 samples each.
No further decomposition is possible.
\(d_j\) has \(2^j\) samples and occupies the indices between \(2^{j} +1\) and \(2^{j+1}\).
Functions to work with dyadic structure¶
We provide a function to identify the indices of \(j\)-th decomposition:
>> spx.wavelet.dyad(1)
ans =
3 4
>> spx.wavelet.dyad(2)
ans =
5 6 7 8
>> spx.wavelet.dyad(3)
ans =
9 10 11 12 13 14 15 16
>> spx.wavelet.dyad(4)
ans =
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
A wavelet coefficient is indexed by two numbers \((j, k)\). Here, \(j\) denotes the resolution level of the wavelet and \(k\) denotes the translation. We have \(j >= 0\) and \(0 \leq k <2^j\).
The absolute index is given by \(2^j + k + 1\).
>> spx.wavelet.dyad(1)
ans =
3 4
>> spx.wavelet.dyad_to_index(1,0)
ans =
3
>> spx.wavelet.dyad_to_index(1,1)
ans =
4
>> spx.wavelet.dyad_to_index(3,2)
ans =
11
dyad_length
lets us the number of decompositions possible
for a vector:
>> [N, J, c] = spx.wavelet.dyad_length(1:16)
N =
16
J =
4
c =
logical
1
Here N is the length of the vector, J is the possible number of decompositions, and c is consistency indicating whether N is a power of 2 or not.
cut_dyadic
cuts a signal to the
length which is the nearest power of 2:
>> spx.wavelet.cut_dyadic(1:15)
ans =
1 2 3 4 5 6 7 8
Periodic Convolution¶
Usual convolution of a signal \(x\) of length N with a filter \(h\) of length M results in a signal \(y\) of length N+M-1.
The assumption here is that \(x[n] =0\) for \(n <=0\) and \(n > N\).
Here is an example:
>> conv([3 1 2], [1 2 2 1])
ans =
3 7 10 9 5 2
This is not suitable for an orthogonal wavelet decomposition of a signal. We are interested in periodic or circular convolution which is defined by
Periodic Extension¶
To construct the periodic extension of a vector, we provide following methods:
repeat_vector_at_start
repeats values from the end of a vector to its beginning.repeat_vector_at_end
repeats values from the start of a vector to its end.
>> spx.vector.repeat_vector_at_start(1:10, 4)
ans =
7 8 9 10 1 2 3 4 5 6 7 8 9 10
>> spx.vector.repeat_vector_at_end(1:10, 4)
ans =
1 2 3 4 5 6 7 8 9 10 1 2 3 4
Computing the Periodic Convolution¶
We provide a method called iconv
to compute the
periodic convolution. Let’s go through the steps of
periodic convolution one by one.
Let’s take an example signal:
>> x = [1 1 1 1 1 1]
x =
1 1 1 1 1 1
And an example filter:
>> f = [1 -1]
f =
1 -1
Let’s get the length of signal:
>> n = length(x)
n =
6
And the length of filter:
>> p = length(f)
p =
2
Extend the signal at the start by p values (from the end):
>> x_padded = spx.vector.repeat_vector_at_start(x, p)
x_padded =
1 1 1 1 1 1 1 1
Perform full convolution on the extended signal:
>> y_padded = filter(f, 1, x_padded)
y_padded =
1 0 0 0 0 0 0 0
Drop the first p values from it to get the periodic convolution output:
>> y = y_padded((p+1):(n+p))
y =
0 0 0 0 0 0
The same can be achieved by a single function call:
>> spx.wavelet.iconv(f,x)
ans =
0 0 0 0 0 0
MATLAB has a same convolution feature. This is different from periodic convolution:
>> u = [-1 2 3 -2 0 1 2];
>> v = [2 -1];
>> conv(u,v,'same')
ans =
5 4 -7 2 2 3 -2
>> spx.wavelet.iconv(v, u)
ans =
-4 5 4 -7 2 2 3
There is another function for computing the convolution of a signal with the time reversed version of a filter.
>> spx.wavelet.aconv(v, u)
ans =
-4 1 8 -4 -1 0 5
>> spx.wavelet.iconv(v(length(v):-1:1), u)
ans =
5 -4 1 8 -4 -1 0
Notice the slight difference in the two outputs.
aconv
output is circular shifted by 1.
Upsampling¶
Upsampling introduces zeros between individual samples.
Upsampling by a factor of 2:
>> spx.wavelet.up_sample([-1 2 3 -2 0 1 2])
ans =
-1 0 2 0 3 0 -2 0 0 0 1 0 2 0
Upsampling by a factor of 3:
>> spx.wavelet.up_sample([-1 2 3 -2 0 1 2], 3)
ans =
-1 0 0 2 0 0 3 0 0 -2 0 0 0 0 0 1 0 0 2 0 0
The second argument is the upsampling factor.
MATLAB Wavelet Toolbox¶
Introduction¶
This section is a quick review of wavelet toolbox in MATLAB.
Wavelet families
The toolbox supports a number of wavelet families:
>> waveletfamilies('f')
===================================
Haar haar
Daubechies db
Symlets sym
Coiflets coif
BiorSplines bior
ReverseBior rbio
Meyer meyr
DMeyer dmey
Gaussian gaus
Mexican_hat mexh
Morlet morl
Complex Gaussian cgau
Shannon shan
Frequency B-Spline fbsp
Complex Morlet cmor
Fejer-Korovkin fk
===================================
Following command shows how to get the list of wavelets in each of the families:
>> waveletfamilies('n')
===================================
Haar haar
===================================
Daubechies db
------------------------------
db1 db2 db3 db4
db5 db6 db7 db8
db9 db10 db**
===================================
Symlets sym
------------------------------
sym2 sym3 sym4 sym5
sym6 sym7 sym8 sym**
===================================
Coiflets coif
------------------------------
coif1 coif2 coif3 coif4
coif5
===================================
BiorSplines bior
------------------------------
bior1.1 bior1.3 bior1.5 bior2.2
bior2.4 bior2.6 bior2.8 bior3.1
bior3.3 bior3.5 bior3.7 bior3.9
bior4.4 bior5.5 bior6.8
===================================
ReverseBior rbio
------------------------------
rbio1.1 rbio1.3 rbio1.5 rbio2.2
rbio2.4 rbio2.6 rbio2.8 rbio3.1
rbio3.3 rbio3.5 rbio3.7 rbio3.9
rbio4.4 rbio5.5 rbio6.8
===================================
Meyer meyr
===================================
DMeyer dmey
===================================
Gaussian gaus
------------------------------
gaus1 gaus2 gaus3 gaus4
gaus5 gaus6 gaus7 gaus8
===================================
Mexican_hat mexh
===================================
Morlet morl
===================================
Complex Gaussian cgau
------------------------------
cgau1 cgau2 cgau3 cgau4
cgau5 cgau6 cgau7 cgau8
===================================
Shannon shan
------------------------------
shan1-1.5 shan1-1 shan1-0.5 shan1-0.1
shan2-3 shan**
===================================
Frequency B-Spline fbsp
------------------------------
fbsp1-1-1.5 fbsp1-1-1 fbsp1-1-0.5 fbsp2-1-1
fbsp2-1-0.5 fbsp2-1-0.1 fbsp**
===================================
Complex Morlet cmor
------------------------------
cmor1-1.5 cmor1-1 cmor1-0.5 cmor1-1
cmor1-0.5 cmor1-0.1 cmor**
===================================
Fejer-Korovkin fk
------------------------------
fk4 fk6 fk8 fk14
fk18 fk22
===================================
Working with Daubechies Wavelets¶
The short name for this family of wavelets is db.
Information about the wavelet family:
>> waveinfo('db')
Information on Daubechies wavelets.
Daubechies Wavelets
General characteristics: Compactly supported
wavelets with extremal phase and highest
number of vanishing moments for a given
support width. Associated scaling filters are
minimum-phase filters.
Family Daubechies
Short name db
Order N N a positive integer from 1 to 45.
Examples db1 or haar, db4, db15
Orthogonal yes
Biorthogonal yes
Compact support yes
DWT possible
CWT possible
Support width 2N-1
Filters length 2N
Regularity about 0.2 N for large N
Symmetry far from
Number of vanishing
moments for psi N
Reference: I. Daubechies,
Ten lectures on wavelets,
CBMS, SIAM, 61, 1994, 194-202.
Decomposition and Reconstruction filters¶
Let’s construct the filters for ‘db4’ wavelet:
>> [LoD,HiD,LoR,HiR] = wfilters('db4');
Let’s plot the filters:
subplot(221);
stem(LoD, '.'); title('Lowpass Decomposition');
subplot(222);
stem(LoR,'.'); title('Lowpass Reconstruction');
subplot(223);
stem(HiD,'.'); title('Highpass Decomposition');
subplot(224);
stem(HiR,'.'); title('Highpass Reconstruction');

Single Level Decomposition and Reconstruction¶
The dwt
and idwt
functions can be used
for single level decomposition and reconstruction.
Let’s load a signal on which we will perform the decomposition:
load noisdopp;
plot(noisdopp);

Let’s perform 1-level decomposition:
[approximation, detail] = dwt(noisdopp,LoD,HiD);
Let’s plot the decomposed approximation and detail components:
subplot(211);
plot(approximation); title('Approximation');
subplot(212);
plot(detail); title('Detail');

Reconstruct the original signal using idwt
:
reconstructed = idwt(approximation, detail,LoR,HiR);
Let’s measure the reconstruction error:
>> max_abs_diff = max(abs(noisdopp-reconstructed))
max_abs_diff =
6.3300e-12
Multi-level Wavelet Decomposition¶
We can use the wavedec
function for
multi-level wavelet decomposition:
[coefficients, levels] = wavedec(s,3,'db1');
Let’s plot the decomposition coefficients:
plot(coefficients); title('Coefficients');

Reconstruction from multi-level decomposition:
reconstructed = waverec(coefficients, levels, LoR, HiR);
Let’s verify the reconstruction error:
max_abs_diff = max(abs(noisdopp-reconstructed))
max_abs_diff =
2.0627e-11
It is possible to look at the approximation coefficients at all levels:
for level=0:4
level_app_coeffs = appcoef(coefficients, levels, LoR, HiR, level);
subplot(511+level);
plot(level_app_coeffs);
title(sprintf('Approximation coefficients @ level-%d', level));
end

The level-0 coefficients are nothing but the original signal. The higher level approximation coefficients are increasingly smoother.
It is important to know how many levels of decomposition
are possible. wmaxlev
can be used for finding it out:
>> wmaxlev(numel(noisdopp),'db4')
ans =
7
The normal wavelet decomposition creates more coefficients than there are in the original signal.
Let’s see how the number of coefficients increase with the level of decomposition:
>> for i=1:7
[coefficients, levels] = wavedec(noisdopp,i, LoD,HiD);
fprintf('%d ', numel(coefficients));
end
1030 1037 1044 1050 1056 1062 1068
For every level 6 or 7 extra coefficients are being introduced. This is because a normal convolution of length M signal with length N filter produces a signal of length M + N -1.
The behavior is controlled by the DWT MODE. It defines how the signals are extended to complete the convolution.
The default mode is:
>> dwtmode
*******************************************************
** DWT Extension Mode: Symmetrization (half-point) **
*******************************************************
Decomposition with Periodic Extension¶
If we want to have a non-redundant wavelet decomposition, we can use the periodic extension DWT mode.
Changing the mode:
old_dwt_mode = dwtmode('status','nodisp');
dwtmode('per');
*****************************************
** DWT Extension Mode: Periodization **
*****************************************
Performing level 4 decomposition:
[coefficients, levels] = wavedec(noisdopp,4, LoD,HiD);
Verify that the coefficients array is of same length as signal:
>> numel(coefficients)
ans =
1024
Verify that number of elements at different levels is changing by a factor of 2 always:
>> levels
levels =
64 64 128 256 512 1024
Plot the coefficients:
plot(coefficients); title('Coefficients');

Reconstruct the signal:
reconstructed = waverec(coefficients, levels, LoR, HiR);
Verify that the reconstruction is fine:
max_abs_diff = max(abs(noisdopp-reconstructed))
max_abs_diff =
2.0357e-11
Plot the approximation coefficients at all levels:
for level=0:4
level_app_coeffs = appcoef(coefficients, levels, LoR, HiR, level);
subplot(511+level);
plot(level_app_coeffs);
fprintf('%d ', numel(level_app_coeffs));
title(sprintf('Approximation coefficients @ level-%d', level));
end
1024 512 256 128 64

The number of approximation coefficients is decreasing exactly by a factor of 2 in each level.
Restoring the old DWT mode:
% restore the old DWT mode
dwtmode(old_dwt_mode);
Synthesis and Analysis Orthonormal Bases¶
Daubechies wavelets are orthogonal. For the specific case where the DWT is decomposing a signal \(x \in \RR^N\) to a representation \(\alpha \in \RR^N\) (in the periodic extension case), the transformation can be represented by an equation
where \(\Psi\) is an Orthonormal basis (ONB) for \(\RR^N\) synthesizing the signal \(x\) from the representation \(\alpha\).
The decomposition process is represented by
We can easily construct the matrix \(\Psi^T\). Each column of \(\Psi^T\) can be obtained by computing \(\Psi^T e_i\) where \(e_i\) is the standard unit vector in i-th direction for \(\RR^N\).
We will construct the decomposition matrix for the ‘db4’ wavelet and level 4 decomposition. The size of the signal would be \(N=1024\):
[LoD,HiD,LoR,HiR] = wfilters('db4');
N = 1024;
L = 4;
Let’s make sure that we are using per
mode:
old_dwt_mode = dwtmode('status','nodisp');
dwtmode('per');
Let’s construct \(\Psi^T\):
PsiT = zeros(N, N);
for i=1:N
unit_vec = zeros(N, 1);
unit_vec(i) = 1;
[coefficients, levels] = wavedec(unit_vec, L, LoD,HiD);
PsiT(:, i) = coefficients;
end
Let’s verify that the rows of \(\Psi^T\) are unit norm:
>> norms = spx.norm.norms_l2_rw(PsiT);
fprintf('norms: min: %.4f, max: %.4f\n', min(norms), max(norms));
norms: min: 1.0000, max: 1.0000
Let’s get the corresponding synthesis matrix \(\Psi\)
Psi = PsiT';
Let’s verify that it is indeed an orthonormal basis:
>> max(max(abs(Psi * Psi' - eye(N))))
ans =
1.8573e-12
We should also verify that the matrix
\(\Psi\) behaves same as the
application of wavedec
and waverec
functions.
Let’s load our sample signal:
load noisdopp;
% make it a column vector
noisdopp = noisdopp';
Let’s construct its representation by wavedec
:
[a1, levels] = wavedec(noisdopp, L, LoD, HiD);
Let’s construct its representation by \(\Psi^T\):
a2 = PsiT * noisdopp;
Let’s compare if they match:
>> fprintf('Decomposition diff: %e\n', max(a1 - a2));
Decomposition diff: 2.486900e-14
They indeed match. Now, let’s reconstruct
the signal through both ways.
First using waverec
:
x1 = waverec(a1, levels, LoR, HiR);
Now using \(\Psi\)
x2 = Psi * a2;
Compare them:
fprintf('Synthesis diff: %e\n', max(x1 - x2));
Synthesis diff: 1.065814e-14
It’s working great.
Finally, don’t forget to restore the older DWT mode:
dwtmode(old_dwt_mode);
It is instructive to visualize the basis \(\Psi\):
colormap('gray');
imagesc(Psi);
colorbar;

The matrix is sparse. In fact only 3% of its entries are non-zero:
>> nnz(Psi) / (N*N)
ans =
0.0283
This is expected since wavelets have a very small support.
MATLAB provides a function for constructing a dictionary from one or more orthonormal or biorthogonal bases. Let’s try to construct a our ONB matrix using this function:
PsiMP = wmpdictionary(N, 'lstcpt', {{'db4', 4}});
Let’s verify that the two approaches are giving us same result:
>> max(max(abs(PsiMP - Psi)))
ans =
7.9581e-13
A quick note, the wmpdictionary
function
returns a sparse matrix.
Complete example code can be downloaded
here
.
Stationary Wavelet Transform¶
DWT is not translation invariant. In some applications, translation invariance is important. Stationary Wavelet Transform (SWT) overcomes this limitation. It removes all the upsamplers and downsamplers in DWT. It is a highly redundant transform.
In MATLAB, it is implemented using swt function.
swt
doesn’t involve any downsampling. All details
and approximations are of same length as the original signal.
swt
is defined using periodic extension.
The length of the approximation and detail coefficients
computed at each level equals the length of the signal.
Let us construct a level 4 decomposition:
coefficients = swt(noisdopp, 4, LoD,HiD);
Let’s plot the approximation and detail coefficients:
for level=0:4
subplot(511+level);
plot(coefficients(level+1, :));
title(sprintf('SWT Coefficients @level-%d', level));
end

Detection, Classification and Estimation¶
Binary Hypothesis Testing¶
Generate a sequence of bits:
% Number of bits being transmitted
B = 1000*100;
transmittedBits = randi(2, B , 1) - 1;
Modulation:
% Number of samples per detection test.
N = 10;
% The signal shape
signal = ones(N, 1);
transmittedSequence = SPX_Modulator.modulate_bits_with_signals(transmittedBits, signal);
Adding noise:
sigma = 1;
noise = sigma * randn(size(transmittedSequence));
% We add noise to transmitted data to create received sequence
receivedSequence = transmittedSequence + noise;
Matched filtering:
matchedFilterOutput = SPX_MatchedFilter.filter(receivedSequence, signal);
Generating sufficient statistics:
signalNormSquared = signal' * signal;
sufficientStatistics = matchedFilterOutput / signalNormSquared;
Thresholding:
% We define optimal detection threshold
eta = 0.5;
% We create the received bits
receivedBits = sufficientStatistics >= eta;
Detection results:
result = SPX_BinaryHypothesisTest.performance(...
transmittedBits, receivedBits)
% Number of False sent, False received
result.FF
% Number of False sent, True received
result.FT
% Number of True sent, False received
result.TF
% Number of True sent, True received
result.TT
% Number of times hypothesis 0 was sent.
result.H0
% Number of times hypothesis 1 was sent.
result.H1
% Number of times 0 was detected.
result.D0
% Number of times 1 was detected.
result.D1
% A priori probability of 0
result.P0
% A priori probability of 1
result.P1
% Detection probability
result.PD
% False alarm probability
result.PF
% Miss probability
result.PM
% Accuracy (probability of correct decisions)
result.Accuracy
% Probability of error
result.Pe
% Precision : Truth sent given that truth was detected
result.Precision
% Recall : Truth detected given that truth was sent.
result.Recall
% F1 score
result.F1
ECG¶
A Short Review of ECG Signals¶
The structure of an ECG signal. Courtesy: Wikipedia.
General Features¶
P wave
- P wave has a duration less than 120 msec with frequencies below 10-15 Hz.
QRS complex
- QRS wave has a duration of about 70-110 msec with frequencies in 10-50 Hz.
T wave
- It is similar in frequency content to P wave.
PQ segment
- PQ segment lasts about 80 msec.
Computational Complexity¶
Introduction¶
This chapters provides a framework for analysis of computational complexity of sparse recovery algorithms. See [GVL12] for a detailed study of matrix computations.
The table below summarizes the flop counts for various basic operations. A detailed derivation of these flop counts is presented in Basic Operations.
Operation | Description | Parameters | Flop Counts |
---|---|---|---|
\(y = \text{abs}(x)\) | Absolute values | \(x \in \RR^n\) | \(n\) |
\(\langle x, y \rangle\) | Inner product | \(x, y \in \RR^n\) | \(2n\) |
\([v, i] = \text{max}(\text{abs}(x))\) | Find maximum value by magnitude | \(x \in \RR^n\) | \(2n\) |
\(y = A x\) | Matrix vector multiplication | \(A \in \RR^{m \times n}, x \in \RR^n\) | \(2mn\) |
\(C = AB\) | Matrix multiplication | \(A \in \RR^{m \times n}, B \in \RR^{n \times p}\) | \(2mnp\) |
\(y = A x\) | \(A\) is diagonal | \(A \in \RR^{n\times n}, x \in \RR^n\) | \(n\) |
\(y = A x\) | \(A\) is lower triangular | \(A \in \RR^{n\times n}, x \in \RR^n\) | \(n(n+1)\) |
\((I + u v^T)x\) | \(x, u, v \in \RR^n\) | \(4n\) | |
\(G = A^TA\) | Gram matrix (symmetric) | \(A \in \RR^{m \times n}\) | \(mn^2\) |
\(F = AA^T\) | Frame operator (symmetric) | \(A \in \RR^{m \times n}\) | \(nm^2\) |
\(\| x \|_2^2\) | Squared \(\ell_2\) norm | \(x \in \RR^n\) | \(2n - 1\) |
\(\| x \|_2\) | \(\ell_2\) norm | \(x \in \RR^n\) | \(2n\) |
\(x(:) = c\) | Set to a constant value | \(x \in \RR^n\) | \(n\) |
Swap rows in \(A\) | elementary row operation | \(A \in \RR^{m \times n}\) | \(3n\) |
\(A(i, :) = \alpha A(i, :)\) | Scale a row | \(A \in \RR^{m \times n}\) | \(2n\) |
Solve \(L x = b\) | Lower triangular system | \(L \in \RR^{n \times n}\) | \(n^2\) |
Solve \(U x = b\) | Upper triangular system | \(U \in \RR^{n \times n}\) | \(n^2\) |
Solve \(Ax =b\) | Gaussian elimination, \(A\) full rank | \(A\in \RR^{n \times n}\) | \(\frac{2\, n^3}{3} + \frac{n^2}{2} - \frac{7\, n}{6}\) |
\(A = QR\) | QR factorization | \(A \in \RR^{m \times n}\) | \(2mn^2\) |
Solve \(\| A x - b \|_2^2\) | Least squares through QR | \(A \in \RR^{m \times n}\) | \(2mn^2 + 2mn + n^2\) |
\(A^TA x = A^T b\) | Least squares through Cholesky \(A^T A = L L^T\) | \(A \in \RR^{m \times n}\) | \(mn^2 + \frac{1}{3} n^3\) |
Basic Operations¶
Essential operations in the implementation of a numerical algorithm are addition, multiplication, comparison, load and store. Numbers are stored in floating point representation. A dedicated floating point unit is available for performing arithmetic operations. These operations are known as floating point operations (flops). A typical update operation \(b \leftarrow b + x y\) (a.k.a. multiply and add) involves two flops (one floating point multiply and open floating point addition). Subtraction costs same as addition. A division is usually counted as 4 flops in HPC community as a more sophisticated procedure is invoked in the floating point arithmetic hardware. For our purposes, we will count division as single flop as it is a rare operation and doesn’t affect overall flop count asymptotically. A square root operation can take about 6 flops on typical CPU architectures, but [following [TBI97]], we will treat it as a single flop. We ignore the costs of load and store operations. We usually, also ignore costs of decision making operations and integer counters. We will also be treating real arithmetic as well as complex arithmetic costing same number of flops to maintain ease of analysis.
Let \(x, y \in \RR^n\) be two vectors, then their inner product is computed as
This involves \(n\) multiplications and \(n-1\) additions. Total operation count is \(2n - 1\) flops. If we implement this as a sequence of multiply and add operation starting with \(0\), then this will take \(2n\) flops. We will use this simpler expression. Addition and subtraction of \(x\) and \(y\) takes \(n\) flops. Scalar multiplication takes \(n\) flops.
Multiplication¶
Let \(A \in \RR^{m \times n}\) be a real matrix and \(x \in \RR^n\) be a vector. Then \(y = A x \in \RR^m\) is their matrix-vector product. A straight-forward implementation consists of taking inner product of each row of \(A\) with \(x\). Each inner product costs \(2n\) flops. There are \(m\) such inner products computed. Total operation count is \(2mn\). When two matrices \(A \in \RR^{m \times n}\) and \(B \in \RR^{n \times p}\) are multiplied, the operation count is \(2mnp\).
There are specialized matrix-matrix multiplication algorithms which can reduce the flop count, but we would be content with this result. If \(A\) has a certain structure [e.g. Fourier Transform], then specialized algorithms may compute the product much faster. We will not be concerned with this at the moment. Also, partitioning of a matrix into blocks and using block versions of fundamental matrix operations helps a lot in improving the memory traffic and can significantly improve the performance of the algorithm on real computers, but this doesn’t affect the flop count and we won’t burden ourselves with these details.
If \(A\) is diagonal (with \(m=n\)), then \(Ax\) can be computed in \(n\) flops. If \(A\) is lower triangular (with \(m=n\)), then \(Ax\) can be computed in \(n(n+1)\) flops. Here is a quick way to compute \((I + uv^T)x\): Compute \(c = v^T x\) (\(2n\) flops), then compute \(w = c u\) (\(n\) flops), then compute \(w + x\) (\(n\) flops). The total is \(4n\) flops.
The Gram Matrix \(G = A^T A\) (for \(A \in \RR^{m \times n}\)) is symmetric of size \(n \times n\). We need to calculate only the upper triangular part and we can fill the lower triangular part easily. Each row vector of \(A^T\) and column vector of \(A\) belong to \(\RR^{m}\). Their inner product takes \(2m\) flops. We need to compute \(n(n+1)/2\) such inner products. The total flop count is \(mn(n+1) \approx mn^2\). Similarly, the frame operator \(AA^T\) is symmetric requiring \(nm(m+1) \approx nm^2\) flops.
Squared norm of a vector \(\| x \|_2^2 = \langle x, x \rangle\) can be computed in \(2n-1\) flops. Norm can be computed in \(2n\) flops.
Elementary row operations¶
There are a few memory operations for which we need to assign flop counts. Setting a vector \(x \in \RR^n\) to zero (or any constant value) will take \(n\) flops. Swapping two rows of a matrix \(A\) (with \(n\) columns) takes \(3n\) flops.
Scaling a row of \(A\) takes \(n\) flops. Scaling a row and adding to another row takes \(2n\) flops.
Back and forward substitution¶
Given an upper triangular matrix \(L \in \RR^{n \times n}\), solving the equation \(L x = b\) takes \(n^2\) flops. This can be easily proved by induction. The case for \(n=1\) is trivial (requiring 1 division). Assume that the flop count is valid for \(1\dots n-1\). For \(n \times n\) matrix \(L\), let the top most row equation be
where \(x_2 \dots x_n\) are already determined in \((n-1)^2\) flops. Solving for \(x_1\) requires \(2n -3 + 1 + 1= 2n - 1\) flops. The total is \((n-1)^2 + 2n -1 = n^2\). Flop count for forward substitution is also \(n^2\).
Gaussian elimination¶
Let \(A \in \RR^{n \times n}\) be a full rank matrix and let us look at the Gaussian elimination process in solving the equation \(A x = y\) for a given \(y\) and unknown \(x\). As the pivot column shifts in Gaussian elimination process, the number of columns involved keeps reducing. The first pivot is \(a_{11}\). Computing its inverse takes 4 flops. For \(i\)-th row beneath the first row, computing \(a_{11} / a_{i1}\) takes 4 flops, scaling the row with this value takes \(n\) flops, and subtracting 1st from from this takes \(n\) flops. Total flop count is \((2n+1)\) flops. We repeat the same for \((n-1)\) rows. Total flop count is \((2n+1)(n-1)\). For \(i\)-th pivot from \(i\)-th row, the number of columns involved is \(n-i+1\). Number of rows below it is \(n-i\). Flop count of zeroing out entries below the pivot is \((2(n-i+1)+1)(n-i)\). Summing over \(1\) to \(n\), we obtain:
For a \(2\times 2\) matrix, this is 5 flops. For a \(3\times 3\) matrix, this is \(19\) flops. Actually, substituting \(n-i+1\) by \(k\), we can rewrite the sum as:
Additional \(n^2\) flops are required for back substitution part.
QR factorization¶
We factorize a full column rank matrix \(A \in \RR^{m \times n}\) as \(A = QR\) where \(Q \in \RR^{m \times n}\) is an orthogonal matrix \(Q^TQ = I\) and \(R \in \RR^{n \times n}\) is an upper triangular matrix. This can be computed in \(2mn^2\) flops using Modified Gram-Schmidt algorithm presented in here.
\caption{Modified Gram-Schmidt Algorithm}
\footnotesize
\SetAlgoLined
\For{:math:`k \leftarrow 1` \KwTo :math:`n`}{
:math:`v_k \leftarrow a_k`\tcp*{Initialize :math:`Q` matrix}
}
\For{:math:`k \leftarrow 1` \KwTo :math:`n`}{
:math:`r_{kk} \leftarrow \| v_k \|_2`\tcp*{Compute norm}
:math:`q_k \leftarrow v_k / r_{kk}` \tcp*{Normalize}
\For{:math:`j \leftarrow k+1` \KwTo :math:`n`} {
:math:`r_{kj} \leftarrow q_k^T v_j` \tcp*{Compute projection}
:math:`v_j \leftarrow v_j - r_{kj} q_k` \tcp*{Subtract projection}
}
}
Most of the time of the algorithm is spent in the inner loop on \(j\). Projection of \(v_j\) on \(q_k\) is computed in \(2m-1\) flops. It is subtracted from \(v_j\) in \(2m\) flops. Projection of \(q_k\) is subtracted from remaining \((n-k)\) vectors requiring \((n-k)(4m-1)\) flops. Summing over \(k\), we get:
Computing norm \(r_{kk}\) requires \(2m\) flops. Computing \(q_k\) requires \(m+1\) flops (1 inverse and \(m\) multiplications). These contribute \((3m+1)n\) flops for \(n\) columns. Initialization of \(Q\) matrix can be absorbed into the normalization step requiring no additional flops. Thus, the total flop count is \(\frac{3n}{2} + m n + 2mn^2 - \frac{n^2}{2} \approx 2mn^2\).
A variation of this algorithm is presented below. In this version \(Q\) and \(R\) matrices are computed column by column from \(A\) matrix. This allows for incremental update of \(QR\) factorization of \(A\) as more columns in \(A\) are added. This variation is very useful in efficient implementation of algorithms like Orthogonal Matching Pursuit.

Again, the inner loop requires \(4m-1\) flops. This loop is run \(k-1\) times. We have \(\sum_{k=1}^n (k-1)= \sum_{k=1}^n (n - k)\). Thus, flop counts are identical.
Least Squares¶
Standard least squares problem of minimizing the norm squared \(\| A x - b\|_2^2\) where \(A\) is a full column rank matrix, can be solved using various methods. Solution can be obtained by solving the normal equations \(A^T A x = A^T b\). Since the Gram matrix \(A^T A\) is symmetric, faster solutions than Gaussian elimination are applicable.
QR factorization¶
We write \(A = QR\). Then, an equivalent formulation of normal equations is \(R x = Q^T b\). The solution is obtained in 3 steps: a) Compute \(QR\) factorization of \(A\). b) Form \(d = Q^T b\). c) Solve \(R x = d\) by back substitution. Total cost for solution is \(2mn^2 + 2mn + n^2\) flops. We refrain from ignoring the lower order terms as we will be using incremental QR update based series of least squares problems in sequel.
Cholesky factorization¶
We calculate \(G = A^T A\). We then perform the Cholesky factorization of \(G = LL^T\). We compute \(d = A^T b\). We solve \(Lz = d\) by forward substitution. We solve \(L^T x = z\) by back substitution. Total flop count is approximately \(mn^2 + (1/3) n^3 + 2mn + n^2 + n^2\) flops. For large \(m, n\), the cost is approximately \(mn^2 + (1/3) n^3\). QR factorization is numerically more stable though Cholesky is faster. Cholesky factorization can be significantly faster if \(A\) is a sparse matrix. Otherwise QR factorization is the preferred approach.
Incremental QR factorization¶
Let us spend some time on looking at the QR based solution differently. Let us say that \(A = \begin{bmatrix} a_1 & a_2 & \dots & a_n \end{bmatrix}\). Let \(A_k\) be the submatrix consisting of first \(k\) columns of \(A\). Let the QR factorization of \(A_k\) be \(Q_k R_k\). Let \(x_k\) be the solution of the least squares problem of minimizing \(\| A_k x_k - b \|_2^2\). We form \(d_k = Q_k^T b\) and solve \(R_k x_k = d_k\) via back substitution.
Similarly, QR factorization of \(A_{k+1}\) is \(Q_{k+1} R_{k+1}\). We can write
\(k\) entries in the vector \(r_{k+1}\) are computed as per the loop in above. Computing and subtracting projection of \(a_{k+1}\) for each normalized column in \(Q_k\) requires \(4m-1\) flops. This loop is run \(k\) times. Computing norm and division requires \(3m+1\) flops. The whole QR update step requires \(k(4m-1) + 3m + 1\) flops. It is clear that the first \(k\) entries in \(d_{k+1}\) are identical to \(d_k\). We just need to compute the last entry as \(q_{k+1}^T b\) (requiring \(2m\) flops). Back substitution will require all \((k+1)^2\) flops. Total number of flops required for solving the \(k+1\)-th least squares problem is \(k(4m-1) + 3m + 1 + 2m + (k+1)^2\) flops. Summing over \(k=0\) to \(n-1\), we get
Compare this with the flop count for QR factorization based least squares solution for whole matrix \(A\): \(2mn^2 + 2mn + n^2\). Asymptotically (with \(n < m\)), this is close to \(2mn^2\), the operation count for solving the full least squares problem. This approach gives us a series of solutions with sacrificing much on computational complexity.
Orthogonal Matching Pursuit¶
We are modeling a signal \(y \in \RR^M\) in a dictionary \(\Phi \in \RR^{M \times N}\) consisting of \(N\) atoms as \(y = \Phi x + r\) where \(r\) is the approximation error. Our objective is to construct a sparse model \(x \in \RR^N\). \(\Lambda = \supp(x)\) is the set of indices on which \(x_i\) is non-zero. \(K = \| x \|_0 = | \supp(x) |\) is the so called \(\ell_0\)-“norm” of \(x\) which is the number of non-zero entries in \(x\).
A sparse recovery or approximation algorithm need not provide the full vector \(x\). It can provide the positions of non-zero entries \(\Lambda\) and corresponding values \(x_{\Lambda}\) requiring \(2K\) units of storage where \(x_{\Lambda} \in \RR^{K}\) consists of entries from \(x\) indexed by \(\Lambda\). \(\Phi_{\Lambda}\) denotes the submatrix constructed by picking columns indexed by \(\Lambda\).
Orthogonal Matching Pursuit is presented below.

OMP builds the support incrementally. In each iteration, one more atom is added to the support set for \(y\). We terminate the algorithm either after a fixed number of iterations \(K\) or when the magnitude of residual \(\| y - \Phi x \|_2\) reaches a specified threshold.
Following analysis assumes that the main loop of OMP runs for \(K\) iterations. The iteration counter \(k\) varies from \(1\) to \(K\). The counter is increased at the beginning of the iteration. Note that \(K \leq M\).
Matching step requires the multiplication of \(\Phi^T \in \RR^{N \times M}\) with \(r^{k-1}\in \RR^{M}\) (the residual after \(k-1\) iterations). It requires \(2MN\) flops at maximum. OMP has a property that the residual after \(k\)-th iteration is orthogonal to the space spanned by the atoms selected till \(k\)-th iteration \(\{\phi_{\lambda_1}\dots \phi_{\lambda_k}\}\). Thus, the inner product of these atoms with \(r\) is 0 and we can safely ignore these columns. This reduces flop count to \(2M(N-k+1)\).
Identification step requires \(2N\) flops. This includes \(N\) flops for taking absolute values and \(N\) flops for finding the maximum.
\(\Lambda\) is easily implemented in the form of an array whose length is indicated by the iteration counter \(k\). A large array (of size \(M\)) can be allocated in advance for maintaining \(\Lambda\). Thus, support update operation requires a single flop and we will ignore it. \(\Lambda^{k}\) contains \(k\) indices.
While the algorithm shows the full sparse vector \(x\), in practice, we only need to reserve space for \(x_{\Lambda}\) which is an array with maximum size of \(M\) and can be preallocated. \(\Phi_{\Lambda^{k}}\) need not be stored separately. This can be obtained from \(\Phi\) by proper indexing operations. Its size is \(M \times k\).
Let’s skip the least squares step for updating representation for the moment.
Once \(x^{k}_{\Lambda^{k}}\) has been computed, computing the approximation \(y^{k}\) takes \(2Mk\) flops.
Updating the residual \(r^{k}\) takes \(M\) flops as both \(y\) and \(y^{k}\) belong to \(\RR^{M}\). Updating iteration counter takes 1 flop and can be ignored.
Least Squares through QR Update¶
Let’s come back to the least squares step. Assume that \(\Phi_{\Lambda^{k-1}}\) has a QR decomposition \(Q_{k-1} R_{k-1}\). Addition of \(\phi_{\lambda^{k}}\) to \(\Phi_{\Lambda^k}\) requires us updating the QR decomposition to \(Q_{k} R_{k}\). Following here, Computing and subtracting projection of \(\phi_{\lambda^{k}}\) for each normalized column in \(Q_{k-1}\) requires \(4M-1\) flops. This loop is run \({k-1}\) times. Computing norm and division requires \(3M+1\) flops. The whole QR update step requires \((k-1)(4M-1) + 3M + 1\) flops. We are assuming that enough space has been preallocated to maintain \(Q_k\) and \(R_k\). Solving the least squares problem requires additional steps of computing the projection \(d = Q^T y\) (\(2M\) flops for the new entry in \(d\)) and solving \(R x = d\) by back substitution (\(k^2\) flops). Thus, QR update based least squares solution requires \((k-1)(4M-1) + 3M + 1 + 2M + k^2\) flops.
Refer to here for a summary of all the steps.
Finally, we can put together the cost of all steps in the main loop of OMP as
This simplifies to \(4\,M+2\,N-k+4\,M\,k+k^2+2\,M\,N+2\). Summing over \(k \in \{1,\dots, K\}\), we obtain
For a specific setting of \(K = \sqrt{M} / 2\) and \(M = N/2\), we get
In terms of \(M\), it will simplify to:
In a typical sparse approximation problem, we have \(K < M \ll N\). Thus, the flop count will be approximately \(2KMN\).
Total flop count of matching step over all iterations is \(K\, M - K^2\, M + 2\, K\, M\, N\). Total flop count of least squares step over all iterations is \(\frac{5\, K}{3} + 2\, K^2\, M + \frac{K^3}{3} + 3\, K\, M\). This suggests that the matching step is the dominant step for OMP.
\centering
\caption{Operations in OMP using QR update}
\begin{tabular}{c | c}
\hline
Operation & Flops \\
\hline
:math:`\Phi^T r` & :math:`2M(N - k +1)`\\
Identification & :math:`2N` \\
:math:`y^{k} = \Phi_{\Lambda^{k}} x_{\Lambda^{k}}^{k}` & :math:`2Mk`\\
:math:`r^k = y - y^k` & :math:`M` \\
QR update & :math:`(k-1)(4M-1) + 3M + 1` \\
Update :math:`d = Q_k^T y` & :math:`2M` \\
Solve :math:`R_k x = d` & :math:`k^2` \\
\hline
\end{tabular}
Least Squares through Cholesky Update¶
If the OMP least squares step is computed through Cholesky decomposition, then we maintain the Cholesky decomposition of \(G = \Phi_{\Lambda}^T \Phi_{\Lambda}\) as \(G = L L^T\). Then
In each iteration, we need to update \(L_k\), compute \(b = \Phi_{\Lambda}^T y\), solve \(L u = b\) and then solve \(L^T x = u\). Now,
Define \(v = \Phi_{\Lambda^{k-1}}^T \phi_{\lambda^k}\). We have
The Cholesky update is given by:
where solving \(L^{k - 1} w = v\) gives us \(w\). For the first iteration, \(L^1 = 1\) since the atoms in \(\Phi\) are normalized.
Computing \(v\) would take \(2M(k-1)\) flops. Computing \(w\) would take \((k-1)^2\) flops. Computing \(\sqrt{1-w^T w}\) would take another \(2k\) flops. Thus, Cholesky update requires \(2M(k-1) + 2k + (k-1)^2\) flops. Then computing \(b = \Phi^T_{\Lambda} y\) requires only updating the last entry in \(b\) which requires \(2M\) flops. Solving \(LL^T x = b\) requires \(2k^2\) flops.
\centering
\caption{Operations in OMP using Cholesky update}
\begin{tabular}{c | c}
\hline
Operation & Flops \\
\hline
:math:`\Phi^T r` & :math:`2M(N - k +1)`\\
Identification & :math:`2N` \\
:math:`y^{k} = \Phi_{\Lambda^{k}} x_{\Lambda^{k}}^{k}` & :math:`2Mk`\\
:math:`r^k = y - y^k` & :math:`M` \\
Cholesky update & :math:`2M(k-1) + 2k + (k-1)^2` \\
Update :math:`b = \Phi^T_{\Lambda} y` & :math:`2M` \\
Solve :math:`LL^T x = b` & :math:`2k^2` \\
\hline
\end{tabular}
We can see that for \(k \ll M\), QR update is around \(4Mk\) flops while Cholesky update is around \(2Mk\) steps (asymptotically).
Flop counts for the main loop of OMP using Cholesky update is
Summing over \(k \in [K]\), we get total flop count for OMP as
For a specific setting of \(K = \sqrt{M} / 2\) and \(M = N/2\), we get In terms of \(M\), it will simplify to:
In a typical sparse approximation problem, we have \(K < M \ll N\). Thus, the flop count will be approximately \(2KMN\) i.e. dominated by the matching step.
Cholesky update based solution is marginally faster than QR update based solution for small values of \(M\).
Sorting¶
We sometimes need sorting and searching operations on arrays of numbers in the numerical algorithms. This section summarizes results related to number of operations needed to perform various sorting and searching tasks on arrays. These results are collected from or based on the approach in [SF13]. Fundamental operations in these algorithms are comparison, load, store and exchanges of array elements.
Finding the maximum of an array of length n takes \(n-1 \approx n\) comparisons. We assume the first entry as maximum, keep comparing with other entries in array, and change the maximum if the entry in array is larger. On an average, half of these comparisons will also require changing the maximum entry. Apart from finding the largest entry, we are often required to find its location too. Location will be updated whenever maximum value is updated. If we have to find \(k\) largest entries in the array (along with their indices), we can work iteratively: find maximum, set the corresponding entry in array to small enough value (0 for positive valued array, \(-\infty\) for real array), find the second largest entry and so on. This would require \(kn\) comparisons approximately. Considering additional book-keeping cost, the flop count can be put at \(2kn\). If the array is needed further, we can put the \(k\) largest entries back in the array.
Theorem 1.3 in [SF13] suggests that quicksort algorithm on average uses \((n-1)/2\) partitioning stages, \(2n\ln{n} -1.846n\) compares and \(.333 n \ln{n} -.865 n\) exchanges to sort an array of n randomly ordered distinct elements.
Our needs for sorting also require us to store the indices of entries in sorted array in the original array. This is done by creating an index array and performing exchanges on the index array whenever exchanges are done in the original array. Keeping these extra operations in mind, We will use a conservative estimate of \(4n \ln{n}\) flops for sorting an array. Once the array is sorted, picking the \(k\) largest entries requires \(k\) iterations. It is noted here that when \(n\) is small (say less than 1000), then an efficient implementation of quicksort can actually beat the naive way of finding \(k\) largest entries discussed above. This will be our preferred approach in this work.
Library Classes¶
Contents:
Sparse recovery pursuit algorithms¶
Contents:
Matching pursuit¶
Constructing the solver with dictionary and expected sparsity level:
solver = spx.pursuit.single.MatchingPursuit(Dict, K)
Using the solver to obtain the sparse representation of one vector:
result = solver.solve(y)
Using the solver to obtain the sparse representations of all vectors in the signal matrix Y independently:
result = solver.solve_all(Y)
Orthogonal matching pursuit¶
Constructing the solver with dictionary and expected sparsity level:
solver = spx.pursuit.single.OrthogonalMatchingPursuit(Dict, K)
Using the solver to obtain the sparse representation of one vector:
result = solver.solve(y)
There are several ways of solving the least squares problem which is an intermediate step in the orthogonal matching pursuit algorithm. Some of these are described below.
Using the solver to obtain the sparse representation of one vector with incremental QR decomposition of the subdictionary for the least squares step:
result = solver.solve_qr(y)
Using the solver to obtain the sparse representations of all vectors in the signal matrix Y independently:
result = solver.solve_all(Y)
Using the solver to obtain the sparse representations of all vectors
in the signal matrix Y independently using the linsolve
method
for least squares:
result = solver.solve_all_linsolve(Y)
Basis pursuit and its variations¶
Basis pursuit is a way of solving the sparse recovery problem via \(\ell_1\) minimization. We provide multiple implementations for different variations of the problem.
Note
These algorithms are dependent on the CVX toolbox. Please make sure to install them before using the algorithms.
Constructing the solver with dictionary and set of signals to be solved arranged in a signal matrix:
solver = spx.pursuit.single.BasisPursuit(Dict, Y)
Solving using LASSO method:
result = solver.solve_lasso(lambda)
result = solver.solve_lasso()
Solving using \(\ell_1\) minimization assuming that signals are exact sparse:
result = solver.solve_l1_exact()
Solving using \(\ell_1\) minimization assuming that signals are noisy:
result = solver.solve_l1_noise()
Compressive sampling matching pursuit¶
Constructing the solver with dictionary and expected sparsity level:
solver = spx.pursuit.single.CoSaMP(Dict, K)
Using the solver to obtain the sparse representation of one vector:
result = solver.solve(y)
Using the solver to obtain the sparse representations of all vectors in the signal matrix Y independently:
result = solver.solve_all(Y)
Introduction¶
This section focuses on methods which solve the sparse recovery or sparse approximation problems for one vector at a time. A subsection on joint recovery algorithms focuses on solving problems where multiple vectors which have largely common supports can be solved jointly.
For each algorithm, there is a solver. The solver should be constructed first with the dictionary / sensing matrix and some other parameters like sparsity level as needed by the algorithm.
The solver can then be used for solving one problem at a time.
Common utilities¶
Contents:
Signals¶
Our focus is usually on finite
dimensional signals. Such signals
are usually stored as column vectors
in MATLAB. A set of signals with same
dimensions can
be stored together in the form of
a matrix where each column of the matrix
is one signal. Such a matrix of
signals is called a signal matrix
.
In this section we describe some helper utility functions which provide extra functionality on top of existing support in MATLAB.
General¶
Constructing unit (column) vector in a given co-ordinate:
>> N = 8; i = 2; >> spx.vector.unit_vector(N, i)' 0 1 0 0 0 0 0 0
Sparsification¶
Finding the K-largest indices of a given signal:
>> x = [0 0 0 1 0 0 -1 0 0 -2 0 0 -3 0 0 7 0 0 4 0 0 -6];
>> K=4;
>> spx.commons.signals.largest_indices(x, K)'
16 22 19 13
Constructing the sparse approximation of x
with K
largest indices:
>> spx.commons.signals.sparseApproximation(x, K)'
0 0 0 0 0 0 0 0 0 0 0 0 -3 0 0 7 0 0 4 0 0 -6
Searching¶
spx.commons.signals.find_first_signal_with_energy_le
finds the first signal in a signal matrix X
with an energy less than or equal to
a given threshold
energy:
[x, i] = spx.commons.signals.find_first_signal_with_energy_le(X, threshold);
x
is the first signal with energy less
than the given threshold.
i
is the index of the column in X
holding
this signal.
Working with matrices¶
Simple checks on matrices¶
Let us create a simple matrix:
A = magic(3);
Checking whether the matrix is a square matrix:
spx.matrix.is_square(A)
Checking if it is symmetric:
spx.matrix.is_symmetric(A)
Checking if it is a Hermitian matrix:
spx.matrix.is_hermitian(A)
Checking if it is a positive definite matrix:
spx.matrix.is_positive_definite(A)
Matrix utilities¶
spx.matrix.off_diagonal_elements
returns
the off-diagonal elements of a given matrix
in a column vector arranged in column major order.
A = magic(3);
spx.matrix.off_diagonal_elements(A)'
ans =
3 4 1 9 6 7
spx.matrix.off_diagonal_matrix
zeros out
the diagonal entries of a matrix and
returns the modified matrix:
spx.matrix.off_diagonal_matrix(A)
ans =
0 1 6
3 0 7
4 9 0
spx.matrix.off_diag_upper_tri_matrix
returns
the off diagonal part of the upper triangular part
of a given matrix and zeros out the remaining entries:
spx.matrix.off_diag_upper_tri_matrix(A)
ans =
0 1 6
0 0 7
0 0 0
spx.matrix.off_diag_upper_tri_elements
returns the
elements in the off diagonal part of the upper
triangular part of a matrix arranged in column major
order:
spx.matrix.off_diag_upper_tri_elements(A)'
ans =
1 6 7
spx.matrix.nonzero_density
returns the ratio
of total number of non-zero elements in a matrix
with the size of the matrix:
spx.matrix.nonzero_density(A)
ans = 1
diagonally dominant matrices¶
Checking whether a matrix is diagonally dominant:
spx.matrix.is_diagonally_dominant(A)
Making a matrix diagonally dominant:
A = spx.matrix.make_diagonally_dominant(A)
Both these functions have an extra parameter
named strict
. When set to true, strict
diagonal dominance is considered / enforced.
Norms and distances¶
Distance measurement utilities¶
Let X
be a matrix. Treat each column of X
as a signal.
Euclidean distance between each signal pair can be computed by:
spx.commons.distance.pairwise_distances(X)
If X
contains N signals, then the result
is an N x N
matrix whose (i, j)-th entry
contains the distance between i-th and j-th
signal. Naturally, the diagonal elements are all
zero.
An additional second argument can be
provided to specify the distance measure
to be used. See the documentation of
MATLAB pdist
function for supported
distance functions.
For example, for measuring city-block distance between each pair of signals, use:
spx.commons.distance.pairwise_distances(X, 'cityblock')
Following dedicated functions are faster.
Squared \(\ell_2\) distances between all pairs of columns of X:
spx.commons.distance.sqrd_l2_distances_cw(X)
Squared \(\ell_2\) distances between all pairs of rows of X:
spx.commons.distance.sqrd_l2_distances_rw(X)
Norm utilities¶
These functions help in computing norm or normalizing signals in a signal matrix.
Compute \(\ell_1\) norm of each column vector:
spx.norm.norms_l1_cw(X)
Compute \(\ell_2\) norm of each column vector:
spx.norm.norms_l2_cw(X)
Compute \(\ell_{\infty}\) norm of each column vector:
spx.norm.norms_linf_cw(X)
Normalize each column vector w.r.t. \(\ell_1\) norm:
spx.norm.normalize_l1(X)
Normalize each column vector w.r.t. \(\ell_2\) norm:
spx.norm.normalize_l2(X)
Normalize each row vector w.r.t. \(\ell_2\) norm:
spx.norm.normalize_l2_rw(X)
Normalize each column vector w.r.t. \(\ell_{\infty}\) norm:
spx.norm.normalize_linf(X)
Scale each column vector by a separate factor:
spx.norm.scale_columns(X, factors)
Scale each row vector by a separate factor:
spx.norm.scale_rows(X, factors)
Compute the inner product of each column vector in A with each column vector in B:
spx.norm.inner_product_cw(A, B)
Sparse signals¶
Working with signal support¶
Let’s create a sparse vector:
>> x = [0 0 0 1 0 0 -1 0 0 -2 0 0 -3 0 0 7 0 0 4 0 0 -6];
Sparse support for a vector:
>> spx.commons.sparse.support(x)
4 7 10 13 16 19 22
\(\ell_0\) “norm” of a vector:
>> spx.commons.sparse.l0norm(x)
7
Let us create one more signal:
>> y = [3 0 0 0 0 0 0 0 0 4 0 0 -6 0 0 -5 0 0 -4 0 8 0];
>> spx.commons.sparse.l0norm(y)
6
>> spx.commons.sparse.support(y)
1 10 13 16 19 21
Support intersection ratio:
>> spx.commons.sparse.support_intersection_ratio(x, y)
0.1364
It is the ratio between the size of common indices in the supports of x and y and maximum of the sizes of supports of x and y.
Average support similarity of a reference signal with a set of signals X (each signal as a column vector):
spx.commons.sparse.support_similarity(X, reference)
Support similarities between two sets of signals (pairwise):
spx.commons.sparse.support_similarities(X, Y)
Support detection ratios
spx.commons.sparse.support_detection_rate(X, trueSupport)
K largest indices over a set of vectors:
spx.commons.sparse.dominant_support_merged(data, K)
Sometimes it’s useful to identify and arrange the non-zero entries in a signal in descending order of their magnitude:
>> spx.commons.sparse.sorted_non_zero_elements(x)
16 22 19 13 10 4 7
7 -6 4 -3 -2 1 -1
Given a signal x
, the function spx.commons.sparse.sorted_non_zero_elements
returns a two row matrix where the first row contains the locations
of non-zero elements sorted by their magnitude and second row
contains their magnitude. If the magnitude of two non-zero elements
is same, then the original order is maintained. The sorting is stable.
Comparing signals¶
Comparing sparse or approximately sparse signals¶
spx.commons.SparseSignalsComparison
class provides a number of
methods to compare two sets of sparse signals. It is
typically used to compare a set of original sparse signals
with corresponding recovered sparse signals.
Let us create two signals of size (N=256) with sparsity level (K=4) with the non-zero entries having magnitude chosen uniformly between [1,2]:
N = 256;
K = 4;
% Constructing a sparse vector
% Choosing the support randomly
Omega = randperm(N, K);
% Number of signals
S = 2;
% Original signals
X = zeros(N, S);
% Choosing non-zero values uniformly between (-b, -a) and (a, b)
a = 1;
b = 2;
% unsigned magnitudes of non-zero entries
XM = a + (b-a).*rand(K, S);
% Generate sign for non-zero entries randomly
sgn = sign(randn(K, S));
% Combine sign and magnitude
XMS = sgn .* XM;
% Place at the right non-zero locations
X(Omega, :) = XMS;
Let us create a noisy version of these signals with noise only in the non-zero entries at 15 dB of SNR:
% Creating noise using helper function
SNR = 15;
Noise = spx.data.noise.Basic.createNoise(XMS, SNR);
Y = X;
Y(Omega, :) = Y(Omega, :) + Noise;
Let us create an instance of sparse signal comparison class:
cs = spx.commons.SparseSignalsComparison(X, Y, K);
Norms of difference signals [X - Y]:
cs.difference_norms()
Norms of original signals [X]:
cs.reference_norms()
Norms of estimated signals [Y]:
cs.estimate_norms()
Ratios between signal error norms and original signal norms:
cs.error_to_signal_norms()
SNR for each signal:
cs.signal_to_noise_ratios()
In case the signals X and Y were not
truly sparse, then spx.commons.SignalsComparison
has the ability to sparsify them
by choosing the K
largest (magnitude)
entries for each signal in reference signal
set and estimated signal set. K
is an input parameter taken by the class.
We can access the sparsified reference signals:
cs.sparse_references()
We can access the sparsified estimated signals:
cs.sparse_estimates()
We can also examine the support index set for each sparsified reference signal:
cs.reference_sparse_supports()
Ditto for the supports of sparsified estimated signals:
cs.estimate_sparse_supports()
We can measure the support similarity ratio for each signal
cs.support_similarity_ratios()
We can find out which of the signals have a support similarity above a specified threshold:
cs.has_matching_supports(1.0)
Overall analysis can be easily summarized and printed for each signal:
cs.summarize()
Here is the output
Signal dimension: 256
Number of signals: 2
Combined reference norm: 4.56207362
Combined estimate norm: 4.80070407
Combined difference norm: 0.81126416
Combined SNR: 15.0000 dB
Specified sparsity level: 4
Signal: 1
Reference norm: 2.81008750
Estimate norm: 2.91691022
Error norm: 0.49971207
SNR: 15.0000 dB
Support similarity ratio: 1.00
Signal: 2
Reference norm: 3.59387311
Estimate norm: 3.81292464
Error norm: 0.63909106
SNR: 15.0000 dB
Support similarity ratio: 1.00
Signal space comparison¶
For comparing signals which are not sparse,
we have another helper utility class spx.commons.SignalsComparison
.
Assuming X is a signal matrix (with each column treated as a signal), and Y is its noisy version, we created the signal comparison instance as:
cs = spx.commons.SignalsComparison(X, Y);
Most functions are similar to what we had for
spx.commons.SparseSignalsComparison
:
cs.difference_norms()
cs.reference_norms()
cs.estimate_norms()
cs.error_to_signal_norms()
cs.signal_to_noise_ratios()
cs.summarize()
Working with Numbers¶
Some algorithms from number theory are useful at times.
Finding integer factors closest to square root:
>> [a,b] = spx.discrete.number.integer_factors_close_to_sqr_root(120)
a = 10
b = 12
Printing utilities¶
Sparse signals¶
Printing a sparse signal as pairs of locations and values:
>> x = [0 0 0 1 0 0 -1 0 0 -2 0 0 -3 0 0 7 0 0 4 0 0 -6];
>> spx.io.print.sparse_signal(x)
(4,1) (7,-1) (10,-2) (13,-3) (16,7) (19,4) (22,-6) N=22, K=7
Printing the non-zero entries in a signal in descending order of magnitude with location and value:
>> spx.io.print.sorted_sparse_signal(x)
Index: Value
16: 7.000000
22: -6.000000
19: 4.000000
13: -3.000000
10: -2.000000
4: 1.000000
7: -1.000000
Latex¶
Printing a vector in a format suitable for Latex:
>> spx.io.latex.printVector([1, 2, 3, 4])
\begin{pmatrix}
1 & 2 & 3 & 4
\end{pmatrix}
Printing a matrix in a format suitable for Latex:
>> spx.io.latex.printMatrix(randn(3, 4))
\begin{bmatrix}
-0.340285 & 1.13915 & 0.65748 & 0.0187744\\
-0.925848 & 0.427361 & 0.584246 & -0.425961\\
0.00532169 & 0.181032 & -1.61645 & -2.03403
\end{bmatrix}
Printing a vector as a set in Latex:
>> spx.io.latex.printSet([1, 2, 3, 4])
\{ 1 , 2 , 3 , 4 \}
SciRust¶
SciRust is a related scientific computing library developed by us. Some helper functions have been written to convert MATLAB data into SciRust compatible Rust source code.
Printing a matrix for consumption in SciRust source code:
>> spx.io.scirust.printMatrix(magic(3))
matrix_rw_f64(3, 3, [
8.0, 1.0, 6.0,
3.0, 5.0, 7.0,
4.0, 9.0, 2.0
]);
Sparse recovery¶
Estimate for the required number of measurements for sparse signals
in N
and sparsity level K
based on paper by Donoho and Tanner:
M = spx.commons.sparse.phase_transition_estimate_m(N, K);
Example:
>> spx.commons.sparse.phase_transition_estimate_m(1000, 4)
60
Synthetic Signals¶
Some easy to setup recovery problems¶
General approach:
m = 64;
n = 121;
k = 4;
dict = spx.dict.simple.gaussian_dict(m, n);
gen = spx.data.synthetic.SparseSignalGenerator(n, k);
% create a sparse vector
rep = gen.biGaussian();
signal = dict*rep;
problem.dictionary = dict;
problem.representation_vector = rep;
problem.sparsity_level = k;
problem.signal_vector = signal;
The problems:
problem = spx.data.synthetic.recovery_problems.problem_small_1()
problem = spx.data.synthetic.recovery_problems.problem_large_1()
problem = spx.data.synthetic.recovery_problems.problem_barbara_blocks()
Sparse signal generation¶
Create generator:
N = 256; K = 4; S = 10;
gen = spx.data.synthetic.SparseSignalGenerator(N, K, S);
Uniform signals:
result = gen.uniform();
result = gen.uniform(1, 2);
result = gen.uniform(-1, 1);
Bi-uniform signals:
result = gen.biUniform();
result = gen.biUniform(1, 2);
Gaussian signals:
result = gen.gaussian();
BiGuassian signals:
result = gen.biGaussian();
result = gen.biGaussian(2.0);
result = gen.biGaussian(10.0, 1.0);
Compressible signal generation¶
We can use randcs
function by Cevher, V.
for constructing compressible signals:
N = 100;
q = 1;
x = randcs(N, q);
plot(x);
plot(randcs(100, .9));
plot(randcs(100, .8));
plot(randcs(100, .7));
plot(randcs(100, .6));
plot(randcs(100, .5));
plot(randcs(100, .4));
lambda = 2;
x = randcs(N, q, lambda);
dist = 'logn';
x = randcs(N, q, lambda, dist);
Multi-subspace signal generation¶
Signals with disjoint supports:
% Dimension of representation space
N = 80;
% Number of subspaces
P = 8;
% Number of signals per subspace
SS = 10;
% Sparsity level of each signal (subspace dimension)
K = 4;
% Create signal generator
sg = spx.data.synthetic.MultiSubspaceSignalGenerator(N, K);
% Create disjoint supports
sg.createDisjointSupports(P);
sg.setNumSignalsPerSubspace(SS);
% Generate signal representations
sg.biUniform(1, 4);
% Access signal representations
X = sg.X;
% Corresponding supports
qs = sg.Supports;
Graphics and visualization¶
In this section we cover:
- Some utility classes which help in specific visualization tasks
- Some external open source libraries / functions
which have been integrated in
sparse-plex
to make visualization tasks easier - Some general techniques for specific visualization tasks
Create a full screen figure:
spx.graphics.figure.full_screen;
Multiple figures:
mf = spx.graphics.Figures();
mf.new_figure('fig 1');
mf.new_figure('fig 2');
mf.new_figure('fig 3');
All these figures will be created with same width and height. They will be placed one after another in a stacked manner.
Controlling size of multiple figures:
width = 1000;
height = 400;
mf = spx.graphics.Figures(width, height);
Display a Gram matrix for a given dictionary Phi
:
spx.graphics.display.display_gram_matrix(Phi);
Canvas of a grid of images¶
Sometimes we wish to show a set of small images in the form of a grid. These images may be patches from a larger image or may be small independent images themselves.
spx.graphics.canvas
helps in
combining the images in the form
of a grid on a canvas image.
We provide all the images to be displayed in the form of a matrix where each column consists of one image.
Creating a canvas of image patches:
% Let us create some random images of size 50x50
width = 50;
height = 50;
rows = 10;
cols = 10;
images = 255* rand(width*height, rows*cols);
% Let's create a canvas of these images formed into a
% 10 x 10 grid.
canvas = spx.graphics.canvas.create_image_grid(images, rows, cols, ...
height, width);
% Let's convert the canvas to UINT8 image
canvas = uint8(canvas);
% Let's show the image
imshow(canvas);
% Let's set the proper colormap.
colormap(gray);
% Axis sizing etc.
axis image;
axis off;
Displaying a set of signals in the form of a matrix¶
While working on joint signal recovery problems, we need to visualize a set of signals together. They can be put together in a signal matrix where each column is one (finite dimensional) signal. It is straightforward to create a visualization for these signals:
num_signals = 100;
signal_size = 80;
signal_matrix = randn(signal_size, num_signals);
% Let's create a canvas and put all the signals on it.
canvas = spx.graphics.canvas.create_signal_matrix_canvas(signal_matrix);
% Let's show the image
imshow(canvas);
% Let's set the proper colormap.
colormap(gray);
% Axis sizing etc.
axis image;
axis off;
Some third party open source libraries¶
Put a title over all subplots:
spx.graphics.suptitle(title);
This function is by Drea Thomas.
RGB code for given colorname:
c = spx.graphics.rgb('DarkRed')
c = spx.graphics.rgb('Green')
plot(x,y,'color',spx.graphics.rgb('orange'))
This function is by Kristján Jónasson and is in public domain.
Supported colors:
%White colors
'FF','FF','FF', 'White'
'FF','FA','FA', 'Snow'
'F0','FF','F0', 'Honeydew'
'F5','FF','FA', 'MintCream'
'F0','FF','FF', 'Azure'
'F0','F8','FF', 'AliceBlue'
'F8','F8','FF', 'GhostWhite'
'F5','F5','F5', 'WhiteSmoke'
'FF','F5','EE', 'Seashell'
'F5','F5','DC', 'Beige'
'FD','F5','E6', 'OldLace'
'FF','FA','F0', 'FloralWhite'
'FF','FF','F0', 'Ivory'
'FA','EB','D7', 'AntiqueWhite'
'FA','F0','E6', 'Linen'
'FF','F0','F5', 'LavenderBlush'
'FF','E4','E1', 'MistyRose'
%Grey colors'
'80','80','80', 'Gray'
'DC','DC','DC', 'Gainsboro'
'D3','D3','D3', 'LightGray'
'C0','C0','C0', 'Silver'
'A9','A9','A9', 'DarkGray'
'69','69','69', 'DimGray'
'77','88','99', 'LightSlateGray'
'70','80','90', 'SlateGray'
'2F','4F','4F', 'DarkSlateGray'
'00','00','00', 'Black'
%Red colors
'FF','00','00', 'Red'
'FF','A0','7A', 'LightSalmon'
'FA','80','72', 'Salmon'
'E9','96','7A', 'DarkSalmon'
'F0','80','80', 'LightCoral'
'CD','5C','5C', 'IndianRed'
'DC','14','3C', 'Crimson'
'B2','22','22', 'FireBrick'
'8B','00','00', 'DarkRed'
%Pink colors
'FF','C0','CB', 'Pink'
'FF','B6','C1', 'LightPink'
'FF','69','B4', 'HotPink'
'FF','14','93', 'DeepPink'
'DB','70','93', 'PaleVioletRed'
'C7','15','85', 'MediumVioletRed'
%Orange colors
'FF','A5','00', 'Orange'
'FF','8C','00', 'DarkOrange'
'FF','7F','50', 'Coral'
'FF','63','47', 'Tomato'
'FF','45','00', 'OrangeRed'
%Yellow colors
'FF','FF','00', 'Yellow'
'FF','FF','E0', 'LightYellow'
'FF','FA','CD', 'LemonChiffon'
'FA','FA','D2', 'LightGoldenrodYellow'
'FF','EF','D5', 'PapayaWhip'
'FF','E4','B5', 'Moccasin'
'FF','DA','B9', 'PeachPuff'
'EE','E8','AA', 'PaleGoldenrod'
'F0','E6','8C', 'Khaki'
'BD','B7','6B', 'DarkKhaki'
'FF','D7','00', 'Gold'
%Brown colors
'A5','2A','2A', 'Brown'
'FF','F8','DC', 'Cornsilk'
'FF','EB','CD', 'BlanchedAlmond'
'FF','E4','C4', 'Bisque'
'FF','DE','AD', 'NavajoWhite'
'F5','DE','B3', 'Wheat'
'DE','B8','87', 'BurlyWood'
'D2','B4','8C', 'Tan'
'BC','8F','8F', 'RosyBrown'
'F4','A4','60', 'SandyBrown'
'DA','A5','20', 'Goldenrod'
'B8','86','0B', 'DarkGoldenrod'
'CD','85','3F', 'Peru'
'D2','69','1E', 'Chocolate'
'8B','45','13', 'SaddleBrown'
'A0','52','2D', 'Sienna'
'80','00','00', 'Maroon'
%Green colors
'00','80','00', 'Green'
'98','FB','98', 'PaleGreen'
'90','EE','90', 'LightGreen'
'9A','CD','32', 'YellowGreen'
'AD','FF','2F', 'GreenYellow'
'7F','FF','00', 'Chartreuse'
'7C','FC','00', 'LawnGreen'
'00','FF','00', 'Lime'
'32','CD','32', 'LimeGreen'
'00','FA','9A', 'MediumSpringGreen'
'00','FF','7F', 'SpringGreen'
'66','CD','AA', 'MediumAquamarine'
'7F','FF','D4', 'Aquamarine'
'20','B2','AA', 'LightSeaGreen'
'3C','B3','71', 'MediumSeaGreen'
'2E','8B','57', 'SeaGreen'
'8F','BC','8F', 'DarkSeaGreen'
'22','8B','22', 'ForestGreen'
'00','64','00', 'DarkGreen'
'6B','8E','23', 'OliveDrab'
'80','80','00', 'Olive'
'55','6B','2F', 'DarkOliveGreen'
'00','80','80', 'Teal'
%Blue colors
'00','00','FF', 'Blue'
'AD','D8','E6', 'LightBlue'
'B0','E0','E6', 'PowderBlue'
'AF','EE','EE', 'PaleTurquoise'
'40','E0','D0', 'Turquoise'
'48','D1','CC', 'MediumTurquoise'
'00','CE','D1', 'DarkTurquoise'
'E0','FF','FF', 'LightCyan'
'00','FF','FF', 'Cyan'
'00','FF','FF', 'Aqua'
'00','8B','8B', 'DarkCyan'
'5F','9E','A0', 'CadetBlue'
'B0','C4','DE', 'LightSteelBlue'
'46','82','B4', 'SteelBlue'
'87','CE','FA', 'LightSkyBlue'
'87','CE','EB', 'SkyBlue'
'00','BF','FF', 'DeepSkyBlue'
'1E','90','FF', 'DodgerBlue'
'64','95','ED', 'CornflowerBlue'
'41','69','E1', 'RoyalBlue'
'00','00','CD', 'MediumBlue'
'00','00','8B', 'DarkBlue'
'00','00','80', 'Navy'
'19','19','70', 'MidnightBlue'
%Purple colors
'80','00','80', 'Purple'
'E6','E6','FA', 'Lavender'
'D8','BF','D8', 'Thistle'
'DD','A0','DD', 'Plum'
'EE','82','EE', 'Violet'
'DA','70','D6', 'Orchid'
'FF','00','FF', 'Fuchsia'
'FF','00','FF', 'Magenta'
'BA','55','D3', 'MediumOrchid'
'93','70','DB', 'MediumPurple'
'99','66','CC', 'Amethyst'
'8A','2B','E2', 'BlueViolet'
'94','00','D3', 'DarkViolet'
'99','32','CC', 'DarkOrchid'
'8B','00','8B', 'DarkMagenta'
'6A','5A','CD', 'SlateBlue'
'48','3D','8B', 'DarkSlateBlue'
'7B','68','EE', 'MediumSlateBlue'
'4B','00','82', 'Indigo'
%Gray repeated with spelling grey
'80','80','80', 'Grey'
'D3','D3','D3', 'LightGrey'
'A9','A9','A9', 'DarkGrey'
'69','69','69', 'DimGrey'
'77','88','99', 'LightSlateGrey'
'70','80','90', 'SlateGrey'
'2F','4F','4F', 'DarkSlateGrey'
Dictionaries¶
Basic Dictionaries¶
Some simple dictionaries can be constructed using library functions.
The dictionaries are available in two flavors:
- As simple matrices
- As objects which implement the
spx.dict.Operator
abstraction defined below.
The functions returning the dictionary
as a simple matrix have a suffix “mtx”.
The functions returning the dictionary
as a spx.dict.Operator
have the suffix
“dict” at the end.
These functions can also be used to construct random sensing matrices which are essentially random dictionaries.
Dirac Fourier Dictionary
spx.dict.simple.dirac_fourier_dict(N)
Dirac DCT Dictionary:
spx.dict.simple.dirac_dct_dict(N)
Gaussian Dictionary:
spx.dict.simple.gaussian_dict(N, D, normalized_columns)
Rademacher Dictionary:
Phi = spx.dict.simple.rademacher_dict(N, D);
Partial Fourier Dictionary:
Phi = spx.dict.simple.partial_fourier_dict(N, D);
Over complete 1-D DCT dictionary:
spx.dict.simple.overcomplete1DDCT(N, D)
Over complete 2-D DCT dictionary:
spx.dict.simple.overcomplete2DDCT(N, D)
Dictionaries from SPIE2011 paper:
spx.dict.simple.spie_2011(name) % ahoc, orth, rand, sine
Sensing matrices¶
Gaussian sensing matrix:
Phi = spx.dict.simple.gaussian_mtx(M, N);
Rademacher sensing matrix:
Phi = spx.dict.simple.rademacher_mtx(M, N);
Partial Fourier matrix:
Phi = spx.dict.simple.partial_fourier_mtx(M, N);
Operators¶
In simple terms, a (finite) dictionary is implemented as a matrix whose columns are atoms of the dictionary. This approach is not powerful enough. A dictionary \(\Phi\) usually acts on a sparse representation \(\alpha\) to obtain a signal \(x = \Phi \alpha\). During sparse recovery, the Hermitian transpose of the dictionary acts on the signal [or residual] to compute \(\Phi^H x\) or \(\Phi^H r\). Thus, the fundamental operations are multiplication by \(\Phi\) and \(\Phi^H\). While, these operations can be directly implemented by using a matrix representation of a dictionary, they are slow and require a large storage for the dictionary. For random dictionaries, this is the only option. But for structured dictionaries and sensing matrices, the whole of dictionary need not be held in memory. The multiplication by \(\Phi\) and \(\Phi^H\) can be implemented using fast functions.
Also multiple dictionaries can be combined to construct a composite dictionary, e.g. \(\Phi \Psi\).
In order to take care of these scenarios,
we define the notion of a generic operator
in an abstract class spx.dict.Operator
.
All operators support following methods.
Constructing a matrix representation of the operator:
op.double()
Computing \(\Phi x\):
op.mtimes(x)
The transpose operator:
op.transpose()
By default it is constructed by computing the matrix representation of the transpose of the operator. But specialized dictionaries can implement it smartly.
The Hermitian transpose operator:
op.ctranspose()
By default it is constructed by computing the matrix representation of the Hermitian transpose of the operator. But specialized dictionaries can implement it smartly.
Obtaining specific columns from the operator:
op.columns(columns)
Note that this doesn’t require computing the complete matrix representation of the operator.
op.apply_columns(vectors, columns)
Constructing an operator which uses only the specified columns from this dictionary:
op.columns_operator(columns)
A specific column of the dictionary:
op.column(index)
Printing the contents of the dictionary:
disp(op)
Matrix operators¶
Matrix operators are constructed by
wrapping a given matrix into spx.dict.MatrixOperator
which is a subclass of spx.dict.Operator
.
Constructing the matrix operator from a matrix A
:
op = spx.dict.MatrixOperator(A)
The matrix operator holds references to the matrix as well as its Hermitian transpose:
op.A
op.AH
Composite Operators¶
A composite operator can be created by combining two or more operators:
co = spx.dict.CompositeOperator(f, g)
Unitary/Orthogonal matrices¶
spx.dict.unitary.uniform_normal_qr(n)
spx.dict.unitary.analyze_rr(O)
spx.dict.unitary.synthesize_rr(rotations, reflections)
spx.dict.unitary.givens_rot(a, b)
Dictionary Properties¶
dp = spx.dict.Properties(Dict)
dp.gram_matrix()
dp.abs_gram_matrix()
dp.frame_operator()
dp.singular_values()
dp.gram_eigen_values()
dp.lower_frame_bound()
dp.upper_frame_bound()
dp.coherence()
Coherence of a dictionary:
mu = spx.dict.coherence(dict)
Babel function of a dictionary:
mu = spx.dict.babel(dict)
Spark of a dictionary (for small sizes):
[ K, columns ] = spx.dict.spark( Phi )
Equiangular Tight Frames¶
spx.dict.etf.ss_to_etf(M)
spx.dict.etf.is_etf(F)
spx.dict.etf.ss_etf_structure(k, v)
Grassmannian Frames¶
spx.dict.grassmannian.minimum_coherence(m, n)
spx.dict.grassmannian.n_upper_bound(m)
spx.dict.grassmannian.min_coherence_max_n(ms)
spx.dict.grassmannian.max_n_for_coherence(m, mu)
spx.dict.grassmannian.alternate_projections(dict, options)
Vector Spaces¶
Our work is focused on finite dimensional vector spaces \(\mathbb{R}^N\) or \(\mathbb{C}^N\). We represent a vector space by a basis in the vector space. In this section, we describe several useful functions for working with one or more vector spaces (represented by one basis per vector space).
Basis for intersection of two subspaces:
result = spx.la.spaces.insersection_space(A, B)
Orthogonal complement of A in B:
result = spx.la.spaces.orth_complement(A, B)
Principal angles between subspaces spanned by A and B:
result = spx.la.spaces.principal_angles_cos(A, B);
result = spx.la.spaces.principal_angles_radian(A, B);
result = spx.la.spaces.principal_angles_degree(A, B);
Smallest principal angle between subspaces spanned by A and B:
result = spx.la.spaces.smallest_angle_cos(A, B);
result = spx.la.spaces.smallest_angle_rad(A, B);
result = spx.la.spaces.smallest_angle_deg(A, B);
Principal angle between two orthogonal bases:
result = spx.la.spaces.principal_angles_orth_cos(A, B)
result = spx.la.spaces.smallest_angle_orth_cos(A, B);
Smallest angles between subspaces:
result = spx.la.spaces.smallest_angles_cos(subspaces, d)
result = spx.la.spaces.smallest_angles_rad(subspaces, d)
result = spx.la.spaces.smallest_angles_deg(subspaces, d)
Distance between subspaces based on Grassmannian space:
result = spx.la.spaces.subspace_distance(A, B)
This is computed as the operator norm of the difference between projection matrices for two subspaces.
Check if v in range of unitary matrix U:
result = spx.la.spaces.is_in_range_orth(v, U)
Check if v in range of A:
result = spx.la.spaces.is_in_range(v, A)
A basis for matrix A:
result = spx.la.spaces.find_basis(A)
Elementary matrices product and row reduced echelon form:
[E, R] = spx.la.spaces.elim(A)
Basis for null space of A:
result = spx.la.spaces.null_basis(A)
Bases for four fundamental spaces:
[col_space, null_space, row_space, left_null_space] = spx.la.spaces.four_bases(A)
[col_space, null_space, row_space, left_null_space] = spx.la.spaces.four_orth_bases(A)
Utility for constructing specific examples¶
Two spaces at a given angle:
[A, B] = spx.data.synthetic.subspaces.two_spaces_at_angle(N, theta)
Three spaces at a given angle:
[A, B, C] = spx.la.spaces.three_spaces_at_angle(N, theta)
Three disjoint spaces at a given angle:
[A, B, C] = spx.la.spaces.three_disjoint_spaces_at_angle(N, theta)
Map data from k dimensions to n dimensions:
result = spx.la.spaces.k_dim_to_n_dim(X, n, indices)
Describing relations between three spaces:
spx.la.spaces.describe_three_spaces(A, B, C);
Usage:
d = 4;
theta = 10;
n = 20;
[A, B, C] = spx.la.spaces.three_disjoint_spaces_at_angle(deg2rad(theta), d);
spx.la.spaces.describe_three_spaces(A, B, C);
% Put them together
X = [A B C];
% Put them to bigger dimension
X = spx.la.spaces.k_dim_to_n_dim(X, n);
% Perform a random orthonormal transformation
O = orth(randn(n));
X = O * X;
Combinatorics¶
Steiner Systems¶
Steiner system with block size 2:
v = 10;
m = spx.discrete.steiner_system.ss_2(v);
Steiner system with block size 3 (STS Steiner Triple System):
m = spx.discrete.steiner_system.ss_3(v);
Bose construction for STS system for v = 6n + 3
m = spx.discrete.steiner_system.ss_3_bose(v);
Verify if a given incidence matrix is a Steiner system:
spx.discrete.steiner_system.is_ss(M, k)
Latin square construction:
spx.discrete.steiner_system.commutative_idempotent_latin_square(n)
Verify if a table is a Latin square:
spx.discrete.steiner_system.is_latin_square(table)
Matrix factorization algorithms¶
Note
Better implementations for these algorithms may be available in stock MATLAB distribution or other third party libraries. These codes were developed for instructional purposes as variations of these algorithms were needed in development of other algorithms in this package.
Various versions of QR Factorization¶
Gram Schmidt:
[Q, R] = spx.la.qr.gram_schmidt(A)
Householder UR:
[U, R] = spx.la.qr.householder_ur(A)
Householder QR:
[Q, R] = spx.la.qr.householder_qr(A)
Householder matrix for a given vector:
[H, v] = spx.la.qr.householder_matrix(x)
External Code¶
almost equal:
isalmost(a,b,tol)
Timing¶
[t, measurement_overhead, measurement_details] = timeit(f, num_outputs)
Noise¶
Noise generation¶
Gaussian noise:
ng = spx.data.noise.Basic(N, S);
sigma = 1;
mean = 0;
ng.gaussian(sigma, mean);
Creating noise at a specific SNR:
% Sparse signal dimension
N = 100;
% Sparsity level
K = 20;
% Number of signals
S = 4;
% Create sparse signals
signals = spx.data.synthetic.SparseSignalGenerator(N, K, S).gaussian();
% Create noise at specific SNR level.
snrDb = 10;
noises = spx.data.noise.Basic.createNoise(signals, snrDb);
% add signal to noise
signals_with_noise = signals + noises;
% Verify SNR level
20 * log10 (spx.norm.norms_l2_cw(signals) ./ spx.norm.norms_l2_cw(noises))
Noise measurement¶
SNR in dB:
result = spx.commons.snr.SNR(signals, noises)
SNR in dB from signal and reconstruction:
reconstructions = signals_with_noise;
result = spx.commons.snr.recSNRdB(signals, reconstructions)
Signal energy in DB
result = spx.commons.snr.energyDB(signals)
Reconstruction SNR as energy ratio:
result = spx.commons.snr.recSNR(signal, reconstruction)
Error energy normalized by signal energy:
result = spx.commons.snr.normalizedErrorEnergy(signal, reconstruction)
Reconstruction SNRs over multiple signals in dB:
result = spx.commons.snr.recSNRsdB(signals, reconstructions)
Reconstruction SNRs over multiple signals as energy ratios:
result = spx.commons.snr.recSNRs(signals, reconstructions)
Signal energies:
result = spx.commons.snr.energies(signals)
Signal energies in dB:
result = spx.commons.snr.energiesDB(signals)
Exercises¶
The best way to learn is by doing exercises yourself. In this section, we present a set of computer exercises which help you learn the fundamentals of sparse representations: algorithms and applications.
Most of these exercises are implemented in some form or
other as part of the sparse-plex
library.
Once you have written your own implementations, you may
hunt the code in library and compare your implementation
with the reference implementation.
The exercises are described in terms of MATLAB programming environment. But they can be easily developed in other programming environments too.
Throughout these exercises, we will develop a set of functions which are reusable for performing various tasks related to sparse representation problems. We suggest you to collect such functions developed by you in one place together so that you can implement the more sophisticated exercises easily later.
Creating a sparse signal¶
The first aspect is deciding the support for the sparse signal.
- Decide on the length of signal N=1024.
- Decide on the sparsity level K=10.
- Choose K entries from 1..N randomly as your choice of sparse support. You can use
randperm
function.
Now, we need to consider the values of non-zero entries in the sparse vector. Typically, they are chosen from a random distribution. Few of the common choices are:
- Gaussian
- Uniform
- Bi-uniform
Gaussian
- Generate K Gaussian random numbers with zero
mean and unit standard deviation. You can
use
randn
function. You may choose to change the standard deviation, but mean should usually be zero. - Create a column vector with N zeros.
- On the entries indexed by the sparse support set, place the K numbers generated above.
Plotting
- Use
stem
command to visualize the sparse signal.
Uniform
- Most of the steps are similar to creating a Gaussian sparse vector.
- The
rand
function generates a number uniformly between 0 and 1. - In order to generate a number uniformly between a and b,
we can use the simple trick of
a + (b -a) * rand
- Choose a and b (say -4 and 4).
- Generate K uniformly distributed numbers between a and b.
- Place them in the N length vector as described above.
- Plot them.
Bi-uniform
A problem with Gaussian and uniform distributions as described above is that they are prone to generate some non-zero entries which are much smaller compared to others.
Bi-uniform approach attempts to avoid this situation. It generates numbers uniformly between [-b, -a] and [a, b] where a and b are both positive numbers with a < b.
- Choose a and by [say 1 and 2].
- Generate K uniformly distributed random numbers between a and b (as discussed above). These are the magnitudes of the sparse non-zero entries.
- Generate K Gaussian numbers and apply
sign
function to them to map them to 1 and -1. Note that with equal probability, the signs would be 1 or -1. - Multiply the signs and magnitudes to generate your sparse non-zero entries.
- Place them in the N length vector as described above.
- Plot them.
Following image is an example of how a sparse vector looks.

Creating a two ortho basis¶
Simplest example of an overcomplete dictionary is Dirac Fourier dictionary.
- You can use
eye(N)
to generate the standard basis of \(\mathbb{C}^N\) which is also known as Dirac basis. dftmtx(N)
gives the matrix for forward Fourier transform. Corresponding Fourier basis can be constructed by taking its transpose.- The columns / rows of
dftmtx(N)
are not normalized. Hence, in order to construct an orthonormal basis, we need to normalize the columns too. This can be easily done by multiplying with \(\frac{1}{\sqrt{N}}\).
- Choose the dimension of the ambient signal space (say N=1024).
- Construct the Dirac basis for \(\mathbb{C}^N\).
- Construct the orthonormal Fourier basis for \(\mathbb{C}^N\).
- Combine the two to form the two ortho basis (Dirac in left, Fourier in right).
Verification
We assume that the dictionary has been stored
in a variable named Phi
. We will use the
mathematical symbol \(\Phi\) for the same.
- Verify that each column has unit norm.
- Verify that each row has a norm of \(\sqrt{2}\).
- Compute the Gram matrix \(\Phi' * \Phi\).
- Verify that the diagonal elements are all one.
- Divide the Gram matrix into four quadrants.
- Verify that the first and fourth quadrants are identity matrices.
- Verify that the Gram matrix is symmetric.
- What can you say about the values in 2nd and 3rd quadrant?
Creating a Dirac-DCT two-ortho basis¶
While Dirac-DFT two ortho basis has the lowest possible coherence amongst all pairs of orthogonal bases, it is not restricted to \(\mathbb{R}^N\). A good starting point is to consider constructing a Dirac-DCT two ortho basis.
- Construct the Dirac-DCT two-ortho basis dictionary.
- Replace
dftmtx(N)
bydctmtx(N)
. - Follow steps similar to previous exercise to construct a Dirac-DCT dictionary.
- Notice the differences in the Gram matrix of Dirac-DFT dictionary with Dirac-DCT dictionary.
- Construct the Dirac-DCT dictionary for different values of N=(8, 16, 32, 64, 128, 256).
- Look at the changes in the Gram matrix as you vary N for constructing Dirac-DCT dictionary.
An example Dirac-DCT dictionary has been illustrated in the figure below.

Note
While constructing the two-ortho bases is nice for illustration, it should be noted that using them directly for computing \(\Phi x\) is not efficient. This entails full cost of a matrix vector multiplication. An efficient implementation would consider following ideas:
- \(\Phi x = [I \Psi] x = I x_1 + \Psi x_2\) where \(x_1\) and \(x_2\) are upper and lower halves of the vector \(x\).
- \(I x_1\) is nothing but x_1.
- \(\Psi x_2\) can be computed by using the efficient implementations of (Inverse) DFT or DCT transforms with appropriate scaling.
- Such implementations would perform the multiplication with dictionary in \(O(N \log N)\) time.
- In fact, if the second basis is a wavelet basis, then the multiplication can be carried out in linear time too.
- You are suggested to take advantage of these ideas in following exercises.
Creating a signal which is a mixture of sinusoids and impulses
If we split the sparse vector \(x\) into two halves \(x_1\) and \(x_2\) then: * The first half corresponds to impulses from the Dirac basis. * The second half corresponds to sinusoids from DCT or DFT basis.
It is straightforward to construct a signal which is a mixture of impulses and sinusoids and has a sparse representation in Dirac-DFT or Dirac-DCT representation.
- Pick a suitable value of N (say 64).
- Construct the corresponding two ortho basis.
- Choose a sparsity pattern for the vector x (of size 2N) such that some of the non-zero entries fall in first half while some in second half.
- Choose appropriate non-zero coefficients for x.
- Compute \(y = \Phi x\) to obtain a signal which is a mixture of impulses and sinusoids.
Verification
- It is obvious that the signal is non-sparse in time domain.
- Plot the signal using
stem
function. - Compute the DCT or DFT representation of the signal (by taking inverse transform).
- Plot the transform basis representation of the signal.
- Verify that the transform basis representation does indeed have some large spikes (corresponding to the non-zero entries in second half of \(x\)) but the rest of the representation is also full with (small) non-zero terms (corresponding to the transform representation of impulses).
Creating a random dictionary¶
We consider constructing a Gaussian random matrix.
- Choose the number of measurements \(M\) say 128.
- Choose the signal space dimension \(N\) say 1024.
- Generate a Gaussian random matrix as \(\Phi = \text{randn(M, N)}\).
Normalization
There are two ways of normalizing the random matrix to a dictionary.
One view considers that all columns or atoms of a dictionary should be of unit norm.
- Measure the norm of each column. You may be tempted to write a for loop
to do the same. While this is alright, but MATLAB is known for its
vectorization capabilities. Consider using a combination of
sum
conj
element wise multiplication andsqrt
to come up with a function which can measure the column wise norms of a matrix. You may also explorebsxfun
. - Divide each column by its norm to construct a normalized dictionary.
- Verify that the columns of this dictionary are indeed unit norm.
An alternative way considers a probabilistic view.
- We say that each entry in the Gaussian random matrix should be zero mean and variance \(\frac{1}{M}\).
- This ensures that on an average the mean of each column is indeed 1 though actual norms of each column may differ.
- As the number of measurements increases, the likelihood of norm being close to one increases further.
We can apply these ideas as follows.
Recall that randn
generates Gaussian random variables with zero mean
and unit variance.
- Divide the whole random matrix by \(\frac{1}{\sqrt{M}}\) to achieve the desired sensing matrix.
- Measure the norm of each column.
- Verify that the norms are indeed close to 1 (though not exactly).
- Vary M and N to see how norms vary.
- Use
imagesc
orimshow
function to visualize the sensing matrix.
An example Gaussian sensing matrix is illustrated in figure below.

Taking compressive measurements¶
- Choose a sparsity level (say K=10)
- Choose a sparse support over \(1 \dots N\) of size K randomly using
randperm
function. - Construct a sparse vector with bi uniform non-zero entries.
- Apply the Gaussian sensing matrix on to the sparse signal to compute compressive measurement vector \(y = \Phi x \in \mathbb{R}^M\).
An example of compressive measurement vector is shown in figure below.

In the sequel we will refer to the computation of noiseless measurement vector by the equation \(y = \Phi x\).
When we make measurement noisy, the equation would be \(y = \Phi x + e\).
Before we jump into sparse recovery, let us spend some time studying some simple properties of dictionaries.
Measuring dictionary properties¶
Gram matrix¶
You have already done this before. The straight forward calculation is \(G = \Phi' * \Phi\) where we are considering the conjugate transpose of the dictionary \(\Phi\).
- Write a function to measure the Gram matrix of any dictionary.
- Compute the Gram matrix for all the dictionaries discussed above.
- Verify that Gram matrix is symmetric.
For most of our purposes, the sign or phase of entries in the Gram
matrix is not important. We may use the symbol G
to refer to
the Gram matrix in the sequel.
- Compute absolute value Gram matrix
abs(G)
.
Coherence¶
Recall that the coherence of a dictionary is largest (absolute value) inner product between any pair of atoms. Actually it’s quite easy to read the coherence from the absolute value Gram matrix.
- We reject the diagonal elements since they correspond to the inner product of an atom with itself. For a properly normalized dictionary, they should be 1 anyway.
- Since the matrix is symmetric we need to look at only the upper triangular half or the lower triangular half (excluding the diagonal) to read off the coherence.
- Pick the largest value in the upper triangular half.
- Write a MATLAB function to compute the coherence.
- Compute coherence of a Dirac-DFT dictionary for different values of N. Plot the same to see how coherence decreases with N.
- Do the same for Dirac-DCT.
- Compute the coherence of Gaussian dictionary (with say N=1024) for different values of M and plot it.
- In the case of Gaussian dictionary, it is better to take average coherence for same M and N over different instances of Gaussian dictionary of the specified size.
Babel function¶
Babel function is quite interesting. While the definition looks pretty scary, it turns out that it can be computed very easily from the Gram matrix.
- Compute the (absolute value) Gram matrix for a dictionary.
- Sort the rows of the Gram matrix (each row separately) in descending order.
- Remove the first column (consists of all ones in for a normalized dictionary).
- Construct a new matrix by accumulating over the columns of the
sorted Gram matrix above. In other words, in the new matrix
- First column is as it is.
- Second column consists of sum of first and second column of sorted matrix.
- Third column consists of sum of first to third column of sorted matrix .
- Continue accumulating like this.
- Compute the maximum for each column.
- Your Babel function is in front of you.
- Write a MATLAB function to carry out the same for any dictionary.
- Compute the Babel function for Dirac-DFT and Dirac-DCT dictionary with (N=256).
- Compute the Babel function for Gaussian dictionary with N=256. Actually compute Babel functions for many instances of Gaussian dictionary and then compute the average Babel function.
Getting started with sparse recovery¶
Our first objective will be to develop algorithms for sparse recovery in noiseless case.
The defining equation is \(y = \Phi x\) where \(x\) is the sparse representation vector, \(\Phi\) is the dictionary or sensing matrix and \(y\) is the signal or measurement vector. In any sparse recovery algorithm, following quantities are of core interest:
- \(x\) which is unknown to us.
- \(\Phi\) which is known to us. Sometimes we may know \(\Phi\) only approximately.
- \(y\) which is known to us.
- Given \(\Phi\) and \(y\), we estimate an approximation of \(x\) which we will represent as \(\widehat{x}\).
- \(\widehat{x}\) is (typically) sparse even if \(x\) may be only approximately sparse or compressible.
- Given an estimate \(\widehat{x}\), we compute the residual \(r = y - \Phi \widehat{x}\). This quantity is computed during the sparse recovery process.
- Measurement or signal error norm \(\| r \|_2\). We strive to reduce this as much as possible.
- Sparsity level \(K\). We try to come up with an \(\widehat{x}\) which is K-sparse.
- Representation error or recovery error \(f = x - \widehat{x}\). This is unknown to us. The recovery process tends to minimize its norm \(\| f \|_2\) (if it is working correctly !).
Some notes are in order
- K may or may not be given to us. If K is given to us, we should use it in our recovery process. If it is not given, then we should work with \(\| r \|_2\).
- While the recovery algorithm itself doesn’t know about \(x\) and hence cannot calculate \(f\), a controlled testing environment can carefully choose and \(x\), compute \(y\) and pass \(\Phi\) and \(y\) to the recovery algorithm. Thus, the testing environment can easily compute \(f\) by using the \(x\) known to it and \(\widehat{x}\) given by the recovery algorithm.
Usually the sparse recovery algorithms are iterative. In each iteration, we improve our approximation \(\widehat{x}\) and reduce \(\| r \|_2\).
- We can denote the iteration counter by \(k\) starting from 0 onwards.
- We denote k-th approximation by \(\widehat{x}^k\) and k-th residual by \(r^k\).
- A typical initial estimate is given by \(\widehat{x}^0 = 0\) and thus, \(r^0 = y\).
Objectives of recovery algorithm
There are fundamentally two objectives of a sparse recovery algorithm
- Identification of locations at which \(\widehat{x}\) has non-zero entries. This corresponds to the sparse support of \(x\).
- Estimation of the values of non-zero entries in \(\widehat{x}\).
We will use following notation.
- The identified support will be denoted as \(\Lambda\). It is the responsibility of the sparse recovery algorithm to guess it.
- If the support is identified gradually in each iteration, we can use the notation \(\Lambda^k\).
- The actual support of \(x\) will be denoted by \(\Omega\). Since \(x\) is unknown to us hence \(\Omega\) is also unknown to us within the sparse recovery algorithm. However, the controlled testing environment would know about \(\Omega\).
If the support has been identified correctly, then estimation part is quite easy. It’s nothing but the application of least squares over the columns of \(\Phi\) selected by the support set.
Different recovery algorithms vary in how they approach the support identification and coefficient estimations.
- Some algorithms try to identify whole support at once and then estimate the values of non-zero entries.
- Some algorithms identify atoms in the support one at a time and iteratively estimate the non-zero values for the current support.
Simple support identification
- Write a function which sorts a given vector by the decreasing order of magnitudes of its entries.
- Identify the K largest (magnitude) entries in the sorted vector and their locations in the original vector.
- Collect the locations of K largest entries into a set
Note
[sorted_x, index_vector] = sort(x)
in MATLAB returns
both the sorted entries and the index vector
such that sorted_x = x[index_vector]
. Our interest
is usually in the index_vector
as we don’t want
to really change the order of entries in x
while
identifying the largest K entries.
In MATLAB a set can be represented using an array. You have to be careful to ensure that such a set never have any duplicate elements.
Sparse approximation of a given vector
Given a vector \(x\) which may not be sparse, its K sparse approximation which is the best approximation in \(l_p\) norm sense can be obtained by choosing the K largest (in magnitude) entries.
- Write a MATLAB function to compute the K sparse representation of
any vector.
- Identify the K largest entries and put their locations in the support set \(\Lambda\).
- Compute \(\Lambda^c = \{1 \dots N \} \setminus \Lambda\).
- Set the entries corresponding to \(\Lambda^c\) in \(x\) to zero.
The proxy vector
A very interesting quantity which appears in many sparse recovery algorithms is the proxy vector \(p = \Phi' r\).
The figure below shows a sparse vector, its measurements and corresponding proxy vector \(p^0 = \Phi r^0 =\Phi y\).

While the proxy vector may look quite chaotic on first look, it is very interesting to note that it tends to have large entries at exactly the same location as the sparse vector \(x\) itself.
if we think about the proxy vector closely, we can notice that each entry in the proxy is the inner product of an atom in \(\Phi\) with the residual \(r\). Thus, each entry in proxy vector indicates how similar an atom in the dictionary is with the residual.
- Choose M, N and K and construct a sparse vector \(x\) with support \(\Omega\) and Gaussian dictionary \(\Phi\).
- For the measurement vector \(y = \Phi x\), compute \(p = \Phi' y\).
- Identify the K largest entries in \(p\) and use their locations to make a guess of support as \(\Lambda\).
- Compare the sets \(\Omega\) and \(\Lambda\). Measure the support identification ratio as \(\frac{|\Lambda \cap \Omega|}{|\Omega|}\) i.e. the ratio of the number of indices common in \(\Lambda\) and \(\Omega\) with the number of indices in \(\Omega\) (which is K).
- Keep M and N fixed and vary K to see how support identification ratio changes. For this, measure average support identification ratio for say 100 trials. You may increase the number of trials if you want.
- Keep K=4, N=1024 and vary M from 10 to 500 to see how support identification ratio changes. Again use the average value.
Note
The support identification ratio is a critical tool for evaluating the quality of a sparse recovery algorithm. Recall that if the support has been identified correctly, then reconstructing a sparse vector is a simple least squares problem. If the support is identified partially, or some of the indices are incorrect, then it can lead to large recovery errors.
If the support identification ratio is 1, then we have correctly identified the support. Otherwise, we haven’t.
For noiseless recovery, if support is identified correctly, then representation will be recovered correctly (unless \(\Phi\) is ill conditioned). Thus, support identification ratio is a good measure of success or failure of recovery. We don’t need to worry about SNR or norm of recovery error.
In the sequel, for noiseless recovery, we will say that recovery succeeds if support identification ratio is 1.
If we run multiple trials of a recovery algorithm (for a specific configuration of K, M, N etc.) with different data, then the recovery rate would be the number of trials in which successful recovery happened divided by the total number of trials.
The recovery rate (on reasonably high number of trials) would be our main tool for measuring the quality of a recovery algorithm. Note that the recovery rate depends on
- The representation space dimension \(N\).
- The number of measurements \(M\).
- The sparsity level \(K\).
- The choice of dictionary \(\Phi\).
It doesn’t really depend much on the choice of distribution for the non-zero entries in \(x\) if the entries are i.i.d. Or the dependence as such is not very significant.
Developing the hard thresholding algorithm¶
Based on the idea of the proxy vector, we can easily compute a sparse approximation as follows.
- Identify the K largest entries in the proxy and their locations.
- Put the locations together in your guess for the support \(\Lambda\).
- Identify the columns of \(\Phi\) corresponding to \(\Lambda\) and construct a submatrix \(\Phi_{\Lambda}\).
- Compute \(x_{\Lambda} = \Phi_{\Lambda}^{\dagger} y\) as the least squares solution of the problem \(y = \Phi_{\Lambda} x_{\Lambda}\).
- Set the remaining entries in \(x\) corresponding to \(\Lambda^c\) as zeros.
Put together the algorithm described above in a MATLAB function
like x_hat = hard_thresholding(Phi, y, K)
.
- Think and explain why hard thresholding will always succeed if \(K=1\).
- Say \(N=256\) and \(K=2\). What is the required number of measurements at which the recovery rate will be equal to 1.
Phase transition diagram
A nice visualization of the performance of a recovery algorithm is via its phase transition diagram. The figure below shows the phase transition diagram for orthogonal matching pursuit algorithm with a Gaussian dictionary and Gaussian sparse vectors.
- N is fixed at 64.
- K is varied from 1 to 4.
- M is varied from 1 and 2 to 32 (N/2) with steps of 2.
- For each configuration of K and M, 1000 trials are conducted and recovery rate is measured.
- In the phase transition diagram, a white cell indicates that for the corresponding K and M, the algorithm is able to recover successfully always.
- A black cell indicates that the algorithm never successfully recovers any signal for the corresponding K and M.
- A gray cell indicates that the algorithm sometimes recovers successfully while sometimes it may fail.
- Safe zone of operation is the white area in the diagram.

In the figure below, we capture the minimum required number of measurements for different values of K for OMP algorithm running on Gaussian sensing matrix.

It is evident that as K increases, the minimum M required for successful recovery also increases.
- Generate the phase transition diagram for thresholding algorithm with N = 256, K varying from 1 to 16 and M varying from 2 to 128 and a minimum of 100 trials for each configuration.
- Use the phase transition diagram data for estimating the minimum M for different values of K and plot it.
Developing the matching pursuit algorithm¶
You can read the description of matching pursuit algorithms on Wikipedia. This is a simpler algorithm than orthogonal matching pursuit. It doesn’t involve any least squares step.
- Implement the matching pursuit (MP) algorithm in MATLAB.
- Generate the phase transition diagram for MP algorithm with N = 256, K varying from 1 to 16 and M varying from 2 to 128 and a minimum of 100 trials for each configuration.
- Use the phase transition diagram data for estimating the minimum M for different values of K and plot it.
Developing the orthogonal matching pursuit algorithm¶
The orthogonal matching pursuit algorithm is described in the figure below.

- Implement the orthogonal matching pursuit (OMP) algorithm in MATLAB.
- Generate the phase transition diagram for OMP algorithm with N = 256, K varying from 1 to 16 and M varying from 2 to 128 and a minimum of 100 trials for each configuration.
- Use the phase transition diagram data for estimating the minimum M for different values of K and plot it.
Sparsifying an image¶
Scripts¶
Preamble¶
close all; clear all; clc;
Resetting random numbers:
rng('default');
Export management flag:
export = true;
Figures¶
Exporting figures:
if export
export_fig images\figure_name.png -r120 -nocrop;
export_fig images\figure_name.pdf;
end
Typical steps in figures:
xlabel('Principal angle (degrees)');
ylabel('Number of subspace pairs');
title('Distribution of principal angles over subspace pairs in signal space');
grid on;
References¶
[AB98] | C.D. Aliprantis and O. Burkinshaw. Principles of real analysis. ACADEMIC PressINC, 1998. ISBN 9780120502578. |
[Art91] | M. Artin. Algebra. Prentice Hall, 1991. ISBN 9780130047632. |
[BDDH11] | Richard Baraniuk, M Davenport, M Duarte, and Chinmay Hegde. An introduction to compressive sensing. Connexions e-textbook, 2011. |
[BJ03] | Ronen Basri and David W Jacobs. Lambertian reflectance and linear subspaces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(2):218–233, 2003. |
[BFH+07] | E. van den Berg, M. P. Friedlander, G. Hennenfent, F. Herrmann, R. Saab, and Ö. Yılmaz. Sparco: A testing framework for sparse reconstruction. Technical Report TR-2007-20, Dept. Computer Science, University of British Columbia, Vancouver, October 2007. |
[BjorckG73] | Ȧke Björck and Gene H Golub. Numerical methods for computing angles between linear subspaces. Mathematics of computation, 27(123):579–594, 1973. |
[BD09] | Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009. |
[BB91] | Terrance E Boult and Lisa Gottesfeld Brown. Factorization-based segmentation of motions. In Visual Motion, 1991., Proceedings of the IEEE Workshop on, 179–186. IEEE, 1991. |
[BM00] | Paul S Bradley and Olvi L Mangasarian. K-plane clustering. Journal of Global Optimization, 16(1):23–32, 2000. |
[BK00] | DS Broomhead and Michael Kirby. A new approach to dimensionality reduction: theory and algorithms. SIAM Journal on Applied Mathematics, 60(6):2114–2142, 2000. |
[BM13] | Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013. |
[CR04] | Emmanuel J Candes and Justin Romberg. Practical signal recovery from random projections. Wavelet Applications in Signal and Image Processing XI Proc. SPIE Conf. 5914., 2004. |
[CT05] | Emmanuel J Candes and Terence Tao. Decoding by linear programming. Information Theory, IEEE Transactions on, 51(12):4203–4215, 2005. |
[CT06] | Emmanuel J Candes and Terence Tao. Near-optimal signal recovery from random projections: universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406–5425, 2006. |
[Cev09] | Volkan Cevher. Learning with compressible priors. In Advances in Neural Information Processing Systems, 261–269. 2009. |
[CDS98] | Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM journal on scientific computing, 20(1):33–61, 1998. |
[CK98] | João Paulo Costeira and Takeo Kanade. A multibody factorization method for independently moving objects. International Journal of Computer Vision, 29(3):159–179, 1998. |
[DG99] | Sanjoy Dasgupta and Anupam Gupta. An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Technical Report, pages 99–006, 1999. |
[DLR77] | Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977. |
[Der07] | Harm Derksen. Hilbert series of subspace arrangements. Journal of pure and applied algebra, 209(1):91–98, 2007. |
[Don06] | David L Donoho. For most large underdetermined systems of linear equations the minimal $l_1$-norm solution is also the sparsest solution. Communications on pure and applied mathematics, 59(6):797–829, 2006. |
[DE03] | David L Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via $l_1$ minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003. |
[DET06] | David L Donoho, Michael Elad, and Vladimir N Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. Information Theory, IEEE Transactions on, 52(1):6–18, 2006. |
[DHS12] | Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, 2012. |
[DSB13] | Eva L Dyer, Aswin C Sankaranarayanan, and Richard G Baraniuk. Greedy feature selection for subspace clustering. The Journal of Machine Learning Research, 14(1):2487–2517, 2013. |
[Ela10] | Michael Elad. Sparse and redundant representations. Springer, 2010. |
[EV13] | Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: algorithm, theory, and applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(11):2765–2781, 2013. |
[EV09] | Ehsan Elhamifar and René Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2790–2797. IEEE, 2009. |
[FV+62] | David G Feingold, Richard S Varga, and others. Block diagonally dominant matrices and generalizations of the gerschgorin circle theorem. Pacific J. Math, 12(4):1241–1250, 1962. |
[Gea98] | Charles William Gear. Multibody grouping from motion images. International Journal of Computer Vision, 29(2):133–150, 1998. |
[GVL12] | Gene H Golub and Charles F Van Loan. Matrix computations. Volume 3. JHU Press, 2012. |
[GH14] | Phillip Griffiths and Joseph Harris. Principles of algebraic geometry. John Wiley & Sons, 2014. |
[Har13] | Joe Harris. Algebraic geometry: a first course. Volume 133. Springer Science & Business Media, 2013. |
[Har75] | John A Hartigan. Clustering algorithms. 1975. |
[Har77] | Robin Hartshorne. Algebraic geometry. Volume 52. Springer Science & Business Media, 1977. |
[HYL+03] | Jeffrey Ho, Ming-Hsuan Yang, Jongwoo Lim, Kuang-Chih Lee, and David Kriegman. Clustering appearances of objects under varying illumination conditions. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, I–11. IEEE, 2003. |
[HMV04] | Kun Huang, Yi Ma, and René Vidal. Minimum effective dimension for mixtures of subspaces: a robust gpca algorithm and its applications. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, II–631. IEEE, 2004. |
[Jol02] | Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002. |
[Kan01] | Kenichi Kanatani. Motion segmentation by subspace separation and model selection. image, 1:1, 2001. |
[KW79] | Paul Joseph Kelly and Max L Weiss. Geometry and convexity: a study in mathematical methods. John Wiley & Sons, 1979. |
[Lan02] | Serge Lang. Algebra revised third edition. Volume 1. Springer Science and Media, 2002. |
[LBBH98] | Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. |
[M+67] | James MacQueen and others. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, 281–297. Oakland, CA, USA., 1967. |
[MZ93] | Stephane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397–3415, 1993. |
[NT09] | Deanna Needell and Joel A Tropp. Cosamp: iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009. |
[NJW+02] | Andrew Y Ng, Michael I Jordan, Yair Weiss, and others. On spectral clustering: analysis and an algorithm. Advances in neural information processing systems, 2:849–856, 2002. |
[PRK93] | Yagyensh Chandra Pati, Ramin Rezaiifar, and PS Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, 40–44. IEEE, 1993. |
[PK97] | Conrad J Poelman and Takeo Kanade. A paraperspective factorization method for shape and motion recovery. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(3):206–218, 1997. |
[RBE10] | Ron Rubinstein, Alfred M Bruckstein, and Michael Elad. Dictionaries for sparse representation modeling. Proceedings of the IEEE, 98(6):1045–1057, 2010. |
[RZE08] | Ron Rubinstein, Michael Zibulevsky, and Michael Elad. Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit. Cs Technion, 40(8):1–15, 2008. |
[SF13] | Robert Sedgewick and Philippe Flajolet. An introduction to the analysis of algorithms. Addison-Wesley, 2013. |
[SM00] | Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000. |
[Str99] | Gilbert Strang. The discrete cosine transform. SIAM review, 41(1):135–147, 1999. |
[TK91] | Carlo Tomasi and Takeo Kanade. Detection and tracking of point features. School of Computer Science, Carnegie Mellon Univ. Pittsburgh, 1991. |
[TK92] | Carlo Tomasi and Takeo Kanade. Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2):137–154, 1992. |
[TBI97] | Lloyd N Trefethen and David Bau III. Numerical linear algebra. Volume 50. Siam, 1997. |
[Tro04] | Joel A Tropp. Greed is good: algorithmic results for sparse approximation. Information Theory, IEEE Transactions on, 50(10):2231–2242, 2004. |
[TRO04] | JOEL A TROPP. Just relax: convex programming methods for subset selection and sparse approximation. 2004. |
[Tro06] | Joel A Tropp. Just relax: convex programming methods for identifying sparse signals in noise. Information Theory, IEEE Transactions on, 52(3):1030–1051, 2006. |
[TG07] | Joel A Tropp and Anna C Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. Information Theory, IEEE Transactions on, 53(12):4655–4666, 2007. |
[TW10] | Joel A Tropp and Stephen J Wright. Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE, 98(6):948–958, 2010. |
[Vap13] | Vladimir Vapnik. The nature of statistical learning theory. Springer Science & Business Media, 2013. |
[VMS05] | Rene Vidal, Yi Ma, and Shankar Sastry. Generalized principal component analysis (gpca). Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(12):1945–1959, 2005. |
[Vid10] | René Vidal. A tutorial on subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2010. |
[VH04] | René Vidal and Richard Hartley. Motion segmentation with missing data using powerfactorization and gpca. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, II–310. IEEE, 2004. |
[VMS03] | René Vidal, Yi Ma, and Shankar Sastry. Generalized principal component analysis (gpca). In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, I–621. IEEE, 2003. |
[VTH08] | René Vidal, Roberto Tron, and Richard Hartley. Multiframe motion segmentation with missing data using powerfactorization and gpca. International Journal of Computer Vision, 79(1):85–105, 2008. |
[VL07] | Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007. |
[WW07] | Silke Wagner and Dorothea Wagner. Comparing clusterings: an overview. 2007. |
[YRV16] | Chong You, D Robinson, and René Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1. 2016. |
[YV15] | Chong You and René Vidal. Sparse subspace clustering by orthogonal matching pursuit. arXiv preprint arXiv:1507.01238, 2015. |
[ZMP04] | Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In Advances in neural information processing systems, 1601–1608. 2004. |
