Soft ThresholdingΒΆ

Definition

Given a function \(f : \RR^n \to (-\infty, \infty]\), the proximal mapping of \(f\) is the operator given by

\[\text{prox}_f(x) = \text{argmin}_u \left \{ f(u) + \frac{1}{2} \| u - x \|_2^2 \right\} \quad \forall x \in \RR^n.\]

The proximal mapping in general is a point to set mapping. i.e. it maps each point \(x \in \RR^n\) to a set of points in \(\RR^n\) which minimize the expression on the R.H.S.

When \(f\) is a proper closed convex function, then the set \(\text{prox}_f(x)\) is a singleton for each \(x \in \RR^n\). In this case, proximal mapping can be treated as a single valued mapping or a function.

Proposition
Let \(f(x) = c\), then \(\text{prox}_f(x) = x\).
Proof

Let

\[g_x(u) = f(u) + \frac{1}{2} \| u - x \|_2^2 = c + \frac{1}{2} \| u - x \|_2^2.\]

To minimize \(g\) over \(u\), we differentiate \(g\) w.r.t. \(u\).

\[\frac{\partial g}{\partial u} = u - x.\]

The minimum value is achieved at \(u = x\).

A simpler argument is that \(g\) is a constant plus a quadratic. Thus, the minimum of \(g\) is achieved when the quadratic is 0 which happens when \(x = u\).

Proposition
Let \(f : \RR \to \RR\) be defined as \(f(x) = x\), then \(\text{prox}_f(x) = x - 1\).
Proof

Let

\[g_x(u) = f(u) + \frac{1}{2} (u - x)^2 = u + \frac{1}{2} (u - x)^2.\]

To minimize \(g\) over \(u\), we differentiate \(g\) w.r.t. \(u\).

\[\frac{\partial g}{\partial u} = 1 + u - x.\]

The minimum value is achieved at \(u = x - 1\).

Proposition
Let \(f : \RR \to \RR\) be defined as \(f(x) = \lambda x\), then \(\text{prox}_f(x) = x - \lambda\).
Proposition

Let \(f\) be defined as:

\[\begin{split}f(x) = \begin{cases} \lambda x & \text{if $x \geq 0$}\\ \infty & \text{otherwise} \end{cases}\end{split}\]

Then \(\text{prox}_f(x)\) is \([x - \lambda]_+\).

This is one-sided soft-thresholding.

Proof

Note that \(f\) is differentiable for \(x > 0\). \(\text{prox}_f(x)\) is the minimizer of the function

\[\begin{split}g(u) = \begin{cases} g_1(u) & u \geq 0 \\ \infty & u < 0 \end{cases}\end{split}\]

where

\[g_1(u) = \lambda u + \frac{1}{2} (u - x)^2.\]

\(g\) is a proper, convex and closed function. \(g\) is differentiable for \(u > 0\) with \(g'(u) = g_1'(u)\) for \(u > 0\). The only point where \(g\) is non-differentiable is \(u = 0\) in the domain of \(g\) which is \(u \geq 0\).

If \(g'(u) = 0\), then \(u\) is a minimizer of \(g\) (since it is a convex function).

\(g_1'(u) = 0\) iff \(u = x - \lambda\). If \(x > \lambda\), then \(g'(x-\lambda) = g_1'(x-\lambda) = 0\). Thus, \(\text{prox}_f(x) = x - \lambda\) for \(x > \lambda\).

Now if \(x \leq \lambda\), the \(g'(u)\) is never 0 wherever \(g\) is differentiable. Since a minimum of \(g\) exists, it must be attained at a point of non-differentiability. The only point of non-differentiability is \(u = 0\). Thus, \(\text{prox}_f(x) = 0\) for \(x \leq \lambda\).

../../_images/quadratic_with_restricted_domain.png

The quadratic \(g(u) = 2u + \frac{1}{2} (u - 1)^2\). If the \(2u\) term was not present, the minimum would have been \(u = 1\). The term \(2u\) just shifts the minimum left by 2 to \(u = -1\). However, if the domain of the function is restricted to \(u \geq 0\) (the red part of the graph), then the minimum is achieved at \(u = 0\).

For general \(g(u) = \lambda u + \frac{1}{2}(u - x)\) with domain limited to \(u \geq 0\), as long as \(x \geq \lambda\), the minimum is achieved at \(u = x - \lambda\). For all \(x < \lambda\), the minimum is achieved at \(u = 0\).
Proposition

Let \(f : \RR \to \RR\) be defined as \(f(x) = \lambda |x|\), then

\[\text{prox}_f(x) = [|x| - \lambda]_+ \sgn(x).\]
Proof

Note that \(f\) is differentiable for \(x \neq 0\). \(\text{prox}_f(x)\) is the minimizer of the function

\[\begin{split}g(u) = \begin{cases} g_1(u) = \lambda u + \frac{1}{2} (u - x)^2 & u \geq 0 \\ g_2(u) = -\lambda u + \frac{1}{2} (u - x)^2 & u < 0 \end{cases}\end{split}\]

If a minimizer is obtained at \(u > 0\), then \(0 = f'(u) = g_1'(u) = \lambda + u - x\) giving us \(u = x - \lambda\). Since, we assumed that \(u > 0\), hence \(x > \lambda\). In other words, if \(x > \lambda\), then \(u = x - \lambda\) is the minimizer. Similarly, for \(x < -\lambda\), \(u = x + \lambda\) is the minimizer. For \(-\lambda \leq x \leq \lambda\), the minimizer is at the only point of non-differentiability for \(g_1\) which is \(u = 0\). Thus

\[\begin{split}\text{prox}_f(x) = \begin{cases} x - \lambda & x > \lambda \\ 0 & -\lambda \leq x \leq \lambda \\ x + \lambda & x < -\lambda \end{cases}\end{split}\]

This is nothing but shrinkage operator or soft thresholding operator.

Definition

The soft thresholding operator \(S: \RR \to \RR\) is defined as:

\[\begin{split}S_{\kappa}(x) = \begin{cases} x - \kappa & x > \kappa \\ 0 & |x| \leq \kappa \\ x + \kappa & x < -\kappa \end{cases}.\end{split}\]

An equivalent definition is

\[S_{\kappa}(x) = (x - \kappa)_+ - (-x-\kappa)_+.\]
Proposition

The minimizer of

\[f(x) = \lambda |x| + \frac{1}{2} (x - u)^2\]

is \(S_{\lambda}(u)\).

This is a straight-forward application of the proximal operator result above.

Proposition

The minimizer of

\[f(x) = \lambda |x| + \frac{\rho}{2} (x - u)^2\]

is \(S_{\lambda/\rho}(u)\).

Proof
\[\begin{split}f(x) = \begin{cases} f_1(x) = \lambda x + \frac{\rho}{2} (x - u)^2 & x \geq 0 \\ f_2(x) = -\lambda x + \frac{\rho}{2} (x - u)^2 & x < 0 \end{cases}.\end{split}\]

\(f\) is differentiable everywhere except at \(x = 0\).

If a minimizer is obtained at \(x > 0\), then

\[0 = f'(x) = f_1'(x) = \lambda + \rho (x - u)\]

giving us \(x = u - \frac{\lambda}{\rho}\). This holds true if \(u > \frac{\lambda}{\rho}\). Rest of the argument is similar.

Definition

Element-wise soft thresholding operator \(S_{\lambda} : \RR^n \to \RR^n\) is defined as

\[\begin{split}S_{\lambda}(x_i) = \begin{cases} x_i - \lambda & x_i > \lambda \\ 0 & |x_i| \leq \lambda \\ x_i + \lambda & x_i < -\lambda \end{cases}.\end{split}\]
Proposition

The minimizer of \(f: \RR^n \to \RR\) defined as

\[f(x) = \lambda \|x\|_1 + \frac{\rho}{2} \|x - u\|_2^2\]

is \(S_{\lambda/\rho}(u)\) (applied element-wise).

Proof

The result follows from the fact that \(f(x)\) is separable as \(f(x) = \sum f_i(x_i)\) where

\[f_i(x) = \lambda |x_i| + \frac{\rho}{2} (x_i - u_i)^2.\]
Definition

Element-wise soft thresholding operator over matrices \(S_{\lambda} : \RR^{m\times n} \to \RR^{m\times n}\) is defined as

\[\begin{split}S_{\lambda}(x_{ij}) = \begin{cases} x_{ij} - \lambda & x_{ij} > \lambda \\ 0 & |x_{ij}| \leq \lambda \\ x_{ij} + \lambda & x_{ij} < -\lambda \end{cases}.\end{split}\]
Proposition

The minimizer of \(f: \RR^{m \times n} \to \RR\) defined as

\[f(X) = \lambda \|X\|_{1,1} + \frac{\rho}{2} \|X - U\|_F^2\]

is \(S_{\lambda/\rho}(U)\) (applied element-wise).

See Row column norms for formal definition of \(\| \cdot \|_{1,1}\) norm. It is essentially the absolute sum of all elements of a matrix.

Proof
The result follows from the fact that \(f(x)\) is separable over all elements of \(X\).