A review of linear transformations

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • Linear transformations and matrices
    • Linear transformations on subspaces
    • Matrix spectrum and positive definite matrices
    • The singular value decomposition
    • Inverses and Pseudo-inverses

Linear transformations

  • Matrices and vectors give us tools to “represent” abstract linear transformations by giving coordinates to them.

  • Linear transformations refer to a special class of maps between vector spaces;

    • we will usually consider vectors spaces such as \( \mathbb{R}^N \), though these tools extend in far greater generality.
Linear transformations
Let \( a,b\in \mathbb{R} \) and \( \pmb{x},\pmb{y}\in V \) be arbitrary scalars and vectors and \( T \) be some transformation \[ \begin{align} T : V \rightarrow W. \end{align} \] \( T \) is said to be a linear transformation if it satisfies the property \[ \begin{align} T\left(a \pmb{x} + b \pmb{y} \right) = a T\left( \pmb{x}\right) + b T\left(\pmb{y}\right). \end{align} \]
  • Any map that does not satisfy the property above is said to be nonlinear;

    • nonlinearities arise in highly different ways in practice and developing general tools for nonlinear maps is challenging.
  • For reasons we shall see, we often use a linear approximation when it is appropriate to simplify the challenge of nonlinearities.

Linear transformations as matrices

  • Let's suppose that for simplicity, a transformation

    \[ \begin{align} T:V \rightarrow V \end{align} \]

    is a map of the vector space \( V \) to itself.

  • Suppose, moreover, that we have chosen some basis for \( V \), \( \{\pmb{x}_i\}_{i=1}^N \);

    • with a choice of basis, for each \( i \) we can write

    \[ \begin{align} T(\pmb{x}_i) = \sum_{j=1}^N \alpha_{i,j} \pmb{x}_j \end{align} \]

    as \( T(\pmb{x})\in V \) by its definition, and by the definition of a basis.

  • It is by choosing a basis for the domain and the image of the map, that we make a coordinatization of the map \( T \).

  • Notice that, by the linearity of the transformation \( T \), for any vector \( \pmb{x} = \sum_{i=1}^N \beta_i \pmb{x}_i \) its map can be written as

    \[ \begin{align} T\left(\pmb{x}\right) &= T\left( \sum_{i=1}^N \beta_i \pmb{x}_i \right)=\sum_{i=1}^N \beta_i T\left(\pmb{x}_i\right) \end{align} \]

    so that determining the output of the map on the basis determines the output for any vector.

Linear transformations as matrices

  • Recall the notation for the coordinatization of the map, \( T(\pmb{x}_i) = \sum_{j=1}^N \alpha_{i,j} \pmb{x}_j \).

  • In particular, suppose we order the coefficients \( \alpha_{i,j} \) so that \( i \) refers to the row index and \( j \) refers to column index of the array

    \[ \begin{align} \mathbf{A} := \begin{pmatrix} \alpha_{1,1} & \cdots & \alpha_{1,N} \\ \vdots & \ddots & \vdots \\ \alpha_{N,1} & \cdots & \alpha_{N,N}. \end{pmatrix} \end{align} \]

  • Recall, if \( \pmb{x} = \sum_{i=1}^N \beta_i \pmb{x}_i \), its coordinates in this basis are given as

    \[ \begin{align} \pmb{x} := \begin{pmatrix} \beta_1 & \cdots & \beta_N \end{pmatrix}^\top \end{align} \]

  • Writing the transformation in these bases,

    \[ T(\pmb{x})= \sum_{i=1}^N \beta_i T(\pmb{x}_i) = \sum_{i=1}^N \beta_i \left( \sum_{j=1}^N \alpha_{i,j} \pmb{x}_j\right) = \sum_{j=1}^N \left( \sum_{i=1}^N \beta_i \alpha_{i,j} \right)\pmb{x}_j \]

    so that the map of \( \pmb{x} \) under \( T \) has the coordinatization as

    \[ \begin{align} T(\pmb{x}) := \begin{pmatrix} \sum_{i=1}^N \beta_i \alpha_{i,1} & \cdots & \sum_{i=1}^N \beta_i \alpha_{i,N}\end{pmatrix}^\top \equiv \mathbf{A} \pmb{x} \end{align} \]

    so that \( \mathbf{A} \) completely describes the transformation \( T \).

Linear transformations and subspaces

  • Note that a linear transformation need not map a vector space \( V \) to itself;

    • a very common situation is when we consider a map of \( V \) to one of its subspaces with a projection operation.
  • Generally, a map to a subspace also needs not be orthogonal;

    • however, we will consider a similar property to idempotence in this situation in which a subspace is said to be invariant.
Invariant subspaces of linear transformations
Let \( W\subset V \) be a subspace, and let \( T:V \rightarrow V \) be a linear map. The subspace \( W \) is said to be invariant under \( T \) if for every \( \pmb{y} \in W \) then \( T(\pmb{y})\in W \).
  • In the case that a map \( T \) has an invariant subspace, there is a well-defined restriction of \( T \) to the subspace

    \[ T|_W: W \rightarrow W. \]

  • In particular, this restriction can similarly be written as matrix transformation;

    • if \( V \) has a basis of dimension \( N \) and \( W \) has a basis of dimension \( M \), the restriction takes the form of a \( M\times M \) matrix.
  • With the above points, it brings us to the question of what basis should we choose for a linear transformation?

  • The standard Euclidean basis is the most common choice, but the Euclidean basis may not take advantage of the invariant properties of the transformation.

  • Particularly, if \( M \ll N \) and we are only concerned with the action on the subspace, a well-chosen basis as above can perform a dramatic reduction to the computation of the map.

Linear transformations and subspaces

  • As a motivating example, consider the case where \( \mathrm{span}\{\pmb{x}\} \subset \mathbb{R}^N \) is an invariant subspace of the transformation

    \[ T:\mathbb{R}^N \rightarrow \mathbb{R}^N. \]

  • By the definition,

    \[ T(\pmb{x}) \in \mathrm{span}\{\pmb{x}\} \Leftrightarrow T(\pmb{x}) =\alpha \pmb{x}. \]

  • Therefore, for any arbitrary vector \( \beta \pmb{x} \) in the subspace

    \[ \begin{align} \mathbf{A}\beta \pmb{x} = \alpha \beta\pmb{x} \end{align} \]

    so that the action of the linear transformation on the entire subspace is that of scalar multiplication.

  • If we choose \( \pmb{x} \) as a basis vector for \( \mathbb{R}^N \), the associated matrix \( \mathbf{A} \) for this choice of basis will have a diagonal element corresponding to this action.

  • Diagonal matrices are extremely simple for computation; to make efficient use of computations we will be concerned with matrix decompositions, corresponding to special choices of coordinates.

  • The most fundamental matrix decomposition is known as the eigen value decomposition, discussed as follows.

Eigenvalues and eigenvectors

Eigen values and vectors
If a nonzero vector \( \pmb{x} \) has the property that, \[ \mathbf{A}\pmb{x} =\lambda \pmb{x} \] then \( \pmb{x} \) is said to be an eigen vector of \( \mathbf{A} \) associated to the eigen value \( \lambda \).
  • Diagonal matrices are ones that have an entire coordinate system composed of eigen vectors;

    • each basis vector defines an invariant subspace and the action is equivalent to scalar multiplication on each subspace.
  • Certain non-diagonal matrices can be transformed into diagonal matrices by finding such a coordinate system and applying a “change of basis” transformation.

  • Suppose that \( \pmb{x} \) is an eigen vector for a matrix \( \mathbf{A} \) associated to a zero eigenvalue \( \lambda=0 \); then the columns of \( \mathbf{A}^i \) we have a linear dependence defined as

    \[ \begin{align} \sum_{i=1}^N x_i \mathbf{A}^i = \pmb{0} . \end{align} \]

  • Notice now that,

    \[ \begin{align} \mathbf{A}\pmb{x} =\lambda \pmb{x} & & \Leftrightarrow & & \mathbf{A}\pmb{x} - \lambda \pmb{x} = 0 & & \Leftrightarrow & & \left(\mathbf{A} - \lambda \mathbf{I}_n\right) \pmb{x} = 0. \end{align} \]

  • This means that \( \pmb{x} \) is an eigen vector of the matrix \( \left(\mathbf{A} - \lambda \mathbf{I}_n\right) \) associated to the zero eigenvalue and a linear dependence in the columns of \( \left(\mathbf{A} - \lambda \mathbf{I}_n\right) \).

The polar decomposition

  • Not all matrices can be diagonalized with a change of basis;

    • however, there is a special class of matrices for which this is true and arise commonly in statistical analysis.
  • This result is a special case of the more general spectral theorem:

The polar decomposition (restricted spectral theorem)
Suppose that \( \mathbf{A} \) is a real valued symmetric matrix, i.e., \( \mathbf{A} = \mathbf{A}^\top \). Then there exists an orthogonal change of basis matrix \( \mathbf{U} \) and diagonal matrix \( \mathbf{D} \) for which \[ \begin{align} \mathbf{A} = \mathbf{U}\mathbf{D} \mathbf{U}^\top. \end{align} \]
  • The columns of \( \mathbf{U} \) are the (orthonormal) eigen vectors of the matrix \( \mathbf{A} \) and the diagonal elements of \( \mathbf{D} \) are the associated eigen values.

    • Orthogonal matrices represent rotations;
    • therefore, with respect to a rotation of the coordinate frame, the action of the map is scalar multiplication on the invariant subspaces.
  • If the eigenvalues are larger than one in absolute value, this action then corresponds to an expansion along these directions while eigen values with norm less than one correspond to a contraction;

    • a negative eigenvalue corresponds to a reflection in the subspace, including a contraction or expansion as above.
  • The eigenvalues, with a correct rotation of perspective, thus completely describe the action of a symmetric matrix as above.

Eigen values and vectors in Numpy

  • We will make a quick demonstration of the basic eigen function in Numpy
import numpy as np
from numpy.linalg import eig
from numpy.random import normal
  • We begin by randomly generating an array with standard normal entries
np.random.seed(42)
my_array = normal(size=[3,3])
my_array
array([[ 0.49671415, -0.1382643 ,  0.64768854],
       [ 1.52302986, -0.23415337, -0.23413696],
       [ 1.57921282,  0.76743473, -0.46947439]])
  • To compute the eigen values and the eigen vectors, we can use the eig function imported above:
e_vals, e_vecs = eig(my_array)

Eigen values and vectors in Numpy

  • From the last slide, we have
e_vals
array([ 1.22795686+0.j        , -0.71743524+0.61245434j,
       -0.71743524-0.61245434j])
  • Notice that two of the eigen values are complex and come in a conjugate pair;

    • generally, real matrices can have complex eigen values, corresponding to rotations within the subspaces.
  • This is not the case, however, for real symmetric matrices as discussed in the polar decomposition.

  • Indeed,

my_symmetric_array = my_array.transpose() @ my_array
my_symmetric_array
array([[ 5.06025801,  0.78664234, -0.77628148],
       [ 0.78664234,  0.66290088, -0.39501919],
       [-0.77628148, -0.39501919,  0.69472676]])
e_vals, e_vecs = eig(my_symmetric_array)

Eigen values and vectors in Numpy

  • From the last slide, we see that the eigen values of the symmetric array are given as
e_vals
array([5.34613967, 0.78854214, 0.28320384])
  • Notice the action of the symmetric array on the eigen vectors
e_vecs[:,0]
array([ 0.96813371,  0.17751704, -0.17664888])
my_symmetric_array @ e_vecs[:,0] / e_vals[0]
array([ 0.96813371,  0.17751704, -0.17664888])
  • If the map needs to be applied many times, computing the change of basis into the eigen vectors can be more cost-efficient than applying the map directly.

  • Notice that the above eigen values are all positive – this is actually a general property of “squared” matrices constructed as

    \[ \begin{align} \mathbf{A}^\top \mathbf{A} \end{align} \]

    for arbitrary arrays \( \mathbf{A} \) with linearly independent columns.

Positive definite matrices

  • When all eigenvalues of a symmetric matrix \( \mathbf{A} \) are positive we can visualize the matrix as a hyper-ellipse.

  • The principal axes of the hyper-ellipse are given by the eigen vectors, while the length of these axes are given by the eigen values.

Positive (semi)-definite matrices
A real symmetric matrix \( \mathbf{A} \) is said to be positive definite if all of its eigenvalues are greater than zero. If all eigenvalues are non-negative, then \( \mathbf{A} \) is said to be positive semi-definite.
  • Positive definite matrices are special, in part, because they define alternative, weighted version of the Euclidean norm by \( \sqrt{\pmb{x}^\top \mathbf{A}\pmb{x}} \).

  • For our purposes, we will more commonly use the norm defined by the inverse of a positive definite \( \mathbf{A} \);

    • we will return to this point shortly when we discuss matrix inverses.
  • Let \( \mathbf{D} := \mathrm{diag}\left(\lambda_i\right) \) be the diagonal matrix of the eigen values for the positive definite matrix \( \mathbf{A} \).

  • Positive definite matrices also have well-defined square roots – notice that for

    \[ \sqrt{\mathbf{A}} := \mathbf{A}^\frac{1}{2}:= \mathbf{U}\mathrm{diag}\left(\sqrt{\lambda}_i\right) \mathbf{U}^\top \]

    we can write \( \mathbf{A}= \left(\mathbf{A}^\frac{1}{2}\right)^2 \).

The singular value decomposition

  • Consider now an arbitrary matrix \( \mathbf{M}\in \mathbb{R}^{N\times M} \) for \( N>M \); if \( \mathbf{M} \) is not square, it cannot be symmetric, but each of

    \[ \begin{align} \mathbf{M}^\top \mathbf{M} \in \mathbb{R}^{M\times M} & & \mathbf{M}\mathbf{M}^\top \in \mathbb{R}^{N\times N} \end{align} \] are symmetric.

  • Using the polar decomposition, we can thus find their diagonalizations

    \[ \begin{align} \mathbf{M}^\top \mathbf{M} & = \mathbf{V} \mathbf{D} \mathbf{V}^\top\\ \mathbf{M}\mathbf{M}^\top & = \mathbf{U} \hat{\mathbf{D}} \mathbf{U}^\top, \end{align} \]

    where

    1. \( \mathbf{V} \in \mathbb{R}^{M\times M},\mathbf{U} \in \mathbb{R}^{N \times N} \) are orthogonal matrices;
    2. \( \mathbf{D} \in \mathbb{R}^{M \times M}, \hat{\mathbf{D}} \in \mathbb{R}^{N \times N} \) are both diagonal matrices of the eigenvalues of the “square”.
  • Assume that the columns of \( \mathbf{M} \) are linearly independent – then both \( \mathbf{D} \) and \( \hat{\mathbf{D}} \) have exactly \( M \) non-zero eigenvalues.

  • If we order the non-zero eigenvalues of \( \hat{\mathbf{D}} \) such that there is a non-zero block in the top left,

    \[ \begin{align} \hat{\mathbf{D}} = \begin{pmatrix} \mathbf{D} & 0_{M \times (N-M)} \\ 0_{(N-M) \times M} & 0_{(N-M) \times (N-M)} \end{pmatrix} \end{align} \]

The singular value decomposition

  • It can be demonstrated that \( \mathbf{M} \) has a decomposition into,

    \[ \begin{align} \mathbf{M} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top, \end{align} \]

    where

    \[ \begin{align} \boldsymbol{\Sigma} = \begin{pmatrix} \sqrt{\mathbf{D}} \\ 0_{(N-M) \times M} \end{pmatrix}. \end{align} \]

  • This decomposition is known as the singular value decomposition. The singular values are the (positive) principal-diagonal entries of \( \boldsymbol{\Sigma} \) usually ordered descendingly in size.

  • The columns of \( \mathbf{U} \) are called the left singular vectors while the columns of \( \mathbf{V} \) are called the right singular vectors.

  • Thus we can write, e.g.,

    \[ \begin{align} \mathbf{M}\mathbf{M}^\top &= \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top \mathbf{V} \boldsymbol{\Sigma}^\top \mathbf{U}^\top \\ &= \mathbf{U} \begin{pmatrix} \sqrt{\mathbf{D}} \sqrt{\mathbf{D}} & 0_{M \times (N-M)} \\ 0_{(N-M) \times M} & 0_{(N-M) \times (N-M)} \end{pmatrix} \mathbf{U}^\top \\ & = \mathbf{U} \hat{\mathbf{D}} \mathbf{U}^\top \end{align} \]

  • We can recover the eigen decomposition for \( \mathbf{M}^\top \mathbf{M} \) in a similar fashion.

Visualizing the SVD

Image of singular value decomposition mapping diagram

Courtesy of Georg-Johann CC BY-SA 3.0

  • We view the orthogonal transformations as rotations of the standard Euclidean frame;
  • likewise, we view the diagonal matrix as a dilation of the points along the specified frame, stretching the unit circle into an ellipsoid.
  • Therefore, for a matrix transformation \( \mathbf{M} \), we can view its SVD \[ \begin{align} \mathbf{M} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top \end{align} \] as a composition of:
    1. a rotation into a new frame;
    2. a stretch, contraction or mapping to zero along the directions in this frame;
    3. a final rotation back into the Euclidean coordinates, but where the shape may no longer align with this frame.
  • It is important to recognize that this applies for non-square matrices so that the image space and the domain do not match.
  • The trailing zeros in \[ \begin{align} \boldsymbol{\Sigma} = \begin{pmatrix} \sqrt{\mathbf{D}} \\ 0_{(N-M) \times M} \end{pmatrix} \end{align} \] correspond to the inputs that are mapped to zero automatically, based on the dimensional reduction from the domain to the image space.

SVD in numpy

  • The singular value decomposition is similarly a built-in function in Numpy, with the values and left / right vectors given as a standard output:
from numpy.linalg import svd
U, Sigma, Vh = svd(my_array)
  • Notice that unlike the eigen values of the random matrix, the singular values are all non-negative, real-valued
Sigma
array([2.31217207, 0.88799895, 0.53216899])
  • For the real symmetric array the SVD is identical to the polar decomposition:
U, Sigma, Vh = svd(my_symmetric_array)
Sigma
array([5.34613967, 0.78854214, 0.28320384])
e_vals
array([5.34613967, 0.78854214, 0.28320384])
  • SVD routines are more expensive than eigen value routines for symmetric arrays, but tend to be more accurate / stable numerically; therefore, they may still be used in place of an eigen value solver when this is the priority.

Solving a linear system of equations

  • For any zero eigen vector, its image under the corresponding linear transformation is zero regardless of which particular value in the subspace we choose.

    • This means that we cannot decipher what was the input from observing the output of the map on this subspace.
  • We can still attempt to reconstruct the information of the input from an observed output, but the methods will depend on how the relationships constrain the data, and if we can be satisfied with an approximate solution.

    Linear inverse problem
    Let \( \mathbf{A} \in \mathbb{R}^{N \times M} \), \( \pmb{x} \in \mathbb{R}^N \), \( \pmb{b}\in\mathbb{R}^M \). Suppose that \( \mathbf{A} \) and \( \pmb{b} \) are known, but \( \pmb{x} \) is unknown and satisfy the relationship: \[ \mathbf{A} \pmb{x} = \pmb{b} \] Finding the collection of values \( \pmb{x} \) that satisfy this relationship is known as a linear inverse problem.
  • If \( N>M \) (more rows than columns) then the system is over-determined and there may be no value of \( \pmb{x} \) satisfying the relationship;

    • for \( M>N \) (more columns than rows), the system is under-determined, and there may be infinitely many choices for \( \pmb{x} \).
  • In the case that \( M=N \) and \( \mathbf{A} \) has no zero eigen values then \( \mathbf{A} \) is invertible:

Matrix inverse
Suppose \( \mathbf{A}\in \mathbb{R}^{N \times N} \) has no zero eigen values. There exists an inverse transformation denoted \( \mathbf{A}^{-1} \) that satisfies the property \[ \begin{align} \mathbf{A}^{-1}\mathbf{A} = \mathbf{A}\mathbf{A}^{-1} = \mathbf{I}_N:= \mathrm{diag}(\pmb{1}_N). \end{align} \]

Solving a linear system of equations

  • If we suppose that an inverse for \( \mathbf{A} \) exists, then notice that

    \[ \pmb{x} = \mathbf{A}^{-1} \pmb{b}. \]

    • Therefore, if \( \pmb{b} \) is observed and \( \mathbf{A} \) is a known relationship, we can uniquely identify the inputs from the observations.
  • The way to implement such a procedure in Numpy is with the linalg.solve function.

    • Firstly we create a simple array for example:
from numpy.linalg import solve, inv
observed_vec = np.arange(3)
observed_vec
array([0, 1, 2])
  • Now, applying the solve function, we compute the desired inputs and demonstrate that the action of my_array gives the original example:
input_vec = solve(my_array, observed_vec)
my_array @ input_vec
array([-2.77555756e-17,  1.00000000e+00,  2.00000000e+00])

Solving a linear system of equations

  • We can also compute the inverse of a matrix directly
my_inverse_array = inv(my_array)
my_inverse_array @ my_array
array([[ 1.00000000e+00,  0.00000000e+00,  4.16333634e-17],
       [-2.22044605e-16,  1.00000000e+00,  0.00000000e+00],
       [ 1.38777878e-16, -6.93889390e-17,  1.00000000e+00]])
  • and then apply its action to the observed vector to recover the input vector:
my_inverse_array @ observed_vec
array([ 0.7323549 ,  0.86895275, -0.37614669])
input_vec
array([ 0.7323549 ,  0.86895275, -0.37614669])
  • However, for purposes of the inverse problem, it is far more efficient to use the solve function as on the last slide.

Pseudo-inverses

  • Consider the linear inverse problem again,

    \[ \begin{align} \mathbf{A} \pmb{x}= \pmb{b} & & \mathbf{A}\in \mathbb{R}^{N \times M} \end{align} \] and suppose that the , \( N>M \), where \( \mathbf{A} \) has the SVD \( \mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top \).

  • We note, \( \mathbf{A} \) is not square and thus has no formal inverse;

    • however, we can still reconstruct a “best” estimate with a pseudo-inverse.
The (left) pseudo-inverse
Suppose that \( \mathbf{A}\in \mathbb{R}^{N\times M} \) has linearly independent columns with \( N>M \). In this case, \( \left(\mathbf{A}^\top\mathbf{A}\right)^{-1} \) exists, and the (left) pseudo-inverse is defined \[ \begin{align} \mathbf{A}^\dagger := \left(\mathbf{A}^\top \mathbf{A}\right)^{-1} \mathbf{A}^\top. \end{align} \]
  • If the pseudo-inverse exists as above it satisfies the property,

    \[ \begin{align} \mathbf{A}^\dagger \mathbf{A} = \left(\mathbf{A}^\top \mathbf{A}\right)^{-1} \mathbf{A}^\top\mathbf{A} = \mathbf{I}_M; \end{align} \]

    • this is what is meant by a “left pseudo-inverse”.
  • Note, right pseudo-inverses also exist under the right conditions;

    • these can be used to solve an under-determined linear system, though this will not be the focus of our discussions.

Pseudo-inverses

  • Consider now the representation of the pseudo-inverse with respect to the SVD,

    \[ \begin{align} \mathbf{A}^\dagger &= \left(\mathbf{V}\boldsymbol{\Sigma}^\top\mathbf{U}^\top\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top\right)^{-1} \mathbf{V}\boldsymbol{\Sigma}^\top\mathbf{U}^\top \\ &=\mathbf{V}\left(\boldsymbol{\Sigma}^\top\boldsymbol{\Sigma}\right)^{-1} \boldsymbol{\Sigma}^\top\mathbf{U}^\top \end{align} \]

    where \( \boldsymbol{\Sigma}^\top\boldsymbol{\Sigma}\in \mathbb{R}^{M\times M} \) is the diagonal matrix of the singular values squared.

  • Applying \( \mathbf{A} \) to the left-hand-side,

    \[ \begin{align} \mathbf{A}\mathbf{A}^\dagger &=\mathbf{U} \boldsymbol{\Sigma}\left(\boldsymbol{\Sigma}^\top\boldsymbol{\Sigma}\right)^{-1} \boldsymbol{\Sigma}^\top\mathbf{U}^\top\\ &= \mathbf{U}\mathbf{U}^\top \end{align} \] where \( \mathbf{U}\mathbf{U}^\top \) is the orthogonal projection into the column span of \( \mathbf{A} \).

  • The orthogonal projection lemma thus tells us that for \( \pmb{y} := \mathbf{A}^\dagger \pmb{b} \),

    \[ \begin{align} \parallel \mathbf{A} \pmb{x} - \pmb{b} \parallel \geq \parallel \mathbf{A} \pmb{y} - \pmb{b} \parallel \end{align} \] for any \( \pmb{x} \) that lies in the column span of \( \mathbf{A} \).

  • The pseudo-inverse thus gives a coordinatization of a projected vector when restricted to \( \mathrm{span}\{\mathbf{A}\} \equiv \mathbb{R}^M \), but without embedding the coordinates in the higher dimensional space \( \mathbb{R}^N \).

  • Applying the action of \( \mathbf{A} \) once again raises the coordinates to the embedding in \( \mathbb{R}^N \).

Pseudo-inverses

  • It is clear that one can efficiently compute the pseudo-inverse in Numpy by using the SVD formulation directly.

    • One can simply compute \( \mathbf{V} \boldsymbol{\Sigma}^\dagger\mathbf{U}^\top \) where \( \boldsymbol{\Sigma}^\dagger \) is array with the inverse singular values on the diagonal wherever the singular values are non-zero and zero otherwise.
  • However, Numpy also has a built-in function to handle the pseudo-inverse;

    • this averts instability when computing the inverse singular values that are close to zero by automatically cutting off at a small threshold.
  • We can use

from numpy.linalg import pinv
my_rectangular_array = normal(size=[3,2])
my_pseudo_inverse = pinv(my_rectangular_array)
my_pseudo_inverse @ my_rectangular_array
array([[1.00000000e+00, 5.55111512e-17],
       [0.00000000e+00, 1.00000000e+00]])
  • In the above, the array is not even square, but we are able to find an “inverse” that lies only within the column span.

    • This is part of what is meant by an approximate solution to the linear inverse problem, where we can re-write this as a “least squares problem” to be discussed further in the course.

Weighted norms and pseudo-norms

  • We remarked earlier that when a matrix is positive definite, we can define a weighted norm from this matrix;

    • in this case, then the matrix inverse is also defined as all eigen values are greater than zero.
  • On the other hand, if a matrix \( \mathbf{A} \in \mathbb{R}^{N \times M} \) has linearly independent columns then \( \mathbf{M}:= \mathbf{A} \mathbf{A}^\top \) is positive semi-definite;

    • in this case, the pseudo-inverse \( \mathbf{A}^\dagger \) is also defined.
  • Often, we will want to take a norm or a pseudo-norm with respect to a positive semi-definite matrix, but weighted inverse proportionally.

Weighted norms and pseudo-norms
Suppose \( \mathbf{A} \) is positive definite, we denote the vector norm with respect to \( \mathbf{A} \) as \[ \begin{align} \parallel \pmb{x}\parallel_\mathbf{A} := \sqrt{\pmb{x}^\top \mathbf{A}^{-1}\pmb{x}} \end{align} \] If \( \mathbf{A} \) has full column rank and \( \mathbf{M}:= \mathbf{A}\mathbf{A}^\top \), we denote the pseudo-norm with respect to \( \mathbf{M} \) as \[ \begin{align} \parallel \pmb{x}\parallel_\mathbf{M} := \sqrt{ \left(\mathbf{A}^\dagger\pmb{x}\right)^\top \left(\mathbf{A}^{\dagger}\pmb{x}\right)} \end{align} \]
  • Note, the pseudo-norm is not a true norm as it takes a zero value on the null space of \( \mathbf{A} \).

    • In both cases above this represents a true a norm on the column span of \( \mathbf{A} \), but where the distance from zero is measured inversely proportional to the singular values of \( \mathbf{A} \).
    • Directions corresponding to singular values greater than one are penalized less than directions corresponding to singular values less than one.