PRESS criterion

Definition

The Predicted Residual Error Sum of Squares (PRESS), called also the ordinary cross-validation, is based on the basic leave-one-out cross-validation, which is proposed by D. M. Allen, 1974 [1].
Let \(\mathbf{x}_\lambda^{(l)}\) be the solution in which the \(l\)-th observation is omitted. The PRESS criterion’s argument is that if \(\lambda\) is a good choice, then the \(l\)-th component \(\left(\mathbf{T}\mathbf{x}_\lambda^{(l)}\right)_l\) should be a good predictor of \(b_l\). Therefore, the PRESS criterion leads to choosing \(\lambda\) as the minimizer of the PRESS function \(\mathcal{P}(\lambda)\), defined by

\[ \mathcal{P}(\lambda) \equiv \sum_{l=1}^{M} \left[ \left( \mathbf{T}\mathbf{x}_\lambda^{(l)} \right)_l - b_l \right]^2. \]

Using the Generalized Tikhonov regularization solution, let

\[ \hat{\mathbf{T}} \equiv \mathbf{B}\mathbf{T},\quad \hat{\mathbf{b}} \equiv \mathbf{B}\mathbf{b},\quad \mathbf{Q} = \mathbf{B}^\mathsf{T}\mathbf{B}. \]

Then the PRESS criterion can be rewritten as

(1)\[ \mathcal{P}(\lambda) = \left\| \mathbf{D}_\lambda \left( \mathbf{I} - \mathbf{A}_\lambda \right) \hat{\mathbf{b}} \right\|_2^2, \]

where

\[\begin{split} \begin{align*}\mathbf{A}_\lambda&\equiv\hat{\mathbf{T}}\left(\hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{T}} + \lambda\mathbf{H}\right)^{-1}\hat{\mathbf{T}}^\mathsf{T} = \mathbf{B}\mathbf{T}\left(\mathbf{T}^\mathsf{T}\mathbf{Q}\mathbf{T} + \lambda\mathbf{H}\right)^{-1}\mathbf{T}^\mathsf{T}\mathbf{B}^\mathsf{T},\\ \mathbf{D}_\lambda &\equiv \mathrm{diag} \left( \cdots,\frac{1}{1 - a_{\lambda, ii}},\cdots \right),\\ a_{\lambda, ii} &\equiv \left(\mathbf{A}_\lambda\right)_{ii}. \end{align*} \end{split}\]

Using series-expansion form of the solution, \(\mathcal{P}(\lambda)\) can be written as

(2)\[ \mathcal{P}(\lambda) = \left\| \left[ \mathrm{Diag}\left( \mathbf{I} - \mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T} \right) \right]^{-1} \left( \mathbf{I} - \mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T} \right)\hat{\mathbf{b}} \right\|_2^2, \]

where \(\mathrm{Diag}(\mathbf{M})\) denotes the diagonal matrix formed from the diagonal entries of \(\mathbf{M}\).

Equivalently, expanding element-wise with \(u_{li}\equiv(\mathbf{U})_{li}\):

(3)\[ \mathcal{P}(\lambda) = \sum_{l=1}^{M} \left[ \frac{ \hat{b}_l - \sum_{i=1}^{r} f_{\lambda,i}\,u_{li}\,(\mathbf{u}_i^\mathsf{T}\hat{\mathbf{b}}) }{ 1 - \sum_{i=1}^{r} f_{\lambda,i}\,u_{li}^2 } \right]^2. \]

Derivation of (1)

Let \(\hat{\mathbf{t}}_l\in\mathbb{R}^N\) be the \(l\)-th row of \(\hat{\mathbf{T}}\) treated as a column vector, and define

\[ \mathbf{K}_\lambda \equiv \hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{T}} + \lambda\mathbf{H} = \mathbf{T}^\mathsf{T}\mathbf{Q}\mathbf{T} + \lambda\mathbf{H}. \]

Then the solution of the transformed generalized Tikhonov problem is

\[ \mathbf{x}_\lambda = \mathbf{K}_\lambda^{-1}\hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{b}}. \]

Removing the \(l\)-th observation from the transformed system changes the normal equations as

\[\begin{split} \begin{cases} \hat{\mathbf{T}}^{(l)\mathsf{T}}\hat{\mathbf{T}}^{(l)} = \hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{T}} - \hat{\mathbf{t}}_l\hat{\mathbf{t}}_l^\mathsf{T},\\[4pt] \hat{\mathbf{T}}^{(l)\mathsf{T}}\hat{\mathbf{b}}^{(l)} = \hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{b}} - \hat{b}_l\hat{\mathbf{t}}_l, \end{cases} \end{split}\]

so the \(l\)-th leave-one-out solution \(\mathbf{x}_\lambda^{(l)}\) is given by

\[ \mathbf{x}_\lambda^{(l)} = \bigl(\mathbf{K}_\lambda - \hat{\mathbf{t}}_l\hat{\mathbf{t}}_l^\mathsf{T}\bigr)^{-1} \bigl(\hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{b}} - \hat{b}_l\hat{\mathbf{t}}_l\bigr). \]

Applying the Sherman-Morrison formula to the rank-1 perturbation with \(\hat{\mathbf{c}}_l\equiv\mathbf{K}_\lambda^{-1}\hat{\mathbf{t}}_l\) gives

\[ \bigl(\mathbf{K}_\lambda - \hat{\mathbf{t}}_l\hat{\mathbf{t}}_l^\mathsf{T}\bigr)^{-1} = \mathbf{K}_\lambda^{-1} + \frac{\hat{\mathbf{c}}_l\hat{\mathbf{c}}_l^\mathsf{T}}{1 - a_{\lambda,ll}}, \quad a_{\lambda,ll} = \hat{\mathbf{t}}_l^\mathsf{T}\hat{\mathbf{c}}_l = \bigl(\hat{\mathbf{T}}\mathbf{K}_\lambda^{-1}\hat{\mathbf{T}}^\mathsf{T}\bigr)_{ll} = \bigl(\mathbf{A}_\lambda\bigr)_{ll}. \]

Substituting and using \(\hat{\mathbf{c}}_l^\mathsf{T}\hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{b}} =\hat{\mathbf{t}}_l^\mathsf{T}\mathbf{x}_\lambda =(\mathbf{A}_\lambda\hat{\mathbf{b}})_l\) and \(\hat{\mathbf{c}}_l^\mathsf{T}\hat{\mathbf{t}}_l=a_{\lambda,ll}\):

\[\begin{split} \begin{align*} \mathbf{x}_\lambda^{(l)} &= \mathbf{x}_\lambda - \hat{b}_l\hat{\mathbf{c}}_l + \frac{\hat{\mathbf{c}}_l\bigl[(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l a_{\lambda,ll}\bigr]} {1 - a_{\lambda,ll}}\\ &= \mathbf{x}_\lambda + \hat{\mathbf{c}}_l\cdot \frac{(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l}{1 - a_{\lambda,ll}}. \end{align*} \end{split}\]

Hence the \(l\)-th leave-one-out prediction error in the transformed space is

\[\begin{split} \begin{align*} \bigl(\hat{\mathbf{T}}\mathbf{x}_\lambda^{(l)}\bigr)_l - \hat{b}_l &= \hat{\mathbf{t}}_l^\mathsf{T}\mathbf{x}_\lambda^{(l)} - \hat{b}_l\\ &= \hat{\mathbf{t}}_l^\mathsf{T}\mathbf{x}_\lambda - \hat{b}_l + \hat{\mathbf{t}}_l^\mathsf{T}\hat{\mathbf{c}}_l\cdot \frac{(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l}{1 - a_{\lambda,ll}}\\ &= (\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l + a_{\lambda,ll}\cdot \frac{(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l}{1 - a_{\lambda,ll}}\\ &= \frac{(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l}{1 - a_{\lambda,ll}}. \end{align*} \end{split}\]

Summing the squares over all \(l\) yields (1):

\[ \mathcal{P}(\lambda) = \sum_{l=1}^{M}\left[\frac{\bigl[(\mathbf{I}-\mathbf{A}_\lambda)\hat{\mathbf{b}}\bigr]_l}{1-a_{\lambda,ll}}\right]^2 = \bigl\|\mathbf{D}_\lambda(\mathbf{I}-\mathbf{A}_\lambda)\hat{\mathbf{b}}\bigr\|_2^2. \]

Derivation of (2)

Using the decomposed solution form, we first identify \(\mathbf{A}_\lambda\):

\[\begin{split} \begin{align*} \mathbf{A}_\lambda\hat{\mathbf{b}} &= \hat{\mathbf{T}}\mathbf{x}_\lambda\\ &= \hat{\mathbf{T}}\tilde{\mathbf{V}}\mathbf{F}_\lambda\mathbf{S}^{-1}\mathbf{U}^\mathsf{T}\hat{\mathbf{b}}\\ &= \mathbf{U}\mathbf{S}\mathbf{F}_\lambda\mathbf{S}^{-1}\mathbf{U}^\mathsf{T}\hat{\mathbf{b}}\quad(\because \hat{\mathbf{T}}\tilde{\mathbf{V}} = \mathbf{U}\mathbf{S})\\ &= \mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T}\hat{\mathbf{b}},\\ \therefore\quad \mathbf{A}_\lambda &= \mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T}. \end{align*} \end{split}\]

Substituting the above directly into (1) gives (2), since \(\mathbf{D}_\lambda = [\mathrm{Diag}(\mathbf{I}-\mathbf{A}_\lambda)]^{-1}\) by definition.

To obtain the component form (3), note that the \(l\)-th diagonal entry of \(\mathbf{A}_\lambda\) is

\[ a_{\lambda,ll} = \left(\mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T}\right)_{ll} = \sum_{i=1}^{r} f_{\lambda,i}\,u_{li}^2, \]

and the \(l\)-th component of \((\mathbf{I}-\mathbf{A}_\lambda)\hat{\mathbf{b}}\) is

\[ \hat{b}_l - \left(\mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T}\hat{\mathbf{b}}\right)_l = \hat{b}_l - \sum_{i=1}^{r} f_{\lambda,i}\,u_{li}\,(\mathbf{u}_i^\mathsf{T}\hat{\mathbf{b}}). \]

Dividing the \(l\)-th residual component by \((1-a_{\lambda,ll})\) and summing the squares over \(l\) yields (3).