PRESS criterion
Definition
The Predicted Residual Error Sum of Squares (PRESS), called also the ordinary cross-validation, is based on the basic leave-one-out cross-validation, which is proposed by D. M. Allen, 1974 .
Let \(\mathbf{x}_\lambda^{(l)}\) be the solution in which the \(l\)-th observation is omitted.
The PRESS criterion’s argument is that if \(\lambda\) is a good choice, then the \(l\)-th component \(\left(\mathbf{T}\mathbf{x}_\lambda^{(l)}\right)_l\) should be a good predictor of \(b_l\).
Therefore, the PRESS criterion leads to choosing \(\lambda\) as the minimizer of the PRESS function \(\mathcal{P}(\lambda)\), defined by
\[
\mathcal{P}(\lambda)
\equiv
\sum_{l=1}^{M}
\left[
\left(
\mathbf{T}\mathbf{x}_\lambda^{(l)}
\right)_l
-
b_l
\right]^2.
\]
Using the Generalized Tikhonov regularization solution, let
\[
\hat{\mathbf{T}} \equiv \mathbf{B}\mathbf{T},\quad
\hat{\mathbf{b}} \equiv \mathbf{B}\mathbf{b},\quad
\mathbf{Q} = \mathbf{B}^\mathsf{T}\mathbf{B}.
\]
Then the PRESS criterion can be rewritten as
(1)\[
\mathcal{P}(\lambda)
=
\left\|
\mathbf{D}_\lambda
\left(
\mathbf{I} - \mathbf{A}_\lambda
\right)
\hat{\mathbf{b}}
\right\|_2^2,
\]
where
\[\begin{split}
\begin{align*}\mathbf{A}_\lambda&\equiv\hat{\mathbf{T}}\left(\hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{T}} + \lambda\mathbf{H}\right)^{-1}\hat{\mathbf{T}}^\mathsf{T}
=
\mathbf{B}\mathbf{T}\left(\mathbf{T}^\mathsf{T}\mathbf{Q}\mathbf{T} + \lambda\mathbf{H}\right)^{-1}\mathbf{T}^\mathsf{T}\mathbf{B}^\mathsf{T},\\
\mathbf{D}_\lambda
&\equiv
\mathrm{diag}
\left(
\cdots,\frac{1}{1 - a_{\lambda, ii}},\cdots
\right),\\
a_{\lambda, ii}
&\equiv
\left(\mathbf{A}_\lambda\right)_{ii}.
\end{align*}
\end{split}\]
Using series-expansion form of the solution, \(\mathcal{P}(\lambda)\) can be written as
(2)\[
\mathcal{P}(\lambda)
=
\left\|
\left[
\mathrm{Diag}\left(
\mathbf{I} - \mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T}
\right)
\right]^{-1}
\left(
\mathbf{I} - \mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T}
\right)\hat{\mathbf{b}}
\right\|_2^2,
\]
where \(\mathrm{Diag}(\mathbf{M})\) denotes the diagonal matrix formed from the diagonal entries of \(\mathbf{M}\).
Equivalently, expanding element-wise with \(u_{li}\equiv(\mathbf{U})_{li}\):
(3)\[
\mathcal{P}(\lambda)
=
\sum_{l=1}^{M}
\left[
\frac{
\hat{b}_l -
\sum_{i=1}^{r}
f_{\lambda,i}\,u_{li}\,(\mathbf{u}_i^\mathsf{T}\hat{\mathbf{b}})
}{
1 -
\sum_{i=1}^{r} f_{\lambda,i}\,u_{li}^2
}
\right]^2.
\]
Derivation of (1)
Let \(\hat{\mathbf{t}}_l\in\mathbb{R}^N\) be the \(l\)-th row of \(\hat{\mathbf{T}}\) treated as a column vector, and define
\[
\mathbf{K}_\lambda \equiv \hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{T}} + \lambda\mathbf{H}
= \mathbf{T}^\mathsf{T}\mathbf{Q}\mathbf{T} + \lambda\mathbf{H}.
\]
Then the solution of the transformed generalized Tikhonov problem is
\[
\mathbf{x}_\lambda = \mathbf{K}_\lambda^{-1}\hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{b}}.
\]
Removing the \(l\)-th observation from the transformed system changes the normal equations as
\[\begin{split}
\begin{cases}
\hat{\mathbf{T}}^{(l)\mathsf{T}}\hat{\mathbf{T}}^{(l)}
= \hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{T}} - \hat{\mathbf{t}}_l\hat{\mathbf{t}}_l^\mathsf{T},\\[4pt]
\hat{\mathbf{T}}^{(l)\mathsf{T}}\hat{\mathbf{b}}^{(l)}
= \hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{b}} - \hat{b}_l\hat{\mathbf{t}}_l,
\end{cases}
\end{split}\]
so the \(l\)-th leave-one-out solution \(\mathbf{x}_\lambda^{(l)}\) is given by
\[
\mathbf{x}_\lambda^{(l)}
=
\bigl(\mathbf{K}_\lambda - \hat{\mathbf{t}}_l\hat{\mathbf{t}}_l^\mathsf{T}\bigr)^{-1}
\bigl(\hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{b}} - \hat{b}_l\hat{\mathbf{t}}_l\bigr).
\]
Applying the Sherman-Morrison formula to the rank-1 perturbation with \(\hat{\mathbf{c}}_l\equiv\mathbf{K}_\lambda^{-1}\hat{\mathbf{t}}_l\) gives
\[
\bigl(\mathbf{K}_\lambda - \hat{\mathbf{t}}_l\hat{\mathbf{t}}_l^\mathsf{T}\bigr)^{-1}
=
\mathbf{K}_\lambda^{-1}
+
\frac{\hat{\mathbf{c}}_l\hat{\mathbf{c}}_l^\mathsf{T}}{1 - a_{\lambda,ll}},
\quad
a_{\lambda,ll}
= \hat{\mathbf{t}}_l^\mathsf{T}\hat{\mathbf{c}}_l
= \bigl(\hat{\mathbf{T}}\mathbf{K}_\lambda^{-1}\hat{\mathbf{T}}^\mathsf{T}\bigr)_{ll}
= \bigl(\mathbf{A}_\lambda\bigr)_{ll}.
\]
Substituting and using
\(\hat{\mathbf{c}}_l^\mathsf{T}\hat{\mathbf{T}}^\mathsf{T}\hat{\mathbf{b}}
=\hat{\mathbf{t}}_l^\mathsf{T}\mathbf{x}_\lambda
=(\mathbf{A}_\lambda\hat{\mathbf{b}})_l\)
and
\(\hat{\mathbf{c}}_l^\mathsf{T}\hat{\mathbf{t}}_l=a_{\lambda,ll}\):
\[\begin{split}
\begin{align*}
\mathbf{x}_\lambda^{(l)}
&=
\mathbf{x}_\lambda - \hat{b}_l\hat{\mathbf{c}}_l
+
\frac{\hat{\mathbf{c}}_l\bigl[(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l a_{\lambda,ll}\bigr]}
{1 - a_{\lambda,ll}}\\
&=
\mathbf{x}_\lambda
+
\hat{\mathbf{c}}_l\cdot
\frac{(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l}{1 - a_{\lambda,ll}}.
\end{align*}
\end{split}\]
Hence the \(l\)-th leave-one-out prediction error in the transformed space is
\[\begin{split}
\begin{align*}
\bigl(\hat{\mathbf{T}}\mathbf{x}_\lambda^{(l)}\bigr)_l - \hat{b}_l
&=
\hat{\mathbf{t}}_l^\mathsf{T}\mathbf{x}_\lambda^{(l)} - \hat{b}_l\\
&=
\hat{\mathbf{t}}_l^\mathsf{T}\mathbf{x}_\lambda - \hat{b}_l
+
\hat{\mathbf{t}}_l^\mathsf{T}\hat{\mathbf{c}}_l\cdot
\frac{(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l}{1 - a_{\lambda,ll}}\\
&=
(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l
+
a_{\lambda,ll}\cdot
\frac{(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l}{1 - a_{\lambda,ll}}\\
&=
\frac{(\mathbf{A}_\lambda\hat{\mathbf{b}})_l - \hat{b}_l}{1 - a_{\lambda,ll}}.
\end{align*}
\end{split}\]
Summing the squares over all \(l\) yields (1):
\[
\mathcal{P}(\lambda)
=
\sum_{l=1}^{M}\left[\frac{\bigl[(\mathbf{I}-\mathbf{A}_\lambda)\hat{\mathbf{b}}\bigr]_l}{1-a_{\lambda,ll}}\right]^2
=
\bigl\|\mathbf{D}_\lambda(\mathbf{I}-\mathbf{A}_\lambda)\hat{\mathbf{b}}\bigr\|_2^2.
\]
Derivation of (2)
Using the decomposed solution form, we first identify \(\mathbf{A}_\lambda\):
\[\begin{split}
\begin{align*}
\mathbf{A}_\lambda\hat{\mathbf{b}}
&=
\hat{\mathbf{T}}\mathbf{x}_\lambda\\
&=
\hat{\mathbf{T}}\tilde{\mathbf{V}}\mathbf{F}_\lambda\mathbf{S}^{-1}\mathbf{U}^\mathsf{T}\hat{\mathbf{b}}\\
&=
\mathbf{U}\mathbf{S}\mathbf{F}_\lambda\mathbf{S}^{-1}\mathbf{U}^\mathsf{T}\hat{\mathbf{b}}\quad(\because \hat{\mathbf{T}}\tilde{\mathbf{V}} = \mathbf{U}\mathbf{S})\\
&=
\mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T}\hat{\mathbf{b}},\\
\therefore\quad
\mathbf{A}_\lambda &= \mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T}.
\end{align*}
\end{split}\]
Substituting the above directly into (1) gives (2), since \(\mathbf{D}_\lambda = [\mathrm{Diag}(\mathbf{I}-\mathbf{A}_\lambda)]^{-1}\) by definition.
To obtain the component form (3), note that the \(l\)-th diagonal entry of \(\mathbf{A}_\lambda\) is
\[
a_{\lambda,ll}
=
\left(\mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T}\right)_{ll}
= \sum_{i=1}^{r} f_{\lambda,i}\,u_{li}^2,
\]
and the \(l\)-th component of \((\mathbf{I}-\mathbf{A}_\lambda)\hat{\mathbf{b}}\) is
\[
\hat{b}_l - \left(\mathbf{U}\mathbf{F}_\lambda\mathbf{U}^\mathsf{T}\hat{\mathbf{b}}\right)_l
= \hat{b}_l - \sum_{i=1}^{r} f_{\lambda,i}\,u_{li}\,(\mathbf{u}_i^\mathsf{T}\hat{\mathbf{b}}).
\]
Dividing the \(l\)-th residual component by \((1-a_{\lambda,ll})\) and summing the squares over \(l\) yields (3).