*This post is the nineth (and probably last) one of our series on* *the history and foundations of econometric and machine learning models. The first fours were on econometrics techniques. Part 8 is online here.*

## Optimization and algorithmic aspects

In econometrics, (numerical) optimization became omnipresent as soon as we left the Gaussian model. We briefly mentioned it in the section on the exponential family, and the use of the Fisher score (gradient descent) to solve the first order condition \mathbf{X}^T W(\beta)^{-1})[y-\widehat{y}]=\mathbf{0}. In learning, optimization is the central tool. And it is necessary to have effective optimization algorithms, to solve problems (described previously) of the form: \widehat{\beta}\in\underset{\beta\in\mathbb{R}^p}{\text{argmin}}\left\lbrace\sum_{i=1}^n \ell(y_i,\beta_0+\mathbf{x}^T\beta)+\lambda\Vert\boldsymbol{\beta}\Vert\right\rbraceIn some cases, instead of global optimization, it is sufficient to consider optimization by coordinates (widely studied in Daubechies et al. (2004)). If f:\mathbb{R}^d\rightarrow\mathbf{R} is convex and differentiable, if \mathbf{x} satisfies f(\mathbf{x}+h\boldsymbol{e}_i)\geq f(\mathbf{x}) for any h>0 and i\in\{1,\cdots, d\}then f(\mathbf{x})=\min\{f\}, where \mathbf{e}=(\mathbf{e}_i) is the canonical basis of \mathbb{R}^d. However, this property is not true in the non-differentiable case. But if we assume that the non-differentiable part is separable (additively), it becomes true again. More specifically, iff(\mathbf{x})=g(\mathbf{x})+\sum_{i=1}^d h_i(x_i)with\left\lbrace\begin{array}{l}g: \mathbb{R}^d\rightarrow\mathbb{R}\text{ convex-differentiable}\\h_i: \mathbb{R}\rightarrow\mathbb{R}\text{ convex}\end{array}\right.This was the case for Lasso regression, \beta)\mapsto\| \mathbf{y}-\beta_0-\mathbf{X}\beta\|_{\ell_2 }+\lambda\|\beta\|_{\ell_1}, as shown by Tsen (2001). Getting back to our initial notations, we can use a coordinate descent algorithm: from an initial value \mathbf{x}^{(0)}, we consider (by iterating)x_j^{(k)}\in\text{argmin}\big\lbrace f(x_1^{(k)},\cdots,x_{k-1}^{(k)},x_k,x_{k+1}^{(k-1)},\cdots,x_n^{(k-1)})\big\rbrace for j=1,2,\cdots,nThese algorithmic problems and numerical issues may seem secondary to econometricians. However, they are essential in automatic learning: a technique is interesting if there is a stable and fast algorithm, which allows to obtain a solution. These optimization techniques can be transposed: for example, this coordinate descent technique can be used in the case of SVM methods (known as “vector support” methods) when the space is not linearly separable, and the classification error must be penalized (we will come back to this technique in the next section).

## In-sample, out-of-sample and cross-validation

These techniques seem intellectually interesting, but we have not yet discussed the choice of the penalty parameter \lambda. But this problem is actually more general, because comparing two parameters \widehat{\beta}_{\lambda_1} and \widehat{\beta}_{\lambda_2} is actually comparing two models. In particular, if we use a Lasso method, with different thresholds \lambda, we compare models that do not have the same dimension. Previously, we have addressed the problem of model comparison from an econometric perspective (by penalizing overly complex models). In the learning literature, judging the quality of a model on the data used to construct it does not make it possible to know how the model will behave on new data. This is the so-called “generalization” problem. The traditional approach then consists in separating the sample (size n) into two parts: a part that will be used to train the model (the training database, in-sample, size m) and a part that will be used to test the model (the testing database, out-of-sample, size n-m). The latter then makes it possible to measure a real predictive risk. Suppose that the data are generated by a linear model y_i=\mathbf{x}_i^T \beta_0+\varepsilon_i where \varepsilon_i are independent and centred law achievements. The empirical quadratic risk in-sample is here\frac{1}{m}\sum_{i=1}^m\mathbb{E}\big([\mathbf{x}_i^T \widehat{\beta}-\mathbf{x}_i^T \beta_0]^2\big)=\mathbb{E}\big([\mathbf{x}_i^T \widehat{\beta}-\mathbf{x}_i^T \beta_0]^2\big),for any observation i. Assuming the residuals \varepsilon Gaussian, then we can show that this risk is worth \sigma^2 \text{trace} (\Pi_X)/m is \sigma^2 p/m. On the other hand, the empirical out-of-sample quadratic risk is here \mathbb{E}\big([\mathbf{x}^T \widehat{\beta}-\mathbf{x}^T \beta_0]^2\big) where \mathbf{x} is a new observation, independent of the others. It can be noted that \mathbb{E}\big([\mathbf{x}^T \widehat{\beta}-\mathbf{x}^T \beta_0]^2\big\vert \mathbf{x}\big)=\text{Var}\big(\mathbf{x}^T \widehat{\beta}\big\vert \mathbf{x}\big)=\sigma^2\mathbf{x}^T(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x},and by integrating with respect to \mathbf{x}, \mathbb{E}\big([\mathbf{x}^T \widehat{\beta}-\mathbf{x}^T\beta_0]^2\big)=\sigma^2\text{trace}\big(\mathbb{E}[\mathbf{x}\mathbf{x}^T]\mathbb{E}\big[(\mathbf{x}^T\mathbf{x})^{-1}\big]\big).The expression is then different from that obtained in-sample, and using the Groves & Rothenberg (1969) increase, we can show that \mathbb{E}\big([\mathbf{x}^T \widehat{\beta}-\mathbf{x}^T \beta_0]^2\big) \geq \sigma^2\frac{p}{m}which is pretty intuitive, when we start thinking about it. Except in some simple cases, there is no simple (explicit) formula. Note, however, that if \mathbf{X}\sim\mathcal{N}(0,\sigma^2 \mathbb{I}), then \mathbf{x}^T \mathbf{x} follows a Wishart law, and it can be shown that \mathbb{E}\big([\mathbf{x}^T \widehat{\beta}-\mathbf{x}^T \beta_0]^2\big)=\sigma^2\frac{p}{m-p-1}.If we now look at the empirical version: if \widehat{\beta} is estimated on the first m observations,\widehat{\mathcal{R}}^{~\text{ IS}}=\sum_{i=1}^m [y_i-\boldsymbol{x}_i^T\widehat{\boldsymbol{\beta}}]^2\text{ and }\widehat{\mathcal{R}}^{\text{ OS}}=\sum_{i=m+1}^{n} [y_i-\boldsymbol{x}_i^T\widehat{\boldsymbol{\beta}}]^2and as Leeb (2008) noted, \widehat{\mathcal{R}}^{\text{IS}}-\widehat{\mathcal{R}}^{\text{OS}}\approx 2\cdot\nu where \nu represents the number of degrees of freedom, which is not unlike the penalty used in the Akaike test.

Figure 4 shows the respective evolution of \widehat{\mathcal{R}}^{\text{IS}} and \widehat{\mathcal{R}}^{\text{OS}} according to the complexity of the model (number of degrees in a polynomial regression, number of nodes in splines, etc). The more complex the model, the more \widehat{\mathcal{R}}^{\text{IS}} will decrease (this is the red curve, below). But that’s not what we’re interested in here: we want a model that predicts well on new data (i. e. out-of-sample). As Figure 4 shows, if the model is too simple, it does not predict well (as it does with in-sample data). But what we can see is that if the model is too complex, we are in a situation of “overlearning”: the model will start to model the noise. Of course, this figure should remind us of the one we’ve seen in our second post of that series

**Figure 4** : Generalization, under- and over-fitting

Instead of splitting the database in two, with some of the data that will be used to calibrate the model and some to study its performance, it is also possible to use cross-validation. To present the general idea, we can go back to the “jackknife”, introduced by Quenouille (1949) (and formalized by Quenouille (1956) and Tukey (1958)) relatively used in statistics to reduce bias. Indeed, if we assume that \{y_1,\cdots,y_n\} is a sample drawn according to a law F_\theta, and that we have an estimator T_n (\mathbf{y})=T_n (y_1,\cdots,y_n), but that this estimator is biased, with \mathbf{E}[T_n (\mathbf{Y})]=\theta+O(n^{-1}), it is possible to reduce the bias by considering \widetilde{T}_n(\mathbf{y})=\frac{1}{n}\sum_{i=1}^n T_{n-1}(\mathbf{y}_{(i)})\text{ where }\mathbf{y}_{(i)}=(y_1,\cdots,y_{i-1},y_{i+1},\cdots,y_n)It can then be shown that \mathbb{E}[\tilde{T}_n(Y)]=\theta+O(n^{-2})The idea of cross-validation is based on the idea of building an estimator by removing an observation. Since we want to build a predictive model, we will compare the forecast obtained with the estimated model, and the missing observation\widehat{\mathcal{R}}^{\text{ CV}}=\frac{1}{n}\sum_{i=1}^n \ell(y_i,\widehat{m}_{(i)}(\mathbf{x}_i))We will speak here of the “leave-one-out” (loocv) method.

This technique reminds us of the traditional method used to find the optimal parameter in exponential smoothing methods for time series. In simple smoothing, we will construct a forecast from a time series as {}_t\widehat{y}_{t+1} =\alpha\cdot{}_{t-1}\widehat{y}_t +(1-\alpha)\cdot y_t, where \alpha\in[0,1], and we will consider as “optimal” \alpha^\star = \underset{\alpha\in[0,1]}{\text{argmin}}\left\lbrace \sum_{t=2}^T \ell({}_{t-1}\widehat{y}_{t},y_{t}) \right\rbraceas described by Hyndman et al (2009).

The main problem with the leave-one-out method is that it requires calibration of n models, which can be problematic in large dimensions. An alternative method is cross validation by k-blocks (called “k-fold cross validation”) which consists in using a partition of \{1,\cdots,n\} in k groups (or blocks) of the same size, \mathcal{I}_1,\cdots,\mathcal{I}_k, and let us note \mathcal{I}_{\bar j}=\{1,\cdots,n\}\setminus \mathcal{I}_j. By noting \widehat{m}_{(j)} built on the sample \mathcal{I}_{\bar j}, we then set:\widehat{\mathcal{R}}^{k-\text{ CV}}=\frac{1}{k}\sum_{j=1}^k \mathcal{R}_j\text{ where }\mathcal{R}_j=\frac{k}{n}\sum_{i\in\mathcal{I}_{{j}}} \ell(y_i,\widehat{m}_{(j)}(\mathbf{x}_i))Standard cross-validation, where only one observation is removed each time (loocv), is a special case, with k=n. Using k=5 or 10 has a double advantage over k=n: (1) the number of estimates to be made is much smaller, 5 or 10 rather than n; (2) the samples used for estimation are less similar and therefore less correlated to each other, which tends to avoid excess variance, as recalled by James et al. (2013).

Another alternative is to use boosted samples. Let \mathcal{I}_b be a sample of size n obtained by drawing with replacement in \{1,\cdots,n\} to know which observations (y_i,\mathbf{x}_i) will be kept in the learning population (at each draw). Note \mathcal{I}_{\bar b}=\{1,\cdots,n\}\setminus\mathcal{I}_b. By noting \widehat{m}_{(b)} built on sample \mathcal{I}_b, we then set :\widehat{\mathcal{R}}^{\text{ B}}=\frac{1}{B}\sum_{b=1}^B \mathcal{R}_b\text{ where }\mathcal{R}_b=\frac{n_{\overline{b}}}{n}\sum_{i\in\mathcal{I}_{\overline{b}}} \ell(y_i,\widehat{m}_{(b)}(\mathbf{x}_i))where n_{\bar b} is the number of observations that have not been kept in \mathcal{I}_b. It should be noted that with this technique, on average e^{-1}\sim36.7\% of the observations do not appear in the boosted sample, and we find an order of magnitude of the proportions used when creating a calibration sample, and a test sample. In fact, as Stone (1977) had shown, the minimization of AIC is to be compared to the cross-validation criterion, and Shao (1997) showed that the minimization of BIC corresponds to k-fold cross-validation, with k=n/\log n.

All those techniques here are mentioned in the “machine learning” section since they rely on automatic, computational techniques, and no probabilistic foundations are necessary. In many cases we did use the notation m^\star (at least in the first posts on “machine learning” techniques) to highlight the fact that we want some sort of “optimal” model – and to make a distinction with estimators \widehat{m} considered earlier, when we had some probabilistic framework. But of course, it is possible (and necessary) to build bridges between those two cultures…

*References are online here. As explained in the introduction, it is some sort of online version of an introduction to our joint paper with Emmanuel Flachaire and Antoine Ly, Econometrics and Machine Learning (initially writen in French), that will actually appear soon in the journal Economics and Statistics (in English and in French).
*

Thanks of the nice exposition and the bridges between the two topics. I was wondering if you could have second look at the equation numbers since they seem to be discontinuous among the different topics making harder for the reader to grasp why the skip from one numbering of the equation to the other, for instance there’s a skip between equation 3 and then it goes directly to 6, whereas 4 & 5 are missing.