This weekend, the Canadian Econometric Study Group will organise a conference in Montréal, on Machine Learning Econometrics. Since I was in the scientific committee, I’ve read some of the papers that will be presented, and it will be extremely interesting. There will be two invited speakers, Gregory Duncan (Amazon and University of Washington) and Dacheng Xiu (University of Chicago).I will be around at the poster session on Friday, and I should chair a session on Saturday ! See you there !
Tag Archives: econometrics
Foundations of Machine Learning, part 3
This post is the seventh one of our series on the history and foundations of econometric and machine learning models. The first fours were on econometrics techniques. Part 6 is online here.
Boosting and sequential learning
As we have seen before, modelling here is based on solving an optimization problem, and solving the problem described by equation (6) is all the more complex because the functional space \mathcal{M} is large. The idea of boosting, as introduced by Shapire & Freund (2012), is to learn, slowly, from the errors of the model, in an iterative way. In the first step, we estimate a model m_1 for y, from \mathbf{X}, which will give an error \varepsilon_1. In the second step, we estimate a model m_2 for \varepsilon_1, from X, which will give an error \varepsilon_2, etc. We will then retain as a model, after k iterations m^{(k)}(\cdot)=\underbrace{m_1(\cdot)}_{\sim y}+\underbrace{m_2(\cdot)}_{\sim \epsilon_1}+\underbrace{m_3(\cdot)}_{\sim \epsilon_2}+\cdots+\underbrace{m_k(\cdot)}_{\sim \epsilon_{k1}}=m^{(k1)}(\cdot)+m_k(\cdot)~~~(7)Here, the error \varepsilon is seen as the difference between y and the model m(\mathbf{x}), but it can also be seen as the gradient associated with the quadratic loss function. Formally, \varepsilon can be seen as \nabla\ell in a more general context (here we find an interpretation that reminds us of residuals in generalized linear models).
Equation (7) can be seen as a descent of the gradient, but written in a dual way. The problem will then be rewritten as an optimization problem: m^{(k)}=m^{(k1)}+\underset{h\in\mathcal{H}}{\text{argmin}}\left\lbrace \sum_{i=1}^n \ell(\underbrace{y_im^{(k1)}(\boldsymbol{x}_i)}_{\varepsilon_{k,i}},h(\boldsymbol{x}_i))\right\rbrace~~~(8)where the trick is to consider a relatively simple space \mathcal{H} (we will speak of “weak learner”). Classically, \mathcal{H} functions are stepfunctions (which will be found in classification and regression trees) called “stumps”. To ensure that learning is indeed slow, it is not uncommon to use a shrinkage parameter, and instead of setting, for example, \varepsilon_1=ym_1 (\mathbf{x}), we will set \varepsilon_1=y\alpha\cdot m_1 (\mathbf{x}) with \alpha\in[0.1]. It should be noted that it is because a nonlinear space is used for \mathcal{H}, and learning is slow, that this algorithm works well. In the case of the Gaussian linear model, remember that the residuals \varepsilon=y\mathbf{x}^T\beta are orthogonal to the explanatory variables, \mathbf{X}, and it is then impossible to learn from our errors. The main difficulty is to stop in time, because after too many iterations, it is no longer the m function that is approximated, but the noise. This problem is called overlearning.
This presentation has the advantage of having a heuristic reminiscent of an econometric model, by iteratively modelling the residuals by a (very) simple model. But this is often not the presentation used in the learning literature, which places more emphasis on an optimization algorithm heuristic (and gradient approximation). The function is learned iteratively, starting from a constant value, m^{(0)}=\underset{m\in\mathbb{R}}{\text{argmin}}\left\lbrace\sum_{i=1}^n \ell(y_i,m)\right\rbracethen we consider the following learning procedure{\displaystyle m^{(k)}=m^{(k1)}+{\underset{h\in {\mathcal {H}}}{\text{argmin}}}\sum _{i=1}^{n}\ell(y_{i},m^{(k1)}(\mathbf{x}_{i})+h(\mathbf{x}_{i}))}~~~(9)which can be written, if \mathcal{H} is a set of differentiable functions, {\displaystyle m^{(k)}=m^{(k1)}\gamma_{k}\sum _{i=1}^{n}\nabla _{m^{(k1)}}\ell(y_{i},m^{(k1)}(\mathbf{x}_{i})),} where {\displaystyle \gamma _{k}=\underset{\gamma }{\text{argmin }}\sum _{i=1}^{n}\ell\left(y_{i},m^{(k1)}( \mathbf{x}_{i})\gamma \nabla _{m^{(k1)}}\ell(y_{i},m^{(k1)}( \mathbf{x}_{i}))\right).} To better understand the relationship with the approach described above, at step k, pseudoresiduals are defined by setting r_{i,k}=\left.\frac{\partial \ell(y_i,m(\mathbf{x}_i))}{\partial m(\mathbf{x}_i)}\right\vert_{m(\mathbf{x})=m^{(k1)}( \mathbf{x})}\text{ where }i=1,\cdots,nA simple model is then sought to explain these pseudoresiduals according to the explanatory variables \mathbf{x}_i, i.e. r_{i,k}=h^\star(\mathbf{x}_i) , where h^\star\in\mathcal{H}. In a second step, we look for an optimal multiplier by solving\gamma_k = \underset{\gamma\in\mathbb{R}}{\text{argmin}}\left\lbrace\sum_{i=1}^n \ell(y_i,m^{(k1)}( \mathbf{x}_i)+\gamma h^\star(\mathbf{x}_i))\right\rbrace then update the model by setting m_k (\cdot)=m_(k1) (\cdot)+\gamma_k h^\star (\cdot) . More formally, we move from equation (8) – which clearly shows that we are building a model on residuals – to equation (9) – which will then be translated as a gradient calculation problem – noting that \ell(y,m+h)=\ell(ym,h) . Classically, class \mathcal{H} of functions consists in regression trees. It is also possible to use a form of penalty by setting m_k (\cdot)=m_(k1) (\cdot)+\nu\gamma_k h^\star (\cdot) , with \nu\in(0,1) . But let’s go back a little further – in our next post – on the importance of penalization before discussing the numerical aspects of optimization.
To be continued (keep in mind that references are online here)…
Probabilistic Foundations of Econometrics, part 4
This post is the fourth one of our series on the history and foundations of econometric and machine learning models. Part 3 is online here.
Goodness of Fit, and Model
In the Gaussian linear model, the determination coefficient – noted R^2 – is often used as a measure of fit quality. It is based on the variance decomposition formula \underbrace{\frac{1}{n}\sum_{i=1}^n (y_i\bar{y})^2}_{\text{total variance}}=\underbrace{\frac{1}{n}\sum_{i=1}^n (y_i\widehat{y}_i)^2}_{\text{residual variance}}+\underbrace{\frac{1}{n}\sum_{i=1}^n (\widehat{y}_i\bar{y})^2}_{\text{explained variance}} The R^2 is defined as the ratio of explained variance and total variance, another interpretation of the coefficient that we had introduced from the geometry of the least squares R^2= \frac{\sum_{i=1}^n (y_i\bar{y})^2\sum_{i=1}^n (y_i\widehat{y}_i)^2}{\sum_{i=1}^n (y_i\bar{y})^2}The sums of the error squares in this writing can be rewritten as a loglikelihood. However, it should be remembered that, up to one additive constant (obtained with a saturated model) in generalized linear models, deviance is defined by {Deviance}(\widehat{\beta}) = 2\log[\mathcal{L}] which can also be noted Deviance(\widehat{\mathbf{y}}). A null deviance can be defined as the one obtained without using the explanatory variables \mathbf{x}, so that \widehat{y}_i=\overline{y}. It is then possible to define, in a more general context (with a nonGaussian distribution for y)R^2=\frac{{Deviance}(\overline{y}){Deviance}(\widehat{\mathbf{y}})}{{Deviance}(\overline{y})}=1\frac{{Deviance}(\widehat{\mathbf{y}})}{{Deviance}(\overline{y})}However, this measure cannot be used to choose a model, if one wishes to have a relatively simple model in the end, because it increases artificially with the addition of explanatory variables without significant effect. We will then tend to prefer the adjusted R^2,\bar R^2 = {1(1R^{2})\cdot{n1 \over np}} = R^{2}\underbrace{(1R^{2})\cdot{p1 \over np}}_{\text{penalty}}where p is the number of parameters of the model. Measuring the quality of fit will penalize overly complex models.
This idea will be found in the Akaike criterion, where AIC=Deviance+2\cdot p or in the Schwarz criterion, BIC=Deviance+log(n)\cdot p. In large dimensions (typically p>\sqrt{n}), we will tend to use a corrected AIC, defined by AIC_c=Deviance+2⋅p⋅n/(np1) .
These criterias are used in socalled “stepwise” methods, introducing the set methods. In the “forward” method, we start by regressing to the constant, then we add one variable at a time, retaining the one that lowers the AIC criterion the most, until adding a variable increases the AIC criterion of the model. In the “backward” method, we start by regressing on all variables, then we remove one variable at a time, removing the one that lowers the AIC criterion the most, until removing a variable increases the AIC criterion from the model.
Another justification for this notion of penalty (we will come back to this idea in machine learning) can be the following. Let us consider an estimator in the class of linear predictors, \mathcal{M}=\big\lbrace m:~m(\mathbf{x})=s_h(\mathbf{x})^T\mathbf{y} \text{ where }S=(s(\mathbf{x}_1),\cdots,s(\mathbf{x}_n))^T\text{ is some smoothing matrix}\big\rbrace and assume that y=m_0 (x)+\varepsilon, with \mathbb{E}[\varepsilon]=0 and Var[\varepsilon]=\sigma^2\mathbb{I}, so that m_0 (x)=\mathbb{E}[YX=x] . From a theoretical point of view, the quadratic risk, associated with an estimated model \widehat{m}, \mathbb{E}\big[(Y\widehat{m}(\mathbf{X}))^2\big], is written\mathcal{R}(\widehat{m})=\underbrace{\mathbb{E}\big[(Ym_0(\mathbf{X}))^2\big]}_{\text{error}}+\underbrace{\mathbb{E}\big[(m_0(\mathbf {X})\mathbb{E}[\widehat{m}(\mathbf{X})])^2\big]}_{\text{bias}^2}+\underbrace{\mathbb{E}\big[(\mathbb{E}[\widehat{m}(\mathbf{X})]\widehat{m}(\mathbf{X}))^2\big]}_{\text{variance}} if m_0 is the true model. The first term is sometimes called “Bayes error”, and does not depend on the estimator selected, \widehat{m}.
The empirical quadratic risk, associated with a model m, is here: \widehat{\mathcal{R}}_n(m)=\frac{1}{n}\sum_{i=1}^n (y_im(\mathbf{x}_i))^2 (by convention). We recognize here the mean square error, “mse”, which will more generally give the “risk” of the model m when using another loss function (as we will discuss later on). It should be noted that:\displaystyle{\mathbb{E}[\widehat{\mathcal{R}}_n(m)]=\frac{1}{n}\m_0(\mathbf{x})m(\mathbf{x})\^2+\frac{1}{n}\mathbb{E}\big(\{Y}m_0(\mathbf{X})\^2\big)} We can show that:n\mathbb{E}\big[\widehat{\mathcal{R}}_n(\widehat{m})\big]=\mathbb{E}\big(\Y\widehat{m}(\mathbf{x})\^2\big)=\(\mathbb{I}\mathbf{S})m_0\^2+\sigma^2\\mathbb{I}\mathbf{S}\^2so that the (real) risk of \widehat{m} is: {\mathcal{R}}_n(\widehat{m})=\mathbb{E}\big[\widehat{\mathcal{R}}_n(\widehat{m})\big]+2\frac{\sigma^2}{n}\text{trace}(\boldsymbol{S})So, if \text{trace}(\boldsymbol{S})\geq0 (which is not a too strong assumption), the empirical risk underestimates the true risk of the estimator. Actually, we recognize here the number of degrees of freedom of the model, the righthand term corresponding to Mallow’s C_p, introduced in Mallows (1973) using not deviance but R^2.
Statistical Tests
The most traditional test in econometrics is probably the significance test, corresponding to the nullity of a coefficient in a linear regression model. Formally, it is the test of H_0:\beta_k=0 against H_1:\beta_k\neq 0. The socalled Student test, based on the statistics t_k=\widehat{\beta}_k/se_{\widehat{β}_k}, allows to decide between the two alternatives, using the test pvalue, defined by \mathbb{P}[T>t_k] avec T\overset{\mathcal{L}}{\sim} Std_\nu, where \nu is the number of degrees of freedom of the model (\nu=p+1 for the standard linear model). In large dimension, however, this statistic is of very limited interest, given a significant FDR (“False Discovery Ratio”). Classically, with a level of significance \alpha=0.05, 5% of the variables are falsely significant. Suppose that we have p=100 explanatory variables, but that 5 (only) are really significant. We can hope that these 5 variables will pass the Student test, but we can also expect that 5 additional variables (false positive test) will emerge. We will then have 10 variables perceived as significant, while only half are significant, i.e. an FDR ratio of 50%. In order to avoid this recurrent pitfall in multiple tests, it is natural to use the procedure of Benjamini & Hochberg (1995).
From a correlation to some causal effect
Econometric models are used to implement public policy evaluations. It is therefore essential to fully understand the underlying mechanisms in order to know which variables actually make it possible to act on a variable of interest. But then we move on to another important dimension of econometrics. Jerry Neyman was responsible for the first work on the identification of causal mechanisms, and then Rubin (1974) formalized the test, called the “Rubin causal model” in Holland (1986). The first approaches to the notion of causality in econometrics were based on the use of instrumental variables, models with discontinuity of regression, analysis of differences in differences, and natural or unnatural experiments. Causality is usually inferred by comparing the effect of a policy – or more generally of a treatment – with its counterfactual, ideally given by a random control group. The causal effect of the treatment is then defined as \Delta=y_1y_0, i.e. the difference between what the situation would be with treatment (noted t=1) and without treatment (noted t=0). The concern is that only y=t\cdot y_1+(1t)\cdot y_0 and t are observed. In other words, the causal effect of variable t on t is not observed (since only one of the two potential variables – y_0 or y_1 is observed for each individual), but it is also individual, and therefore a function of xcovariates. Generally, by making assumptions about the distribution of the triplet (Y_0,Y_1,T) , some parameters of the causal effect distribution become identifiable, based on the density of the observable variables (Y,T) . Classically, we will be interested in the moments of this distribution, in particular the average effect of treatment in the population, \mathbb{E}[\Delta] , or even just the average effect of treatment in the case of treatment \mathbb{E}[\DeltaT=1] . If the result (Y_0,Y_1) is independent of the processing access variable T, it can be shown that \mathbb{E}[\Delta]=\mathbb{E}[YT=1] \mathbb{E} [YT=0]. But if this independence hypothesis is not verified, there is a selection bias, often associated with \mathbb{E}[Y_0T=1] \mathbb{E} [Y_0T=0]. Rosenbaum & Rubin (1983) propose to use a propensity to be treated score, p(x)=\mathbb{P}[T=1X=x] , noting that if variable Y_0\ is independent of access to treatment T conditionally to the explanatory variables X, then it is independent of T conditionally to the score p(X) : it is sufficient to match them using their propensity score. Heckman et al (2003) thus proposes a kernel estimator on the propensity score, which simply provides an estimator of the effect of the treatment, provided that it is treated.
To be continued… next time, we’ll introduce “machine learning techniques” (references mentioned above are online here)
Probabilistic Foundations of Econometrics, part 3
This post is the third one of our series on the history and foundations of econometric and machine learning models. Part 2 is online here.
Exponential family and linear models
The Gaussian linear model is a special case of a large family of linear models, obtained when the conditional distribution of Y (given the covariates) belongs to the exponential family f(y_i\theta_i,\phi)=\exp\left(\frac{y_i\theta_ib(\theta_i)}{a(\phi)}+c(y_i,\phi)\right) with \theta_i=\psi(\mathbf{x}_i^T \beta). Functions a, b and c are specified according to the type of exponential law (studied extensively in statistics since Darmoix (1935), as Brown (1986) reminds us), and \psi is a onetoone mapping that the user must specify. Loglikelihood then has a simple expression \log\mathcal{L}(\mathbf{\theta},\phi\mathbf{y}) =\frac{\sum_{i=1}^ny_i\theta_i\sum_{i=1}^nb(\theta_i)}{a(\phi)}+\sum_{i=1}^n c(y_i,\phi) and the first order condition is then written \frac{\partial \log \mathcal{L}(\mathbf{\theta},\phi\mathbf{y})}{\partial \mathbf{\beta}} = \mathbf{X}^T\mathbf{W}^{1}[\mathbf{y}\widehat{\mathbf{y}}]=\mathbf{0} based on Müller’s (2011) notations, where \mathbf{W} is a weight matrix (which depends on \beta). Given the link between \theta and the expectation of Y, instead of specifying the function \psi(\cdot) , we will tend to specify the link function g(\cdot) defined by \widehat{y}=m(\mathbf{x})=\mathbb{E}[Y\mathbf{X}=\mathbf{x}]=g^{1} (\mathbf{x}^T \beta) For the Gaussian linear regression we consider an identity link, while for the Poisson regression, the natural link (called canonical) is the logarithmic link. Here, as \mathbf{W} depends on \beta (with \mathbf{W}=diag(\nabla g(\widehat{\mathbf{y}})Var[\mathbf{y}]) there is generally no explicit formula for the maximum likelihood estimator. But an iterative algorithm makes it possible to obtain a numerical approximation. By setting \mathbf{z}=g(\widehat{\mathbf{y}})+(\mathbf{y}\widehat{\mathbf{y}})\cdot\nabla g(\widehat{\mathbf{y}}) corresponding to the error term of a Taylor development in order 1 of g, we obtain an algorithm of the form\widehat{\beta}_{k+1}=[\mathbf{X}^T \mathbf{W}_k^{1} \mathbf{X}]^{1} \mathbf{X}^T \mathbf{W}_k^{1} \mathbf{z}_kBy iterating, we will define \widehat{\beta}=\widehat{\beta}_{\infty}, and we can show that – with some additional technical assumptions (detailed in Müller (2011)) – this estimator is asymptotically Gaussian, with \sqrt{n}(\widehat{\beta} \beta)\overset{\mathcal{L}}{\rightarrow} \mathcal{N}(\mathbf{0},I(β)^{1}) where numerically I(\beta)=\varphi\cdot[\mathbf{X}^T \mathbf{W}_\infty^{1} \mathbf{X}] .
From a numerical point of view, the computer will solve the firstorder condition, and actually, the law of Y does not really intervene. For example, one can estimate a “Poisson regression” even when observations are not integers (but they need to be positive). In other words, the law of Y is only an interpretation here, and the algorithm could be introduced in a different way (as we will see later on), without necessarily having an underlying probabilistic model.
Logistic Regression
Logistic regression is the generalized linear model obtained with a Bernoulli’s law, and a link function which is the quantile function of a logistic law (which corresponds to the canonical link in the sense of the exponential family). Taking into account the form of Bernoulli’s law, econometrics proposes a model for y_i\in\{0,1\}, in which the logarithm of the odds follows a linear model: \log\left(\frac{\mathbb{P}[Y=1\vert \mathbf{X}=\mathbf{x}]}{\mathbb{P}[Y\neq 1\vert \mathbf{X}=\mathbf{x}]}\right)=\beta_0+\mathbf{x}^T\beta or \mathbb{E}[Y\mathbf{X}=\mathbf{x}]=\mathbb{P}[Y=1\mathbf{X}=\mathbf{x}]=\frac{e^{\beta_0+\mathbf{x}^T\beta}}{1+ e^{\beta_0+\mathbf{x}^T\beta}}=H(\beta_0+\mathbf{x}^T\beta) where H(\cdot)=\exp(\cdot)/(1+exp(\cdot)) is the cumulative distribution function of the logistic law. The estimation of (\beta_0,\beta) is performed by maximizing the likelihood: \mathcal{L}=\prod_{i=1}^n \left(\frac{e^{\mathbf{x}_i^T\mathbf{\beta}}}{1+e^{\boldsymbol{x}_i^T\mathbf{\beta}}}\right)^{y_i}\left(\frac{1}{1+e^{\mathbf{x}_i^T\mathbf{\beta}}}\right)^{1y_i} It is said to be a linear models because isoprobability curves here are the parallel hyperplanes b+\mathbf{x}^T\beta . Rather than this model, popularized by Berkson (1944), some will prefer the probit model (see Berkson, 1951), introduced by Bliss (1934). In this model: \mathbb{E}[Y\mathbf{X}=\mathbf{x}]=\mathbb{P}[Y=1\mathbf{X}=\mathbf{x}]=\Phi (\beta_0+\mathbf{x}^T\beta)
where \Phi denotes the distribution function of the reduced centred normal distribution. This model has the advantage of having a direct link with the Gaussian linear model, since y_i=\mathbf{1}(y_i^\star>0) with y_i^\star=\beta_0+\mathbf{x}^T \beta+\varepsilon_i where the residuals are Gaussian, \mathcal{N}(0,\sigma^2). An alternative is to have centered residuals of unit variance, and to consider a latent modeling of the form y_i=\mathbf{1}(y_i^\star>\xi) (where \xi will be fixed). As we can see, these techniques are fundamentally linked to an underlying stochastic model. In the body of the article, we present several alternative techniques – from the learning literature – for this classification problem (with two classes, here 0 and 1).
Regression in high dimension
As we mentioned earlier, the first order condition \mathbf{X}^T (\mathbf{X}\widehat{\beta}\mathbf{y})=\mathbf{0} is solved numerically by performing a QR decomposition, at a cost which consists in O(np^2) operations (where p is the rank of \mathbf{X}^T \mathbf{X}). Numerically, this calculation can be long (either because p is large or because n is large), and a simpler strategy may be to subsample. Let n_s\ll n, and consider a subsample size n_s of \{1,\cdots,n\}. Then \widehat{\beta}_s=(\mathbf{X}_s^T \mathbf{X}_s )^{1} \mathbf{X}_s^T\mathbf{y}_s is a good approximation of \beta as shown by Dhillon et al. (2014). However, this algorithm is dangerous if some points have a high leverage (i.e. L_i=\mathbf{x}_i(\mathbf{X}^T\mathbf{X})^{1}\mathbf{x}_i^T). Tropp (2011) proposes to transform the data (in a linear way), but a more popular approach is to do nonuniform subsampling, with a probability related to the influence of observations (defined by I_i=\widehat{\varepsilon}_iL_i/(1L_i)^2 , and which unfortunately can only be calculated once the model is estimated).
In general, we will talk about massive data when the data table of size does not fit in the RAM memory of the computer. This situation is often encountered in statistical learning nowadays with very often p\ll n. This is why, in practice, many libraries of algorithms assimilated to machine learning use iterative methods to solve the firstorder condition. When the parametric model to be calibrated is indeed convex and semidifferentiable, it is possible to use, for example, the stochastic gradient descent method as suggested by Bottou (2010). This last one allows to free oneself at each iteration from the calculation of the gradient on each observation of our learning base. Rather than making an average descent at each iteration, we start by drawing (without replacement) an observation \mathbf{x}_i among the n available. The model parameters are then corrected so that the prediction made from \mathbf{x}_i is as close as possible to the true value y_i. The method is then repeated until all the data have been reviewed. In this algorithm there is therefore as much iteration as there are observations. Unlike the gradient descent algorithm (or Newton’s method) at each iteration, only one gradient vector is calculated (and no longer n). However, it is sometimes necessary to run this algorithm several times to increase the convergence of the model parameters. If the objective is, for example, to minimize a loss function \ell between the estimator m_\beta (\mathbf{x}) and y (like the quadratic loss function, as in the Gaussian linear regression) the algorithm can be summarized as follows:
 Step 0: Mix the data
 Iteration step: For t=1,\cdots, n, we pull i\in\{1,\cdots,n\} without replacement, and we set \beta^{t+1} = \beta^{t}  \gamma_t\frac{ \partial{\ell(y_i,m_{\beta^t}(X_i)) } }{ \partial{ \beta}}
This algorithm can be repeated several times as a whole depending on the user’s needs. The advantage of this method is that at each iteration, it is not necessary to calculate the gradient on all observations (more sum). It is therefore suitable for large databases. This algorithm is based on a convergence in probability towards a neighborhood of the optimum (and not the optimum itself).
(references will be given in the very last post of that series) To be continued…
Probabilistic Foundations of Econometrics, part 2
This post is the second one of our series on the history and foundations of econometric and machine learning models. Part 1 is online here.
Geometric Properties of this Linear Model
Let’s define the scalar product in \mathbb{R}^n, ⟨\mathbf{a},\mathbf{b}⟩=\mathbf{a}^T\mathbf{b}, and let’s note \\cdot\ the associated Euclidean standard, \\mathbf{a}\=\sqrt{\mathbf{a}^T\mathbf{a}} (denoted \\cdot\_{\ell_2} in the next post). Note \mathcal{E}_X the space generated by all linear combinations of the \mathbf{X} components (adding the constant). If the explanatory variables are linearly independent, \mathbf{X} is a full (column) rank matrix and \mathcal{E}_X is a space of dimension p+1. Let’s assume from now on that the variables \mathbf{x} and y are centered here. Note that no law hypothesis is made in this section, the geometric properties are derived from the properties of expectation and variance in the set of finite variance variables.
With this notation, it should be noted that the linear model is written m(\mathbf{x})=⟨\mathbf{x},\beta⟩. The space H_z=\{\mathbf{x}\in\mathbb{R}^{p+1}:m(\mathbf{x})=z\} is a hyperplane (affine) that separates the space in two. Let’s define the orthogonal projection operator on \mathcal{E}_X, \Pi_X =\mathbf{X}(\mathbf{X}^T\mathbf{X})^{1} \mathbf{X}^T. Thus, the forecast that can be made for it is: \widehat{\mathbf{y}}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{1} \mathbf{X}^T\mathbf{y}=\Pi_X\mathbf{y}. As, \widehat{\varepsilon}=\mathbf{y}\widehat{\mathbf{y}}=(\mathbb{I}\Pi_X)\mathbf{y}=\Pi_{X^\perp}\mathbf{y}, we note that \widehat{\varepsilon}\perp\mathbf{x}, which will be interpreted as meaning that residuals are a term of innovation, unpredictable in the sense that \Pi_{X }\widehat{\varepsilon}=\mathbf{0}. The Pythagorean theorem is written here: \Vert \mathbf{y} \Vert^2=\Vert \Pi_{ {X}}\mathbf{y} \Vert^2+\Vert \Pi_{ {X}^\perp}\mathbf{y} \Vert^2=\Vert \Pi_{ {X}}\mathbf{y}\Vert^2+\Vert \mathbf{y}\Pi_{ {X}}\mathbf{y}\Vert^2=\Vert\widehat{\mathbf{y}}\Vert^2+\Vert\widehat{\mathbf{\varepsilon}}\Vert^2which is classically translated in terms of the sum of squares: \underbrace{\sum_{i=1}^n y_i^2}_{n\times\text{total variance}}=\underbrace{\sum_{i=1}^n \widehat{y}_i^2}_{n\times\text{explained variance}}+\underbrace{\sum_{i=1}^n (y_i\widehat{y}_i)^2}_{n\times\text{residual variance}} The coefficient of determination, R^2, is then interpreted as the square of the cosine of the angle \theta between \mathbf{y} and \Pi_X \mathbf{y} : R^2=\frac{\Vert \Pi_{{X}} \mathbf{y}\Vert^2}{\Vert \mathbf{y}\Vert^2}=1\frac{\Vert \Pi_{ {X}^\perp} \mathbf{y}\Vert^2}{\Vert \mathbf {y}\Vert^2}=\cos^2(\theta)An important application was obtained by Frish & Waugh (1933), when the explanatory variables are divided into two groups, \mathbf{X}=[\mathbf{X}_1 \mathbf{X}_2], so that the regression becomes y=\beta_0+\mathbf{X}_1 β_1+\mathbf{X}_2 β_2+\varepsilon. Frish & Waugh (1933) showed that two successive projections could be considered. Indeed, if \mathbf{y}_2^\star=\Pi_{X_1^\perp} \mathbf{y} and X_2^\star=\Pi_{X_1^\perp}\mathbf{X}_2, we can show that \widehat{\beta} _2=[{\mathbf{X}_2^\star}^T \mathbf{X}_2^\star]^{1}{\mathbf{X}_2^\star}^T \mathbf{y}_2^\star In other words, the overall estimate is equivalent to the combination of independent estimates of the two models if \mathbf{X}_2^\star=\mathbf{X}_2, i.e. \mathbf{X}_2\in \mathcal{E}_{X_1}^\perp, which can be noted \mathbf{x}_1\perp\mathbf{x}_2 We obtain here the FrischWaugh theorem which guarantees that if the explanatory variables between the two groups are orthogonal, then the overall estimate is equivalent to two independent regressions, on each of the sets of explanatory variables. This is a theorem of double projection, on orthogonal spaces. Many results and interpretations are obtained through geometric interpretations (fundamentally related to the links between conditional expectation and the orthogonal projection in space of variables of finite variance).
This geometric interpretation might help to get a better understanding of the problem of underidentification, i.e. the case where the real model would be y_i=\beta_0+ \mathbf{x}_1^T \beta_1+\mathbf{x}_2^T \beta_2+\varepsilon_i, but the estimated model is y_i=b_0+\mathbf{x}_1^T \mathbf{b}_1+\eta_i. The maximum likelihood estimator of \mathbf{b}_1 is \widehat{\mathbf{b}}_1=\mathbf {\beta}_1 + \underbrace{ (\mathbf {X}_1^T\mathbf {X}_1)^{1} \mathbf {X}_1^T \mathbf {X}_{2} \mathbf{\beta}_2}_{\mathbf{\beta}_{12}}+\underbrace{(\mathbf{X}_1^{T}\mathbf{X}_1)^{1} \mathbf{X}_1^T\varepsilon}_{\nu}so that \mathbb{E}[\widehat{\mathbf{b}}_1]=\beta_1+\beta_{12}, the bias ( \beta_{12}) being null only in the case where \mathbf{X}_1^T \mathbf{X}_2=\mathbf{0} (i. e. \mathbf{X}_1\perp \mathbf{X}_2 ): we find here a consequence of the FrischWaugh theorem.
On the other hand, overidentification corresponds to the case where the real model would be y_i=\beta_0+\mathbf{x}_1^T \beta_1+\varepsilon_i, but the estimated model is y_i=b_0+ \mathbf{x}_1^T \mathbf{b} _1+\mathbf{x}_2^T \mathbf{b}_2+\eta_i. In this case, the estimate is unbiased, in the sense that \mathbb{E}[\widehat{\mathbf{b}}_1]=\beta_1 but the estimator is not efficient. Later on, we will discuss an effective method for selecting variables (and avoid overidentification).
From parametric to nonparametric
We can rewrite equation (4) in the form \widehat{\mathbf{y}}=\Pi_X\mathbf{y} which helps us to see the forecast directly as a linear transformation of the observations. More generally, a linear predictor can be obtained by considering m(\mathbf{x})=\mathbf{s}_{\mathbf{x}}^T \mathbf{y}, where \mathbf{s}_{\mathbf{x}} is a weight vector, which depends on \mathbf{x}, interpreted as a smoothing vector. Using the vectors \mathbf{s}_{\mathbf{x}_i}, calculated from the observations \mathbf{x}_i, we obtain a matrix \mathbf{S} of size n\times n, and \widehat{\mathbf{y}}=\mathbf{S}\mathbf{y}. In the case of the linear regression described above, \mathbf{s}_{\mathbf{x}}=\mathbf{X}[\mathbf{X}^T\mathbf{X}]^{1}\mathbf{x}, and in that case \text{trace}(\mathbf{S}) is the number of columns in the \mathbf{X} matrix (the number of explanatory variables). In this context of more general linear predictors, \text{trace}(\mathbf{S}) is often seen as equivalent to the number of parameters (or complexity, or dimension, of the model), and \nu=n\text{trace}(\mathbf{S}) is then the number of degrees of freedom (see Ruppert et al., 2003; Simonoff, 1996). The principle of parsimony says that we should minimize this dimension (the trace of the matrix \mathbf{S}) as much as possible. But in the general case, this dimension is more to obtain, explicitely.
The estimator introduced by Nadaraya (1964) and Watson (1964), in the case of a simple nonparametric regression, is also written in this form since\widehat{m}_h(x)=\mathbf{s}_{x}^T\mathbf{y}=\sum_{i=1}^n \mathbf{s}_{x,i}y_iwhere\mathbf{s}_{x,i}=\frac{K_h(xx_i)}{K_h(xx_1)+\cdots+K_h(xx_n)} where K(\cdot) is a kernel function, which assigns a value that is lower the closer x_i is to x, and h>0 is the bandwidth. The introduction of this metaparameter h is an important issue, as it should be chosen wisely. Using asymptotic developments, we can show that if X has density f, \text{biais}[\widehat{m}_h(x)]=\mathbb{E}[\widehat{m}_h(x)]m(x)\sim {h^2}\left(\frac{C_1 }{2}m''(x)+C_2 m'(x)\frac{f'(x)}{f(x)}\right)and \displaystyle{{\text{Var}[\widehat{m}_h(x)]\sim\frac{C_3}{{nh}}\frac{\sigma(x)}{f(x)}}}for some constants that can be estimated (see Simonoff (1996) for a discussion). These two functions evolve inversely with h, as shown in Figure 1 (where the metaparameter on the xaxis is here, actually, h^{1}). Keep in ming that we will see a similar graph in the context of machine learning models.
Figure 1. Choice of metaparameter and the Goldilocks problem: it must not be too large (otherwise there is too much variance), nor too small (otherwise there is too much bias).
The natural idea is then to try to minimize the mean square error, the MSE, defined as bias[\widehat{m}_h (x)]^2+Var[\widehat{m}_h (x)], and them integrate over x, which gives an optimal value for h of the form h^\star=O(n^{1/5}) , and reminds us of Silverman’s rule – see Silverman (1986). In larger dimensions, for continuous \mathbf{x} variables, a multivariate kernel with matrix bandwidth \mathbf{H} can be used, and \mathbb{E}[\widehat{m}_{\mathbf{H}}(\mathbf{x})]\sim m(\mathbf{x})+\frac{C_1}{2}\text{trace}\big(\mathbf{H}^Tm''(\mathbf{x})\mathbf{H}\big)+C_2\frac{m'(\boldsymbol{x})^T\mathbf{H}\mathbf{H}^T \nabla f(\mathbf{x})}{f(\mathbf{x})}while\text{Var}[\widehat{m}_{\mathbf{H}}(\mathbf{x})]\sim\frac{C_3}{n~\text{det}(\mathbf{H})}\frac{\sigma(\mathbf{x})}{f(\mathbf{x})}
If \mathbf{H} is a diagonal matrix, with the same term h on the diagonal, then h^\star=O(n^{1/(4+dim(\mathbf{x}))}. However, in practice, there will be more interest in the integrated version of the quadratic error, MISE(\widehat{m}_{h})=\mathbb{E}[MSE(\widehat{m}_{h}(X))]=\int MSE(\widehat{m}_{h}(x))dF(x)and we can prove that MISE[\widehat{m}_h]\sim \overbrace{\frac{h^4}{4}\left(\int x^2k(x)dx\right)^2\int\big[m''(x)+2m'(x)\frac{f'(x)}{f(x)}\big]^2dx}^{\text{bias}^2} +\overbrace{\frac{\sigma^2}{nh}\int k^2(x)dx \cdot\int\frac{dx}{f(x)}}^{\text{variance}}as n→∞ and nh→∞. Here we find an asymptotic relationship that again recalls Silverman’s (1986) order of magnitude, h^\star =n^{\frac{1}{5}}\left(\frac{C_1\int \frac{dx}{f(x)}}{C_2\int \big[m''(x)+2m'(x)\frac{f'(x)}{f(x)}\big]dx}\right)^{\frac{1}{5}}The main problem here, in practice, is that many of the terms in the expression above are unknown. Automatic learning offers computational techniques, when the econometrician used to searching for asymptotic (mathematical) properties.
To be continued (references mentioned above are online here)…
Probabilistic Foundations of Econometrics, part 1
In a series of posts, I wanted to get into details of the history and foundations of econometric and machine learning models. It will be some sort of online version of our joint paper with Emmanuel Flachaire and Antoine Ly, Econometrics and Machine Learning (initially writen in French), that will actually appear soon in the journal Economics and Statistics. This is the first one…
The importance of probabilistic models in economics is rooted in Working’s (1927) questions and the attempts to answer them in Tinbergen’s two volumes (1939). The latter have subsequently generated a great deal of work, as recalled by Duo (1993) in his book on the foundations of econometrics, and more particularly in the first chapter “The Probability Foundations of Econometrics”. It should be recalled that Trygve Haavelmo was awarded the Nobel Prize in Economics in 1989 for his “clarification of the foundations of the probabilistic theory of econometrics”. Because as Haavelmo (1944) (initiating a profound change in econometric theory in the 1930s, as recalled in Morgan’s Chapter 8 (1990)) showed, econometrics is fundamentally based on a probabilistic model, for two main reasons. First, the use of statistical quantities (or “measures”) such as means, standard errors and correlation coefficients for inferential purposes can only be justified if the process generating the data can be expressed in terms of a probabilistic model. Second, the probability approach is relatively general, and is particularly well suited to the analysis of “dependent” and “nonhomogeneous” observations, as they are often found on economic data.We will then assume that there is a probabilistic space (\Omega,\mathcal{F},\mathbb{P}) such that observations (y_i,\mathbf{x}_i) are seen as realizations of random variables (Y_i, \mathbf{X}_i) . In practice, however, we are not very interested in the joint law of the couple (Y, \mathbf{X}) : the law of \mathbf{X} is unknown, and it is the law of Y conditional on \mathbf{X} that will be interested in. In the following, we will note x a single observation, \mathbf{x} a vector of observations, X a random variable, and \mathbf{X} a random vector. Abusively, \mathbf{X} may also designate the matrix of individual observations (denoted \mathbf{x}_i), depending on the context.
Foundations of mathematical statistics
As recalled in Vapnik’s (1998) introduction, inference in parametric statistics is based on the following belief: the statistician knows the problem to be analyzed well, in particular, he knows the physical law that generates the stochastic properties of the data, and the function to be found is written via a finite number of parameters[1]. To find these parameters, the maximum likelihood method is used. The purpose of the theory is to justify this approach (by discovering and describing its favorable properties). We will see that in learning, philosophy is very different, since we do not have a priori reliable information on the statistical law underlying the problem, nor even on the function we would like to approach (we will then propose methods to construct an approximation from the data at our disposal, as in (1998)). A “golden age” of parametric inference, from 1930 to 1960, laid the foundations for mathematical statistics, which can be found in all statistical textbooks, including today. As Vapnik (1998) states, the classical parametric paradigm is based on the following three beliefs:
 To find a functional relationship from the data, the statistician is able to define a set of functions, linear in their parameters, that contain a good approximation of the desired function. The number of parameters describing this set is small.
 The statistical law underlying the stochastic component of most reallife problems is the normal law. This belief has been supported by reference to the central limit theorem, which stipulates that under large conditions the sum of a large number of random variables is approximated by the normal law.
 The maximum likelihood method is a good tool for estimating parameters.
In this section we will come back to the construction of the econometric paradigm, directly inspired by that of classical inferential statistics.
Conditional laws and likelihood
Linear econometrics has been constructed under the assumption of individual data, which amounts to assuming independent variables (Y_i, \mathbf{X}_i) (if it is possible to imagine temporal observations – then we would have a process (Y_t, \mathbf{X}_t) – but we will not discuss time series here). More precisely, we will assume that, conditionally to the explanatory variables \mathbf{X}_i, the variables Y_i are independent. We will also assume that these conditional laws remain in the same parametric family, but that the parameter is a function of \mathbf{x}. In the Gaussian linear model it is assumed that: (Y\vert \mathbf{X}=\mathbf{x})\overset{\mathcal{L}}{\sim}\mathcal{N}(\mu(\mathbf{x}),\sigma^2)~~~~ (1)where \mu(\mathbf{x})=\beta_0+\mathbf{x}^T\mathbf{\beta} and \mathbf{\beta}\in\mathbb{R}^{p}.
It is usually called a ‘linear’ model since \mathbb{E}[Y\vert \mathbf{X}=\mathbf{x}]=\beta_0+\mathbf{x}^T\mathbf{\beta} is a linear combination of covariates[2]. It is said to be a homoscedastic model if Var[Y\mathbf{X}=\mathbf{x}]=\sigma^2, where \sigma^2 is a positive constant. To estimate the parameters, the traditional approach is to use the Maximum Likelihood estimator, as initially suggested by Ronald Fisher. In the case of the Gaussian linear model, loglikelihood is written: \log\mathcal{L}(\beta_0, \mathbf{\beta},\sigma^2\vert \mathbf{y},\mathbf{x}) = \frac{n}{2}\log[2\pi\sigma^2]  \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i\beta_0\mathbf{x}_i^T\mathbf{\beta})^2Note that the term on the right, measuring a distance between the data and the model, will be interpreted as deviance in generalized linear models. Then we will set: (\widehat{\beta}_0,\widehat{\mathbf{\beta}},\widehat{\sigma}^2)=\text{argmax}\left\lbrace\log\mathcal{L}(\beta_0, \mathbf{\beta},\sigma^2\vert \mathbf{y},\mathbf{x})\right\rbraceThe maximum likelihood estimator is obtained by minimizing the sum of the error squares (the socalled “least squares” estimator) that we will find in the “machine learning” approach.
The first order conditions allow to find the normal equations, whose matrix writing is \mathbf{X}^T[\mathbf{y}\mathbf{X}\mathbf{\beta}]=\mathbf{0}, which can also be written (\mathbf{X}^T \mathbf{X})\mathbf{\beta}=\mathbf{X}^T \mathbf{y}. If \mathbf{X} is a full (column) rank matrix, then we find the classical estimator:\widehat{\mathbf{\beta}}=(\mathbf{X}^T\mathbf{X})^{1}\mathbf{X}^T\mathbf{y}=\mathbf{\beta}+(\mathbf{X}^T\mathbf{X})^{1}\mathbf{X}^{1}\mathbf{\varepsilon}~~~(2)using residualbased writing (as often in econometrics), y=\mathbf{x}^T\mathbf{\beta}+\varepsilon. Gauss Markov’s theorem ensures that this estimator is the unbiased linear estimator with minimum variance. It can then be shown that \widehat{\mathbf{\beta}}\sim\mathcal{N}(\mathbf{\beta},\sigma^2(\mathbf{X}^T\mathbf{X})^{1}), and in particular, if we simply need the first two moments : \mathbb{E}[\widehat{\mathbf{\beta}}]=\mathbf{\beta}~~~Var[\widehat{\mathbf{\beta}}]=\sigma^2 [\mathbf{X}^T\mathbf{X}]^{1}In fact, the normality hypothesis makes it possible to make a link with mathematical statistics, but it is possible to construct this estimator given by equation (2) without that Gaussian assumption. Hence, if we assume that Y\mathbf{X} has the same distribution as \mathbf{x}^T\mathbf{\beta}+\varepsilon, where \mathbb{E}[\varepsilon]=0, Var[\varepsilon]=\sigma^2 and Cov[X_j,\varepsilon]=0 for all j, then \widehat{\mathbf{\beta}} is an unbiased estimator of \mathbf{\beta} with smallest variance[3] among unbiased linear estimators. Furthermore, if we cannot get normality at finite distance, asymptotically this estimator is Gaussian, with \sqrt{n}(\widehat{\mathbf{\beta}}\mathbf{\beta})\overset{\mathcal{L}}{\rightarrow}\mathcal{N}(\mathbf{0},\mathbf{\Sigma})as n\rightarrow\infty, for some matrix \mathbf{\Sigma}.
The condition of having a full rank \mathbf{X} matrix can be (numerically) strong in large dimensions. If it is not satisfied, (\mathbf{X}^T \mathbf{X})^{1}\mathbf{X}^T does not exist. If \mathbb{I} denotes the identity matrix, however, it should be noted that (\mathbf{X}^T \mathbf{X}+\lambda\mathbb{I})^{1}\mathbf{X}^T still exists, whatever \lambda>0. This estimator is called the ridge estimator of level \lambda (introduced in the 1960s by Hoerl (1962), and associated with a regularization studied by Tikhonov (1963)). This estimator naturally appears in a Bayesian econometric context.
Residuals
It is not uncommon to introduce the linear model from the distribution of the residuals, as we mentioned earlier. Also, equation (1) is written as often: y_i=\beta_0+\mathbf{x}_i^T\mathbf{\beta}+\varepsilon_i~~~~(3)where \varepsilon_i’s are realizations of independent and identically distributed random variables (i.i.d.) from some \mathcal{N}(0,\sigma^2) distribution. With a vector notation, we will write \mathbf{\varepsilon}\overset{\mathcal{L}}{\sim}\mathcal{N}(\mathbf{0},\sigma^2\mathbb{I}) . The estimated residuals are defined as: \widehat{\varepsilon}_i =y_i[\widehat{\beta}_0+\mathbf{x}_i^T\widehat{\mathbf{\beta}}] Those (estimated) residuals are basic tools for diagnosing the relevance of the model.
An extension of the model described by equation (1) has been proposed to take into account a possible heteroscedastic character: (Y\vert \mathbf{X}=\mathbf{x})\overset{\mathcal{L}}{\sim}\mathcal{N}(\mu(\mathbf{x}),\sigma^2(\mathbf{x}))where \sigma^2(\mathbf{x}) is a positive function of the explanatory variables. This model can be rewritten as: y_i=\beta_0+\mathbf{x}_i^T\mathbf{\beta}+\sigma^2(\mathbf{x}_i)\cdot\varepsilon_iwhere residuals are always i.i.d., with unit variance, \varepsilon_i=\frac{y_i[\beta_0+\mathbf{x}_i^T\mathbf{\beta}]}{\sigma(\mathbf{x}_i)} While residuals based equations are popular in linear econometrics (when the dependent variable is continuous), it is no longer popular in counting models, or logistic regression.
However, writing using an error term (as in equation (3)) raises many questions about the representation of an economic relationship between two quantities. For example, it can be assumed that there is a relationship (linear to begin with) between the quantities of a traded good, q and its price p. This allows us to imagine a supply equationq_i=\beta_0+\beta_1 p_i+u_i(u_i being an error term) where the quantity sold depends on the price, but in an equally legitimate way, one can imagine that the price depends on the quantity produced (what one could call a demand equation), p_i=\alpha_0+\alpha_1 q_i+v_i(v_i denoting another error term). Historically, the error term in equation (3) could be interpreted as an idiosyncratic error on the variable y, the socalled explanatory variables being assumed to be fixed, but this interpretation often makes the link between an economic relationship and a complicated economic model difficult, the economic theory speaking abstractly about a relationship between a magnitude, the econometric model imposing a specific shape (what magnitude is y and what magnitude is x) as shown in more detail in Morgan (1990) Chapter 7.
(references mentioned above are online here). To be continued…
[1] This approach can be compared to structural econometrics, as presented for example in Kean (2010).
[2] Here, we will try to distinguish \beta_0, the intercept, and the other parameters \mathbf{\beta}, since they are considered differently in many extensions (e.g. regularization). Nevertheless, in many expressions \mathbf{\beta} will denote the joint vector (\beta_0, \mathbf{\beta}), for general formulas, to avoid too heavy notations.
[3] In the sense that the difference between variance matrices is a positive matrix.
Summer School, Big Data and Economics
This week I will be giving a lecture at the 2018 edition of the Summer School at the UB School of Economics, in Barcelona. It will be a four day crash course, starting on Tuesday (morning).
Lecture 1: Introduction : Why Big Data brings New Questions
Lecture 2: Simulation Based Techniques & Bootstrap
Lecture 3: Loss Functions : from OLS to Quantile Regression
Lecture 4: Nonlinearities and Discontinuities
Lecture 5: CrossValidation and OutofSample diagnosis
Lecture 6: Variable and model selection
Lecture 7: New Tools for Classification Problems
Lecture 8: New Tools for Time Series & Forecasting
Some slides are available on github, and probably more interesting, I will upload a R markdown with all the codes.
Seminar at Università degli studi dell’Insubria
Next Wednesday, I will be giving a seminar at Università degli studi dell’Insubria on Machine Learning and Econometrics. The presentation is based on Econométrie et Machine Learning (so far only in French) written with Emmanuel Flachaire and Antoine Ly. Slides are now available online…
Graduate Course at Università degli studi dell’Insubria
Tomorrow and Friday, I will give a crash course on advanced techniques in econometrics. Slides are now online (with animations). There will be breaks to discuss, and I will attend the doctoral seminar on Thursday.
Graduate Course on Advanced Tools for Econometrics (2)
This Tuesday, I will be giving the second part of the (crash) graduate course on advanced tools for econometrics. It will take place in Rennes, IMAPP room, and I have been told that there will be a visio with Nantes and Angers. Slides for the morning are online, as well as slides for the afternoon.
In the morning, we will talk about variable section and penalization, and in the afternoon, it will be on changing the loss function (quantile regression).
Graduate Course on Advanced Tools for Econometrics (1)
This Monday, I will be giving the first part of the (crash) graduate course on advanced tools for econometrics. It will take place in Rennes, IMAPP room, and I have been told that there will be a visio with Nantes and Angers. Slides for the morning are online, as well as slides for the afternoon.
In the morning, we will talk about smoothing techniques, and in the afternoon, it will be on simulations and bootstrap techniques.
Advanced Econometrics: Model Selection
On Thursday, March 23rd, I will give the third lecture of the PhD course on advanced tools for econometrics, on model selection and variable selection, where we will focus on ridge and lasso regressions . Slides are available online.
The first part was on on Nonlinearities in Econometric models, and the second one on Simulations.
Advanced Econometrics: Simulations
On Thursday, March 9nd, I will give the second lecture of the PhD course on advanced tools for econometrics, on simulation techniques (and bootstrap). Slides are available online.
The first part is this Thurdays, on Nonlinearities in Econometric models.
Graduate Course on Advanced Methods in Econometrics
I will give a short graduate course for PhD students, in Rennes, on Thurday mornings, in March (2nd, 9th, 23rd and 30th). The agenda will be

Nonlinear Regression Models and Smoothing Techniques

Bootstrapping and Regression

Penalized Regression Models and LASSO

Quantile Regression and Expectiles
There will be slides available by the end of February.
Econometrics and Machine Learning
I will be in London, UK, at the Centre for Central Banking Studies, invited as a keynote speaker for a major conference. For my talk, on Econometric Models and Statistical Learning Techniques, the agenda is the follownig
 introduction on High Dimensional Data and Modeling
 foundations of econometric models, and probabilistic aspects
 machine learning techniques, with a discussion on boosting, cross validation
 classification, from the logistic regression to trees and random forest
 machine learning tools that can be used in econometrics, such as bootstrap, principal component analysis / partial least squares, and instrumental variables and variable selection
Slides are avaible (as usual, the pdf version is more informative than the one on slideshare where animations are missing)