Andrew Pua
March 2022
One option is to invoke the law of iterated expectations and condition on ALL observations of \(X_{1}\). Therefore, \[\mathbb{E}\left(\left.\widehat{\beta}_{1}\right|X_{11},X_{12},\ldots,X_{1n}\right)=\sum_{t=1}^{n}\mathbb{E}\left(\left.W_{t}Y_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right)={\displaystyle \sum_{t=1}^{n}}W_{t}\mathbb{E}\left(\left.Y_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right).\]
Under the “modern” version of the linear regression model, we must have \(Y_{t}=\beta_{0}^{*}+\beta_{1}^{*}X_{1t}+u_{t}\). Thus, \[\mathbb{E}\left(\left.\widehat{\beta}_{1}\right|X_{11},X_{12},\ldots,X_{1n}\right)=\beta_{1}^{*}+{\displaystyle \sum_{t=1}^{n}}W_{t}\mathbb{E}\left(\left.u_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right).\]
We finally encounter the most crucial hurdle to establish unbiasedness. The only property we know about \(u_{t}\) is it has zero covariance with \(X_{1t}\). This is insufficient to establish that \(\mathbb{E}\left(\left.u_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right)=0\) even under IID conditions.
Therefore, finite-sample results require stronger assumptions to establish. This motivates us to revisit conditional expectations.
Return to our linear prediction setting.
Consider a prediction rule of the form \(g\left(X_1\right)\) instead of \(\beta_{0}+\beta_{1}X_{1}\).
##
## The decimal point is at the |
##
## 64 | 5
## 66 | 479
## 68 | 001334467712237
## 70 | 00023479901244555888
## 72 | 011236688902668
## 74 | 06
## 76 | 4
##
## The decimal point is at the |
##
## 62 | 8
## 63 | 13789
## 64 | 08
## 65 | 28
## 66 | 069
## 67 | 24677
In order to show that \(\widehat{\beta}_1\) is unbiased for \(\beta_1^*\), we need to somehow impose \[\mathbb{E}\left(\left.u_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right)=0,\] or \[\mathbb{E}\left(\left.u_{t}\right|X_{1t}\right)=0.\] under IID conditions.
But this means that \[\mathbb{E}\left(\left.Y_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right)=\beta_0^*+\beta_1^* X_{1t},\] or \[\mathbb{E}\left(\left.Y_{t}\right|X_{1t}\right)=\beta_0^*+\beta_1^* X_{1t}.\] under IID conditions.
To show unbiasedness, we have to assume that the CEF is actually equal to the best linear predictor under IID conditions. Without IID conditions, we actually need something stronger called strict exogeneity.
Therefore, \(\beta_0^*\) and \(\beta_1^*\) have more than the usual meaning as the coefficients of the best linear predictor.
In addition, the “modern” version of the linear regression model turns into something more classical.
This case allows us to justify that the CEF coincides with the best linear predictor.
You can show that when \[\left(\begin{array}{c} X_{1}\\ Y \end{array}\right)\sim N\left(\left(\begin{array}{c} \mu_{1}\\ \mu_{Y} \end{array}\right),\left(\begin{array}{cc} \sigma_{1}^{2} & \rho\sigma_{1}\sigma_{Y}\\ \rho\sigma_{1}\sigma_{Y} & \sigma_{Y}^{2} \end{array}\right)\right),\] we must have \[\mathbb{E}\left(Y|X_1\right)=\mu_Y+\rho\frac{\sigma_Y}{\sigma_1}\left(X_1-\mu_1\right)=\underbrace{\mu_Y-\rho\frac{\sigma_Y}{\sigma_1}\mu_1}_{\beta_0^o}+\underbrace{\rho\frac{\sigma_Y}{\sigma_1}}_{\beta_1^o}X_1.\]
In addition, the conditional variance is actually a constant under bivariate normality: \[\mathsf{Var}\left(Y|X_1\right)=\left(1-\rho^2\right)\sigma_Y^2.\]
Finally, the conditional distribution is also normally distributed, so that: \[Y|X_1=x_1 \sim N\left(\beta_0^o+\beta_1^ox_1, \left(1-\rho^2\right)\sigma_Y^2\right).\]
Predict scalar \(Y\) when you have information about a vector \(X=\left(1,X_{1},\ldots,X_{k}\right)^{\prime}\). What stays the same? What changes?
For the CEF:
For the best linear predictor:
For the best linear predictor, the system of equations before\[\left(\begin{array}{cc} 1 & \mathbb{E}\left(X_1\right)\\ \mathbb{E}\left(X_1\right) & \mathbb{E}\left(X^{2}_1\right) \end{array}\right)\left(\begin{array}{c} \beta_{0}^{*}\\ \beta_{1}^{*} \end{array}\right) = \left(\begin{array}{c} \mathbb{E}\left(Y\right)\\ \mathbb{E}\left(X_1Y\right) \end{array}\right)\] extends to \[\underbrace{\left(\begin{array}{cccc} 1 & \mathbb{E}\left(X_{1}\right) & \cdots & \mathbb{E}\left(X_{k}\right)\\ \mathbb{E}\left(X_{1}\right) & \mathbb{E}\left(X_{1}^{2}\right) & \cdots & \mathbb{E}\left(X_{1}X_{k}\right)\\ \vdots & \vdots & \ddots & \vdots\\ \mathbb{E}\left(X_{k}\right) & \mathbb{E}\left(X_{k}X_{1}\right) & \cdots & \mathbb{E}\left(X_{k}^{2}\right) \end{array}\right)}_{Q=\mathbb{E}\left(XX^{\prime}\right)}\underbrace{\left(\begin{array}{c} \beta_{0}^{*}\\ \beta_{1}^{*}\\ \vdots\\ \beta_{k}^{*} \end{array}\right)}_{\beta^{*}} = \underbrace{\left(\begin{array}{c} \mathbb{E}\left(Y\right)\\ \mathbb{E}\left(X_{1}Y\right)\\ \vdots\\ \mathbb{E}\left(X_{k}Y\right) \end{array}\right)}_{\mathbb{E}\left(XY\right)}.\]
Thus, \(\beta^{*}=\left[\mathbb{E}\left(XX^{\prime}\right)\right]^{-1}\mathbb{E}\left(XY\right)\), provided \(\mathbb{E}\left(XX^{\prime}\right)\) is invertible.
One way to think of “linear” is when we say that the “CEF is a linear function of X”.
But actually the meaning of “linear” should be clear from \(X^{\prime}\beta\) meaning linear in \(\beta\). That means \(X\) can contain nonlinearities in terms of the variables.
For example: transformations (taking logarithms), interaction terms (products of variables in \(X\)), …, and many more, as long as you have linear in parameters.
We will talk about interpreting coefficients in the presence of transformations, interactions, and many more next time. In fact, you have already encountered issues with interpretation in a previous exercise.
We now try to give an interpretation of the last component of \(\beta^{*}\), for concreteness (this is much more general than it appears). The last coefficient has the following interpretation: \[\beta_{k}^{*}=\dfrac{\mathsf{Cov}\left(Y,v_{k}\right)}{\mathsf{Var}\left(v_{k}\right)}.\]
Suppose you have found the optimal coefficients of the best linear predictor of \(X_{k}\) (a scalar r.v.) using \(X_{-k}^{\prime}=\left(1,X_{1},X_{2},\ldots,X_{k-1}\right)\). We can always write \[X_{k}=X_{-k}^{\prime}\delta^{*}+v_{k},\] where \(v_{k}\) is the error from best linear prediction. We must also have \[Y=X^{\prime}\beta^{*}+u.\] where \(u\) is another error from best linear prediction.