Andrew Pua

May 2022

- Let \(X_{1},X_{2},\ldots,X_{n}\) be IID exponential random variables with parameter \(\lambda>0\).
- It has the following moments:\[\mathbb{E}\left[X_{t}^{k}\right]=\dfrac{k}{\lambda^{k}},\qquad t=1,\ldots,n\;k=1,2,\ldots\]
- If you want to apply method of moments, you can choose to use the first moment\[\mathbb{E}\bigg[\underbrace{X_{t}-\dfrac{1}{\lambda}}_{m_{1t}\left(\lambda\right)}\bigg]=0.\]
- The resulting MME of \(\lambda\) is going to be the solution to the equation \[\frac{1}{n}\sum_{t=1}^{n}\left[X_{t}-\dfrac{1}{\lambda}\right]=0\quad\Rightarrow\quad\widehat{\lambda}=\left(\frac{1}{n}\sum_{t=1}^{n}X_{t}\right)^{-1}.\]

- You can alternatively choose to use the second moment\[\mathbb{E}\bigg[\underbrace{X_{t}^{2}-\dfrac{2}{\lambda^{2}}}_{m_{2t}\left(\lambda\right)}\bigg]=0.\]
- The resulting MME of \(\lambda\) is going to be the solution to the equation \[\frac{1}{n}\sum_{t=1}^{n}\left[X_{t}^{2}-\dfrac{2}{\lambda^{2}}\right]=0\quad\Rightarrow\quad\widehat{\lambda}=\sqrt{2\left(\frac{1}{n}\sum_{t=1}^{n}X_{t}^{2}\right)^{-1}}.\]

Can we solve the two equations simultaneously? NO! Why?

Consider the moment function \[m_{t}\left(\lambda\right)=m\left(X_{t},\lambda\right)=\left(\begin{array}{c} m_{1t}\left(\lambda\right)\\ m_{2t}\left(\lambda\right) \end{array}\right).\]

Which should we choose? Canâ€™t we use all the moments? YES! But how to combine them together? Two options:

- Generalize the method of moments! How? Put weights on the moments! Discovered in 1982 by L. P. Hansen. But how do you put weights? Can you do this optimally in some sense? This is the subject matter of Chapters 7 and 8.
- Use maximum likelihood. But models in economics are rarely completely specified. This is the subject matter of Chapter 9.

Recall that you had \(K=k+1\) equations in \(K=k+1\) unknowns when we wanted to determine \(\beta^{*}\). Let me focus on the special case \(K=2\).

The two equations arise from: \(\mathbb{E}\left(u_{t}\right)=0\) and \(\mathbb{E}\left(X_{1t}u_{t}\right)=0\), where \(u_{t}=Y_{t}-\beta_{0}^{*}-\beta_{1}^{*}X_{1t}\).

A function of the r.v.â€™s and parameters \[m_{t}\left(\beta_{0},\beta_{1}\right)=m\left(X_{1t},Y_{t},\beta_{0},\beta_{1}\right)=\left(\begin{array}{c} Y_{t}-\beta_{0}-\beta_{1}X_{1t}\\ X_{1t}\left(Y_{t}-\beta_{0}-\beta_{1}X_{1t}\right) \end{array}\right)\]is called a moment function.

Note that when \(\mathsf{Var}\left(X_1\right)>0\), \(\beta_{0}^{*}\) and \(\beta_{1}^{*}\) uniquely solve \(\mathbb{E}\left(m_{t}\left(\beta_{0},\beta_{1}\right)\right)=0\).

If it happens that the CEF is truly linear, there are actually more moment conditions that can be used to determine \(\beta_{0}^{*}=\beta_{0}^{o}\) and \(\beta_{1}^{*}=\beta_{1}^{o}\).

Because the CEF is assume to be linear, we must have \(\mathbb{E}\left(u_{t}|X_{1t}\right)=0\). This implies, for example, \(\mathbb{E}\left(X_{1t}^{2}u_{t}\right)=0\).

As a result, we can have a new set of moment functions \[m_{t}\left(\beta_{0},\beta_{1}\right)=\left(\begin{array}{c} Y_{t}-\beta_{0}-\beta_{1}X_{1t}\\ X_{1t}\left(Y_{t}-\beta_{0}-\beta_{1}X_{1t}\right)\\ X_{1t}^{2}\left(Y_{t}-\beta_{0}-\beta_{1}X_{1t}\right) \end{array}\right).\]

Note that we have moment conditions \(\mathbb{E}\left(m_{t}\left(\beta_{0},\beta_{1}\right)\right)=0\).

All of our examples featured orthogonality/moment conditions of the form \(\mathbb{E}\left(m_{t}\left(\beta\right)\right)=0\).

\(m_{t}\left(\beta\right)\) is called a moment function.

\(\beta^{o}\) is usually assumed to be the value of \(\beta\) which satisfies the orthogonality conditions. This is called point identification.

The expectation operator \(\mathbb{E}\) is with respect to the distribution of the data, which is indexed by the parameter \(\beta^{o}\). But we do not specify the distribution of the data!

The sample counterparts of the orthogonality conditions are: \[\widehat{m}\left(\beta\right)=\frac{1}{n}\sum_{t=1}^{n}m_{t}\left(\beta\right)\]

A generalized method of moments (GMM) estimator is defined as the solution to the following minimization problem: \[\widehat{\beta}_{GMM}=\arg\min_{\beta\in\Theta}\widehat{m}\left(\beta\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\beta\right)\]

Accounting for the sizes:

\(\widehat{m}\left(\beta\right)\) is \(L\times1\) vector of moment functions, which may be linear (Chapter 7) or nonlinear (Chapter 8) in \(\beta\)

\(\widehat{W}\) is \(L\times L\) symmetric nonsingular weighting matrix

\(\beta\) is \(K\times1\) is the parameter vector

\(\Theta\subseteq\mathbb{R}^{K}\) is the parameter space

\(\widehat{m}\left(\beta\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\beta\right)\) is a scalar GMM criterion function

Clearly, \(L\) has to be greater than or equal to \(K\).

Suppose \(Y_{t}=X_{t}^{\prime}\beta^{o}+\varepsilon_{t}\) but now \(\mathbb{E}\left(X_{t}\varepsilon_{t}\right)=0\).

This implies that we do not necessarily have correct specification, meaning \(\mathbb{E}\left(Y_{t}|X_{t}\right)\neq X_{t}^{\prime}\beta^{o}\). But, \(X_{t}^{\prime}\beta^{o}\) is the best linear predictor.

To estimate \(\beta^{o}\) by least squares, we simply minimize the sum of squared residuals, i.e.Â \[\widehat{\beta}_{OLS}=\arg\min_{\beta\in\Theta}\dfrac{1}{n}\left(Y-\mathbf{X}\beta\right)^{\prime}\left(Y-\mathbf{X}\beta\right) = \arg\min_{\beta\in\Theta}\dfrac{1}{n}\sum_{t=1}^{n}\left(Y_{t}-X_{t}^{\prime}\beta\right)^{2}\]

Recall the FOCs: \(\widehat{\beta}_{OLS}\) solves\[\dfrac{1}{n}\sum_{t=1}^{n}\left(X_{t}Y_{t}-X_{t}X_{t}^{\prime}\widehat{\beta}_{OLS}\right)=0\]

Here we have \(m_{t}\left(\beta\right)=X_{t}\left(Y_{t}-X_{t}^{\prime}\beta\right)\) and that \(\mathbb{E}\left(m_{t}\left(\beta\right)\right)=0\).

- So we actually have the following sample counterpart of \(\mathbb{E}\left(m_{t}\left(\beta\right)\right)=0\):\[\widehat{m}\left(\beta\right)=\dfrac{1}{n}\displaystyle\sum_{t=1}^{n}X_{t}\left(Y_{t}-X_{t}^{\prime}\beta\right)=\dfrac{1}{n}\mathbf{X}^{\prime}\left(Y-\mathbf{X}\beta\right)=0\]

As you recall, these are \(K\) equations, linear in \(\beta\), in \(K\) unknowns. Here is a case where \(L=K\).

Within a GMM framework, set \(\widehat{m}\left(\beta\right)\) as before and \(\widehat{W}=\mathbf{X}^{\prime}\mathbf{X}\). You can show that \[\widehat{\beta}_{OLS}=\widehat{\beta}_{GMM}=\arg\min_{\beta\in\Theta}\:\widehat{m}\left(\beta\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\beta\right).\]

More importantly, if you substitute \(\widehat{\beta}_{OLS}\) into the GMM criterion function, we obtain \[\left[\mathbf{X}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}_{OLS}\right)\right]^{\prime}\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}_{OLS}\right)=\left(Y-\mathbf{X}\widehat{\beta}_{OLS}\right)^{\prime}P_{\mathbf{X}}\left(Y-\mathbf{X}\widehat{\beta}_{OLS}\right)=e^{\prime}P_{\mathbf{X}}e=0.\]

In fact, even if you choose a different weighting matrix \(\widehat{W}\), substituting \(\widehat{\beta}_{GMM}\) into the GMM criterion function will always produce a zero value! Why?

As a result, GMM is not even needed here! So why bother spending three slides on this special case?

The key reason is to provide a contrast to models called linear structural equations.

Suppose \(Y_{t}=X_{t}^{\prime}\beta^{o}+\varepsilon_{t}\) but now \(\mathbb{E}\left(X_{t}\varepsilon_{t}\right)\neq0\).

This implies that we lose correct specification, meaning \(\mathbb{E}\left(Y_{t}|X_{t}\right)\neq X_{t}^{\prime}\beta^{o}\). This also implies that \(X_{t}^{\prime}\beta^{o}\) is not the BLP.

All this means is that \(Y_{t}=X_{t}^{\prime}\beta^{o}+\varepsilon_{t}\) is NOT a linear regression model.

From now on, we are going to call \(Y_{t}=X_{t}^{\prime}\beta^{o}+\varepsilon_{t}\) a structural equation or a response schedule.

- Linear regression models become a special case.
- In some cases, you will see \(X_t\) being called endogenous or subject to an endogeneity problem.
- It is also possible to cover the case where we do not have correct dynamic specification.

How do we estimate \(\beta^{o}\) in this situation? There are just so many ways. But, we study closely an approach called instrumental variables (which is a special case of GMM).

Assume that there exists a random \(L\times1\) vector \(Z_{t}\) such that \(\mathbb{E}\left(Z_{t}\varepsilon_{t}\right)=0\).

Given the linear structural equation earlier, we can specify \[m_{t}\left(\beta\right)=Z_{t}\left(Y_{t}-X_{t}^{\prime}\beta\right)\]and by assumption we have \(\mathbb{E}\left(m_{t}\left(\beta\right)\right)=0\). So we have a moment condition which determines \(\beta\).

The sample counterpart of this orthogonality condition is \[\widehat{m}\left(\beta\right)=\frac{1}{n}\sum_{t=1}^{n}Z_{t}\left(Y_{t}-X_{t}^{\prime}\beta\right).\]

So we have \(L\) equations, linear in \(\beta\), in \(K\) unknowns.

Setup: \[\underset{\left(n\times1\right)}{Y}=\left(\begin{array}{c} Y_{1}\\ \vdots\\ Y_{n} \end{array}\right),\underset{\left(n\times K\right)}{\mathbf{X}}=\left(\begin{array}{c} X_{1}^{\prime}\\ \vdots\\ X_{n}^{\prime} \end{array}\right),\underset{\left(n\times L\right)}{\mathbf{Z}}=\left(\begin{array}{c} Z_{1}^{\prime}\\ \vdots\\ Z_{n}^{\prime} \end{array}\right),\underset{\left(K\times1\right)}{\beta}=\left(\begin{array}{c} \beta_{0}\\ \vdots\\ \beta_{k} \end{array}\right)\]

The GMM estimator is then defined as \[\widehat{\beta}_{GMM}=\arg\min_{\beta\in\Theta}\widehat{m}\left(\beta\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\beta\right)=n^{-2}\arg\min_{\beta\in\Theta}\:\left(Y-\mathbf{X}\beta\right)^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\beta\right).\]

The FOCs are given by: \[\begin{eqnarray}\frac{\partial}{\partial\beta}\left.\left(Y-\mathbf{X}\beta\right)^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\beta\right)\right|_{\beta=\widehat{\beta}} &=& 0 \\ -2\mathbf{X}^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}\right) &=& 0.\end{eqnarray}\]

The value of \(\widehat{\beta}\) that solves the FOCs is given by: \[\widehat{\beta}_{GMM}\left(\widehat{W}\right)=\left(\mathbf{X}^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}Y.\]

If \(L=K\), then \(\mathbf{Z}^{\prime}\mathbf{X}\) and \(\mathbf{X}^{\prime}\mathbf{Z}\) are both square matrices.

- If they are invertible or nonsingular, then \[\widehat{\beta}_{GMM}\left(\widehat{W}\right)=\left(\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\widehat{W}\left(\mathbf{X}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}Y=\left(\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\mathbf{Z}^{\prime}Y,\]which is called an just-identified instrumental variables (IV) estimator.
- Notice that \(\widehat{W}\) does not even play a role! This is similar to the OLS case.

In fact, OLS can be justified as a just-identified IV estimator under certain orthogonality conditions! Why?

Consider \[\begin{eqnarray}C_t &=& \beta_0^o+\beta_1^o I_t+\varepsilon_t \\ I_t &=&C_t+D_t,\end{eqnarray}\] where \(C_t\) is consumption, \(I_t\) is income, and \(D_t\) is non-consumption.

Assume, for convenience, that \[\left(\begin{array}{c} D\\ \varepsilon \end{array}\right)\sim N\left(\left(\begin{array}{c} \mu_{D}\\ 0 \end{array}\right),\left(\begin{array}{cc} \sigma_{D}^{2} & 0\\ 0 & \sigma_{\varepsilon}^{2} \end{array}\right)\right).\] Assume we have IID draws from this distribution.

You can easily check that \(\mathbb{E}\left(I_t\varepsilon_t\right)= \sigma^2_{\varepsilon}/\left(1-\beta_1^o\right)\).

Therefore, \(\mathbb{E}\left(I_t\varepsilon_t\right)\neq 0\) whenever \(\sigma^2_{\varepsilon}>0\).

So, if you apply least squares to a regression of \(C_t\) on \(I_t\), you will not recover \(\beta_1^o\) no matter how large the sample size is.

- You can show that \(\mathbb{E}\left(C_t|I_t\right)=\gamma_0+\gamma_1 I_t\) where \[\begin{eqnarray} \gamma_1 &=& \frac{\beta_1^o\sigma^2_D+\sigma^2_{\varepsilon}}{\sigma^2_D+\sigma^2_{\varepsilon}}=\theta\beta_1^o+\left(1-\theta\right) \\ \gamma_0 &=& \frac{\beta_0^o+\beta_1^o\mu_D-\gamma_1\left(\beta_0^o+\mu_D\right)}{1-\beta_1^o}=\theta\beta_0^o-\left(1-\theta\right)\mu_D, \end{eqnarray}\]where \(\theta = \sigma^2_D/\left(\sigma^2_D+\sigma^2_{\varepsilon}\right)\).
- Thus, we have two representations of the relationship between \(C_t\) and \(I_t\):
- One which is always true: \(C_t=\mathbb{E}\left(C_t|I_t\right)+\eta_t\), where \(\eta_t\) is a CEF error.
- The other is assumed to be true: \(C_t=\beta_0^o+\beta_1^oI_t+\varepsilon_t\), where \(\varepsilon_t\) is a structural error and not a CEF error.

- What can you estimate?
- OLS estimates \(\gamma_0\) and \(\gamma_1\) consistently.
- But OLS cannot estimate \(\beta_0^o\) and \(\beta_1^o\) consistently. We need an alternative method.

- Some terminology:
- \(I_t\) is called an endogenous variable or endogenous regressor.
- \(\varepsilon_t\) is called a structural equation.
- \(C_t=\beta_0^o+\beta_1^oI_t+\varepsilon_t\) is called a structural equation or structural model or response schedule.
- \(\beta_0^o\) and \(\beta_1^o\) are called structural parameters.

- To recover \(\beta_1^o\), the idea is to have an identification argument:
- Observe that \[\begin{eqnarray} \mathsf{Cov}\left(C_t, D_t\right) &=& \beta_1^o\mathsf{Cov}\left(I_t, D_t\right)+\mathsf{Cov}\left(D_t,\varepsilon_t\right) \\ \mathsf{Cov}\left(C_t, D_t\right) &=& \beta_1^o\mathsf{Cov}\left(I_t, D_t\right) \\ \beta_1^o &=&\frac{\mathsf{Cov}\left(C_t, D_t\right)}{\mathsf{Cov}\left(I_t, D_t\right)} \end{eqnarray}\]
- The argument works when \(\mathsf{Cov}\left(D_t,\varepsilon_t\right)=0\) and \(\mathsf{Cov}\left(I_t,D_t\right)\neq 0\).
- Thus, we can uniquely recover (or point-identify) \(\beta_1^o\).

- To recover \(\beta_0^o\), the idea is to have another identification argument:
- Observe that \[\begin{eqnarray} \mathbb{E}\left(C_t\right) &=& \beta_0^o+\beta_1^o \mathbb{E}\left(I_t\right)+ \mathbb{E}\left(\varepsilon_t\right) \\ \mathbb{E}\left(C_t\right) &=& \beta_0^o+\beta_1^o \mathbb{E}\left(I_t\right)\\ \beta_0^o &=& \mathbb{E}\left(C_t\right)-\beta_1^o \mathbb{E}\left(I_t\right)\end{eqnarray}\]
- Provided that \(\mathbb{E}\left(\varepsilon_t\right)=0\) and \(\beta_1^o\) is uniquely identified, then we can also uniquely identify \(\beta_0^o\).

- Consistent estimators for both \(\beta_0^o\) and \(\beta_1^o\) are now available. Simply apply the method of moments!
- In fact, you have actually constructed a special case of a GMM estimator which is a just-identified IV estimator.
- Take \(Z_t^\prime \leftarrow \left(1, D_t\right)\), \(X_t^\prime \leftarrow \left(1, I_t\right)\) and \(Y_t \leftarrow C_t\). The orthogonality condition is \(\mathbb{E}\left(Z_t\varepsilon_t\right)=0\).
- You can also write \[\beta^o=\left(\mathbb{E}\left(Z_tX_t^\prime\right)\right)^{-1}\mathbb{E}\left(Z_tY_t\right),\] provided that \(\mathbb{E}\left(Z_tX_t^\prime\right)\) has rank 2.

- A model inspired by the theory of demand and supply could be written as: \[\begin{eqnarray}q_t^d &=&\alpha_0+\alpha_1 p_t+u_t \\ q_t^s &=&\beta_0+\beta_1 p_t+v_t\\ q_t^d &=& q_t^s\end{eqnarray}\]
- In equilibrium, we have \[\begin{eqnarray} p_t &=& \frac{\beta_0-\alpha_0}{\alpha_1-\beta_1}+\frac{v_t-u_t}{\alpha_1-\beta_1} \\ q_t &=& \frac{\alpha_1\beta_0-\alpha_0\beta_1}{\alpha_1-\beta_1}+\frac{\alpha_1 v_t-\beta_1 u_t}{\alpha_1-\beta_1}.\end{eqnarray}\] These equations express the endogenous variables in terms of all other exogenous variables. These equations are sometimes called the reduced form.
- What is the reduced form in the consumption-income example?

- Assume IID sampling, \(\mathbb{E}\left(u_t\right)=0\), \(\mathbb{E}\left(v_t\right)=0\), \(\mathsf{Var}\left(u_t\right)=\sigma^2_u\), \(\mathsf{Var}\left(v_t\right)=\sigma^2_v\), and \(\mathsf{Cov}\left(u_t,v_t\right)=0\).
- Suppose you specify \(q_t=\gamma_0+\gamma_1 p_t+\varepsilon_t\).
- Using a similar analysis we have done in the consumption-income example, you can show that applying least squares to a regression of \(q_t\) on \(p_t\) will give \[\widehat{\gamma_1}\overset{p}{\to} \frac{\alpha_1\sigma^2_v+\beta_1\sigma^2_u}{\sigma^2_u+\sigma^2_v}.\]
- Therefore you will neither recover the slope of the demand curve nor the slope of the supply curve. This is a case of underidentification.
- Why canâ€™t the approach in our consumption-income example work here? What is \(Z_t\), \(X_t\), \(L\), and \(K\)?
- But, we can gain identification by making assumptions about \(\sigma^2_u\) or \(\sigma^2_v\). What does OLS uniquely identify when \(\sigma^2_u\to 0\)?

The case where \(L=K\) is sometimes called just identified case or exactly identified case. The case where \(L<K\) is called the underidentified case.

If \(L>K\), then we have more moment conditions than the dimension of the parameters to be estimated. Intuitively, we should be able to exploit all \(L\) moment conditions. In some sense, there should be gains in efficiency when exploiting all of them.

In the linear structural equation case, we reached a point where \[\underbrace{\mathbb{E}\left(Z_tX_t^\prime\right)}_{\left(L\times K\right)}\underbrace{\beta^o}_{\left(K\times 1\right)}=\underbrace{\mathbb{E}\left(Z_tY_t\right)}_{\left(L\times 1\right)}.\]

So, when \(L>K\), we can no longer invert \(\mathbb{E}\left(Z_tX_t^\prime\right)\) because it is no longer a square matrix.

Two strategies are available:

- Exploit the setup behind linear structural equations. We will be using the reduced form.
- Asymptotic theory for the GMM estimator could provide some guidance as to how to proceed.

- We will be augmenting the linear structural equation by a reduced form.
- Recall that a reduced form is a set of equations where you express the endogenous variables in terms of the exogenous variables. Essentially, reduced forms are linear regressions!
- Therefore, we have \[\begin{eqnarray} Y_t&=&X_t^\prime \beta^o+\varepsilon_t \\ X_t^\prime &=&Z_t^\prime \gamma+v_t^\prime. \end{eqnarray}\]
- The first equation is assumed to be true, while the second equation is always true.
- Note that \(\mathbb{E}\left(X_t\varepsilon_t\right)\neq 0\) and by construction, \(\mathbb{E}\left(Z_tv_t\right)=0\).

- A key difference with what we learned in Lectures 1-5 is that \(X_t^\prime\) is now being predicted and this is a vector! I also dropped the *â€™s to reduce the notational burden.

- Substituting the reduced form into the structural equation, we will obtain \[\begin{eqnarray}Y_{t} &=& \left(Z_{t}^{\prime}\gamma+v_{t}^{\prime}\right)\beta^{o}+\varepsilon_{t} \\ Y_{t} &=& \left(Z_{t}^{\prime}\gamma\right)\beta^{o}+\left(v_{t}^{\prime}\beta^{o}+\varepsilon_{t}\right)\end{eqnarray}\]
- Observe that the last equation is a linear regression. Check that \(\mathbb{E}\left[\left(Z_{t}^{\prime}\gamma\right)\left(v_{t}^{\prime}\beta^{o}+\varepsilon_{t}\right)\right]=0\). Justify the steps.
- From Lectures 1-5, we can write an expression for \(\beta^o\):\[\begin{eqnarray} \beta^{o} &=& \left\{ \mathbb{E}\left[\left(Z_{t}^{\prime}\gamma\right)^{\prime}\left(Z_{t}^{\prime}\gamma\right)\right]\right\} ^{-1}\mathbb{E}\left[\left(Z_{t}^{\prime}\gamma\right)^{\prime}Y_{t}\right]\\ &=& \left(\gamma^{\prime}\mathbb{E}\left(Z_{t}Z_{t}^{\prime}\right)\gamma\right)^{-1}\gamma^{\prime}\mathbb{E}\left(Z_{t}Y_{t}\right)\end{eqnarray}\]
- But what is \(\gamma\)? It is a matrix of BLP coefficients! It has the form \[\gamma = \left(\mathbb{E}\left(Z_{t}Z_{t}^{\prime}\right)\right)^{-1}\mathbb{E}\left(Z_{t}X_{t}^{\prime}\right).\]

Putting them all together, we now have \[ \beta^{o} = \bigg(\underbrace{\mathbb{E}\left(X_{t}Z_{t}^{\prime}\right)}_{Q_{ZX}^{\prime}}(\underbrace{\mathbb{E}\left(Z_{t}Z_{t}^{\prime}\right)}_{Q_{ZZ}})^{-1}\underbrace{\mathbb{E}\left(Z_{t}X_{t}^{\prime}\right)}_{Q_{ZX}}\bigg)^{-1}\mathbb{E}\left(X_{t}Z_{t}^{\prime}\right)\left(\mathbb{E}\left(Z_{t}Z_{t}^{\prime}\right)\right)^{-1}\mathbb{E}\left(Z_{t}Y_{t}\right). \]

This is called the two-stage least squares (2SLS) estimand.

Provided that \(Q_{ZX}\) has rank equal to \(K\) and \(Q_{ZZ}\) is nonsingular, we have uniquely identified \(\beta^o\) and expressed it in terms of observable quantities.

We now have written an identification argument for \(\beta^o\).

- The 2SLS estimator can be computed directly using MM and it can be written as \[\begin{eqnarray}&&\widehat{\beta}_{2SLS}\\ &=&\left[\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}Z_{t}^{\prime}\right)^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}X_{t}^{\prime}\right)\right]^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}Z_{t}^{\prime}\right)^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}Y_{t}\right) \\ &=&\left(\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}Y\end{eqnarray}\]
- This is exactly a GMM estimator for a specific class of weighting matrices \(\widehat{W}\)! As long as you choosen \(\widehat{W} \propto \mathbf{Z}^{\prime}\mathbf{Z}\), you obtain the 2SLS estimator.
- Where are the two stages? We will return to this question later. For the moment, we move on to the other strategy when \(L>K\).

You will quickly realize that choosing \(\widehat{W}\) is crucial. How do you choose \(\widehat{W}\)?

So we need some theory here. Start from \(\widehat{\beta}_{GMM}\left(\widehat{W}\right)\): \[\begin{eqnarray}\widehat{\beta}_{GMM} &=& \left[\left(\frac{\mathbf{X}^{\prime}\mathbf{Z}}{n}\right)\widehat{W}^{-1}\left(\frac{\mathbf{Z}^{\prime}\mathbf{X}}{n}\right)\right]^{-1}\left(\frac{\mathbf{X}^{\prime}\mathbf{Z}}{n}\right)\widehat{W}^{-1}\left(\frac{\mathbf{Z}^{\prime}Y}{n}\right)\\ &=& \left[\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}X_{t}^{\prime}\right)\right]^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}Y_{t}\right)\end{eqnarray}\]

Substitute the structural equation \(Y_{t}=X_{t}^{\prime}\beta^{o}+\varepsilon_{t}\) into the previous expression and then determine what conditions are needed to obtain consistency.

So, we have\[\widehat{\beta}_{GMM}-\beta^{o}=\left[\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}X_{t}^{\prime}\right)\right]^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}\varepsilon_{t}\right).\]

Provided that LLNs below apply:\[\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}X_{t}^{\prime}\overset{p}{\to}\mathbb{E}\left(Z_{t}X_{t}^{\prime}\right)=Q_{ZX},\quad\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}\varepsilon_{t}\overset{p}{\to}\mathbb{E}\left(Z_{t}\varepsilon_{t}\right)=0.\]

To obtain consistency in the sense that \(\widehat{\beta}_{GMM}\left(\widehat{W}\right)\overset{p}{\rightarrow}\beta^{o}\), we also need

- \(\mathbb{E}\left(Z_{t}X_{t}^{\prime}\right)\) to have full rank equal to dimension of \(\beta^{o}\)
- \(\widehat{W}\overset{p}{\to}W\) where the limit has to be symmetric, nonsingular

Revisit the consumption-income example. What do you notice?

Revisit the simultaneous equations example. What do you notice?

There is not a lot of guidance to narrow down how to choose \(\widehat{W}\) from the consistency result.

Let us a look at the asymptotic distribution: \[\begin{eqnarray} &&\sqrt{n}\left(\widehat{\beta}_{GMM}-\beta^{o}\right) \\ &=& \left[\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}X_{t}^{\prime}\right)\right]^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{\sqrt{n}}\sum_{t=1}^{n}Z_{t}\varepsilon_{t}\right)\end{eqnarray}\]

We need the same conditions to prove consistency. But we need a suitable CLT which applies to \(\{Z_{t}\varepsilon_{t}\}\), i.e., \[\sqrt{n}\widehat{m}\left(\beta^{o}\right)=\sqrt{n}\left(\frac{1}{n}\mathbf{Z}^{\prime}\varepsilon\right)=\sqrt{n}\left(\frac{1}{n}\sum_{t=1}^{n}Z_{t}\varepsilon_{t}\right)\overset{d}{\rightarrow}N\left(0,V_{o}\right).\]

Thus, we have \[\sqrt{n}\left(\widehat{\beta}_{GMM}\left(\widehat{W}\right)-\beta^{o}\right)\overset{d}{\to}N\left(0,\Omega\right),\] where \[\begin{eqnarray}\Omega &=&\mathsf{Avar}\left(\sqrt{n}\left(\widehat{\beta}_{GMM}\left(\widehat{W}\right)-\beta^{o}\right)\right)\\ &=&\left(Q_{ZX}^{\prime}W^{-1}Q_{ZX}\right)^{-1}Q_{ZX}^{\prime}W^{-1}V_{o}W^{-1}Q_{ZX}\left(Q_{ZX}^{\prime}W^{-1}Q_{ZX}\right)^{-1}.\end{eqnarray}\]

By choosing \(\widehat{W}\) so that \(\widehat{W}\overset{p}{\rightarrow}W\propto V_{o}\), the form of the asymptotic variance is now \[\Omega_{o}=\left(Q_{ZX}^{\prime}V_{o}^{-1}Q_{ZX}\right)^{-1}.\]

Note this is the optimal choice of the weighting matrix in the sense that \(\Omega-\Omega_{o}\) is positive semi-definite.

The optimal choice of the weighting matrix is proportional to the asymptotic variance of \(\mathbf{Z}^{\prime}\varepsilon\), which is really \(\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\beta^{o}\right)\). This has a very nice interpretation and opens the possibility for a similar approach to consistent covariance matrix estimation.

The optimal choice of the weighting matrix gives us the optimal/efficient GMM estimator.

From the main textbook: â€śThe use of the optimal weighting matrix downweights the sample moments which have large sampling variations.â€ť

Suppose that \[\begin{eqnarray} X_{1t} &=& \mu+\varepsilon_{1t}\\ X_{2t} &=& \mu+\varepsilon_{2t}\end{eqnarray}\]

Assume that \(\left\{ \left(\varepsilon_{1t},\varepsilon_{2t}\right)\right\}\) is IID. Further assume that \(\mathbb{E}\left(\varepsilon_{1t}\right)=\mathbb{E}\left(\varepsilon_{2t}\right)=0\), \(\mathsf{Var}\left(\varepsilon_{1t}\right)=\sigma_{1}^{2}<\infty\), \(\mathsf{Var}\left(\varepsilon_{2t}\right)=\sigma_{2}^{2}<\infty\), and \(\varepsilon_{1t}\), \(\varepsilon_{2t}\) are independent of each other.

Propose a consistent estimator for \(\mu\) using only an IID random sample from \(\left\{ X_{1t}\right\}\). Derive the asymptotic distribution of this estimator.

You are going to construct a GMM estimator that exploits an IID random sample from \(\left\{ \left(X_{1t},X_{2t}\right)\right\}\). Derive the optimal weighting matrix that exploits the following moment conditions:\[\begin{eqnarray}\mathbb{E}\left(X_{1t}-\mu\right) &=& 0 \\ \mathbb{E}\left(X_{2t}-\mu\right) &=& 0\end{eqnarray}\] Look at the structure of your optimal weighting matrix.

Derive the efficient GMM estimator and its asymptotic distribution. Compare the asymptotic variances of the efficient GMM estimator against the estimator which uses only an IID random sample from \(\left\{ X_{1t}\right\}\).

If you strengthen \(\mathbb{E}\left(Z_{t}\varepsilon_{t}\right)=0\) to \(\mathbb{E}\left(\varepsilon_{t}|Z_{t}\right)=0\) and assume conditional homoscedasticity \(\mathsf{Var}\left(\varepsilon_{t}|Z_{t}\right)=\mathbb{E}\left(\varepsilon_{t}^{2}|Z_{t}\right)=\sigma^{2}\), then \(V_{o}=\sigma^{2}Q_{ZZ}\).

We obtain the following special case, but are classic results for the two-stage least squares (2SLS) estimator:\[\widehat{\beta}_{GMM}\left(\widehat{V}_{o}\right) = \left(\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}Y=\widehat{\beta}_{2SLS}\] with asymptotic distribution equal to \[\sqrt{n}\left(\widehat{\beta}_{GMM}\left(\widehat{V}_{o}\right)-\beta^{o}\right) \overset{d}{\to} N\left(0,\left(Q_{ZX}^{\prime}Q_{ZZ}^{-1}Q_{ZX}\right)^{-1}\right).\]

So, under conditional homoscedasticity, 2SLS is efficient GMM!

Another special case is the homoscedastic MDS case. Try showing that you still have 2SLS being efficient GMM.

Outside of these cases, 2SLS ceases to be efficient GMM.

It turns out that when \(L>K\) and whether we are in or outside the linear structural equations case, we can use the idea behind efficient GMM estimation. This leads to the following algorithm:

Choose a \(\widehat{W}\) (Typically, \(\widehat{W}=I\)) and calculate \[\widehat{\beta}_{GMM}\left(\widehat{W}\right)=\arg\min_{\beta\in\Theta}\widehat{m}\left(\beta\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\beta\right).\]

Find a consistent estimator of \[V_{o}=\mathsf{avar}\left(\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right).\]Call this estimator \(\widetilde{V}\). This estimator is computed based on the form of \(V_{o}\) and \(\widehat{\beta}_{GMM}\left(\widehat{W}\right)\).

Use \(\widetilde{V}\) as a consistent estimator of the optimal weight matrix and recalculate the GMM estimator: \[\widehat{\beta}_{GMM}\left(\widetilde{V}\right)=\arg\min_{\beta\in\Theta}\widehat{m}\left(\beta\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\beta\right).\]

\(\widehat{\beta}_{GMM}\left(\widehat{W}\right)\) for any appropriate choice of \(\widehat{W}\) is called one-step GMM. \(\widehat{\beta}_{GMM}\left(\widetilde{V}\right)\) is sometimes referred to as two-step or two-stage GMM.

It is also possible to use an updated estimator of \(V_{o}\), called \(\widehat{V}\), where you use \(\widehat{\beta}_{GMM}\left(\widetilde{V}\right)\).

It is possible to iterate the procedure for a few more times. This is called iterated GMM and has been subject of B. E. Hansen and Lee (2021).

It is also possible to allow the weight matrix to depend on \(\beta\). This version is called continuously updated GMM by L. P. Hansen, Heaton, Yaron (1996).

The 2SLS estimand already gives you a hint. that estimand uses best linear prediction twice.

The 2SLS estimator itself has some nice algebra to uncover. In particular, observe that \[\begin{eqnarray} && \left(\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}Y\\ & = & \left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}P_{\mathbf{Z}}Y\\ & = &\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}P_{\mathbf{Z}}^{\prime}P_{\mathbf{Z}}Y\\ & = &\left[\left(P_{\mathbf{Z}}\mathbf{X}\right)^{\prime}P_{\mathbf{Z}}\mathbf{X}\right]^{-1}\left(P_{\mathbf{Z}}\mathbf{X}\right)^{\prime}P_{\mathbf{Z}}Y \end{eqnarray}\]

- Thus, you compute the fitted values from a least squares fit of \(\mathbf{X}\) on \(\mathbf{Z}\). This gives you \(P_{\mathbf{Z}}\mathbf{X}\). The next stage is to compute the coefficient vector from the least squares fit of \(P_{\mathbf{Z}}Y\) on \(P_{\mathbf{Z}}\mathbf{X}\).
- This two-stage interpretation is purely least squares algebra.
- In practice, literally aplying OLS in two stages is most useful for getting the 2SLS estimator.
- For other considerations, be careful.

A better and practically more useful alternative is to look at the control function interpretation of 2SLS.

- Recall our linear structural equation setup augmented with a reduced form: \[\begin{eqnarray} Y_t&=&X_t^\prime \beta^o+\varepsilon_t \\ X_t^\prime &=&Z_t^\prime \gamma+v_t^\prime. \end{eqnarray}\]
- We can form another linear regression: \[ \varepsilon_t=v_t^\prime \rho + u_t\] where \(\mathbb{E}\left(v_tu_t\right)=0\) and \(\rho\) is the BLP coefficient vector.
- Note that \(\rho\) must be nonzero. Why? Because
- \(X_t\) is endogenous with respect to \(\varepsilon_t\): \(\mathbb{E}\left(X_t\varepsilon_t\right)\neq 0\)
- \(v_t^\prime\) is a BLP error: \(\mathbb{E}\left(Z_tv_t^\prime\right)=0\)
- \(Z_t\) is exogenous with respect to \(\varepsilon_t\): \(\mathbb{E}\left(Z_t\varepsilon_t\right)=0\)

- Substituting \(\varepsilon_t=v_t^\prime \rho + u_t\) into our linear structural equation gives use \[Y_t = X_t^\prime \beta^o+v_t^\prime \rho+ u_t. \]
- We now have a linear regression of \(Y_t\) on \(X_t\) and \(v_t\)! You can show this!
- Therefore, we can apply least squares and get consistent estimators of \(\beta^o\) and \(\rho\).
- \(v_t\) is sometimes called a control function, because once you include it
- The only problem is that \(v_t\) is unobservable.

- But you can easily estimate \(v_t\) using the residuals \(\widehat{v}_t\) from the reduced form.
- Using least squares algebra and FWL, you can show that when you obtain a least squares fit \[Y_t = X_t^\prime \widehat{\beta}+\widehat{v}_t^\prime \widehat{\rho}+ \widehat{u}_t,\] \(\widehat{\beta}=\widehat{\beta}_{2SLS}\).

- Control functions enable you to solve more general endogeneity problems without resorting to 2SLS.
- See a version called two-stage residual inclusion or 2SRI, which is popular in health economics (work by Terza and co-authors). See also work by Wooldridge (2015) targeted toward labor economics.
- Our analysis can also shed light on what happens when \(L\to\infty\). Think about what happens to the reduced form.

The word â€śmodelâ€ť can be more general than the linear regression model or even linear structural equations.

The model is now described by moment conditions of the form \(\mathbb{E}\left(m_{t}\left(\beta^{o}\right)\right)=0\) for some \(\beta^{o}\in\Theta\).

Notice that a model, in this context, is a set of unconditional moment restrictions.

We are only able to test the model when \(L>K\). What happens when \(L<K\)? \(L=K\)?

One way to check whether the model is correctly specified (meaning the unconditional moment restrictions hold) is to check how far \(\widehat{m}\left(\widehat{\beta}\right)\) is from zero (Why zero?).

The null hypothesis is \(\mathbb{E}\left(m_{t}\left(\beta^{o}\right)\right)=0\) for some \(\beta^{o}\in\Theta\). We need to derive a test statistic and be able to derive its distribution under the null.

It turns out that the GMM criterion function evaluated at efficient GMM estimator is a good starting point for a test statistic.

For the linear structural equation case with conditional homoscedasticity, \(\widehat{\beta}=\widehat{\beta}_{2SLS}\) and \[\begin{eqnarray} n\times\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\widehat{\beta}\right) &=& \left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)^{\prime}\mathbf{Z}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)\\ &=& \widehat{e}_{2SLS}^{\prime}\mathbf{Z}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\widehat{e}_{2SLS}\\ &=& \left(\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\widehat{e}_{2SLS}\right)^{\prime}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\widehat{e}_{2SLS} \end{eqnarray}\]

\(\widehat{e}_{2SLS}=Y-\mathbf{X}\widehat{\beta}_{2SLS}\) is the vector of 2SLS residuals and \(\widehat{\sigma}^2=\widehat{e}_{2SLS}^\prime \widehat{e}_{2SLS}/n\) is an estimator of the constant conditional variance \(\mathbb{E}\left(\varepsilon_t^2|Z_t\right)=\sigma^2\).

Note that using the usual algebra you have encountered before, we have

\[\begin{eqnarray} &&\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\widehat{e}_{2SLS} \\ &=& \left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\left(\varepsilon-\mathbf{X}\left(\widehat{\beta}_{2SLS}-\beta^{o}\right)\right)\\ &=& \left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\varepsilon-\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\mathbf{X}\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\varepsilon\\ &=& \left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\varepsilon-\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\mathbf{X}\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\varepsilon\\ &=& \left(I-\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\mathbf{X}\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\right)\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\varepsilon \end{eqnarray}\]Next, note that \[\widehat{\Pi}=I-\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\mathbf{X}\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\] is symmetric and idempotent.

Under conditional homoscedasticity (and other conditions which you should fill in), we have \[n\times\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\widehat{\beta}\right)=\left[\widehat{\Pi}\left(\widehat{\sigma}^{2}\dfrac{\mathbf{Z}^{\prime}\mathbf{Z}}{n}\right)^{-1/2}\left(\dfrac{\mathbf{Z}^{\prime}\varepsilon}{\sqrt{n}}\right)\right]^{\prime}\underbrace{\widehat{\Pi}}_{\overset{p}{\to}\Pi}\underbrace{\left(\widehat{\sigma}^{2}\dfrac{\mathbf{Z}^{\prime}\mathbf{Z}}{n}\right)^{-1/2}}_{\overset{p}{\to}\sigma^{-1}Q_{ZZ}^{-1/2}}\underbrace{\left(\dfrac{\mathbf{Z}^{\prime}\varepsilon}{\sqrt{n}}\right)}_{\overset{d}{\to}N\left(0,\sigma^{2}Q_{ZZ}\right)}\]

Because \(\Pi\) is also symmetric and idempotent (so rank is equal to trace) and\[\widehat{\Pi}\left(\widehat{\sigma}^{2}\dfrac{\mathbf{Z}^{\prime}\mathbf{Z}}{n}\right)^{-1/2}\left(\dfrac{\mathbf{Z}^{\prime}\varepsilon}{\sqrt{n}}\right)\overset{d}{\to}N\left(0,\Pi\right),\]we must have \[n\times\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\widehat{\beta}\right)=\left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)^{\prime}\mathbf{Z}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)\overset{d}{\to}\chi_{L-K}^{2}.\]

The test statistic can be rewritten in many ways in the special case of testing model specification in the linear structural equation case: \[\begin{eqnarray} n\times\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\widehat{\beta}\right) &=& \left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)^{\prime}\mathbf{Z}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)\\ &=& \widehat{e}_{2SLS}^{\prime}\mathbf{Z}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\widehat{e}_{2SLS}\\ &=& \dfrac{\left\Vert P_{\mathbf{Z}}\widehat{e}_{2SLS}\right\Vert ^{2}}{\left\Vert \widehat{e}_{2SLS}\right\Vert ^{2}/n}. \end{eqnarray}\]

At this point, you should notice something extremely familiar. Run an auxiliary regression of the 2SLS residuals \(\widehat{e}_{2SLS}\) on the instruments \(\mathbf{Z}\), without a constant. Does the statistic now look familiar?

The test statistic we derived for the special case is sometimes called a Sargan test.

The simplified form of the test statistic is sometimes called a Sargan statistic.

In general, the test of model specification is also sometimes called a test of overidentifying restrictions or overidentifying restrictions test or \(J\)-test (Hansen 1982).

If you reject the null, the test does not tell you which moment condition is incompatible with the data.

If you do not reject the null, it does not necessarily mean that your model is correct.

We need to work out the general case, as we focused on the linear in \(\beta\) case.

- Nonlinear in \(\beta\) presents its own special technical problems, especially in the use of the LLN and CLT to justify GMM.

We need to understand where \(\mathbf{Z}\) comes from. Unfortunately, the theory presented is actually the easy part. The real hard part is to really find these \(\mathbf{Z}\) in empirical applications.

- We will work out Angrist and Krueger (1991).
- If we have time, we will try to understand what IV and 2SLS recovers in a causal setting.

The 2SLS case actually has a very rich history. If we have time, we are going to examine the finite-sample properties and the bad things that could happen once some assumptions of the theory do not hold.

- We will do some more algebra to facilitate the many interpretations of 2SLS.
- We will try to cover something similar to Chapter 3. Is 2SLS unbiased?

In all consistency proofs you have written, you were able to find a closed-form expression for some estimator.

In addition, the estimator is usually written as the estimation target plus some error.

Now, we only have an objective function and FOCs.

You have seen examples where it is difficult to find a closed-form expression for the solution to the FOC. Recall the case of estimating \(\lambda\) in the exponential distribution.

It may also happen that there are multiple solutions. Therefore, avoid using FOCs.

LLNs have to be modified as well. Why?

Pay attention to what changes relative to linear GMM.

In a separate document, I go through the details of the proof.

In these slides, I present the key ideas only.

In all asymptotic normality proofs you have written for the OLS estimator, you started from a closed-form expression for \(\sqrt{n}\left(\widehat{\beta}-\beta^*\right)\).

Recall that the difference between the estimator and the estimation target is equal to some error. You also needed a CLT at some point.

Now, we only have an objective function and FOCs.

Just like in the consistency case, it is harder to write a closed-form expression for \(\sqrt{n}\left(\widehat{\beta}_{GMM}-\beta^o\right)\).

The key idea is to use a linear approximation, where the approximation error somehow disappears in large samples.

Pay attention to what changes relative to linear GMM.

Introduce the key object. We need an first-order asymptotic approximation or an asymptotically linear representation: \[\sqrt{n}\left(\widehat{\beta}-\beta^{o}\right)=-\left[\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\right]^{-1}\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\]

This is very similar to the OLS estimator: \[\begin{eqnarray} \sqrt{n}\left(\widehat{\beta}-\beta^{*}\right) = \left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}X_{t}^{\prime}\right)^{-1}\left(\dfrac{1}{\sqrt{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}u_{t}\right)\\ \overset{d}{\rightarrow} N\bigg(0,\underbrace{Q^{-1}\left[\mathsf{Avar}\left(\dfrac{1}{\sqrt{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}u_{t}\right)\right]Q^{-1}}_{\mathsf{Avar}\left(\sqrt{n}\left(\widehat{\beta}-\beta^{*}\right)\right)}\bigg) \end{eqnarray}\]

So where are the differences?

Introduce a key tool called the mean value theorem. Let \(h:\mathbb{R}^{p}\to\mathbb{R}^{q}\) be continuously differentiable. Then, \[h\left(x\right)=h\left(x_{0}\right)+\frac{\partial h\left(\bar{x}\right)}{\partial x}\left(x-x_{0}\right)\] where \(\bar{x}\) is a mean value lying between \(x\) and \(x_{0}\), i.e., \(\bar{x}=\lambda x+(1-\lambda)x_{0}\), where \(\lambda\in\left(0,1\right)\).

Note that the layout used in Chapter 8 (which may be confusing given some notational conventions) is: \[\underset{\left(q\times p\right)}{\frac{\partial h}{\partial x}}=\left[\begin{array}{cccc} \frac{\partial h_{1}}{\partial x_{1}} & \frac{\partial h_{1}}{\partial x_{2}} & \cdots & \frac{\partial h_{1}}{\partial x_{p}}\\ \frac{\partial h_{2}}{\partial x_{1}} & \frac{\partial h_{2}}{\partial x_{2}} & \cdots & \frac{\partial h_{2}}{\partial x_{p}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial h_{q}}{\partial x_{1}} & \frac{\partial h_{q}}{\partial x_{2}} & \cdots & \frac{\partial h_{q}}{\partial x_{p}} \end{array}\right]\]

Assumption 8.5 states that \(\beta^{o}\in\mathsf{int}\left(\Theta\right)\).

Assumptions 8.1 to 8.4 allow you to conclude that \(\widehat{\beta}_{GMM}\overset{p}{\rightarrow}\beta^{o}\) as \(n\rightarrow\infty\).

- Note again, that \(W\) has to be positive definite, contrary to what is stated in the textbook.
- Actually, only positive semidefiniteness is really needed but then Assumption 8.3 as it is stated has to be modified.

Because of Assumptions 8.1 to 8.5, we have that \(\widehat{\beta}_{GMM}\) is an interior element of \(\Theta\) with probability approaching one as \(n\rightarrow\infty\).

The FOCs of the GMM objective function are \[\left.\frac{\partial\widehat{Q}\left(\beta\right)}{\partial\beta}\right|_{\beta=\widehat{\beta}}=0\Rightarrow-2\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\widehat{m}\left(\widehat{\beta}\right)=0.\]

A local linear approximation of \(\widehat{m}\left(\cdot\right)\) about \(\beta^{o}\) gives \[\underset{\left(L\times1\right)}{\widehat{m}\left(\widehat{\beta}\right)}=\underset{\left(L\times1\right)}{\widehat{m}\left(\beta^{o}\right)}+\underset{\left(L\times K\right)}{\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}}\underset{\left(K\times1\right)}{\left(\widehat{\beta}-\beta^{o}\right).}\]

Now, substitute the local linear approximation into the FOCs and solve for \(\widehat{\beta}-\beta^{o}\): \[\widehat{\beta}-\beta^{o}=-\left[\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\right]^{-1}\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\widehat{m}\left(\beta^{o}\right).\]

- First: \[\frac{\partial\widehat{m}\left(\breve{\beta}\right)}{\partial\beta}\overset{p}{\rightarrow}\mathbb{E}\left[\dfrac{\partial m_{t}\left(\beta^{o}\right)}{\partial\beta}\right]=D\left(\beta^{o}\right)=D_{o}\] for \(\breve{\beta}\in\left\{ \widehat{\beta},\bar{\beta}\right\}\)
- This part requires the usual CLT and the uniform LLN.

- Second: \[\sqrt{n}\left(\widehat{\beta}-\beta^{o}\right)=-\left[\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\right]^{-1}\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\overset{d}{\to}N\left(0,\Omega\right)\]for some form for \(\Omega\).
- This part is familiar from the proofs of asymptotic normality of the OLS estimator.

At the end of the proof for asymptotic normality, you will be able to obtain a form for \(\Omega\): \[\Omega=\left(D_{o}^{\prime}W^{-1}D_{o}\right)^{-1} D_{o}^{\prime}W^{-1}V_o W^{-1}D_{o} \left(D_{o}^{\prime}W^{-1}D_{o}\right)^{-1}.\]

This is very similar to the linear case: \[\Omega = \left(Q_{ZX}^{\prime}W^{-1}Q_{ZX}\right)^{-1}Q_{ZX}^{\prime}W^{-1}V_{o}W^{-1}Q_{ZX}\left(Q_{ZX}^{\prime}W^{-1}Q_{ZX}\right)^{-1}.\]

Recall that by choosing \(\widehat{W}\) so that \(\widehat{W}\overset{p}{\rightarrow}W\propto V_{o}\), the form of the asymptotic variance in the linear case becomes \[\Omega_{o}=\left(Q_{ZX}^{\prime}V_{o}^{-1}Q_{ZX}\right)^{-1}.\]

In the nonlinear case, we can also choose \(\widehat{W}\) so that \(\widehat{W}\overset{p}{\rightarrow}W\propto V_{o}\), so that the form of the asymptotic variance for the nonlinear case becomes \[\Omega_{o}=\left(D_{o}^{\prime}V_o^{-1}D_{o}\right)^{-1}.\]

In the proof of consistency, the FOCs were never used. Therefore, no assumption about differentiability was ever needed.

In the proof of asymptotic normality:

- No derivatives of the objective function apart from the first played a role.
- Any initial consistent GMM estimator would have worked.
- What guarantees that \(D_{o}^{\prime}W^{-1}D_{o}\) is nonsingular?

Consistent covariance matrix estimation is just like before but the justification uses uniform LLNs.

First-order asymptotic efficiency may be achieved by choosing \(\widehat{W}\) so that \(\widehat{W}\overset{p}{\rightarrow}W\propto V_{o}\). The question is what \(V_o\) looks like.

The \(J\)-statistic is based on the value of the GMM criterion function evaluated at the optimal or efficient GMM estimator. This means that the starting point is the two-step GMM estimator.

So, we have \[\begin{eqnarray}n\times\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\widehat{\beta}\right) &=& \sqrt{n}\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1/2}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widehat{\beta}\right)\\ &=& \left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widehat{\beta}\right)\right]^{\prime}\left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widehat{\beta}\right)\right] \\ & \overset{\mathsf{Step\:1}}{=}& \left[\widehat{\Pi}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]^{\prime}\left[\widehat{\Pi}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]\\ & \overset{\mathsf{Step\:2}}{=}& \left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]^{\prime}\underbrace{\widehat{\Pi}}_{\overset{p}{\to}\Pi}\underbrace{\left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]}_{\overset{d}{\to}N\left(0,I\right)} \overset{d}{\to} \chi_{L-K}^{2}\end{eqnarray}\]

\[\begin{eqnarray} && \widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widehat{\beta}\right)\\ &=& \widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)+\widetilde{V}^{-1/2}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta^{\prime}}\sqrt{n}\left(\widehat{\beta}-\beta^{o}\right)\\ &=& \widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)-\widetilde{V}^{-1/2}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\left[\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widetilde{V}^{-1}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\right]^{-1}\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widetilde{V}^{-1}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\\ &=& \underbrace{\left[I_{L}-\widetilde{V}^{-1/2}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\left[\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widetilde{V}^{-1}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\right]^{-1}\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widetilde{V}^{-1/2}\right]}_{\widehat{\Pi}}\times\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right) \end{eqnarray}\]

Show that \(\widehat{\Pi}\overset{p}{\to}\Pi\) where the limit is symmetric and idempotent.

Apply the results related to the multivariate normal and show the chi-squared result.

Any other consistent estimator of \(V_{o}\) would work in the derivation. The optimal/efficient GMM estimator should also be used for the derivation to work.

What will happen if you do not use the optimal GMM estimator?

Start with an inefficient GMM estimator \(\widetilde{\beta}\) obtained from using some weighting matrix \(\widehat{W}\).

Here is a sketch of what will happen:

\[\begin{eqnarray} && n\widehat{m}\left(\widetilde{\beta}\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\widetilde{\beta}\right) \\ &=& \sqrt{n}\widehat{m}\left(\widetilde{\beta}\right)^{\prime}\widetilde{V}^{-1/2}\widetilde{V}^{1/2}\widehat{W}^{-1}\widetilde{V}^{1/2}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widetilde{\beta}\right)\\ &=& \left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widetilde{\beta}\right)\right]^{\prime}\widetilde{V}^{1/2}\widehat{W}^{-1}\widetilde{V}^{1/2}\left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widetilde{\beta}\right)\right]\\ &=& \left[\widetilde{\Pi}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]^{\prime}\widetilde{V}^{1/2}\widehat{W}^{-1}\widetilde{V}^{1/2}\left[\widetilde{\Pi}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]\\ &=& \left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]^{\prime}\underbrace{\widetilde{\Pi}^{\prime}\widetilde{V}^{1/2}\widehat{W}^{-1}\widetilde{V}^{1/2}\widetilde{\Pi}}_{\overset{p}{\to}?}\underbrace{\left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]}_{\overset{d}{\to}N\left(0,I\right)} \end{eqnarray}\]

- What will be the effect on the overidentifying restrictions test?

- 7.1: conceptual exercises, constructing examples
- 7.2 to 7.5: specific cases of IV, practice with IV conditions and asymptotics
- 7.6 to 7.9, 7.11, 7.12: repeats textbook IV asymptotics
- 7.10: important exercise on 2SLS residuals, do not implement two stages literally!
- 7.13: compare OLS and 2SLS if \(X_t\) satisfies IV conditions in terms of efficiency
- 7.14, 7.15: control function approach covered in slides
- 7.16: 2SLS algebra
- 7.17: apply your knowledge of asymptotic theory to derive the Hausman specification test

- 8.1: repeats most of the theory in the slides but for a specific case
- 8.2: linear IV case, mostly discussed in the slides
- 8.3: relationship between 2SLS and GMM, under what conditions will 2SLS be efficient GMM?
- 8.4: consistent estimation of \(V_o\), what LLN should you use here?
- 8.5, 8.9, 8.10, 8.11: aspects explored in our slides
- 8.6: optional, but connected to our discussion of restricted least squares
- 8.12, 8.13: aspects explored in our slides