Andrew Pua
May 2022
Can we solve the two equations simultaneously? NO! Why?
Consider the moment function \[m_{t}\left(\lambda\right)=m\left(X_{t},\lambda\right)=\left(\begin{array}{c} m_{1t}\left(\lambda\right)\\ m_{2t}\left(\lambda\right) \end{array}\right).\]
Which should we choose? Can’t we use all the moments? YES! But how to combine them together? Two options:
Recall that you had \(K=k+1\) equations in \(K=k+1\) unknowns when we wanted to determine \(\beta^{*}\). Let me focus on the special case \(K=2\).
The two equations arise from: \(\mathbb{E}\left(u_{t}\right)=0\) and \(\mathbb{E}\left(X_{1t}u_{t}\right)=0\), where \(u_{t}=Y_{t}-\beta_{0}^{*}-\beta_{1}^{*}X_{1t}\).
A function of the r.v.’s and parameters \[m_{t}\left(\beta_{0},\beta_{1}\right)=m\left(X_{1t},Y_{t},\beta_{0},\beta_{1}\right)=\left(\begin{array}{c} Y_{t}-\beta_{0}-\beta_{1}X_{1t}\\ X_{1t}\left(Y_{t}-\beta_{0}-\beta_{1}X_{1t}\right) \end{array}\right)\]is called a moment function.
Note that when \(\mathsf{Var}\left(X_1\right)>0\), \(\beta_{0}^{*}\) and \(\beta_{1}^{*}\) uniquely solve \(\mathbb{E}\left(m_{t}\left(\beta_{0},\beta_{1}\right)\right)=0\).
If it happens that the CEF is truly linear, there are actually more moment conditions that can be used to determine \(\beta_{0}^{*}=\beta_{0}^{o}\) and \(\beta_{1}^{*}=\beta_{1}^{o}\).
Because the CEF is assume to be linear, we must have \(\mathbb{E}\left(u_{t}|X_{1t}\right)=0\). This implies, for example, \(\mathbb{E}\left(X_{1t}^{2}u_{t}\right)=0\).
As a result, we can have a new set of moment functions \[m_{t}\left(\beta_{0},\beta_{1}\right)=\left(\begin{array}{c} Y_{t}-\beta_{0}-\beta_{1}X_{1t}\\ X_{1t}\left(Y_{t}-\beta_{0}-\beta_{1}X_{1t}\right)\\ X_{1t}^{2}\left(Y_{t}-\beta_{0}-\beta_{1}X_{1t}\right) \end{array}\right).\]
Note that we have moment conditions \(\mathbb{E}\left(m_{t}\left(\beta_{0},\beta_{1}\right)\right)=0\).
All of our examples featured orthogonality/moment conditions of the form \(\mathbb{E}\left(m_{t}\left(\beta\right)\right)=0\).
\(m_{t}\left(\beta\right)\) is called a moment function.
\(\beta^{o}\) is usually assumed to be the value of \(\beta\) which satisfies the orthogonality conditions. This is called point identification.
The expectation operator \(\mathbb{E}\) is with respect to the distribution of the data, which is indexed by the parameter \(\beta^{o}\). But we do not specify the distribution of the data!
The sample counterparts of the orthogonality conditions are: \[\widehat{m}\left(\beta\right)=\frac{1}{n}\sum_{t=1}^{n}m_{t}\left(\beta\right)\]
A generalized method of moments (GMM) estimator is defined as the solution to the following minimization problem: \[\widehat{\beta}_{GMM}=\arg\min_{\beta\in\Theta}\widehat{m}\left(\beta\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\beta\right)\]
Accounting for the sizes:
\(\widehat{m}\left(\beta\right)\) is \(L\times1\) vector of moment functions, which may be linear (Chapter 7) or nonlinear (Chapter 8) in \(\beta\)
\(\widehat{W}\) is \(L\times L\) symmetric nonsingular weighting matrix
\(\beta\) is \(K\times1\) is the parameter vector
\(\Theta\subseteq\mathbb{R}^{K}\) is the parameter space
\(\widehat{m}\left(\beta\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\beta\right)\) is a scalar GMM criterion function
Clearly, \(L\) has to be greater than or equal to \(K\).
Suppose \(Y_{t}=X_{t}^{\prime}\beta^{o}+\varepsilon_{t}\) but now \(\mathbb{E}\left(X_{t}\varepsilon_{t}\right)=0\).
This implies that we do not necessarily have correct specification, meaning \(\mathbb{E}\left(Y_{t}|X_{t}\right)\neq X_{t}^{\prime}\beta^{o}\). But, \(X_{t}^{\prime}\beta^{o}\) is the best linear predictor.
To estimate \(\beta^{o}\) by least squares, we simply minimize the sum of squared residuals, i.e. \[\widehat{\beta}_{OLS}=\arg\min_{\beta\in\Theta}\dfrac{1}{n}\left(Y-\mathbf{X}\beta\right)^{\prime}\left(Y-\mathbf{X}\beta\right) = \arg\min_{\beta\in\Theta}\dfrac{1}{n}\sum_{t=1}^{n}\left(Y_{t}-X_{t}^{\prime}\beta\right)^{2}\]
Recall the FOCs: \(\widehat{\beta}_{OLS}\) solves\[\dfrac{1}{n}\sum_{t=1}^{n}\left(X_{t}Y_{t}-X_{t}X_{t}^{\prime}\widehat{\beta}_{OLS}\right)=0\]
Here we have \(m_{t}\left(\beta\right)=X_{t}\left(Y_{t}-X_{t}^{\prime}\beta\right)\) and that \(\mathbb{E}\left(m_{t}\left(\beta\right)\right)=0\).
As you recall, these are \(K\) equations, linear in \(\beta\), in \(K\) unknowns. Here is a case where \(L=K\).
Within a GMM framework, set \(\widehat{m}\left(\beta\right)\) as before and \(\widehat{W}=\mathbf{X}^{\prime}\mathbf{X}\). You can show that \[\widehat{\beta}_{OLS}=\widehat{\beta}_{GMM}=\arg\min_{\beta\in\Theta}\:\widehat{m}\left(\beta\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\beta\right).\]
More importantly, if you substitute \(\widehat{\beta}_{OLS}\) into the GMM criterion function, we obtain \[\left[\mathbf{X}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}_{OLS}\right)\right]^{\prime}\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}_{OLS}\right)=\left(Y-\mathbf{X}\widehat{\beta}_{OLS}\right)^{\prime}P_{\mathbf{X}}\left(Y-\mathbf{X}\widehat{\beta}_{OLS}\right)=e^{\prime}P_{\mathbf{X}}e=0.\]
In fact, even if you choose a different weighting matrix \(\widehat{W}\), substituting \(\widehat{\beta}_{GMM}\) into the GMM criterion function will always produce a zero value! Why?
As a result, GMM is not even needed here! So why bother spending three slides on this special case?
The key reason is to provide a contrast to models called linear structural equations.
Suppose \(Y_{t}=X_{t}^{\prime}\beta^{o}+\varepsilon_{t}\) but now \(\mathbb{E}\left(X_{t}\varepsilon_{t}\right)\neq0\).
This implies that we lose correct specification, meaning \(\mathbb{E}\left(Y_{t}|X_{t}\right)\neq X_{t}^{\prime}\beta^{o}\). This also implies that \(X_{t}^{\prime}\beta^{o}\) is not the BLP.
All this means is that \(Y_{t}=X_{t}^{\prime}\beta^{o}+\varepsilon_{t}\) is NOT a linear regression model.
From now on, we are going to call \(Y_{t}=X_{t}^{\prime}\beta^{o}+\varepsilon_{t}\) a structural equation or a response schedule.
How do we estimate \(\beta^{o}\) in this situation? There are just so many ways. But, we study closely an approach called instrumental variables (which is a special case of GMM).
Assume that there exists a random \(L\times1\) vector \(Z_{t}\) such that \(\mathbb{E}\left(Z_{t}\varepsilon_{t}\right)=0\).
Given the linear structural equation earlier, we can specify \[m_{t}\left(\beta\right)=Z_{t}\left(Y_{t}-X_{t}^{\prime}\beta\right)\]and by assumption we have \(\mathbb{E}\left(m_{t}\left(\beta\right)\right)=0\). So we have a moment condition which determines \(\beta\).
The sample counterpart of this orthogonality condition is \[\widehat{m}\left(\beta\right)=\frac{1}{n}\sum_{t=1}^{n}Z_{t}\left(Y_{t}-X_{t}^{\prime}\beta\right).\]
So we have \(L\) equations, linear in \(\beta\), in \(K\) unknowns.
Setup: \[\underset{\left(n\times1\right)}{Y}=\left(\begin{array}{c} Y_{1}\\ \vdots\\ Y_{n} \end{array}\right),\underset{\left(n\times K\right)}{\mathbf{X}}=\left(\begin{array}{c} X_{1}^{\prime}\\ \vdots\\ X_{n}^{\prime} \end{array}\right),\underset{\left(n\times L\right)}{\mathbf{Z}}=\left(\begin{array}{c} Z_{1}^{\prime}\\ \vdots\\ Z_{n}^{\prime} \end{array}\right),\underset{\left(K\times1\right)}{\beta}=\left(\begin{array}{c} \beta_{0}\\ \vdots\\ \beta_{k} \end{array}\right)\]
The GMM estimator is then defined as \[\widehat{\beta}_{GMM}=\arg\min_{\beta\in\Theta}\widehat{m}\left(\beta\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\beta\right)=n^{-2}\arg\min_{\beta\in\Theta}\:\left(Y-\mathbf{X}\beta\right)^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\beta\right).\]
The FOCs are given by: \[\begin{eqnarray}\frac{\partial}{\partial\beta}\left.\left(Y-\mathbf{X}\beta\right)^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\beta\right)\right|_{\beta=\widehat{\beta}} &=& 0 \\ -2\mathbf{X}^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}\right) &=& 0.\end{eqnarray}\]
The value of \(\widehat{\beta}\) that solves the FOCs is given by: \[\widehat{\beta}_{GMM}\left(\widehat{W}\right)=\left(\mathbf{X}^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}Y.\]
If \(L=K\), then \(\mathbf{Z}^{\prime}\mathbf{X}\) and \(\mathbf{X}^{\prime}\mathbf{Z}\) are both square matrices.
In fact, OLS can be justified as a just-identified IV estimator under certain orthogonality conditions! Why?
Consider \[\begin{eqnarray}C_t &=& \beta_0^o+\beta_1^o I_t+\varepsilon_t \\ I_t &=&C_t+D_t,\end{eqnarray}\] where \(C_t\) is consumption, \(I_t\) is income, and \(D_t\) is non-consumption.
Assume, for convenience, that \[\left(\begin{array}{c} D\\ \varepsilon \end{array}\right)\sim N\left(\left(\begin{array}{c} \mu_{D}\\ 0 \end{array}\right),\left(\begin{array}{cc} \sigma_{D}^{2} & 0\\ 0 & \sigma_{\varepsilon}^{2} \end{array}\right)\right).\] Assume we have IID draws from this distribution.
You can easily check that \(\mathbb{E}\left(I_t\varepsilon_t\right)= \sigma^2_{\varepsilon}/\left(1-\beta_1^o\right)\).
Therefore, \(\mathbb{E}\left(I_t\varepsilon_t\right)\neq 0\) whenever \(\sigma^2_{\varepsilon}>0\).
So, if you apply least squares to a regression of \(C_t\) on \(I_t\), you will not recover \(\beta_1^o\) no matter how large the sample size is.
The case where \(L=K\) is sometimes called just identified case or exactly identified case. The case where \(L<K\) is called the underidentified case.
If \(L>K\), then we have more moment conditions than the dimension of the parameters to be estimated. Intuitively, we should be able to exploit all \(L\) moment conditions. In some sense, there should be gains in efficiency when exploiting all of them.
In the linear structural equation case, we reached a point where \[\underbrace{\mathbb{E}\left(Z_tX_t^\prime\right)}_{\left(L\times K\right)}\underbrace{\beta^o}_{\left(K\times 1\right)}=\underbrace{\mathbb{E}\left(Z_tY_t\right)}_{\left(L\times 1\right)}.\]
So, when \(L>K\), we can no longer invert \(\mathbb{E}\left(Z_tX_t^\prime\right)\) because it is no longer a square matrix.
Two strategies are available:
Putting them all together, we now have \[ \beta^{o} = \bigg(\underbrace{\mathbb{E}\left(X_{t}Z_{t}^{\prime}\right)}_{Q_{ZX}^{\prime}}(\underbrace{\mathbb{E}\left(Z_{t}Z_{t}^{\prime}\right)}_{Q_{ZZ}})^{-1}\underbrace{\mathbb{E}\left(Z_{t}X_{t}^{\prime}\right)}_{Q_{ZX}}\bigg)^{-1}\mathbb{E}\left(X_{t}Z_{t}^{\prime}\right)\left(\mathbb{E}\left(Z_{t}Z_{t}^{\prime}\right)\right)^{-1}\mathbb{E}\left(Z_{t}Y_{t}\right). \]
This is called the two-stage least squares (2SLS) estimand.
Provided that \(Q_{ZX}\) has rank equal to \(K\) and \(Q_{ZZ}\) is nonsingular, we have uniquely identified \(\beta^o\) and expressed it in terms of observable quantities.
We now have written an identification argument for \(\beta^o\).
You will quickly realize that choosing \(\widehat{W}\) is crucial. How do you choose \(\widehat{W}\)?
So we need some theory here. Start from \(\widehat{\beta}_{GMM}\left(\widehat{W}\right)\): \[\begin{eqnarray}\widehat{\beta}_{GMM} &=& \left[\left(\frac{\mathbf{X}^{\prime}\mathbf{Z}}{n}\right)\widehat{W}^{-1}\left(\frac{\mathbf{Z}^{\prime}\mathbf{X}}{n}\right)\right]^{-1}\left(\frac{\mathbf{X}^{\prime}\mathbf{Z}}{n}\right)\widehat{W}^{-1}\left(\frac{\mathbf{Z}^{\prime}Y}{n}\right)\\ &=& \left[\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}X_{t}^{\prime}\right)\right]^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}Y_{t}\right)\end{eqnarray}\]
Substitute the structural equation \(Y_{t}=X_{t}^{\prime}\beta^{o}+\varepsilon_{t}\) into the previous expression and then determine what conditions are needed to obtain consistency.
So, we have\[\widehat{\beta}_{GMM}-\beta^{o}=\left[\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}X_{t}^{\prime}\right)\right]^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}\varepsilon_{t}\right).\]
Provided that LLNs below apply:\[\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}X_{t}^{\prime}\overset{p}{\to}\mathbb{E}\left(Z_{t}X_{t}^{\prime}\right)=Q_{ZX},\quad\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}\varepsilon_{t}\overset{p}{\to}\mathbb{E}\left(Z_{t}\varepsilon_{t}\right)=0.\]
To obtain consistency in the sense that \(\widehat{\beta}_{GMM}\left(\widehat{W}\right)\overset{p}{\rightarrow}\beta^{o}\), we also need
Revisit the consumption-income example. What do you notice?
Revisit the simultaneous equations example. What do you notice?
There is not a lot of guidance to narrow down how to choose \(\widehat{W}\) from the consistency result.
Let us a look at the asymptotic distribution: \[\begin{eqnarray} &&\sqrt{n}\left(\widehat{\beta}_{GMM}-\beta^{o}\right) \\ &=& \left[\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}Z_{t}X_{t}^{\prime}\right)\right]^{-1}\left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}Z_{t}^{\prime}\right)\widehat{W}^{-1}\left(\dfrac{1}{\sqrt{n}}\sum_{t=1}^{n}Z_{t}\varepsilon_{t}\right)\end{eqnarray}\]
We need the same conditions to prove consistency. But we need a suitable CLT which applies to \(\{Z_{t}\varepsilon_{t}\}\), i.e., \[\sqrt{n}\widehat{m}\left(\beta^{o}\right)=\sqrt{n}\left(\frac{1}{n}\mathbf{Z}^{\prime}\varepsilon\right)=\sqrt{n}\left(\frac{1}{n}\sum_{t=1}^{n}Z_{t}\varepsilon_{t}\right)\overset{d}{\rightarrow}N\left(0,V_{o}\right).\]
Thus, we have \[\sqrt{n}\left(\widehat{\beta}_{GMM}\left(\widehat{W}\right)-\beta^{o}\right)\overset{d}{\to}N\left(0,\Omega\right),\] where \[\begin{eqnarray}\Omega &=&\mathsf{Avar}\left(\sqrt{n}\left(\widehat{\beta}_{GMM}\left(\widehat{W}\right)-\beta^{o}\right)\right)\\ &=&\left(Q_{ZX}^{\prime}W^{-1}Q_{ZX}\right)^{-1}Q_{ZX}^{\prime}W^{-1}V_{o}W^{-1}Q_{ZX}\left(Q_{ZX}^{\prime}W^{-1}Q_{ZX}\right)^{-1}.\end{eqnarray}\]
By choosing \(\widehat{W}\) so that \(\widehat{W}\overset{p}{\rightarrow}W\propto V_{o}\), the form of the asymptotic variance is now \[\Omega_{o}=\left(Q_{ZX}^{\prime}V_{o}^{-1}Q_{ZX}\right)^{-1}.\]
Note this is the optimal choice of the weighting matrix in the sense that \(\Omega-\Omega_{o}\) is positive semi-definite.
The optimal choice of the weighting matrix is proportional to the asymptotic variance of \(\mathbf{Z}^{\prime}\varepsilon\), which is really \(\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\beta^{o}\right)\). This has a very nice interpretation and opens the possibility for a similar approach to consistent covariance matrix estimation.
The optimal choice of the weighting matrix gives us the optimal/efficient GMM estimator.
From the main textbook: “The use of the optimal weighting matrix downweights the sample moments which have large sampling variations.”
Suppose that \[\begin{eqnarray} X_{1t} &=& \mu+\varepsilon_{1t}\\ X_{2t} &=& \mu+\varepsilon_{2t}\end{eqnarray}\]
Assume that \(\left\{ \left(\varepsilon_{1t},\varepsilon_{2t}\right)\right\}\) is IID. Further assume that \(\mathbb{E}\left(\varepsilon_{1t}\right)=\mathbb{E}\left(\varepsilon_{2t}\right)=0\), \(\mathsf{Var}\left(\varepsilon_{1t}\right)=\sigma_{1}^{2}<\infty\), \(\mathsf{Var}\left(\varepsilon_{2t}\right)=\sigma_{2}^{2}<\infty\), and \(\varepsilon_{1t}\), \(\varepsilon_{2t}\) are independent of each other.
Propose a consistent estimator for \(\mu\) using only an IID random sample from \(\left\{ X_{1t}\right\}\). Derive the asymptotic distribution of this estimator.
You are going to construct a GMM estimator that exploits an IID random sample from \(\left\{ \left(X_{1t},X_{2t}\right)\right\}\). Derive the optimal weighting matrix that exploits the following moment conditions:\[\begin{eqnarray}\mathbb{E}\left(X_{1t}-\mu\right) &=& 0 \\ \mathbb{E}\left(X_{2t}-\mu\right) &=& 0\end{eqnarray}\] Look at the structure of your optimal weighting matrix.
Derive the efficient GMM estimator and its asymptotic distribution. Compare the asymptotic variances of the efficient GMM estimator against the estimator which uses only an IID random sample from \(\left\{ X_{1t}\right\}\).
If you strengthen \(\mathbb{E}\left(Z_{t}\varepsilon_{t}\right)=0\) to \(\mathbb{E}\left(\varepsilon_{t}|Z_{t}\right)=0\) and assume conditional homoscedasticity \(\mathsf{Var}\left(\varepsilon_{t}|Z_{t}\right)=\mathbb{E}\left(\varepsilon_{t}^{2}|Z_{t}\right)=\sigma^{2}\), then \(V_{o}=\sigma^{2}Q_{ZZ}\).
We obtain the following special case, but are classic results for the two-stage least squares (2SLS) estimator:\[\widehat{\beta}_{GMM}\left(\widehat{V}_{o}\right) = \left(\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}Y=\widehat{\beta}_{2SLS}\] with asymptotic distribution equal to \[\sqrt{n}\left(\widehat{\beta}_{GMM}\left(\widehat{V}_{o}\right)-\beta^{o}\right) \overset{d}{\to} N\left(0,\left(Q_{ZX}^{\prime}Q_{ZZ}^{-1}Q_{ZX}\right)^{-1}\right).\]
So, under conditional homoscedasticity, 2SLS is efficient GMM!
Another special case is the homoscedastic MDS case. Try showing that you still have 2SLS being efficient GMM.
Outside of these cases, 2SLS ceases to be efficient GMM.
It turns out that when \(L>K\) and whether we are in or outside the linear structural equations case, we can use the idea behind efficient GMM estimation. This leads to the following algorithm:
Choose a \(\widehat{W}\) (Typically, \(\widehat{W}=I\)) and calculate \[\widehat{\beta}_{GMM}\left(\widehat{W}\right)=\arg\min_{\beta\in\Theta}\widehat{m}\left(\beta\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\beta\right).\]
Find a consistent estimator of \[V_{o}=\mathsf{avar}\left(\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right).\]Call this estimator \(\widetilde{V}\). This estimator is computed based on the form of \(V_{o}\) and \(\widehat{\beta}_{GMM}\left(\widehat{W}\right)\).
Use \(\widetilde{V}\) as a consistent estimator of the optimal weight matrix and recalculate the GMM estimator: \[\widehat{\beta}_{GMM}\left(\widetilde{V}\right)=\arg\min_{\beta\in\Theta}\widehat{m}\left(\beta\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\beta\right).\]
\(\widehat{\beta}_{GMM}\left(\widehat{W}\right)\) for any appropriate choice of \(\widehat{W}\) is called one-step GMM. \(\widehat{\beta}_{GMM}\left(\widetilde{V}\right)\) is sometimes referred to as two-step or two-stage GMM.
It is also possible to use an updated estimator of \(V_{o}\), called \(\widehat{V}\), where you use \(\widehat{\beta}_{GMM}\left(\widetilde{V}\right)\).
It is possible to iterate the procedure for a few more times. This is called iterated GMM and has been subject of B. E. Hansen and Lee (2021).
It is also possible to allow the weight matrix to depend on \(\beta\). This version is called continuously updated GMM by L. P. Hansen, Heaton, Yaron (1996).
The 2SLS estimand already gives you a hint. that estimand uses best linear prediction twice.
The 2SLS estimator itself has some nice algebra to uncover. In particular, observe that \[\begin{eqnarray} && \left(\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}Y\\ & = & \left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}P_{\mathbf{Z}}Y\\ & = &\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}P_{\mathbf{Z}}^{\prime}P_{\mathbf{Z}}Y\\ & = &\left[\left(P_{\mathbf{Z}}\mathbf{X}\right)^{\prime}P_{\mathbf{Z}}\mathbf{X}\right]^{-1}\left(P_{\mathbf{Z}}\mathbf{X}\right)^{\prime}P_{\mathbf{Z}}Y \end{eqnarray}\]
A better and practically more useful alternative is to look at the control function interpretation of 2SLS.
The word “model” can be more general than the linear regression model or even linear structural equations.
The model is now described by moment conditions of the form \(\mathbb{E}\left(m_{t}\left(\beta^{o}\right)\right)=0\) for some \(\beta^{o}\in\Theta\).
Notice that a model, in this context, is a set of unconditional moment restrictions.
We are only able to test the model when \(L>K\). What happens when \(L<K\)? \(L=K\)?
One way to check whether the model is correctly specified (meaning the unconditional moment restrictions hold) is to check how far \(\widehat{m}\left(\widehat{\beta}\right)\) is from zero (Why zero?).
The null hypothesis is \(\mathbb{E}\left(m_{t}\left(\beta^{o}\right)\right)=0\) for some \(\beta^{o}\in\Theta\). We need to derive a test statistic and be able to derive its distribution under the null.
It turns out that the GMM criterion function evaluated at efficient GMM estimator is a good starting point for a test statistic.
For the linear structural equation case with conditional homoscedasticity, \(\widehat{\beta}=\widehat{\beta}_{2SLS}\) and \[\begin{eqnarray} n\times\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\widehat{\beta}\right) &=& \left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)^{\prime}\mathbf{Z}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)\\ &=& \widehat{e}_{2SLS}^{\prime}\mathbf{Z}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\widehat{e}_{2SLS}\\ &=& \left(\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\widehat{e}_{2SLS}\right)^{\prime}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\widehat{e}_{2SLS} \end{eqnarray}\]
\(\widehat{e}_{2SLS}=Y-\mathbf{X}\widehat{\beta}_{2SLS}\) is the vector of 2SLS residuals and \(\widehat{\sigma}^2=\widehat{e}_{2SLS}^\prime \widehat{e}_{2SLS}/n\) is an estimator of the constant conditional variance \(\mathbb{E}\left(\varepsilon_t^2|Z_t\right)=\sigma^2\).
Note that using the usual algebra you have encountered before, we have
\[\begin{eqnarray}
&&\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\widehat{e}_{2SLS} \\
&=& \left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\left(\varepsilon-\mathbf{X}\left(\widehat{\beta}_{2SLS}-\beta^{o}\right)\right)\\
&=& \left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\varepsilon-\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\mathbf{X}\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\varepsilon\\
&=& \left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\varepsilon-\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\mathbf{X}\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\varepsilon\\
&=& \left(I-\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\mathbf{X}\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\right)\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\varepsilon
\end{eqnarray}\]
Next, note that \[\widehat{\Pi}=I-\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\mathbf{Z}^{\prime}\mathbf{X}\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1/2}\] is symmetric and idempotent.
Under conditional homoscedasticity (and other conditions which you should fill in), we have \[n\times\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\widehat{\beta}\right)=\left[\widehat{\Pi}\left(\widehat{\sigma}^{2}\dfrac{\mathbf{Z}^{\prime}\mathbf{Z}}{n}\right)^{-1/2}\left(\dfrac{\mathbf{Z}^{\prime}\varepsilon}{\sqrt{n}}\right)\right]^{\prime}\underbrace{\widehat{\Pi}}_{\overset{p}{\to}\Pi}\underbrace{\left(\widehat{\sigma}^{2}\dfrac{\mathbf{Z}^{\prime}\mathbf{Z}}{n}\right)^{-1/2}}_{\overset{p}{\to}\sigma^{-1}Q_{ZZ}^{-1/2}}\underbrace{\left(\dfrac{\mathbf{Z}^{\prime}\varepsilon}{\sqrt{n}}\right)}_{\overset{d}{\to}N\left(0,\sigma^{2}Q_{ZZ}\right)}\]
Because \(\Pi\) is also symmetric and idempotent (so rank is equal to trace) and\[\widehat{\Pi}\left(\widehat{\sigma}^{2}\dfrac{\mathbf{Z}^{\prime}\mathbf{Z}}{n}\right)^{-1/2}\left(\dfrac{\mathbf{Z}^{\prime}\varepsilon}{\sqrt{n}}\right)\overset{d}{\to}N\left(0,\Pi\right),\]we must have \[n\times\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\widehat{\beta}\right)=\left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)^{\prime}\mathbf{Z}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)\overset{d}{\to}\chi_{L-K}^{2}.\]
The test statistic can be rewritten in many ways in the special case of testing model specification in the linear structural equation case: \[\begin{eqnarray} n\times\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\widehat{\beta}\right) &=& \left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)^{\prime}\mathbf{Z}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\left(Y-\mathbf{X}\widehat{\beta}_{2SLS}\right)\\ &=& \widehat{e}_{2SLS}^{\prime}\mathbf{Z}\left(\widehat{\sigma}^{2}\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\widehat{e}_{2SLS}\\ &=& \dfrac{\left\Vert P_{\mathbf{Z}}\widehat{e}_{2SLS}\right\Vert ^{2}}{\left\Vert \widehat{e}_{2SLS}\right\Vert ^{2}/n}. \end{eqnarray}\]
At this point, you should notice something extremely familiar. Run an auxiliary regression of the 2SLS residuals \(\widehat{e}_{2SLS}\) on the instruments \(\mathbf{Z}\), without a constant. Does the statistic now look familiar?
The test statistic we derived for the special case is sometimes called a Sargan test.
The simplified form of the test statistic is sometimes called a Sargan statistic.
In general, the test of model specification is also sometimes called a test of overidentifying restrictions or overidentifying restrictions test or \(J\)-test (Hansen 1982).
If you reject the null, the test does not tell you which moment condition is incompatible with the data.
If you do not reject the null, it does not necessarily mean that your model is correct.
We need to work out the general case, as we focused on the linear in \(\beta\) case.
We need to understand where \(\mathbf{Z}\) comes from. Unfortunately, the theory presented is actually the easy part. The real hard part is to really find these \(\mathbf{Z}\) in empirical applications.
The 2SLS case actually has a very rich history. If we have time, we are going to examine the finite-sample properties and the bad things that could happen once some assumptions of the theory do not hold.
In all consistency proofs you have written, you were able to find a closed-form expression for some estimator.
In addition, the estimator is usually written as the estimation target plus some error.
Now, we only have an objective function and FOCs.
You have seen examples where it is difficult to find a closed-form expression for the solution to the FOC. Recall the case of estimating \(\lambda\) in the exponential distribution.
It may also happen that there are multiple solutions. Therefore, avoid using FOCs.
LLNs have to be modified as well. Why?
Pay attention to what changes relative to linear GMM.
In a separate document, I go through the details of the proof.
In these slides, I present the key ideas only.
In all asymptotic normality proofs you have written for the OLS estimator, you started from a closed-form expression for \(\sqrt{n}\left(\widehat{\beta}-\beta^*\right)\).
Recall that the difference between the estimator and the estimation target is equal to some error. You also needed a CLT at some point.
Now, we only have an objective function and FOCs.
Just like in the consistency case, it is harder to write a closed-form expression for \(\sqrt{n}\left(\widehat{\beta}_{GMM}-\beta^o\right)\).
The key idea is to use a linear approximation, where the approximation error somehow disappears in large samples.
Pay attention to what changes relative to linear GMM.
Introduce the key object. We need an first-order asymptotic approximation or an asymptotically linear representation: \[\sqrt{n}\left(\widehat{\beta}-\beta^{o}\right)=-\left[\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\right]^{-1}\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\]
This is very similar to the OLS estimator: \[\begin{eqnarray} \sqrt{n}\left(\widehat{\beta}-\beta^{*}\right) = \left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}X_{t}^{\prime}\right)^{-1}\left(\dfrac{1}{\sqrt{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}u_{t}\right)\\ \overset{d}{\rightarrow} N\bigg(0,\underbrace{Q^{-1}\left[\mathsf{Avar}\left(\dfrac{1}{\sqrt{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}u_{t}\right)\right]Q^{-1}}_{\mathsf{Avar}\left(\sqrt{n}\left(\widehat{\beta}-\beta^{*}\right)\right)}\bigg) \end{eqnarray}\]
So where are the differences?
Introduce a key tool called the mean value theorem. Let \(h:\mathbb{R}^{p}\to\mathbb{R}^{q}\) be continuously differentiable. Then, \[h\left(x\right)=h\left(x_{0}\right)+\frac{\partial h\left(\bar{x}\right)}{\partial x}\left(x-x_{0}\right)\] where \(\bar{x}\) is a mean value lying between \(x\) and \(x_{0}\), i.e., \(\bar{x}=\lambda x+(1-\lambda)x_{0}\), where \(\lambda\in\left(0,1\right)\).
Note that the layout used in Chapter 8 (which may be confusing given some notational conventions) is: \[\underset{\left(q\times p\right)}{\frac{\partial h}{\partial x}}=\left[\begin{array}{cccc} \frac{\partial h_{1}}{\partial x_{1}} & \frac{\partial h_{1}}{\partial x_{2}} & \cdots & \frac{\partial h_{1}}{\partial x_{p}}\\ \frac{\partial h_{2}}{\partial x_{1}} & \frac{\partial h_{2}}{\partial x_{2}} & \cdots & \frac{\partial h_{2}}{\partial x_{p}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial h_{q}}{\partial x_{1}} & \frac{\partial h_{q}}{\partial x_{2}} & \cdots & \frac{\partial h_{q}}{\partial x_{p}} \end{array}\right]\]
Assumption 8.5 states that \(\beta^{o}\in\mathsf{int}\left(\Theta\right)\).
Assumptions 8.1 to 8.4 allow you to conclude that \(\widehat{\beta}_{GMM}\overset{p}{\rightarrow}\beta^{o}\) as \(n\rightarrow\infty\).
Because of Assumptions 8.1 to 8.5, we have that \(\widehat{\beta}_{GMM}\) is an interior element of \(\Theta\) with probability approaching one as \(n\rightarrow\infty\).
The FOCs of the GMM objective function are \[\left.\frac{\partial\widehat{Q}\left(\beta\right)}{\partial\beta}\right|_{\beta=\widehat{\beta}}=0\Rightarrow-2\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\widehat{m}\left(\widehat{\beta}\right)=0.\]
A local linear approximation of \(\widehat{m}\left(\cdot\right)\) about \(\beta^{o}\) gives \[\underset{\left(L\times1\right)}{\widehat{m}\left(\widehat{\beta}\right)}=\underset{\left(L\times1\right)}{\widehat{m}\left(\beta^{o}\right)}+\underset{\left(L\times K\right)}{\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}}\underset{\left(K\times1\right)}{\left(\widehat{\beta}-\beta^{o}\right).}\]
Now, substitute the local linear approximation into the FOCs and solve for \(\widehat{\beta}-\beta^{o}\): \[\widehat{\beta}-\beta^{o}=-\left[\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\right]^{-1}\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widehat{W}^{-1}\widehat{m}\left(\beta^{o}\right).\]
At the end of the proof for asymptotic normality, you will be able to obtain a form for \(\Omega\): \[\Omega=\left(D_{o}^{\prime}W^{-1}D_{o}\right)^{-1} D_{o}^{\prime}W^{-1}V_o W^{-1}D_{o} \left(D_{o}^{\prime}W^{-1}D_{o}\right)^{-1}.\]
This is very similar to the linear case: \[\Omega = \left(Q_{ZX}^{\prime}W^{-1}Q_{ZX}\right)^{-1}Q_{ZX}^{\prime}W^{-1}V_{o}W^{-1}Q_{ZX}\left(Q_{ZX}^{\prime}W^{-1}Q_{ZX}\right)^{-1}.\]
Recall that by choosing \(\widehat{W}\) so that \(\widehat{W}\overset{p}{\rightarrow}W\propto V_{o}\), the form of the asymptotic variance in the linear case becomes \[\Omega_{o}=\left(Q_{ZX}^{\prime}V_{o}^{-1}Q_{ZX}\right)^{-1}.\]
In the nonlinear case, we can also choose \(\widehat{W}\) so that \(\widehat{W}\overset{p}{\rightarrow}W\propto V_{o}\), so that the form of the asymptotic variance for the nonlinear case becomes \[\Omega_{o}=\left(D_{o}^{\prime}V_o^{-1}D_{o}\right)^{-1}.\]
In the proof of consistency, the FOCs were never used. Therefore, no assumption about differentiability was ever needed.
In the proof of asymptotic normality:
Consistent covariance matrix estimation is just like before but the justification uses uniform LLNs.
First-order asymptotic efficiency may be achieved by choosing \(\widehat{W}\) so that \(\widehat{W}\overset{p}{\rightarrow}W\propto V_{o}\). The question is what \(V_o\) looks like.
The \(J\)-statistic is based on the value of the GMM criterion function evaluated at the optimal or efficient GMM estimator. This means that the starting point is the two-step GMM estimator.
So, we have \[\begin{eqnarray}n\times\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1}\widehat{m}\left(\widehat{\beta}\right) &=& \sqrt{n}\widehat{m}\left(\widehat{\beta}\right)^{\prime}\widetilde{V}^{-1/2}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widehat{\beta}\right)\\ &=& \left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widehat{\beta}\right)\right]^{\prime}\left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widehat{\beta}\right)\right] \\ & \overset{\mathsf{Step\:1}}{=}& \left[\widehat{\Pi}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]^{\prime}\left[\widehat{\Pi}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]\\ & \overset{\mathsf{Step\:2}}{=}& \left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]^{\prime}\underbrace{\widehat{\Pi}}_{\overset{p}{\to}\Pi}\underbrace{\left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]}_{\overset{d}{\to}N\left(0,I\right)} \overset{d}{\to} \chi_{L-K}^{2}\end{eqnarray}\]
\[\begin{eqnarray} && \widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widehat{\beta}\right)\\ &=& \widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)+\widetilde{V}^{-1/2}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta^{\prime}}\sqrt{n}\left(\widehat{\beta}-\beta^{o}\right)\\ &=& \widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)-\widetilde{V}^{-1/2}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\left[\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widetilde{V}^{-1}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\right]^{-1}\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widetilde{V}^{-1}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\\ &=& \underbrace{\left[I_{L}-\widetilde{V}^{-1/2}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\left[\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widetilde{V}^{-1}\frac{\partial\widehat{m}\left(\bar{\beta}\right)}{\partial\beta}\right]^{-1}\frac{\partial\widehat{m}\left(\widehat{\beta}\right)}{\partial\beta^{\prime}}\widetilde{V}^{-1/2}\right]}_{\widehat{\Pi}}\times\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right) \end{eqnarray}\]
Show that \(\widehat{\Pi}\overset{p}{\to}\Pi\) where the limit is symmetric and idempotent.
Apply the results related to the multivariate normal and show the chi-squared result.
Any other consistent estimator of \(V_{o}\) would work in the derivation. The optimal/efficient GMM estimator should also be used for the derivation to work.
What will happen if you do not use the optimal GMM estimator?
Start with an inefficient GMM estimator \(\widetilde{\beta}\) obtained from using some weighting matrix \(\widehat{W}\).
Here is a sketch of what will happen:
\[\begin{eqnarray} && n\widehat{m}\left(\widetilde{\beta}\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\widetilde{\beta}\right) \\ &=& \sqrt{n}\widehat{m}\left(\widetilde{\beta}\right)^{\prime}\widetilde{V}^{-1/2}\widetilde{V}^{1/2}\widehat{W}^{-1}\widetilde{V}^{1/2}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widetilde{\beta}\right)\\ &=& \left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widetilde{\beta}\right)\right]^{\prime}\widetilde{V}^{1/2}\widehat{W}^{-1}\widetilde{V}^{1/2}\left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\widetilde{\beta}\right)\right]\\ &=& \left[\widetilde{\Pi}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]^{\prime}\widetilde{V}^{1/2}\widehat{W}^{-1}\widetilde{V}^{1/2}\left[\widetilde{\Pi}\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]\\ &=& \left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]^{\prime}\underbrace{\widetilde{\Pi}^{\prime}\widetilde{V}^{1/2}\widehat{W}^{-1}\widetilde{V}^{1/2}\widetilde{\Pi}}_{\overset{p}{\to}?}\underbrace{\left[\widetilde{V}^{-1/2}\sqrt{n}\widehat{m}\left(\beta^{o}\right)\right]}_{\overset{d}{\to}N\left(0,I\right)} \end{eqnarray}\]