Summary

Instructor: Andrew Pua
Teaching assistants: 黄酬子佑 and 徐卫超
Contact:
- Interact with us directly.
- Reach out to us by DingTalk through the main group or privately.
- Post questions at XMUSPOC.
- Office hours are still to be arranged.
Main textbook: Yongmiao Hong’s Advanced Econometrics: A Unified Approach (available at XMUSPOC)

Where I come from

I have never taken a course like this before.
- My advanced econometrics courses were a bit more practical but we were asked to setup an application so that we can use econometrics properly.
- We did proofs but not as intensive as what is done in our main textbook.
- We did not have solutions released to us. We figured it out.
This is my fourth time teaching this course.
- I studied the textbook so that I can teach students.
- I think I know the difficulties students face with respect to the main textbook.
I want you to pass the course, but this should not be the only goal.

What the course is about

Mainly about regressions as a way to understand relationships, specifically using OLS
Focus is more theoretical: But theoretical \(\neq\) not practical.
Divided into five parts
- Linear prediction and least squares fit: The simplest case (covers a subset of Chapters 2 to 4)
- Conditional prediction: The general case (covers whatever is left behind from Chapters 2 to 4)
- The need to go beyond IID sampling (covers Chapters 5 to 6)
- The need to recover structural relations (covers Chapters 7 to 8)
- The need to go beyond models that are linear in parameters (covers Chapters 8 and 9)

The goals of the course

The course mainly serves two purposes:
- Building the capacity and the independence to do research
- Fulfilling certification requirements
Research: Not just for the upcoming thesis writing phase; be a good consumer and producer
Certification: Part of your core courses and so that your degree has some useful information about your skill level.
Can it be good for business decision-making?
- Causal Machine Learning and Business Decision Making
- Causal Decision Making and Causal Effect Estimation Are Not the Same… and Why It Matters

Strategies that could work

Keep an open mind. The course is not just mathematics.
Pay attention or things will pass you by. You should not multi-task.
Ask questions immediately and participate in class.
Memorization can help, but doing bits of memorization over an entire semester is better than doing it a day (or less!) before the exams.
- But retention for a longer term is low.
Do the exercises immediately, even if not told and even if there are no solutions.

Why is the course taught in English and why it is a good thing

Practical reason: I cannot teach it in Chinese.
Idealistic reason: Even if I could teach it in Chinese, it is important to maintain the meaning of 全英文授课.
Economic reason: An extra language should expand your options and could indirectly benefit other people.
Personal reason: I want you to succeed.
- The exams are in English. Students sometimes have a hard time figuring out what exactly is being asked of them.
- Communicating in another language helps you understand both languages more.

How do I make things less difficult?

As much as possible, I follow the main textbook in terms of its overall structure, including the notation. But I may jump from one place to another.
I jump from one place to another with a purpose in mind. I put references to the main textbook in the slides.
I want to give you the context and the connections to past knowledge rather than just the methods/computer commands. I want you to be able to rebuild this knowledge if you lose it.
Almost no homework to submit, but there are activities that are graded for completion.

Grading

40% midterm + 50% final + 10% 平时成绩
Before the exams, all instructors will typically provide a guide as to what parts of the textbook will be covered in the exam. In the past,
- midterm exam broadly covers Chapters 2 to 5 or 6
- final exam broadly covers parts of the midterm, Chapters 7 to 8.
Questions in exams are in English. Answers have to be in English, otherwise this is an automatic zero.

平时成绩

When you are formally registered in this class, you automatically get 100 points for 平时成绩.
An automatic 100 points has a price.
- You have to participate in class discussions, TA sessions, answer exercises. You are NEVER punished for giving wrong answers.
- You have to complete short quizzes. If you miss a short quiz, that is an automatic 10 point deduction from the total outstanding amount of your 平时成绩.
- You cannot do anything unrelated to Advanced Econometrics 2. If I catch you, that is an automatic 20 point deduction from the total outstanding amount of your 平时成绩. You may be asked to step out of class.
You are definitely allowed to sleep in class. I will just wake you up in case I need you to participate in class.
You can be absent from class for a maximum of 3 times, as I do not check attendance regularly. You have the right to choose to be absent, provided you understand the consequences.
If you get caught being absent 4 or more times, then, according to regulations, you will not be able to take the final exam.
If you want to be excused for an absence, file for leave properly.

Example of a short quiz

Suppose \(X\) has a PDF equal to \(f_{X}\left(x\right)=\begin{cases} 0.5 & 0\leq x\leq2\\ 0 & \mathrm{otherwise} \end{cases}\).

Write down the expression needed to calculate \(\mathbb{E}\left(X^{2}\right)\).

The formula for an average \({\displaystyle \dfrac{1}{n}\sum_{i=1}^{n}X_{i}}\) applies to data, or to random variables?

Inspiration for these notes

Much of these notes were developed over the past 5 years.
These notes were heavily influenced by
- A. Goldberger’s A Course in Econometrics
- B. Hansen’s Econometrics Lecture Notes
- D. Blackwell’s Basic Statistics
- J. Stock and M. Watson’s Introduction to Econometrics
- J. Wooldridge’s Introductory Econometrics: A Modern Approach and Econometric Analysis of Cross-Section and Panel Data
- A. Gelman and J. Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models
- C. Manski’s Identification for Prediction and Decision
- Y. Hong’s Foundations of Modern Econometrics: A Unified Approach (main textbook)

Review of expected values

Let \(Y\) be a scalar random variable.
What is the definition of the expected value of \(Y\), i.e., \(\mathbb{E}\left(Y\right)\)?
What other expected values have you encountered before?
Interpretations:
- “Center” of the distribution
- Law of large numbers
- (Our focus) Best prediction / best linear prediction under squared error loss

Optimal prediction under squared error loss

Think of \(Y\) being a characteristic of interest in some population.
\(Y\) has a distribution \(f_{Y}\left(y\right)\).
A draw was made from \(f_{Y}\). Your task is to predict this unit’s \(Y\).
Let \(\beta_{0}\) be some real number.
\(\mathbb{E}\left(Y\right)\) is the unique solution to the following optimization problem: \[\min_{\beta_{0}}\mathbb{E}\left[\left(Y-\beta_{0}\right)^{2}\right].\]
The objective function is sometimes called mean squared error (MSE). Compare with Definition 2.2 in the main textbook.

Implications?

What is the minimized value of \(\mathbb{E}\left[\left(Y-\beta_{0}\right)^{2}\right]\)? Can you interpret this quantity?
To prove the optimality of \(\mathbb{E}\left(Y\right)\):
- Use calculus. Calculate the first and second order conditions.
- Use add-and-subtract trick. Compare with the proof of Theorem 2.1 on page 26 of the main textbook.\[\begin{aligned}\mathbb{E}\left[\left(Y-\beta_{0}\right)^{2}\right] & =\mathbb{E}\left[\left(Y-\mathbb{E}\left(Y\right)\right)^{2}\right]+\mathbb{E}\left[\left(\mathbb{E}\left(Y\right)-\beta_{0}\right)^{2}\right]+2\mathbb{E}\left[\left(Y-\mathbb{E}\left(Y\right)\right)\left(\mathbb{E}\left(Y\right)-\beta_{0}\right)\right]\end{aligned}\]
- Can you determine why we could not just say that the optimal choice is \(\beta_{0}^{*}=\mathbb{E}\left(Y\right)\) and be done with the argument earlier?

Example

One of the four words in the sentence “I SEE THE MOUSE” will be selected at random.
The task is to predict the number of letters in the selected word, denote this as \(Y\).
What would be your prediction rule in order to make your expected loss as small as possible?
How much is the smallest expected loss?

Predictions with additional information

What if you have other information in the form of another random variable \(X_{1}\)?
- Suppose we have two random variables \(X_{1}\) and \(Y\), which follow a joint distribution \(f_{X_{1},Y}\left(x_{1},y\right)\).
- Suppose you draw a unit at random from the subpopulation of units with \(X_{1}=x_{1}\). Your task is to predict this unit’s \(Y\).
- How do we accomplish this task optimally?
Consider a prediction rule of the form \(\beta_{0}+\beta_{1}X_{1}\).
- The task is now to find the unique solution to the following optimization problem: \[\min_{\beta_{0},\beta_{1}}\mathbb{E}\left[\left(Y-\beta_{0}-\beta_{1}X_{1}\right)^{2}\right].\]
- Provided that \(\mathsf{Var}\left(X_{1}\right)>0\), the optimal solution \(\left(\beta_{0}^{*},\beta_{1}^{*}\right)\) solve the following first-order conditions: \[\begin{eqnarray} \mathbb{E}\left(Y\right)-\beta_{0}^{*}-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right) &=& 0,\\ \mathbb{E}\left(X_{1}Y\right)-\beta_{0}^{*}\mathbb{E}\left(X_{1}\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}^{2}\right) &=& 0. \end{eqnarray}.\]

Explicit solutions

As a result, we have \[\beta_{0}^{*} = \mathbb{E}\left(Y\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right),\qquad\beta_{1}^{*}=\dfrac{\mathsf{Cov}\left(X_{1},Y\right)}{\mathsf{Var}\left(X_{1}\right)}.\]
\(\mathsf{Var}\left(X_{1}\right)>0\) rules out point-mass distributions.
- If \(\mathsf{Var}\left(X_{1}\right)=0\), then you are not able to separately recover \(\beta_{0}^{*}\) and \(\beta_{1}^{*}\).
- The best you can do is to recover \(\beta_{0}^{*}+\beta_{1}^{*}c\), where \(c\in\mathbb{R}\) is such that \(\Pr\left(X_{1}=c\right)=1\).
What is the minimized value of the objective function?
What happens when \(\beta_1^*=0\) is known in advance? \(\beta_0^*=0\)
Return to I SEE THE MOUSE. The next task is to predict the number of letters in the selected word if you have information about the number of E’s in the word (call this \(X_1\)).

Linear algebra interpretation

There is a common structure to the optimization problems you have seen so far.
Most optimization problems in econometrics will be sharing this common structure and it is connected to orthogonal projection, a linear algebra concept you may have encountered before.
Define the inner product to be \[\left\langle Y,X_{1}\right\rangle =\mathbb{E}\left(X_{1}Y\right).\]
Therefore, we could rewrite the problems as \[\min_{\beta_{0}}\left\Vert Y-\beta_{0}\right\Vert ^{2},\qquad\min_{\beta_{0},\beta_{1}}\left\Vert Y-\beta_{0}-\beta_{1}X_{1}\right\Vert ^{2}.\]

Orthogonality conditions

The first-order conditions for these problems are actually orthogonality conditions.
For the prediction rule \(\beta_0^*\), it is given by \[\left\langle Y-\beta_{0}^{*},X_{0}\right\rangle =0.\] Here \(X_0\) is a random variable with point-mass at 1. Thus, the prediction error \(u=Y-\beta_{0}^{*}\) is orthogonal to \(X_{0}\).
For the prediction rule \(\beta_{0}+\beta_{1}X_{1}\), they are given by \[\begin{eqnarray}\left\langle Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1},X_{0}\right\rangle &=& 0,\\ \left\langle Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1},X_{1}\right\rangle &=& 0.\end{eqnarray}\] Similarly, the prediction error \(u=Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1}\) is orthogonal to \(X_{0}\) and \(X_{1}\).

Systems of linear equations

Yet another way to interpret the first-order conditions is by looking at them as systems of linear equations. \[\underbrace{\left(\begin{array}{cc} 1 & \mathbb{E}\left(X_1\right)\\ \mathbb{E}\left(X_1\right) & \mathbb{E}\left(X^{2}_1\right) \end{array}\right)}_Q\left(\begin{array}{c} \beta_{0}^{*}\\ \beta_{1}^{*} \end{array}\right) = \left(\begin{array}{c} \mathbb{E}\left(Y\right)\\ \mathbb{E}\left(X_1Y\right) \end{array}\right)\]
The matrix \(Q\) is important and you will see this frequently.
- When will the system of equations have a unique solution?
- How do you interpret your finding?

The “modern” version of the linear regression model

We have two random variables \(X_{1}\) and \(Y\), which follow a joint distribution \(f_{X_{1},Y}\left(x_{1},y\right)\).
We were able to obtain a predictive relationship between \(X_{1}\) and \(Y\).
We can now write this relationship as \[Y=\beta_{0}^{*}+\beta_{1}^{*}X_{1}+u,\] where \(u\) satisfies \(\left\langle u,1\right\rangle =0\) and \(\left\langle u,X_{1}\right\rangle =0\) by design or by construction.
In more familiar terms, \(\left\langle u,1\right\rangle =\mathbb{E}\left(u\right)=0\) and \(\left\langle u,X_{1}\right\rangle =\mathbb{E}\left(X_{1}u\right)=\mathsf{Cov}\left(X_{1},u\right)=0\).
This predictive relationship is called the linear regression model.

Exercises

What is the linear regression implied by the variables defined in I SEE THE MOUSE?
A version of Exercise 2.2: What is the linear regression model for the case where \(f_{X_{1},Y}\left(x_{1},y\right)\) is bivariate normal, i.e., \[\left(\begin{array}{c} X_{1}\\ Y \end{array}\right)\sim N\left(\left(\begin{array}{c} \mu_{1}\\ \mu_{Y} \end{array}\right),\left(\begin{array}{cc} \sigma_{1}^{2} & \rho\sigma_{1}\sigma_{Y}\\ \rho\sigma_{1}\sigma_{Y} & \sigma_{Y}^{2} \end{array}\right)\right) \]
A version of Exercise 2.6:
- Do you understand why “… linear regression modelling is essentially a correlation analysis”?
- Provide an interpretation of the minimized value of the MSE.
How do you interpret \(\beta_{0}^{*}\) and \(\beta_{1}^{*}\) if \(X_1\) has a Bernoulli distribution with probability of success equal to \(p\)?

The many versions of the linear regression model

The classical linear regression model with stochastic regressors
- Similar to the “modern” version but starts from \(Y=\beta_{0}^{*}+\beta_{1}^{*}X_{1}+u\) directly and makes assumptions about \(u\) instead.
- Is this a big deal? Compare with our discussion and the discussion in Section 2.3 of the main textbook.
The classical linear regression model with fixed regressors
- Treat \(X_1\) as fixed.
The normal linear regression model:
- Treat the joint distribution \(f_{X_{1},Y}\left(x_{1},y\right)\) as a bivariate normal.
And many more …

Least squares algebra

As we do not know the values of \(\left(\beta_{0}^{*},\beta_{1}^{*}\right)\), we need to estimate these quantities.
Suppose you have data from a sample of size \(n\). Denote this dataset as \[\left\{ \left(X_{1t},Y_{t}\right):t=1,\ldots,n\right\}.\]
Define the fitted value for the \(t\)th observation as \[\widehat{Y}_{t}=\widehat{\beta}_{0}+\widehat{\beta}_{1}X_{1t}.\]
Define the residual for the \(t\)th observation as \[\widehat{u}_{t}=Y_{t}-\widehat{Y}_{t}.\]
Our task is to find \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\) so that \[\min_{\widehat{\beta}_{0},\widehat{\beta}_{1}}\dfrac{1}{n}\sum_{t=1}^{n}\left(Y_{t}-\widehat{Y}_{t}\right)^{2}.\]

Least squares geometry

Once again, notice the structure of the optimization problem. Let us introduce some notation. We will represent the dataset in matrix form, i.e., \[Y=\left(\begin{array}{c} Y_{1}\\ Y_{2}\\ \vdots\\ Y_{n} \end{array}\right),\quad X_{0}=\left(\begin{array}{c} 1\\ 1\\ \vdots\\ 1 \end{array}\right),\quad X_{1}=\left(\begin{array}{c} X_{11}\\ X_{12}\\ \vdots\\ X_{1n} \end{array}\right),\quad\widehat{u}=\left(\begin{array}{c} \widehat{u}_{1}\\ \widehat{u}_{2}\\ \vdots\\ \widehat{u}_{n} \end{array}\right)\] and define the inner product as \[\left\langle Y,X_{1}\right\rangle =\dfrac{1}{n}\sum_{t=1}^{n}X_{1t}Y_{t},\] so that the optimization problem can be rewritten as \[\min_{\widehat{\beta}_{0},\widehat{\beta}_{1}}\left\Vert Y-\widehat{\beta}_{0}X_{0}-\widehat{\beta}_{1}X_{1}\right\Vert ^{2}.\]

Least squares algebra and geometry

A solution requires that the residual \(\widehat{u}\) is orthogonal to \(X_{0}\) and \(X_{1}\).
Therefore, we have a system of two linear equations in two unknowns:\[\begin{eqnarray}\left\langle Y-\widehat{\beta}_{0}-\widehat{\beta}_{1}X_{1},X_{0}\right\rangle &=& 0,\\ \left\langle Y-\widehat{\beta}_{0}-\widehat{\beta}_{1}X_{1},X_{1}\right\rangle &=& 0,\end{eqnarray}\]with the following unique solution \[\widehat{\beta}_{0}=\overline{Y}-\widehat{\beta}_{1}\overline{X}_{1},\qquad\widehat{\beta}_{1}=\dfrac{\dfrac{1}{n}\displaystyle\sum_{t=1}^{n}X_{1t}Y_{t}-\overline{X}_{1}\:\overline{Y}}{\dfrac{1}{n}\displaystyle\sum_{t=1}^{n}X_{1t}^{2}-\left(\overline{X}_{1}\right)^{2}}=\dfrac{\dfrac{1}{n}\displaystyle\sum_{t=1}^{n}\left(X_{1t}-\overline{X}_{1}\right)\left(Y_{t}-\overline{Y}\right)}{\dfrac{1}{n}\displaystyle\sum_{t=1}^{n}\left(X_{1t}-\overline{X}_{1}\right)^{2}},\] provided that \(X_{0}\) and \(X_{1}\) are linearly independent of each other.

Intuition

As you have observed, the residual vector \(\widehat{u}\) satisfies the system of two linear equations by design or by construction.
- For the case where \(X_{0}\) is included, the sample average of residuals has to be zero.
- The sample average of the \(Y_{t}\)’s is equal to the sample average of the fitted values \(\widehat{Y}_{t}\)’s.
- The sample covariance of the residuals and the \(X_{1}\)’s has to be zero as well.
Intuitively, the least squares fit has to
- pass through the means \(\left(\overline{X}_{1},\overline{Y}\right)\) of the scatterplot for the dataset.
- ensure that the positive and negative residuals “balance” each other out.
- to ensure that there is no correlation between the residuals and the \(X_{1}\)’s.

Exercises

A version of Exercise 3.2: Can you write \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\) as weighted averages of the \(Y_t\)’s?
Another way to write \(\widehat{\beta}_{1}\) is as a weighted average of individual slopes, i.e., \[\widehat{\beta}_1=\dfrac{\displaystyle\sum_{i,j}\left(X_{1j}-X_{1i}\right)^2 \dfrac{Y_{j}-Y_{i}}{X_{1j}-X_{1i}}}{\displaystyle\sum_{i,j}\left(X_{1j}-X_{1i}\right)^2}.\]
Does the linear regression model have to be true for all the least squares algebra, least squares geometry, and the resulting intuitions to hold?

Connecting the linear regression model and the least squares fit

We have a population object which can be calculated if you know the joint distribution \(f_{X_{1},Y}\left(x_{1},y\right)\): \(\left(\beta_0^*,\beta_1^*\right)\).
We also have an object which can be calculated from the data: \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\).
In practice, we do not know the joint distribution \(f_{X_{1},Y}\left(x_{1},y\right)\). But we might be able to obtain samples from that joint distribution.
So we hope that knowing how to calculate \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\) can help us “recover” \(\left(\beta_0^*,\beta_1^*\right)\).
We have to decide
- what it means to “recover” something.
- what it means to obtain samples from the joint distribution.

I SEE THE MOUSE

set.seed(20220221) # Change this to generate different results
coefs <- matrix(NA, nrow=10^4, ncol=2) # Storage
for(i in 1:10^4)
{
  source <- matrix(c(1,3,3,5,0,2,1,1), ncol = 2) # joint distribution
  data <- source[sample(nrow(source), size=40, replace = TRUE),]  # IID sampling
  temp <- lm(data[, 1] ~ data[, 2]) # least squares
  coefs[i, ] <- summary(temp)[[4]][, 1] # store coefficients
}

Sampling distribution of \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\), \(n=40\)

Sampling distribution of \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\), \(n=160\)

Sampling distribution of \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\), \(n=640\)

Consistency

If we treat \(\left(X_{1t},Y_{t}\right)\) for all \(t=1,\ldots,n\) as pairs of random variables rather than just data, then we have access to what you have learned in probability theory and statistical inference.
Provided that these pairs of random variables are IID draws from the joint distribution \(f_{X_{1},Y}\left(x_{1},y\right)\) and that second moments exist, we have \[\begin{eqnarray} \overline{Y}={\displaystyle \dfrac{1}{n}\sum_{t=1}^{n}}Y_{t} \overset{p}{\rightarrow} \mathbb{E}\left(Y\right)=\mu_{Y}\\ \overline{X}_{1}={\displaystyle \dfrac{1}{n}\sum_{t=1}^{n}}X_{1t} \overset{p}{\rightarrow} \mathbb{E}\left(X_{1}\right)=\mu_{{1}}\\ \widetilde{\mu}_{11}={\displaystyle \dfrac{1}{n}\sum_{t=1}^{n}}\left(X_{1t}-\mu_{{1}}\right)\left(Y_{t}-\mu_{Y}\right) \overset{p}{\rightarrow} \mathsf{Cov}\left(X_{1},Y\right)=\mu_{11}\\ \widetilde{\mu}_{20}={\displaystyle \dfrac{1}{n}\sum_{t=1}^{n}}\left(X_{1t}-\mu_{1}\right)^{2} \overset{p}{\rightarrow} \mathsf{Var}\left(X_{1}\right)=\mu_{20}, \end{eqnarray}\] where \(\mu_{rs}=\mathbb{E}\left[\left(X_{1t}-\mu_{{1}}\right)^r \left(Y_{t}-\mu_{Y}\right)^s\right]\).

Consistency, continued

Plugging in estimated versions of \(\mu_{{1}}\) and \(\mu_{Y}\) has no effect asymptotically: \[\begin{eqnarray}\widehat{\mu}_{11}={\displaystyle \dfrac{1}{n}\sum_{t=1}^{n}}\left(X_{1t}-\overline{X}_1\right)\left(Y_{t}-\overline{Y}\right) \overset{p}{\rightarrow} \mathsf{Cov}\left(X_{1},Y\right)=\mu_{11}\\ \widehat{\mu}_{20}={\displaystyle \dfrac{1}{n}\sum_{t=1}^{n}}\left(X_{1t}-\overline{X}_{1}\right)^{2} \overset{p}{\rightarrow} \mathsf{Var}\left(X_{1}\right)=\mu_{20} \end{eqnarray}\]
You can show that \(\widehat{\beta}_{1}\overset{p}{\rightarrow}\beta_{1}^{*}\) and \(\widehat{\beta}_{0}\overset{p}{\rightarrow}\beta_{0}^{*}\), as \(n\to\infty\).
This is a perfect opportunity to apply the asymptotic tools you have learned before, specifically Lemmas 4.2, 4.6 to 4.9 of the main textbook.
The argument requires IID sampling. This is essentially what you will see in Chapter 4 of the main textbook. Moving beyond IID sampling is the subject of Chapters 5 and 6.

Asymptotic normality

Now, let us focus on the asymptotic distribution of a suitably scaled version of \(\widehat{\beta}_{1}\).
Why do you need to scale?
Notice that \(\widehat{\beta}_{1}\) is a ratio of sample averages.
The “ideal” version of \(\widehat{\beta}_{1}\), call it \(\widetilde{\beta}_{1}\), is given by: \[\widetilde{\beta}_{1}=\dfrac{\dfrac{1}{n}\displaystyle\sum_{t=1}^{n}\left(X_{1t}-\mu_{{1}}\right)\left(Y_{t}-\mu_{Y}\right)}{\dfrac{1}{n}\displaystyle\sum_{t=1}^{n}\left(X_{1t}-\mu_{{1}}\right)^{2}}.\]
Observe that \(\sqrt{n}\left(\widetilde{\beta}_{1}-\beta_{1}^{*}\right)\) obeys a central limit theorem under IID conditions and finite fourth moments:\[\sqrt{n}\left(\widetilde{\beta}_{1}-\beta_{1}^{*}\right)\overset{d}{\rightarrow}N\left(0,\phi^{2}\right).\]

The meaning of \(\phi^2\)

How do we interpret \(\phi^2\)?
You should be able to derive an expression for \(\phi^2\) and produce a neat interpretation: \[\phi^{2}=\dfrac{\mu_{22}+\left(\beta_{1}^{*}\right)^{2}\mu_{40}-2\beta_{1}^{*}\mu_{31}}{\mu_{20}^{2}}=\dfrac{\mathbb{E}\left[\left(X_{1t}-\mu_{{1}}\right)^{2}u_{t}^{2}\right]}{\left[\mathsf{Var}\left(X_{1t}\right)\right]^{2}}=\dfrac{\mathsf{Var}\left[\left(X_{1t}-\mu_{{1}}\right)u_{t}\right]}{\left[\mathsf{Var}\left(X_{1t}\right)\right]^{2}},\] where \(u_t\) is the error from best linear prediction of \(Y\) given \(X_1\). Recall the linear regression model!
You could also rewrite the neat expression into a form that you will see again (compare with Theorems 3.10 and 4.2 of the main textbook).
- In particular, \[\dfrac{\mathbb{E}\left[\left(X_{1t}-\mu_{{1}}\right)^{2}u_{t}^{2}\right]}{\left[\mathsf{Var}\left(X_{1t}\right)\right]^{2}}=\left[\mathsf{Var}\left(X_{1t}\right)\right]^{-1}\mathbb{E}\left[\left(X_{1t}-\mu_{{1}}\right)^{2}u_{t}^{2}\right]\left[\mathsf{Var}\left(X_{1t}\right)\right]^{-1}.\]
- This form is typically called a sandwich or a robust form of the asymptotic variance.
- The question is the meaning of the word “robust”.

The effect of plugging in estimates instead of the real thing

\(\widehat{\beta}_{1}\) is different from \(\widetilde{\beta}_{1}\).
We have to account for the effect of the plug-in:\[\sqrt{n}\left(\widehat{\beta}_{1}-\beta_{1}^{*}\right)=\sqrt{n}\left(\widetilde{\beta}_{1}-\beta_{1}^{*}\right)+\sqrt{n}\left(\widehat{\beta}_{1}-\widetilde{\beta}_{1}\right).\]
We have to show that the effect is asymptotically negligible, i.e., \[\sqrt{n}\left(\widehat{\beta}_{1}-\widetilde{\beta}_{1}\right)\overset{p}{\rightarrow}0.\]
To see this, write\[\sqrt{n}\left(\widehat{\beta}_{1}-\widetilde{\beta}_{1}\right)=\underbrace{\dfrac{1}{\widetilde{\mu}_{20}}}_{\overset{p}{\rightarrow}\frac{1}{\mu_{20}}}\left[\underbrace{\sqrt{n}\left(\widehat{\mu}_{11}-\widetilde{\mu}_{11}\right)}_{\overset{p}{\rightarrow}0}-\underbrace{\dfrac{\widehat{\mu}_{11}}{\widehat{\mu}_{20}}}_{\overset{p}{\rightarrow}\beta^*_1}\underbrace{\sqrt{n}\left(\widehat{\mu}_{20}-\widetilde{\mu}_{20}\right)}_{\overset{p}{\rightarrow}0}\right]\overset{p}{\rightarrow}0.\]

Exercises

At this point, you might wonder about the asymptotic distribution of \(\widehat{\beta}_{0}\). Follow the approach laid out in the earlier slides. It can get tedious but it is one way to gain some practice. (In case you did not notice, you will run into a major difficulty related to the discussion leading to Examples 4.9 and 4.10 in the main textbook. This is the point of the exercise!)
Another exercise is to try to obtain the joint asymptotic distribution of \[\sqrt{n}\left(\begin{array}{c} \widehat{\beta}_{0}-\beta_{0}^{*}\\ \widehat{\beta}_{1}-\beta_{1}^{*} \end{array}\right).\] This can be painful and hopefully provides a motivation for using matrix algebra. How is this different from what we have done earlier? Why should we even do this?
Yet another exercise is to observe that \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\) depends solely on the following vector of sample averages \[\left({\displaystyle \dfrac{1}{n}\sum_{t=1}^{n}}X_{1t},{\displaystyle \dfrac{1}{n}\sum_{t=1}^{n}}Y_{1t},{\displaystyle \dfrac{1}{n}\sum_{t=1}^{n}}X_{1t}^{2},{\displaystyle \dfrac{1}{n}\sum_{t=1}^{n}}X_{1t}Y_{t}\right)^\prime,\]and the asymptotic distributions could be derived using the delta method (extension of Lemma 4.10 in main textbook). You can focus on \(\widehat{\beta}_1\) alone just to get a feel. Do it.

Consistently estimating the asymptotic variance

\(\phi^2\) depends on unknown quantities and that the theoretical standard error of \(\widehat{\beta}_{1}\) based on asymptotic theory is given by \[\mathsf{se}\left(\widehat{\beta}_{1}\right)=\dfrac{1}{\sqrt{n}}\sqrt{\dfrac{\mathsf{Var}\left[\left(X_{1t}-\mu_{{1}}\right)u_{t}\right]}{\left[\mathsf{Var}\left(X_{1t}\right)\right]^{2}}.}\]
Contributions from Eicker and Huber from the 1960s and White in the 1980s have shown that it is possible to consistently estimate the standard error of \(\widehat{\beta}_{1}\) as: \[\widehat{\mathsf{se}}\left(\widehat{\beta}_{1}\right)=\sqrt{\dfrac{\displaystyle\sum_{t=1}^{n}\left(X_{1t}-\overline{X}_{1}\right)^{2}\widehat{u}_{t}^{2}}{\left(\displaystyle\sum_{t=1}^{n}\left(X_{1t}-\overline{X}_{1}\right)^{2}\right)^{2}}}.\]This estimate of the standard error is valid for large samples.

Asymptotic variance, \(n=40\)

set.seed(20220221) # Change this to generate different results
ses <- matrix(NA, nrow=10^4, ncol=2) # Storage 
for(i in 1:10^4)
{
  source <- matrix(c(1,3,3,5,0,2,1,1), ncol = 2) # joint distribution
  data <- source[sample(nrow(source), size=40, replace = TRUE),]  # IID sampling
  temp <- lm(data[, 1] ~ data[, 2]) # least squares
  coefs[i, ] <- summary(temp)[[4]][, 1] # store coefficients
  ses[i, ] <- summary(temp)[[4]][,2] # store standard errors
}
c(mean(ses[, 1])/sd(coefs[, 1]), mean(ses[, 2])/sd(coefs[, 2])) # SE/SD ratio

## [1] 1.048279 1.093472

Asymptotic variance, \(n=40\)

Asymptotic variance, \(n=160\)

## [1] 1.114528 1.205512

Asymptotic variance, \(n=640\)

## [1] 1.141887 1.225075

Linear prediction and least squares fit

Summary

Where I come from

What the course is about

The goals of the course

Strategies that could work

Why is the course taught in English and why it is a good thing

How do I make things less difficult?

Grading

平时成绩

Example of a short quiz

Inspiration for these notes

Review of expected values

Optimal prediction under squared error loss

Implications?

Example

Predictions with additional information

Explicit solutions

Linear algebra interpretation

Orthogonality conditions

Systems of linear equations

The “modern” version of the linear regression model

Exercises

The many versions of the linear regression model

Least squares algebra

Least squares geometry

Least squares algebra and geometry

Intuition

Exercises

Connecting the linear regression model and the least squares fit

I SEE THE MOUSE

Sampling distribution of \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\), \(n=40\)

Sampling distribution of \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\), \(n=160\)

Sampling distribution of \(\left(\widehat{\beta}_{0},\widehat{\beta}_{1}\right)\), \(n=640\)

Consistency

Consistency, continued

Asymptotic normality

The meaning of \(\phi^2\)

The effect of plugging in estimates instead of the real thing

Exercises

Consistently estimating the asymptotic variance

Asymptotic variance, \(n=40\)

Asymptotic variance, \(n=40\)

Asymptotic variance, \(n=160\)

Asymptotic variance, \(n=640\)