This tutorial reviews *regression* analysis, a cornerstone method that supports many of the other methods we will consider. We begin by translating the basic equations to matrix form - in order to connect to basic knowledge of matrix algebra (covered in class). Then, we go through examples of how to fit regression models for continuous, count, and binary outcomes.

In this session we cover …

A. Mathematical Model Basics

B. A Set of Examples

B1. \(y_i\) is a continuous, Gaussian (normal) distributed, variable = *Multiple Regression*

B2. \(y_i\) is an ordinal (count), Poisson distributed, variable = *Poisson Regression*

B3. \(y_i\) is a binary, Binomial (Bernoulli) distributed, variable = *Logistic Regression*

```
library(psych)
library(ggplot2)
```

`## Warning: package 'ggplot2' was built under R version 3.4.4`

In basic form, a regression model describes a single *outcome variable* as a function of *a set of predictor variables*. \[ outcome_i = f(inputs_i) = f(\cdot) \] Generally, the regression model is written as

\[ y_{i} = \beta_0(1_i) + \beta_1(x_{1i}) + \beta_2(x_{2i}) + ... + \beta_q(x_{qi}) + \epsilon_{i} \]

where \(y_{i}\) is the value of the outcome variable for individual \(i\); \(b_0\) is an *intercept parameter* that indicates the expected value of \(y_i\) when the predictor variables, \(\left\{ x_{1i}, ..., x_{qi}\right\}\), are all \(= 0\); \(\left\{\beta_1,..., \beta_q\right\}\) are *regression parameters* that each indicate the unique relation between a specific predictor variable \(x_{qi}\) and the outcome variable, \(y_i\); and \(\epsilon_{i}\) are *residuals* (i.e., random) portions of \(y_i\) that are unaccounted for by the deterministic (i.e., fixed) part of the model, \(\beta_0 + \beta_1(x_{1i}) + \beta_2(x_{2i}) + ... + \beta_q(x_{qi})\).

Given our long-term goals, we make connections between this representation of the model and the matrix algerbra representation.

We can translate the regression model into matrix form. First, let’s take a simpler form of the model, an intercept-only model where

\[ y_i = 1_i \cdot \beta_0 + \epsilon_i\] Note that we have made the “silent” 1 explicit. This will become important later (e.g., when fitting growth models).

Translating into matrix form, \(y_i\) can be written as an \(N\) x 1 matrix (a column vector). More specifically, for \(i = 1\) to \(N\) individuals, \[ y_i = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \boldsymbol{Y}\]

(matrices are often designated as bold capital letters).

Doing the same for all the other parts of the model, we get

\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right] [\beta_0] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

Note that we have taken care that each matrix is of an order that will allow for matrix multiplication.

What is the interpretation of \([\beta_0]\)?

What is the interpretation of \[\left[ \begin{array}{c} e_1 \\ e_2 \\ \vdots \\ e_N \end{array} \right]\]?

Now, let’s expand our regression model by adding a predictor \(x_{1i}\). Our model becomes

\[ y_i = (1_i)\beta_0 + (x_{1i})\beta_1 + \epsilon_i \]

Written out explicitly in matrix form, the model is

\[ \left[ \begin{array}{c}
y_1 \\
y_2 \\
\vdots \\
y_N \end{array} \right] =
\left[ \begin{array}{cc}
1, x_{11}\\
1, x_{12}\\
\vdots \\
1, x_{1N}\end{array} \right]
\left[ \begin{array}{c}b_0\\ b_1\end{array}\right] +
\left[ \begin{array}{c}
\epsilon_1 \\
\epsilon_2 \\
\vdots \\
\epsilon_N \end{array} \right]\]

Finally, extending the model to the general case with \(q\) predictor variables, we have \[ y_i = \beta_0 + \beta_1(x_{1i}) + \beta_2(x_{2i}) + ... + \beta_q(x_{qi}) + \epsilon_i \]

which is written out in matrix form as

\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{cccc} 1, x_{11}, \ldots, x_{q1}\\ 1, x_{12}, \ldots, x_{q2}\\ \vdots \\ 1, x_{1N}, \ldots, x_{qN}\end{array} \right] \left[ \begin{array}{c}\beta_0\\ \beta_1\\ \vdots\\ \beta_q\end{array}\right] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

When we implement this model in R, it will be important to know the portions of the model that are in our data frame, \(y_i\) and \({x_{1}, ..., x_{q}}\), and to have them structured properly. This will become clear in the examples below.

Now that we have the model written out explicitly as matrices, we can easily simplify the notation.

\[ \boldsymbol{Y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] ;\boldsymbol{X} = \left[ \begin{array}{cccc} 1, x_{11}, \ldots, x_{q1}\\ 1, x_{12}, \ldots, x_{q2}\\ \vdots \\ 1, x_{1N}, \ldots, x_{qN}\end{array} \right]; \boldsymbol{\beta} = \left[ \begin{array}{c}\beta_0\\ \beta_1\\ \vdots\\ \beta_q\end{array}\right]; \boldsymbol{e} = \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right] \]

In compact matrix notation, the regression model then can be written as

\[ \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

In practice, we would like to know the contents of (i.e., solve for) \(\boldsymbol{\beta}\).

Assuming the model is correct, the expected value of \(\boldsymbol{\epsilon}\) is 0. \[ \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta}\] Then we just need to solve for \(\boldsymbol{\beta}\).

Premultiplying both sides of the equation by \(\boldsymbol{X'}\), we obtain

\[ \boldsymbol{X'}\boldsymbol{Y} = \boldsymbol{X'}\boldsymbol{X}\boldsymbol{\beta} \] This gets us a quantity, \(\left(\boldsymbol{X'}\boldsymbol{X}\right)\) that is a square matrix that contains information about the relations among the \(x_q\) variables. When this matrix has an inverse, we can premultiply both sides by \(\left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\), to obtain

\[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1} \left( \boldsymbol{X'}\boldsymbol{Y}\right) = \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1} (\boldsymbol{X'}\boldsymbol{X}) \boldsymbol{\beta} \]

Because \[\left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\left(\boldsymbol{X'}\boldsymbol{X}\right) = \boldsymbol{I}\] the equation simpifies to \[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\left(\boldsymbol{X'}\boldsymbol{Y}\right) = \boldsymbol{I}\boldsymbol{\beta} \] and \[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\left(\boldsymbol{X'}\boldsymbol{Y}\right) = \boldsymbol{\beta} \] Yay! We’ve isolated the unknowns, \(\boldsymbol{\beta}\) onto one side of the equation.

Quite literally, this algebra is what allows for estimation of the parameters when fitting a regression model to data.

We will now work through some practical examples - staying aware that this kind of matrix algebra is being done in the background.

Examples make use of a new WISC data file. For this tutorial the data have, given our didactic purposes, been supplemented with some simualted data. In particular we have added a `female`

variable that indicates each participant’s gender.

Reading in these example data.

```
#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/wisc3raw_gender.csv"
#read in the .csv file using the url() function
wisc3raw <- read.csv(file=url(filepath),header=TRUE)
```

Subsetting to variables of interest.

`wiscsub <- wisc3raw[ , c("id","verb1","verb2","verb4","verb6","momed","grad", "female")]`

Sample-level descriptives for the chosen variables.

`describe(wiscsub)`

vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

id | 1 | 204 | 102.5000000 | 59.0338886 | 102.500 | 102.5000000 | 75.612600 | 1.00 | 204.00 | 203.00 | 0.0000000 | -1.2176609 | 4.1331989 |

verb1 | 2 | 204 | 19.5850490 | 5.8076950 | 19.335 | 19.4976829 | 5.411490 | 3.33 | 35.15 | 31.82 | 0.1301071 | -0.0528376 | 0.4066200 |

verb2 | 3 | 204 | 25.4153431 | 6.1063772 | 25.980 | 25.4026220 | 6.567918 | 5.95 | 39.85 | 33.90 | -0.0639307 | -0.3445494 | 0.4275319 |

verb4 | 4 | 204 | 32.6077451 | 7.3198841 | 32.820 | 32.4240244 | 7.183197 | 12.60 | 52.84 | 40.24 | 0.2317998 | -0.0776524 | 0.5124944 |

verb6 | 5 | 204 | 43.7499020 | 10.6650511 | 42.545 | 43.4560976 | 11.297412 | 17.35 | 72.59 | 55.24 | 0.2356459 | -0.3605210 | 0.7467029 |

momed | 6 | 204 | 10.8112745 | 2.6982790 | 11.500 | 10.9969512 | 2.965200 | 5.50 | 18.00 | 12.50 | -0.3586143 | 0.0056095 | 0.1889173 |

grad | 7 | 204 | 0.2254902 | 0.4189328 | 0.000 | 0.1585366 | 0.000000 | 0.00 | 1.00 | 1.00 | 1.3040954 | -0.3007374 | 0.0293312 |

female | 8 | 204 | 0.5000000 | 0.5012300 | 0.500 | 0.5000000 | 0.741300 | 0.00 | 1.00 | 1.00 | 0.0000000 | -2.0097799 | 0.0350931 |

For the first example, we focus on verbal ability at Grade 2 as an outcome (`verb2`

in the data frame `wiscsub`

).

Examining the distribution for ‘verb2’.

`describe(wiscsub$verb2)`

vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

X1 | 1 | 204 | 25.41534 | 6.106377 | 25.98 | 25.40262 | 6.567918 | 5.95 | 39.85 | 33.9 | -0.0639307 | -0.3445494 | 0.4275319 |

```
ggplot(data=wiscsub, aes(x=verb2)) +
geom_histogram(binwidth=2.5, fill="white", color="black", boundary=0) +
xlab("Verbal Ability Grade 2") + ylab("Count") +
xlim(0,50) +
theme_classic()
```

The simplest model is an intercept only model. In this case, we would fit the model \[ verb_{2i} = b_0 + \epsilon_{i}\]

Written out explicitly with the “silent” 1 in it, we get \[ verb_{2i} = b_0(1_i) + \epsilon_{i}\] This is helpful for explicit translation into the R code, specifically the `formula`

within the `lm()`

function.

We fit the model using the following code … note that the code has the ‘1’ predictor variable stated explicitly.

```
model1 <- lm(formula = verb2 ~ 1,
data = wiscsub,
na.action = na.exclude)
summary(model1)
```

```
##
## Call:
## lm(formula = verb2 ~ 1, data = wiscsub, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4653 -4.6403 0.5647 4.2822 14.4347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.4153 0.4275 59.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.106 on 203 degrees of freedom
```

```
#After noting that there is no R-squared given, we ask for it explicitly
summary(model1)$r.squared
```

`## [1] 0`

The output indicates that \(b_0\) = 25.4153, and its standard error = 0.4275.

The intercept reflects the expected value of the outcome variable when all of the predictor variables (i.e. \(\left\{ x_{1i}, ..., x_{qi}\right\}\)) = 0. So, in the absence of any additional information other than the descriptive statistics of \(verb_{2i}\), what is our best guess for a person’s \(verb_{2i}\) score? It is the mean of \(verb_{2i}\). The regression above confirms this notion; regressing the outcome on a vector of 1s allows us to ‘recover’ the mean.

`mean(wiscsub$verb2)`

`## [1] 25.41534`

Yes - we recovered the *mean*, but we did not attempt to explain any of the *variance*. It thus makes sense why the R-sqaured = 0.

Let’s build up the model further.

For example, we could attempt to explain some of the *between-person variance* in the Grade 2 verbal score from the Grade 1 verbal scores. But, before we do, let’s examine the distribution of the *between-person differences* in the Grade 1 verbal scores.

```
ggplot(wiscsub, aes(x=verb1)) +
geom_histogram(binwidth=2.5, fill="white", color="black", boundary=0) +
xlab("Verbal Ability Grade 1") +
ylab("Count") +
xlim(0,50) +
theme_classic()
```

And the relation between the Grade 2 and Grade 1 verbal ability scores.

```
ggplot(wiscsub, aes(x=verb1, y = verb2)) +
geom_point() +
stat_ellipse(color="blue", alpha=.7) +
xlab("Verbal Ability Grade 1") +
ylab("Verbal Ability Grade 2") +
ylim(0,45) +
xlim(0,45) +
theme_classic()
```

Our regression model becomes

\[ verb_{2i} = b_0(1_i) + b_1(verb_{1i}) + \epsilon_{i}\]

```
model2 <- lm(verb2 ~ 1 + verb1,
data = wiscsub,
na.action = na.exclude)
summary(model2)
```

```
##
## Call:
## lm(formula = verb2 ~ 1 + verb1, data = wiscsub, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5305 -3.0362 0.2526 2.7147 12.5020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.62965 1.05164 10.11 <2e-16 ***
## verb1 0.75495 0.05149 14.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.261 on 202 degrees of freedom
## Multiple R-squared: 0.5156, Adjusted R-squared: 0.5132
## F-statistic: 215 on 1 and 202 DF, p-value: < 2.2e-16
```

How do we interpret the parameters here?

The intercept, \(b_0\), is the expected value for the outcome variable when all of the predictor variables equal zero. So, we would expect a child to have a Grade 2 verbal score of 10.62965 *if* they have a Grade 1 verbal score of 0.

The slope, \(b_1\) is the expected difference in the outcome variable for each 1-unit difference in the predictor variable. So, *across children* for each 1-point difference in a child’s Grade 1 verbal score, we would expect a 0.75495 point difference in the Grade 2 verbal score.

We can plot the relation between ‘verb1’ and ‘verb2’, and include the predicted line from the analysis.

```
ggplot(data=wiscsub, aes(x=verb1,y=verb2)) +
geom_point(size = 2, shape=19) +
geom_smooth(method=lm,se=TRUE,fullrange=TRUE,colour="red", size=2) +
labs(x= "Verbal Ability Grade 1", y= "Verbal Ability Grade 2") +
xlim(0,50) +
ylim(0,50) +
#theme with white background
theme_bw() +
#eliminate background, gridlines, and chart border
theme(
plot.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank()
) +
#draws x and y axis line
theme(axis.line = element_line(color = 'black')) +
#set size of axis labels and titles
theme(axis.text = element_text(size=12),
axis.title = element_text(size=14))
```