0.1 Overview

This tutorial reviews regression analysis, a cornerstone method that supports many of the other methods we will consider. We begin by translating the basic equations to matrix form - in order to connect to basic knowledge of matrix algebra (covered in class). Then, we go through examples of how to fit regression models for continuous, count, and binary outcomes.

0.2 Outline

In this session we cover …

A. Mathematical Model Basics
B. A Set of Examples
B1. \(y_i\) is a continuous, Gaussian (normal) distributed, variable = Multiple Regression
B2. \(y_i\) is an ordinal (count), Poisson distributed, variable = Poisson Regression
B3. \(y_i\) is a binary, Binomial (Bernoulli) distributed, variable = Logistic Regression

0.2.0.1 Prelim - Loading libraries used in this script.

library(psych)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4

0.3 Mathematical Model Basics

In basic form, a regression model describes a single outcome variable as a function of a set of predictor variables. \[ outcome_i = f(inputs_i) = f(\cdot) \] Generally, the regression model is written as

\[ y_{i} = \beta_0(1_i) + \beta_1(x_{1i}) + \beta_2(x_{2i}) + ... + \beta_q(x_{qi}) + \epsilon_{i} \]

where \(y_{i}\) is the value of the outcome variable for individual \(i\); \(b_0\) is an intercept parameter that indicates the expected value of \(y_i\) when the predictor variables, \(\left\{ x_{1i}, ..., x_{qi}\right\}\), are all \(= 0\); \(\left\{\beta_1,..., \beta_q\right\}\) are regression parameters that each indicate the unique relation between a specific predictor variable \(x_{qi}\) and the outcome variable, \(y_i\); and \(\epsilon_{i}\) are residuals (i.e., random) portions of \(y_i\) that are unaccounted for by the deterministic (i.e., fixed) part of the model, \(\beta_0 + \beta_1(x_{1i}) + \beta_2(x_{2i}) + ... + \beta_q(x_{qi})\).

Given our long-term goals, we make connections between this representation of the model and the matrix algerbra representation.

0.3.1 Regression and Matrix Notation

We can translate the regression model into matrix form. First, let’s take a simpler form of the model, an intercept-only model where

\[ y_i = 1_i \cdot \beta_0 + \epsilon_i\] Note that we have made the “silent” 1 explicit. This will become important later (e.g., when fitting growth models).

Translating into matrix form, \(y_i\) can be written as an \(N\) x 1 matrix (a column vector). More specifically, for \(i = 1\) to \(N\) individuals, \[ y_i = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \boldsymbol{Y}\]

(matrices are often designated as bold capital letters).

Doing the same for all the other parts of the model, we get

\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right] [\beta_0] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

Note that we have taken care that each matrix is of an order that will allow for matrix multiplication.

What is the interpretation of \([\beta_0]\)?

What is the interpretation of \[\left[ \begin{array}{c} e_1 \\ e_2 \\ \vdots \\ e_N \end{array} \right]\]?

Now, let’s expand our regression model by adding a predictor \(x_{1i}\). Our model becomes

\[ y_i = (1_i)\beta_0 + (x_{1i})\beta_1 + \epsilon_i \]

Written out explicitly in matrix form, the model is
\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{cc} 1, x_{11}\\ 1, x_{12}\\ \vdots \\ 1, x_{1N}\end{array} \right] \left[ \begin{array}{c}b_0\\ b_1\end{array}\right] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

Finally, extending the model to the general case with \(q\) predictor variables, we have \[ y_i = \beta_0 + \beta_1(x_{1i}) + \beta_2(x_{2i}) + ... + \beta_q(x_{qi}) + \epsilon_i \]

which is written out in matrix form as

\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{cccc} 1, x_{11}, \ldots, x_{q1}\\ 1, x_{12}, \ldots, x_{q2}\\ \vdots \\ 1, x_{1N}, \ldots, x_{qN}\end{array} \right] \left[ \begin{array}{c}\beta_0\\ \beta_1\\ \vdots\\ \beta_q\end{array}\right] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

When we implement this model in R, it will be important to know the portions of the model that are in our data frame, \(y_i\) and \({x_{1}, ..., x_{q}}\), and to have them structured properly. This will become clear in the examples below.

Now that we have the model written out explicitly as matrices, we can easily simplify the notation.

\[ \boldsymbol{Y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] ;\boldsymbol{X} = \left[ \begin{array}{cccc} 1, x_{11}, \ldots, x_{q1}\\ 1, x_{12}, \ldots, x_{q2}\\ \vdots \\ 1, x_{1N}, \ldots, x_{qN}\end{array} \right]; \boldsymbol{\beta} = \left[ \begin{array}{c}\beta_0\\ \beta_1\\ \vdots\\ \beta_q\end{array}\right]; \boldsymbol{e} = \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right] \]

In compact matrix notation, the regression model then can be written as

\[ \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

In practice, we would like to know the contents of (i.e., solve for) \(\boldsymbol{\beta}\).

Assuming the model is correct, the expected value of \(\boldsymbol{\epsilon}\) is 0. \[ \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta}\] Then we just need to solve for \(\boldsymbol{\beta}\).

Premultiplying both sides of the equation by \(\boldsymbol{X'}\), we obtain

\[ \boldsymbol{X'}\boldsymbol{Y} = \boldsymbol{X'}\boldsymbol{X}\boldsymbol{\beta} \] This gets us a quantity, \(\left(\boldsymbol{X'}\boldsymbol{X}\right)\) that is a square matrix that contains information about the relations among the \(x_q\) variables. When this matrix has an inverse, we can premultiply both sides by \(\left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\), to obtain

\[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1} \left( \boldsymbol{X'}\boldsymbol{Y}\right) = \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1} (\boldsymbol{X'}\boldsymbol{X}) \boldsymbol{\beta} \]

Because \[\left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\left(\boldsymbol{X'}\boldsymbol{X}\right) = \boldsymbol{I}\] the equation simpifies to \[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\left(\boldsymbol{X'}\boldsymbol{Y}\right) = \boldsymbol{I}\boldsymbol{\beta} \] and \[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\left(\boldsymbol{X'}\boldsymbol{Y}\right) = \boldsymbol{\beta} \] Yay! We’ve isolated the unknowns, \(\boldsymbol{\beta}\) onto one side of the equation.

Quite literally, this algebra is what allows for estimation of the parameters when fitting a regression model to data.

We will now work through some practical examples - staying aware that this kind of matrix algebra is being done in the background.

0.3.1.1 Prelim - Reading in Data File

Examples make use of a new WISC data file. For this tutorial the data have, given our didactic purposes, been supplemented with some simualted data. In particular we have added a female variable that indicates each participant’s gender.

Reading in these example data.

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/wisc3raw_gender.csv"

#read in the .csv file using the url() function
wisc3raw <- read.csv(file=url(filepath),header=TRUE)

Subsetting to variables of interest.

wiscsub <- wisc3raw[ , c("id","verb1","verb2","verb4","verb6","momed","grad", "female")]

Sample-level descriptives for the chosen variables.

describe(wiscsub)
vars n mean sd median trimmed mad min max range skew kurtosis se
id 1 204 102.5000000 59.0338886 102.500 102.5000000 75.612600 1.00 204.00 203.00 0.0000000 -1.2176609 4.1331989
verb1 2 204 19.5850490 5.8076950 19.335 19.4976829 5.411490 3.33 35.15 31.82 0.1301071 -0.0528376 0.4066200
verb2 3 204 25.4153431 6.1063772 25.980 25.4026220 6.567918 5.95 39.85 33.90 -0.0639307 -0.3445494 0.4275319
verb4 4 204 32.6077451 7.3198841 32.820 32.4240244 7.183197 12.60 52.84 40.24 0.2317998 -0.0776524 0.5124944
verb6 5 204 43.7499020 10.6650511 42.545 43.4560976 11.297412 17.35 72.59 55.24 0.2356459 -0.3605210 0.7467029
momed 6 204 10.8112745 2.6982790 11.500 10.9969512 2.965200 5.50 18.00 12.50 -0.3586143 0.0056095 0.1889173
grad 7 204 0.2254902 0.4189328 0.000 0.1585366 0.000000 0.00 1.00 1.00 1.3040954 -0.3007374 0.0293312
female 8 204 0.5000000 0.5012300 0.500 0.5000000 0.741300 0.00 1.00 1.00 0.0000000 -2.0097799 0.0350931

1 Multiple Regression

1.1 Regression with a Normally Distributed (Gaussian) Outcome

For the first example, we focus on verbal ability at Grade 2 as an outcome (verb2 in the data frame wiscsub).

Examining the distribution for ‘verb2’.

describe(wiscsub$verb2)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 204 25.41534 6.106377 25.98 25.40262 6.567918 5.95 39.85 33.9 -0.0639307 -0.3445494 0.4275319
ggplot(data=wiscsub, aes(x=verb2)) + 
  geom_histogram(binwidth=2.5, fill="white", color="black", boundary=0) +
  xlab("Verbal Ability Grade 2") + ylab("Count") +
  xlim(0,50) +
  theme_classic()

The simplest model is an intercept only model. In this case, we would fit the model \[ verb_{2i} = b_0 + \epsilon_{i}\]

Written out explicitly with the “silent” 1 in it, we get \[ verb_{2i} = b_0(1_i) + \epsilon_{i}\] This is helpful for explicit translation into the R code, specifically the formula within the lm() function.

We fit the model using the following code … note that the code has the ‘1’ predictor variable stated explicitly.

model1 <- lm(formula = verb2 ~ 1,
              data = wiscsub,
              na.action = na.exclude)
summary(model1)
## 
## Call:
## lm(formula = verb2 ~ 1, data = wiscsub, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4653  -4.6403   0.5647   4.2822  14.4347 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.4153     0.4275   59.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.106 on 203 degrees of freedom
#After noting that there is no R-squared given, we ask for it explicitly 
summary(model1)$r.squared
## [1] 0

The output indicates that \(b_0\) = 25.4153, and its standard error = 0.4275.

The intercept reflects the expected value of the outcome variable when all of the predictor variables (i.e. \(\left\{ x_{1i}, ..., x_{qi}\right\}\)) = 0. So, in the absence of any additional information other than the descriptive statistics of \(verb_{2i}\), what is our best guess for a person’s \(verb_{2i}\) score? It is the mean of \(verb_{2i}\). The regression above confirms this notion; regressing the outcome on a vector of 1s allows us to ‘recover’ the mean.

mean(wiscsub$verb2)
## [1] 25.41534

Yes - we recovered the mean, but we did not attempt to explain any of the variance. It thus makes sense why the R-sqaured = 0.

Let’s build up the model further.

For example, we could attempt to explain some of the between-person variance in the Grade 2 verbal score from the Grade 1 verbal scores. But, before we do, let’s examine the distribution of the between-person differences in the Grade 1 verbal scores.

ggplot(wiscsub, aes(x=verb1)) + 
  geom_histogram(binwidth=2.5, fill="white", color="black", boundary=0) +
  xlab("Verbal Ability Grade 1") + 
  ylab("Count") +
  xlim(0,50) +
  theme_classic()

And the relation between the Grade 2 and Grade 1 verbal ability scores.

ggplot(wiscsub, aes(x=verb1, y = verb2)) + 
  geom_point() +
  stat_ellipse(color="blue", alpha=.7) +
  xlab("Verbal Ability Grade 1") + 
  ylab("Verbal Ability Grade 2") +
  ylim(0,45) + 
  xlim(0,45) +
  theme_classic()

Our regression model becomes

\[ verb_{2i} = b_0(1_i) + b_1(verb_{1i}) + \epsilon_{i}\]

model2 <- lm(verb2 ~ 1 + verb1,
              data = wiscsub,
              na.action = na.exclude)
summary(model2)
## 
## Call:
## lm(formula = verb2 ~ 1 + verb1, data = wiscsub, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5305  -3.0362   0.2526   2.7147  12.5020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.62965    1.05164   10.11   <2e-16 ***
## verb1        0.75495    0.05149   14.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.261 on 202 degrees of freedom
## Multiple R-squared:  0.5156, Adjusted R-squared:  0.5132 
## F-statistic:   215 on 1 and 202 DF,  p-value: < 2.2e-16

How do we interpret the parameters here?

The intercept, \(b_0\), is the expected value for the outcome variable when all of the predictor variables equal zero. So, we would expect a child to have a Grade 2 verbal score of 10.62965 if they have a Grade 1 verbal score of 0.

The slope, \(b_1\) is the expected difference in the outcome variable for each 1-unit difference in the predictor variable. So, across children for each 1-point difference in a child’s Grade 1 verbal score, we would expect a 0.75495 point difference in the Grade 2 verbal score.

We can plot the relation between ‘verb1’ and ‘verb2’, and include the predicted line from the analysis.

ggplot(data=wiscsub, aes(x=verb1,y=verb2)) +
  geom_point(size = 2, shape=19) +
  geom_smooth(method=lm,se=TRUE,fullrange=TRUE,colour="red", size=2) +
  labs(x= "Verbal Ability Grade 1", y= "Verbal Ability Grade 2") +
  xlim(0,50) +
  ylim(0,50) +
  #theme with white background
  theme_bw() +
  #eliminate background, gridlines, and chart border
  theme(
    plot.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank()
  ) +
  #draws x and y axis line
  theme(axis.line = element_line(color = 'black')) +
  #set size of axis labels and titles
  theme(axis.text = element_text(size=12),
        axis.title = element_text(size=14))