Overview

In this tutorial we walk through the very basics of conducting confirmatory factor analysis (CFA) in R. This is not a comprehensive coverage, just something to get started. This is a one-off done as part of a guest lecture.

Outline

In this session we cover …

A. Introduction to the Factor Analysis Model
B. Example 1: Basic CFA orientation & interpretation
C. Example 2: A Different Set of Variables

Prelim - Loading libraries used in this script.

library(psych)
library(ggplot2)
library(corrplot) #plotting correlation matrices
library(lavaan)  #for fitting structural equation models
library(semPlot)  #for automatically making diagrams 

A. Introduction to the Factor Analyis Model

Recall from last time that the basic factor analysis model is written as series of equations of the form …

\[y_{pi} = \lambda_{pq} f_{qi} + u_{pi}\] where \(y_{pi}\) is individual i’s score on the pth observed variable, \(f_{qi}\) is individual i’s score on the qth latent common factor, \(u_{pi}\) is individual i’s score on the pth latent unique factor, and \(\lambda_{pq}\) is the factor loading that indicates the relation between the pth observed variable and the qth latent common factor.

Typically, we have multiple observed variables and one or more common factors. For instance in the 4 variable, 2 factor case we would have …

\[y_{1i} = \lambda_{11} f_{1i} + \lambda_{12} f_{2i} + u_{1i}\] \[y_{2i} = \lambda_{21} f_{1i} + \lambda_{22} f_{2i} + u_{2i}\] \[y_{3i} = \lambda_{31} f_{1i} + \lambda_{32} f_{2i} + u_{3i}\] \[y_{4i} = \lambda_{41} f_{1i} + \lambda_{42} f_{2i} + u_{4i}\]

which can be written in a compact matrix form as

\[ \boldsymbol{Y_{i}} = \boldsymbol{\Lambda}\boldsymbol{F_{i}} + \boldsymbol{U_{i}} \] where \(\boldsymbol{Y_{i}}\) is a \(p\) x 1 vector of observed variable scores, \(\boldsymbol{\Lambda}\) is a p x q matrix of factor loadings, \(\boldsymbol{F_{i}}\) is a \(q\) x 1 vector of common factor scores, and \(\boldsymbol{U_{i}}\) is a p x 1 vector of unique factor scores.

Extension to multiple persons provided for mapping to the observed correlation matrix, \(\boldsymbol{\Sigma} = \boldsymbol{Y}'\boldsymbol{Y}\) and the common factor model becomes

\[ \boldsymbol{\Sigma} = \boldsymbol{\Lambda}\boldsymbol{\Psi}\boldsymbol{\Lambda}' + \boldsymbol{\Theta} \] where \(\boldsymbol{\Sigma}\) is a p x p covariance (or correlation) matrix of the observed variables, \(\boldsymbol{\Lambda}\) is a p x q matrix of factor loadings, \(\boldsymbol{\Psi}\) is a q x q covariance matrix of the latent factor variables, and \(\boldsymbol{\Theta}\) is a diagonal matrix of unique factor variances.

Confirmatory Factor Analysis is usually conduced within a Structural Equation Modeling (SEM) framework. There are a number of SEM packages in R. We currently tend to be using the lavaan package.
Lots of good information and instruction can be found on the package website … http://lavaan.ugent.be , including one of the examples we use here.

Prelim - Reading in Repeated Measures Data

For this example, we use classic data from Osborne & Sudick (1972) that has been the basis for many examples in Jack McArdles and colleagues papers, courses, and workshops.
The data were obtained from 204 participants (rows) who completed a vareity of cognitive ability tests that are part of the WISC.

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/WISC_CFAexample.csv"
#read in the .csv file using the url() function
dat <- read.csv(file=url(filepath),header=TRUE)

#subsetting to the variables of interest 
dat <- dat[,c("id","info_06","comp_06","simi_06","voca_06","picc_06","pica_06","bloc_06","obje_06")]

Lets have a quick look at the data file and the descriptives.

#data structure
head(dat,10)
##    id  info_06   comp_06   simi_06  voca_06   picc_06  pica_06    bloc_06
## 1   1 31.28655 25.626566 22.932331 22.21491 25.877193 8.310249  9.4098884
## 2   2 13.80117 14.786967  7.581454 15.37281 10.526316 7.417667  3.8915470
## 3   3 34.96970 34.675325 28.051948 26.84091 42.818182 6.443381  4.9586777
## 4   4 24.79532 31.390977  8.208020 20.19737 50.000000 2.708526  8.8676236
## 5   5 25.26316 30.263158 15.977444 35.41667 44.298246 9.295168  8.7400319
## 6   6 15.40230 23.399015 11.453202 20.56034 18.103448 8.832426  1.5047022
## 7   7 15.38012 -1.253133  2.318296 13.00439  8.596491 1.662050 -0.3827751
## 8   8 19.88304  6.704261 14.160401 14.86842 39.912281 6.463527  7.0813397
## 9   9 12.63158 13.847118 10.275689 23.22368 19.736842 6.401970 -0.3189793
## 10 10 23.69048 25.446429 11.415816 21.78571 35.178571 7.675439  7.8571429
##      obje_06
## 1  39.164087
## 2   4.179567
## 3  41.764706
## 4  63.054696
## 5  43.085655
## 6  13.793103
## 7   4.437564
## 8  19.762642
## 9  25.902993
## 10 12.710084
#descriptives
describe(dat)
##         vars   n   mean    sd median trimmed   mad   min    max  range
## id         1 204 102.50 59.03 102.50  102.50 75.61  1.00 204.00 203.00
## info_06    2 204  19.78  6.12  19.13   19.77  5.53  1.04  34.97  33.93
## comp_06    3 204  21.80  9.74  21.90   21.98 10.32 -1.25  46.49  47.74
## simi_06    4 204  14.90  7.56  14.54   14.57  7.83 -4.52  37.76  42.28
## voca_06    5 204  20.40  6.29  19.84   20.11  6.52  2.77  39.66  36.89
## picc_06    6 204  28.25 12.15  29.20   28.60 12.68 -3.42  55.18  58.60
## pica_06    7 204   9.31  8.41   7.40    7.97  5.77 -1.85  54.30  56.14
## bloc_06    8 204   6.94  6.60   5.81    6.12  4.62 -5.58  42.19  47.78
## obje_06    9 204  25.26 16.39  22.34   23.89 15.60 -4.54  71.10  75.64
##          skew kurtosis   se
## id       0.00    -1.22 4.13
## info_06 -0.06     0.17 0.43
## comp_06 -0.13    -0.49 0.68
## simi_06  0.39    -0.14 0.53
## voca_06  0.37     0.02 0.44
## picc_06 -0.24    -0.40 0.85
## pica_06  1.97     5.36 0.59
## bloc_06  2.05     6.77 0.46
## obje_06  0.66    -0.24 1.15

B. Example 1: Basic CFA orientation & interpretation

Our goal is to code a model that matches an a priori hypothesis about the structure of the data, and evaluate the match between that model, specifically the mean and variance-covariance expectations, and the observed data (i.e., the observed means and variance-covaraince matrix).

We go through a series of models here that match those examined in the lecture slides.

Of particular interest for what we are doing is the correlation matrix of the first 4 substantive variables in the data set. We can look at both the numeric and visual versions of the correlation matrix.

round(cor(dat[,c("info_06","voca_06","comp_06","simi_06")]),2)
##         info_06 voca_06 comp_06 simi_06
## info_06    1.00    0.56    0.51    0.54
## voca_06    0.56    1.00    0.58    0.44
## comp_06    0.51    0.58    1.00    0.45
## simi_06    0.54    0.44    0.45    1.00

And a compact visual version from the corrplot package.

#visusal correlation matrix
corrplot(cor(dat[,c("info_06","voca_06","comp_06","simi_06")]), order = "original", tl.col='black', tl.cex=.75) 

We will consider a set of 3 models …
* 0-Factor model * 1-Factor model * 2-Factor model

Each time we engage with … Model Specification, Model Estimation, and Model Evaluation.

0-Factor Model.

We start with a 0-factor model - basically a null model. But it is still a set of assumptions. Namely, that scores on one test are NOT related to scores on any of the other tests - i.e., that there are no covariances among the 4 tests.

Model Specification

First, code the model into a lavaan model object. Note that there are a variety of short-hand ways to program these models (i.e, defualts). We take a very explicit and comprehensive approach - where we basically ‘draw’ every single arrow into the model.

Basically, there are four formula types that correspond to different kinds of arrows.

latent variable definition =~ “is measured by” regression ~ “is regressed on” (residual) (co)variance ~~ “is correlated with” intercept ~ 1 “intercept” A complete lavaan model syntax is simply a combination of these formula types, enclosed between single quotes.

We set up a simple zero factor model …

#Model with 0 common factors 
verbal_0factor <- ' #start of model

# latent variable definitions (common factors)
# latent variable variances
# latent variable covariances
# latent variable means

# manifest variable variances (uniquenesses)
  info_06 ~~ info_06
  voca_06 ~~ voca_06
  comp_06 ~~ comp_06
  simi_06 ~~ simi_06
            
# manifest variable covariances (uniquenesses)

#manifest variable means 
  info_06 ~ 1
  voca_06 ~ 1
  comp_06 ~ 1
  simi_06 ~ 1
' #end of model
Model Fitting

Second, we fit the specified model to the data to obrain parameter estimates.

fit0 <- lavaan(verbal_0factor, data=dat, mimic = "mplus")
summary(fit0, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after   8 iterations
## 
##   Number of observations                           204
## 
##   Number of missing patterns                         1
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic              261.106
##   Degrees of freedom                                 6
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              261.106
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.000
##   Tucker-Lewis Index (TLI)                       0.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2777.649
##   Loglikelihood unrestricted model (H1)      -2647.095
## 
##   Number of free parameters                          8
##   Akaike (AIC)                                5571.297
##   Bayesian (BIC)                              5597.842
##   Sample-size adjusted Bayesian (BIC)         5572.496
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.457
##   90 Percent Confidence Interval          0.410  0.505
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.337
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Standard Errors                             Standard
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     info_06          19.776    0.427   46.273    0.000   19.776    3.240
##     voca_06          20.396    0.439   46.416    0.000   20.396    3.250
##     comp_06          21.797    0.680   32.036    0.000   21.797    2.243
##     simi_06          14.903    0.528   28.224    0.000   14.903    1.976
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     info_06          37.261    3.689   10.100    0.000   37.261    1.000
##     voca_06          39.390    3.900   10.100    0.000   39.390    1.000
##     comp_06          94.434    9.350   10.100    0.000   94.434    1.000
##     simi_06          56.881    5.632   10.100    0.000   56.881    1.000
Model Evaluation

Some basic questions to ask after fitting a model -
Did it work?
Does it make sense? Are the parameters viable?

Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)

semPaths(fit0, what="std", 
         sizeLat = 7, sizeMan = 7, edge.label.cex = .75)

Note that we asked for standardized estimates in the diagram.

We can also look more closely at how well the model fit, and interpret the model more generally. * Note that there are 6 degrees of freedom in this model. * Note how bad the fit of the model is. * There is room for improvement. Do we have another theory?

1-Factor Model

We have another theory - that the 4 tests are indicators of a single latent factor we’ll call Verbal ability.

Model Specification

We specify a 1-factor model by defining a latent variable and adding in some additional arrows. The factor must be indicated by manifest variables, and must have a variance (maybe covariance and mean).

Basically, there are four formula types that correspond to different kinds of arrows.

latent variable definition =~ “is measured by” regression ~ “is regressed on” (residual) (co)variance ~~ “is correlated with” intercept ~ 1 “intercept” A complete lavaan model syntax is simply a combination of these formula types, enclosed between single quotes.

We set up a 1-factor model …

#Model with 1 common factor 
verbal_1factor <- ' #start of model

# latent variable definitions (common factors)
  verbal =~ NA*info_06 + 
            NA*voca_06 +
            NA*comp_06 +
            NA*simi_06
  
# latent variable variances
  verbal ~~ 1*verbal

# latent variable covariances
# latent variable means

# manifest variable variances (uniquenesses)
  info_06 ~~ info_06
  voca_06 ~~ voca_06
  comp_06 ~~ comp_06
  simi_06 ~~ simi_06
            
# manifest variable covariances (uniquenesses)

#manifest variable means 
  info_06 ~ 1
  voca_06 ~ 1
  comp_06 ~ 1
  simi_06 ~ 1
' #end of model
Model Fitting

Second, we fit the specified model to the data to obrain parameter estimates.

fit1 <- lavaan(verbal_1factor, data=dat, mimic = "mplus")
summary(fit1, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after  40 iterations
## 
##   Number of observations                           204
## 
##   Number of missing patterns                         1
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                7.147
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.028
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              261.106
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.980
##   Tucker-Lewis Index (TLI)                       0.939
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2650.669
##   Loglikelihood unrestricted model (H1)      -2647.095
## 
##   Number of free parameters                         12
##   Akaike (AIC)                                5325.338
##   Bayesian (BIC)                              5365.155
##   Sample-size adjusted Bayesian (BIC)         5327.136
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.112
##   90 Percent Confidence Interval          0.032  0.206
##   P-value RMSEA <= 0.05                          0.088
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.025
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   verbal =~                                                             
##     info_06           4.582    0.412   11.112    0.000    4.582    0.751
##     voca_06           4.714    0.421   11.192    0.000    4.714    0.751
##     comp_06           7.013    0.659   10.642    0.000    7.013    0.722
##     simi_06           4.843    0.527    9.190    0.000    4.843    0.642
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .info_06          19.776    0.427   46.273    0.000   19.776    3.240
##    .voca_06          20.396    0.439   46.416    0.000   20.396    3.250
##    .comp_06          21.797    0.680   32.036    0.000   21.797    2.243
##    .simi_06          14.903    0.528   28.224    0.000   14.903    1.976
##     verbal            0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     verbal            1.000                               1.000    1.000
##    .info_06          16.271    2.419    6.725    0.000   16.271    0.437
##    .voca_06          17.167    2.517    6.820    0.000   17.167    0.436
##    .comp_06          45.258    6.176    7.328    0.000   45.258    0.479
##    .simi_06          33.426    4.031    8.293    0.000   33.426    0.588
Model Evaluation

Some basic questions to ask after fitting a model - are Did it work?
Does it make sense? Are the parameters viable?

Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)

semPaths(fit1, what="std", 
         sizeLat = 7, sizeMan = 7, edge.label.cex = .75)