In this tutorial we walk through the very basics of conducting confirmatory factor analysis (CFA) in R. This is not a comprehensive coverage, just something to get started. This is a one-off done as part of a guest lecture.
In this session we cover …
A. Introduction to the Factor Analysis Model
B. Example 1: Basic CFA orientation & interpretation
C. Example 2: A Different Set of Variables
library(psych)
library(ggplot2)
library(corrplot) #plotting correlation matrices
library(lavaan) #for fitting structural equation models
library(semPlot) #for automatically making diagrams
Recall from last time that the basic factor analysis model is written as series of equations of the form …
\[y_{pi} = \lambda_{pq} f_{qi} + u_{pi}\] where \(y_{pi}\) is individual i’s score on the pth observed variable, \(f_{qi}\) is individual i’s score on the qth latent common factor, \(u_{pi}\) is individual i’s score on the pth latent unique factor, and \(\lambda_{pq}\) is the factor loading that indicates the relation between the pth observed variable and the qth latent common factor.
Typically, we have multiple observed variables and one or more common factors. For instance in the 4 variable, 2 factor case we would have …
\[y_{1i} = \lambda_{11} f_{1i} + \lambda_{12} f_{2i} + u_{1i}\] \[y_{2i} = \lambda_{21} f_{1i} + \lambda_{22} f_{2i} + u_{2i}\] \[y_{3i} = \lambda_{31} f_{1i} + \lambda_{32} f_{2i} + u_{3i}\] \[y_{4i} = \lambda_{41} f_{1i} + \lambda_{42} f_{2i} + u_{4i}\]
which can be written in a compact matrix form as
\[ \boldsymbol{Y_{i}} = \boldsymbol{\Lambda}\boldsymbol{F_{i}} + \boldsymbol{U_{i}} \] where \(\boldsymbol{Y_{i}}\) is a \(p\) x 1 vector of observed variable scores, \(\boldsymbol{\Lambda}\) is a p x q matrix of factor loadings, \(\boldsymbol{F_{i}}\) is a \(q\) x 1 vector of common factor scores, and \(\boldsymbol{U_{i}}\) is a p x 1 vector of unique factor scores.
Extension to multiple persons provided for mapping to the observed correlation matrix, \(\boldsymbol{\Sigma} = \boldsymbol{Y}'\boldsymbol{Y}\) and the common factor model becomes
\[ \boldsymbol{\Sigma} = \boldsymbol{\Lambda}\boldsymbol{\Psi}\boldsymbol{\Lambda}' + \boldsymbol{\Theta} \] where \(\boldsymbol{\Sigma}\) is a p x p covariance (or correlation) matrix of the observed variables, \(\boldsymbol{\Lambda}\) is a p x q matrix of factor loadings, \(\boldsymbol{\Psi}\) is a q x q covariance matrix of the latent factor variables, and \(\boldsymbol{\Theta}\) is a diagonal matrix of unique factor variances.
Confirmatory Factor Analysis is usually conduced within a Structural Equation Modeling (SEM) framework. There are a number of SEM packages in R. We currently tend to be using the lavaan
package.
Lots of good information and instruction can be found on the package website … http://lavaan.ugent.be , including one of the examples we use here.
For this example, we use classic data from Osborne & Sudick (1972) that has been the basis for many examples in Jack McArdles and colleagues papers, courses, and workshops.
The data were obtained from 204 participants (rows) who completed a vareity of cognitive ability tests that are part of the WISC.
#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/WISC_CFAexample.csv"
#read in the .csv file using the url() function
dat <- read.csv(file=url(filepath),header=TRUE)
#subsetting to the variables of interest
dat <- dat[,c("id","info_06","comp_06","simi_06","voca_06","picc_06","pica_06","bloc_06","obje_06")]
Lets have a quick look at the data file and the descriptives.
#data structure
head(dat,10)
## id info_06 comp_06 simi_06 voca_06 picc_06 pica_06 bloc_06
## 1 1 31.28655 25.626566 22.932331 22.21491 25.877193 8.310249 9.4098884
## 2 2 13.80117 14.786967 7.581454 15.37281 10.526316 7.417667 3.8915470
## 3 3 34.96970 34.675325 28.051948 26.84091 42.818182 6.443381 4.9586777
## 4 4 24.79532 31.390977 8.208020 20.19737 50.000000 2.708526 8.8676236
## 5 5 25.26316 30.263158 15.977444 35.41667 44.298246 9.295168 8.7400319
## 6 6 15.40230 23.399015 11.453202 20.56034 18.103448 8.832426 1.5047022
## 7 7 15.38012 -1.253133 2.318296 13.00439 8.596491 1.662050 -0.3827751
## 8 8 19.88304 6.704261 14.160401 14.86842 39.912281 6.463527 7.0813397
## 9 9 12.63158 13.847118 10.275689 23.22368 19.736842 6.401970 -0.3189793
## 10 10 23.69048 25.446429 11.415816 21.78571 35.178571 7.675439 7.8571429
## obje_06
## 1 39.164087
## 2 4.179567
## 3 41.764706
## 4 63.054696
## 5 43.085655
## 6 13.793103
## 7 4.437564
## 8 19.762642
## 9 25.902993
## 10 12.710084
#descriptives
describe(dat)
## vars n mean sd median trimmed mad min max range
## id 1 204 102.50 59.03 102.50 102.50 75.61 1.00 204.00 203.00
## info_06 2 204 19.78 6.12 19.13 19.77 5.53 1.04 34.97 33.93
## comp_06 3 204 21.80 9.74 21.90 21.98 10.32 -1.25 46.49 47.74
## simi_06 4 204 14.90 7.56 14.54 14.57 7.83 -4.52 37.76 42.28
## voca_06 5 204 20.40 6.29 19.84 20.11 6.52 2.77 39.66 36.89
## picc_06 6 204 28.25 12.15 29.20 28.60 12.68 -3.42 55.18 58.60
## pica_06 7 204 9.31 8.41 7.40 7.97 5.77 -1.85 54.30 56.14
## bloc_06 8 204 6.94 6.60 5.81 6.12 4.62 -5.58 42.19 47.78
## obje_06 9 204 25.26 16.39 22.34 23.89 15.60 -4.54 71.10 75.64
## skew kurtosis se
## id 0.00 -1.22 4.13
## info_06 -0.06 0.17 0.43
## comp_06 -0.13 -0.49 0.68
## simi_06 0.39 -0.14 0.53
## voca_06 0.37 0.02 0.44
## picc_06 -0.24 -0.40 0.85
## pica_06 1.97 5.36 0.59
## bloc_06 2.05 6.77 0.46
## obje_06 0.66 -0.24 1.15
Our goal is to code a model that matches an a priori hypothesis about the structure of the data, and evaluate the match between that model, specifically the mean and variance-covariance expectations, and the observed data (i.e., the observed means and variance-covaraince matrix).
We go through a series of models here that match those examined in the lecture slides.
Of particular interest for what we are doing is the correlation matrix of the first 4 substantive variables in the data set. We can look at both the numeric and visual versions of the correlation matrix.
round(cor(dat[,c("info_06","voca_06","comp_06","simi_06")]),2)
## info_06 voca_06 comp_06 simi_06
## info_06 1.00 0.56 0.51 0.54
## voca_06 0.56 1.00 0.58 0.44
## comp_06 0.51 0.58 1.00 0.45
## simi_06 0.54 0.44 0.45 1.00
And a compact visual version from the corrplot
package.
#visusal correlation matrix
corrplot(cor(dat[,c("info_06","voca_06","comp_06","simi_06")]), order = "original", tl.col='black', tl.cex=.75)
We will consider a set of 3 models …
* 0-Factor model * 1-Factor model * 2-Factor model
Each time we engage with … Model Specification, Model Estimation, and Model Evaluation.
We start with a 0-factor model - basically a null model. But it is still a set of assumptions. Namely, that scores on one test are NOT related to scores on any of the other tests - i.e., that there are no covariances among the 4 tests.
First, code the model into a lavaan model object. Note that there are a variety of short-hand ways to program these models (i.e, defualts). We take a very explicit and comprehensive approach - where we basically ‘draw’ every single arrow into the model.
Basically, there are four formula types that correspond to different kinds of arrows.
latent variable definition =~ “is measured by” regression ~ “is regressed on” (residual) (co)variance ~~ “is correlated with” intercept ~ 1 “intercept” A complete lavaan model syntax is simply a combination of these formula types, enclosed between single quotes.
We set up a simple zero factor model …
#Model with 0 common factors
verbal_0factor <- ' #start of model
# latent variable definitions (common factors)
# latent variable variances
# latent variable covariances
# latent variable means
# manifest variable variances (uniquenesses)
info_06 ~~ info_06
voca_06 ~~ voca_06
comp_06 ~~ comp_06
simi_06 ~~ simi_06
# manifest variable covariances (uniquenesses)
#manifest variable means
info_06 ~ 1
voca_06 ~ 1
comp_06 ~ 1
simi_06 ~ 1
' #end of model
Second, we fit the specified model to the data to obrain parameter estimates.
fit0 <- lavaan(verbal_0factor, data=dat, mimic = "mplus")
summary(fit0, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after 8 iterations
##
## Number of observations 204
##
## Number of missing patterns 1
##
## Estimator ML
## Minimum Function Test Statistic 261.106
## Degrees of freedom 6
## P-value (Chi-square) 0.000
##
## Model test baseline model:
##
## Minimum Function Test Statistic 261.106
## Degrees of freedom 6
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.000
## Tucker-Lewis Index (TLI) 0.000
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -2777.649
## Loglikelihood unrestricted model (H1) -2647.095
##
## Number of free parameters 8
## Akaike (AIC) 5571.297
## Bayesian (BIC) 5597.842
## Sample-size adjusted Bayesian (BIC) 5572.496
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.457
## 90 Percent Confidence Interval 0.410 0.505
## P-value RMSEA <= 0.05 0.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.337
##
## Parameter Estimates:
##
## Information Observed
## Standard Errors Standard
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## info_06 19.776 0.427 46.273 0.000 19.776 3.240
## voca_06 20.396 0.439 46.416 0.000 20.396 3.250
## comp_06 21.797 0.680 32.036 0.000 21.797 2.243
## simi_06 14.903 0.528 28.224 0.000 14.903 1.976
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## info_06 37.261 3.689 10.100 0.000 37.261 1.000
## voca_06 39.390 3.900 10.100 0.000 39.390 1.000
## comp_06 94.434 9.350 10.100 0.000 94.434 1.000
## simi_06 56.881 5.632 10.100 0.000 56.881 1.000
Some basic questions to ask after fitting a model -
Did it work?
Does it make sense? Are the parameters viable?
Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)
semPaths(fit0, what="std",
sizeLat = 7, sizeMan = 7, edge.label.cex = .75)
Note that we asked for standardized estimates in the diagram.
We can also look more closely at how well the model fit, and interpret the model more generally. * Note that there are 6 degrees of freedom in this model. * Note how bad the fit of the model is. * There is room for improvement. Do we have another theory?
We have another theory - that the 4 tests are indicators of a single latent factor we’ll call Verbal ability.
We specify a 1-factor model by defining a latent variable and adding in some additional arrows. The factor must be indicated by manifest variables, and must have a variance (maybe covariance and mean).
Basically, there are four formula types that correspond to different kinds of arrows.
latent variable definition =~ “is measured by” regression ~ “is regressed on” (residual) (co)variance ~~ “is correlated with” intercept ~ 1 “intercept” A complete lavaan model syntax is simply a combination of these formula types, enclosed between single quotes.
We set up a 1-factor model …
#Model with 1 common factor
verbal_1factor <- ' #start of model
# latent variable definitions (common factors)
verbal =~ NA*info_06 +
NA*voca_06 +
NA*comp_06 +
NA*simi_06
# latent variable variances
verbal ~~ 1*verbal
# latent variable covariances
# latent variable means
# manifest variable variances (uniquenesses)
info_06 ~~ info_06
voca_06 ~~ voca_06
comp_06 ~~ comp_06
simi_06 ~~ simi_06
# manifest variable covariances (uniquenesses)
#manifest variable means
info_06 ~ 1
voca_06 ~ 1
comp_06 ~ 1
simi_06 ~ 1
' #end of model
Second, we fit the specified model to the data to obrain parameter estimates.
fit1 <- lavaan(verbal_1factor, data=dat, mimic = "mplus")
summary(fit1, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after 40 iterations
##
## Number of observations 204
##
## Number of missing patterns 1
##
## Estimator ML
## Minimum Function Test Statistic 7.147
## Degrees of freedom 2
## P-value (Chi-square) 0.028
##
## Model test baseline model:
##
## Minimum Function Test Statistic 261.106
## Degrees of freedom 6
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.980
## Tucker-Lewis Index (TLI) 0.939
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -2650.669
## Loglikelihood unrestricted model (H1) -2647.095
##
## Number of free parameters 12
## Akaike (AIC) 5325.338
## Bayesian (BIC) 5365.155
## Sample-size adjusted Bayesian (BIC) 5327.136
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.112
## 90 Percent Confidence Interval 0.032 0.206
## P-value RMSEA <= 0.05 0.088
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.025
##
## Parameter Estimates:
##
## Information Observed
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## verbal =~
## info_06 4.582 0.412 11.112 0.000 4.582 0.751
## voca_06 4.714 0.421 11.192 0.000 4.714 0.751
## comp_06 7.013 0.659 10.642 0.000 7.013 0.722
## simi_06 4.843 0.527 9.190 0.000 4.843 0.642
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .info_06 19.776 0.427 46.273 0.000 19.776 3.240
## .voca_06 20.396 0.439 46.416 0.000 20.396 3.250
## .comp_06 21.797 0.680 32.036 0.000 21.797 2.243
## .simi_06 14.903 0.528 28.224 0.000 14.903 1.976
## verbal 0.000 0.000 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## verbal 1.000 1.000 1.000
## .info_06 16.271 2.419 6.725 0.000 16.271 0.437
## .voca_06 17.167 2.517 6.820 0.000 17.167 0.436
## .comp_06 45.258 6.176 7.328 0.000 45.258 0.479
## .simi_06 33.426 4.031 8.293 0.000 33.426 0.588
Some basic questions to ask after fitting a model - are Did it work?
Does it make sense? Are the parameters viable?
Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)
semPaths(fit1, what="std",
sizeLat = 7, sizeMan = 7, edge.label.cex = .75)
Then, we can look more closely at how well the model fit, and interpret the model more generally.
We do model comparison…
Looking at fit statistics
#Compare model fit statistics
round(cbind(m1=inspect(fit0, 'fit.measures'), m2=inspect(fit1, 'fit.measures')),3)
## m1 m2
## npar 8.000 12.000
## fmin 0.640 0.018
## chisq 261.106 7.147
## df 6.000 2.000
## pvalue 0.000 0.028
## baseline.chisq 261.106 261.106
## baseline.df 6.000 6.000
## baseline.pvalue 0.000 0.000
## cfi 0.000 0.980
## tli 0.000 0.939
## nnfi 0.000 0.939
## rfi 0.000 0.918
## nfi 0.000 0.973
## pnfi 0.000 0.324
## ifi 0.000 0.980
## rni 0.000 0.980
## logl -2777.649 -2650.669
## unrestricted.logl -2647.095 -2647.095
## aic 5571.297 5325.338
## bic 5597.842 5365.155
## ntotal 204.000 204.000
## bic2 5572.496 5327.136
## rmsea 0.457 0.112
## rmsea.ci.lower 0.410 0.032
## rmsea.ci.upper 0.505 0.206
## rmsea.pvalue 0.000 0.088
## rmr 21.820 1.531
## rmr_nomean 21.820 1.531
## srmr 0.337 0.025
## srmr_bentler 0.337 0.025
## srmr_bentler_nomean 0.399 0.029
## srmr_bollen 0.337 0.025
## srmr_bollen_nomean 0.399 0.029
## srmr_mplus 0.337 0.025
## srmr_mplus_nomean 0.399 0.029
## cn_05 10.838 172.016
## cn_01 14.135 263.893
## gfi 0.953 0.998
## agfi 0.889 0.984
## pgfi 0.408 0.143
## mfi 0.535 0.987
## ecvi NA NA
Formal model test
#Chi-square difference test for nested models
anova(fit0, fit1)
## Chi Square Difference Test
##
## Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)
## fit1 2 5325.3 5365.2 7.147
## fit0 6 5571.3 5597.8 261.106 253.96 4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the null hypothesis that the models fit the same.
We have yet another theory - that the 4 tests are indicators of two correlated latent factor we’ll call F1 and F2.
We specify a 2-factor model by defining the latent variables and adding in some additional arrows. Each factor must be indicated by manifest variables, and must have a variance (maybe covariance and mean).
We won’t go into it here, but identification constraints are needed and important.
We set up a 2-factor model …
#Model with 1 common factor
verbal_2factor <- ' #start of model
# latent variable definitions (common factors)
f1 =~ NA*info_06 +
NA*voca_06
f2 =~ NA*comp_06 +
NA*simi_06
# latent variable variances
f1 ~~ 1*f1
f2 ~~ 1*f2
# latent variable covariances
f1 ~~ f2
# latent variable means
# manifest variable variances (uniquenesses)
info_06 ~~ info_06
voca_06 ~~ voca_06
comp_06 ~~ comp_06
simi_06 ~~ simi_06
# manifest variable covariances (uniquenesses)
#manifest variable means
info_06 ~ 1
voca_06 ~ 1
comp_06 ~ 1
simi_06 ~ 1
' #end of model
Second, we fit the specified model to the data to obrain parameter estimates.
fit2 <- lavaan(verbal_2factor, data=dat, mimic = "mplus")
## Warning in lav_object_post_check(object): lavaan WARNING: covariance matrix of latent variables
## is not positive definite;
## use inspect(fit,"cov.lv") to investigate.
summary(fit2, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after 55 iterations
##
## Number of observations 204
##
## Number of missing patterns 1
##
## Estimator ML
## Minimum Function Test Statistic 6.805
## Degrees of freedom 1
## P-value (Chi-square) 0.009
##
## Model test baseline model:
##
## Minimum Function Test Statistic 261.106
## Degrees of freedom 6
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.977
## Tucker-Lewis Index (TLI) 0.863
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -2650.498
## Loglikelihood unrestricted model (H1) -2647.095
##
## Number of free parameters 13
## Akaike (AIC) 5326.996
## Bayesian (BIC) 5370.132
## Sample-size adjusted Bayesian (BIC) 5328.944
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.169
## 90 Percent Confidence Interval 0.067 0.298
## P-value RMSEA <= 0.05 0.030
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.024
##
## Parameter Estimates:
##
## Information Observed
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## f1 =~
## info_06 4.553 0.413 11.032 0.000 4.553 0.746
## voca_06 4.675 0.424 11.015 0.000 4.675 0.745
## f2 =~
## comp_06 6.882 0.696 9.889 0.000 6.882 0.708
## simi_06 4.778 0.538 8.886 0.000 4.778 0.634
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## f1 ~~
## f2 1.035 0.061 17.067 0.000 1.035 1.035
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .info_06 19.776 0.427 46.273 0.000 19.776 3.240
## .voca_06 20.396 0.439 46.416 0.000 20.396 3.250
## .comp_06 21.797 0.680 32.036 0.000 21.797 2.243
## .simi_06 14.903 0.528 28.224 0.000 14.903 1.976
## f1 0.000 0.000 0.000
## f2 0.000 0.000 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## f1 1.000 1.000 1.000
## f2 1.000 1.000 1.000
## .info_06 16.529 2.423 6.821 0.000 16.529 0.444
## .voca_06 17.537 2.562 6.846 0.000 17.537 0.445
## .comp_06 47.070 6.912 6.810 0.000 47.070 0.498
## .simi_06 34.047 4.174 8.158 0.000 34.047 0.599
Some basic questions to ask after fitting a model -
Did it work?
Does it make sense? Are the parameters viable?
NOTICE THERE WAS A WARNING Something is wrong with this model. Look at the correlation between the factors, r = 1.03. That is not viable.
Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)
semPaths(fit2, what="std",
sizeLat = 7, sizeMan = 7, edge.label.cex = .75)
Then, we can look more closely at how well the model fit, and interpret the model more generally.
We do model comparison…
Looking at fit statistics
#Compare model fit statistics
round(cbind(m1=inspect(fit0, 'fit.measures'), m2=inspect(fit1, 'fit.measures'),m3=inspect(fit2, 'fit.measures')),3)
## m1 m2 m3
## npar 8.000 12.000 13.000
## fmin 0.640 0.018 0.017
## chisq 261.106 7.147 6.805
## df 6.000 2.000 1.000
## pvalue 0.000 0.028 0.009
## baseline.chisq 261.106 261.106 261.106
## baseline.df 6.000 6.000 6.000
## baseline.pvalue 0.000 0.000 0.000
## cfi 0.000 0.980 0.977
## tli 0.000 0.939 0.863
## nnfi 0.000 0.939 0.863
## rfi 0.000 0.918 0.844
## nfi 0.000 0.973 0.974
## pnfi 0.000 0.324 0.162
## ifi 0.000 0.980 0.978
## rni 0.000 0.980 0.977
## logl -2777.649 -2650.669 -2650.498
## unrestricted.logl -2647.095 -2647.095 -2647.095
## aic 5571.297 5325.338 5326.996
## bic 5597.842 5365.155 5370.132
## ntotal 204.000 204.000 204.000
## bic2 5572.496 5327.136 5328.944
## rmsea 0.457 0.112 0.169
## rmsea.ci.lower 0.410 0.032 0.067
## rmsea.ci.upper 0.505 0.206 0.298
## rmsea.pvalue 0.000 0.088 0.030
## rmr 21.820 1.531 1.481
## rmr_nomean 21.820 1.531 1.481
## srmr 0.337 0.025 0.024
## srmr_bentler 0.337 0.025 0.024
## srmr_bentler_nomean 0.399 0.029 0.029
## srmr_bollen 0.337 0.025 0.024
## srmr_bollen_nomean 0.399 0.029 0.029
## srmr_mplus 0.337 0.025 0.024
## srmr_mplus_nomean 0.399 0.029 0.029
## cn_05 10.838 172.016 116.154
## cn_01 14.135 263.893 199.891
## gfi 0.953 0.998 0.998
## agfi 0.889 0.984 0.971
## pgfi 0.408 0.143 0.071
## mfi 0.535 0.987 0.986
## ecvi NA NA NA
Formal model test
#Chi-square difference test for nested models
anova(fit1, fit2)
## Chi Square Difference Test
##
## Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)
## fit2 1 5327.0 5370.1 6.8053
## fit1 2 5325.3 5365.2 7.1470 0.34172 1 0.5588
We cannot reject the null hypothesis that the models fit the same. And - the 2-factor model is bad anyway.
We conclude that the 1-factor model is the optimal model, of those evaluated.
Again, our goal is to code a model that matches an a priori hypothesis about the structure of the data, and evaluate the match between that model, specifically the mean and variance-covariance expectations, and the observed data (i.e., the observed means and variance-covaraince matrix).
We go through a series of models here that match those examined in the lecture slides.
Of particular interest for what we are doing is the correlation matrix of 4 substantive variables in the data set (CO, VO, BL, OB). We can look at both the numeric and visual versions of the correlation matrix.
round(cor(dat[,c("comp_06","voca_06","bloc_06","obje_06")]),2)
## comp_06 voca_06 bloc_06 obje_06
## comp_06 1.00 0.58 0.32 0.31
## voca_06 0.58 1.00 0.39 0.45
## bloc_06 0.32 0.39 1.00 0.55
## obje_06 0.31 0.45 0.55 1.00
And a compact visual version from the corrplot
package.
#visusal correlation matrix
corrplot(cor(dat[,c("comp_06","voca_06","bloc_06","obje_06")]), order = "original", tl.col='black', tl.cex=.75)
We will consider a set of 3 models …
* 0-Factor model * 1-Factor model * 2-Factor model
Each time we engage with … Model Specification, Model Estimation, and Model Evaluation.
We start with a 0-factor model - basically a null model. But it is still a set of assumptions. Namely, that scores on one test are NOT related to scores on any of the other tests - i.e., that there are no covariances among the 4 tests.
First, code the model into a lavaan model object. Note that there are a variety of short-hand ways to program these models (i.e, defualts). We take a very explicit and comprehensive approach - where we basically ‘draw’ every single arrow into the model.
Basically, there are four formula types that correspond to different kinds of arrows.
latent variable definition =~ “is measured by” regression ~ “is regressed on” (residual) (co)variance ~~ “is correlated with” intercept ~ 1 “intercept” A complete lavaan model syntax is simply a combination of these formula types, enclosed between single quotes.
We set up a simple zero factor model …
#Model with 0 common factors
General_0factor <- ' #start of model
# latent variable definitions (common factors)
# latent variable variances
# latent variable covariances
# latent variable means
# manifest variable variances (uniquenesses)
comp_06 ~~ comp_06
voca_06 ~~ voca_06
bloc_06 ~~ bloc_06
obje_06 ~~ obje_06
# manifest variable covariances (uniquenesses)
#manifest variable means
comp_06 ~ 1
voca_06 ~ 1
bloc_06 ~ 1
obje_06 ~ 1
' #end of model
Second, we fit the specified model to the data to obrain parameter estimates.
fit0 <- lavaan(General_0factor, data=dat, mimic = "mplus")
summary(fit0, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after 8 iterations
##
## Number of observations 204
##
## Number of missing patterns 1
##
## Estimator ML
## Minimum Function Test Statistic 214.189
## Degrees of freedom 6
## P-value (Chi-square) 0.000
##
## Model test baseline model:
##
## Minimum Function Test Statistic 214.189
## Degrees of freedom 6
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.000
## Tucker-Lewis Index (TLI) 0.000
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -2950.942
## Loglikelihood unrestricted model (H1) -2843.848
##
## Number of free parameters 8
## Akaike (AIC) 5917.885
## Bayesian (BIC) 5944.430
## Sample-size adjusted Bayesian (BIC) 5919.084
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.412
## 90 Percent Confidence Interval 0.366 0.461
## P-value RMSEA <= 0.05 0.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.292
##
## Parameter Estimates:
##
## Information Observed
## Standard Errors Standard
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## comp_06 21.797 0.680 32.036 0.000 21.797 2.243
## voca_06 20.396 0.439 46.416 0.000 20.396 3.250
## bloc_06 6.942 0.461 15.063 0.000 6.942 1.055
## obje_06 25.258 1.145 22.059 0.000 25.258 1.544
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## comp_06 94.434 9.350 10.100 0.000 94.434 1.000
## voca_06 39.390 3.900 10.100 0.000 39.390 1.000
## bloc_06 43.333 4.291 10.100 0.000 43.333 1.000
## obje_06 267.462 26.483 10.100 0.000 267.462 1.000
Some basic questions to ask after fitting a model -
Did it work?
Does it make sense? Are the parameters viable?
Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)
semPaths(fit0, what="std",
sizeLat = 7, sizeMan = 7, edge.label.cex = .75)
Note that we asked for standardized estimates in the diagram.
We can also look more closely at how well the model fit, and interpret the model more generally. * Note that there are 6 degrees of freedom in this model. * Note how bad the fit of the model is. * There is room for improvement. Do we have another theory?
We have another theory - that the 4 tests are indicators of a single latent factor we’ll call G (general ability).
We specify a 1-factor model by defining a latent variable and adding in some additional arrows. The factor must be indicated by manifest variables, and must have a variance (maybe covariance and mean).
Basically, there are four formula types that correspond to different kinds of arrows.
latent variable definition =~ “is measured by” regression ~ “is regressed on” (residual) (co)variance ~~ “is correlated with” intercept ~ 1 “intercept” A complete lavaan model syntax is simply a combination of these formula types, enclosed between single quotes.
We set up a 1-factor model …
#Model with 1 common factor
General_1factor <- ' #start of model
# latent variable definitions (common factors)
G =~ NA*comp_06 +
NA*voca_06 +
NA*bloc_06 +
NA*obje_06
# latent variable variances
G ~~ 1*G
# latent variable covariances
# latent variable means
# manifest variable variances (uniquenesses)
comp_06 ~~ comp_06
voca_06 ~~ voca_06
bloc_06 ~~ bloc_06
obje_06 ~~ obje_06
# manifest variable covariances (uniquenesses)
#manifest variable means
comp_06 ~ 1
voca_06 ~ 1
bloc_06 ~ 1
obje_06 ~ 1
' #end of model
Second, we fit the specified model to the data to obrain parameter estimates.
fit1 <- lavaan(General_1factor, data=dat, mimic = "mplus")
summary(fit1, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after 51 iterations
##
## Number of observations 204
##
## Number of missing patterns 1
##
## Estimator ML
## Minimum Function Test Statistic 31.453
## Degrees of freedom 2
## P-value (Chi-square) 0.000
##
## Model test baseline model:
##
## Minimum Function Test Statistic 214.189
## Degrees of freedom 6
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.859
## Tucker-Lewis Index (TLI) 0.576
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -2859.574
## Loglikelihood unrestricted model (H1) -2843.848
##
## Number of free parameters 12
## Akaike (AIC) 5743.148
## Bayesian (BIC) 5782.966
## Sample-size adjusted Bayesian (BIC) 5744.946
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.269
## 90 Percent Confidence Interval 0.191 0.355
## P-value RMSEA <= 0.05 0.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.064
##
## Parameter Estimates:
##
## Information Observed
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## G =~
## comp_06 6.409 0.721 8.891 0.000 6.409 0.660
## voca_06 4.888 0.482 10.151 0.000 4.888 0.779
## bloc_06 3.788 0.544 6.961 0.000 3.788 0.575
## obje_06 10.053 1.322 7.606 0.000 10.053 0.615
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .comp_06 21.797 0.680 32.036 0.000 21.797 2.243
## .voca_06 20.396 0.439 46.416 0.000 20.396 3.250
## .bloc_06 6.942 0.461 15.063 0.000 6.942 1.055
## .obje_06 25.258 1.145 22.059 0.000 25.258 1.544
## G 0.000 0.000 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## G 1.000 1.000 1.000
## .comp_06 53.359 7.333 7.276 0.000 53.359 0.565
## .voca_06 15.493 3.415 4.536 0.000 15.493 0.393
## .bloc_06 28.985 3.880 7.470 0.000 28.985 0.669
## .obje_06 166.402 23.404 7.110 0.000 166.402 0.622
Some basic questions to ask after fitting a model - are Did it work?
Does it make sense? Are the parameters viable?
Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)
semPaths(fit1, what="std",
sizeLat = 7, sizeMan = 7, edge.label.cex = .75)
Then, we can look more closely at how well the model fit, and interpret the model more generally.
We do model comparison…
Looking at fit statistics
#Compare model fit statistics
round(cbind(m1=inspect(fit0, 'fit.measures'), m2=inspect(fit1, 'fit.measures')),3)
## m1 m2
## npar 8.000 12.000
## fmin 0.525 0.077
## chisq 214.189 31.453
## df 6.000 2.000
## pvalue 0.000 0.000
## baseline.chisq 214.189 214.189
## baseline.df 6.000 6.000
## baseline.pvalue 0.000 0.000
## cfi 0.000 0.859
## tli 0.000 0.576
## nnfi 0.000 0.576
## rfi 0.000 0.559
## nfi 0.000 0.853
## pnfi 0.000 0.284
## ifi 0.000 0.861
## rni 0.000 0.859
## logl -2950.942 -2859.574
## unrestricted.logl -2843.848 -2843.848
## aic 5917.885 5743.148
## bic 5944.430 5782.966
## ntotal 204.000 204.000
## bic2 5919.084 5744.946
## rmsea 0.412 0.269
## rmsea.ci.lower 0.366 0.191
## rmsea.ci.upper 0.461 0.355
## rmsea.pvalue 0.000 0.000
## rmr 31.629 8.300
## rmr_nomean 31.629 8.300
## srmr 0.292 0.064
## srmr_bentler 0.292 0.064
## srmr_bentler_nomean 0.345 0.076
## srmr_bollen 0.292 0.064
## srmr_bollen_nomean 0.345 0.076
## srmr_mplus 0.292 0.064
## srmr_mplus_nomean 0.345 0.076
## cn_05 12.993 39.860
## cn_01 17.012 60.738
## gfi 0.946 0.988
## agfi 0.875 0.916
## pgfi 0.406 0.141
## mfi 0.600 0.930
## ecvi NA NA
Formal model test
#Chi-square difference test for nested models
anova(fit0, fit1)
## Chi Square Difference Test
##
## Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)
## fit1 2 5743.1 5783.0 31.453
## fit0 6 5917.9 5944.4 214.189 182.74 4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the null hypothesis that the models fit the same.
We have yet another theory - that the 4 tests are indicators of two correlated latent factor we’ll call Gf and Gc.
We specify a 2-factor model by defining the latent variables and adding in some additional arrows. Each factor must be indicated by manifest variables, and must have a variance (maybe covariance and mean).
We won’t go into it here, but identification constraints are needed and important.
We set up a 2-factor model …
#Model with 1 common factor
General_2factor <- ' #start of model
# latent variable definitions (common factors)
Gc =~ NA*comp_06 +
NA*voca_06
Gf =~ NA*bloc_06 +
NA*obje_06
# latent variable variances
Gc ~~ 1*Gc
Gf ~~ 1*Gf
# latent variable covariances
Gc ~~ Gf
# latent variable means
# manifest variable variances (uniquenesses)
comp_06 ~~ comp_06
voca_06 ~~ voca_06
bloc_06 ~~ bloc_06
obje_06 ~~ obje_06
# manifest variable covariances (uniquenesses)
#manifest variable means
comp_06 ~ 1
voca_06 ~ 1
bloc_06 ~ 1
obje_06 ~ 1
' #end of model
Second, we fit the specified model to the data to obrain parameter estimates.
fit2 <- lavaan(General_2factor, data=dat, mimic = "mplus")
summary(fit2, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after 71 iterations
##
## Number of observations 204
##
## Number of missing patterns 1
##
## Estimator ML
## Minimum Function Test Statistic 1.167
## Degrees of freedom 1
## P-value (Chi-square) 0.280
##
## Model test baseline model:
##
## Minimum Function Test Statistic 214.189
## Degrees of freedom 6
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.999
## Tucker-Lewis Index (TLI) 0.995
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -2844.431
## Loglikelihood unrestricted model (H1) -2843.848
##
## Number of free parameters 13
## Akaike (AIC) 5714.862
## Bayesian (BIC) 5757.998
## Sample-size adjusted Bayesian (BIC) 5716.810
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.029
## 90 Percent Confidence Interval 0.000 0.191
## P-value RMSEA <= 0.05 0.394
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.010
##
## Parameter Estimates:
##
## Information Observed
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Gc =~
## comp_06 6.420 0.758 8.466 0.000 6.420 0.661
## voca_06 5.550 0.526 10.547 0.000 5.550 0.884
## Gf =~
## bloc_06 4.576 0.517 8.851 0.000 4.576 0.695
## obje_06 12.829 1.320 9.718 0.000 12.829 0.784
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Gc ~~
## Gf 0.641 0.074 8.642 0.000 0.641 0.641
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .comp_06 21.797 0.680 32.036 0.000 21.797 2.243
## .voca_06 20.396 0.439 46.416 0.000 20.396 3.250
## .bloc_06 6.942 0.461 15.063 0.000 6.942 1.055
## .obje_06 25.258 1.145 22.059 0.000 25.258 1.544
## Gc 0.000 0.000 0.000
## Gf 0.000 0.000 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Gc 1.000 1.000 1.000
## Gf 1.000 1.000 1.000
## .comp_06 53.220 7.931 6.710 0.000 53.220 0.564
## .voca_06 8.592 4.511 1.905 0.057 8.592 0.218
## .bloc_06 22.391 3.717 6.025 0.000 22.391 0.517
## .obje_06 102.876 25.562 4.025 0.000 102.876 0.385
Some basic questions to ask after fitting a model -
Did it work?
Does it make sense? Are the parameters viable?
Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)
semPaths(fit2, what="std",
sizeLat = 7, sizeMan = 7, edge.label.cex = .75)
Then, we can look more closely at how well the model fit, and interpret the model more generally.
We do model comparison…
Looking at fit statistics
#Compare model fit statistics
round(cbind(m1=inspect(fit0, 'fit.measures'), m2=inspect(fit1, 'fit.measures'),m3=inspect(fit2, 'fit.measures')),3)
## m1 m2 m3
## npar 8.000 12.000 13.000
## fmin 0.525 0.077 0.003
## chisq 214.189 31.453 1.167
## df 6.000 2.000 1.000
## pvalue 0.000 0.000 0.280
## baseline.chisq 214.189 214.189 214.189
## baseline.df 6.000 6.000 6.000
## baseline.pvalue 0.000 0.000 0.000
## cfi 0.000 0.859 0.999
## tli 0.000 0.576 0.995
## nnfi 0.000 0.576 0.995
## rfi 0.000 0.559 0.967
## nfi 0.000 0.853 0.995
## pnfi 0.000 0.284 0.166
## ifi 0.000 0.861 0.999
## rni 0.000 0.859 0.999
## logl -2950.942 -2859.574 -2844.431
## unrestricted.logl -2843.848 -2843.848 -2843.848
## aic 5917.885 5743.148 5714.862
## bic 5944.430 5782.966 5757.998
## ntotal 204.000 204.000 204.000
## bic2 5919.084 5744.946 5716.810
## rmsea 0.412 0.269 0.029
## rmsea.ci.lower 0.366 0.191 0.000
## rmsea.ci.upper 0.461 0.355 0.191
## rmsea.pvalue 0.000 0.000 0.394
## rmr 31.629 8.300 1.160
## rmr_nomean 31.629 8.300 1.160
## srmr 0.292 0.064 0.010
## srmr_bentler 0.292 0.064 0.010
## srmr_bentler_nomean 0.345 0.076 0.012
## srmr_bollen 0.292 0.064 0.010
## srmr_bollen_nomean 0.345 0.076 0.012
## srmr_mplus 0.292 0.064 0.010
## srmr_mplus_nomean 0.345 0.076 0.012
## cn_05 12.993 39.860 672.610
## cn_01 17.012 60.738 1160.992
## gfi 0.946 0.988 1.000
## agfi 0.875 0.916 0.994
## pgfi 0.406 0.141 0.071
## mfi 0.600 0.930 1.000
## ecvi NA NA NA
Formal model test
#Chi-square difference test for nested models
anova(fit1, fit2)
## Chi Square Difference Test
##
## Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)
## fit2 1 5714.9 5758 1.1668
## fit1 2 5743.1 5783 31.4527 30.286 1 3.728e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the null hypothesis that the models fit the same.
We conclude that the 2-factor model is the optimal model, of those evaluated.
Thanks for playing!