We have forwarded a proposal for Computational Developmental Science (Ram et al., 2015). The basic idea is that theory can be used to generate some formal propositions (equations). Those propositions can then be used in to simualate/generate data. The implications of the theory can then be explored in the simulated world through in silica experiments (e.g., changing parameter values). Theory can be rejected and revised appropriately.
This script is a very basic set-up for simualting growth model data. We specifically generate formal propositions for (1) intraindividual change and (2) interindividual differences in specific features of intraindividual change.
We set up our example using a (1) linear model for intraindividual change and (2) interindividual differences in two features of intraindividual change, intercept and linear slope. The script is meant to be adjusted for examination of other theories (i.e., nonlinear intraindividual change).
Libraries used in this script
Our first task is to write out the theory of interest - a theory of interindividual differences in intraindividual change.
The equation for intraindividual change is … (will need to adjust this section with new equation)
\[y_{it} = (b_{0i}) + (b_{1i})x_{it} + e_{it}\] where \[x_{it} = time_{it}\] and \[b_{0i} = g_{00} + u_{0i}\] \[b_{1i} = g_{10} + u_{1i}\] with \[e_{it} \tilde{} N(0,\sigma^{2}_{e})\] and \[[u_{0i},u_{1i}, ...] \tilde{} N(0,\Psi)\]
After algebraic substitution we get … \[y_{it} = (g_{00} + u_{0i}) + (g_{10} + u_{1i})x_{it} + e_{it}\]
This equation can be rearranged to more closely match R code layout of whatever functions will use to estimate. For our purposes here, this form translates in a sufficiently straightforward manner with our coding below.
Setting up function in R to hold the mathemtical growth model … (will need to adjust this section to match new equation) (For convenience we have created “holders” for 5 features of change.)
Set prototypical values for g00, g10, g20, g30, ….(will need to adjust these values with theoretically informed values)
#setting fixed effects parameters
g_00 <- 0
g_10 <- 0.2
g_20 <- NA
g_30 <- NA
g_40 <- NA
#...add more as needed ...
Set number of persons and occasions (here it is for 1 person, 10 occasions).
Generate empty data:
proto_data <- data.frame(
id = sort(rep(c(1:num_persons),num_occasions)),
time = rep(c(0:(num_occasions-1)),num_persons),
outcome = NA)
head(proto_data)
id | time | outcome |
---|---|---|
1 | 0 | NA |
1 | 1 | NA |
1 | 2 | NA |
1 | 3 | NA |
1 | 4 | NA |
1 | 5 | NA |
Fill in ‘outcome’ with prototype trajectory from function (plugging in x = time variable)
id | time | outcome |
---|---|---|
1 | 0 | 0.0 |
1 | 1 | 0.2 |
1 | 2 | 0.4 |
1 | 3 | 0.6 |
1 | 4 | 0.8 |
1 | 5 | 1.0 |
1 | 6 | 1.2 |
1 | 7 | 1.4 |
1 | 8 | 1.6 |
1 | 9 | 1.8 |
Plot the prototype …(may need to adjust for aesthetics)
#plot prototype trajectory
ggplot(data = proto_data, aes(x = time, y = outcome, group = id)) +
ggtitle("Simulated Linear Growth") +
#geom_point() +
geom_line(color="red", size = 2) +
xlab("Time") +
ylab("Outcome") + #ylim(0,100) +
scale_x_continuous(breaks=seq(0,9,by=1))
We now have the intraindividual change part in place.
This section is setting the information for the between-person random effects part of the model. \[[u_{0i},u_{1i}, ...] \tilde{} N(0,\Psi)\].
Set the number of persons to simulate: (will need to adjust this to size of desired sample)
Create names for the growth coefficients (gcoeffs):
Note - am creating 5 individual difference coefficients even though our example only uses 2.
Create a vector of means for the individual differences, u variables (typically these will be zero):
Set values for standard deviations for growth coefficients (will need to adjust this with chosen values)
sig_u_0 <- 1.0
sig_u_1 <- 0.3
sig_u_2 <- 1.0
sig_u_3 <- 1.0
sig_u_4 <- 1.0
#...add more as needed ...
Create a symmetric matrix of correlations among the growth coefficients (will need to adjust this with chosen values):
corr_gcoeffs <- matrix(c(1.0,0.2,0.0,0.0,0.0,
0.2,1.0,0.0,0.0,0.0,
0.0,0.0,1.0,0.0,0.0,
0.0,0.0,0.0,1.0,0.0,
0.0,0.0,0.0,0.0,1.0),nrow=5,ncol=5,byrow=TRUE)
corr_gcoeffs
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0 0.2 0 0 0
## [2,] 0.2 1.0 0 0 0
## [3,] 0.0 0.0 1 0 0
## [4,] 0.0 0.0 0 1 0
## [5,] 0.0 0.0 0 0 1
Create a vector of means. Create a vector of standard deviations and make it into a diagonal matrix:
#making mean vector
means_gcoeffs <- c(u_0,u_1,u_2,u_3,u_4)
#making sds vector and diagonal matrix
sds_gcoeffs_vector <- c(sig_u_0,sig_u_1,sig_u_2,sig_u_3,sig_u_4)
sds_gcoeffs <- diag(sds_gcoeffs_vector)
sds_gcoeffs
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0.0 0 0 0
## [2,] 0 0.3 0 0 0
## [3,] 0 0.0 1 0 0
## [4,] 0 0.0 0 1 0
## [5,] 0 0.0 0 0 1
Do the matrix multiplication to get a covariance matrix:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00 0.06 0 0 0
## [2,] 0.06 0.09 0 0 0
## [3,] 0.00 0.00 1 0 0
## [4,] 0.00 0.00 0 1 0
## [5,] 0.00 0.00 0 0 1
We use the mvrnorm() function in the MASS package) …
Use the mean vector and variance-covariance matrix to create some multivariate normal data:
#setting the random seed (in order to be able to repeat with same generation)
set.seed(1234) #can change this value if want different data run
#simulate
coeffs_simulate <- mvrnorm(n = num_persons, mu = means_gcoeffs, Sigma = cov_gcoeffs, empirical = TRUE)
Note that makes the data exact (if N is large enough)
Some cleanup into a data-like object.
#making and binding an id variable
id <- c(1:nrow(coeffs_simulate))
indiv_data <- cbind(id,coeffs_simulate)
#assigning the variable names
colnames(indiv_data) <- c("id",names_gcoeffs)
head(indiv_data)
## id u_0 u_1 u_2 u_3 u_4
## [1,] 1 1.0883565 0.35493517 0.06978623 1.03101628 1.5681671
## [2,] 2 1.1514923 0.02592345 1.35546492 -0.10400543 -1.1427319
## [3,] 3 1.2717192 0.08035000 -1.10056618 0.60137539 -0.6211519
## [4,] 4 0.4712630 -0.19889661 2.31188787 0.07321659 1.6843328
## [5,] 5 0.6817137 -0.20026783 -0.59006198 -0.70953285 -0.2979141
## [6,] 6 -0.1140907 -0.37390083 1.24890947 0.42056273 -1.5163611
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | 1 | 25 | 13 | 7.359801 | 13.0000000 | 13.0000000 | 8.8956000 | 1.0000000 | 25.0000000 | 24.000000 | 0.0000000 | -1.3446646 | 1.47196 |
u_0 | 2 | 25 | 0 | 1.000000 | 0.4533346 | 0.0649800 | 0.9414834 | -2.3494809 | 1.2717192 | 3.621200 | -0.5562613 | -0.8521228 | 0.20000 |
u_1 | 3 | 25 | 0 | 0.300000 | -0.0559674 | -0.0046921 | 0.2577785 | -0.5169619 | 0.6649941 | 1.181956 | 0.3168838 | -0.8436492 | 0.06000 |
u_2 | 4 | 25 | 0 | 1.000000 | -0.0992872 | -0.0440353 | 0.8244843 | -2.2144909 | 2.3118879 | 4.526379 | 0.3678988 | 0.0888570 | 0.20000 |
u_3 | 5 | 25 | 0 | 1.000000 | -0.0062015 | 0.0257368 | 0.8268599 | -2.3353412 | 2.2650956 | 4.600437 | -0.1814367 | 0.1138464 | 0.20000 |
u_4 | 6 | 25 | 0 | 1.000000 | 0.0631530 | 0.0133786 | 0.8978654 | -1.9403057 | 1.6843328 | 3.624639 | -0.1622044 | -0.8209951 | 0.20000 |
We now have the individual differences part in place.
Now lets generate the simulated intraindivdual change trajectories for all those individuals.
Set number of occasions: (will need to adjust this)
Typically is the same as was used earlier for intraindividual change.
Setting the (time-varying) residual error standard deviation: (will need to adjust this)
\[e_{it} \tilde{} N(0,\sigma^{2}_{e})\].
Calculating total number of observations
## [1] 250
Generating an empty data frame of appropraite size:
empty_df <- data.frame(
id = sort(rep(c(1:num_persons),num_occasions)),
time = rep(c(0:(num_occasions-1)),num_persons),
outcome = NA)
head(empty_df)
id | time | outcome |
---|---|---|
1 | 0 | NA |
1 | 1 | NA |
1 | 2 | NA |
1 | 3 | NA |
1 | 4 | NA |
1 | 5 | NA |
Merging in the individual differences coefficients:
Creating a (time-varying) residual error score for each observation:
id | time | outcome | u_0 | u_1 | u_2 | u_3 | u_4 | error |
---|---|---|---|---|---|---|---|---|
1 | 0 | NA | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | 0.0080060 |
1 | 1 | NA | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | -0.0631409 |
1 | 2 | NA | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | -0.1513288 |
1 | 3 | NA | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | -0.0636100 |
1 | 4 | NA | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | 0.0226302 |
1 | 5 | NA | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | 0.1013690 |
Notice which variables are time-invariant, and which are time-varying.
Use the originally specified growth function to generate outcome scores in the long data set:
sim_data$outcome <- growth.fun(x=sim_data$time,
u_0=sim_data$u_0,
u_1=sim_data$u_1,
u_2=sim_data$u_2,
u_3=sim_data$u_3,
u_4=sim_data$u_4,
error=sim_data$error)
head(sim_data)
id | time | outcome | u_0 | u_1 | u_2 | u_3 | u_4 | error |
---|---|---|---|---|---|---|---|---|
1 | 0 | 1.096362 | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | 0.0080060 |
1 | 1 | 1.580151 | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | -0.0631409 |
1 | 2 | 2.046898 | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | -0.1513288 |
1 | 3 | 2.689552 | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | -0.0636100 |
1 | 4 | 3.330727 | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | 0.0226302 |
1 | 5 | 3.964401 | 1.088356 | 0.3549352 | 0.0697862 | 1.031016 | 1.568167 | 0.1013690 |
Plotting the simulated data (with the prototype as an additional layer) (may need to adjust title and for aesthetics)
ggplot(data = sim_data, aes(x = time, y = outcome, group = id)) +
ggtitle("Simulated Linear Growth") +
geom_point() +
geom_line() +
xlab("Time") +
ylab("Outcome") + #ylim(0,100) +
scale_x_continuous(breaks=seq(0,9,by=1)) +
stat_function(fun=growth.fun, color="red", size = 2)
That’s cool!
A simulation engine that can be used to play with theory and generate hypotheses!