Chapter 2 - Data Structuring and Plotting Descriptives

Overview

This tutorial walks through a few helpful initial steps before conducting growth curve analyses (or any analyses for that matter). Specifically, this tutorial demonstrates how to manipulate data structures and how to obtain initial descriptive statistics and plots of the data, which will be useful when making decisions about analyses later down the line. In this tutorial, we will be using a data set that examines weight over time.

The code and example provided in this tutorial are from Chapter 2 of Grimm, Ram, and Estabrook (2016), with a few additions in code and commentary; however, the chpater should be referred to for further interpretations and insights about the analyses.

Outline

This tutorial provides line-by-line code to
1. re-structure data (long to wide, and wide to long),
2 create initial longitudinal plots,
3. examine descriptive statistics, and
4. create plots of bivariate relationships.

Step 0: Read in the data and call needed libraries.

#set filepath
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/wght_data.csv"
#read in the .csv file using the url() function
wght_long <- read.csv(file=url(filepath),header=TRUE)

#add names the columns of the data set
names(wght_long) <- c('id','occ','occ_begin','year','time_in_study','grade','age','gyn_age', 'wght')

#view the first few observations in the data set
head(wght_long)

##   id occ occ_begin year time_in_study grade       age    gyn_age wght
## 1  4   2         2 1994      2.083333     5 10.916667  0.9166667  100
## 2  4   3         3 1996      3.750000     6 12.583333  2.5833333  108
## 3  5   1         1 1992      0.000000     0  6.333333 -6.6666667   49
## 4  5   2         2 1994      2.000000     1  8.333333 -4.6666667   52
## 5  5   3         3 1996      3.750000     2 10.083333 -2.9166667   72
## 6  5   4         4 1998      5.750000     5 12.083333 -0.9166667  100

#calling the libraries we will need throughout this tutorial
library(ggplot2)
library(psych)
library(reshape)

Step 1: Re-structure data.

Often times, different analyses call for different data structures. Two data structures that are frequently used are referred to as “long” data files or “wide” data files. Long data files contain a row for each measurement occasion and a column for each repeated measure, resulting in a data file that is N (number of persons) x O (number of occasions) long. In contrast, wide data files contain a row for each person and each measurement occasion is a separate column.

It is useful to have both long and wide files of your data before beginning analyses. Below, we begin with re-structuring a long data file into a wide data file, and then reverse this change.

Long to wide.

#rounding the variable age and creating a new variable with this information
wght_long$age_r <- round(wght_long$age)

#restructuring the data from long to wide
wght_wide <- reshape(wght_long,          #data set
                    v.names='wght',     #repeated measures variable
                    idvar='id',         #id variable
                    timevar='age_r',    #time metric/occasion variable
                    direction='wide')   #direction of re-structuring

## Warning in reshapeWide(data, idvar = idvar, timevar =
## timevar, varying = varying, : some constant variables
## (occ,occ_begin,year,time_in_study,grade,age,gyn_age) are really varying

#view the first few observations in the data set
head(wght_wide)

##    id occ occ_begin year time_in_study grade       age    gyn_age wght.11
## 1   4   2         2 1994      2.083333     5 10.916667  0.9166667     100
## 3   5   1         1 1992      0.000000     0  6.333333 -6.6666667      NA
## 8   8   1         1 1986      0.000000     4 10.000000 -2.0000000      NA
## 12 10   1         1 1990      0.000000     3  7.916667 -5.0833333      NA
## 16 11   1         1 1990      0.000000     1  7.583333 -6.4166667      NA
## 19 19   1         1 1996      0.000000     0  5.916667 -5.0833333      NA
##    wght.13 wght.6 wght.8 wght.10 wght.12 wght.14 wght.16 wght.7 wght.5
## 1      108     NA     NA      NA      NA      NA      NA     NA     NA
## 3       NA     49     52      72     100     124      NA     NA     NA
## 8       NA     NA     NA      85     133     180     160     NA     NA
## 12      NA     NA     61      75     105     147      NA     NA     NA
## 16      NA     NA     49      65      70      NA      NA     NA     NA
## 19      NA     40     52      91     120      NA      NA     NA     NA
##    wght.9 wght.15 wght.17 wght.19 wght.18
## 1      NA      NA      NA      NA      NA
## 3      NA      NA      NA      NA      NA
## 8      NA      NA      NA      NA      NA
## 12     NA      NA      NA      NA      NA
## 16     NA      NA      NA      NA      NA
## 19     NA      NA      NA      NA      NA

#creating new data set with only the id and weight variables
wght_wide1 <- wght_wide[ , c('id','wght.5','wght.6','wght.7','wght.8','wght.9','wght.10',
                         'wght.11','wght.12','wght.13','wght.14','wght.15','wght.16',
                         'wght.17','wght.18','wght.19')]

#view the first few observations in the data set
head(wght_wide1)

##    id wght.5 wght.6 wght.7 wght.8 wght.9 wght.10 wght.11 wght.12 wght.13
## 1   4     NA     NA     NA     NA     NA      NA     100      NA     108
## 3   5     NA     49     NA     52     NA      72      NA     100      NA
## 8   8     NA     NA     NA     NA     NA      85      NA     133      NA
## 12 10     NA     NA     NA     61     NA      75      NA     105      NA
## 16 11     NA     NA     NA     49     NA      65      NA      70      NA
## 19 19     NA     40     NA     52     NA      91      NA     120      NA
##    wght.14 wght.15 wght.16 wght.17 wght.18 wght.19
## 1       NA      NA      NA      NA      NA      NA
## 3      124      NA      NA      NA      NA      NA
## 8      180      NA     160      NA      NA      NA
## 12     147      NA      NA      NA      NA      NA
## 16      NA      NA      NA      NA      NA      NA
## 19      NA      NA      NA      NA      NA      NA

#add names the columns of the data set
names(wght_wide1) <- c('id','wght5','wght6','wght7','wght8','wght9','wght10',
                          'wght11','wght12','wght13','wght14','wght15','wght16',
                          'wght17','wght18','wght19')

Wide to long.

#restructuring the data from wide to long 
wght_long_new <- reshape(data = wght_wide1,                                          #data set
                        idvar='id',                                                 #id variable
                        varying=c('wght5','wght6','wght7','wght8','wght9','wght10', #repeated measures variables
                        'wght11','wght12','wght13','wght14','wght15','wght16',
                        'wght17','wght18','wght19'),
                        times=c(5,6,7,8,9,10,11,12,13,14,15,16,17,18,19),           #time metric/occasion variable
                        v.names='wght',                                             #name of repeated measures variable (i.e., new column name)
                        direction='long')                                           #direction of restructuring

#re-order columns
wght_long_new <- wght_long_new[order(wght_long_new$id, wght_long_new$time),]


#view the first few observations in the data set
head(wght_long_new)

##      id time wght
## 4.5   4    5   NA
## 4.6   4    6   NA
## 4.7   4    7   NA
## 4.8   4    8   NA
## 4.9   4    9   NA
## 4.10  4   10   NA

#creating new data set with no missing weight variables
wght_long_new1 <- wght_long_new[which(!is.na(wght_long_new$wght)), ]

#view the first few observations in the data set
head(wght_long_new1)

##      id time wght
## 4.11  4   11  100
## 4.13  4   13  108
## 5.6   5    6   49
## 5.8   5    8   52
## 5.10  5   10   72
## 5.12  5   12  100

Step 2: Create initial longitudinal plots.

Before beginning analyses, it is often helpful to examine plots of your data. In this case, we want to make sure that growth curve models are appropriate for our data (i.e., do we see any growth/change in our data?).

#creating a new data set with a subset of our data for plots that are more clear/less messy
wght_long1 <- wght_long[which(wght_long$id > 1300 & wght_long$id < 1600), ]

#creating a plot and assigning it to an object 
plot_obs <- ggplot(data=wght_long1,                                                                  #data set
                   aes(x=age, y=wght, group=id)) +                                                   #calling variables
                   geom_line() +                                                                     #adding lines to plot
                   theme_bw() +                                                                      #changing style/background
                   scale_x_continuous(breaks = c(5,7,9,11,13,15,17), name = "Chronological Age") +   #creating breaks in the x-axis and labeling the x-axis
                   scale_y_continuous(breaks = c(25,50,75,100,125,150,175,200,225), name = "Weight") #creating breaks in the y-axis and labeling the y-axis

#printing the object (plot)
print(plot_obs)

plot_obs <- ggplot(data=wght_long1,                                                                  #data set
                   aes(x=age, y=wght, group=id)) +                                                   #calling variables
                   geom_line() +                                                                     #adding lines to plot
                   geom_point(size=2) +                                                              #adding and adjusting size of points on plot
                   theme_classic() +                                                                 #changing style/background
                   scale_x_continuous(breaks = c(5,7,9,11,13,15,17), name = "Chronological Age") +   #creating breaks in the x-axis and labeling the x-axis
                   scale_y_continuous(breaks = c(25,50,75,100,125,150,175,200,225), name = "Weight") #creating breaks in the y-axis and labeling the y-axis

#print the plot
print(plot_obs)

Step 3: Examine descriptive statistics.

Now since we’ve visually examined our data, we will get a better feel of our data through the examination of descriptive statistics. We use our wide data fill to conduct these analyses.

#creating new data set with only the weight variables
wght_vars <- wght_wide1[ , c('wght5','wght6','wght7','wght8','wght9','wght10','wght11','wght12',
                            'wght13','wght14','wght15','wght16','wght17','wght18','wght19')]

#view the first few observations in the data set
head(wght_vars)

##    wght5 wght6 wght7 wght8 wght9 wght10 wght11 wght12 wght13 wght14 wght15
## 1     NA    NA    NA    NA    NA     NA    100     NA    108     NA     NA
## 3     NA    49    NA    52    NA     72     NA    100     NA    124     NA
## 8     NA    NA    NA    NA    NA     85     NA    133     NA    180     NA
## 12    NA    NA    NA    61    NA     75     NA    105     NA    147     NA
## 16    NA    NA    NA    49    NA     65     NA     70     NA     NA     NA
## 19    NA    40    NA    52    NA     91     NA    120     NA     NA     NA
##    wght16 wght17 wght18 wght19
## 1      NA     NA     NA     NA
## 3      NA     NA     NA     NA
## 8     160     NA     NA     NA
## 12     NA     NA     NA     NA
## 16     NA     NA     NA     NA
## 19     NA     NA     NA     NA

#univariate descriptives
describe(wght_vars)

##        vars    n   mean    sd median trimmed   mad min max range skew
## wght5     1  171  43.47  9.26   41.0   42.45  5.93  30  90    60 1.61
## wght6     2  837  47.85  9.84   46.0   46.86  8.90  27 110    83 1.48
## wght7     3  856  54.21 12.46   51.0   52.69  8.90   8 127   119 1.35
## wght8     4 1157  62.77 17.57   60.0   60.72 14.83   7 280   273 2.59
## wght9     5 1001  72.78 19.70   69.0   70.50 16.31  37 220   183 1.40
## wght10    6 1320  83.08 23.03   79.0   80.51 20.76  20 200   180 1.16
## wght11    7 1145  96.72 27.31   91.0   93.90 23.72  44 265   221 1.16
## wght12    8 1288 108.90 28.75  104.0  106.12 23.72   1 249   248 1.05
## wght13    9 1014 122.82 33.15  116.0  119.06 25.20  42 313   271 1.26
## wght14   10 1054 130.42 34.12  122.0  125.78 25.20  62 324   262 1.50
## wght15   11  143 130.41 26.88  128.0  127.90 23.72  87 235   148 0.93
## wght16   12   72 135.03 32.76  126.5  130.91 25.95  90 240   150 1.09
## wght17   13   48 137.31 33.42  129.0  132.62 20.76  96 255   159 1.56
## wght18   14   15 136.13 49.94  119.0  127.77 14.83 101 280   179 1.88
## wght19   15    7 153.43 30.55  145.0  153.43 22.24 118 200    82 0.39
##        kurtosis    se
## wght5      4.29  0.71
## wght6      4.46  0.34
## wght7      3.20  0.43
## wght8     21.02  0.52
## wght9      3.80  0.62
## wght10     1.88  0.63
## wght11     2.27  0.81
## wght12     1.91  0.80
## wght13     2.41  1.04
## wght14     2.98  1.05
## wght15     1.10  2.25
## wght16     0.70  3.86
## wght17     2.28  4.82
## wght18     2.31 12.89
## wght19    -1.67 11.55

#bivariate descriptives
cor(wght_vars, use='pairwise.complete.obs') #correlation matrix

##            wght5     wght6     wght7     wght8     wght9    wght10
## wght5  1.0000000        NA 0.7763339 1.0000000 0.8038661 0.7278684
## wght6         NA 1.0000000 0.8557850 0.7369325 0.8083459 0.7618904
## wght7  0.7763339 0.8557850 1.0000000 0.8631670 0.8488304 0.6646086
## wght8  1.0000000 0.7369325 0.8631670 1.0000000 0.9571767 0.7791118
## wght9  0.8038661 0.8083459 0.8488304 0.9571767 1.0000000 0.8940827
## wght10 0.7278684 0.7618904 0.6646086 0.7791118 0.8940827 1.0000000
## wght11 0.6527799 0.7457341 0.7802957 0.8010815 0.8688773 0.9059677
## wght12 0.4330780 0.6539078 0.6907005 0.7840391 0.8057898 0.8675311
## wght13 0.6656333 0.7186277 0.7426071 0.8057780 0.8224555 0.7865057
## wght14 0.7254958 0.5842574 0.6680452 0.6488491 0.8505020 0.7829868
## wght15        NA 0.2979179 0.9981373 0.7044674 0.6291826 0.6701769
## wght16        NA        NA        NA        NA 0.9155911 0.6347579
## wght17        NA        NA        NA        NA        NA 0.7905679
## wght18        NA        NA        NA        NA        NA        NA
## wght19        NA        NA        NA        NA        NA        NA
##           wght11    wght12    wght13    wght14    wght15    wght16
## wght5  0.6527799 0.4330780 0.6656333 0.7254958        NA        NA
## wght6  0.7457341 0.6539078 0.7186277 0.5842574 0.2979179        NA
## wght7  0.7802957 0.6907005 0.7426071 0.6680452 0.9981373        NA
## wght8  0.8010815 0.7840391 0.8057780 0.6488491 0.7044674        NA
## wght9  0.8688773 0.8057898 0.8224555 0.8505020 0.6291826 0.9155911
## wght10 0.9059677 0.8675311 0.7865057 0.7829868 0.6701769 0.6347579
## wght11 1.0000000 0.8403308 0.8601377 0.8776408 0.7062868 0.6974460
## wght12 0.8403308 1.0000000 0.9104735 0.8457076 0.7411690 0.8621695
## wght13 0.8601377 0.9104735 1.0000000 0.9643801 0.8059892 0.9258816
## wght14 0.8776408 0.8457076 0.9643801 1.0000000 0.9900719 0.9264146
## wght15 0.7062868 0.7411690 0.8059892 0.9900719 1.0000000 0.9995866
## wght16 0.6974460 0.8621695 0.9258816 0.9264146 0.9995866 1.0000000
## wght17 0.4637109 0.9621373 0.8964683 0.9794416 0.8783625 0.9431854
## wght18 0.9331130 0.3258318 1.0000000 0.9422574        NA 0.9032972
## wght19        NA 1.0000000 1.0000000        NA 0.8464803        NA
##           wght17    wght18    wght19
## wght5         NA        NA        NA
## wght6         NA        NA        NA
## wght7         NA        NA        NA
## wght8         NA        NA        NA
## wght9         NA        NA        NA
## wght10 0.7905679        NA        NA
## wght11 0.4637109 0.9331130        NA
## wght12 0.9621373 0.3258318 1.0000000
## wght13 0.8964683 1.0000000 1.0000000
## wght14 0.9794416 0.9422574        NA
## wght15 0.8783625        NA 0.8464803
## wght16 0.9431854 0.9032972        NA
## wght17 1.0000000        NA 0.9499461
## wght18        NA 1.0000000        NA
## wght19 0.9499461        NA 1.0000000

cov(wght_vars, use='pairwise.complete.obs') #covariance matrix

##            wght5     wght6    wght7    wght8    wght9    wght10   wght11
## wght5   85.79195        NA 100.6235 120.0000 160.0060  87.30714 172.7811
## wght6         NA  96.84823 107.9002 142.6230 268.3675 173.69643 270.3874
## wght7  100.62355 107.90018 155.3054 168.1466 214.8442 164.22347 262.8675
## wght8  120.00000 142.62304 168.1466 308.6154 506.4894 321.34312 444.2223
## wght9  160.00597 268.36755 214.8442 506.4894 388.2059 357.30508 475.6419
## wght10  87.30714 173.69643 164.2235 321.3431 357.3051 530.49248 547.4219
## wght11 172.78107 270.38739 262.8675 444.2223 475.6419 547.42195 745.7104
## wght12  71.31579 172.50264 247.2632 357.4025 364.9774 556.08329 502.3402
## wght13 179.46414 272.85654 323.3059 509.5192 573.6170 487.37626 793.5887
## wght14 176.21053 187.14631 207.1805 414.5335 452.4881 617.12588 786.0611
## wght15        NA  32.13636 334.5000 310.2857 237.2240 267.58559 443.1825
## wght16        NA        NA       NA       NA 497.3053 352.76882 295.8667
## wght17        NA        NA       NA       NA       NA 261.48918 130.6970
## wght18        NA        NA       NA       NA       NA        NA 258.6667
## wght19        NA        NA       NA       NA       NA        NA       NA
##            wght12    wght13    wght14     wght15    wght16    wght17
## wght5    71.31579  179.4641  176.2105         NA        NA        NA
## wght6   172.50264  272.8565  187.1463   32.13636        NA        NA
## wght7   247.26321  323.3059  207.1805  334.50000        NA        NA
## wght8   357.40250  509.5192  414.5335  310.28571        NA        NA
## wght9   364.97741  573.6170  452.4881  237.22402  497.3053        NA
## wght10  556.08329  487.3763  617.1259  267.58559  352.7688  261.4892
## wght11  502.34023  793.5887  786.0611  443.18248  295.8667  130.6970
## wght12  826.75810  777.6939  797.8288  390.89655  710.5937  836.5500
## wght13  777.69388 1099.2347 1613.8075  641.77025 1047.9167  970.7571
## wght14  797.82875 1613.8075 1164.3862   61.00000  950.0605  218.6000
## wght15  390.89655  641.7703   61.0000  722.63715 1405.0000  988.0465
## wght16  710.59373 1047.9167  950.0605 1405.00000 1073.4358  118.0000
## wght17  836.55000  970.7571  218.6000  988.04655  118.0000 1116.9428
## wght18   36.08333 1584.0000 1850.5321         NA 2236.9727        NA
## wght19 1435.00000   42.0000        NA  888.00000        NA  809.6000
##            wght18    wght19
## wght5          NA        NA
## wght6          NA        NA
## wght7          NA        NA
## wght8          NA        NA
## wght9          NA        NA
## wght10         NA        NA
## wght11  258.66667        NA
## wght12   36.08333 1435.0000
## wght13 1584.00000   42.0000
## wght14 1850.53205        NA
## wght15         NA  888.0000
## wght16 2236.97273        NA
## wght17         NA  809.6000
## wght18 2493.69524        NA
## wght19         NA  933.2857

Step 4: Create plots of bivariate relationships.

Finally, we plot the bivariate descriptive statistics we just examined.

#creating a function that will plot bivariate relationships of our variables
panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}

#using pairs to create a matrix of scatterplots and our panel.hist function to create a scatterplot matrix of the bivariate relationships between all of our weight measures
pairs(~wght5+wght6+wght7+wght8+wght9+
       wght10+wght11+wght12+wght13+wght14+
       wght15+wght16+wght17+wght18+wght19,
      data=wght_wide1, diag.panel=panel.hist)

#using pairs to create a matrix of scatterplots and our panel.hist function to create a scatterplot matrix of the bivariate relationships between a subset of our weight measures
pairs(~wght5+wght6+wght7+wght8+wght9+wght10, data=wght_wide1, diag.panel=panel.hist)

Conclusion

This tutorial has presented several key steps when beginning data analyses, specifically, setting up data in several formats (long and wide), plotting the data, and obtaining descriptive statistics of the data.