This session we shall work through some basics. In particular we shall cover …

A. Reading in Repeated Measures Data
B. Manipulating those data into two different formats (long and wide)
C. Describing the data (descriptives and plots)

0.0.0.1 Prelim - Loading libraries used in this script.

library(psych)
library(ggplot2)
library(car)
library(GGally)
library(lattice)

0.0.1 A. Reading in Repeated Measures Data

The first step, and often the most frustrating part of data analysis is … getting the data into the program!

Recently, the trend is that everyone exchanges files in .csv format. This usually works well and is a good practice to follow.

Our examples makes use of data from McArdle and colleagues and has been used in many papers. These are classic data from Osborne & Suddick (1972) as described in McArdle & Epstein (1987), McArdle (1988), McArdle & Aber (1990), McArdle & Nesselroade (1994), McArdle (1990, 2001) …. They are the basis for the Longitudinal Research Institute Workshops, and are used here with permission.

0.0.1.1 1. Getting the data into the program.

To get the data, we set the working directory and read in a data file. (Note that in R markdown documents, these must be done in the same code chunk).

#Setting the working directory
setwd("~/Desktop/")  #Collaborator 1's Computer
#setwd("~/Desktop/")  #Collaborator 2's Computer

############################
####### Reading in the Data
############################
#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/wisc3raw.csv"
#read in the .csv file using the url() function
wisc3raw <- read.csv(file=url(filepath),header=TRUE)

Note that the working directory has been set to a specific location on Collaborator 1’s computer. For convenience, an additional setwd() line is included that is commented out (using # - or can comment blocks of text using Cmd-Shift-c within RStudio). This set-up allows for easy passing back and forth among multiple collaborators (or across home and work computers) who have the data housed in an idiosyncratic location. If seeking help with your scripts, it may be useful to send the script and data file, along with this kind of set-up. It makes it much easier to give help when the data read-in process is easy.

For reading in other files types see … http://www.statmethods.net/input/importingdata.html.

0.0.1.2 2. Clean-up of file (“data hygiene”)

Once the data are in, there are often a few bits of clean-up that need to be done to make a file easy to work with. Driven by necessity, individual R developers have made a wide variety of convenience functions. The trick is finding them. If you think that someone else may have ever had to do something similar, use a search engine to find it.

For example, R is case-sensitive. Sometimes variables are named with a mix of lower-case and upper-case letters. Depending on the fluidity of these names, it is sometimes convenient to rename the variables to all lower case. The consistency of naming often reduces typographical coding errors (although the front-end engines, like R-Studio, are quite good at helping now). Lets have a look at the variable names in our example data.

colnames(wisc3raw)
##  [1] "id"       "verb1"    "verb2"    "verb4"    "verb6"    "perfo1"  
##  [7] "perfo2"   "perfo4"   "perfo6"   "info1"    "comp1"    "simu1"   
## [13] "voca1"    "info6"    "comp6"    "simu6"    "voca6"    "momed"   
## [19] "grad"     "constant"

Sidebar - a little trick to get the variable names into a format that is easy to cut and paste back into R code. This is super helpful when working with lots of names.

dput(colnames(wisc3raw))
## c("id", "verb1", "verb2", "verb4", "verb6", "perfo1", "perfo2", 
## "perfo4", "perfo6", "info1", "comp1", "simu1", "voca1", "info6", 
## "comp6", "simu6", "voca6", "momed", "grad", "constant")

Sometimes data have a few variables that have upper-case names. Everything can be converted to lower-case …

# Set all variable names to lowercase 
var_names_wisc <- tolower(colnames(wisc3raw))
colnames(wisc3raw)<-var_names_wisc

A few other idiosyncratic tendencies of mine are (a) the naming of the analysis unit (e.g., person, family), because my collection of scritpts is written in a particular way, and (b) the ordering of the variables so that id and time are always in the left-most columns, becasue I find this easier when trying to locate specific observations while scrolling through the data file. Thus, I often change the identification variable to id and reorder the columns. In these example data, the id variable is already named as such (e.g., http://www.statmethods.net/management/variables.html), and the column ordering is ok (e.g., google search for “reordering columns in R”). Make sure you know what is going on with missing data values - and recode as necessary. (e.g., http://www.statmethods.net/input/missingdata.html)

0.0.1.3 3. Outputting a clean file for future use.

Depending on work-flow, it may be useful to create a “working file” - both for use with R, and for passing to other programs.

To output a comma delimited data file …

#Output a comma delimited data file;
write.csv(wisc3raw, file="NewData.csv",row.names=FALSE, na="")

Note that by default the write.csv() function will include an extra column of row numbers and will notate missing data with an NA. Both are inconvenient for other programs, so we have included therow.names=FALSE and na="" options to keep the .csv file “clean”.

To output some other files types see … http://www.statmethods.net/input/exportingdata.html

0.0.2 B. Manipulating those data into two different formats (long and wide)

Behavioral science tends to use relational data structures - in basic form, spreadsheets. Typically the data are stored/configured as a data frame (a “fancy” matrix) with multiple rows and multiple columns. Two main schema are used to accommodate repeated measures data - “Wide Format” and “Long Format”. Different analysis and plotting functions require different kinds of data input. Thus, it is imperative that one can convert the data back and forth between wide and long formats.

There are lots of ways to do this. We illustrate one way.

First, lets subset down to the variables we need.

#making a vector of variable names (id, time-varying, time-invariant)
var_names_sub <- c("id", 
                   "verb1","verb2","verb4","verb6",
                   "perfo1","perfo2","perfo4","perfo6",
                   "momed","grad")
#subsetting
wiscraw <- wisc3raw[,var_names_sub]

#looking at the data
head(wiscraw)
id verb1 verb2 verb4 verb6 perfo1 perfo2 perfo4 perfo6 momed grad
1 24.42 26.98 39.61 55.64 19.84 22.97 43.90 44.19 9.5 0
2 12.44 14.38 21.92 37.81 5.90 13.44 18.29 40.38 5.5 0
3 32.43 33.51 34.30 50.18 27.64 45.02 46.99 77.72 14.0 1
4 22.69 28.39 42.16 44.72 33.16 29.68 45.97 61.66 14.0 1
5 28.23 37.81 41.06 70.95 27.64 44.42 65.48 64.22 11.5 0
6 16.06 20.12 38.02 39.94 8.45 15.78 26.99 39.08 14.0 1

0.0.2.1 1. Reshape Wide to Long

One way to go from wide to long is using the reshape() function (not to be confused with the reshape2 package).

#reshaping long to wide
wisclong <- reshape(data=wiscraw,
                    varying = c("verb1","verb2","verb4","verb6",
                                "perfo1","perfo2","perfo4","perfo6"),
                    timevar=c("grade"), 
                    idvar=c("id"),
                    direction="long", sep="")
#sorting for easy viewing
#reorder by id and day
wisclong <- wisclong[order(wisclong$id,wisclong$grade), ]

#looking at the data
head(wisclong, 8)
id momed grad grade verb perfo
1.1 1 9.5 0 1 24.42 19.84
1.2 1 9.5 0 2 26.98 22.97
1.4 1 9.5 0 4 39.61 43.90
1.6 1 9.5 0 6 55.64 44.19
2.1 2 5.5 0 1 12.44 5.90
2.2 2 5.5 0 2 14.38 13.44
2.4 2 5.5 0 4 21.92 18.29
2.6 2 5.5 0 6 37.81 40.38

BE CAREFUL WITH RESHAPE ##### THE ARGUMENTS NEEDED ARE NOT ALWAYS STRAIGHTFORWARD Make sure that the missing data has been treated properly.

0.0.2.2 2. Reshape Long to Wide

One way to go from long to wide is using the reshape() function (not to be confused with the reshape2 package).

#reshaping long to wide
wiscwide <- reshape(data=wisclong, 
                    timevar=c("grade"), 
                    idvar=c("id"),
                    v.names=c("verb","perfo"),
                    direction="wide", sep="_")

#reordering columns for easy viewing
wiscwide <- wiscwide[,c("id",
                        "verb_1","verb_2","verb_4","verb_6",
                        "perfo_1","perfo_2","perfo_4","perfo_6",
                        "momed","grad")]

#looking at the data
head(wiscwide)
id verb_1 verb_2 verb_4 verb_6 perfo_1 perfo_2 perfo_4 perfo_6 momed grad
1.1 1 24.42 26.98 39.61 55.64 19.84 22.97 43.90 44.19 9.5 0
2.1 2 12.44 14.38 21.92 37.81 5.90 13.44 18.29 40.38 5.5 0
3.1 3 32.43 33.51 34.30 50.18 27.64 45.02 46.99 77.72 14.0 1
4.1 4 22.69 28.39 42.16 44.72 33.16 29.68 45.97 61.66 14.0 1
5.1 5 28.23 37.81 41.06 70.95 27.64 44.42 65.48 64.22 11.5 0
6.1 6 16.06 20.12 38.02 39.94 8.45 15.78 26.99 39.08 14.0 1

BE CAREFUL WITH RESHAPE ##### THE ARGUMENTS NEEDED ARE NOT ALWAYS STRAIGHTFORWARD
Make sure that the missing data has been treated properly.

Note: A more general reshaping solution is provided by Hadley Wickham’s reshape2 package through melt and cast functions. It is very useful when one has complicated data structures. Always keep track of the length of your data, so that no observations get lost or merged improperly.

Ok - now we have both wide and long format data sets and can start describing the data. Yay!

0.0.3 C. Describing the Data

Once the wide and long data sets are in place, we can begin describing and plotting the data. I cannot emphasize enough the importance of these steps for understanding one’s data!

Some descriptives and plots are produced from wide data, some from long data. Having both in place facilitates learning about the data. Continually keep in mind what portions of the data-box are being desribed (e.g., persons, variables, occasions).

0.0.3.0.1 Basic Descriptives:

Most often we are using the functions in the psych and ggplot2 packages.

#In the wide file
describe(wiscwide)
vars n mean sd median trimmed mad min max range skew kurtosis se
id 1 204 102.5000000 59.0338886 102.500 102.5000000 75.612600 1.00 204.00 203.00 0.0000000 -1.2176609 4.1331989
verb_1 2 204 19.5850490 5.8076950 19.335 19.4976829 5.411490 3.33 35.15 31.82 0.1301071 -0.0528376 0.4066200
verb_2 3 204 25.4153431 6.1063772 25.980 25.4026220 6.567918 5.95 39.85 33.90 -0.0639307 -0.3445494 0.4275319
verb_4 4 204 32.6077451 7.3198841 32.820 32.4240244 7.183197 12.60 52.84 40.24 0.2317998 -0.0776524 0.5124944
verb_6 5 204 43.7499020 10.6650511 42.545 43.4560976 11.297412 17.35 72.59 55.24 0.2356459 -0.3605210 0.7467029
perfo_1 6 204 17.9770588 8.3497820 17.660 17.6911585 8.295147 0.00 46.58 46.58 0.3510232 -0.1089726 0.5846017
perfo_2 7 204 27.6899510 9.9911175 26.570 27.3409756 10.511634 7.83 59.58 51.75 0.3925937 -0.2086413 0.6995181
perfo_4 8 204 39.3575000 10.2689265 39.090 39.2792683 10.037202 7.81 75.61 67.80 0.1509714 0.5924345 0.7189687
perfo_6 9 204 50.9316176 12.4798742 51.765 51.0707317 13.269270 10.26 89.01 78.75 -0.0637532 0.1836101 0.8737660
momed 10 204 10.8112745 2.6982790 11.500 10.9969512 2.965200 5.50 18.00 12.50 -0.3586143 0.0056095 0.1889173
grad 11 204 0.2254902 0.4189328 0.000 0.1585366 0.000000 0.00 1.00 1.00 1.3040954 -0.3007374 0.0293312
#In the long file 
describe(wisclong)
vars n mean sd median trimmed mad min max range skew kurtosis se
id 1 816 102.5000000 58.925137 102.500 102.5000000 75.61260 1.00 204.00 203.00 0.0000000 -1.2044666 2.0627924
momed 2 816 10.8112745 2.693308 11.500 11.0007645 2.96520 5.50 18.00 12.50 -0.3606036 0.0278594 0.0942846
grad 3 816 0.2254902 0.418161 0.000 0.1574924 0.00000 0.00 1.00 1.00 1.3113292 -0.2807552 0.0146386
grade 4 816 3.2500000 1.921464 3.000 3.1880734 2.22390 1.00 6.00 5.00 0.2775196 -1.4304556 0.0672647
verb 5 816 30.3395098 11.861196 28.465 29.3911468 11.33448 3.33 72.59 69.26 0.7091064 0.3311120 0.4152249
perfo 6 816 33.9890319 16.138758 33.140 33.3422783 18.13961 0.00 89.01 89.01 0.3363985 -0.4307565 0.5649695

Note the n. What slices of the box are being used to make the calculations?

0.0.3.1 All Data = all persons and occasions (Verbal Ability)

#sample descriptives
describe(wisclong$verb)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 816 30.33951 11.8612 28.465 29.39115 11.33448 3.33 72.59 69.26 0.7091064 0.331112 0.4152249
#histogram
ggplot(data=wisclong, aes(x=verb)) +
  geom_histogram(binwidth=2.5, pad=TRUE, fill="white", color="black") + 
  xlab("Verbal Ability (Grade 1 to 6)")
## Warning: Duplicated aesthetics after name standardisation: pad

0.0.3.2 Sample-level descriptives across Time (Verbal Ability)

Note that our variable is actually “multivariate” because we have repeated measures. We need to consider the time-splits when we are doing descriptives.

0.0.3.3 Verbal Ability across Time = all persons faceted by grade.

Note that we start with MEANS, VARIANCES

#sample descriptives by occasion 
#in the wide file
describe(wiscwide[,c("verb_1","verb_2","verb_4","verb_6")])
vars n mean sd median trimmed mad min max range skew kurtosis se
verb_1 1 204 19.58505 5.807695 19.335 19.49768 5.411490 3.33 35.15 31.82 0.1301071 -0.0528376 0.4066200
verb_2 2 204 25.41534 6.106377 25.980 25.40262 6.567918 5.95 39.85 33.90 -0.0639307 -0.3445494 0.4275319
verb_4 3 204 32.60775 7.319884 32.820 32.42402 7.183197 12.60 52.84 40.24 0.2317998 -0.0776524 0.5124944
verb_6 4 204 43.74990 10.665051 42.545 43.45610 11.297412 17.35 72.59 55.24 0.2356459 -0.3605210 0.7467029
#or in the long file
describeBy(wisclong[,c("verb")],group=wisclong$grade)
## 
##  Descriptive statistics by group 
## group: 1
##    vars   n  mean   sd median trimmed  mad  min   max range skew kurtosis
## X1    1 204 19.59 5.81  19.34    19.5 5.41 3.33 35.15 31.82 0.13    -0.05
##      se
## X1 0.41
## -------------------------------------------------------- 
## group: 2
##    vars   n  mean   sd median trimmed  mad  min   max range  skew kurtosis
## X1    1 204 25.42 6.11  25.98    25.4 6.57 5.95 39.85  33.9 -0.06    -0.34
##      se
## X1 0.43
## -------------------------------------------------------- 
## group: 4
##    vars   n  mean   sd median trimmed  mad  min   max range skew kurtosis
## X1    1 204 32.61 7.32  32.82   32.42 7.18 12.6 52.84 40.24 0.23    -0.08
##      se
## X1 0.51
## -------------------------------------------------------- 
## group: 6
##    vars   n  mean    sd median trimmed  mad   min   max range skew
## X1    1 204 43.75 10.67  42.55   43.46 11.3 17.35 72.59 55.24 0.24
##    kurtosis   se
## X1    -0.36 0.75
#histogram faceted by grade
ggplot(data=wisclong, aes(x=verb)) +
  geom_histogram(binwidth=5, pad = TRUE, fill="white", color="black") + 
  xlab("Verbal Ability") +
  facet_grid(grade ~ .)
## Warning: Duplicated aesthetics after name standardisation: pad

#boxplot by grade
qplot(x=factor(grade), y=verb, data=wisclong, geom="boxplot", ylab="Verbal Ability", xlab="Grade")

#use factor() to convert "time" from numeric to categorical

#boxplot by grade
ggplot(data=wisclong, aes(x=factor(grade), y=verb)) + 
  geom_boxplot(notch = TRUE) +
  stat_summary(fun.y="mean", geom="point", shape=23, size=3, fill="white") +
  labs(x = "Grade", y = "Verbal Ability")

#Density distribution by grade
ggplot(data=wisclong, aes(x=verb)) + 
  geom_density(aes(group=factor(grade), colour=factor(grade), fill=factor(grade)), alpha=0.3) +
  guides(colour=FALSE,
         fill=guide_legend(title="Grade")) +
  labs(x="Verbal Ability", y="Density")