Chapter 1 - Introduction, Data Files, Descriptives & Plots

A: Reading in the Data and calling needed libraries.

The first step, and often the most frustrating part of data analysis is … getting the data into the program!

Recently, the trend is that everyone exchanges files in .csv format. This usually works well and is a good practice to follow.

These example data are based on the PSU AMIB study, an experience sampling and daily diary study. Comprehensive description of the study design can be found in … Ram, N., Conroy, D. E., Pincus, A. L., Hyde, A. L., & Molloy, L. E. (2012). Tethering theory to method: Using measures of intraindividual variability to operationalize individuals’ dynamic characteristics. In G. Hancock & J. Harring (Eds.), Advances in longitudinal methods in the social and behavioral sciences (pp. 81-110). New York: Information Age. The data here are for demosntration purposes only, and should not be used for publication without explicit permission. For additional use please contact Nilam.

0. Loading libraries used in this script.

We make use of some specialized packages in R. We try to follow the good practice of loading these at the beginning of a work session.

library(psych)
library(ggplot2)

1. Getting the data into the program.

To get the data, we set the working directory and read in a data file.

#Setting the working directory
setwd("~/Desktop/Fall_2017")  #Person 1 Computer
#setwd("~/Desktop/Fall_2017")  #Person 2 Computer

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBbrief_raw_daily1.csv"
#read in the .csv file using the url() function
daily <- read.csv(file=url(filepath),header=TRUE)

Note that the working directory has been set to a specific location on Person 1’s computer. For convenience, an additional setwd() line is included that is commented out (using # - or can comment blocks of text using Cmnd-Shift-c within RStudio). This set-up allows for easy passing back and forth across machines or collaborators where the data may be housed in an idiosyncratic location. If seeking help with your scripts, it may be useful to send the script and data file to someone, along with this kind of set-up. It makes it much easier to give help when the data read-in process is easy.

For reading in other files types see … http://www.statmethods.net/input/importingdata.html.

2. Clean-up of file (“data hygiene”).

Once the data are in, there are often a few little bits of clean-up that need to be done to make a file easy to work with. Driven by necessity, individual R developers have made a wide variety of convenience functions. The trick is finding them. If you think that someone else may have ever had to do something similar, use a search engine to find it.

For example, R is case-sensitive. Sometimes variables are named with a mix of lower-case and upper-case letters. Depending on the fluidity of these names, it is sometimes convenient to rename the variables to all lower case. The consistency of naming often reduces typographical coding errors (although the front-end engines, like R-Studio, are getting better at helping). Lets have a look at the variable names in our example data.

colnames(daily)

##  [1] "id"      "day"     "date"    "slphrs"  "weath"   "lteq"    "pss"    
##  [8] "se"      "swls"    "evalday" "posaff"  "negaff"  "TEMP"    "HUM"    
## [15] "WIND"    "BAR"     "PREC"

Note that there are a few variables that have upper-case names. Everything can be converted to lower-case …

# Set all variable names to lowercase 
var.names.daily <- tolower(colnames(daily))
colnames(daily)<-var.names.daily

A few other idiosyncratic tendencies of mine are …

the naming of the analysis unit (e.g., person, family) as id, because my collection of scripts is written in a particular way, and
the ordering of the variables so that id and time are always in the left-most columns, becasue I find this easier when trying to locate specific observations when scrolling through the data file. Thus, I often change the identification variable to id and reorder the columns. In these example data, the id variable is already named as such (e.g., http://www.statmethods.net/management/variables.html), and the column ordering is ok (e.g., google search for “reordering columns in r”).
Make sure you know what is going on with missing data values - and recode as necessary. (e.g., http://www.statmethods.net/input/missingdata.html)

Lets look at the top few rows of the data.

head(daily,12)

##     id day      date slphrs weath lteq  pss se swls evalday posaff negaff
## 1  101   0 19-Feb-09    6.0     1   10 2.50  2  3.8       1    3.9    3.0
## 2  101   1 20-Feb-09    2.0     2   10 2.75  3  4.2       0    3.8    2.3
## 3  101   2 21-Feb-09    9.0     3   10 3.50  4  5.0       1    5.1    1.0
## 4  101   3 22-Feb-09    7.5     2    9 3.00  4  5.0       1    5.6    1.3
## 5  101   4 23-Feb-09    8.0     1   18 2.75  3  4.0       1    4.3    1.1
## 6  101   5 24-Feb-09    8.0     2   19 2.75  3  4.2       1    3.9    1.0
## 7  101   6 25-Feb-09    8.0     3   21 3.50  4  4.6       1    5.1    1.2
## 8  101   7 26-Feb-09    7.0    NA   14 2.75  3  4.6       1    4.8    1.1
## 9  102   0 19-Feb-09    7.0     0   12 3.50  5  5.6       0    6.3    1.4
## 10 102   1 20-Feb-09    6.0     0   20 4.00  5  6.6       0    7.0    1.6
## 11 102   2 21-Feb-09    3.0     2   11 4.00  5  5.6       1    6.1    2.6
## 12 102   3 22-Feb-09     NA     1    6 1.75  3  6.8       1    6.2    2.8
##    temp  hum wind   bar prec
## 1  28.0 0.79 11.0 29.40 0.20
## 2  20.8 0.62  3.6 30.17 0.00
## 3  29.1 0.51  1.9 30.35 0.02
## 4  30.2 0.58  2.7 30.23 0.00
## 5  22.7 0.55  2.4 30.46 0.00
## 6  21.4 0.54  0.7 30.54 0.00
## 7  31.4 0.49  1.0 30.51 0.00
## 8  45.3 0.52  1.1 30.30 0.00
## 9  28.0 0.79 11.0 29.40 0.20
## 10 20.8 0.62  3.6 30.17 0.00
## 11 29.1 0.51  1.9 30.35 0.02
## 12 30.2 0.58  2.7 30.23 0.00

3. Outputting a clean file for future use.

Third, depending on work-flow, it may be useful to create a “working file” - both for use with R, and for passing to other programs.

To output a comma delimited data file …

#Output a comma delimited data file;
write.csv(daily, file="NewData.csv",row.names=FALSE, na="")

Note that by default the write.csv() function will include an extra column of row numbers and will notate missing data with an NA. Both are inconvenient for other programs, so we have included therow.names=FALSE and na="" options to keep the .csv file “clean” and portable.

To output some other files types see … http://www.statmethods.net/input/exportingdata.html

B. Manipulating those data into two different structures (long and wide)

Behavioral science tends to use relational data structures - in basic form, spreadsheets. Typically the data are stored/configured as a data frame (a “fancy” matrix) with multiple rows and multiple columns. Two main schema are used to accommodate repeated measures data - “Wide Format” and “Long Format”. Different functions work with different kinds of data input. Thus, it is imperative that one can convert the data back and forth between wide and long formats.

There are lots of ways to do this.

Reshape Long to Wide

One way to go from long to wide is using the reshape() function (not to be confused with the reshape2 package).

#reshaping long to wide
dailywide <- reshape(data=daily, 
                    timevar=c("day"), 
                    idvar="id",
                    v.names=c("slphrs", "weath","lteq","pss","se", "swls", "evalday","posaff","negaff",
                              "temp","hum","wind","bar","prec"),
                    direction="wide", sep="_",
                    drop=c("date"))

BE CAREFUL WITH RESHAPE ## THE ARGUMENTS NEEDED ARE NOT ALWAYS STRAIGHTFORWARD

Lets look at the data structure (first few columns only).

head(dailywide[,1:9],5)

##     id slphrs_0 weath_0 lteq_0 pss_0 se_0 swls_0 evalday_0 posaff_0
## 1  101        6       1     10  2.50    2    3.8         1      3.9
## 9  102        7       0     12  3.50    5    5.6         0      6.3
## 17 103        7       0     17  2.75    5    5.6         0      5.0
## 23 104        7       2     21  2.50    4    3.8         1      6.5
## 31 105        6       1      9  2.50    2    5.2         0      4.7

Note there is only one row per unit. For these data that means … one row per person.

Reshape Wide to Long

And, one way to go from wide to long is using the reshape() function (not to be confused with the reshape2 package).

#reshaping wide to long
dailylong <- reshape(data=dailywide, 
                    timevar=c("day"), 
                    idvar="id",
                    direction="long", sep="_")
#sorting for easy viewing
# order by id and day
dailylong <- dailylong[order(dailylong$id,dailylong$day),]

Lets look at the data structure (first few columns only).

head(dailylong[,1:9],12)

##        id day slphrs weath lteq  pss se swls evalday
## 101.0 101   0    6.0     1   10 2.50  2  3.8       1
## 101.1 101   1    2.0     2   10 2.75  3  4.2       0
## 101.2 101   2    9.0     3   10 3.50  4  5.0       1
## 101.3 101   3    7.5     2    9 3.00  4  5.0       1
## 101.4 101   4    8.0     1   18 2.75  3  4.0       1
## 101.5 101   5    8.0     2   19 2.75  3  4.2       1
## 101.6 101   6    8.0     3   21 3.50  4  4.6       1
## 101.7 101   7    7.0    NA   14 2.75  3  4.6       1
## 102.0 102   0    7.0     0   12 3.50  5  5.6       0
## 102.1 102   1    6.0     0   20 4.00  5  6.6       0
## 102.2 102   2    3.0     2   11 4.00  5  5.6       1
## 102.3 102   3     NA     1    6 1.75  3  6.8       1

Note the differences in the size (#row x #column) between our reshaped long data and the original long data.

#dimension of reshaped file
dim(dailylong)

## [1] 1520   16

#dimension of original file
dim(daily)

## [1] 1458   17

Why are the dimensions of the two files different?

Make sure that the missing data has been treated properly.
Know where all the variables and records have gone.

Note: A more general reshaping solution is provided by Hadley Wickham’s reshape2 package through melt and cast functions.

Ok - now we have both wide and long data sets and can start describing the data. Yay!

C. Assessing the overall characteristics of the data/protocol

For basic description of the protocol we often need a few basic pieces of information:

The number of participants (persons).
The number of occasions (noting that this may differ across persons).

Lets see …

How many unique persons are there in these data?

#Make a vector of the unique ids
idlist <- unique(dailylong$id)
#length of the id vector
length(idlist)

## [1] 190

Ok so, N = 190.

How many occasions are there in these data?

#Make a vector of the unique entires in the "day" variable" 
daylist <- unique(dailylong$day)
#length of the id vector
length(daylist)

## [1] 8

We also need to check on how many people completed each measure on each day.

This is easier in the wide file.

#we describe by day (just for the first 6 variables)
describeBy(dailylong[,1:6],group=dailylong$day)

## 
##  Descriptive statistics by group 
## group: 0
##        vars   n   mean     sd median trimmed    mad    min    max range
## id        1 190 318.29 130.44  321.5  318.99 151.23 101.00 532.00   431
## day       2 190   0.00   0.00    0.0    0.00   0.00   0.00   0.00     0
## slphrs    3 188   7.26   1.80    7.0    7.19   1.48   3.00  14.00    11
## weath     4 190   2.07   1.55    2.0    2.09   2.97   0.00   4.00     4
## lteq      5 190  11.94   8.85    9.0   11.09   8.15   0.00  40.00    40
## pss       6 190   2.56   0.65    2.5    2.60   0.74   0.75   3.75     3
##         skew kurtosis   se
## id     -0.04    -1.09 9.46
## day      NaN      NaN 0.00
## slphrs  0.52     1.51 0.13
## weath  -0.12    -1.50 0.11
## lteq    0.83     0.04 0.64
## pss    -0.58     0.07 0.05
## -------------------------------------------------------- 
## group: 1
##        vars   n   mean     sd median trimmed    mad min max range  skew
## id        1 190 318.29 130.44 321.50  318.99 151.23 101 532   431 -0.04
## day       2 190   1.00   0.00   1.00    1.00   0.00   1   1     0   NaN
## slphrs    3 175   6.81   1.56   7.00    6.86   1.48   2  10     8 -0.35
## weath     4 176   1.98   1.30   2.00    1.97   1.48   0   4     4 -0.08
## lteq      5 175  12.88  10.28  12.00   11.83  10.38   0  52    52  1.04
## pss       6 176   2.63   0.70   2.75    2.67   0.74   0   4     4 -0.70
##        kurtosis   se
## id        -1.09 9.46
## day         NaN 0.00
## slphrs     0.29 0.12
## weath     -1.06 0.10
## lteq       1.22 0.78
## pss        1.15 0.05
## -------------------------------------------------------- 
## group: 2
##        vars   n   mean     sd median trimmed    mad min max range  skew
## id        1 190 318.29 130.44 321.50  318.99 151.23 101 532   431 -0.04
## day       2 190   2.00   0.00   2.00    2.00   0.00   2   2     0   NaN
## slphrs    3 178   7.54   1.96   7.75    7.52   1.85   3  13    10  0.08
## weath     4 180   2.10   1.04   2.00    2.09   1.48   0   4     4 -0.05
## lteq      5 180  11.48   9.99   9.00   10.14   8.90   0  53    53  1.24
## pss       6 180   2.66   0.63   2.75    2.66   0.74   1   4     3 -0.03
##        kurtosis   se
## id        -1.09 9.46
## day         NaN 0.00
## slphrs     0.21 0.15
## weath     -0.54 0.08
## lteq       1.70 0.74
## pss       -0.37 0.05
## -------------------------------------------------------- 
## group: 3
##        vars   n   mean     sd median trimmed    mad    min max  range
## id        1 190 318.29 130.44 321.50  318.99 151.23 101.00 532 431.00
## day       2 190   3.00   0.00   3.00    3.00   0.00   3.00   3   0.00
## slphrs    3 179   7.65   1.90   8.00    7.73   1.48   0.00  12  12.00
## weath     4 179   1.99   1.33   2.00    1.99   1.48   0.00   4   4.00
## lteq      5 181  10.45   9.89   8.00    9.06   7.41   0.00  51  51.00
## pss       6 181   2.66   0.69   2.75    2.68   0.74   0.25   4   3.75
##         skew kurtosis   se
## id     -0.04    -1.09 9.46
## day      NaN      NaN 0.00
## slphrs -0.52     1.10 0.14
## weath  -0.11    -1.17 0.10
## lteq    1.17     1.12 0.74
## pss    -0.33     0.11 0.05
## -------------------------------------------------------- 
## group: 4
##        vars   n   mean     sd median trimmed    mad    min max  range
## id        1 190 318.29 130.44 321.50  318.99 151.23 101.00 532 431.00
## day       2 190   4.00   0.00   4.00    4.00   0.00   4.00   4   0.00
## slphrs    3 180   7.06   1.87   7.00    7.08   1.48   0.00  18  18.00
## weath     4 181   1.83   1.28   2.00    1.79   1.48   0.00   4   4.00
## lteq      5 184  13.60  11.32  12.00   12.26  10.38   0.00  58  58.00
## pss       6 183   2.64   0.70   2.75    2.66   0.74   0.25   4   3.75
##         skew kurtosis   se
## id     -0.04    -1.09 9.46
## day      NaN      NaN 0.00
## slphrs  0.77     7.24 0.14
## weath   0.18    -1.08 0.10
## lteq    1.13     1.19 0.83
## pss    -0.32     0.04 0.05
## -------------------------------------------------------- 
## group: 5
##        vars   n   mean     sd median trimmed    mad    min max  range
## id        1 190 318.29 130.44  321.5  318.99 151.23 101.00 532 431.00
## day       2 190   5.00   0.00    5.0    5.00   0.00   5.00   5   0.00
## slphrs    3 173   7.07   1.89    7.0    7.14   1.48   0.00  15  15.00
## weath     4 175   1.98   1.18    2.0    1.97   1.48   0.00   4   4.00
## lteq      5 177  12.54  10.61    9.0   11.35   8.90   0.00  45  45.00
## pss       6 178   2.60   0.70    2.5    2.61   0.74   0.75   4   3.25
##         skew kurtosis   se
## id     -0.04    -1.09 9.46
## day      NaN      NaN 0.00
## slphrs -0.09     2.06 0.14
## weath  -0.08    -0.83 0.09
## lteq    0.87     0.06 0.80
## pss    -0.07    -0.67 0.05
## -------------------------------------------------------- 
## group: 6
##        vars   n   mean     sd median trimmed    mad    min max  range
## id        1 190 318.29 130.44  321.5  318.99 151.23 101.00 532 431.00
## day       2 190   6.00   0.00    6.0    6.00   0.00   6.00   6   0.00
## slphrs    3 179   6.92   1.73    7.0    6.90   1.48   2.00  12  10.00
## weath     4 176   1.68   1.08    2.0    1.68   1.48   0.00   4   4.00
## lteq      5 180  11.91  10.22    9.0   10.60   8.90   0.00  47  47.00
## pss       6 179   2.53   0.69    2.5    2.55   0.74   0.75   4   3.25
##         skew kurtosis   se
## id     -0.04    -1.09 9.46
## day      NaN      NaN 0.00
## slphrs  0.07     0.02 0.13
## weath   0.02    -0.80 0.08
## lteq    1.10     0.88 0.76
## pss    -0.28    -0.31 0.05
## -------------------------------------------------------- 
## group: 7
##        vars   n   mean     sd median trimmed    mad    min max  range
## id        1 190 318.29 130.44 321.50  318.99 151.23 101.00 532 431.00
## day       2 190   7.00   0.00   7.00    7.00   0.00   7.00   7   0.00
## slphrs    3 176   7.03   1.61   7.00    7.08   1.48   2.00  12  10.00
## weath     4 176   2.34   1.41   3.00    2.42   1.48   0.00   4   4.00
## lteq      5 179  13.84  11.59  12.00   12.55  10.38   0.00  50  50.00
## pss       6 178   2.64   0.71   2.75    2.66   0.74   0.25   4   3.75
##         skew kurtosis   se
## id     -0.04    -1.09 9.46
## day      NaN      NaN 0.00
## slphrs -0.30     0.69 0.12
## weath  -0.38    -1.18 0.11
## lteq    0.89     0.11 0.87
## pss    -0.46     0.37 0.05

We note that there are differing number of reports for the different days (Day 0 to Day 7), as well as different number of records for the different variables.

Ah - the joys of working with intensive repeated measures data; it requires making hard decisions!

Which participants should we analyze or drop from analysis?
Which variables should we analyze or drop from analysis?
What set of occasions should we analyze or drop from analysis?
PLUS … We need to be able to justify those decisions.

Lets outline out a strategy for how this might be done for a specific paper, on say Daily Self-esteem (se).
Who provided how much Daily Self-esteem data ?

# 1. Calculate the number of rows for each id, after removing rows where se is missing 
numobs.se <- table(dailylong[which(dailylong$se != "NA"),]$id)
numobs.se

## 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 
##   8   8   6   8   8   8   8   8   6   8   8   8   8   6   1   4   8   7 
## 119 120 121 122 123 124 125 126 127 128 129 130 201 202 203 204 205 206 
##   8   7   7   7   8   5   8   7   6   2   7   8   8   8   8   8   8   8 
## 207 208 209 210 211 212 213 214 215 218 219 220 221 222 223 224 225 226 
##   7   8   8   7   8   8   8   8   8   8   2   8   8   6   8   8   8   8 
## 227 228 229 230 231 233 234 237 238 239 240 241 242 243 244 245 246 247 
##   7   8   8   8   8   7   8   8   8   8   6   8   8   8   8   8   3   8 
## 248 249 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 
##   8   8   8   8   8   8   8   8   8   8   8   8   8   7   8   8   8   8 
## 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 
##   7   8   1   8   8   7   8   8   8   7   8   8   8   8   8   8   8   8 
## 335 336 337 338 339 340 341 342 343 344 345 346 401 402 403 404 405 406 
##   8   8   8   8   8   8   8   8   7   8   8   1   8   8   8   8   8   8 
## 407 408 409 410 411 412 413 414 415 417 418 419 420 421 422 423 424 425 
##   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8 
## 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 501 
##   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8 
## 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 
##   8   8   8   8   8   8   8   6   8   8   8   8   8   8   8   8   8   8 
## 520 521 524 525 526 527 528 530 531 532 
##   8   8   8   8   8   8   8   7   8   8

#Alternative way
    #data.se <- dailylong[,c("id","day","se")] 
    #numobs <- aggregate(day ~ id, data = data.se[which(data.se$se !="NA"),], FUN = function(x) length(unique(x)))
    #names(numobs) <- c("id","daycount.se")

# 2. Making a frequency table of the observation counts
tbl <- table(numobs.se)
tbl

## numobs.se
##   1   2   3   4   5   6   7   8 
##   3   2   1   1   1   7  16 159

#3. Using that table to calculate number of persons and proportion of sample that provided each amount of observations
tbl.pct <- cbind(numobs=as.numeric(rownames(tbl)), freq=tbl, cumul=cumsum(tbl), 
                 relative.pct=prop.table(tbl), cumulative.pct=cumsum(tbl)/length(numobs.se),
                 invcumulative.pct=1-(cumsum(tbl)/length(numobs.se)))
round(tbl.pct,3)

##   numobs freq cumul relative.pct cumulative.pct invcumulative.pct
## 1      1    3     3        0.016          0.016             0.984
## 2      2    2     5        0.011          0.026             0.974
## 3      3    1     6        0.005          0.032             0.968
## 4      4    1     7        0.005          0.037             0.963
## 5      5    1     8        0.005          0.042             0.958
## 6      6    7    15        0.037          0.079             0.921
## 7      7   16    31        0.084          0.163             0.837
## 8      8  159   190        0.837          1.000             0.000

From this description we can make decisions about which individuals “completed the protocol” to a sufficient level.

Then, the data may be “cleaned” appropriately (e.g., remove the non-compliers). In our example, we shall simply retain everyone (and pray that incomplete data are missing completely at random, MCAR).

Now we are ready to look at some descriptions of the content data.

D. Describing the data (descriptives and plots)

There are a variety of ways to describe intensive repeated measures data - by aggregating in different ways. We illustrate some of the options we use, noting that for any given research question, we usually look at the data from many perspectives (so as to be sure that we really understand the data well).

Daily Self-esteem at the sample level = all persons and occasions.

#sample descriptives
describe(dailylong$se)

##    vars    n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 1445 3.43 0.99      3    3.47 1.48   1   5     4 -0.41    -0.11
##      se
## X1 0.03

#histogram
ggplot(data=dailylong, aes(x=se)) +
  geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") + 
  labs(x = "Daily Self-esteem")

## Warning: Removed 75 rows containing non-finite values (stat_bin).

Daily Self-esteem at the daily level = all persons faceted by day.

#sample descriptives by occasion
describeBy(dailylong[,c("se")],group=dailylong$day)

## 
##  Descriptive statistics by group 
## group: 0
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 190  3.4 0.98      3    3.41 1.48   1   5     4 -0.26    -0.34
##      se
## X1 0.07
## -------------------------------------------------------- 
## group: 1
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 176 3.49 1.03      4    3.57 1.48   1   5     4 -0.64     0.26
##      se
## X1 0.08
## -------------------------------------------------------- 
## group: 2
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 180 3.44 0.98      3    3.46 1.48   1   5     4 -0.24    -0.35
##      se
## X1 0.07
## -------------------------------------------------------- 
## group: 3
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 181 3.49 0.96      4    3.54 1.48   1   5     4 -0.48     0.08
##      se
## X1 0.07
## -------------------------------------------------------- 
## group: 4
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 183  3.5 0.97      4    3.54 1.48   1   5     4 -0.45    -0.11
##      se
## X1 0.07
## -------------------------------------------------------- 
## group: 5
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 178 3.47 0.98      4    3.51 1.48   1   5     4 -0.39    -0.16
##      se
## X1 0.07
## -------------------------------------------------------- 
## group: 6
##    vars   n mean sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 179 3.34  1      3    3.36 1.48   1   5     4 -0.35    -0.32 0.07
## -------------------------------------------------------- 
## group: 7
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 178 3.32 1.04      3    3.36 1.48   1   5     4 -0.4    -0.21 0.08

#histogram faceted by day
ggplot(data=dailylong, aes(x=se)) +
  geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") + 
  labs(x = "Daily Self-esteem") +
  facet_grid(day ~ .)

## Warning: Removed 75 rows containing non-finite values (stat_bin).

#boxplot by day
qplot(x=factor(day), y=se, data=dailylong, geom="boxplot", ylab="Daily Self-esteem")

## Warning: Removed 75 rows containing non-finite values (stat_boxplot).

#use factor() to convert "time" from numeric to categorical

#boxplot by day
ggplot(data=dailylong, aes(x=factor(day), y=se)) + 
  geom_boxplot(notch = TRUE) +
  stat_summary(fun.y="mean", geom="point", shape=23, size=3, fill="white") +
  labs(x = "Day", y = "Daily Self-esteem")

## Warning: Removed 75 rows containing non-finite values (stat_boxplot).

## Warning: Removed 75 rows containing non-finite values (stat_summary).

## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.

#Density distribution by day
ggplot(data=dailylong, aes(x=se)) + 
  geom_density(aes(group=factor(day), colour=factor(day), fill=factor(day)), alpha=0.3)

## Warning: Removed 75 rows containing non-finite values (stat_density).

#adjust bandwidth 
ggplot(data=dailylong, aes(x=se)) + 
  geom_density(aes(group=factor(day), colour=factor(day), fill=factor(day)), alpha=0.3, adjust=2)

## Warning: Removed 75 rows containing non-finite values (stat_density).

Notice in these plots how little “change” there is at the sample level across days. Is that expected?

Finally, we also always want to look at the individual-level data.

Self-esteem at the individual level = each individual described as a separate entity.

To illustrate - we just work with a few “randomly” selected individuals with low id numbers.

#sample descriptives by individual 
dailylongfew <- dailylong[which(dailylong$id<=105),c("id","day","se")]  #selecting only a few individuals
describeBy(dailylongfew[,c("se")],group=dailylongfew$id, na.rm=TRUE)

## 
##  Descriptive statistics by group 
## group: 101
##    vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 8 3.25 0.71      3    3.25 0.74   2   4     2 -0.27     -1.3 0.25
## -------------------------------------------------------- 
## group: 102
##    vars n mean   sd median trimmed  mad min max range  skew kurtosis  se
## X1    1 8 4.12 1.13    4.5    4.12 0.74   2   5     3 -0.73    -1.11 0.4
## -------------------------------------------------------- 
## group: 103
##    vars n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 6 3.67 0.82    3.5    3.67 0.74   3   5     2 0.48    -1.58 0.33
## -------------------------------------------------------- 
## group: 104
##    vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 8 3.25 0.71      3    3.25 0.74   2   4     2 -0.27     -1.3 0.25
## -------------------------------------------------------- 
## group: 105
##    vars n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 8 2.75 0.71      3    2.75 0.74   2   4     2 0.27     -1.3 0.25

#histogram faceted by individual
ggplot(data=dailylong[which(dailylong$id <=105),], aes(x=se)) +
  geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") + 
  labs(x = "Daily Self-esteem") + xlim(1,5) +
  facet_grid(id ~ .)

## Warning: Removed 2 rows containing non-finite values (stat_bin).

#boxplot by individual
ggplot(data=dailylong[which(dailylong$id <=105),], aes(x=factor(id), y=se)) + 
  geom_boxplot() +
  stat_summary(fun.y="mean", geom="point", shape=23, size=3, fill="white") +
  labs(x = "ID", y = "Daily Self-esteem") + ylim(1,5)

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

## Warning: Removed 2 rows containing non-finite values (stat_summary).

#Density distribution by individual
ggplot(data=dailylong[which(dailylong$id <=105),], aes(x=se)) + 
  geom_density(aes(group=factor(id), colour=factor(id), fill=factor(id)), alpha=0.3) +
  xlim(1,5) + xlab("Daily Self-esteem")

## Warning: Removed 2 rows containing non-finite values (stat_density).

And moving to individual-level trajectories.

#ggplot version .. see also http://ggplot.yhathq.com/docs/index.html
ggplot(data = dailylong[which(dailylong$id <=105),], aes(x = day, y = se, group = id, color=factor(id))) +
  geom_point() + 
  geom_line(data=dailylong[which(dailylong$id <= 105 & dailylong$se !="NA"),]) +
  xlab("Day") + 
  ylab("Daily Self-esteem") + ylim(1,5) +
  scale_x_continuous(breaks=seq(0,7,by=1))

## Warning: Removed 2 rows containing missing values (geom_point).

#Trying to look at the density of trajectories  
ggplot(data = dailylong, aes(x = day, y = se, group = id)) +
  geom_line(data=dailylong[!is.na(dailylong$se),], aes(colour=factor(id)), position=position_jitter(width = .1, height=0)) +
  stat_sum(aes(size = ..n.., group = 1), alpha=.5) +
  scale_size_area(max_size=10) +
  xlab("Day") + 
  ylab("Daily Self-esteem") + ylim(1,5) +
  scale_x_continuous(breaks=seq(0,7,by=1)) +
  theme(axis.title=element_text(size=12),legend.position = "none",
        axis.text=element_text(size=10, colour="black"))

## Warning: Removed 75 rows containing non-finite values (stat_sum).

We see …

differences across persons,
change across time, and
the impacts of the measurement scale (5-point Likert Scale)!

In designing your own studies, consider the sensitivity of the measurement devices.

Extra tip: For making a presentation or for inclusion in a manuscript we will need to save our plots as high-density images. Here is one way of outputting the last plot created in ggplot.

#Saving the plot file. See also ... http://www.cookbook-r.com/Graphs/Output_to_a_file/
#ggsave(filename = default_name(plot), plot = last_plot(), device = default_device(filename), 
#       path = NULL, scale = 1, width = par("din")[1], height = par("din")[2], 
#       units = c("in", "cm", "mm"), dpi = 300, limitsize = TRUE, ...)
ggsave(filename = "Daily_Self-esteem_raw.png", width = 6, height = 4, dpi=600)

## Warning: Removed 75 rows containing non-finite values (stat_sum).

Conclusion

This session we covered some basics: A. Reading in Repeated Measures Data from an Experience Sampling Study; B. Manipulating those data into two different structures (long and wide); C. Assessing the overall characteristics of the data/protocol; and D. Describing the data (descriptives and plots).

Now we have a good set of strategies that can be applied whenever looking at new longitudinal data.
With comprehensive descriptions in place, subsequent analyses will be fun!