Overview

In this tutorial we walk through basic Data Mining ideas using regression. We introduce the exploratory perspective - prediction - and the use of K-fold Cross Validation.

Outline

In this session we cover …

  1. Introduction to Motivating Problem and Data
  2. Regression as a Predcition Model
  3. Expanding the Prediction Model
  4. K-fold Cross-Validation
  5. Application to New Data

Prelim - Loading libraries used in this script.

library(psych)  #for general functions
library(ggplot2)  #for data visualization
library(car)  # for regression diagnostics

1A. The Regression (Prediction) Model

Although not usually considered as such in the Social Science community, regressions are considered as part of the data mining toolbox. Generally they might be labeled as a form of supervised learning. For clarity we write out the general (univariate) model we use here …

\[ {y_{i}} = {\beta_0} + {\beta_1}{x_{i}} + {\beta_2}{x_{i}^2} + ... + {\beta_k}{x_{i}^k} + {e_{i}} \]

We will use this model to optimize prediction of a specific outcome.

The motivating problem is derived from the following:
+Phenylketonuria (PKU) is an inborn error of metabolism that can have devastating effects on child intelligence +Children of mothers with PKU can suffer damage in utero, even if they do not have the full PKU +Critical question is the association between prenatal phenylaline (PHE) exposure by the fetus in utero and childhood intelligence

Our goal is to develop an accurate prediction equation.

1B. Data: Training Set and Test Set

In this example there are two data sets. One data set that we can use for data mining (an exploration data set), and one data set that we cannot touch until the very end (a confirmation data set).

Reading in the exploration data set.

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/apexpos.csv"
#read in the .csv file using the url() function
dat <- read.csv(file=url(filepath),header=TRUE)

Prelim - Descriptives

Lets have a quick look at the data file and the descriptives.

#data structure
head(dat,10)
##    id apexpos fsiq7
## 1   1     2.2   121
## 2   2     5.5    98
## 3   3     5.2    93
## 4   4     3.6    98
## 5   5     7.2    89
## 6   6     7.3   115
## 7   7     9.9    72
## 8   8     8.9    90
## 9   9     0.0    89
## 10 10     7.1   111
names(dat)
## [1] "id"      "apexpos" "fsiq7"

There are three variables in the data set.
+“id” is a identification variable
+“apexpos” is prenatal phenylaline (PHE) exposure +“fsiq7” is childhood intelligence (Age 7 years)

#descriptives (means, sds)
describe(dat[,-1])  #the -1 ignores the first column, "id"
##         vars   n  mean    sd median trimmed   mad min   max range skew
## apexpos    1 314 11.28  6.94   10.7   11.14  8.45   0  24.9  24.9 0.13
## fsiq7      2 314 74.65 24.12   76.0   74.53 29.65  31 126.0  95.0 0.00
##         kurtosis   se
## apexpos     -1.1 0.39
## fsiq7       -1.1 1.36
#correlation matrix
round(cor(dat[,-1]),2)
##         apexpos fsiq7
## apexpos    1.00 -0.84
## fsiq7     -0.84  1.00

Plotting the empirical data (in our exploration set)

## Plotting empirical data ##
ggplot(dat, aes(x=apexpos, y=fsiq7)) + 
  geom_point() +
  xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')

2. Regression as a Prediction Model

We formulate the initial prediction model.

\[ {y_{i}} = {\beta_0}{1} + {\beta_1}{x_{i}} + {e_{i}} \] Model Fitting

#Specifying linear regression model
lm.1 = lm(fsiq7 ~ 1 + apexpos,
          data=dat,
          na.action=na.exclude)

summary(lm.1)
## 
## Call:
## lm(formula = fsiq7 ~ 1 + apexpos, data = dat, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.108  -8.790  -0.232   8.004  39.749 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.4337     1.4289   75.19   <2e-16 ***
## apexpos      -2.9068     0.1079  -26.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.25 on 312 degrees of freedom
## Multiple R-squared:  0.6992, Adjusted R-squared:  0.6982 
## F-statistic: 725.1 on 1 and 312 DF,  p-value: < 2.2e-16

Interpreting the output, we get the prediction model

\[ {y_{i}} = {107.4337} + {-2.9068}{x_{i}} + {e_{i}} \]

And a plot of the model …

plot.lm1 = ggplot(dat, aes(x=apexpos, y=fsiq7)) + 
              geom_point() +
              stat_smooth(method='lm', formula = y ~ x, color="blue", size = 1) +
              xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')
print(plot.lm1)

We can do some diagnostics on the Linear Regression Model

#Outlier Test
outlierTest(lm.1) # Bonferonni p-value for most extreme obs
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferonni p
## 307 3.060391          0.0024031      0.75457
#QQ Plot
qqPlot(lm.1, main="QQ Plot") #qq plot for studentized resid 

#Cook's D
cook = cooks.distance(lm.1)
plot(cook,ylab="Cooks distances")

#Raw scores against Residuals 
plot(dat$apexpos, lm.1$res)

#Fitted scores against Residuals
plot(lm.1$fitted, lm.1$res)

#Leverage Plot
leverage = hat(model.matrix(lm.1))
plot(leverage)

This all looks pretty good - but perhaps we can do better.

3. Expanding the Prediction Model

We formulate an expanded quadratic model.

\[ {y_{i}} = {\beta_0}{1} + {\beta_1}{x_{i}} + {\beta_2}{x_{i}^2} + {e_{i}} \] Model Setup and Fitting

# making quadratic term 
dat$apexpos2 = dat$apexpos^2

lm.2 = lm(fsiq7 ~ 1 + apexpos + apexpos2,
          data=dat,
          na.action=na.exclude)

summary(lm.2)
## 
## Call:
## lm(formula = fsiq7 ~ 1 + apexpos + apexpos2, data = dat, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.939  -8.731  -0.351   7.826  38.618 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 108.47508    2.05289  52.840  < 2e-16 ***
## apexpos      -3.17979    0.40087  -7.932 3.91e-14 ***
## apexpos2      0.01163    0.01644   0.707     0.48    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.26 on 311 degrees of freedom
## Multiple R-squared:  0.6997, Adjusted R-squared:  0.6977 
## F-statistic: 362.2 on 2 and 311 DF,  p-value: < 2.2e-16

Interpreting the output, we get the prediction model

\[ {y_{i}} = {108.47508} + {-3.17979}{x_{i}} + {0.01163}{x_{i}^2} + {e_{i}} \] And a plot of the model …

plot.lm2 = ggplot(dat, aes(x=apexpos, y=fsiq7)) + 
                geom_point() +
                stat_smooth(method='lm', formula = y ~ x + I(x^2), size = 1) +
                xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')
print(plot.lm2)

#Alternative plotting approach 
# plot(dat$apexpos, fitted(lm.2))
# 
# plot(dat$apexpos, fsiq7, ylab = 'Full Scale IQ', xlab = 'PHE Exposure')
# 
# Save model estimates for plotting curve
# b0 = lm.2$coefficients[1]
# b1 = lm.2$coefficients[2]
# b2 = lm.2$coefficients[3]
# 
# curve(b0 + b1*x + b2*x^2, from = min(apexpos), to = max(apexpos), n = 100, 
#       col = "red", lwd = 2, ann = F, add = T)

We can go even further - adding more polynomial terms. Let’s keep increasing the order up to 20.

Note: We do this using embedded function in R poly() which create orthogonal polynomials

#setting order of polynomial
order = 20

#creating some empty object to hold model fit information
r.square = matrix(data=NA,nrow=order,ncol=1)
adj.r.square = matrix(data=NA,nrow=order,ncol=1)
mse = matrix(data=NA,nrow=order,ncol=1)

#Looping through all the candidate regression models
for (i in 1:order) {
  #i <- 2  #for testing loop
     lm.poly <- lm(fsiq7 ~ poly(apexpos,i),
                   data=dat,
                   na.action=na.exclude)
#     print(summary(lm.poly))

     #making and storing plots
     mypath = file.path(getwd(),paste("pred.poly.",i,".png", sep=""))
     png(file = mypath)  #turning on plot device
     myplot = ggplot(dat, aes(x=apexpos, y=fsiq7)) + 
                    geom_point() +
                    stat_smooth(method='lm', formula = y ~ poly(x,i), size = 1) + 
                    xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')
     print(myplot)
     dev.off() #turning off plot device
     
     r.square[i] = summary(lm.poly)$r.squared
     adj.r.square[i] = summary(lm.poly)$adj.r.squared
     mse[i] = mean(residuals(lm.poly)^2)

     }

#Merging and model fits
fits <- as.data.frame(cbind(1:order,r.square,adj.r.square,mse))
colnames(fits) <- c("order","r.square","adj.r.square","mse")
fits
##    order  r.square adj.r.square      mse
## 1      1 0.6991699    0.6982057 174.4870
## 2      2 0.6996527    0.6977212 174.2070
## 3      3 0.7595095    0.7571822 139.4889
## 4      4 0.7641612    0.7611083 136.7908
## 5      5 0.7715048    0.7677954 132.5314
## 6      6 0.7747508    0.7703486 130.6486
## 7      7 0.7754067    0.7702690 130.2682
## 8      8 0.7755293    0.7696415 130.1971
## 9      9 0.7766750    0.7700634 129.5326
## 10    10 0.7780076    0.7706811 128.7597
## 11    11 0.7780537    0.7699696 128.7329
## 12    12 0.7780565    0.7692082 128.7313
## 13    13 0.7781012    0.7684856 128.7054
## 14    14 0.7814518    0.7712188 126.7619
## 15    15 0.7834292    0.7725279 125.6151
## 16    16 0.7834627    0.7717974 125.5956
## 17    17 0.7841315    0.7717337 125.2077
## 18    18 0.7847870    0.7716553 124.8275
## 19    19 0.7855862    0.7717295 124.3640
## 20    20 0.7868239    0.7722727 123.6460
#Plotting model fits#
ggplot(fits, aes(x=order)) +
       geom_line(aes(y=r.square), color="red", size=2) + 
       geom_line(aes(y=adj.r.square), color="blue", size=2) + 
       xlab('Order of the Polynomial') +
       ylab('R^2 (red) or Adjusted R^2 (blue)') + 
       xlim(1,20) + 
       ylim(0.5,1) 

We see improvement of prediction up through polynomial modelsof order 6 or so, and then less improvement.

4. K-fold Cross-Validation

To tune the model - that is, to choose the best model (i.e, order of plynomial) we make use of K-fold Cross Validation procedure wherein we, +a. Randomly shuffle the data +b. Set K, the number of folds

a.Randomly shuffle the data

dat.shuffled <- dat[sample(nrow(dat)),]

b. Set K, the number of folds

K <- 10 

c. Create K equally size folds

folds <- cut(seq(1,nrow(dat.shuffled)),breaks=K,labels=FALSE)

d. Model fitting and evaluation

#Creating empty object to hold fit information
r.square = matrix(data=NA,nrow=K,ncol=order)

#Perform K-fold cross validation
for(i in 1:K){
    #Segement data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- dat.shuffled[testIndexes, ]
    trainData <- dat.shuffled[-testIndexes, ]
    #Use the test and train data partitions
    #Model fitting and evaluation
    for (j in 1:order){
      #training regression model on training folds
        fit.train = lm(fsiq7 ~ poly(apexpos,j), data = trainData)
        #evaluating fit on the test fold
            fit.test = predict(fit.train, newdata=testData)
        r.square[i,j] = cor(fit.test, testData$fsiq7, use='complete')^2
    }
}

#Averaging fit at each order 
fits.kfold <- colMeans(r.square)
#plotting cross-validated prediction accuracy 
plot(colMeans(r.square), type='l')

The 10-fold Cross Validation suggests that the most consistent accuracy is found with a fifth order polynomial.

We can then obtain the prediction model with the complete training data

Model Fitting

#Fitting final model 
lm.final = lm(fsiq7~poly(apexpos,5), 
              data=dat,
              na.action=na.exclude)
summary(lm.final)
## 
## Call:
## lm(formula = fsiq7 ~ poly(apexpos, 5), data = dat, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.243  -7.356  -1.273   7.433  31.509 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         74.647      0.656 113.796  < 2e-16 ***
## poly(apexpos, 5)1 -356.843     11.624 -30.699  < 2e-16 ***
## poly(apexpos, 5)2    9.377     11.624   0.807  0.42047    
## poly(apexpos, 5)3  104.410     11.624   8.982  < 2e-16 ***
## poly(apexpos, 5)4  -29.107     11.624  -2.504  0.01280 *  
## poly(apexpos, 5)5  -36.571     11.624  -3.146  0.00182 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.62 on 308 degrees of freedom
## Multiple R-squared:  0.7715, Adjusted R-squared:  0.7678 
## F-statistic:   208 on 5 and 308 DF,  p-value: < 2.2e-16

And, visually …

plot.lm.final = ggplot(dat, aes(x=apexpos, y=fsiq7)) + 
          geom_point() +
          stat_smooth(method='lm', formula = y ~ poly(x,5), size = 1) + 
          xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')

print(plot.lm.final)

Note that our final prediction equation is based on orthonormal polynomials … We don’t really have interest in the details of the model, only in its prediction accuracy. Thus, we are not writing out the prediction equation here. We simply retain the model object.

5. Application to New Data

In a final step, we apply the learned prediction model to a completely new data set - and evaluate the quality of the prediction (e.g. \(R^2\)).

Reading in the confirmation set - The Official Test Data

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/apexpos_test.csv"
#read in the .csv file using the url() function
dat.test <- read.csv(file=url(filepath),header=TRUE)

Applying the Prediction model to the new data

newdata.predict <- predict(lm.final, newdata=dat.test)

Evaluating Quality of Prediction

#calculating sqaured correlation of actual and predicted scores
cor(dat.test$fsiq7, newdata.predict)^2
## [1] 0.7556371

Accuracy of prediction is as expected from the model fitting and tuning.

Concluding thought

Regression models can be used as machine learning tools. Focus shifts toward prediction and explcit use of training data, k-fold cross validation, and test data in ways that acknowledge and make use of exploratory modeling.

Thank you for playing! (and be careful =:-)