In this tutorial we walk through basic Data Mining ideas using regression. We introduce the exploratory perspective - prediction - and the use of K-fold Cross Validation.
In this session we cover …
library(psych) #for general functions
library(ggplot2) #for data visualization
library(car) # for regression diagnostics
Although not usually considered as such in the Social Science community, regressions are considered as part of the data mining toolbox. Generally they might be labeled as a form of supervised learning. For clarity we write out the general (univariate) model we use here …
\[ {y_{i}} = {\beta_0} + {\beta_1}{x_{i}} + {\beta_2}{x_{i}^2} + ... + {\beta_k}{x_{i}^k} + {e_{i}} \]
We will use this model to optimize prediction of a specific outcome.
The motivating problem is derived from the following:
+Phenylketonuria (PKU) is an inborn error of metabolism that can have devastating effects on child intelligence +Children of mothers with PKU can suffer damage in utero, even if they do not have the full PKU +Critical question is the association between prenatal phenylaline (PHE) exposure by the fetus in utero and childhood intelligence
Our goal is to develop an accurate prediction equation.
In this example there are two data sets. One data set that we can use for data mining (an exploration data set), and one data set that we cannot touch until the very end (a confirmation data set).
#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/apexpos.csv"
#read in the .csv file using the url() function
dat <- read.csv(file=url(filepath),header=TRUE)
Lets have a quick look at the data file and the descriptives.
#data structure
head(dat,10)
## id apexpos fsiq7
## 1 1 2.2 121
## 2 2 5.5 98
## 3 3 5.2 93
## 4 4 3.6 98
## 5 5 7.2 89
## 6 6 7.3 115
## 7 7 9.9 72
## 8 8 8.9 90
## 9 9 0.0 89
## 10 10 7.1 111
names(dat)
## [1] "id" "apexpos" "fsiq7"
There are three variables in the data set.
+“id” is a identification variable
+“apexpos” is prenatal phenylaline (PHE) exposure +“fsiq7” is childhood intelligence (Age 7 years)
#descriptives (means, sds)
describe(dat[,-1]) #the -1 ignores the first column, "id"
## vars n mean sd median trimmed mad min max range skew
## apexpos 1 314 11.28 6.94 10.7 11.14 8.45 0 24.9 24.9 0.13
## fsiq7 2 314 74.65 24.12 76.0 74.53 29.65 31 126.0 95.0 0.00
## kurtosis se
## apexpos -1.1 0.39
## fsiq7 -1.1 1.36
#correlation matrix
round(cor(dat[,-1]),2)
## apexpos fsiq7
## apexpos 1.00 -0.84
## fsiq7 -0.84 1.00
Plotting the empirical data (in our exploration set)
## Plotting empirical data ##
ggplot(dat, aes(x=apexpos, y=fsiq7)) +
geom_point() +
xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')
We formulate the initial prediction model.
\[ {y_{i}} = {\beta_0}{1} + {\beta_1}{x_{i}} + {e_{i}} \] Model Fitting
#Specifying linear regression model
lm.1 = lm(fsiq7 ~ 1 + apexpos,
data=dat,
na.action=na.exclude)
summary(lm.1)
##
## Call:
## lm(formula = fsiq7 ~ 1 + apexpos, data = dat, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.108 -8.790 -0.232 8.004 39.749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.4337 1.4289 75.19 <2e-16 ***
## apexpos -2.9068 0.1079 -26.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.25 on 312 degrees of freedom
## Multiple R-squared: 0.6992, Adjusted R-squared: 0.6982
## F-statistic: 725.1 on 1 and 312 DF, p-value: < 2.2e-16
Interpreting the output, we get the prediction model
\[ {y_{i}} = {107.4337} + {-2.9068}{x_{i}} + {e_{i}} \]
And a plot of the model …
plot.lm1 = ggplot(dat, aes(x=apexpos, y=fsiq7)) +
geom_point() +
stat_smooth(method='lm', formula = y ~ x, color="blue", size = 1) +
xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')
print(plot.lm1)
We can do some diagnostics on the Linear Regression Model
#Outlier Test
outlierTest(lm.1) # Bonferonni p-value for most extreme obs
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 307 3.060391 0.0024031 0.75457
#QQ Plot
qqPlot(lm.1, main="QQ Plot") #qq plot for studentized resid
#Cook's D
cook = cooks.distance(lm.1)
plot(cook,ylab="Cooks distances")
#Raw scores against Residuals
plot(dat$apexpos, lm.1$res)
#Fitted scores against Residuals
plot(lm.1$fitted, lm.1$res)
#Leverage Plot
leverage = hat(model.matrix(lm.1))
plot(leverage)
This all looks pretty good - but perhaps we can do better.
We formulate an expanded quadratic model.
\[ {y_{i}} = {\beta_0}{1} + {\beta_1}{x_{i}} + {\beta_2}{x_{i}^2} + {e_{i}} \] Model Setup and Fitting
# making quadratic term
dat$apexpos2 = dat$apexpos^2
lm.2 = lm(fsiq7 ~ 1 + apexpos + apexpos2,
data=dat,
na.action=na.exclude)
summary(lm.2)
##
## Call:
## lm(formula = fsiq7 ~ 1 + apexpos + apexpos2, data = dat, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.939 -8.731 -0.351 7.826 38.618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108.47508 2.05289 52.840 < 2e-16 ***
## apexpos -3.17979 0.40087 -7.932 3.91e-14 ***
## apexpos2 0.01163 0.01644 0.707 0.48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.26 on 311 degrees of freedom
## Multiple R-squared: 0.6997, Adjusted R-squared: 0.6977
## F-statistic: 362.2 on 2 and 311 DF, p-value: < 2.2e-16
Interpreting the output, we get the prediction model
\[ {y_{i}} = {108.47508} + {-3.17979}{x_{i}} + {0.01163}{x_{i}^2} + {e_{i}} \] And a plot of the model …
plot.lm2 = ggplot(dat, aes(x=apexpos, y=fsiq7)) +
geom_point() +
stat_smooth(method='lm', formula = y ~ x + I(x^2), size = 1) +
xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')
print(plot.lm2)
#Alternative plotting approach
# plot(dat$apexpos, fitted(lm.2))
#
# plot(dat$apexpos, fsiq7, ylab = 'Full Scale IQ', xlab = 'PHE Exposure')
#
# Save model estimates for plotting curve
# b0 = lm.2$coefficients[1]
# b1 = lm.2$coefficients[2]
# b2 = lm.2$coefficients[3]
#
# curve(b0 + b1*x + b2*x^2, from = min(apexpos), to = max(apexpos), n = 100,
# col = "red", lwd = 2, ann = F, add = T)
We can go even further - adding more polynomial terms. Let’s keep increasing the order up to 20.
Note: We do this using embedded function in R poly()
which create orthogonal polynomials
#setting order of polynomial
order = 20
#creating some empty object to hold model fit information
r.square = matrix(data=NA,nrow=order,ncol=1)
adj.r.square = matrix(data=NA,nrow=order,ncol=1)
mse = matrix(data=NA,nrow=order,ncol=1)
#Looping through all the candidate regression models
for (i in 1:order) {
#i <- 2 #for testing loop
lm.poly <- lm(fsiq7 ~ poly(apexpos,i),
data=dat,
na.action=na.exclude)
# print(summary(lm.poly))
#making and storing plots
mypath = file.path(getwd(),paste("pred.poly.",i,".png", sep=""))
png(file = mypath) #turning on plot device
myplot = ggplot(dat, aes(x=apexpos, y=fsiq7)) +
geom_point() +
stat_smooth(method='lm', formula = y ~ poly(x,i), size = 1) +
xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')
print(myplot)
dev.off() #turning off plot device
r.square[i] = summary(lm.poly)$r.squared
adj.r.square[i] = summary(lm.poly)$adj.r.squared
mse[i] = mean(residuals(lm.poly)^2)
}
#Merging and model fits
fits <- as.data.frame(cbind(1:order,r.square,adj.r.square,mse))
colnames(fits) <- c("order","r.square","adj.r.square","mse")
fits
## order r.square adj.r.square mse
## 1 1 0.6991699 0.6982057 174.4870
## 2 2 0.6996527 0.6977212 174.2070
## 3 3 0.7595095 0.7571822 139.4889
## 4 4 0.7641612 0.7611083 136.7908
## 5 5 0.7715048 0.7677954 132.5314
## 6 6 0.7747508 0.7703486 130.6486
## 7 7 0.7754067 0.7702690 130.2682
## 8 8 0.7755293 0.7696415 130.1971
## 9 9 0.7766750 0.7700634 129.5326
## 10 10 0.7780076 0.7706811 128.7597
## 11 11 0.7780537 0.7699696 128.7329
## 12 12 0.7780565 0.7692082 128.7313
## 13 13 0.7781012 0.7684856 128.7054
## 14 14 0.7814518 0.7712188 126.7619
## 15 15 0.7834292 0.7725279 125.6151
## 16 16 0.7834627 0.7717974 125.5956
## 17 17 0.7841315 0.7717337 125.2077
## 18 18 0.7847870 0.7716553 124.8275
## 19 19 0.7855862 0.7717295 124.3640
## 20 20 0.7868239 0.7722727 123.6460
#Plotting model fits#
ggplot(fits, aes(x=order)) +
geom_line(aes(y=r.square), color="red", size=2) +
geom_line(aes(y=adj.r.square), color="blue", size=2) +
xlab('Order of the Polynomial') +
ylab('R^2 (red) or Adjusted R^2 (blue)') +
xlim(1,20) +
ylim(0.5,1)
We see improvement of prediction up through polynomial modelsof order 6 or so, and then less improvement.
To tune the model - that is, to choose the best model (i.e, order of plynomial) we make use of K-fold Cross Validation procedure wherein we, +a. Randomly shuffle the data +b. Set K, the number of folds
dat.shuffled <- dat[sample(nrow(dat)),]
K <- 10
folds <- cut(seq(1,nrow(dat.shuffled)),breaks=K,labels=FALSE)
#Creating empty object to hold fit information
r.square = matrix(data=NA,nrow=K,ncol=order)
#Perform K-fold cross validation
for(i in 1:K){
#Segement data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- dat.shuffled[testIndexes, ]
trainData <- dat.shuffled[-testIndexes, ]
#Use the test and train data partitions
#Model fitting and evaluation
for (j in 1:order){
#training regression model on training folds
fit.train = lm(fsiq7 ~ poly(apexpos,j), data = trainData)
#evaluating fit on the test fold
fit.test = predict(fit.train, newdata=testData)
r.square[i,j] = cor(fit.test, testData$fsiq7, use='complete')^2
}
}
#Averaging fit at each order
fits.kfold <- colMeans(r.square)
#plotting cross-validated prediction accuracy
plot(colMeans(r.square), type='l')
The 10-fold Cross Validation suggests that the most consistent accuracy is found with a fifth order polynomial.
We can then obtain the prediction model with the complete training data
Model Fitting
#Fitting final model
lm.final = lm(fsiq7~poly(apexpos,5),
data=dat,
na.action=na.exclude)
summary(lm.final)
##
## Call:
## lm(formula = fsiq7 ~ poly(apexpos, 5), data = dat, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.243 -7.356 -1.273 7.433 31.509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.647 0.656 113.796 < 2e-16 ***
## poly(apexpos, 5)1 -356.843 11.624 -30.699 < 2e-16 ***
## poly(apexpos, 5)2 9.377 11.624 0.807 0.42047
## poly(apexpos, 5)3 104.410 11.624 8.982 < 2e-16 ***
## poly(apexpos, 5)4 -29.107 11.624 -2.504 0.01280 *
## poly(apexpos, 5)5 -36.571 11.624 -3.146 0.00182 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.62 on 308 degrees of freedom
## Multiple R-squared: 0.7715, Adjusted R-squared: 0.7678
## F-statistic: 208 on 5 and 308 DF, p-value: < 2.2e-16
And, visually …
plot.lm.final = ggplot(dat, aes(x=apexpos, y=fsiq7)) +
geom_point() +
stat_smooth(method='lm', formula = y ~ poly(x,5), size = 1) +
xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')
print(plot.lm.final)
Note that our final prediction equation is based on orthonormal polynomials … We don’t really have interest in the details of the model, only in its prediction accuracy. Thus, we are not writing out the prediction equation here. We simply retain the model object.
In a final step, we apply the learned prediction model to a completely new data set - and evaluate the quality of the prediction (e.g. \(R^2\)).
#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/apexpos_test.csv"
#read in the .csv file using the url() function
dat.test <- read.csv(file=url(filepath),header=TRUE)
newdata.predict <- predict(lm.final, newdata=dat.test)
#calculating sqaured correlation of actual and predicted scores
cor(dat.test$fsiq7, newdata.predict)^2
## [1] 0.7556371
Accuracy of prediction is as expected from the model fitting and tuning.
Regression models can be used as machine learning tools. Focus shifts toward prediction and explcit use of training data, k-fold cross validation, and test data in ways that acknowledge and make use of exploratory modeling.
Thank you for playing! (and be careful =:-)