Overview

In this tutorial we walk through basic Data Mining ideas using regression. We introduce the exploratory perspective - prediction - and the use of K-fold Cross Validation.

Outline

In this session we cover …

  1. Introduction to Motivating Problem and Data
  2. Regression as a Predcition Model
  3. Expanding the Prediction Model
  4. K-fold Cross-Validation
  5. Application to New Data

Prelim - Loading libraries used in this script.

library(psych)  #for general functions
library(ggplot2)  #for data visualization
library(car)  # for regression diagnostics

1A. The Regression (Prediction) Model

Although not usually considered as such in the Social Science community, regressions are considered as part of the data mining toolbox. Generally they might be labeled as a form of supervised learning. For clarity we write out the general (univariate) model we use here …

\[ {y_{i}} = {\beta_0} + {\beta_1}{x_{i}} + {\beta_2}{x_{i}^2} + ... + {\beta_k}{x_{i}^k} + {e_{i}} \]

We will use this model to optimize prediction of a specific outcome.

The motivating problem is derived from the following:
+Phenylketonuria (PKU) is an inborn error of metabolism that can have devastating effects on child intelligence +Children of mothers with PKU can suffer damage in utero, even if they do not have the full PKU +Critical question is the association between prenatal phenylaline (PHE) exposure by the fetus in utero and childhood intelligence

Our goal is to develop an accurate prediction equation.

1B. Data: Training Set and Test Set

In this example there are two data sets. One data set that we can use for data mining (an exploration data set), and one data set that we cannot touch until the very end (a confirmation data set).

Reading in the exploration data set.

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/apexpos.csv"
#read in the .csv file using the url() function
dat <- read.csv(file=url(filepath),header=TRUE)

Prelim - Descriptives

Lets have a quick look at the data file and the descriptives.

#data structure
head(dat,10)
##    id apexpos fsiq7
## 1   1     2.2   121
## 2   2     5.5    98
## 3   3     5.2    93
## 4   4     3.6    98
## 5   5     7.2    89
## 6   6     7.3   115
## 7   7     9.9    72
## 8   8     8.9    90
## 9   9     0.0    89
## 10 10     7.1   111
names(dat)
## [1] "id"      "apexpos" "fsiq7"

There are three variables in the data set.
+“id” is a identification variable
+“apexpos” is prenatal phenylaline (PHE) exposure +“fsiq7” is childhood intelligence (Age 7 years)

#descriptives (means, sds)
describe(dat[,-1])  #the -1 ignores the first column, "id"
##         vars   n  mean    sd median trimmed   mad min   max range skew
## apexpos    1 314 11.28  6.94   10.7   11.14  8.45   0  24.9  24.9 0.13
## fsiq7      2 314 74.65 24.12   76.0   74.53 29.65  31 126.0  95.0 0.00
##         kurtosis   se
## apexpos     -1.1 0.39
## fsiq7       -1.1 1.36
#correlation matrix
round(cor(dat[,-1]),2)
##         apexpos fsiq7
## apexpos    1.00 -0.84
## fsiq7     -0.84  1.00

Plotting the empirical data (in our exploration set)

## Plotting empirical data ##
ggplot(dat, aes(x=apexpos, y=fsiq7)) + 
  geom_point() +
  xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')

2. Regression as a Prediction Model

We formulate the initial prediction model.

\[ {y_{i}} = {\beta_0}{1} + {\beta_1}{x_{i}} + {e_{i}} \] Model Fitting

#Specifying linear regression model
lm.1 = lm(fsiq7 ~ 1 + apexpos,
          data=dat,
          na.action=na.exclude)

summary(lm.1)
## 
## Call:
## lm(formula = fsiq7 ~ 1 + apexpos, data = dat, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.108  -8.790  -0.232   8.004  39.749 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.4337     1.4289   75.19   <2e-16 ***
## apexpos      -2.9068     0.1079  -26.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.25 on 312 degrees of freedom
## Multiple R-squared:  0.6992, Adjusted R-squared:  0.6982 
## F-statistic: 725.1 on 1 and 312 DF,  p-value: < 2.2e-16

Interpreting the output, we get the prediction model

\[ {y_{i}} = {107.4337} + {-2.9068}{x_{i}} + {e_{i}} \]

And a plot of the model …

plot.lm1 = ggplot(dat, aes(x=apexpos, y=fsiq7)) + 
              geom_point() +
              stat_smooth(method='lm', formula = y ~ x, color="blue", size = 1) +
              xlab('PHE Exposure') + ylab('Age 7 Full Scale IQ')
print(plot.lm1)