Overview

Tree-based models basically consist of one or more nested if-then statements for the predictors that partition the data. Within these partitions, a specific model is used to predict the outcome. This recursive partitioning technique provides for exploration of the stucture of a set of data (outcome and predictors) and identification of easy to visualize decision rules for predicting a categorical (Classification Tree) or continuous (Regression Tree) outcome. In this tutorial we briefly describe the process of growing, examining, and pruning regression trees.

Outline

In this session we cover …

  1. Introduction to Data (Boston Data)
  2. Multivariate Regression Baseline
  3. Regression Tree (CART method): rpart (rpart package)
  4. Regression Tree (Conditional Inference method): ctree (partykit package)
  5. Conclusion

Prelim - Loading libraries used in this script.

library(MASS)  #for the Boston Data

library(psych)  #for general functions
library(ggplot2)  #for data visualization

# library(devtools)
# devtools::install_github('topepo/caret/pkg/caret') #May need the github version to correct a bug with parallelizing
library(caret)  #for training and cross validation (also calls other model libaries)
## Warning: Installed Rcpp (0.12.13) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library(rpart)  #for trees
#library(rattle)    # Fancy tree plot This is a difficult library to install (https://gist.github.com/zhiyzuo/a489ffdcc5da87f28f8589a55aa206dd) 
library(rpart.plot)             # Enhanced tree plots
library(RColorBrewer)       # Color selection for fancy tree plot
library(party)                  # Alternative decision tree algorithm
library(partykit)               # Updated party functions

1. Introduction to Data

For this example we use data that accompanies the MASS package. No special reason these data were selected, other than they were used in some other examples we were working on. The data can be considered “typical” social science data, with a mix of nominal, count, and continuous variables. Of note, there are no missing data.

Reading in the Boston Data exploration data set.

#loading the data
data("Boston")

Prelim - Descriptives

Lets have a quick look at the data file and the descriptives.

#data structure
head(Boston,10)
##       crim   zn indus chas   nox    rm   age    dis rad tax ptratio  black
## 1  0.00632 18.0  2.31    0 0.538 6.575  65.2 4.0900   1 296    15.3 396.90
## 2  0.02731  0.0  7.07    0 0.469 6.421  78.9 4.9671   2 242    17.8 396.90
## 3  0.02729  0.0  7.07    0 0.469 7.185  61.1 4.9671   2 242    17.8 392.83
## 4  0.03237  0.0  2.18    0 0.458 6.998  45.8 6.0622   3 222    18.7 394.63
## 5  0.06905  0.0  2.18    0 0.458 7.147  54.2 6.0622   3 222    18.7 396.90
## 6  0.02985  0.0  2.18    0 0.458 6.430  58.7 6.0622   3 222    18.7 394.12
## 7  0.08829 12.5  7.87    0 0.524 6.012  66.6 5.5605   5 311    15.2 395.60
## 8  0.14455 12.5  7.87    0 0.524 6.172  96.1 5.9505   5 311    15.2 396.90
## 9  0.21124 12.5  7.87    0 0.524 5.631 100.0 6.0821   5 311    15.2 386.63
## 10 0.17004 12.5  7.87    0 0.524 6.004  85.9 6.5921   5 311    15.2 386.71
##    lstat medv
## 1   4.98 24.0
## 2   9.14 21.6
## 3   4.03 34.7
## 4   2.94 33.4
## 5   5.33 36.2
## 6   5.21 28.7
## 7  12.43 22.9
## 8  19.15 27.1
## 9  29.93 16.5
## 10 17.10 18.9

Our outcome of interest is medv: median value of owner-occupied homes in $1000s.

(Note that there is no id variable. This is convenient for some tasks.)

Descriptives

#sample descriptives
describe(Boston)
##         vars   n   mean     sd median trimmed    mad    min    max  range
## crim       1 506   3.61   8.60   0.26    1.68   0.33   0.01  88.98  88.97
## zn         2 506  11.36  23.32   0.00    5.08   0.00   0.00 100.00 100.00
## indus      3 506  11.14   6.86   9.69   10.93   9.37   0.46  27.74  27.28
## chas       4 506   0.07   0.25   0.00    0.00   0.00   0.00   1.00   1.00
## nox        5 506   0.55   0.12   0.54    0.55   0.13   0.38   0.87   0.49
## rm         6 506   6.28   0.70   6.21    6.25   0.51   3.56   8.78   5.22
## age        7 506  68.57  28.15  77.50   71.20  28.98   2.90 100.00  97.10
## dis        8 506   3.80   2.11   3.21    3.54   1.91   1.13  12.13  11.00
## rad        9 506   9.55   8.71   5.00    8.73   2.97   1.00  24.00  23.00
## tax       10 506 408.24 168.54 330.00  400.04 108.23 187.00 711.00 524.00
## ptratio   11 506  18.46   2.16  19.05   18.66   1.70  12.60  22.00   9.40
## black     12 506 356.67  91.29 391.44  383.17   8.09   0.32 396.90 396.58
## lstat     13 506  12.65   7.14  11.36   11.90   7.11   1.73  37.97  36.24
## medv      14 506  22.53   9.20  21.20   21.56   5.93   5.00  50.00  45.00
##          skew kurtosis   se
## crim     5.19    36.60 0.38
## zn       2.21     3.95 1.04
## indus    0.29    -1.24 0.30
## chas     3.39     9.48 0.01
## nox      0.72    -0.09 0.01
## rm       0.40     1.84 0.03
## age     -0.60    -0.98 1.25
## dis      1.01     0.46 0.09
## rad      1.00    -0.88 0.39
## tax      0.67    -1.15 7.49
## ptratio -0.80    -0.30 0.10
## black   -2.87     7.10 4.06
## lstat    0.90     0.46 0.32
## medv     1.10     1.45 0.41
#plots
pairs.panels(Boston)

#histogram of outcome
ggplot(data=Boston, aes(x=medv)) +
  geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") + 
  labs(x = "Median Home Value")