Overview

To try this tutorial out yoruself, you will need a copy of R installed on your computer, with the latest version available at http://cran.r-project.org/. We also recommend using RStudio, available at http://www.rstudio.com

Here are some other helpful resources we came across while putting together this tutorial:

The goals of this tutorial are to provide introductory skills on how to read-in, manipulate/edit, and plot geospatial data in R. Geospatial data are used in diverse scientific fields such as history, biology, busienss, tech, public health, sociology, psychology, and many more.

Important considerations for using geospatial data include defining:

We will present examples of geospatial data at multiple levels of analysis including county-level flu data within Pennsylvania, as well as individual-level data for a single person on a single day.


Outline


Section 0: Load Packages & Set Working Directory

Check whether all packages installed (install them if not), and then load all packages:

list.of.packages <- c("maps","mapdata","maptools","rgdal","ggmap","leaflet","tigris",
                      "sp","ggplot2","plyr","animation","gridExtra","psych","rstudioapi",
                      "data.table")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

Note: We don’t use all of these packages in the tutorial, but these were the most common packages we came across while writing the tutorial, and therefore may be helpful to dig into in your own work :)

library(maps) 
library(mapdata) 
library(maptools)
library(rgdal)
library(ggmap)
library(leaflet)
library(tigris)
library(sp)
library(ggplot2)
library(plyr)
library(animation)
library(gridExtra)
library(psych)
library(rstudioapi)
library(data.table)

Set working directory to source file location first, or choose different directory:

current_path <- getSourceEditorContext()$path
setwd(dirname(current_path))

Section 1: Reading in Geospatial Data

Below are data from a single individual, collected during the Women’s March in the Washington Mall using a passive data collection app called “Moves” (https://moves-app.com/).

The varialbes in the dataframe are:

These geospatial data are formatted as a standard .csv. Other spatial data formats include: GPX, ESRI, Shapefile, GeoJSON, and many more. Using the rgdal package, we can take a look at all the different drivers (i.e., spatial data formats) that can be utilized:

ogrDrivers()$name
##  [1] "AeronavFAA"     "ARCGEN"         "AVCBin"         "AVCE00"        
##  [5] "BNA"            "CSV"            "DGN"            "DXF"           
##  [9] "EDIGEO"         "ESRI Shapefile" "Geoconcept"     "GeoJSON"       
## [13] "Geomedia"       "GeoRSS"         "GML"            "GPKG"          
## [17] "GPSBabel"       "GPSTrackMaker"  "GPX"            "HTF"           
## [21] "Idrisi"         "JML"            "KML"            "MapInfo File"  
## [25] "Memory"         "MSSQLSpatial"   "ODBC"           "ODS"           
## [29] "OGR_GMT"        "OGR_PDS"        "OGR_SDTS"       "OGR_VRT"       
## [33] "OpenAir"        "OpenFileGDB"    "OSM"            "PCIDSK"        
## [37] "PDF"            "PGDUMP"         "PGeo"           "REC"           
## [41] "S57"            "SEGUKOOA"       "SEGY"           "Selafin"       
## [45] "SQLite"         "SUA"            "SVG"            "SXF"           
## [49] "TIGER"          "UK .NTF"        "VFK"            "Walk"          
## [53] "WAsP"           "XLSX"           "XPlane"

Further, there are many places on the internet as well as other R packages where geospatial data can be obtained. For example, the cdcfluview package allows you to retrieve US flu season data from the CDC. The US Census Bureau also has a wealth of data availabel to be downloaded (https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml).

For now, we will read-in the moves data:

moves <- read.csv("MovesData_WM.csv",header=TRUE)

We will also look at another dataset pertaining to number of flu cases by county in Pennsylvania, retrieved from: http://www.health.pa.gov/My%20Health/Diseases%20and%20Conditions/I-L/Pages/20162017-Influenza-Season.aspx#.WbiFPMiGPIU

fludat <- read.csv("PA_fludata_2017.csv", header=TRUE)
fludat$County<-tolower(fludat$County)
head(fludat)
##      County FluCases
## 1     adams     1623
## 2 allegheny     5001
## 3 armstrong      312
## 4    beaver     1234
## 5   bedford      234
## 6     berks     3031
str(fludat)
## 'data.frame':    67 obs. of  2 variables:
##  $ County  : chr  "adams" "allegheny" "armstrong" "beaver" ...
##  $ FluCases: int  1623 5001 312 1234 234 3031 1036 815 2227 921 ...

Section 2: Basic Geospatial Data Formatting

For some research questions, the geospatial data may have an indicator of duration. In the moves data, there is a duration variable indicating the number of seconds that were spent in that location.

head(moves)
##        Date                                  Name
## 1 1/21/2017     Place in Penn Quarter, Washington
## 2 1/21/2017 Place in Federal Triangle, Washington
## 3 1/21/2017         Place in The Mall, Washington
## 4 1/21/2017         Place in The Mall, Washington
## 5 1/21/2017         Place in The Mall, Washington
## 6 1/21/2017         Place in The Mall, Washington
##                       Start                       End Duration Latitude
## 1 2017-01-21T08:49:27-05:00 2017-01-21T08:55:14-05:00      347 38.89821
## 2 2017-01-21T09:09:41-05:00 2017-01-21T09:11:41-05:00      120 38.89372
## 3 2017-01-21T09:22:10-05:00 2017-01-21T09:29:11-05:00      421 38.89126
## 4 2017-01-21T09:34:49-05:00 2017-01-21T09:50:34-05:00      945 38.88987
## 5 2017-01-21T09:50:35-05:00 2017-01-21T10:03:02-05:00      747 38.89070
## 6 2017-01-21T10:03:03-05:00 2017-01-21T11:28:17-05:00     5114 38.88939
##   Longitude
## 1 -77.02807
## 2 -77.02371
## 3 -77.01739
## 4 -77.01638
## 5 -77.01578
## 6 -77.01737

In the event that there is not a duration variable, first we would need to format the start and stop dates into date objects, and then we could calcualte a duration variable ourselves:

# First remove the T between the date and time
  moves$Start <- gsub("T", " ", moves$Start)
  moves$End <- gsub("T", " ", moves$End)
  
# Also remove the -5:00 from the date and time
  moves$Start <- gsub("-05:00", "", moves$Start)
  moves$End <- gsub("-05:00", "", moves$End)
  
# Format as date object
  moves$Start <- as.POSIXct(x = moves$Start, format = "%Y-%m-%d %H:%M:%S", tz = "EST")
  moves$End <- as.POSIXct(x = moves$End, format = "%Y-%m-%d %H:%M:%S", tz = "EST")
  
# Create duration variable
  moves$duration_sec <- difftime(time1 = moves$End, time2 = moves$Start, units = "secs")
  moves$duration_min <- round(difftime(time1 = moves$End, time2 = moves$Start, units = "mins"),0)
  moves$duration_hours <- round(difftime(time1 = moves$End, time2 = moves$Start, units = "hours"),2)
  
# Look at first and last few rows
  head(moves)
##        Date                                  Name               Start
## 1 1/21/2017     Place in Penn Quarter, Washington 2017-01-21 08:49:27
## 2 1/21/2017 Place in Federal Triangle, Washington 2017-01-21 09:09:41
## 3 1/21/2017         Place in The Mall, Washington 2017-01-21 09:22:10
## 4 1/21/2017         Place in The Mall, Washington 2017-01-21 09:34:49
## 5 1/21/2017         Place in The Mall, Washington 2017-01-21 09:50:35
## 6 1/21/2017         Place in The Mall, Washington 2017-01-21 10:03:03
##                   End Duration Latitude Longitude duration_sec
## 1 2017-01-21 08:55:14      347 38.89821 -77.02807     347 secs
## 2 2017-01-21 09:11:41      120 38.89372 -77.02371     120 secs
## 3 2017-01-21 09:29:11      421 38.89126 -77.01739     421 secs
## 4 2017-01-21 09:50:34      945 38.88987 -77.01638     945 secs
## 5 2017-01-21 10:03:02      747 38.89070 -77.01578     747 secs
## 6 2017-01-21 11:28:17     5114 38.88939 -77.01737    5114 secs
##   duration_min duration_hours
## 1       6 mins     0.10 hours
## 2       2 mins     0.03 hours
## 3       7 mins     0.12 hours
## 4      16 mins     0.26 hours
## 5      12 mins     0.21 hours
## 6      85 mins     1.42 hours
  tail(moves)
##         Date                              Name               Start
## 17 1/21/2017     Place in The Mall, Washington 2017-01-21 15:50:11
## 18 1/21/2017     Place in The Mall, Washington 2017-01-21 16:11:43
## 19 1/21/2017     Place in The Mall, Washington 2017-01-21 16:37:16
## 20 1/21/2017     Place in The Mall, Washington 2017-01-21 17:06:04
## 21 1/21/2017 Place in Penn Quarter, Washington 2017-01-21 17:37:22
## 22 1/21/2017     Place in The Mall, Washington 2017-01-21 18:38:26
##                    End Duration Latitude Longitude duration_sec
## 17 2017-01-21 16:11:42     1291 38.88991 -77.03206    1291 secs
## 18 2017-01-21 16:26:47      904 38.89191 -77.03274     904 secs
## 19 2017-01-21 16:57:56     1240 38.89378 -77.03630    1240 secs
## 20 2017-01-21 17:31:01     1497 38.89530 -77.03533    1497 secs
## 21 2017-01-21 18:21:55     2673 38.89640 -77.03189    2673 secs
## 22 2017-01-21 18:50:17      711 38.88868 -77.03406     711 secs
##    duration_min duration_hours
## 17      22 mins     0.36 hours
## 18      15 mins     0.25 hours
## 19      21 mins     0.34 hours
## 20      25 mins     0.42 hours
## 21      45 mins     0.74 hours
## 22      12 mins     0.20 hours

Another possibility, is to format the dataset so that the rows are on a more uniform time metric such as seconds, minutes, or hours. Below we demonstrate how to transform the moves data given above into a dataset that can be indexed by minutes or hours:

# Create sequential observations variable, and recode duration into minutes and hours
  moves$obs <- seq(1:length(moves$Date))
    
# Create expanded data frame:
  moves2 <- data.frame(rep(moves$obs, moves$duration_min), 
                       rep(moves$Latitude, moves$duration_min), 
                       rep(moves$Longitude, moves$duration_min))
  names(moves2) <- c("obs","latitude","longitude")
    
# Re-insert and expand the minutes variable into a sequential indicator for each "obs"
  moves3 <- ddply(moves2,"obs",mutate, minutes = seq(1:length(obs)))
    
# Divide the sequential minutes variable by 20 (for plot scaling purposes) 
  moves3$minutes_scaled <- moves3$minutes/10

# Insert a new row 1 - this is used for plotting purposes to help with size scaling
  describe(moves3$minutes_scaled) # range = 0.1:8.5
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis  se
## X1    1 487 2.21 2.14    1.3    1.86 1.33 0.1 8.5   8.4 1.24     0.52 0.1
# For later plotting purposes, add in a ghost row for scaling
  moves3 <- rbind(c(0,0,0,85,8.5),moves3)
    
# Create a numbered sequence variable for each move
  moves3$seq_num <- seq(1:length(moves3$obs))
  
# Look at first and last few rows
  head(moves3)
##   obs latitude longitude minutes minutes_scaled seq_num
## 1   0  0.00000   0.00000      85            8.5       1
## 2   1 38.89821 -77.02807       1            0.1       2
## 3   1 38.89821 -77.02807       2            0.2       3
## 4   1 38.89821 -77.02807       3            0.3       4
## 5   1 38.89821 -77.02807       4            0.4       5
## 6   1 38.89821 -77.02807       5            0.5       6
  tail(moves3)
##     obs latitude longitude minutes minutes_scaled seq_num
## 483  22 38.88868 -77.03406       7            0.7     483
## 484  22 38.88868 -77.03406       8            0.8     484
## 485  22 38.88868 -77.03406       9            0.9     485
## 486  22 38.88868 -77.03406      10            1.0     486
## 487  22 38.88868 -77.03406      11            1.1     487
## 488  22 38.88868 -77.03406      12            1.2     488

Section 3: Fetching a Map

ggplot2/maps Package

We can import data from the maps package into a data frame easily usable with ggplot

Here, we use map_data to pull county information for Pennsylvania:

PAcounties <- map_data("county", region = "pennsylvania")
head(PAcounties)
##        long      lat group order       region subregion
## 1 -77.44670 39.96954     1     1 pennsylvania     adams
## 2 -77.42952 39.98672     1     2 pennsylvania     adams
## 3 -77.37222 40.00391     1     3 pennsylvania     adams
## 4 -77.32065 40.01537     1     4 pennsylvania     adams
## 5 -77.23471 40.02683     1     5 pennsylvania     adams
## 6 -77.18887 40.03256     1     6 pennsylvania     adams

PAcounties contains the following information: * long: longitude * lat: latitude * group: a grouping variable for different regions, that is extra useful for plotting adjacent points (within a group). * order: each element within a group is assigned an order number, which will tell ggplot the order to connect the dots within groups. * region: describes what each group is (e.g., state name) * subregion: describes what each sub-group is (e.g., counties in PA)

Next, we need to use the information we pulled from map_data to tie county information in fludat geographic signifiers (latitude and longitude)

#Set up data: flu cases, county, state names
flu_map <- data.frame(state_names=unique(PAcounties$region), 
                          county_names=unique(PAcounties$subregion), 
                          flucases= fludat$FluCases)
#use a key to match flu map to county latitude and longitude
library(data.table) 
PAcounties <- data.table(PAcounties)
setkey(PAcounties,region,subregion)
flu_map <- data.table(flu_map)
setkey(flu_map,state_names,county_names)
map.df <- PAcounties[flu_map]
head(map.df)
##         long      lat group order       region subregion flucases
## 1: -77.44670 39.96954     1     1 pennsylvania     adams     1623
## 2: -77.42952 39.98672     1     2 pennsylvania     adams     1623
## 3: -77.37222 40.00391     1     3 pennsylvania     adams     1623
## 4: -77.32065 40.01537     1     4 pennsylvania     adams     1623
## 5: -77.23471 40.02683     1     5 pennsylvania     adams     1623
## 6: -77.18887 40.03256     1     6 pennsylvania     adams     1623

Now, we have latitude, longitude, county names, region (state), and flu cases. We’re ready to map! Use ggplot, grouping at the county level, to fill in our flu case data

ggplot(map.df, aes(x = long, y = lat, group = subregion, fill = flucases)) + 
  geom_polygon(color="white") + 
  coord_map()

ggmap Package

The get_map() in the ggmap package is a wrapper that can query Google Maps, OpenStreetMap, Stamen Maps or Naver Map servers.

  • For the location argument, you can either input an address, longitude and latitude, or left/bottom/right/top boudning box.
  • For the zoom argument, you input an integer where 3 is at the continent level and 21 is at the building level.
  • For the maptype argument, you can specify a character string corresponding to, terrain, terrain-background, satellite, roadmap, hybrid, toner, watercolor, terrain-labels, and so forth (see ?get_map).
?ggmap
map1 <- get_map(location = "State College",
               maptype = "terrain",
               source = "google",
               crop = FALSE,
               zoom = 10)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=State+College&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=State%20College&sensor=false
plot_map1 <- ggmap(map1) + 
    xlab("Longitude") + ylab("Latitude") +
    theme(legend.position="none")
plot_map1

map2 <- get_map(location = "Pennsylvania",
               maptype = "watercolor",
               source = "google",
               crop = FALSE,
               zoom = 5)

plot_map2 <- ggmap(map2) + 
    xlab("Longitude") + ylab("Latitude") +
    theme(legend.position="none")
plot_map2


Section 4: Plotting on a Static Map

First, lets plot the Women’s March data using a map from the previous section displaying the Washington Mall.

summary(moves3[,c("longitude","latitude")])
##    longitude         latitude    
##  Min.   :-77.04   Min.   : 0.00  
##  1st Qu.:-77.03   1st Qu.:38.89  
##  Median :-77.02   Median :38.89  
##  Mean   :-76.87   Mean   :38.81  
##  3rd Qu.:-77.02   3rd Qu.:38.89  
##  Max.   :  0.00   Max.   :38.90
map3 <- get_map(location = c(-77.025,38.895),
               maptype = "terrain",
               source = "google",
               crop = FALSE,
               zoom = 15)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=38.895,-77.025&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
plot_map3 <- ggmap(map3) + 
  geom_point(data = moves, aes(x = Longitude, y = Latitude)) +
  geom_path(data = moves,aes(x = Longitude, y = Latitude), size = 1, lineend = "round") +
  xlab("Longitude") + ylab("Latitude") +
  ggtitle("Plot of Women's March Time Series Data") +
  theme(legend.position="none",
        plot.title = element_text(hjust = 0.5)) 

plot_map3


Section 5: Plotting Dynamic Maps

To create the dynamic map, we use the saveGIF() within the animation package. The GIF displayed bleow plots each stop in the march with a point. To visualize the amount of time spent at each location, the dots at each location grow proportional to the duration spent at that location compared to other locations. The movement is depicted by black lines connecting each location. The “growing dots” depicting duration are achieved by subseting the data iteratively - the first plot in the GIF uses only the first row of data, the second plot uses the first two rows…all the way up to the 488th plot in the GIF which utilizes all 488 rows. Do to knit time (this particular GIF takes ~10 minutes to render), the code chunk below is not evaluated, but you are welcome to download the data and run this code on your own machine to see the resulting GIF!

# Create color palette
  set.seed(1989)
  palette = sample(rainbow(length(unique(moves3$obs)))) 
  
# Create gif
  saveGIF({
    seq_list <- unique(moves3$seq_num) # this sets how many plots will be tied together (equal to number of rows)
    #looping through plots
      for(i in 1:length(seq_list)){
        print(paste(c("Now at", seq_list[i])))
        dat <- subset(moves3, seq_num <= seq_list[i])
        print(
        ggmap(map) + 
            scale_size(range=c(0,8.5)) +
            scale_colour_manual(values=palette) +
            geom_point(data = dat, aes(x = longitude, y = latitude, size=minutes_scaled,colour=factor(obs))) +
            geom_path(data = dat, aes(x = longitude, y = latitude), size = .5, lineend = "round") +
            xlab("Longitude") + ylab("Latitude") +
            theme(legend.position="none")
        )
  }}, movie.name="moves2.gif", 
      interval = .001, 
      ani.width = 600,
      ani.height = 600)

Questions, Suggestions, Sharing

If you have questions or suggestions on material to add to this tutorial, please contact Libby at leb237@psu.edu. We are happy for this tutorial to be shared, but rather than sharing directly, please make sure to link to it’s page on the QuantDev website: https://quantdev.ssri.psu.edu/tutorials/intro-geospatial-data-and-maps-r

There are also many other cool tutorials given by the QuantDev group at: https://quantdev.ssri.psu.edu/tutorials/