To try this tutorial out yoruself, you will need a copy of R installed on your computer, with the latest version available at http://cran.r-project.org/. We also recommend using RStudio, available at http://www.rstudio.com
Here are some other helpful resources we came across while putting together this tutorial:
The goals of this tutorial are to provide introductory skills on how to read-in, manipulate/edit, and plot geospatial data in R. Geospatial data are used in diverse scientific fields such as history, biology, busienss, tech, public health, sociology, psychology, and many more.
Important considerations for using geospatial data include defining:
We will present examples of geospatial data at multiple levels of analysis including county-level flu data within Pennsylvania, as well as individual-level data for a single person on a single day.
Check whether all packages installed (install them if not), and then load all packages:
list.of.packages <- c("maps","mapdata","maptools","rgdal","ggmap","leaflet","tigris",
"sp","ggplot2","plyr","animation","gridExtra","psych","rstudioapi",
"data.table")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
Note: We don’t use all of these packages in the tutorial, but these were the most common packages we came across while writing the tutorial, and therefore may be helpful to dig into in your own work :)
library(maps)
library(mapdata)
library(maptools)
library(rgdal)
library(ggmap)
library(leaflet)
library(tigris)
library(sp)
library(ggplot2)
library(plyr)
library(animation)
library(gridExtra)
library(psych)
library(rstudioapi)
library(data.table)
Set working directory to source file location first, or choose different directory:
current_path <- getSourceEditorContext()$path
setwd(dirname(current_path))
Below are data from a single individual, collected during the Women’s March in the Washington Mall using a passive data collection app called “Moves” (https://moves-app.com/).
The varialbes in the dataframe are:
These geospatial data are formatted as a standard .csv. Other spatial data formats include: GPX, ESRI, Shapefile, GeoJSON, and many more. Using the rgdal
package, we can take a look at all the different drivers (i.e., spatial data formats) that can be utilized:
ogrDrivers()$name
## [1] "AeronavFAA" "ARCGEN" "AVCBin" "AVCE00"
## [5] "BNA" "CSV" "DGN" "DXF"
## [9] "EDIGEO" "ESRI Shapefile" "Geoconcept" "GeoJSON"
## [13] "Geomedia" "GeoRSS" "GML" "GPKG"
## [17] "GPSBabel" "GPSTrackMaker" "GPX" "HTF"
## [21] "Idrisi" "JML" "KML" "MapInfo File"
## [25] "Memory" "MSSQLSpatial" "ODBC" "ODS"
## [29] "OGR_GMT" "OGR_PDS" "OGR_SDTS" "OGR_VRT"
## [33] "OpenAir" "OpenFileGDB" "OSM" "PCIDSK"
## [37] "PDF" "PGDUMP" "PGeo" "REC"
## [41] "S57" "SEGUKOOA" "SEGY" "Selafin"
## [45] "SQLite" "SUA" "SVG" "SXF"
## [49] "TIGER" "UK .NTF" "VFK" "Walk"
## [53] "WAsP" "XLSX" "XPlane"
Further, there are many places on the internet as well as other R packages where geospatial data can be obtained. For example, the cdcfluview
package allows you to retrieve US flu season data from the CDC. The US Census Bureau also has a wealth of data availabel to be downloaded (https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml).
For now, we will read-in the moves data:
moves <- read.csv("MovesData_WM.csv",header=TRUE)
We will also look at another dataset pertaining to number of flu cases by county in Pennsylvania, retrieved from: http://www.health.pa.gov/My%20Health/Diseases%20and%20Conditions/I-L/Pages/20162017-Influenza-Season.aspx#.WbiFPMiGPIU
fludat <- read.csv("PA_fludata_2017.csv", header=TRUE)
fludat$County<-tolower(fludat$County)
head(fludat)
## County FluCases
## 1 adams 1623
## 2 allegheny 5001
## 3 armstrong 312
## 4 beaver 1234
## 5 bedford 234
## 6 berks 3031
str(fludat)
## 'data.frame': 67 obs. of 2 variables:
## $ County : chr "adams" "allegheny" "armstrong" "beaver" ...
## $ FluCases: int 1623 5001 312 1234 234 3031 1036 815 2227 921 ...
For some research questions, the geospatial data may have an indicator of duration. In the moves
data, there is a duration variable indicating the number of seconds that were spent in that location.
head(moves)
## Date Name
## 1 1/21/2017 Place in Penn Quarter, Washington
## 2 1/21/2017 Place in Federal Triangle, Washington
## 3 1/21/2017 Place in The Mall, Washington
## 4 1/21/2017 Place in The Mall, Washington
## 5 1/21/2017 Place in The Mall, Washington
## 6 1/21/2017 Place in The Mall, Washington
## Start End Duration Latitude
## 1 2017-01-21T08:49:27-05:00 2017-01-21T08:55:14-05:00 347 38.89821
## 2 2017-01-21T09:09:41-05:00 2017-01-21T09:11:41-05:00 120 38.89372
## 3 2017-01-21T09:22:10-05:00 2017-01-21T09:29:11-05:00 421 38.89126
## 4 2017-01-21T09:34:49-05:00 2017-01-21T09:50:34-05:00 945 38.88987
## 5 2017-01-21T09:50:35-05:00 2017-01-21T10:03:02-05:00 747 38.89070
## 6 2017-01-21T10:03:03-05:00 2017-01-21T11:28:17-05:00 5114 38.88939
## Longitude
## 1 -77.02807
## 2 -77.02371
## 3 -77.01739
## 4 -77.01638
## 5 -77.01578
## 6 -77.01737
In the event that there is not a duration variable, first we would need to format the start and stop dates into date objects, and then we could calcualte a duration variable ourselves:
# First remove the T between the date and time
moves$Start <- gsub("T", " ", moves$Start)
moves$End <- gsub("T", " ", moves$End)
# Also remove the -5:00 from the date and time
moves$Start <- gsub("-05:00", "", moves$Start)
moves$End <- gsub("-05:00", "", moves$End)
# Format as date object
moves$Start <- as.POSIXct(x = moves$Start, format = "%Y-%m-%d %H:%M:%S", tz = "EST")
moves$End <- as.POSIXct(x = moves$End, format = "%Y-%m-%d %H:%M:%S", tz = "EST")
# Create duration variable
moves$duration_sec <- difftime(time1 = moves$End, time2 = moves$Start, units = "secs")
moves$duration_min <- round(difftime(time1 = moves$End, time2 = moves$Start, units = "mins"),0)
moves$duration_hours <- round(difftime(time1 = moves$End, time2 = moves$Start, units = "hours"),2)
# Look at first and last few rows
head(moves)
## Date Name Start
## 1 1/21/2017 Place in Penn Quarter, Washington 2017-01-21 08:49:27
## 2 1/21/2017 Place in Federal Triangle, Washington 2017-01-21 09:09:41
## 3 1/21/2017 Place in The Mall, Washington 2017-01-21 09:22:10
## 4 1/21/2017 Place in The Mall, Washington 2017-01-21 09:34:49
## 5 1/21/2017 Place in The Mall, Washington 2017-01-21 09:50:35
## 6 1/21/2017 Place in The Mall, Washington 2017-01-21 10:03:03
## End Duration Latitude Longitude duration_sec
## 1 2017-01-21 08:55:14 347 38.89821 -77.02807 347 secs
## 2 2017-01-21 09:11:41 120 38.89372 -77.02371 120 secs
## 3 2017-01-21 09:29:11 421 38.89126 -77.01739 421 secs
## 4 2017-01-21 09:50:34 945 38.88987 -77.01638 945 secs
## 5 2017-01-21 10:03:02 747 38.89070 -77.01578 747 secs
## 6 2017-01-21 11:28:17 5114 38.88939 -77.01737 5114 secs
## duration_min duration_hours
## 1 6 mins 0.10 hours
## 2 2 mins 0.03 hours
## 3 7 mins 0.12 hours
## 4 16 mins 0.26 hours
## 5 12 mins 0.21 hours
## 6 85 mins 1.42 hours
tail(moves)
## Date Name Start
## 17 1/21/2017 Place in The Mall, Washington 2017-01-21 15:50:11
## 18 1/21/2017 Place in The Mall, Washington 2017-01-21 16:11:43
## 19 1/21/2017 Place in The Mall, Washington 2017-01-21 16:37:16
## 20 1/21/2017 Place in The Mall, Washington 2017-01-21 17:06:04
## 21 1/21/2017 Place in Penn Quarter, Washington 2017-01-21 17:37:22
## 22 1/21/2017 Place in The Mall, Washington 2017-01-21 18:38:26
## End Duration Latitude Longitude duration_sec
## 17 2017-01-21 16:11:42 1291 38.88991 -77.03206 1291 secs
## 18 2017-01-21 16:26:47 904 38.89191 -77.03274 904 secs
## 19 2017-01-21 16:57:56 1240 38.89378 -77.03630 1240 secs
## 20 2017-01-21 17:31:01 1497 38.89530 -77.03533 1497 secs
## 21 2017-01-21 18:21:55 2673 38.89640 -77.03189 2673 secs
## 22 2017-01-21 18:50:17 711 38.88868 -77.03406 711 secs
## duration_min duration_hours
## 17 22 mins 0.36 hours
## 18 15 mins 0.25 hours
## 19 21 mins 0.34 hours
## 20 25 mins 0.42 hours
## 21 45 mins 0.74 hours
## 22 12 mins 0.20 hours
Another possibility, is to format the dataset so that the rows are on a more uniform time metric such as seconds, minutes, or hours. Below we demonstrate how to transform the moves data given above into a dataset that can be indexed by minutes or hours:
# Create sequential observations variable, and recode duration into minutes and hours
moves$obs <- seq(1:length(moves$Date))
# Create expanded data frame:
moves2 <- data.frame(rep(moves$obs, moves$duration_min),
rep(moves$Latitude, moves$duration_min),
rep(moves$Longitude, moves$duration_min))
names(moves2) <- c("obs","latitude","longitude")
# Re-insert and expand the minutes variable into a sequential indicator for each "obs"
moves3 <- ddply(moves2,"obs",mutate, minutes = seq(1:length(obs)))
# Divide the sequential minutes variable by 20 (for plot scaling purposes)
moves3$minutes_scaled <- moves3$minutes/10
# Insert a new row 1 - this is used for plotting purposes to help with size scaling
describe(moves3$minutes_scaled) # range = 0.1:8.5
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 487 2.21 2.14 1.3 1.86 1.33 0.1 8.5 8.4 1.24 0.52 0.1
# For later plotting purposes, add in a ghost row for scaling
moves3 <- rbind(c(0,0,0,85,8.5),moves3)
# Create a numbered sequence variable for each move
moves3$seq_num <- seq(1:length(moves3$obs))
# Look at first and last few rows
head(moves3)
## obs latitude longitude minutes minutes_scaled seq_num
## 1 0 0.00000 0.00000 85 8.5 1
## 2 1 38.89821 -77.02807 1 0.1 2
## 3 1 38.89821 -77.02807 2 0.2 3
## 4 1 38.89821 -77.02807 3 0.3 4
## 5 1 38.89821 -77.02807 4 0.4 5
## 6 1 38.89821 -77.02807 5 0.5 6
tail(moves3)
## obs latitude longitude minutes minutes_scaled seq_num
## 483 22 38.88868 -77.03406 7 0.7 483
## 484 22 38.88868 -77.03406 8 0.8 484
## 485 22 38.88868 -77.03406 9 0.9 485
## 486 22 38.88868 -77.03406 10 1.0 486
## 487 22 38.88868 -77.03406 11 1.1 487
## 488 22 38.88868 -77.03406 12 1.2 488
We can import data from the maps
package into a data frame easily usable with ggplot
Here, we use map_data
to pull county information for Pennsylvania:
PAcounties <- map_data("county", region = "pennsylvania")
head(PAcounties)
## long lat group order region subregion
## 1 -77.44670 39.96954 1 1 pennsylvania adams
## 2 -77.42952 39.98672 1 2 pennsylvania adams
## 3 -77.37222 40.00391 1 3 pennsylvania adams
## 4 -77.32065 40.01537 1 4 pennsylvania adams
## 5 -77.23471 40.02683 1 5 pennsylvania adams
## 6 -77.18887 40.03256 1 6 pennsylvania adams
PAcounties
contains the following information: * long: longitude * lat: latitude * group: a grouping variable for different regions, that is extra useful for plotting adjacent points (within a group). * order: each element within a group is assigned an order number, which will tell ggplot
the order to connect the dots within groups. * region: describes what each group is (e.g., state name) * subregion: describes what each sub-group is (e.g., counties in PA)
Next, we need to use the information we pulled from map_data
to tie county information in fludat
geographic signifiers (latitude and longitude)
#Set up data: flu cases, county, state names
flu_map <- data.frame(state_names=unique(PAcounties$region),
county_names=unique(PAcounties$subregion),
flucases= fludat$FluCases)
#use a key to match flu map to county latitude and longitude
library(data.table)
PAcounties <- data.table(PAcounties)
setkey(PAcounties,region,subregion)
flu_map <- data.table(flu_map)
setkey(flu_map,state_names,county_names)
map.df <- PAcounties[flu_map]
head(map.df)
## long lat group order region subregion flucases
## 1: -77.44670 39.96954 1 1 pennsylvania adams 1623
## 2: -77.42952 39.98672 1 2 pennsylvania adams 1623
## 3: -77.37222 40.00391 1 3 pennsylvania adams 1623
## 4: -77.32065 40.01537 1 4 pennsylvania adams 1623
## 5: -77.23471 40.02683 1 5 pennsylvania adams 1623
## 6: -77.18887 40.03256 1 6 pennsylvania adams 1623
Now, we have latitude, longitude, county names, region (state), and flu cases. We’re ready to map! Use ggplot, grouping at the county level, to fill in our flu case data
ggplot(map.df, aes(x = long, y = lat, group = subregion, fill = flucases)) +
geom_polygon(color="white") +
coord_map()
The get_map()
in the ggmap
package is a wrapper that can query Google Maps, OpenStreetMap, Stamen Maps or Naver Map servers.
location
argument, you can either input an address, longitude and latitude, or left/bottom/right/top boudning box.zoom
argument, you input an integer where 3 is at the continent level and 21 is at the building level.maptype
argument, you can specify a character string corresponding to, terrain, terrain-background, satellite, roadmap, hybrid, toner, watercolor, terrain-labels, and so forth (see ?get_map).?ggmap
map1 <- get_map(location = "State College",
maptype = "terrain",
source = "google",
crop = FALSE,
zoom = 10)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=State+College&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=State%20College&sensor=false
plot_map1 <- ggmap(map1) +
xlab("Longitude") + ylab("Latitude") +
theme(legend.position="none")
plot_map1
map2 <- get_map(location = "Pennsylvania",
maptype = "watercolor",
source = "google",
crop = FALSE,
zoom = 5)
plot_map2 <- ggmap(map2) +
xlab("Longitude") + ylab("Latitude") +
theme(legend.position="none")
plot_map2
First, lets plot the Women’s March data using a map from the previous section displaying the Washington Mall.
summary(moves3[,c("longitude","latitude")])
## longitude latitude
## Min. :-77.04 Min. : 0.00
## 1st Qu.:-77.03 1st Qu.:38.89
## Median :-77.02 Median :38.89
## Mean :-76.87 Mean :38.81
## 3rd Qu.:-77.02 3rd Qu.:38.89
## Max. : 0.00 Max. :38.90
map3 <- get_map(location = c(-77.025,38.895),
maptype = "terrain",
source = "google",
crop = FALSE,
zoom = 15)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=38.895,-77.025&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
plot_map3 <- ggmap(map3) +
geom_point(data = moves, aes(x = Longitude, y = Latitude)) +
geom_path(data = moves,aes(x = Longitude, y = Latitude), size = 1, lineend = "round") +
xlab("Longitude") + ylab("Latitude") +
ggtitle("Plot of Women's March Time Series Data") +
theme(legend.position="none",
plot.title = element_text(hjust = 0.5))
plot_map3
To create the dynamic map, we use the saveGIF()
within the animation
package. The GIF displayed bleow plots each stop in the march with a point. To visualize the amount of time spent at each location, the dots at each location grow proportional to the duration spent at that location compared to other locations. The movement is depicted by black lines connecting each location. The “growing dots” depicting duration are achieved by subseting the data iteratively - the first plot in the GIF uses only the first row of data, the second plot uses the first two rows…all the way up to the 488th plot in the GIF which utilizes all 488 rows. Do to knit time (this particular GIF takes ~10 minutes to render), the code chunk below is not evaluated, but you are welcome to download the data and run this code on your own machine to see the resulting GIF!
# Create color palette
set.seed(1989)
palette = sample(rainbow(length(unique(moves3$obs))))
# Create gif
saveGIF({
seq_list <- unique(moves3$seq_num) # this sets how many plots will be tied together (equal to number of rows)
#looping through plots
for(i in 1:length(seq_list)){
print(paste(c("Now at", seq_list[i])))
dat <- subset(moves3, seq_num <= seq_list[i])
print(
ggmap(map) +
scale_size(range=c(0,8.5)) +
scale_colour_manual(values=palette) +
geom_point(data = dat, aes(x = longitude, y = latitude, size=minutes_scaled,colour=factor(obs))) +
geom_path(data = dat, aes(x = longitude, y = latitude), size = .5, lineend = "round") +
xlab("Longitude") + ylab("Latitude") +
theme(legend.position="none")
)
}}, movie.name="moves2.gif",
interval = .001,
ani.width = 600,
ani.height = 600)
If you have questions or suggestions on material to add to this tutorial, please contact Libby at leb237@psu.edu. We are happy for this tutorial to be shared, but rather than sharing directly, please make sure to link to it’s page on the QuantDev website: https://quantdev.ssri.psu.edu/tutorials/intro-geospatial-data-and-maps-r
There are also many other cool tutorials given by the QuantDev group at: https://quantdev.ssri.psu.edu/tutorials/