library(raster)
library(sf)
library(sfheaders)
library(units)
library(terra)
library(dplyr)
library(parallel)
Parallel Raster Processing
Drastically reduce processing of large rasters
Recently, I needed to extract raster values over polygons. The raster was huge (22.2 million cells) and the number of polygons was huge (83 thousand). I calculated the the extraction would 24hrs on a single processor. I googled extensively for methods to perform the extraction in parallel or otherwise speed up the process. I found solutions that were close, but nothing that addressed extracting data over a set of polygons. In this post, I demonstrate the parallel method I ultimately implemented. The parallel method finished in 2.5 minutes on an 6 core (12 virtual processors) processor. An amazing a speed-up over the single processor time.
Example Data
I will demonstrate the parallel method on a tiny raster, so the times reported here will not reflect the utility of the parallel method. My real raster was in a TIFF file with 4,760 rows and 4,664 columns. Its pixels were 30 meters by 30 meters and the extent of the raster covered the coastline of Alaska from Glacier Bay National park to Seward.
First, you will need the following packages.
This constructs a example raster.
<- 100 # columns and rows
bigRasterSize <- raster(nrow = bigRasterSize
bigRaster ncol = bigRasterSize
, xmn = 1
, xmx = bigRasterSize
, ymn = 1
, ymx = bigRasterSize
, crs = 3338)
, values(bigRaster) <- rep(1:bigRasterSize, bigRasterSize)
I wish to extract values of bigRaster
over a large number of buffered points. This code creates 50 points and buffers them by 3 meters.
# Create sf POINTS object
<- 50
numPts <- data.frame(x = runif(numPts, 1, bigRasterSize)
pts y = runif(numPts, 1, bigRasterSize))
, <- sf_point(pts)
pts st_crs(pts) <- 3338
# Buffer POINTS
<- st_buffer(pts, set_units(3, "m")) ptPolys
This plots the banded raster, the points, and polygons for inspection.
plot( bigRaster)
plot( ptPolys$geometry, add=T, col="red")
plot( pts$geometry, add=T, pch=16, cex = .5)
Single Core Extract
This example is small enough to extract using a single core. Extraction on a single core is performed by the terra::extract
function. I am interested in the values and execution time.
system.time({
<- extract(bigRaster, ptPolys)
focalData })
user system elapsed
0.22 0.01 0.33
<- lapply(focalData, quantile, p = c(0.05, 0.50, 0.95))
focalQ_Single <- do.call(rbind, focalQ_Single)
focalQ_Single <- data.frame(focalQ_Single) %>%
focalQ_Single rename_with(function(x){x <- sub("X","q",x)
sub("\\.","",x)}) %>%
bind_cols(data.frame(st_coordinates(pts)))
head(focalQ_Single)
q5 q50 q95 X Y
1 23.0 25 27.6 25.397450 27.09893
2 41.4 44 46.0 43.783067 46.46760
3 1.0 2 3.2 1.557295 43.00141
4 50.0 53 55.0 52.702330 14.29329
5 50.0 52 55.0 52.183879 37.09981
6 30.0 33 35.0 32.878268 71.00383
Parallel Extract
For the parallel method, I called the parallel::parApply
function to execute terra::extract
one polygon at a time. parApply
takes care of allocating polygons to processors and collecting results. The key portion of the following code is making the processor cluster, attaching libraries on each, and exporting the necessary objects to each processor. These tasks are facilitated by calls to parallel::makeCluster
, parallel::clusterEvalQ
, and parallel::clusterExport
. During export, the exported object must be visible inside clusterExport
or you may receive a confusing “object not found” error. Here, all objects are in the .GlobalEnv
environment and the export works. If you export from inside another function, you will probably need to use the env=
argument of clusterExport
. Making the cluster and exporting objects is a bit slow, but the actual computation is very quick.
<- makeCluster(detectCores()-1)
cl <- clusterEvalQ(cl, {
res library(terra)
library(raster)
library(sf)
})<- clusterExport(cl, list('bigRaster', 'ptPolys'))
res
system.time({
<- parLapply(cl, 1:nrow(ptPolys), function(i){
focalData <- extract(bigRaster, ptPolys[i,])
dt
dt
}) })
user system elapsed
0.02 0.00 0.59
<- lapply(focalData, FUN = function(x){
focalQ_Parallel quantile(unlist(x), p = c(0.05, 0.50, 0.95))
})<- do.call(rbind, focalQ_Parallel)
focalQ_Parallel <- data.frame(focalQ_Parallel) %>%
focalQ_Parallel rename_with(function(x){x <- sub("X","q",x)
sub("\\.","",x)}) %>%
bind_cols(data.frame(st_coordinates(pts)))
stopCluster(cl)
head(focalQ_Parallel)
q5 q50 q95 X Y
1 23.0 25 27.6 25.397450 27.09893
2 41.4 44 46.0 43.783067 46.46760
3 1.0 2 3.2 1.557295 43.00141
4 50.0 53 55.0 52.702330 14.29329
5 50.0 52 55.0 52.183879 37.09981
6 30.0 33 35.0 32.878268 71.00383
Final Comments
To compare results, I rename columns and merge by X and Y. The all
statements confirm that the quantiles computed by both methods are identical.
<- focalQ_Single %>%
focalQ_Single rename_with(.fn = function(x){paste0("single_",x)}
.cols = starts_with("q"))
, <- focalQ_Parallel %>%
focalQ_Parallel rename_with(.fn = function(x){paste0("parallel_",x)}
.cols = starts_with("q"))
, <- focalQ_Single %>%
focalQ left_join( focalQ_Parallel, by = c("X", "Y"))
all(focalQ$single_q5 == focalQ$parallel_q5)
[1] TRUE
all(focalQ$single_q50 == focalQ$parallel_q50)
[1] TRUE
all(focalQ$single_q95 == focalQ$parallel_q95)
[1] TRUE
Moreover, results are correct because the median value in each polygon should be very close to the polygon’s X coordinate due to the way I constructed the raster. Data for the first 10 points appears below.
<- focalQ %>%
xCoordTest mutate(rounded_q50 = round(parallel_q50)
rounded_X = round(X)) %>%
, select(X, parallel_q50, rounded_X, rounded_q50)
head(xCoordTest, 10)
X parallel_q50 rounded_X rounded_q50
1 25.397450 25.0 25 25
2 43.783067 44.0 44 44
3 1.557295 2.0 2 2
4 52.702330 53.0 53 53
5 52.183879 52.0 52 52
6 32.878268 33.0 33 33
7 36.477458 36.0 36 36
8 3.436670 3.0 3 3
9 47.551025 47.5 48 48
10 40.107390 40.0 40 40
Finally, a big thank you to the maintainers of the R spatial packages. R has proven an amazing processing platform for spatial data. The sf
, raster
, terra
, ggplot
and mapview
packages make programmatic processing of spatial data easy and accessible, especially to those who do not ‘Arc’. In the old days (the 1990’s), R’s current level of spatial processing could only be accomplished by trained analysts in ArcInfo.