Parallel Raster Processing

Drastically reduce processing of large rasters

R
Spatial
Author

Trent McDonald

Published

February 24, 2023

Recently, I needed to extract raster values over polygons. The raster was huge (22.2 million cells) and the number of polygons was huge (83 thousand). I calculated the the extraction would 24hrs on a single processor. I googled extensively for methods to perform the extraction in parallel or otherwise speed up the process. I found solutions that were close, but nothing that addressed extracting data over a set of polygons. In this post, I demonstrate the parallel method I ultimately implemented. The parallel method finished in 2.5 minutes on an 6 core (12 virtual processors) processor. An amazing a speed-up over the single processor time.

Example Data

I will demonstrate the parallel method on a tiny raster, so the times reported here will not reflect the utility of the parallel method. My real raster was in a TIFF file with 4,760 rows and 4,664 columns. Its pixels were 30 meters by 30 meters and the extent of the raster covered the coastline of Alaska from Glacier Bay National park to Seward.

First, you will need the following packages.

library(raster)
library(sf)
library(sfheaders)
library(units)
library(terra)
library(dplyr)
library(parallel)

This constructs a example raster.

bigRasterSize <- 100 # columns and rows
bigRaster <- raster(nrow = bigRasterSize
                  , ncol = bigRasterSize
                  , xmn = 1
                  , xmx = bigRasterSize
                  , ymn = 1
                  , ymx = bigRasterSize
                  , crs = 3338)
values(bigRaster) <- rep(1:bigRasterSize, bigRasterSize)

I wish to extract values of bigRaster over a large number of buffered points. This code creates 50 points and buffers them by 3 meters.

# Create sf POINTS object
numPts <- 50
pts <- data.frame(x = runif(numPts, 1, bigRasterSize)
                , y = runif(numPts, 1, bigRasterSize))
pts <- sf_point(pts)
st_crs(pts) <- 3338

# Buffer POINTS
ptPolys <- st_buffer(pts, set_units(3, "m"))

This plots the banded raster, the points, and polygons for inspection.

plot( bigRaster)
plot( ptPolys$geometry, add=T, col="red")
plot( pts$geometry, add=T, pch=16, cex = .5)

Single Core Extract

This example is small enough to extract using a single core. Extraction on a single core is performed by the terra::extract function. I am interested in the values and execution time.

system.time({
  focalData <- extract(bigRaster, ptPolys)
})
   user  system elapsed 
   1.23    0.00    1.39 
focalQ_Single <- lapply(focalData, quantile, p = c(0.05, 0.50, 0.95))  
focalQ_Single <- do.call(rbind, focalQ_Single)
focalQ_Single <- data.frame(focalQ_Single) %>% 
  rename_with(function(x){x <- sub("X","q",x)
                          sub("\\.","",x)}) %>% 
  bind_cols(data.frame(st_coordinates(pts)))

head(focalQ_Single)
    q5  q50  q95         X         Y
1 40.0 42.5 45.0 42.541693 94.593148
2 88.0 90.0 92.6 89.887690 92.077198
3  1.0  3.0  5.0  3.216999 30.553269
4 86.4 89.0 91.0 88.394036 12.035271
5 13.0 15.5 18.0 15.797879 31.022325
6 53.0 55.0 58.0 55.348541  8.981547

Parallel Extract

For the parallel method, I called the parallel::parApply function to execute terra::extract one polygon at a time. parApply takes care of allocating polygons to processors and collecting results. The key portion of the following code is making the processor cluster, attaching libraries on each, and exporting the necessary objects to each processor. These tasks are facilitated by calls to parallel::makeCluster, parallel::clusterEvalQ, and parallel::clusterExport. During export, the exported object must be visible inside clusterExport or you may receive a confusing “object not found” error. Here, all objects are in the .GlobalEnv environment and the export works. If you export from inside another function, you will probably need to use the env= argument of clusterExport. Making the cluster and exporting objects is a bit slow, but the actual computation is very quick.

cl <- makeCluster(detectCores()-1)
res <- clusterEvalQ(cl, {
  library(terra)
  library(raster)
  library(sf)
  })
res <- clusterExport(cl, list('bigRaster', 'ptPolys'))

system.time({
focalData <- parLapply(cl, 1:nrow(ptPolys), function(i){
    dt <- extract(bigRaster, ptPolys[i,])
    dt
  })
})
   user  system elapsed 
   0.00    0.00    1.25 
focalQ_Parallel <- lapply(focalData, FUN = function(x){
  quantile(unlist(x), p = c(0.05, 0.50, 0.95))
})
focalQ_Parallel <- do.call(rbind, focalQ_Parallel)
focalQ_Parallel <- data.frame(focalQ_Parallel) %>% 
  rename_with(function(x){x <- sub("X","q",x)
                          sub("\\.","",x)}) %>% 
  bind_cols(data.frame(st_coordinates(pts)))

stopCluster(cl)

head(focalQ_Parallel)
    q5  q50  q95         X         Y
1 40.0 42.5 45.0 42.541693 94.593148
2 88.0 90.0 92.6 89.887690 92.077198
3  1.0  3.0  5.0  3.216999 30.553269
4 86.4 89.0 91.0 88.394036 12.035271
5 13.0 15.5 18.0 15.797879 31.022325
6 53.0 55.0 58.0 55.348541  8.981547

Final Comments

To compare results, I rename columns and merge by X and Y. The all statements confirm that the quantiles computed by both methods are identical.

focalQ_Single <- focalQ_Single %>% 
  rename_with(.fn = function(x){paste0("single_",x)}
              , .cols = starts_with("q"))
focalQ_Parallel <- focalQ_Parallel %>% 
  rename_with(.fn = function(x){paste0("parallel_",x)}
              , .cols = starts_with("q"))
focalQ <- focalQ_Single %>% 
  left_join( focalQ_Parallel, by = c("X", "Y"))

all(focalQ$single_q5 == focalQ$parallel_q5)
[1] TRUE
all(focalQ$single_q50 == focalQ$parallel_q50)
[1] TRUE
all(focalQ$single_q95 == focalQ$parallel_q95)
[1] TRUE

Moreover, results are correct because the median value in each polygon should be very close to the polygon’s X coordinate due to the way I constructed the raster. Data for the first 10 points appears below.

xCoordTest <- focalQ %>% 
  mutate(rounded_q50 = round(parallel_q50)
         , rounded_X = round(X)) %>% 
  select(X, parallel_q50, rounded_X, rounded_q50)
head(xCoordTest, 10)
           X parallel_q50 rounded_X rounded_q50
1  42.541693         42.5        43          42
2  89.887690         90.0        90          90
3   3.216999          3.0         3           3
4  88.394036         89.0        88          89
5  15.797879         15.5        16          16
6  55.348541         55.0        55          55
7  78.180909         78.0        78          78
8  17.375969         17.0        17          17
9  74.599965         75.0        75          75
10 75.715748         76.0        76          76

Finally, a big thank you to the maintainers of the R spatial packages. R has proven an amazing processing platform for spatial data. The sf, raster, terra, ggplot and mapview packages make programmatic processing of spatial data easy and accessible, especially to those who do not ‘Arc’. In the old days (the 1990’s), R’s current level of spatial processing could only be accomplished by trained analysts in ArcInfo.