Processing Large Rasters Using Tiling and Parallelization: An R + Saga Gis + Grass Gis Tutorial

Research, Innovation and Consultancy
Processing Large Rasters using

Tiling and Parallelization: An R +
SAGA GIS + GRASS GIS Tutorial
tom.hengl@enviometrix.net
@tom_hengl
thengl
http://envirometrix.net
1. General concepts
a. A systematic approach,
Overview b. Parallelization — basic principles,
2. Example 1
a. processing global land cover data,
3. Example 2
a. processing elevation data and DEM

analysis,
4. Conclusions
a. What to expect and what not to

expect,
b. Some work-arounds
3V's of big data
1. Volume (input output data size)

2. Velocity (read / write / compute)
3. Variety (e.g. millions of variables)
3V's of big data
BData management vs BData analytics
First important realization
Programming with "big data" is

really a different type of game
— expect exponentially more
complexity / get ready to boost
your programming skills
SoilGrids250m
AfSIS
Mapping climatic data in spacetime
Mapping global mangroves
PNV mapping
Three options:
1. Use software that handles large data

(internal parallelization)
2. Make your own functions
3. Use infrastructure optimized for large
data
Learning parallelization is rewarding!
It is of course better if you can

learn how to solve things by
making your own functions
What can go wrong
1. You optimize almost everything, but then one

bottleneck leaves you hanging.
2. You do not plan carefully and examine
read/write times, memory handling, computing
etc — and which results in a serious loss of
time.
3. You leave computer computing for a very
excessive amount of time without really
knowing the progress.
RStudio, GDAL, SAGA GIS and GRASS GIS
once you get a bit better at it, it

starts looking something like
this...
HPC 64cores
Some lessons I've learned:
(a) Plan carefully,
(b) Invest in hardware / storage,
(c) Use appropriate tiling (linear, hierarchical),
(d) Implement full parallelization (optimize code
so it uses hardware to maximum)
(e) Make "scalable" code (it would work with N
elements running on P cores),
Google Earth Engine
Software: fast reading of data
> ## CSV files:

> df <- datatable::fread("test.csv")
> ## shapefiles:
> f = system.file("shape/nc.shp", package="sf")
> pnt <- sf::st_read(f)
> ## rasters:
> r <- raster::stack(list.files)
raster
GDAL
> ## gdal warp in parallel:

> system(paste0('gdalwarp AW3D30_30m.vrt AW3D30_dem_100m.tif
-co \"BIGTIFF=YES\" -wm 2000 -co \"COMPRESS=DEFLATE\" -multi
-wo \"NUM_THREADS=ALL_CPUS\"'))
Reading large geotif tile by tile
> fn = system.file("pictures/SP27GTIF.TIF", package =

"rgdal")
> obj <- rgdal::GDALinfo(fn)
> tiles <- GSIF::getSpatialTiles(obj, block.x=5000,
return.SpatialPolygons = FALSE)
> tiles.pol <- GSIF::getSpatialTiles(obj, block.x=5000,
return.SpatialPolygons = TRUE)
> tile.pol <- SpatialPolygonsDataFrame(tiles.pol, tiles)
> plot(raster(fn), col=bpy.colors(20))
> lines(tile.pol, lwd=2)
Reading and writing RDS files in parallel
> saveRDS.gz <- function(object,file,threads=parallel::detectCores()) {

con <- pipe(paste0("pigz -p", threads, " > ", file), "wb")
saveRDS(object, file = con)
close(con)
}
> readRDS.gz <- function(file,threads=parallel::detectCores()) {

con <- pipe(paste0("pigz -d -c -p", threads, " ", file))
object <- readRDS(file = con)
close(con)
return(object)
}
SAGA GIS
SAGA GIS automatically uses all cores!
saga_cmd [-h, --help]

saga_cmd [-v, --version]
saga_cmd [-b, --batch]
saga_cmd [-d, --docs]
saga_cmd [-f, --flags][=qrsilpxo][-s, --story][=#][-c, --cores][=#]
<LIBRARY> <MODULE> <OPTIONS>
saga_cmd [-f, --flags][=qrsilpxo][-s, --story][=#][-c, --cores][=#]
<SCRIPT>
[-h], [--help] : help on usage

[-v], [--version]: print version information
[-b], [--batch] : create a batch file example
[-d], [--docs] : create tool documentation in current working directory
[-s], [--story] : maximum data history depth (default is unlimited)
[-c], [--cores] : number of physical processors to use for computation
GRASS GIS
M. Neteler: "You can regard GRASS GIS not as an application, but

rather as an environment that provides applications
(commands/modules). The idea of running GRASS in parallel is to
run several GRASS commands in parallel. This kind of
parallelization becomes interesting when you want to perform the
same processing (say, r.fill.stats), to many different input data like
time series or a tile based approach (r.tile). You will need some job
scheduler, e.g. slurm or equivalent (see
https://slurm.schedmd.com/rosetta.html). A single server with
multiple cores can be regarded as a simple HPC system. Of
course also OpenStack/docker-based job schedulers might be
used which are more suitable for cloud processing."
Examples
Example 1: derivation of land

cover change using global
300m resolution images
Example with the global land cover
Some ideas about the data size
➔ at 300 m resolution: land mask about

1.4 billion pixels,
➔ about 6GB in size in memory (each
image),
➔ About 80 CPU hours to compute,
Some ideas about the data size
> r = raster("ESACCI-LC-L4-LCCS-Map-300m-P5Y-2010-v1.6.1.tif")
> r
class : RasterLayer
dimensions : 64800, 129600, 8398080000 (nrow, ncol, ncell)
resolution : 0.002777778, 0.002777778 (x, y)
extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
data source : /data/LDN/ESACCI-LC-L4-LCCS-Map-300m-P5Y-2010-v1.6.1.tif
names : ESACCI.LC.L4.LCCS.Map.300m.P5Y.2010.v1.6.1
values : 0, 255 (min, max)
Objective: derive land cover change per pixel
Steps:
1. Prepare a tiling system,
2. Prepare functions for processing tiles,
3. Run in parallel (snowfall)
4. Generate a mosaic and build with
pyramids (GDAL),
Server
1. 2 x Intel Xeon Gold 6142 @ 2.60GHz

2. Core Count: 32 / Thread Count: 64
3. RAM MEMORY: 386048MB
4. DISK: 1920GB INTEL SSDSC2KG01 + 10001GB
Western Digital WD101KRYZ-01
5. File-System: ext4 / Mount Options: data=ordered
errors=remount-ro relatime rw
6. OPERATING SYSTEM: Ubuntu 16.04
Optional: NAS Synology (50TB)
Prepare a function
make_LC_tiles <- function(...){

out.tif = ...
...
m$i = plyr::join(data.frame(NAME=m$v), comb.leg,
type="left")$Value
writeGDAL(m["i"], out.tif, type="Int16",
options="COMPRESS=DEFLATE", mvFlag = -32768)
}
Run in parallel
sfInit(parallel=TRUE, cpus=64)
snowfall 1.84-6.1 initialized (using snow 0.4-1): parallel execution
on 64 CPUs.
sfExport("make_LC_tiles", "tile.tbl", "comb.leg", "cl.leg", "t.sel")

sfLibrary(rgdal)
Library rgdal loaded.
Library rgdal loaded in cluster.
sum.lst <- snowfall::sfClusterApplyLB(as.numeric(t.sel),

function(x){ make_LC_tiles(x, tile.tbl=tile.tbl) })
You can relax and follow generation of files...
Final step: generate mosaic
tmp.lst <- list.files(path="/data/LDN/tiled",

pattern=glob2rx(paste0("LandCover_CL_*.tif$")), full.names=TRUE,
recursive=TRUE)
## only 3178 tiles with values
out.tmp <- tempfile(fileext = ".txt")
vrt.tmp <- tempfile(fileext = ".vrt")
cat(tmp.lst, sep="\n", file=out.tmp)
system(paste0('gdalbuildvrt -input_file_list ', out.tmp, ' ', vrt.tmp))
system(paste0('gdalwarp ', vrt.tmp, ' LandCover_CL_300m.tif -ot
\"Int16\" -dstnodata \"-32767\" -co \"BIGTIFF=YES\" -multi -wm 2000 -co
\"COMPRESS=DEFLATE\" -r \"near\" -wo \"NUM_THREADS=ALL_CPUS\"'))
Tiles
Tiles
Tiles
Examples
Example 2: derivation of DEM

surface parameters
Digital Land Surface Model
Two sources of DEM data:
● 1/3rd arc-second (10 m) National Elevation

Dataset (NED)
● 1 arc-second (30 m) global ALOS AW3D30 Digital
Surface Model
which one do we use?

Modeling DLSM
Use training points to build a model to merge various

elevation data sources:
mDLSM = f (NED, AW3D30, NDVI, HH/HV)
two data sources, NDVI image and radar bands.

DLSM scheme
Results
Conclusions I
● Yes, you can use R (in combination with GDAL,

SAGA GIS and similar) to process large rasters,
● Do not stop programming until you have reached
the maximum usage of hardware — RAM + cores
(at least 90%?),
● Start with smaller subsets then increase data
volumes,
Conclusions II
● SAGA GIS is already optimized for large data

(automatic parallelization),
● GRASS GIS is quite efficient in processing large
rasters, but parallelization needs to be
implemented through scripting,
● Investing in hardware / storage will eventually be
unavoidable,

Processing Large Rasters Using Tiling and Parallelization: An R + Saga Gis + Grass Gis Tutorial

Uploaded by

Copyright:

Available Formats

Processing Large Rasters Using Tiling and Parallelization: An R + Saga Gis + Grass Gis Tutorial

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Processing Large Rasters Using Tiling and Parallelization: An R + Saga Gis + Grass Gis Tutorial

Uploaded by

Copyright:

Available Formats

Research, Innovation and Consultancy

Processing Large Rasters using

Overview b. Parallelization — basic principles,

a. processing global land cover data,

a. processing elevation data and DEM

a. What to expect and what not to

1. Volume (input output data size)

Programming with "big data" is

1. Use software that handles large data

It is of course better if you can

1. You optimize almost everything, but then one

once you get a bit better at it, it

> ## CSV files:

> ## gdal warp in parallel:

> fn = system.file("pictures/SP27GTIF.TIF", package =

> saveRDS.gz <- function(object,file,threads=parallel::detectCores()) {

> readRDS.gz <- function(file,threads=parallel::detectCores()) {

saga_cmd [-h, --help]

[-h], [--help] : help on usage

M. Neteler: "You can regard GRASS GIS not as an application, but

Example 1: derivation of land

➔ at 300 m resolution: land mask about

1. 2 x Intel Xeon Gold 6142 @ 2.60GHz

make_LC_tiles <- function(...){

sfExport("make_LC_tiles", "tile.tbl", "comb.leg", "cl.leg", "t.sel")

sum.lst <- snowfall::sfClusterApplyLB(as.numeric(t.sel),

tmp.lst <- list.files(path="/data/LDN/tiled",

Example 2: derivation of DEM

Two sources of DEM data:

● 1/3rd arc-second (10 m) National Elevation

which one do we use?

Use training points to build a model to merge various

mDLSM = f (NED, AW3D30, NDVI, HH/HV)

two data sources, NDVI image and radar bands.

● Yes, you can use R (in combination with GDAL,

● SAGA GIS is already optimized for large data

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.