Processing Large Rasters Using Tiling and Parallelization: An R + Saga Gis + Grass Gis Tutorial

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Research, Innovation and Consultancy

Processing Large Rasters using


Tiling and Parallelization: An R +
SAGA GIS + GRASS GIS Tutorial
tom.hengl@enviometrix.net
@tom_hengl
thengl
http://envirometrix.net
1. General concepts

a. A systematic approach,

Overview b. Parallelization — basic principles,

2. Example 1

a. processing global land cover data,

3. Example 2

a. processing elevation data and DEM


analysis,

4. Conclusions

a. What to expect and what not to


expect,

b. Some work-arounds
3V's of big data

1. Volume (input output data size)


2. Velocity (read / write / compute)
3. Variety (e.g. millions of variables)
3V's of big data
BData management vs BData analytics
First important realization

Programming with "big data" is


really a different type of game
— expect exponentially more
complexity / get ready to boost
your programming skills
SoilGrids250m
AfSIS
Mapping climatic data in spacetime
Mapping global mangroves
PNV mapping
Three options:

1. Use software that handles large data


(internal parallelization)
2. Make your own functions
3. Use infrastructure optimized for large
data
Learning parallelization is rewarding!

It is of course better if you can


learn how to solve things by
making your own functions
What can go wrong

1. You optimize almost everything, but then one


bottleneck leaves you hanging.
2. You do not plan carefully and examine
read/write times, memory handling, computing
etc — and which results in a serious loss of
time.
3. You leave computer computing for a very
excessive amount of time without really
knowing the progress.
RStudio, GDAL, SAGA GIS and GRASS GIS

once you get a bit better at it, it


starts looking something like
this...
HPC 64cores
Some lessons I've learned:
(a) Plan carefully,
(b) Invest in hardware / storage,
(c) Use appropriate tiling (linear, hierarchical),
(d) Implement full parallelization (optimize code
so it uses hardware to maximum)
(e) Make "scalable" code (it would work with N
elements running on P cores),
Google Earth Engine
Software: fast reading of data

> ## CSV files:


> df <- datatable::fread("test.csv")
> ## shapefiles:
> f = system.file("shape/nc.shp", package="sf")
> pnt <- sf::st_read(f)
> ## rasters:
> r <- raster::stack(list.files)
raster
GDAL

> ## gdal warp in parallel:


> system(paste0('gdalwarp AW3D30_30m.vrt AW3D30_dem_100m.tif
-co \"BIGTIFF=YES\" -wm 2000 -co \"COMPRESS=DEFLATE\" -multi
-wo \"NUM_THREADS=ALL_CPUS\"'))
Reading large geotif tile by tile

> fn = system.file("pictures/SP27GTIF.TIF", package =


"rgdal")
> obj <- rgdal::GDALinfo(fn)
> tiles <- GSIF::getSpatialTiles(obj, block.x=5000,
return.SpatialPolygons = FALSE)
> tiles.pol <- GSIF::getSpatialTiles(obj, block.x=5000,
return.SpatialPolygons = TRUE)
> tile.pol <- SpatialPolygonsDataFrame(tiles.pol, tiles)
> plot(raster(fn), col=bpy.colors(20))
> lines(tile.pol, lwd=2)
Reading and writing RDS files in parallel

> saveRDS.gz <- function(object,file,threads=parallel::detectCores()) {


con <- pipe(paste0("pigz -p", threads, " > ", file), "wb")
saveRDS(object, file = con)
close(con)
}

> readRDS.gz <- function(file,threads=parallel::detectCores()) {


con <- pipe(paste0("pigz -d -c -p", threads, " ", file))
object <- readRDS(file = con)
close(con)
return(object)
}
SAGA GIS
SAGA GIS automatically uses all cores!

saga_cmd [-h, --help]


saga_cmd [-v, --version]
saga_cmd [-b, --batch]
saga_cmd [-d, --docs]
saga_cmd [-f, --flags][=qrsilpxo][-s, --story][=#][-c, --cores][=#]
<LIBRARY> <MODULE> <OPTIONS>
saga_cmd [-f, --flags][=qrsilpxo][-s, --story][=#][-c, --cores][=#]
<SCRIPT>

[-h], [--help] : help on usage


[-v], [--version]: print version information
[-b], [--batch] : create a batch file example
[-d], [--docs] : create tool documentation in current working directory
[-s], [--story] : maximum data history depth (default is unlimited)
[-c], [--cores] : number of physical processors to use for computation
GRASS GIS

M. Neteler: "You can regard GRASS GIS not as an application, but


rather as an environment that provides applications
(commands/modules). The idea of running GRASS in parallel is to
run several GRASS commands in parallel. This kind of
parallelization becomes interesting when you want to perform the
same processing (say, r.fill.stats), to many different input data like
time series or a tile based approach (r.tile). You will need some job
scheduler, e.g. slurm or equivalent (see
https://slurm.schedmd.com/rosetta.html). A single server with
multiple cores can be regarded as a simple HPC system. Of
course also OpenStack/docker-based job schedulers might be
used which are more suitable for cloud processing."
Examples

Example 1: derivation of land


cover change using global
300m resolution images
Example with the global land cover
Some ideas about the data size

➔ at 300 m resolution: land mask about


1.4 billion pixels,
➔ about 6GB in size in memory (each
image),
➔ About 80 CPU hours to compute,
Some ideas about the data size

> r = raster("ESACCI-LC-L4-LCCS-Map-300m-P5Y-2010-v1.6.1.tif")
> r
class : RasterLayer
dimensions : 64800, 129600, 8398080000 (nrow, ncol, ncell)
resolution : 0.002777778, 0.002777778 (x, y)
extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
data source : /data/LDN/ESACCI-LC-L4-LCCS-Map-300m-P5Y-2010-v1.6.1.tif
names : ESACCI.LC.L4.LCCS.Map.300m.P5Y.2010.v1.6.1
values : 0, 255 (min, max)
Objective: derive land cover change per pixel

Steps:
1. Prepare a tiling system,
2. Prepare functions for processing tiles,
3. Run in parallel (snowfall)
4. Generate a mosaic and build with
pyramids (GDAL),
Server

1. 2 x Intel Xeon Gold 6142 @ 2.60GHz


2. Core Count: 32 / Thread Count: 64
3. RAM MEMORY: 386048MB
4. DISK: 1920GB INTEL SSDSC2KG01 + 10001GB
Western Digital WD101KRYZ-01
5. File-System: ext4 / Mount Options: data=ordered
errors=remount-ro relatime rw
6. OPERATING SYSTEM: Ubuntu 16.04
Optional: NAS Synology (50TB)
Prepare a function

make_LC_tiles <- function(...){


out.tif = ...
...
m$i = plyr::join(data.frame(NAME=m$v), comb.leg,
type="left")$Value
writeGDAL(m["i"], out.tif, type="Int16",
options="COMPRESS=DEFLATE", mvFlag = -32768)
}
Run in parallel

sfInit(parallel=TRUE, cpus=64)
snowfall 1.84-6.1 initialized (using snow 0.4-1): parallel execution
on 64 CPUs.

sfExport("make_LC_tiles", "tile.tbl", "comb.leg", "cl.leg", "t.sel")


sfLibrary(rgdal)
Library rgdal loaded.
Library rgdal loaded in cluster.

sum.lst <- snowfall::sfClusterApplyLB(as.numeric(t.sel),


function(x){ make_LC_tiles(x, tile.tbl=tile.tbl) })
You can relax and follow generation of files...
Final step: generate mosaic

tmp.lst <- list.files(path="/data/LDN/tiled",


pattern=glob2rx(paste0("LandCover_CL_*.tif$")), full.names=TRUE,
recursive=TRUE)
## only 3178 tiles with values
out.tmp <- tempfile(fileext = ".txt")
vrt.tmp <- tempfile(fileext = ".vrt")
cat(tmp.lst, sep="\n", file=out.tmp)
system(paste0('gdalbuildvrt -input_file_list ', out.tmp, ' ', vrt.tmp))
system(paste0('gdalwarp ', vrt.tmp, ' LandCover_CL_300m.tif -ot
\"Int16\" -dstnodata \"-32767\" -co \"BIGTIFF=YES\" -multi -wm 2000 -co
\"COMPRESS=DEFLATE\" -r \"near\" -wo \"NUM_THREADS=ALL_CPUS\"'))
Tiles
Tiles
Tiles
Examples

Example 2: derivation of DEM


surface parameters
Digital Land Surface Model

Two sources of DEM data:

● 1/3rd arc-second (10 m) National Elevation


Dataset (NED)
● 1 arc-second (30 m) global ALOS AW3D30 Digital
Surface Model

which one do we use?


Modeling DLSM

Use training points to build a model to merge various


elevation data sources:

mDLSM = f (NED, AW3D30, NDVI, HH/HV)

two data sources, NDVI image and radar bands.


DLSM scheme
Results
Conclusions I

● Yes, you can use R (in combination with GDAL,


SAGA GIS and similar) to process large rasters,
● Do not stop programming until you have reached
the maximum usage of hardware — RAM + cores
(at least 90%?),
● Start with smaller subsets then increase data
volumes,
Conclusions II

● SAGA GIS is already optimized for large data


(automatic parallelization),
● GRASS GIS is quite efficient in processing large
rasters, but parallelization needs to be
implemented through scripting,
● Investing in hardware / storage will eventually be
unavoidable,

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy