Package 'mlr3spatiotempcv'

Title: Spatiotemporal Resampling Methods for 'mlr3'
Description: Extends the mlr3 ML framework with spatio-temporal resampling methods to account for the presence of spatiotemporal autocorrelation (STAC) in predictor variables. STAC may cause highly biased performance estimates in cross-validation if ignored.
Authors: Patrick Schratz [aut, cre] , Marc Becker [aut] , Jannes Muenchow [ctb] , Michel Lang [ctb]
Maintainer: Patrick Schratz <[email protected]>
License: LGPL-3
Version: 2.3.1
Built: 2024-10-28 05:38:10 UTC
Source: https://github.com/mlr-org/mlr3spatiotempcv

Help Index


mlr3spatiotempcv: Spatiotemporal Resampling Methods for 'mlr3'

Description

Extends the mlr3 ML framework with spatio-temporal resampling methods to account for the presence of spatiotemporal autocorrelation (STAC) in predictor variables. STAC may cause highly biased performance estimates in cross-validation if ignored.

Main resources

Miscellaneous mlr3 content

Author(s)

Maintainer: Patrick Schratz [email protected] (ORCID)

Authors:

Other contributors:

References

Schratz P, Muenchow J, Iturritxa E, Richter J, Brenning A (2019). “Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data.” Ecological Modelling, 406, 109–120. doi:10.1016/j.ecolmodel.2019.06.002.

Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G (2018). “blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models.” bioRxiv. doi:10.1101/357798.

Meyer H, Reudenbach C, Hengl T, Katurji M, Nauss T (2018). “Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation.” Environmental Modelling & Software, 101, 1–9. doi:10.1016/j.envsoft.2017.12.001.

Zhao Y, Karypis G (2002). “Evaluation of Hierarchical Clustering Algorithms for Document Datasets.” 11th Conference of Information and Knowledge Management (CIKM), 51-524. doi:10.1145/584792.584877.

See Also

Useful links:


Convert to a Spatiotemporal Classification Task

Description

Convert an object to a TaskClassifST. This is a S3 generic for the following objects:

  1. TaskClassifST: Ensure the identity.

  2. data.frame() and DataBackend: Provides an alternative to the constructor of TaskClassifST.

  3. sf::sf: Extracts spatial meta data before construction.

  4. TaskRegr: Calls convert_task().

Usage

as_task_classif_st(x, ...)

## S3 method for class 'TaskClassifST'
as_task_classif_st(x, clone = FALSE, ...)

## S3 method for class 'data.frame'
as_task_classif_st(
  x,
  target,
  id = deparse(substitute(x)),
  positive = NULL,
  coordinate_names,
  crs = NA_character_,
  coords_as_features = FALSE,
  label = NA_character_,
  ...
)

## S3 method for class 'DataBackend'
as_task_classif_st(
  x,
  target,
  id = deparse(substitute(x)),
  positive = NULL,
  coordinate_names,
  crs,
  coords_as_features = FALSE,
  label = NA_character_,
  ...
)

## S3 method for class 'sf'
as_task_classif_st(
  x,
  target = NULL,
  id = deparse(substitute(x)),
  positive = NULL,
  coords_as_features = FALSE,
  label = NA_character_,
  ...
)

Arguments

x

(any)
Object to convert.

...

(any)
Additional arguments.

clone

(logical(1))
If TRUE, ensures that the returned object is not the same as the input x.

target

(character(1))
Name of the target column.

id

(character(1))
Id for the new task. Defaults to the (deparsed and substituted) name of the data argument.

positive

(character(1))
Only for binary classification: Name of the positive class. The levels of the target columns are reordered accordingly, so that the first element of ⁠$class_names⁠ is the positive class, and the second element is the negative class.

coordinate_names

(character(1))
The column names of the coordinates in the data.

crs

(character(1))
Coordinate reference system. WKT2 or EPSG string.

coords_as_features

(logical(1))
If TRUE, coordinates are used as features. This is a shortcut for task$set_col_roles(c("x", "y"), role = "feature") with the assumption that the coordinates in the data are named "x" and "y".

label

(character(1))
Label for the new instance. Shown in as.data.table(mlr_tasks).

Value

TaskClassifST.

Examples

if (mlr3misc::require_namespaces(c("sf"), quietly = TRUE)) {
  library("mlr3")
  data("ecuador", package = "mlr3spatiotempcv")

  # data.frame
  as_task_classif_st(ecuador, target = "slides", positive = "TRUE",
    coords_as_features = FALSE,
    crs = "+proj=utm +zone=17 +south +datum=WGS84 +units=m +no_defs",
    coordinate_names = c("x", "y"))

  # sf
  ecuador_sf = sf::st_as_sf(ecuador, coords = c("x", "y"), crs = 32717)
  as_task_classif_st(ecuador_sf, target = "slides", positive = "TRUE")
}

Convert to a Spatiotemporal Regression Task

Description

Convert object to a TaskRegrST.

This is a S3 generic, specialized for at least the following objects:

  1. TaskRegrST: Ensure the identity.

  2. data.frame() and DataBackend: Provides an alternative to the constructor of TaskRegrST.

  3. sf::sf: Extracts spatial meta data before construction.

  4. TaskClassif: Calls convert_task().

Usage

## S3 method for class 'TaskClassifST'
as_task_regr_st(
  x,
  target = NULL,
  drop_original_target = FALSE,
  drop_levels = TRUE,
  ...
)

as_task_regr_st(x, ...)

## S3 method for class 'TaskRegrST'
as_task_regr_st(x, clone = FALSE, ...)

## S3 method for class 'data.frame'
as_task_regr_st(
  x,
  target,
  id = deparse(substitute(x)),
  coordinate_names,
  crs = NA_character_,
  coords_as_features = FALSE,
  label = NA_character_,
  ...
)

## S3 method for class 'DataBackend'
as_task_regr_st(
  x,
  target,
  id = deparse(substitute(x)),
  positive = NULL,
  coordinate_names,
  crs,
  coords_as_features = FALSE,
  label = NA_character_,
  ...
)

## S3 method for class 'sf'
as_task_regr_st(
  x,
  target = NULL,
  id = deparse(substitute(x)),
  coords_as_features = FALSE,
  label = NA_character_,
  ...
)

## S3 method for class 'TaskClassifST'
as_task_regr_st(
  x,
  target = NULL,
  drop_original_target = FALSE,
  drop_levels = TRUE,
  ...
)

Arguments

x

(any)
Object to convert.

target

(character(1))
Name of the target column.

drop_original_target

(logical(1))
If FALSE (default), the original target is added as a feature. Otherwise the original target is dropped.

drop_levels

(logical(1))
If TRUE (default), unused levels of the new target variable are dropped.

...

(any)
Additional arguments.

clone

(logical(1))
If TRUE, ensures that the returned object is not the same as the input x.

id

(character(1))
Id for the new task. Defaults to the (deparsed and substituted) name of the data argument.

coordinate_names

(character(1))
The column names of the coordinates in the data.

crs

(character(1))
Coordinate reference system. WKT2 or EPSG string.

coords_as_features

(logical(1))
If TRUE, coordinates are used as features. This is a shortcut for task$set_col_roles(c("x", "y"), role = "feature") with the assumption that the coordinates in the data are named "x" and "y".

label

(character(1))
Label for the new instance. Shown in as.data.table(mlr_tasks).

positive

(character(1))
Only for binary classification: Name of the positive class. The levels of the target columns are reordered accordingly, so that the first element of ⁠$class_names⁠ is the positive class, and the second element is the negative class.

Value

TaskRegrST

Examples

if (mlr3misc::require_namespaces(c("sf"), quietly = TRUE)) {
  library("mlr3")
  data("cookfarm_mlr3", package = "mlr3spatiotempcv")

  # data.frame
  as_task_regr_st(cookfarm_mlr3, target = "PHIHOX",
    coords_as_features = FALSE, crs = 26911,
    coordinate_names = c("x", "y"))

  # sf
  cookfarm_sf = sf::st_as_sf(cookfarm_mlr3, coords = c("x", "y"), crs = 26911)
  as_task_regr_st(cookfarm_sf, target = "PHIHOX")
}

Visualization Functions for Non-Spatial CV Methods.

Description

Generic S3 plot() and autoplot() (ggplot2) methods.

Usage

## S3 method for class 'ResamplingCustomCV'
autoplot(
  object,
  task,
  fold_id = NULL,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingCustomCV'
plot(x, ...)

Arguments

object

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingCustomCV.

task

⁠[TaskClassifST]/[TaskRegrST]⁠
mlr3 task object.

fold_id

⁠[numeric]⁠
Fold IDs to plot.

plot_as_grid

⁠[logical(1)]⁠
Should a gridded plot using via patchwork be created? If FALSE a list with of ggplot2 objects is returned. Only applies if a numeric vector is passed to argument fold_id.

train_color

⁠[character(1)]⁠
The color to use for the training set observations.

test_color

⁠[character(1)]⁠
The color to use for the test set observations.

sample_fold_n

⁠[integer]⁠
Number of points in a random sample stratified over partitions. This argument aims to keep file sizes of resulting plots reasonable and reduce overplotting in dense datasets.

...

Passed to geom_sf(). Helpful for adjusting point sizes and shapes.

x

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingCustomCV.

See Also

Examples

if (mlr3misc::require_namespaces(c("sf", "patchwork"), quietly = TRUE)) {
  library(mlr3)
  library(mlr3spatiotempcv)
  task = tsk("ecuador")
  breaks = quantile(task$data()$dem, seq(0, 1, length = 6))
  zclass = cut(task$data()$dem, breaks, include.lowest = TRUE)

  resampling = rsmp("custom_cv")
  resampling$instantiate(task, f = zclass)

  autoplot(resampling, task) +
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
  autoplot(resampling, task, fold_id = 1)
  autoplot(resampling, task, fold_id = c(1, 2)) *
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
}

Visualization Functions for Non-Spatial CV Methods.

Description

Generic S3 plot() and autoplot() (ggplot2) methods.

Usage

## S3 method for class 'ResamplingCV'
autoplot(
  object,
  task,
  fold_id = NULL,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingRepeatedCV'
autoplot(
  object,
  task,
  fold_id = NULL,
  repeats_id = 1,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingCV'
plot(x, ...)

## S3 method for class 'ResamplingRepeatedCV'
plot(x, ...)

Arguments

object

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingCV or ResamplingRepeatedCV.

task

⁠[TaskClassifST]/[TaskRegrST]⁠
mlr3 task object.

fold_id

⁠[numeric]⁠
Fold IDs to plot.

plot_as_grid

⁠[logical(1)]⁠
Should a gridded plot using via patchwork be created? If FALSE a list with of ggplot2 objects is returned. Only applies if a numeric vector is passed to argument fold_id.

train_color

⁠[character(1)]⁠
The color to use for the training set observations.

test_color

⁠[character(1)]⁠
The color to use for the test set observations.

sample_fold_n

⁠[integer]⁠
Number of points in a random sample stratified over partitions. This argument aims to keep file sizes of resulting plots reasonable and reduce overplotting in dense datasets.

...

Passed to geom_sf(). Helpful for adjusting point sizes and shapes.

repeats_id

⁠[numeric]⁠
Repetition ID to plot.

x

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingCV or ResamplingRepeatedCV.

See Also

Examples

if (mlr3misc::require_namespaces(c("sf", "patchwork", "ggtext", "ggsci"), quietly = TRUE)) {
  library(mlr3)
  library(mlr3spatiotempcv)
  task = tsk("ecuador")
  resampling = rsmp("cv")
  resampling$instantiate(task)

  autoplot(resampling, task) +
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
  autoplot(resampling, task, fold_id = 1)
  autoplot(resampling, task, fold_id = c(1, 2)) *
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
}

Visualization Functions for SpCV Block Methods.

Description

Generic S3 plot() and autoplot() (ggplot2) methods to visualize mlr3 spatiotemporal resampling objects.

Usage

## S3 method for class 'ResamplingSpCVBlock'
autoplot(
  object,
  task,
  fold_id = NULL,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  show_blocks = FALSE,
  show_labels = FALSE,
  sample_fold_n = NULL,
  label_size = 2,
  ...
)

## S3 method for class 'ResamplingRepeatedSpCVBlock'
autoplot(
  object,
  task,
  fold_id = NULL,
  repeats_id = 1,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  show_blocks = FALSE,
  show_labels = FALSE,
  sample_fold_n = NULL,
  label_size = 2,
  ...
)

## S3 method for class 'ResamplingSpCVBlock'
plot(x, ...)

## S3 method for class 'ResamplingRepeatedSpCVBlock'
plot(x, ...)

Arguments

object

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSpCVBlock or ResamplingRepeatedSpCVBlock.

task

⁠[TaskClassifST]/[TaskRegrST]⁠
mlr3 task object.

fold_id

⁠[numeric]⁠
Fold IDs to plot.

plot_as_grid

⁠[logical(1)]⁠
Should a gridded plot using via patchwork be created? If FALSE a list with of ggplot2 objects is returned. Only applies if a numeric vector is passed to argument fold_id.

train_color

⁠[character(1)]⁠
The color to use for the training set observations.

test_color

⁠[character(1)]⁠
The color to use for the test set observations.

show_blocks

⁠[logical(1)]⁠
Whether to show an overlay of the spatial blocks polygons.

show_labels

⁠[logical(1)]⁠
Whether to show an overlay of the spatial block IDs.

sample_fold_n

⁠[integer]⁠
Number of points in a random sample stratified over partitions. This argument aims to keep file sizes of resulting plots reasonable and reduce overplotting in dense datasets.

label_size

⁠[numeric(1)]⁠
Label size of block labels. Only applies for show_labels = TRUE.

...

Passed to geom_sf(). Helpful for adjusting point sizes and shapes.

repeats_id

⁠[numeric]⁠
Repetition ID to plot.

x

⁠[Resampling]⁠
mlr3 spatial resampling object. One of class ResamplingSpCVBuffer, ResamplingSpCVBlock, ResamplingSpCVCoords, ResamplingSpCVEnv.

Details

By default a plot is returned; if fold_id is set, a gridded plot is created. If plot_as_grid = FALSE, a list of plot objects is returned. This can be used to align the plots individually.

When no single fold is selected, the ggsci::scale_color_ucscgb() palette is used to display all partitions. If you want to change the colors, call ⁠<plot> + <color-palette>()⁠.

Value

ggplot() or list of ggplot2 objects.

See Also

Examples

if (mlr3misc::require_namespaces(c("sf", "blockCV"), quietly = TRUE)) {
  library(mlr3)
  library(mlr3spatiotempcv)
  task = tsk("ecuador")
  resampling = rsmp("spcv_block", range = 1000L)
  resampling$instantiate(task)

  ## list of ggplot2 resamplings
  plot_list = autoplot(resampling, task,
    crs = 4326,
    fold_id = c(1, 2), plot_as_grid = FALSE)

  ## Visualize all partitions
  autoplot(resampling, task) +
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))

  ## Visualize the train/test split of a single fold
  autoplot(resampling, task, fold_id = 1) +
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))

  ## Visualize train/test splits of multiple folds
  autoplot(resampling, task,
    fold_id = c(1, 2),
    show_blocks = TRUE) *
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
}

Visualization Functions for SpCV Buffer Methods.

Description

Generic S3 plot() and autoplot() (ggplot2) methods to visualize mlr3 spatiotemporal resampling objects.

Usage

## S3 method for class 'ResamplingSpCVBuffer'
autoplot(
  object,
  task,
  fold_id = NULL,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  show_omitted = FALSE,
  ...
)

## S3 method for class 'ResamplingSpCVBuffer'
plot(x, ...)

Arguments

object

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSpCVBuffer.

task

⁠[TaskClassifST]/[TaskRegrST]⁠
mlr3 task object.

fold_id

⁠[numeric]⁠
Fold IDs to plot.

plot_as_grid

⁠[logical(1)]⁠
Should a gridded plot using via patchwork be created? If FALSE a list with of ggplot2 objects is returned. Only applies if a numeric vector is passed to argument fold_id.

train_color

⁠[character(1)]⁠
The color to use for the training set observations.

test_color

⁠[character(1)]⁠
The color to use for the test set observations.

show_omitted

⁠[logical]⁠
Whether to show points not used in train or test set for the current fold.

...

Passed to geom_sf(). Helpful for adjusting point sizes and shapes.

x

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSpCVBuffer.

See Also

Examples

if (mlr3misc::require_namespaces(c("sf", "blockCV"), quietly = TRUE)) {
  library(mlr3)
  library(mlr3spatiotempcv)
  task = tsk("ecuador")
  resampling = rsmp("spcv_buffer", theRange = 1000)
  resampling$instantiate(task)

  ## single fold
  autoplot(resampling, task, fold_id = 1) +
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))

  ## multiple folds
  autoplot(resampling, task, fold_id = c(1, 2)) *
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
}

Visualization Functions for SpCV Coords Methods.

Description

Generic S3 plot() and autoplot() (ggplot2) methods.

Usage

## S3 method for class 'ResamplingSpCVCoords'
autoplot(
  object,
  task,
  fold_id = NULL,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingRepeatedSpCVCoords'
autoplot(
  object,
  task,
  fold_id = NULL,
  repeats_id = 1,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingSpCVCoords'
plot(x, ...)

## S3 method for class 'ResamplingRepeatedSpCVCoords'
plot(x, ...)

Arguments

object

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSpCVCoords or ResamplingRepeatedSpCVCoords.

task

⁠[TaskClassifST]/[TaskRegrST]⁠
mlr3 task object.

fold_id

⁠[numeric]⁠
Fold IDs to plot.

plot_as_grid

⁠[logical(1)]⁠
Should a gridded plot using via patchwork be created? If FALSE a list with of ggplot2 objects is returned. Only applies if a numeric vector is passed to argument fold_id.

train_color

⁠[character(1)]⁠
The color to use for the training set observations.

test_color

⁠[character(1)]⁠
The color to use for the test set observations.

sample_fold_n

⁠[integer]⁠
Number of points in a random sample stratified over partitions. This argument aims to keep file sizes of resulting plots reasonable and reduce overplotting in dense datasets.

...

Passed to geom_sf(). Helpful for adjusting point sizes and shapes.

repeats_id

⁠[numeric]⁠
Repetition ID to plot.

x

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSpCVCoords or ResamplingRepeatedSpCVCoords.

See Also

Examples

if (mlr3misc::require_namespaces(c("sf"), quietly = TRUE)) {
  library(mlr3)
  library(mlr3spatiotempcv)
  task = tsk("ecuador")
  resampling = rsmp("spcv_coords")
  resampling$instantiate(task)

  autoplot(resampling, task) +
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
  autoplot(resampling, task, fold_id = 1)
  autoplot(resampling, task, fold_id = c(1, 2)) *
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
}

Visualization Functions for SpCV Disc Method.

Description

Generic S3 plot() and autoplot() (ggplot2) methods to visualize mlr3 spatiotemporal resampling objects.

Usage

## S3 method for class 'ResamplingSpCVDisc'
autoplot(
  object,
  task,
  fold_id = NULL,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  repeats_id = NULL,
  show_omitted = FALSE,
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingRepeatedSpCVDisc'
autoplot(
  object,
  task,
  fold_id = NULL,
  repeats_id = 1,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  show_omitted = FALSE,
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingSpCVDisc'
plot(x, ...)

## S3 method for class 'ResamplingRepeatedSpCVDisc'
plot(x, ...)

Arguments

object

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSpCVBlock or ResamplingRepeatedSpCVBlock.

task

⁠[TaskClassifST]/[TaskRegrST]⁠
mlr3 task object.

fold_id

⁠[numeric]⁠
Fold IDs to plot.

plot_as_grid

⁠[logical(1)]⁠
Should a gridded plot using via patchwork be created? If FALSE a list with of ggplot2 objects is returned. Only applies if a numeric vector is passed to argument fold_id.

train_color

⁠[character(1)]⁠
The color to use for the training set observations.

test_color

⁠[character(1)]⁠
The color to use for the test set observations.

repeats_id

⁠[numeric]⁠
Repetition ID to plot.

show_omitted

⁠[logical]⁠
Whether to show points not used in train or test set for the current fold.

sample_fold_n

⁠[integer]⁠
Number of points in a random sample stratified over partitions. This argument aims to keep file sizes of resulting plots reasonable and reduce overplotting in dense datasets.

...

Passed to geom_sf(). Helpful for adjusting point sizes and shapes.

x

⁠[Resampling]⁠
mlr3 spatial resampling object. One of class ResamplingSpCVBuffer, ResamplingSpCVBlock, ResamplingSpCVCoords, ResamplingSpCVEnv.

Details

This method requires to set argument fold_id and no plot containing all partitions can be created. This is because the method does not make use of all observations but only a subset of them (many observations are left out). Hence, train and test sets of one fold are not re-used in other folds as in other methods and plotting these without a train/test indicator would not make sense.

2D vs 3D plotting

This method has both a 2D and a 3D plotting method. The 2D method returns a ggplot with x and y axes representing the spatial coordinates. The 3D method uses plotly to create an interactive 3D plot. Set plot3D = TRUE to use the 3D method.

Note that spatiotemporal datasets usually suffer from overplotting in 2D mode.

See Also

Examples

if (mlr3misc::require_namespaces("sf", quietly = TRUE)) {
  library(mlr3)
  library(mlr3spatiotempcv)
  task = tsk("ecuador")
  resampling = rsmp("spcv_disc",
    folds = 5, radius = 200L, buffer = 200L)
  resampling$instantiate(task)

  autoplot(resampling, task,
    fold_id = 1,
    show_omitted = TRUE, size = 0.7) *
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
}

Visualization Functions for SpCV Env Methods.

Description

Generic S3 plot() and autoplot() (ggplot2) methods.

Usage

## S3 method for class 'ResamplingSpCVEnv'
autoplot(
  object,
  task,
  fold_id = NULL,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingRepeatedSpCVEnv'
autoplot(
  object,
  task,
  fold_id = NULL,
  repeats_id = 1,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingSpCVEnv'
plot(x, ...)

## S3 method for class 'ResamplingRepeatedSpCVEnv'
plot(x, ...)

Arguments

object

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSpCVEnv or ResamplingRepeatedSpCVEnv.

task

⁠[TaskClassifST]/[TaskRegrST]⁠
mlr3 task object.

fold_id

⁠[numeric]⁠
Fold IDs to plot.

plot_as_grid

⁠[logical(1)]⁠
Should a gridded plot using via patchwork be created? If FALSE a list with of ggplot2 objects is returned. Only applies if a numeric vector is passed to argument fold_id.

train_color

⁠[character(1)]⁠
The color to use for the training set observations.

test_color

⁠[character(1)]⁠
The color to use for the test set observations.

sample_fold_n

⁠[integer]⁠
Number of points in a random sample stratified over partitions. This argument aims to keep file sizes of resulting plots reasonable and reduce overplotting in dense datasets.

...

Passed to geom_sf(). Helpful for adjusting point sizes and shapes.

repeats_id

⁠[numeric]⁠
Repetition ID to plot.

x

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSpCVEnv or ResamplingRepeatedSpCVEnv.

See Also

Examples

if (mlr3misc::require_namespaces(c("sf", "blockCV"), quietly = TRUE)) {
  library(mlr3)
  library(mlr3spatiotempcv)
  task = tsk("ecuador")
  resampling = rsmp("spcv_env", folds = 4, features = "dem")
  resampling$instantiate(task)

  autoplot(resampling, task) +
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
  autoplot(resampling, task, fold_id = 1)
  autoplot(resampling, task, fold_id = c(1, 2)) *
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
}

Visualization Functions for SpCV knndm Method.

Description

Generic S3 plot() and autoplot() (ggplot2) methods to visualize mlr3 spatiotemporal resampling objects.

Usage

## S3 method for class 'ResamplingSpCVKnndm'
autoplot(
  object,
  task,
  fold_id = NULL,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  repeats_id = NULL,
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingRepeatedSpCVKnndm'
autoplot(
  object,
  task,
  fold_id = NULL,
  repeats_id = 1,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingSpCVKnndm'
plot(x, ...)

## S3 method for class 'ResamplingRepeatedSpCVKnndm'
plot(x, ...)

Arguments

object

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSpCVBlock or ResamplingRepeatedSpCVBlock.

task

⁠[TaskClassifST]/[TaskRegrST]⁠
mlr3 task object.

fold_id

⁠[numeric]⁠
Fold IDs to plot.

plot_as_grid

⁠[logical(1)]⁠
Should a gridded plot using via patchwork be created? If FALSE a list with of ggplot2 objects is returned. Only applies if a numeric vector is passed to argument fold_id.

train_color

⁠[character(1)]⁠
The color to use for the training set observations.

test_color

⁠[character(1)]⁠
The color to use for the test set observations.

repeats_id

⁠[numeric]⁠
Repetition ID to plot.

sample_fold_n

⁠[integer]⁠
Number of points in a random sample stratified over partitions. This argument aims to keep file sizes of resulting plots reasonable and reduce overplotting in dense datasets.

...

Passed to geom_sf(). Helpful for adjusting point sizes and shapes.

x

⁠[Resampling]⁠
mlr3 spatial resampling object. One of class ResamplingSpCVBuffer, ResamplingSpCVBlock, ResamplingSpCVCoords, ResamplingSpCVEnv.

Details

This method requires to set argument fold_id and no plot containing all partitions can be created. This is because the method does not make use of all observations but only a subset of them (many observations are left out). Hence, train and test sets of one fold are not re-used in other folds as in other methods and plotting these without a train/test indicator would not make sense.

2D vs 3D plotting

This method has both a 2D and a 3D plotting method. The 2D method returns a ggplot with x and y axes representing the spatial coordinates. The 3D method uses plotly to create an interactive 3D plot. Set plot3D = TRUE to use the 3D method.

Note that spatiotemporal datasets usually suffer from overplotting in 2D mode.

See Also

Examples

if (mlr3misc::require_namespaces(c("CAST", "sf"), quietly = TRUE)) {
  library(mlr3)
  library(mlr3spatiotempcv)
  task = tsk("ecuador")
  points = sf::st_as_sf(task$coordinates(), crs = task$crs, coords = c("x", "y"))
  modeldomain = sf::st_as_sfc(sf::st_bbox(points))

  resampling = rsmp("spcv_knndm",
    folds = 5, modeldomain = modeldomain)
  resampling$instantiate(task)

  autoplot(resampling, task,
    fold_id = 1, size = 0.7) *
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
}

Visualization Functions for SpCV Tiles Method.

Description

Generic S3 plot() and autoplot() (ggplot2) methods to visualize mlr3 spatiotemporal resampling objects.

Usage

## S3 method for class 'ResamplingSpCVTiles'
autoplot(
  object,
  task,
  fold_id = NULL,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  repeats_id = NULL,
  show_omitted = FALSE,
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingRepeatedSpCVTiles'
autoplot(
  object,
  task,
  fold_id = NULL,
  repeats_id = 1,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  show_omitted = FALSE,
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingSpCVTiles'
plot(x, ...)

## S3 method for class 'ResamplingRepeatedSpCVTiles'
plot(x, ...)

Arguments

object

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSpCVBlock or ResamplingRepeatedSpCVBlock.

task

⁠[TaskClassifST]/[TaskRegrST]⁠
mlr3 task object.

fold_id

⁠[numeric]⁠
Fold IDs to plot.

plot_as_grid

⁠[logical(1)]⁠
Should a gridded plot using via patchwork be created? If FALSE a list with of ggplot2 objects is returned. Only applies if a numeric vector is passed to argument fold_id.

train_color

⁠[character(1)]⁠
The color to use for the training set observations.

test_color

⁠[character(1)]⁠
The color to use for the test set observations.

repeats_id

⁠[numeric]⁠
Repetition ID to plot.

show_omitted

⁠[logical]⁠
Whether to show points not used in train or test set for the current fold.

sample_fold_n

⁠[integer]⁠
Number of points in a random sample stratified over partitions. This argument aims to keep file sizes of resulting plots reasonable and reduce overplotting in dense datasets.

...

Passed to geom_sf(). Helpful for adjusting point sizes and shapes.

x

⁠[Resampling]⁠
mlr3 spatial resampling object. One of class ResamplingSpCVBuffer, ResamplingSpCVBlock, ResamplingSpCVCoords, ResamplingSpCVEnv.

Details

Specific combinations of arguments of "spcv_tiles" remove some observations, hence show_omitted has an effect in some cases.

See Also

Examples

if (mlr3misc::require_namespaces(c("sf", "sperrorest"), quietly = TRUE)) {
  library(mlr3)
  library(mlr3spatiotempcv)
  task = tsk("ecuador")
  resampling = rsmp("spcv_tiles",
    nsplit = c(4L, 3L), reassign = FALSE)
  resampling$instantiate(task)

  autoplot(resampling, task,
    fold_id = 1,
    show_omitted = TRUE, size = 0.7) *
    ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))
}

Visualization Functions for SptCV Cstf Methods.

Description

Generic S3 plot() and autoplot() (ggplot2) methods to visualize mlr3 spatiotemporal resampling objects.

Usage

## S3 method for class 'ResamplingSptCVCstf'
autoplot(
  object,
  task,
  fold_id = NULL,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  repeats_id = NULL,
  tickformat_date = "%Y-%m",
  nticks_x = 3,
  nticks_y = 3,
  point_size = 3,
  axis_label_fontsize = 11,
  static_image = FALSE,
  show_omitted = FALSE,
  plot3D = NULL,
  plot_time_var = NULL,
  sample_fold_n = NULL,
  ...
)

## S3 method for class 'ResamplingRepeatedSptCVCstf'
autoplot(
  object,
  task,
  fold_id = NULL,
  repeats_id = 1,
  plot_as_grid = TRUE,
  train_color = "#0072B5",
  test_color = "#E18727",
  tickformat_date = "%Y-%m",
  nticks_x = 3,
  nticks_y = 3,
  point_size = 3,
  axis_label_fontsize = 11,
  plot3D = NULL,
  plot_time_var = NULL,
  ...
)

## S3 method for class 'ResamplingSptCVCstf'
plot(x, ...)

## S3 method for class 'ResamplingRepeatedSptCVCstf'
plot(x, ...)

Arguments

object

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSptCVCstf or ResamplingRepeatedSptCVCstf.

task

⁠[TaskClassifST]/[TaskRegrST]⁠
mlr3 task object.

fold_id

⁠[numeric]⁠
Fold IDs to plot.

plot_as_grid

⁠[logical(1)]⁠
Should a gridded plot using via patchwork be created? If FALSE a list with of ggplot2 objects is returned. Only applies if a numeric vector is passed to argument fold_id.

train_color

⁠[character(1)]⁠
The color to use for the training set observations.

test_color

⁠[character(1)]⁠
The color to use for the test set observations.

repeats_id

⁠[numeric]⁠
Repetition ID to plot.

tickformat_date

⁠[character]⁠
Date format for z-axis.

nticks_x

⁠[integer]⁠
Number of x axis breaks.

nticks_y

⁠[integer]⁠
Number of y axis breaks.

point_size

⁠[numeric]⁠
Point size of markers.

axis_label_fontsize

⁠[integer]⁠
Font size of axis labels.

static_image

⁠[logical]⁠
Whether to create a static image from the plotly plot via plotly::orca(). This requires the orca utility to be available. See https://github.com/plotly/orca for more information. When used, by default a file named plot.png is created in the current working directory.

show_omitted

⁠[logical]⁠
Whether to show points not used in train or test set for the current fold.

plot3D

⁠[logical]⁠
Whether to create a 2D image via ggplot2 or a 3D plot via plotly.

plot_time_var

⁠[character]⁠
The variable to use for the z-axis (time). Remove the column role feature for this variable to only use it for plotting.

sample_fold_n

⁠[integer]⁠
Number of points in a random sample stratified over partitions. This argument aims to keep file sizes of resulting plots reasonable and reduce overplotting in dense datasets.

...

Passed down to plotly::orca(). Only effective when static_image = TRUE.

x

⁠[Resampling]⁠
mlr3 spatial resampling object of class ResamplingSptCVCstf or ResamplingRepeatedSptCVCstf.

Details

This method requires to set argument fold_id. No plot showing all folds in one plot can be created. This is because the LLTO method does not make use of all observations but only a subset of them (many observations are omitted). Hence, train and test sets of one fold are not re-used in other folds as in other methods and plotting these without a train/test indicator would be misleading.

2D vs 3D plotting

This method has both a 2D and a 3D plotting method. The 2D method returns a ggplot with x and y axes representing the spatial coordinates. The 3D method uses plotly to create an interactive 3D plot. Set plot3D = TRUE to use the 3D method.

Note that spatiotemporal datasets usually suffer from overplotting in 2D mode.

See Also

Examples

if (mlr3misc::require_namespaces(c("sf", "plotly"), quietly = TRUE)) {
  library(mlr3)
  library(mlr3spatiotempcv)
  task_st = tsk("cookfarm_mlr3")
  task_st$set_col_roles("SOURCEID", "space")
  task_st$set_col_roles("Date", "time")
  resampling = rsmp("sptcv_cstf", folds = 5)
  resampling$instantiate(task_st)

  # with both `"space"` and `"time"` column roles set (LLTO), the omitted
  # observations per fold can be shown by setting `show_omitted = TRUE`
  autoplot(resampling, task_st, fold_id = 1, show_omitted = TRUE)
}

(blockCV) Repeated spatial block resampling

Description

This function creates spatially separated folds based on a distance to number of row and/or column. It assigns blocks to the training and testing folds randomly, systematically or in a checkerboard pattern. The distance (size) should be in metres, regardless of the unit of the reference system of the input data (for more information see the details section). By default, the function creates blocks according to the extent and shape of the spatial sample data (x e.g. the species occurrence), Alternatively, blocks can be created based on r assuming that the user has considered the landscape for the given species and case study. Blocks can also be offset so the origin is not at the outer corner of the rasters. Instead of providing a distance, the blocks can also be created by specifying a number of rows and/or columns and divide the study area into vertical or horizontal bins, as presented in Wenger & Olden (2012) and Bahn & McGill (2012). Finally, the blocks can be specified by a user-defined spatial polygon layer.

Details

To maintain consistency, all functions in this package use meters as their unit of measurement. However, when the input map has a geographic coordinate system (in decimal degrees), the block size is calculated by dividing the size parameter by deg_to_metre (which defaults to 111325 meters, the standard distance of one degree of latitude on the Equator). In reality, this value varies by a factor of the cosine of the latitude. So, an alternative sensible value could be cos(mean(sf::st_bbox(x)[c(2,4)]) * pi/180) * 111325.

The offset can be used to change the spatial position of the blocks. It can also be used to assess the sensitivity of analysis results to shifting in the blocking arrangements. These options are available when size is defined. By default the region is located in the middle of the blocks and by setting the offsets, the blocks will shift.

Roberts et. al. (2017) suggest that blocks should be substantially bigger than the range of spatial autocorrelation (in model residual) to obtain realistic error estimates, while a buffer with the size of the spatial autocorrelation range would result in a good estimation of error. This is because of the so-called edge effect (O'Sullivan & Unwin, 2014), whereby points located on the edges of the blocks of opposite sets are not separated spatially. Blocking with a buffering strategy overcomes this issue (see cv_buffer).

mlr3spatiotempcv notes

By default blockCV::cv_spatial() does not allow the creation of multiple repetitions. mlr3spatiotempcv adds support for this when using the size argument for fold creation. When supplying a vector of length(repeats) for argument size, these different settings will be used to create folds which differ among the repetitions.

Multiple repetitions are not possible when using the "row & cols" approach because the created folds will always be the same.

The 'Description' and 'Details' fields are inherited from the respective upstream function.

For a list of available arguments, please see blockCV::cv_spatial.

blockCV >= 3.0.0 changed the argument names of the implementation. For backward compatibility, mlr3spatiotempcv is still using the old ones. Here's a list which shows the mapping between blockCV < 3.0.0 and blockCV >= 3.0.0:

  • range -> size

  • rasterLayer -> r

  • speciesData -> points

  • showBlocks -> plot

  • cols and rows -> rows_cols

The default of argument hexagon is different in mlr3spatiotempcv (FALSE instead of TRUE) to create square blocks instead of hexagonal blocks by default.

Parameters

  • repeats (integer(1))
    Number of repeats.

Super class

mlr3::Resampling -> ResamplingRepeatedSpCVBlock

Public fields

blocks

⁠sf | list of sf objects⁠
Polygons (sf objects) as returned by blockCV which grouped observations into partitions.

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create an "spatial block" repeated resampling instance.

For a list of available arguments, please see blockCV::cv_spatial.

Usage
ResamplingRepeatedSpCVBlock$new(id = "repeated_spcv_block")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method folds()

Translates iteration numbers to fold number.

Usage
ResamplingRepeatedSpCVBlock$folds(iters)
Arguments
iters

integer()
Iteration number.


Method repeats()

Translates iteration numbers to repetition number.

Usage
ResamplingRepeatedSpCVBlock$repeats(iters)
Arguments
iters

integer()
Iteration number.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingRepeatedSpCVBlock$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingRepeatedSpCVBlock$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G (2018). “blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models.” bioRxiv. doi:10.1101/357798.

Examples

## Not run: 
if (mlr3misc::require_namespaces(c("sf", "blockCV"), quietly = TRUE)) {
  library(mlr3)
  task = tsk("diplodia")

  # Instantiate Resampling
  rrcv = rsmp("repeated_spcv_block",
    folds = 3, repeats = 2,
    range = c(5000L, 10000L))
  rrcv$instantiate(task)

  # Individual sets:
  rrcv$iters
  rrcv$folds(1:6)
  rrcv$repeats(1:6)

  # Individual sets:
  rrcv$train_set(1)
  rrcv$test_set(1)
  intersect(rrcv$train_set(1), rrcv$test_set(1))

  # Internal storage:
  rrcv$instance # table
}

## End(Not run)

(sperrorest) Repeated coordinate-based k-means clustering

Description

Splits data by clustering in the coordinate space. See the upstream implementation at sperrorest::partition_kmeans() and Brenning (2012) for further information.

Details

Universal partitioning method that splits the data in the coordinate space. Useful for spatially homogeneous datasets that cannot be split well with rectangular approaches like ResamplingSpCVBlock.

Parameters

  • folds (integer(1))
    Number of folds.

  • repeats (integer(1))
    Number of repeats.

Super class

mlr3::Resampling -> ResamplingRepeatedSpCVCoords

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create an "coordinate-based" repeated resampling instance.

For a list of available arguments, please see sperrorest::partition_cv.

Usage
ResamplingRepeatedSpCVCoords$new(id = "repeated_spcv_coords")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method folds()

Translates iteration numbers to fold number.

Usage
ResamplingRepeatedSpCVCoords$folds(iters)
Arguments
iters

integer()
Iteration number.


Method repeats()

Translates iteration numbers to repetition number.

Usage
ResamplingRepeatedSpCVCoords$repeats(iters)
Arguments
iters

integer()
Iteration number.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingRepeatedSpCVCoords$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingRepeatedSpCVCoords$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Brenning A (2012). “Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest.” In 2012 IEEE International Geoscience and Remote Sensing Symposium. doi:10.1109/igarss.2012.6352393.

Examples

library(mlr3)
task = tsk("diplodia")

# Instantiate Resampling
rrcv = rsmp("repeated_spcv_coords", folds = 3, repeats = 5)
rrcv$instantiate(task)

# Individual sets:
rrcv$iters
rrcv$folds(1:6)
rrcv$repeats(1:6)

# Individual sets:
rrcv$train_set(1)
rrcv$test_set(1)
intersect(rrcv$train_set(1), rrcv$test_set(1))

# Internal storage:
rrcv$instance # table

(sperrorest) Repeated spatial "disc" resampling

Description

(sperrorest) Repeated spatial "disc" resampling

(sperrorest) Repeated spatial "disc" resampling

Parameters

  • repeats (integer(1))
    Number of repeats.

Super class

mlr3::Resampling -> ResamplingRepeatedSpCVDisc

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create a "Spatial 'Disc' resampling" resampling instance.

For a list of available arguments, please see sperrorest::partition_disc.

Usage
ResamplingRepeatedSpCVDisc$new(id = "repeated_spcv_disc")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method folds()

Translates iteration numbers to fold number.

Usage
ResamplingRepeatedSpCVDisc$folds(iters)
Arguments
iters

integer()
Iteration number.


Method repeats()

Translates iteration numbers to repetition number.

Usage
ResamplingRepeatedSpCVDisc$repeats(iters)
Arguments
iters

integer()
Iteration number.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingRepeatedSpCVDisc$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingRepeatedSpCVDisc$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Brenning A (2012). “Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest.” In 2012 IEEE International Geoscience and Remote Sensing Symposium. doi:10.1109/igarss.2012.6352393.

Examples

library(mlr3)
task = tsk("ecuador")

# Instantiate Resampling
rrcv = rsmp("repeated_spcv_disc",
  folds = 3L, repeats = 2,
  radius = 200L, buffer = 200L)
rrcv$instantiate(task)

# Individual sets:
rrcv$iters
rrcv$folds(1:6)
rrcv$repeats(1:6)

# Individual sets:
rrcv$train_set(1)
rrcv$test_set(1)
intersect(rrcv$train_set(1), rrcv$test_set(1))

# Internal storage:
rrcv$instance # table

(blockCV) Repeated "environmental blocking" resampling

Description

Splits data by clustering in the feature space. See the upstream implementation at blockCV::cv_cluster() and Valavi et al. (2018) for further information.

Details

Useful when the dataset is supposed to be split on environmental information which is present in features. The method allows for a combination of multiple features for clustering.

The input of raster images directly as in blockCV::cv_cluster() is not supported. See mlr3spatial and its raster DataBackends for such support in mlr3.

Parameters

  • folds (integer(1))
    Number of folds.

  • features (character())
    The features to use for clustering.

  • repeats (integer(1))
    Number of repeats.

Super class

mlr3::Resampling -> ResamplingRepeatedSpCVEnv

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create an "Environmental Block" repeated resampling instance.

For a list of available arguments, please see blockCV::cv_cluster.

Usage
ResamplingRepeatedSpCVEnv$new(id = "repeated_spcv_env")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method folds()

Translates iteration numbers to fold number.

Usage
ResamplingRepeatedSpCVEnv$folds(iters)
Arguments
iters

integer()
Iteration number.


Method repeats()

Translates iteration numbers to repetition number.

Usage
ResamplingRepeatedSpCVEnv$repeats(iters)
Arguments
iters

integer()
Iteration number.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingRepeatedSpCVEnv$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingRepeatedSpCVEnv$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G (2018). “blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models.” bioRxiv. doi:10.1101/357798.

Examples

if (mlr3misc::require_namespaces(c("sf", "blockCV"), quietly = TRUE)) {
  library(mlr3)
  task = tsk("ecuador")

  # Instantiate Resampling
  rrcv = rsmp("repeated_spcv_env", folds = 4, repeats = 2)
  rrcv$instantiate(task)

  # Individual sets:
  rrcv$train_set(1)
  rrcv$test_set(1)
  intersect(rrcv$train_set(1), rrcv$test_set(1))

  # Internal storage:
  rrcv$instance
}

(CAST) Repeated K-fold Nearest Neighbour Distance Matching

Description

This function implements the kNNDM algorithm and returns the necessary indices to perform a k-fold NNDM CV for map validation.

Details

knndm is a k-fold version of NNDM LOO CV for medium and large datasets. Brielfy, the algorithm tries to find a k-fold configuration such that the integral of the absolute differences (Wasserstein W statistic) between the empirical nearest neighbour distance distribution function between the test and training data during CV (Gj*), and the empirical nearest neighbour distance distribution function between the prediction and training points (Gij), is minimised. It does so by performing clustering of the training points' coordinates for different numbers of clusters that range from k to N (number of observations), merging them into k final folds, and selecting the configuration with the lowest W.

Using a projected CRS in 'knndm' has large computational advantages since fast nearest neighbour search can be done via the 'FNN' package, while working with geographic coordinates requires computing the full spherical distance matrices. As a clustering algorithm, 'kmeans' can only be used for projected CRS while 'hierarchical' can work with both projected and geographical coordinates, though it requires calculating the full distance matrix of the training points even for a projected CRS.

In order to select between clustering algorithms and number of folds 'k', different 'knndm' configurations can be run and compared, being the one with a lower W statistic the one that offers a better match. W statistics between 'knndm' runs are comparable as long as 'tpoints' and 'predpoints' or 'modeldomain' stay the same.

Map validation using knndm should be used using 'CAST::global_validation', i.e. by stacking all out-of-sample predictions and evaluating them all at once. The reasons behind this are 1) The resulting folds can be unbalanced and 2) nearest neighbour functions are constructed and matched using all CV folds simultaneously.

If training data points are very clustered with respect to the prediction area and the presented knndm configuration still show signs of Gj* > Gij, there are several things that can be tried. First, increase the 'maxp' parameter; this may help to control for strong clustering (at the cost of having unbalanced folds). Secondly, decrease the number of final folds 'k', which may help to have larger clusters.

The 'modeldomain' is a sf polygon that defines the prediction area. The function takes a regular point sample (amount defined by 'samplesize') from the spatial extent. As an alternative use 'predpoints' instead of 'modeldomain', if you have already defined the prediction locations (e.g. raster pixel centroids). When using either 'modeldomain' or 'predpoints', we advise to plot the study area polygon and the training/prediction points as a previous step to ensure they are aligned.

Parameters

  • folds (integer(1))
    Number of folds.

  • stratify
    If TRUE, stratify on the target column.

  • repeats (integer(1))
    Number of repeats.

Super class

mlr3::Resampling -> ResamplingRepeatedSpCVKnndm

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create a "K-fold Nearest Neighbour Distance Matching" resampling instance.

Usage
ResamplingRepeatedSpCVKnndm$new(id = "repeated_spcv_knndm")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method folds()

Translates iteration numbers to fold number.

Usage
ResamplingRepeatedSpCVKnndm$folds(iters)
Arguments
iters

integer()
Iteration number.


Method repeats()

Translates iteration numbers to repetition number.

Usage
ResamplingRepeatedSpCVKnndm$repeats(iters)
Arguments
iters

integer()
Iteration number.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingRepeatedSpCVKnndm$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingRepeatedSpCVKnndm$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Linnenbrink, J., Mila, C., Ludwig, M., Meyer, H. (2023). “kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation.” EGUsphere, 2023, 1–16. doi:10.5194/egusphere-2023-1308, https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1308/.

Examples

library(mlr3)
library(mlr3spatial)
set.seed(42)
simarea = list(matrix(c(0, 0, 0, 100, 100, 100, 100, 0, 0, 0), ncol = 2, byrow = TRUE))
simarea = sf::st_polygon(simarea)
train_points = sf::st_sample(simarea, 1000, type = "random")
train_points = sf::st_as_sf(train_points)
train_points$target = as.factor(sample(c("TRUE", "FALSE"), 1000, replace = TRUE))
pred_points = sf::st_sample(simarea, 1000, type = "regular")

task = mlr3spatial::as_task_classif_st(sf::st_as_sf(train_points), "target", positive = "TRUE")

cv_knndm = rsmp("repeated_spcv_knndm", predpoints = pred_points, repeats = 2)
cv_knndm$instantiate(task)
#' ### Individual sets:
# cv_knndm$train_set(1)
# cv_knndm$test_set(1)
# check that no obs are in both sets
intersect(cv_knndm$train_set(1), cv_knndm$test_set(1)) # good!

# Internal storage:
# cv_knndm$instance # table

(sperrorest) Repeated spatial "tiles" resampling

Description

Spatial partitioning using rectangular tiles. Small partitions can optionally be merged into adjacent ones to avoid partitions with too few observations. This method is similar to ResamplingSpCVBlock by making use of rectangular zones in the coordinate space. See the upstream implementation at sperrorest::partition_disc() and Brenning (2012) for further information.

Parameters

  • dsplit (integer(2))
    Equidistance of splits in (possibly rotated) x direction (dsplit[1]) and y direction (dsplit[2]) used to define tiles. If dsplit is of length 1, its value is recycled. Either dsplit or nsplit must be specified.

  • nsplit (integer(2))
    Number of splits in (possibly rotated) x direction (nsplit[1]) and y direction (nsplit[2]) used to define tiles. If nsplit is of length 1, its value is recycled.

  • rotation (character(1))
    Whether and how the rectangular grid should be rotated; random rotation is only possible between -45 and +45 degrees. Accepted values: One of c("none", "random", "user").

  • user_rotation (character(1))
    Only used when rotation = "user". Angle(s) (in degrees) by which the rectangular grid is to be rotated in each repetition. Either a vector of same length as repeats, or a single number that will be replicated length(repeats) times.

  • offset (logical(1))
    Whether and how the rectangular grid should be shifted by an offset. Accepted values: One of c("none", "random", "user").

  • user_offset (logical(1))
    Only used when offset = "user". A list (or vector) of two components specifying a shift of the rectangular grid in (possibly rotated) x and y direction. The offset values are relative values, a value of 0.5 resulting in a one-half tile shift towards the left, or upward. If this is a list, its first (second) component refers to the rotated x (y) direction, and both components must have same length as repeats (or length 1). If a vector of length 2 (or list components have length 1), the two values will be interpreted as relative shifts in (rotated) x and y direction, respectively, and will therefore be recycled as needed (length(repeats) times each).

  • reassign (logical(1))
    If TRUE, 'small' tiles (as per min_frac and min_n) are merged with (smallest) adjacent tiles. If FALSE, small tiles are 'eliminated', i.e., set to NA.

  • min_frac (numeric(1))
    Value must be >=0, <1. Minimum relative size of partition as percentage of sample.

  • min_n (integer(1))
    Minimum number of samples per partition.

  • iterate (integer(1))
    Passed down to sperrorest::tile_neighbors().

  • repeats (integer(1))
    Number of repeats.

Super class

mlr3::Resampling -> ResamplingRepeatedSpCVTiles

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create a "Spatial 'Tiles' resampling" resampling instance.

For a list of available arguments, please see sperrorest::partition_tiles.

Usage
ResamplingRepeatedSpCVTiles$new(id = "repeated_spcv_tiles")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method folds()

Translates iteration numbers to fold number.

Usage
ResamplingRepeatedSpCVTiles$folds(iters)
Arguments
iters

integer()
Iteration number.


Method repeats()

Translates iteration numbers to repetition number.

Usage
ResamplingRepeatedSpCVTiles$repeats(iters)
Arguments
iters

integer()
Iteration number.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingRepeatedSpCVTiles$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingRepeatedSpCVTiles$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Brenning A (2012). “Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest.” In 2012 IEEE International Geoscience and Remote Sensing Symposium. doi:10.1109/igarss.2012.6352393.

See Also

ResamplingSpCVBlock

Examples

if (mlr3misc::require_namespaces("sperrorest", quietly = TRUE)) {
  library(mlr3)
  task = tsk("ecuador")

  # Instantiate Resampling
  rrcv = rsmp("repeated_spcv_tiles",
    repeats = 2,
    nsplit = c(4L, 3L), reassign = FALSE)
  rrcv$instantiate(task)

  # Individual sets:
  rrcv$iters
  rrcv$folds(10:12)
  rrcv$repeats(10:12)

  # Individual sets:
  rrcv$train_set(1)
  rrcv$test_set(1)
  intersect(rrcv$train_set(1), rrcv$test_set(1))

  # Internal storage:
  rrcv$instance # table
}

(CAST) Repeated spatiotemporal "leave-location-and-time-out" resampling

Description

Splits data using Leave-Location-Out (LLO), Leave-Time-Out (LTO) and Leave-Location-and-Time-Out (LLTO) partitioning. See the upstream implementation at CreateSpacetimeFolds() (package CAST) and Meyer et al. (2018) for further information.

Details

LLO predicts on unknown locations i.e. complete locations are left out in the training sets. The "space" role in Task$col_roles identifies spatial units. If stratify is TRUE, the target distribution is similar in each fold. This is useful for land cover classification when the observations are polygons. In this case, LLO with stratification should be used to hold back complete polygons and have a similar target distribution in each fold. LTO leaves out complete temporal units which are identified by the "time" role in Task$col_roles. LLTO leaves out spatial and temporal units. See the examples.

Parameters

  • folds (integer(1))
    Number of folds.

  • stratify
    If TRUE, stratify on the target column.

  • repeats (integer(1))
    Number of repeats.

Super class

mlr3::Resampling -> ResamplingRepeatedSptCVCstf

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create a "Spacetime Folds" resampling instance.

Usage
ResamplingRepeatedSptCVCstf$new(id = "repeated_sptcv_cstf")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method folds()

Translates iteration numbers to fold number.

Usage
ResamplingRepeatedSptCVCstf$folds(iters)
Arguments
iters

integer()
Iteration number.


Method repeats()

Translates iteration numbers to repetition number.

Usage
ResamplingRepeatedSptCVCstf$repeats(iters)
Arguments
iters

integer()
Iteration number.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingRepeatedSptCVCstf$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingRepeatedSptCVCstf$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Zhao Y, Karypis G (2002). “Evaluation of Hierarchical Clustering Algorithms for Document Datasets.” 11th Conference of Information and Knowledge Management (CIKM), 51-524. doi:10.1145/584792.584877.

Examples

library(mlr3)
task = tsk("cookfarm_mlr3")
task$set_col_roles("SOURCEID", roles = "space")
task$set_col_roles("Date", roles = "time")

# Instantiate Resampling
rcv = rsmp("repeated_sptcv_cstf", folds = 5, repeats = 2)
rcv$instantiate(task)

### Individual sets:
# rcv$train_set(1)
# rcv$test_set(1)
# check that no obs are in both sets
intersect(rcv$train_set(1), rcv$test_set(1)) # good!

# Internal storage:
# rcv$instance # table

(blockCV) Spatial block resampling

Description

This function creates spatially separated folds based on a distance to number of row and/or column. It assigns blocks to the training and testing folds randomly, systematically or in a checkerboard pattern. The distance (size) should be in metres, regardless of the unit of the reference system of the input data (for more information see the details section). By default, the function creates blocks according to the extent and shape of the spatial sample data (x e.g. the species occurrence), Alternatively, blocks can be created based on r assuming that the user has considered the landscape for the given species and case study. Blocks can also be offset so the origin is not at the outer corner of the rasters. Instead of providing a distance, the blocks can also be created by specifying a number of rows and/or columns and divide the study area into vertical or horizontal bins, as presented in Wenger & Olden (2012) and Bahn & McGill (2012). Finally, the blocks can be specified by a user-defined spatial polygon layer.

Details

To maintain consistency, all functions in this package use meters as their unit of measurement. However, when the input map has a geographic coordinate system (in decimal degrees), the block size is calculated by dividing the size parameter by deg_to_metre (which defaults to 111325 meters, the standard distance of one degree of latitude on the Equator). In reality, this value varies by a factor of the cosine of the latitude. So, an alternative sensible value could be cos(mean(sf::st_bbox(x)[c(2,4)]) * pi/180) * 111325.

The offset can be used to change the spatial position of the blocks. It can also be used to assess the sensitivity of analysis results to shifting in the blocking arrangements. These options are available when size is defined. By default the region is located in the middle of the blocks and by setting the offsets, the blocks will shift.

Roberts et. al. (2017) suggest that blocks should be substantially bigger than the range of spatial autocorrelation (in model residual) to obtain realistic error estimates, while a buffer with the size of the spatial autocorrelation range would result in a good estimation of error. This is because of the so-called edge effect (O'Sullivan & Unwin, 2014), whereby points located on the edges of the blocks of opposite sets are not separated spatially. Blocking with a buffering strategy overcomes this issue (see cv_buffer).

mlr3spatiotempcv notes

By default blockCV::cv_spatial() does not allow the creation of multiple repetitions. mlr3spatiotempcv adds support for this when using the size argument for fold creation. When supplying a vector of length(repeats) for argument size, these different settings will be used to create folds which differ among the repetitions.

Multiple repetitions are not possible when using the "row & cols" approach because the created folds will always be the same.

The 'Description' and 'Details' fields are inherited from the respective upstream function.

For a list of available arguments, please see blockCV::cv_spatial.

blockCV >= 3.0.0 changed the argument names of the implementation. For backward compatibility, mlr3spatiotempcv is still using the old ones. Here's a list which shows the mapping between blockCV < 3.0.0 and blockCV >= 3.0.0:

  • range -> size

  • rasterLayer -> r

  • speciesData -> points

  • showBlocks -> plot

  • cols and rows -> rows_cols

The default of argument hexagon is different in mlr3spatiotempcv (FALSE instead of TRUE) to create square blocks instead of hexagonal blocks by default.

Super class

mlr3::Resampling -> ResamplingSpCVBlock

Public fields

blocks

⁠sf | list of sf objects⁠
Polygons (sf objects) as returned by blockCV which grouped observations into partitions.

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create an "spatial block" resampling instance.

For a list of available arguments, please see blockCV::cv_spatial().

Usage
ResamplingSpCVBlock$new(id = "spcv_block")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingSpCVBlock$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingSpCVBlock$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G (2018). “blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models.” bioRxiv. doi:10.1101/357798.

Examples

if (mlr3misc::require_namespaces(c("sf", "blockCV"), quietly = TRUE)) {
  library(mlr3)
  task = tsk("ecuador")

  # Instantiate Resampling
  rcv = rsmp("spcv_block", range = 3000L, folds = 3)
  rcv$instantiate(task)

  # Individual sets:
  rcv$train_set(1)
  rcv$test_set(1)
  intersect(rcv$train_set(1), rcv$test_set(1))

  # Internal storage:
  rcv$instance
}

(blockCV) Spatial buffering resampling

Description

This function generates spatially separated train and test folds by considering buffers of the specified distance (size parameter) around each observation point. This approach is a form of leave-one-out cross-validation. Each fold is generated by excluding nearby observations around each testing point within the specified distance (ideally the range of spatial autocorrelation, see cv_spatial_autocor). In this method, the testing set never directly abuts a training sample (e.g. presence or absence; 0s and 1s). For more information see the details section.

Details

When working with presence-background (presence and pseudo-absence) species distribution data (should be specified by presence_bg = TRUE argument), only presence records are used for specifying the folds (recommended). Consider a target presence point. The buffer is defined around this target point, using the specified range (size). By default, the testing fold comprises only the target presence point (all background points within the buffer are also added when add_bg = TRUE). Any non-target presence points inside the buffer are excluded. All points (presence and background) outside of buffer are used for the training set. The methods cycles through all the presence data, so the number of folds is equal to the number of presence points in the dataset.

For presence-absence data (and all other types of data), folds are created based on all records, both presences and absences. As above, a target observation (presence or absence) forms a test point, all presence and absence points other than the target point within the buffer are ignored, and the training set comprises all presences and absences outside the buffer. Apart from the folds, the number of training-presence, training-absence, testing-presence and testing-absence records is stored and returned in the records table. If column = NULL and presence_bg = FALSE, the procedure is like presence-absence data. All other data types (continuous, count or multi-class responses) should be done by presence_bg = FALSE.

mlr3spatiotempcv notes

The 'Description' and 'Details' fields are inherited from the respective upstream function. For a list of available arguments, please see blockCV::cv_buffer.

blockCV >= 3.0.0 changed the argument names of the implementation. For backward compatibility, mlr3spatiotempcv is still using the old ones. Here's a list which shows the mapping between blockCV < 3.0.0 and blockCV >= 3.0.0:

  • theRange -> size

  • addBG -> add_bg

  • spDataType (character vector) -> presence_bg (boolean)

Super class

mlr3::Resampling -> ResamplingSpCVBuffer

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create an "Environmental Block" resampling instance.

For a list of available arguments, please see blockCV::cv_buffer().

Usage
ResamplingSpCVBuffer$new(id = "spcv_buffer")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingSpCVBuffer$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingSpCVBuffer$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G (2018). “blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models.” bioRxiv. doi:10.1101/357798.

See Also

ResamplingSpCVDisc

Examples

if (mlr3misc::require_namespaces(c("sf", "blockCV"), quietly = TRUE)) {
  library(mlr3)
  task = tsk("ecuador")

  # Instantiate Resampling
  rcv = rsmp("spcv_buffer", theRange = 10000)
  rcv$instantiate(task)

  # Individual sets:
  rcv$train_set(1)
  rcv$test_set(1)
  intersect(rcv$train_set(1), rcv$test_set(1))

  # Internal storage:
  # rcv$instance
}

(sperrorest) Coordinate-based k-means clustering

Description

Splits data by clustering in the coordinate space. See the upstream implementation at sperrorest::partition_kmeans() and Brenning (2012) for further information.

Details

Universal partitioning method that splits the data in the coordinate space. Useful for spatially homogeneous datasets that cannot be split well with rectangular approaches like ResamplingSpCVBlock.

Parameters

  • folds (integer(1))
    Number of folds.

Super class

mlr3::Resampling -> ResamplingSpCVCoords

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create an "coordinate-based" repeated resampling instance.

For a list of available arguments, please see sperrorest::partition_cv.

Usage
ResamplingSpCVCoords$new(id = "spcv_coords")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingSpCVCoords$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingSpCVCoords$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Brenning A (2012). “Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest.” In 2012 IEEE International Geoscience and Remote Sensing Symposium. doi:10.1109/igarss.2012.6352393.

Examples

library(mlr3)
task = tsk("ecuador")

# Instantiate Resampling
rcv = rsmp("spcv_coords", folds = 5)
rcv$instantiate(task)

# Individual sets:
rcv$train_set(1)
rcv$test_set(1)
# check that no obs are in both sets
intersect(rcv$train_set(1), rcv$test_set(1)) # good!

# Internal storage:
rcv$instance # table

(sperrorest) Spatial "disc" resampling

Description

Spatial partitioning using circular test areas of one of more observations. Optionally, a buffer around the test area can be used to exclude observations. See the upstream implementation at sperrorest::partition_disc() and Brenning (2012) for further information.

Parameters

  • folds (integer(1))
    Number of folds.

  • radius (numeric(1))
    Radius of test area disc.

  • buffer (integer(1))
    Radius around test area disc which is excluded from training or test set.

  • prob (integer(1))
    Optional argument passed down to sample().

  • replace (logical(1))
    Optional argument passed down to sample(). Sample with or without replacement.

Super class

mlr3::Resampling -> ResamplingSpCVDisc

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create a "Spatial 'Disc' resampling" resampling instance.

For a list of available arguments, please see sperrorest::partition_disc.

Usage
ResamplingSpCVDisc$new(id = "spcv_disc")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingSpCVDisc$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingSpCVDisc$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Brenning A (2012). “Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest.” In 2012 IEEE International Geoscience and Remote Sensing Symposium. doi:10.1109/igarss.2012.6352393.

Examples

library(mlr3)
task = tsk("ecuador")

# Instantiate Resampling
rcv = rsmp("spcv_disc", folds = 3L, radius = 200L, buffer = 200L)
rcv$instantiate(task)

# Individual sets:
rcv$train_set(1)
rcv$test_set(1)
# check that no obs are in both sets
intersect(rcv$train_set(1), rcv$test_set(1)) # good!

# Internal storage:
rcv$instance # table

(blockCV) "Environmental blocking" resampling

Description

Splits data by clustering in the feature space. See the upstream implementation at blockCV::cv_cluster() and Valavi et al. (2018) for further information.

Details

Useful when the dataset is supposed to be split on environmental information which is present in features. The method allows for a combination of multiple features for clustering.

The input of raster images directly as in blockCV::cv_cluster() is not supported. See mlr3spatial and its raster DataBackends for such support in mlr3.

Parameters

  • folds (integer(1))
    Number of folds.

  • features (character())
    The features to use for clustering.

Super class

mlr3::Resampling -> ResamplingSpCVEnv

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create an "Environmental Block" resampling instance.

For a list of available arguments, please see blockCV::cv_cluster.

Usage
ResamplingSpCVEnv$new(id = "spcv_env")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingSpCVEnv$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingSpCVEnv$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G (2018). “blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models.” bioRxiv. doi:10.1101/357798.

Examples

if (mlr3misc::require_namespaces(c("sf", "blockCV"), quietly = TRUE)) {
  library(mlr3)
  task = tsk("ecuador")

  # Instantiate Resampling
  rcv = rsmp("spcv_env", folds = 4)
  rcv$instantiate(task)

  # Individual sets:
  rcv$train_set(1)
  rcv$test_set(1)
  intersect(rcv$train_set(1), rcv$test_set(1))

  # Internal storage:
  rcv$instance
}

(CAST) K-fold Nearest Neighbour Distance Matching

Description

This function implements the kNNDM algorithm and returns the necessary indices to perform a k-fold NNDM CV for map validation.

Details

knndm is a k-fold version of NNDM LOO CV for medium and large datasets. Brielfy, the algorithm tries to find a k-fold configuration such that the integral of the absolute differences (Wasserstein W statistic) between the empirical nearest neighbour distance distribution function between the test and training data during CV (Gj*), and the empirical nearest neighbour distance distribution function between the prediction and training points (Gij), is minimised. It does so by performing clustering of the training points' coordinates for different numbers of clusters that range from k to N (number of observations), merging them into k final folds, and selecting the configuration with the lowest W.

Using a projected CRS in 'knndm' has large computational advantages since fast nearest neighbour search can be done via the 'FNN' package, while working with geographic coordinates requires computing the full spherical distance matrices. As a clustering algorithm, 'kmeans' can only be used for projected CRS while 'hierarchical' can work with both projected and geographical coordinates, though it requires calculating the full distance matrix of the training points even for a projected CRS.

In order to select between clustering algorithms and number of folds 'k', different 'knndm' configurations can be run and compared, being the one with a lower W statistic the one that offers a better match. W statistics between 'knndm' runs are comparable as long as 'tpoints' and 'predpoints' or 'modeldomain' stay the same.

Map validation using knndm should be used using 'CAST::global_validation', i.e. by stacking all out-of-sample predictions and evaluating them all at once. The reasons behind this are 1) The resulting folds can be unbalanced and 2) nearest neighbour functions are constructed and matched using all CV folds simultaneously.

If training data points are very clustered with respect to the prediction area and the presented knndm configuration still show signs of Gj* > Gij, there are several things that can be tried. First, increase the 'maxp' parameter; this may help to control for strong clustering (at the cost of having unbalanced folds). Secondly, decrease the number of final folds 'k', which may help to have larger clusters.

The 'modeldomain' is a sf polygon that defines the prediction area. The function takes a regular point sample (amount defined by 'samplesize') from the spatial extent. As an alternative use 'predpoints' instead of 'modeldomain', if you have already defined the prediction locations (e.g. raster pixel centroids). When using either 'modeldomain' or 'predpoints', we advise to plot the study area polygon and the training/prediction points as a previous step to ensure they are aligned.

Parameters

  • folds (integer(1))
    Number of folds.

  • stratify
    If TRUE, stratify on the target column.

Super class

mlr3::Resampling -> ResamplingSpCVKnndm

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create a "K-fold Nearest Neighbour Distance Matching" resampling instance.

Usage
ResamplingSpCVKnndm$new(id = "spcv_knndm")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingSpCVKnndm$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingSpCVKnndm$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Linnenbrink, J., Mila, C., Ludwig, M., Meyer, H. (2023). “kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation.” EGUsphere, 2023, 1–16. doi:10.5194/egusphere-2023-1308, https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1308/.

Examples

if (mlr3misc::require_namespaces(c("sf", "CAST"), quietly = TRUE)) {
  library(mlr3)
  library(sf)

  set.seed(42)
  task = tsk("ecuador")
  points = sf::st_as_sf(task$coordinates(), crs = task$crs, coords = c("x", "y"))
  modeldomain = sf::st_as_sfc(sf::st_bbox(points))

  set.seed(42)
  cv_knndm = rsmp("spcv_knndm", modeldomain = modeldomain)
  cv_knndm$instantiate(task)

  #' ### Individual sets:
  # cv_knndm$train_set(1)
  # cv_knndm$test_set(1)
  # check that no obs are in both sets
  intersect(cv_knndm$train_set(1), cv_knndm$test_set(1)) # good!

  # Internal storage:
  # cv_knndm$instance # table
}

(sperrorest) Spatial "Tiles" resampling

Description

Spatial partitioning using rectangular tiles. Small partitions can optionally be merged into adjacent ones to avoid partitions with too few observations. This method is similar to ResamplingSpCVBlock by making use of rectangular zones in the coordinate space. See the upstream implementation at sperrorest::partition_disc() and Brenning (2012) for further information.

Parameters

  • dsplit (integer(2))
    Equidistance of splits in (possibly rotated) x direction (dsplit[1]) and y direction (dsplit[2]) used to define tiles. If dsplit is of length 1, its value is recycled. Either dsplit or nsplit must be specified.

  • nsplit (integer(2))
    Number of splits in (possibly rotated) x direction (nsplit[1]) and y direction (nsplit[2]) used to define tiles. If nsplit is of length 1, its value is recycled.

  • rotation (character(1))
    Whether and how the rectangular grid should be rotated; random rotation is only possible between -45 and +45 degrees. Accepted values: One of c("none", "random", "user").

  • user_rotation (character(1))
    Only used when rotation = "user". Angle(s) (in degrees) by which the rectangular grid is to be rotated in each repetition. Either a vector of same length as repeats, or a single number that will be replicated length(repeats) times.

  • offset (logical(1))
    Whether and how the rectangular grid should be shifted by an offset. Accepted values: One of c("none", "random", "user").

  • user_offset (logical(1))
    Only used when offset = "user". A list (or vector) of two components specifying a shift of the rectangular grid in (possibly rotated) x and y direction. The offset values are relative values, a value of 0.5 resulting in a one-half tile shift towards the left, or upward. If this is a list, its first (second) component refers to the rotated x (y) direction, and both components must have same length as repeats (or length 1). If a vector of length 2 (or list components have length 1), the two values will be interpreted as relative shifts in (rotated) x and y direction, respectively, and will therefore be recycled as needed (length(repeats) times each).

  • reassign (logical(1))
    If TRUE, 'small' tiles (as per min_frac and min_n) are merged with (smallest) adjacent tiles. If FALSE, small tiles are 'eliminated', i.e., set to NA.

  • min_frac (numeric(1))
    Value must be >=0, <1. Minimum relative size of partition as percentage of sample.

  • min_n (integer(1))
    Minimum number of samples per partition.

  • iterate (integer(1))
    Passed down to sperrorest::tile_neighbors().

Super class

mlr3::Resampling -> ResamplingSpCVTiles

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create a "Spatial 'Tiles' resampling" resampling instance.

Usage
ResamplingSpCVTiles$new(id = "spcv_tiles")
Arguments
id

character(1)
Identifier for the resampling strategy. For a list of available arguments, please see sperrorest::partition_tiles.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingSpCVTiles$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingSpCVTiles$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Brenning A (2012). “Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest.” In 2012 IEEE International Geoscience and Remote Sensing Symposium. doi:10.1109/igarss.2012.6352393.

See Also

ResamplingSpCVBlock

Examples

if (mlr3misc::require_namespaces("sperrorest", quietly = TRUE)) {
  library(mlr3)
  task = tsk("ecuador")

  # Instantiate Resampling
  rcv = rsmp("spcv_tiles", nsplit = c(4L, 3L), reassign = FALSE)
  rcv$instantiate(task)

  # Individual sets:
  rcv$train_set(1)
  rcv$test_set(1)
  # check that no obs are in both sets
  intersect(rcv$train_set(1), rcv$test_set(1)) # good!

  # Internal storage:
  rcv$instance # table
}

(CAST) Spatiotemporal "Leave-location-and-time-out" resampling

Description

Splits data using Leave-Location-Out (LLO), Leave-Time-Out (LTO) and Leave-Location-and-Time-Out (LLTO) partitioning. See the upstream implementation at CreateSpacetimeFolds() (package CAST) and Meyer et al. (2018) for further information.

Details

LLO predicts on unknown locations i.e. complete locations are left out in the training sets. The "space" role in Task$col_roles identifies spatial units. If stratify is TRUE, the target distribution is similar in each fold. This is useful for land cover classification when the observations are polygons. In this case, LLO with stratification should be used to hold back complete polygons and have a similar target distribution in each fold. LTO leaves out complete temporal units which are identified by the "time" role in Task$col_roles. LLTO leaves out spatial and temporal units. See the examples.

Parameters

  • folds (integer(1))
    Number of folds.

  • stratify
    If TRUE, stratify on the target column.

Super class

mlr3::Resampling -> ResamplingSptCVCstf

Active bindings

iters

integer(1)
Returns the number of resampling iterations, depending on the values stored in the param_set.

Methods

Public methods

Inherited methods

Method new()

Create a "Spacetime Folds" resampling instance.

Usage
ResamplingSptCVCstf$new(id = "sptcv_cstf")
Arguments
id

character(1)
Identifier for the resampling strategy.


Method instantiate()

Materializes fixed training and test splits for a given task.

Usage
ResamplingSptCVCstf$instantiate(task)
Arguments
task

Task
A task to instantiate.


Method clone()

The objects of this class are cloneable with this method.

Usage
ResamplingSptCVCstf$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Meyer H, Reudenbach C, Hengl T, Katurji M, Nauss T (2018). “Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation.” Environmental Modelling & Software, 101, 1–9. doi:10.1016/j.envsoft.2017.12.001.

Examples

library(mlr3)
task = tsk("cookfarm_mlr3")
task$set_col_roles("SOURCEID", roles = "space")
task$set_col_roles("Date", roles = "time")

# Instantiate Resampling
rcv = rsmp("sptcv_cstf", folds = 5)
rcv$instantiate(task)

### Individual sets:
# rcv$train_set(1)
# rcv$test_set(1)
# check that no obs are in both sets
intersect(rcv$train_set(1), rcv$test_set(1)) # good!

# Internal storage:
# rcv$instance # table

Cookfarm Profiles Regression Task

Description

The R.J. Cook Agronomy Farm (cookfarm) is a Long-Term Agroecosystem Research Site operated by Washington State University, located near Pullman, Washington, USA. Contains spatio-temporal (3D+T) measurements of three soil properties and a number of spatial and temporal regression covariates.

Here, only the "Profiles" dataset is used from the collection. The Date column was appended from the readings dataset. In addition coordinates were appended to the task as variables "x" and "y".

The dataset was borrowed and adapted from package GSIF which was on archived on CRAN in 2021-03.

Usage

data(cookfarm_mlr3)

Format

R6::R6Class inheriting from TaskRegr.

Usage

mlr_tasks$get("cookfarm")
tsk("cookfarm_mlr3")

Column roles

The task has set column roles "space" and "time" for variables "Date" and "SOURCEID", respectively. These are used by certain methods during partitioning, e.g., mlr_resamplings_sptcv_cstf with variant "Leave-location-and-time-out". If only one of space or time should left out, the column roles must be adjusted by the user!

References

Gasch, C.K., Hengl, T., Gräler, B., Meyer, H., Magney, T., Brown, D.J., 2015. Spatio-temporal interpolation of soil water, temperature, and electrical conductivity in 3D+T: the Cook Agronomy Farm data set. Spatial Statistics, 14, pp.70–90.

Gasch, C.K., D.J. Brown, E.S. Brooks, M. Yourek, M. Poggio, D.R. Cobos, C.S. Campbell, 2016? Retroactive calibration of soil moisture sensors using a two-step, soil-specific correction. Submitted to Vadose Zone Journal.

Gasch, C.K., D.J. Brown, C.S. Campbell, D.R. Cobos, E.S. Brooks, M. Chahal, M. Poggio, 2016? A field-scale sensor network data set for monitoring and modeling the spatial and temporal variation of soil moisture in a dryland agricultural field. Submitted to Water Resources Research.

See Also

Dictionary of Tasks: mlr_tasks

as.data.table(mlr_tasks) for a complete table of all (also dynamically created) Tasks.

Other Task: TaskClassifST, TaskRegrST, mlr_tasks_diplodia, mlr_tasks_ecuador


Diplodia Classification Task

Description

Data set created by Patrick Schratz, University of Jena (Germany) and Eugenia Iturritxa, NEIKER, Vitoria-Gasteiz (Spain). This dataset should be cited as Schratz et al. (2019) (see reference below). The publication also contains additional information on data collection. The data set provided here shows infections of trees by the pathogen Diplodia Sapinea in the Basque Country in Spain. Predictors are environmental variables like temperature, precipitation, soil and more.

Usage

data(diplodia)

Format

R6::R6Class inheriting from TaskClassif.

Usage

mlr_tasks$get("diplodia")
tsk("diplodia")

References

Schratz P, Muenchow J, Iturritxa E, Richter J, Brenning A (2019). “Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data.” Ecological Modelling, 406, 109–120. doi:10.1016/j.ecolmodel.2019.06.002.

See Also

Dictionary of Tasks: mlr_tasks

as.data.table(mlr_tasks) for a complete table of all (also dynamically created) Tasks.

Other Task: TaskClassifST, TaskRegrST, mlr_tasks_cookfarm_mlr3, mlr_tasks_ecuador


Ecuador Classification Task

Description

Data set created by Jannes Muenchow, University of Erlangen-Nuernberg, Germany. This dataset should be cited as Muenchow et al. (2012) (see reference below). The publication also contains additional information on data collection and the geomorphology of the area. The data set provided here is (a subset of) the one from the 'natural' part of the RBSF area and corresponds to landslide distribution in the year 2000.

Usage

data(ecuador)

Format

R6::R6Class inheriting from TaskClassif.

Usage

mlr_tasks$get("ecuador")
tsk("ecuador")

References

Muenchow, J., Brenning, A., Richter, M., 2012. Geomorphic process rates of landslides along a humidity gradient in the tropical Andes. Geomorphology, 139-140: 271-284.

See Also

Dictionary of Tasks: mlr_tasks

as.data.table(mlr_tasks) for a complete table of all (also dynamically created) Tasks.

Other Task: TaskClassifST, TaskRegrST, mlr_tasks_cookfarm_mlr3, mlr_tasks_diplodia


Create a Spatiotemporal Classification Task

Description

This task specializes Task and TaskSupervised for spatiotemporal classification problems. The target column is assumed to be a factor. The task_type is set to "classif" and "spatiotemporal".

A spatial example task is available via tsk("ecuador"), a spatiotemporal one via tsk("cookfarm_mlr3").

The coordinate reference system passed during initialization must match the one which was used during data creation, otherwise offsets of multiple meters may occur. By default, coordinates are not used as features. This can be changed by setting coords_as_features = TRUE.

Super classes

mlr3::Task -> mlr3::TaskSupervised -> mlr3::TaskClassif -> TaskClassifST

Active bindings

crs

(character(1))
Returns coordinate reference system of task.

coordinate_names

(character())
Coordinate names.

coords_as_features

(logical(1))
If TRUE, coordinates are used as features. This is a shortcut for task$set_col_roles(c("x", "y"), role = "feature") with the assumption that the coordinates in the data are named "x" and "y".

Methods

Public methods

Inherited methods

Method new()

Create a new spatiotemporal resampling Task

Usage
TaskClassifST$new(
  id,
  backend,
  target,
  positive = NULL,
  label = NA_character_,
  coordinate_names,
  crs = NA_character_,
  coords_as_features = FALSE,
  extra_args = list()
)
Arguments
id

(character(1))
Identifier for the new instance.

backend

(DataBackend)
Either a DataBackend, or any object which is convertible to a DataBackend with as_data_backend(). E.g., am sf will be converted to a DataBackendDataTable.

target

(character(1))
Name of the target column.

positive

(character(1))
Only for binary classification: Name of the positive class. The levels of the target columns are reordered accordingly, so that the first element of ⁠$class_names⁠ is the positive class, and the second element is the negative class.

label

(character(1))
Label for the new instance. Shown in as.data.table(mlr_tasks).

coordinate_names

(character(1))
The column names of the coordinates in the data.

crs

(character(1))
Coordinate reference system. WKT2 or EPSG string.

coords_as_features

(logical(1))
If TRUE, coordinates are used as features. This is a shortcut for task$set_col_roles(c("x", "y"), role = "feature") with the assumption that the coordinates in the data are named "x" and "y".

extra_args

(named list())
Named list of constructor arguments, required for converting task types via convert_task().


Method coordinates()

Returns coordinates of observations.

Usage
TaskClassifST$coordinates(row_ids = NULL)
Arguments
row_ids

(integer())
Vector of rows indices as subset of task$row_ids.

Returns

data.table::data.table()


Method print()

Print the task.

Usage
TaskClassifST$print(...)
Arguments
...

Arguments passed to the ⁠$print()⁠ method of the superclass.


Method clone()

The objects of this class are cloneable with this method.

Usage
TaskClassifST$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

See Also

Other Task: TaskRegrST, mlr_tasks_cookfarm_mlr3, mlr_tasks_diplodia, mlr_tasks_ecuador

Examples

if (mlr3misc::require_namespaces(c("sf", "blockCV"), quietly = TRUE)) {
  task = as_task_classif_st(ecuador,
    target = "slides",
    positive = "TRUE", coordinate_names = c("x", "y")
  )

  # passing objects of class 'sf' is also supported
  data_sf = sf::st_as_sf(ecuador, coords = c("x", "y"))
  task = as_task_classif_st(data_sf, target = "slides", positive = "TRUE")

  task$task_type
  task$formula()
  task$class_names
  task$positive
  task$negative
  task$coordinates()
  task$coordinate_names
}

Create a Spatiotemporal Regression Task

Description

This task specializes Task and TaskSupervised for spatiotemporal classification problems.

A spatial example task is available via tsk("ecuador"), a spatiotemporal one via tsk("cookfarm_mlr3").

The coordinate reference system passed during initialization must match the one which was used during data creation, otherwise offsets of multiple meters may occur. By default, coordinates are not used as features. This can be changed by setting coords_as_features = TRUE.

Super classes

mlr3::Task -> mlr3::TaskSupervised -> mlr3::TaskRegr -> TaskRegrST

Active bindings

crs

(character(1))
Returns coordinate reference system of task.

coordinate_names

(character())
Coordinate names.

coords_as_features

(logical(1))
If TRUE, coordinates are used as features. This is a shortcut for task$set_col_roles(c("x", "y"), role = "feature") with the assumption that the coordinates in the data are named "x" and "y".

Methods

Public methods

Inherited methods

Method new()

Create a new spatiotemporal resampling Task Returns coordinates of observations.

Usage
TaskRegrST$new(
  id,
  backend,
  target,
  label = NA_character_,
  coordinate_names,
  crs = NA_character_,
  coords_as_features = FALSE,
  extra_args = list()
)
Arguments
id

(character(1))
Identifier for the new instance.

backend

(DataBackend)
Either a DataBackend, or any object which is convertible to a DataBackend with as_data_backend(). E.g., am sf will be converted to a DataBackendDataTable.

target

(character(1))
Name of the target column.

label

(character(1))
Label for the new instance. Shown in as.data.table(mlr_tasks).

coordinate_names

(character(1))
The column names of the coordinates in the data.

crs

(character(1))
Coordinate reference system. WKT2 or EPSG string.

coords_as_features

(logical(1))
If TRUE, coordinates are used as features. This is a shortcut for task$set_col_roles(c("x", "y"), role = "feature") with the assumption that the coordinates in the data are named "x" and "y".

extra_args

(named list())
Named list of constructor arguments, required for converting task types via convert_task().


Method coordinates()

Usage
TaskRegrST$coordinates(row_ids = NULL)
Arguments
row_ids

(integer())
Vector of rows indices as subset of task$row_ids.

Returns

data.table::data.table()


Method print()

Print the task.

Usage
TaskRegrST$print(...)
Arguments
...

Arguments passed to the ⁠$print()⁠ method of the superclass.


Method clone()

The objects of this class are cloneable with this method.

Usage
TaskRegrST$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

See Also

Other Task: TaskClassifST, mlr_tasks_cookfarm_mlr3, mlr_tasks_diplodia, mlr_tasks_ecuador