Package 'mlr3benchmark' reference manual

Title:	Analysis and Visualisation of Benchmark Experiments
Description:	Implements methods for post-hoc analysis and visualisation of benchmark experiments, for 'mlr3' and beyond.
Authors:	Sonabend Raphael [aut] , Florian Pfisterer [aut] , Michel Lang [ctb] , Bernd Bischl [ctb] , Sebastian Fischer [cre, ctb]
Maintainer:	Sebastian Fischer <[email protected]>
License:	LGPL-3
Version:	0.1.6
Built:	2025-03-27 03:06:57 UTC
Source:	https://github.com/mlr-org/mlr3benchmark

mlr3benchmark: Analysis and Visualisation of Benchmark Experiments

Description

logo

Implements methods for post-hoc analysis and visualisation of benchmark experiments, for 'mlr3' and beyond.

Author(s)

Maintainer: Sebastian Fischer [email protected] (ORCID) [contributor]

Authors:

Sonabend Raphael [email protected] (ORCID)
Florian Pfisterer [email protected] (ORCID)

Other contributors:

Michel Lang [email protected] (ORCID) [contributor]
Bernd Bischl [email protected] (ORCID) [contributor]

Coercions to BenchmarkAggr

Description

Coercion methods to BenchmarkAggr. For mlr3::BenchmarkResult this is a simple wrapper around the BenchmarkAggr constructor called with mlr3::BenchmarkResult⁠$aggregate()⁠.

Usage

as_benchmark_aggr(
  obj,
  task_id = "task_id",
  learner_id = "learner_id",
  independent = TRUE,
  strip_prefix = TRUE,
  ...
)
as_benchmark_aggr(
  obj,
  task_id = "task_id",
  learner_id = "learner_id",
  independent = TRUE,
  strip_prefix = TRUE,
  ...
)

Arguments

`obj`	(mlr3::BenchmarkResult\|`matrix(1)`) Passed to BenchmarkAggr`⁠$new()⁠`.
`task_id`, `learner_id`, `independent`, `strip_prefix`	See BenchmarkAggr`⁠$initialize()⁠`.
`...`	`ANY` Passed to mlr3::BenchmarkResult`⁠$aggregate()⁠`.

Examples

df = data.frame(tasks = factor(rep(c("A", "B"), each = 5),
                               levels = c("A", "B")),
                learners = factor(paste0("L", 1:5)),
                RMSE = runif(10), MAE = runif(10))

as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners")


if (requireNamespaces(c("mlr3", "rpart"))) {
  library(mlr3)
  task = tsks(c("boston_housing", "mtcars"))
  learns = lrns(c("regr.featureless", "regr.rpart"))
  bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2)))

  # default measure
  as_benchmark_aggr(bm)

  # change measure
  as_benchmark_aggr(bm, measures = msr("regr.rmse"))
}

df = data.frame(tasks = factor(rep(c("A", "B"), each = 5),
                               levels = c("A", "B")),
                learners = factor(paste0("L", 1:5)),
                RMSE = runif(10), MAE = runif(10))

as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners")


if (requireNamespaces(c("mlr3", "rpart"))) {
  library(mlr3)
  task = tsks(c("boston_housing", "mtcars"))
  learns = lrns(c("regr.featureless", "regr.rpart"))
  bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2)))

  # default measure
  as_benchmark_aggr(bm)

  # change measure
  as_benchmark_aggr(bm, measures = msr("regr.rmse"))
}

Coercions to BenchmarkAggr

Description

This function is deprecated, use as_benchmark_aggr() instead.

Coercion methods to BenchmarkAggr. For mlr3::BenchmarkResult this is a simple wrapper around the BenchmarkAggr constructor called with mlr3::BenchmarkResult⁠$aggregate()⁠.

Usage

as.BenchmarkAggr(
  obj,
  task_id = "task_id",
  learner_id = "learner_id",
  independent = TRUE,
  strip_prefix = TRUE,
  ...
)
as.BenchmarkAggr(
  obj,
  task_id = "task_id",
  learner_id = "learner_id",
  independent = TRUE,
  strip_prefix = TRUE,
  ...
)

Arguments

`obj`	(mlr3::BenchmarkResult\|`matrix(1)`) Passed to BenchmarkAggr`⁠$new()⁠`.
`task_id`, `learner_id`, `independent`, `strip_prefix`	See BenchmarkAggr`⁠$initialize()⁠`.
`...`	`ANY` Passed to mlr3::BenchmarkResult`⁠$aggregate()⁠`.

Examples

df = data.frame(tasks = factor(rep(c("A", "B"), each = 5),
                               levels = c("A", "B")),
                learners = factor(paste0("L", 1:5)),
                RMSE = runif(10), MAE = runif(10))

as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners")


if (requireNamespaces(c("mlr3", "rpart"))) {
  library(mlr3)
  task = tsks(c("boston_housing", "mtcars"))
  learns = lrns(c("regr.featureless", "regr.rpart"))
  bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2)))

  # default measure
  as_benchmark_aggr(bm)

  # change measure
  as_benchmark_aggr(bm, measures = msr("regr.rmse"))
}

df = data.frame(tasks = factor(rep(c("A", "B"), each = 5),
                               levels = c("A", "B")),
                learners = factor(paste0("L", 1:5)),
                RMSE = runif(10), MAE = runif(10))

as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners")


if (requireNamespaces(c("mlr3", "rpart"))) {
  library(mlr3)
  task = tsks(c("boston_housing", "mtcars"))
  learns = lrns(c("regr.featureless", "regr.rpart"))
  bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2)))

  # default measure
  as_benchmark_aggr(bm)

  # change measure
  as_benchmark_aggr(bm, measures = msr("regr.rmse"))
}

Plots for BenchmarkAggr

Description

Generates plots for BenchmarkAggr, all assume that there are multiple, independent, tasks. Choices depending on the argument type:

"mean" (default): Assumes there are at least two independent tasks. Plots the sample mean of the measure for all learners with error bars computed with the standard error of the mean.
"box": Boxplots for each learner calculated over all tasks for a given measure.
"fn": Plots post-hoc Friedman-Nemenyi by first calling BenchmarkAggr⁠$friedman_posthoc⁠ and plotting significant pairs in coloured squares and leaving non-significant pairs blank, useful for simply visualising pair-wise comparisons.
"cd": Critical difference plots (Demsar, 2006). Learners are drawn on the x-axis according to their average rank with the best performing on the left and decreasing performance going right. Any learners not connected by a horizontal bar are significantly different in performance. Critical differences are calculated as:

$CD = q_{\alpha} \sqrt{\left(\frac{k(k+1)}{6N}\right)}$

Where $q_\alpha$ is based on the studentized range statistic. See references for further details. It's recommended to crop white space using external tools, or function image_trim() from package magick.

Usage

## S3 method for class 'BenchmarkAggr'
autoplot(
  object,
  type = c("mean", "box", "fn", "cd"),
  meas = NULL,
  level = 0.95,
  p.value = 0.05,
  minimize = TRUE,
  test = "nem",
  baseline = NULL,
  style = 1L,
  ratio = 1/7,
  col = "red",
  friedman_global = TRUE,
  ...
)
## S3 method for class 'BenchmarkAggr'
autoplot(
  object,
  type = c("mean", "box", "fn", "cd"),
  meas = NULL,
  level = 0.95,
  p.value = 0.05,
  minimize = TRUE,
  test = "nem",
  baseline = NULL,
  style = 1L,
  ratio = 1/7,
  col = "red",
  friedman_global = TRUE,
  ...
)

Arguments

`object`	(BenchmarkAggr) The benchmark aggregation object.
`type`	`(character(1))` Type of plot, see description.
`meas`	`(character(1))` Measure to plot, should be in `obj$measures`, can be `NULL` if only one measure is in `obj`.
`level`	`(numeric(1))` Confidence level for error bars for `type = "mean"`
`p.value`	`(numeric(1))` What value should be considered significant for `type = "cd"` and `type = "fn"`.
`minimize`	`(logical(1))` For `type = "cd"`, indicates if the measure is optimally minimized. Default is `TRUE`.
`test`	(`⁠character(1))⁠`) For `type = "cd"`, critical differences are either computed between all learners (`test = "nemenyi"`), or to a baseline (`test = "bd"`). Bonferroni-Dunn usually yields higher power than Nemenyi as it only compares algorithms to one baseline. Default is `"nemenyi"`.
`baseline`	`(character(1))` For `type = "cd"` and `test = "bd"` a baseline learner to compare the other learners to, should be in `⁠$learners⁠`, if `NULL` then differences are compared to the best performing learner.
`style`	`(integer(1))` For `type = "cd"` two ggplot styles are shipped with the package (`style = 1` or `style = 2`), otherwise the data can be accessed via the returned ggplot.
`ratio`	(`numeric(1)`) For `type = "cd"` and `style = 1`, passed to `ggplot2::coord_fixed()`, useful for quickly specifying the aspect ratio of the plot, best used with `ggsave()`.
`col`	(`character(1)`) For `type = "fn"`, specifies color to fill significant tiles, default is `"red"`.
`friedman_global`	(`logical(1)`) Should a friedman global test be performed for`type = "cd"` and `type = "fn"`? If `FALSE`, a warning is issued in case the corresponding friedman posthoc test fails instead of an error. Default is `TRUE` (raises an error if global test fails).
`...`	`ANY` Additional arguments, currently unused.

Value

The generated plot.

References

Demšar J (2006). “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research, 7(1), 1-30. https://jmlr.org/papers/v7/demsar06a.html.

Examples

if (requireNamespaces(c("mlr3learners", "mlr3", "rpart", "xgboost"))) {
library(mlr3)
library(mlr3learners)
library(ggplot2)

set.seed(1)
task = tsks(c("iris", "sonar", "wine", "zoo"))
learns = lrns(c("classif.featureless", "classif.rpart", "classif.xgboost"))
bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 3)))
obj = as_benchmark_aggr(bm)

# mean and error bars
autoplot(obj, type = "mean", level = 0.95)

if (requireNamespace("PMCMRplus", quietly = TRUE)) {
  # critical differences
  autoplot(obj, type = "cd",style = 1)
  autoplot(obj, type = "cd",style = 2)

  # post-hoc friedman-nemenyi
  autoplot(obj, type = "fn")
}

}

if (requireNamespaces(c("mlr3learners", "mlr3", "rpart", "xgboost"))) {
library(mlr3)
library(mlr3learners)
library(ggplot2)

set.seed(1)
task = tsks(c("iris", "sonar", "wine", "zoo"))
learns = lrns(c("classif.featureless", "classif.rpart", "classif.xgboost"))
bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 3)))
obj = as_benchmark_aggr(bm)

# mean and error bars
autoplot(obj, type = "mean", level = 0.95)

if (requireNamespace("PMCMRplus", quietly = TRUE)) {
  # critical differences
  autoplot(obj, type = "cd",style = 1)
  autoplot(obj, type = "cd",style = 2)

  # post-hoc friedman-nemenyi
  autoplot(obj, type = "fn")
}

}

Aggregated Benchmark Result Object

Description

An R6 class for aggregated benchmark results.

Details

This class is used to easily carry out and guide analysis of models after aggregating the results after resampling. This can either be constructed using mlr3 objects, for example the result of mlr3::BenchmarkResult⁠$aggregate⁠ or via as_benchmark_aggr, or by passing in a custom dataset of results. Custom datasets must include at the very least, a character column for learner ids, a character column for task ids, and numeric columns for one or more measures.

Currently supported for multiple independent datasets only.

Active bindings

data: (data.table::data.table)
Aggregated data.
learners: (character())
Unique learner names.
tasks: (character())
Unique task names.
measures: (character())
Unique measure names.
nlrns: (integer())
Number of learners.
ntasks: (integer())
Number of tasks.
nmeas: (integer())
Number of measures.
nrow: (integer())
Number of rows.
col_roles: (character())
Column roles, currently cannot be changed after construction.

Methods

Method `new()`

Creates a new instance of this R6 class.

Usage

BenchmarkAggr$new(
  dt,
  task_id = "task_id",
  learner_id = "learner_id",
  independent = TRUE,
  strip_prefix = TRUE,
  ...
)

Arguments

dt: (matrix(1))
' matrix like object coercable to data.table::data.table, should include column names "task_id" and "learner_id", and at least one measure (numeric). If ids are not already factors then coerced internally.
task_id: (character(1))
String specifying name of task id column.
learner_id: (character(1))
String specifying name of learner id column.
independent: (logical(1))
Are tasks independent of one another? Affects which tests can be used for analysis.
strip_prefix: (logical(1))
If TRUE (default) then mlr prefixes, e.g. regr., classif., are automatically stripped from the learner_id.
...: ANY
Additional arguments, currently unused.

Method `print()`

Prints the internal data via data.table::print.data.table.

Usage

BenchmarkAggr$print(...)

Arguments

...: ANY
Passed to data.table::print.data.table.

Method `summary()`

Prints the internal data via data.table::print.data.table.

Usage

BenchmarkAggr$summary(...)

Arguments

...: ANY
Passed to data.table::print.data.table.

Method `rank_data()`

Ranks the aggregated data given some measure.

Usage

BenchmarkAggr$rank_data(meas = NULL, minimize = TRUE, task = NULL, ...)

Arguments

meas: (character(1))
Measure to rank the data against, should be in ⁠$measures⁠. Can be NULL if only one measure in data.
minimize: (logical(1))
Should the measure be minimized? Default is TRUE.
task: (character(1))
If NULL then returns a matrix of ranks where columns are tasks and rows are learners, otherwise returns a one-column matrix of a specified task, should be in ⁠$tasks⁠.
...: ANY ANY
Passed to data.table::frank().

Method `friedman_test()`

Computes Friedman test over all tasks, assumes datasets are independent.

Usage

BenchmarkAggr$friedman_test(meas = NULL, p.adjust.method = NULL)

Arguments

meas: (character(1))
Measure to rank the data against, should be in ⁠$measures⁠. If no measure is provided then returns a matrix of tests for all measures.
p.adjust.method: (character(1))
Passed to p.adjust if meas = NULL for multiple testing correction. If NULL then no correction applied.

Method `friedman_posthoc()`

Posthoc Friedman Nemenyi tests. Computed with PMCMRplus::frdAllPairsNemenyiTest. If global ⁠$friedman_test⁠ is non-significant then this is returned and no post-hocs computed. Also returns critical difference

Usage

BenchmarkAggr$friedman_posthoc(
  meas = NULL,
  p.value = 0.05,
  friedman_global = TRUE
)

Arguments

meas: (character(1))
Measure to rank the data against, should be in ⁠$measures⁠. Can be NULL if only one measure in data.
p.value: (numeric(1))
p.value for which the global test will be considered significant.
friedman_global: (logical(1))
Should a friedman global test be performed before conducting the posthoc test? If FALSE, a warning is issued in case the corresponding friedman global test fails instead of an error. Default is TRUE (raises an error if global test fails).

Method `subset()`

Subsets the data by given tasks or learners. Returns data as data.table::data.table.

Usage

BenchmarkAggr$subset(task = NULL, learner = NULL)

Arguments

task: (character())
Task(s) to subset the data by.
learner: (character())
Learner(s) to subset the data by.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

BenchmarkAggr$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

References

'r format_bib("demsar_2006")

Examples

# Not restricted to mlr3 objects
df = data.frame(tasks = factor(rep(c("A", "B"), each = 5),
                               levels = c("A", "B")),
                learners = factor(paste0("L", 1:5)),
                RMSE = runif(10), MAE = runif(10))
as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners")

if (requireNamespaces(c("mlr3", "rpart"))) {
  library(mlr3)
  task = tsks(c("boston_housing", "mtcars"))
  learns = lrns(c("regr.featureless", "regr.rpart"))
  bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2)))

  # coercion
  as_benchmark_aggr(bm)
}
# Not restricted to mlr3 objects
df = data.frame(tasks = factor(rep(c("A", "B"), each = 5),
                               levels = c("A", "B")),
                learners = factor(paste0("L", 1:5)),
                RMSE = runif(10), MAE = runif(10))
as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners")

if (requireNamespaces(c("mlr3", "rpart"))) {
  library(mlr3)
  task = tsks(c("boston_housing", "mtcars"))
  learns = lrns(c("regr.featureless", "regr.rpart"))
  bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2)))

  # coercion
  as_benchmark_aggr(bm)
}

Helper Vectorizing requireNamespace

Description

Internal helper function for documentation.

Usage

requireNamespaces(x)
requireNamespaces(x)

Arguments

`x`	Packages to check.

Value

A logical(1), indicating wether all required packages are available.

Package 'mlr3benchmark'

Help Index

mlr3benchmark: Analysis and Visualisation of Benchmark Experiments

Description

Author(s)

See Also

Coercions to BenchmarkAggr

Description

Usage

Arguments

Examples

Coercions to BenchmarkAggr

Description

Usage

Arguments

Examples

Plots for BenchmarkAggr

Description

Usage

Arguments

Value

References

Examples

Aggregated Benchmark Result Object

Description

Details

Active bindings

Methods

Public methods

Method new()

Usage

Arguments

Method print()

Usage

Arguments

Method summary()

Usage

Arguments

Method rank_data()

Usage

Arguments

Method friedman_test()

Usage

Arguments

Method friedman_posthoc()

Usage

Arguments

Method subset()

Usage

Arguments

Method clone()

Usage

Arguments

References

Examples

Helper Vectorizing requireNamespace

Description

Usage

Arguments

Value

Method `new()`

Method `print()`

Method `summary()`

Method `rank_data()`

Method `friedman_test()`

Method `friedman_posthoc()`

Method `subset()`

Method `clone()`