Title: | Analysis and Visualisation of Benchmark Experiments |
---|---|
Description: | Implements methods for post-hoc analysis and visualisation of benchmark experiments, for 'mlr3' and beyond. |
Authors: | Sonabend Raphael [aut] , Florian Pfisterer [aut] , Michel Lang [ctb] , Bernd Bischl [ctb] , Sebastian Fischer [cre, ctb] |
Maintainer: | Sebastian Fischer <[email protected]> |
License: | LGPL-3 |
Version: | 0.1.6 |
Built: | 2024-12-27 02:57:19 UTC |
Source: | https://github.com/mlr-org/mlr3benchmark |
Implements methods for post-hoc analysis and visualisation of benchmark experiments, for 'mlr3' and beyond.
Maintainer: Sebastian Fischer [email protected] (ORCID) [contributor]
Authors:
Sonabend Raphael [email protected] (ORCID)
Florian Pfisterer [email protected] (ORCID)
Other contributors:
Michel Lang [email protected] (ORCID) [contributor]
Bernd Bischl [email protected] (ORCID) [contributor]
Useful links:
Report bugs at https://github.com/mlr-org/mlr3benchmark/issues
Coercion methods to BenchmarkAggr. For mlr3::BenchmarkResult this is a simple
wrapper around the BenchmarkAggr constructor called with mlr3::BenchmarkResult$aggregate()
.
as_benchmark_aggr( obj, task_id = "task_id", learner_id = "learner_id", independent = TRUE, strip_prefix = TRUE, ... )
as_benchmark_aggr( obj, task_id = "task_id", learner_id = "learner_id", independent = TRUE, strip_prefix = TRUE, ... )
obj |
(mlr3::BenchmarkResult| |
task_id , learner_id , independent , strip_prefix
|
See BenchmarkAggr |
... |
|
df = data.frame(tasks = factor(rep(c("A", "B"), each = 5), levels = c("A", "B")), learners = factor(paste0("L", 1:5)), RMSE = runif(10), MAE = runif(10)) as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners") if (requireNamespaces(c("mlr3", "rpart"))) { library(mlr3) task = tsks(c("boston_housing", "mtcars")) learns = lrns(c("regr.featureless", "regr.rpart")) bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2))) # default measure as_benchmark_aggr(bm) # change measure as_benchmark_aggr(bm, measures = msr("regr.rmse")) }
df = data.frame(tasks = factor(rep(c("A", "B"), each = 5), levels = c("A", "B")), learners = factor(paste0("L", 1:5)), RMSE = runif(10), MAE = runif(10)) as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners") if (requireNamespaces(c("mlr3", "rpart"))) { library(mlr3) task = tsks(c("boston_housing", "mtcars")) learns = lrns(c("regr.featureless", "regr.rpart")) bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2))) # default measure as_benchmark_aggr(bm) # change measure as_benchmark_aggr(bm, measures = msr("regr.rmse")) }
This function is deprecated, use as_benchmark_aggr()
instead.
Coercion methods to BenchmarkAggr. For mlr3::BenchmarkResult this is a simple
wrapper around the BenchmarkAggr constructor called with mlr3::BenchmarkResult$aggregate()
.
as.BenchmarkAggr( obj, task_id = "task_id", learner_id = "learner_id", independent = TRUE, strip_prefix = TRUE, ... )
as.BenchmarkAggr( obj, task_id = "task_id", learner_id = "learner_id", independent = TRUE, strip_prefix = TRUE, ... )
obj |
(mlr3::BenchmarkResult| |
task_id , learner_id , independent , strip_prefix
|
See BenchmarkAggr |
... |
|
df = data.frame(tasks = factor(rep(c("A", "B"), each = 5), levels = c("A", "B")), learners = factor(paste0("L", 1:5)), RMSE = runif(10), MAE = runif(10)) as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners") if (requireNamespaces(c("mlr3", "rpart"))) { library(mlr3) task = tsks(c("boston_housing", "mtcars")) learns = lrns(c("regr.featureless", "regr.rpart")) bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2))) # default measure as_benchmark_aggr(bm) # change measure as_benchmark_aggr(bm, measures = msr("regr.rmse")) }
df = data.frame(tasks = factor(rep(c("A", "B"), each = 5), levels = c("A", "B")), learners = factor(paste0("L", 1:5)), RMSE = runif(10), MAE = runif(10)) as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners") if (requireNamespaces(c("mlr3", "rpart"))) { library(mlr3) task = tsks(c("boston_housing", "mtcars")) learns = lrns(c("regr.featureless", "regr.rpart")) bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2))) # default measure as_benchmark_aggr(bm) # change measure as_benchmark_aggr(bm, measures = msr("regr.rmse")) }
Generates plots for BenchmarkAggr, all assume that there are multiple, independent, tasks.
Choices depending on the argument type
:
"mean"
(default): Assumes there are at least two independent tasks. Plots the sample mean
of the measure for all learners with error bars computed with the standard error of the mean.
"box"
: Boxplots for each learner calculated over all tasks for a given measure.
"fn"
: Plots post-hoc Friedman-Nemenyi by first calling BenchmarkAggr$friedman_posthoc
and plotting significant pairs in coloured squares and leaving non-significant pairs blank,
useful for simply visualising pair-wise comparisons.
"cd"
: Critical difference plots (Demsar, 2006). Learners are drawn on the x-axis according
to their average rank with the best performing on the left and decreasing performance going
right. Any learners not connected by a horizontal bar are significantly different in performance.
Critical differences are calculated as:
Where is based on the studentized range statistic.
See references for further details.
It's recommended to crop white space using external tools, or function
image_trim()
from package magick.
## S3 method for class 'BenchmarkAggr' autoplot( object, type = c("mean", "box", "fn", "cd"), meas = NULL, level = 0.95, p.value = 0.05, minimize = TRUE, test = "nem", baseline = NULL, style = 1L, ratio = 1/7, col = "red", friedman_global = TRUE, ... )
## S3 method for class 'BenchmarkAggr' autoplot( object, type = c("mean", "box", "fn", "cd"), meas = NULL, level = 0.95, p.value = 0.05, minimize = TRUE, test = "nem", baseline = NULL, style = 1L, ratio = 1/7, col = "red", friedman_global = TRUE, ... )
object |
(BenchmarkAggr) |
type |
|
meas |
|
level |
|
p.value |
|
minimize |
|
test |
( |
baseline |
|
style |
|
ratio |
( |
col |
( |
friedman_global |
( |
... |
|
The generated plot.
Demšar J (2006). “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research, 7(1), 1-30. https://jmlr.org/papers/v7/demsar06a.html.
if (requireNamespaces(c("mlr3learners", "mlr3", "rpart", "xgboost"))) { library(mlr3) library(mlr3learners) library(ggplot2) set.seed(1) task = tsks(c("iris", "sonar", "wine", "zoo")) learns = lrns(c("classif.featureless", "classif.rpart", "classif.xgboost")) bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 3))) obj = as_benchmark_aggr(bm) # mean and error bars autoplot(obj, type = "mean", level = 0.95) if (requireNamespace("PMCMRplus", quietly = TRUE)) { # critical differences autoplot(obj, type = "cd",style = 1) autoplot(obj, type = "cd",style = 2) # post-hoc friedman-nemenyi autoplot(obj, type = "fn") } }
if (requireNamespaces(c("mlr3learners", "mlr3", "rpart", "xgboost"))) { library(mlr3) library(mlr3learners) library(ggplot2) set.seed(1) task = tsks(c("iris", "sonar", "wine", "zoo")) learns = lrns(c("classif.featureless", "classif.rpart", "classif.xgboost")) bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 3))) obj = as_benchmark_aggr(bm) # mean and error bars autoplot(obj, type = "mean", level = 0.95) if (requireNamespace("PMCMRplus", quietly = TRUE)) { # critical differences autoplot(obj, type = "cd",style = 1) autoplot(obj, type = "cd",style = 2) # post-hoc friedman-nemenyi autoplot(obj, type = "fn") } }
An R6 class for aggregated benchmark results.
This class is used to easily carry out and guide analysis of models after aggregating
the results after resampling. This can either be constructed using mlr3 objects,
for example the result of mlr3::BenchmarkResult$aggregate
or via as_benchmark_aggr,
or by passing in a custom dataset of results. Custom datasets must include at the very least,
a character column for learner ids, a character column for task ids, and numeric columns for
one or more measures.
Currently supported for multiple independent datasets only.
data
(data.table::data.table)
Aggregated data.
learners
(character())
Unique learner names.
tasks
(character())
Unique task names.
measures
(character())
Unique measure names.
nlrns
(integer())
Number of learners.
ntasks
(integer())
Number of tasks.
nmeas
(integer())
Number of measures.
nrow
(integer())
Number of rows.
col_roles
(character()
)
Column roles, currently cannot be changed after construction.
new()
Creates a new instance of this R6 class.
BenchmarkAggr$new( dt, task_id = "task_id", learner_id = "learner_id", independent = TRUE, strip_prefix = TRUE, ... )
dt
(matrix(1))
'
matrix
like object coercable to data.table::data.table, should
include column names "task_id" and "learner_id", and at least one measure (numeric).
If ids are not already factors then coerced internally.
task_id
(character(1)
)
String specifying name of task id column.
learner_id
(character(1)
)
String specifying name of learner id column.
independent
(logical(1))
Are tasks independent of one another? Affects which tests can be used for analysis.
strip_prefix
(logical(1)
)
If TRUE
(default) then mlr prefixes, e.g. regr.
, classif.
, are automatically
stripped from the learner_id
.
...
ANY
Additional arguments, currently unused.
print()
Prints the internal data via data.table::print.data.table.
BenchmarkAggr$print(...)
...
ANY
Passed to data.table::print.data.table.
summary()
Prints the internal data via data.table::print.data.table.
BenchmarkAggr$summary(...)
...
ANY
Passed to data.table::print.data.table.
rank_data()
Ranks the aggregated data given some measure.
BenchmarkAggr$rank_data(meas = NULL, minimize = TRUE, task = NULL, ...)
meas
(character(1))
Measure to rank the data against, should be in $measures
. Can be NULL
if only one measure
in data.
minimize
(logical(1))
Should the measure be minimized? Default is TRUE
.
task
(character(1))
If NULL
then returns a matrix of ranks where columns are tasks and rows are
learners, otherwise returns a one-column matrix of a specified task, should
be in $tasks
.
...
ANY
ANY
Passed to data.table::frank()
.
friedman_test()
Computes Friedman test over all tasks, assumes datasets are independent.
BenchmarkAggr$friedman_test(meas = NULL, p.adjust.method = NULL)
meas
(character(1))
Measure to rank the data against, should be in $measures
. If no measure is provided
then returns a matrix of tests for all measures.
p.adjust.method
(character(1))
Passed to p.adjust if meas = NULL
for multiple testing correction. If NULL
then no correction applied.
friedman_posthoc()
Posthoc Friedman Nemenyi tests. Computed with
PMCMRplus::frdAllPairsNemenyiTest. If global $friedman_test
is non-significant then
this is returned and no post-hocs computed. Also returns critical difference
BenchmarkAggr$friedman_posthoc( meas = NULL, p.value = 0.05, friedman_global = TRUE )
meas
(character(1))
Measure to rank the data against, should be in $measures
. Can be NULL
if only one measure
in data.
p.value
(numeric(1))
p.value for which the global test will be considered significant.
friedman_global
(logical(1)
)
Should a friedman global test be performed before conducting the posthoc
test? If FALSE
, a warning is issued in case the corresponding friedman
global test fails instead of an error. Default is TRUE
(raises an
error if global test fails).
subset()
Subsets the data by given tasks or learners. Returns data as data.table::data.table.
BenchmarkAggr$subset(task = NULL, learner = NULL)
task
(character()
)
Task(s) to subset the data by.
learner
(character()
)
Learner(s) to subset the data by.
clone()
The objects of this class are cloneable with this method.
BenchmarkAggr$clone(deep = FALSE)
deep
Whether to make a deep clone.
'r format_bib("demsar_2006")
# Not restricted to mlr3 objects df = data.frame(tasks = factor(rep(c("A", "B"), each = 5), levels = c("A", "B")), learners = factor(paste0("L", 1:5)), RMSE = runif(10), MAE = runif(10)) as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners") if (requireNamespaces(c("mlr3", "rpart"))) { library(mlr3) task = tsks(c("boston_housing", "mtcars")) learns = lrns(c("regr.featureless", "regr.rpart")) bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2))) # coercion as_benchmark_aggr(bm) }
# Not restricted to mlr3 objects df = data.frame(tasks = factor(rep(c("A", "B"), each = 5), levels = c("A", "B")), learners = factor(paste0("L", 1:5)), RMSE = runif(10), MAE = runif(10)) as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners") if (requireNamespaces(c("mlr3", "rpart"))) { library(mlr3) task = tsks(c("boston_housing", "mtcars")) learns = lrns(c("regr.featureless", "regr.rpart")) bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2))) # coercion as_benchmark_aggr(bm) }
Internal helper function for documentation.
requireNamespaces(x)
requireNamespaces(x)
x |
Packages to check. |
A logical(1)
, indicating wether all required packages are available.