Title: | Feature Selection for 'mlr3' |
---|---|
Description: | Feature selection package of the 'mlr3' ecosystem. It selects the optimal feature set for any 'mlr3' learner. The package works with several optimization algorithms e.g. Random Search, Recursive Feature Elimination, and Genetic Search. Moreover, it can automatically optimize learners and estimate the performance of optimized feature sets with nested resampling. |
Authors: | Marc Becker [aut, cre] , Patrick Schratz [aut] , Michel Lang [aut] , Bernd Bischl [aut] , John Zobolas [aut] |
Maintainer: | Marc Becker <[email protected]> |
License: | LGPL-3 |
Version: | 1.3.0 |
Built: | 2025-01-16 10:23:04 UTC |
Source: | https://github.com/mlr-org/mlr3fselect |
Feature selection package of the 'mlr3' ecosystem. It selects the optimal feature set for any 'mlr3' learner. The package works with several optimization algorithms e.g. Random Search, Recursive Feature Elimination, and Genetic Search. Moreover, it can automatically optimize learners and estimate the performance of optimized feature sets with nested resampling.
Maintainer: Marc Becker [email protected] (ORCID)
Authors:
Patrick Schratz [email protected] (ORCID)
Michel Lang [email protected] (ORCID)
Bernd Bischl [email protected] (ORCID)
John Zobolas [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/mlr-org/mlr3fselect/issues
The ArchiveBatchFSelect stores all evaluated feature sets and performance scores.
The ArchiveBatchFSelect is a container around a data.table::data.table()
.
Each row corresponds to a single evaluation of a feature set.
See the section on Data Structure for more information.
The archive stores additionally a mlr3::BenchmarkResult ($benchmark_result
) that records the resampling experiments.
Each experiment corresponds to a single evaluation of a feature set.
The table ($data
) and the benchmark result ($benchmark_result
) are linked by the uhash
column.
If the archive is passed to as.data.table()
, both are joined automatically.
The table ($data
) has the following columns:
One column for each feature of the task ($search_space
).
One column for each performance measure ($codomain
).
runtime_learners
(numeric(1)
)
Sum of training and predict times logged in learners per mlr3::ResampleResult / evaluation.
This does not include potential overhead time.
timestamp
(POSIXct
)
Time stamp when the evaluation was logged into the archive.
batch_nr
(integer(1)
)
Feature sets are evaluated in batches. Each batch has a unique batch number.
uhash
(character(1)
)
Connects each feature set to the resampling experiment stored in the mlr3::BenchmarkResult.
For analyzing the feature selection results, it is recommended to pass the archive to as.data.table()
.
The returned data table is joined with the benchmark result which adds the mlr3::ResampleResult for each feature set.
The archive provides various getters (e.g. $learners()
) to ease the access.
All getters extract by position (i
) or unique hash (uhash
).
For a complete list of all getters see the methods section.
The benchmark result ($benchmark_result
) allows to score the feature sets again on a different measure.
Alternatively, measures can be supplied to as.data.table()
.
as.data.table.ArchiveBatchFSelect(x, exclude_columns = "uhash", measures = NULL)
Returns a tabular view of all evaluated feature sets.
ArchiveBatchFSelect -> data.table::data.table()
exclude_columns
(character()
)
Exclude columns from table. Set to NULL
if no column should be excluded.
measures
(list of mlr3::Measure)
Score feature sets on additional measures.
bbotk::Archive
-> bbotk::ArchiveBatch
-> ArchiveBatchFSelect
benchmark_result
(mlr3::BenchmarkResult)
Benchmark result.
ties_method
(character(1)
)
Method to handle ties.
new()
Creates a new instance of this R6 class.
ArchiveBatchFSelect$new( search_space, codomain, check_values = TRUE, ties_method = "least_features" )
search_space
(paradox::ParamSet)
Search space.
Internally created from provided mlr3::Task by instance.
codomain
(bbotk::Codomain)
Specifies codomain of objective function i.e. a set of performance measures.
Internally created from provided mlr3::Measures by instance.
check_values
(logical(1)
)
If TRUE
(default), hyperparameter configurations are check for validity.
ties_method
(character(1)
)
The method to break ties when selecting sets while optimizing and when selecting the best set.
Can be "least_features"
or "random"
.
The option "least_features"
(default) selects the feature set with the least features.
If there are multiple best feature sets with the same number of features, one is selected randomly.
The random
method returns a random feature set from the best feature sets.
Ignored if multiple measures are used.
add_evals()
Adds function evaluations to the archive table.
ArchiveBatchFSelect$add_evals(xdt, xss_trafoed = NULL, ydt)
xdt
(data.table::data.table()
)
x values as data.table
. Each row is one point. Contains the value in
the search space of the FSelectInstanceBatchMultiCrit object. Can contain
additional columns for extra information.
xss_trafoed
(list()
)
Ignored in feature selection.
ydt
(data.table::data.table()
)
Optimal outcome.
learner()
Retrieve mlr3::Learner of the i-th evaluation, by position or by unique hash uhash
.
i
and uhash
are mutually exclusive.
Learner does not contain a model. Use $learners()
to get learners with models.
ArchiveBatchFSelect$learner(i = NULL, uhash = NULL)
i
(integer(1)
)
The iteration value to filter for.
uhash
(logical(1)
)
The uhash
value to filter for.
learners()
Retrieve list of trained mlr3::Learner objects of the i-th evaluation,
by position or by unique hash uhash
. i
and uhash
are mutually
exclusive.
ArchiveBatchFSelect$learners(i = NULL, uhash = NULL)
i
(integer(1)
)
The iteration value to filter for.
uhash
(logical(1)
)
The uhash
value to filter for.
predictions()
Retrieve list of mlr3::Prediction objects of the i-th evaluation, by
position or by unique hash uhash
. i
and uhash
are mutually
exclusive.
ArchiveBatchFSelect$predictions(i = NULL, uhash = NULL)
i
(integer(1)
)
The iteration value to filter for.
uhash
(logical(1)
)
The uhash
value to filter for.
resample_result()
Retrieve mlr3::ResampleResult of the i-th evaluation, by position
or by unique hash uhash
. i
and uhash
are mutually exclusive.
ArchiveBatchFSelect$resample_result(i = NULL, uhash = NULL)
i
(integer(1)
)
The iteration value to filter for.
uhash
(logical(1)
)
The uhash
value to filter for.
print()
Printer.
ArchiveBatchFSelect$print()
...
(ignored).
best()
Returns the best scoring feature sets.
ArchiveBatchFSelect$best(batch = NULL, ties_method = NULL)
batch
(integer()
)
The batch number(s) to limit the best results to.
Default is all batches.
ties_method
(character(1)
)
Method to handle ties.
If NULL
(default), the global ties method set during initialization is used.
The default global ties method is least_features
which selects the feature set with the least features.
If there are multiple best feature sets with the same number of features, one is selected randomly.
The random
method returns a random feature set from the best feature sets.
clone()
The objects of this class are cloneable with this method.
ArchiveBatchFSelect$clone(deep = FALSE)
deep
Whether to make a deep clone.
The AutoFSelector wraps a mlr3::Learner and augments it with an automatic feature selection.
The auto_fselector()
function creates an AutoFSelector object.
auto_fselector( fselector, learner, resampling, measure = NULL, term_evals = NULL, term_time = NULL, terminator = NULL, store_fselect_instance = TRUE, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL, ties_method = "least_features", id = NULL )
auto_fselector( fselector, learner, resampling, measure = NULL, term_evals = NULL, term_time = NULL, terminator = NULL, store_fselect_instance = TRUE, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL, ties_method = "least_features", id = NULL )
fselector |
(FSelector) |
learner |
(mlr3::Learner) |
resampling |
(mlr3::Resampling) |
measure |
(mlr3::Measure) |
term_evals |
( |
term_time |
( |
terminator |
(bbotk::Terminator) |
store_fselect_instance |
( |
store_benchmark_result |
( |
store_models |
( |
check_values |
( |
callbacks |
(list of CallbackBatchFSelect) |
ties_method |
( |
id |
( |
The AutoFSelector is a mlr3::Learner which wraps another mlr3::Learner and performs the following steps during $train()
:
The wrapped (inner) learner is trained on the feature subsets via resampling. The feature selection can be specified by providing a FSelector, a bbotk::Terminator, a mlr3::Resampling and a mlr3::Measure.
A final model is fit on the complete training data with the best-found feature subset.
During $predict()
the AutoFSelector just calls the predict method of the wrapped (inner) learner.
There are several sections about feature selection in the mlr3book.
Estimate Model Performance with nested resampling.
The gallery features a collection of case studies and demos about optimization.
Nested resampling can be performed by passing an AutoFSelector object to mlr3::resample()
or mlr3::benchmark()
.
To access the inner resampling results, set store_fselect_instance = TRUE
and execute mlr3::resample()
or mlr3::benchmark()
with store_models = TRUE
(see examples).
The mlr3::Resampling passed to the AutoFSelector is meant to be the inner resampling, operating on the training set of an arbitrary outer resampling.
For this reason it is not feasible to pass an instantiated mlr3::Resampling here.
# Automatic Feature Selection # split to train and external set task = tsk("penguins") split = partition(task, ratio = 0.8) # create auto fselector afs = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) # optimize feature subset and fit final model afs$train(task, row_ids = split$train) # predict with final model afs$predict(task, row_ids = split$test) # show result afs$fselect_result # model slot contains trained learner and fselect instance afs$model # shortcut trained learner afs$learner # shortcut fselect instance afs$fselect_instance # Nested Resampling afs = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) resampling_outer = rsmp("cv", folds = 3) rr = resample(task, afs, resampling_outer, store_models = TRUE) # retrieve inner feature selection results. extract_inner_fselect_results(rr) # performance scores estimated on the outer resampling rr$score() # unbiased performance of the final model trained on the full data set rr$aggregate()
# Automatic Feature Selection # split to train and external set task = tsk("penguins") split = partition(task, ratio = 0.8) # create auto fselector afs = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) # optimize feature subset and fit final model afs$train(task, row_ids = split$train) # predict with final model afs$predict(task, row_ids = split$test) # show result afs$fselect_result # model slot contains trained learner and fselect instance afs$model # shortcut trained learner afs$learner # shortcut fselect instance afs$fselect_instance # Nested Resampling afs = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) resampling_outer = rsmp("cv", folds = 3) rr = resample(task, afs, resampling_outer, store_models = TRUE) # retrieve inner feature selection results. extract_inner_fselect_results(rr) # performance scores estimated on the outer resampling rr$score() # unbiased performance of the final model trained on the full data set rr$aggregate()
The AutoFSelector wraps a mlr3::Learner and augments it with an automatic feature selection.
The auto_fselector()
function creates an AutoFSelector object.
The AutoFSelector is a mlr3::Learner which wraps another mlr3::Learner and performs the following steps during $train()
:
The wrapped (inner) learner is trained on the feature subsets via resampling. The feature selection can be specified by providing a FSelector, a bbotk::Terminator, a mlr3::Resampling and a mlr3::Measure.
A final model is fit on the complete training data with the best-found feature subset.
During $predict()
the AutoFSelector just calls the predict method of the wrapped (inner) learner.
There are several sections about feature selection in the mlr3book.
Estimate Model Performance with nested resampling.
The gallery features a collection of case studies and demos about optimization.
Nested resampling can be performed by passing an AutoFSelector object to mlr3::resample()
or mlr3::benchmark()
.
To access the inner resampling results, set store_fselect_instance = TRUE
and execute mlr3::resample()
or mlr3::benchmark()
with store_models = TRUE
(see examples).
The mlr3::Resampling passed to the AutoFSelector is meant to be the inner resampling, operating on the training set of an arbitrary outer resampling.
For this reason it is not feasible to pass an instantiated mlr3::Resampling here.
mlr3::Learner
-> AutoFSelector
instance_args
(list()
)
All arguments from construction to create the FSelectInstanceBatchSingleCrit.
fselector
(FSelector)
Optimization algorithm.
archive
([ArchiveBatchFSelect)
Returns FSelectInstanceBatchSingleCrit archive.
learner
(mlr3::Learner)
Trained learner.
fselect_instance
(FSelectInstanceBatchSingleCrit)
Internally created feature selection instance with all intermediate results.
fselect_result
(data.table::data.table)
Short-cut to $result
from FSelectInstanceBatchSingleCrit.
predict_type
(character(1)
)
Stores the currently active predict type, e.g. "response"
.
Must be an element of $predict_types
.
hash
(character(1)
)
Hash (unique identifier) for this object.
phash
(character(1)
)
Hash (unique identifier) for this partial object, excluding some components which are varied systematically during tuning (parameter values) or feature selection (feature names).
new()
Creates a new instance of this R6 class.
AutoFSelector$new( fselector, learner, resampling, measure = NULL, terminator, store_fselect_instance = TRUE, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL, ties_method = "least_features", id = NULL )
fselector
(FSelector)
Optimization algorithm.
learner
(mlr3::Learner)
Learner to optimize the feature subset for.
resampling
(mlr3::Resampling)
Resampling that is used to evaluated the performance of the feature subsets.
Uninstantiated resamplings are instantiated during construction so that all feature subsets are evaluated on the same data splits.
Already instantiated resamplings are kept unchanged.
measure
(mlr3::Measure)
Measure to optimize. If NULL
, default measure is used.
terminator
(bbotk::Terminator)
Stop criterion of the feature selection.
store_fselect_instance
(logical(1)
)
If TRUE
(default), stores the internally created FSelectInstanceBatchSingleCrit with all intermediate results in slot $fselect_instance
.
Is set to TRUE
, if store_models = TRUE
store_benchmark_result
(logical(1)
)
Store benchmark result in archive?
store_models
(logical(1)
).
Store models in benchmark result?
check_values
(logical(1)
)
Check the parameters before the evaluation and the results for
validity?
callbacks
(list of CallbackBatchFSelect)
List of callbacks.
ties_method
(character(1)
)
The method to break ties when selecting sets while optimizing and when selecting the best set.
Can be "least_features"
or "random"
.
The option "least_features"
(default) selects the feature set with the least features.
If there are multiple best feature sets with the same number of features, one is selected randomly.
The random
method returns a random feature set from the best feature sets.
Ignored if multiple measures are used.
id
(character(1)
)
Identifier for the new instance.
base_learner()
Extracts the base learner from nested learner objects like GraphLearner
in mlr3pipelines.
If recursive = 0
, the (tuned) learner is returned.
AutoFSelector$base_learner(recursive = Inf)
recursive
(integer(1)
)
Depth of recursion for multiple nested objects.
importance()
The importance scores of the final model.
AutoFSelector$importance()
Named numeric()
.
selected_features()
The selected features of the final model. These features are selected internally by the learner.
AutoFSelector$selected_features()
character()
.
oob_error()
The out-of-bag error of the final model.
AutoFSelector$oob_error()
numeric(1)
.
loglik()
The log-likelihood of the final model.
AutoFSelector$loglik()
logLik
.
Printer.
print()
AutoFSelector$print()
...
(ignored).
clone()
The objects of this class are cloneable with this method.
AutoFSelector$clone(deep = FALSE)
deep
Whether to make a deep clone.
# Automatic Feature Selection # split to train and external set task = tsk("penguins") split = partition(task, ratio = 0.8) # create auto fselector afs = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) # optimize feature subset and fit final model afs$train(task, row_ids = split$train) # predict with final model afs$predict(task, row_ids = split$test) # show result afs$fselect_result # model slot contains trained learner and fselect instance afs$model # shortcut trained learner afs$learner # shortcut fselect instance afs$fselect_instance # Nested Resampling afs = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) resampling_outer = rsmp("cv", folds = 3) rr = resample(task, afs, resampling_outer, store_models = TRUE) # retrieve inner feature selection results. extract_inner_fselect_results(rr) # performance scores estimated on the outer resampling rr$score() # unbiased performance of the final model trained on the full data set rr$aggregate()
# Automatic Feature Selection # split to train and external set task = tsk("penguins") split = partition(task, ratio = 0.8) # create auto fselector afs = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) # optimize feature subset and fit final model afs$train(task, row_ids = split$train) # predict with final model afs$predict(task, row_ids = split$test) # show result afs$fselect_result # model slot contains trained learner and fselect instance afs$model # shortcut trained learner afs$learner # shortcut fselect instance afs$fselect_instance # Nested Resampling afs = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) resampling_outer = rsmp("cv", folds = 3) rr = resample(task, afs, resampling_outer, store_models = TRUE) # retrieve inner feature selection results. extract_inner_fselect_results(rr) # performance scores estimated on the outer resampling rr$score() # unbiased performance of the final model trained on the full data set rr$aggregate()
Function to create a CallbackBatchFSelect.
Predefined callbacks are stored in the dictionary mlr_callbacks and can be retrieved with clbk()
.
Feature selection callbacks can be called from different stages of feature selection.
The stages are prefixed with on_*
.
The on_auto_fselector_*
stages are only available when the callback is used in an AutoFSelector.
Start Automatic Feature Selection Start Feature Selection - on_optimization_begin Start FSelect Batch - on_optimizer_before_eval Start Evaluation - on_eval_after_design - on_eval_after_benchmark - on_eval_before_archive End Evaluation - on_optimizer_after_eval End FSelect Batch - on_result - on_optimization_end End Feature Selection - on_auto_fselector_before_final_model - on_auto_fselector_after_final_model End Automatic Feature Selection
See also the section on parameters for more information on the stages. A feature selection callback works with bbotk::ContextBatch and ContextBatchFSelect.
callback_batch_fselect( id, label = NA_character_, man = NA_character_, on_optimization_begin = NULL, on_optimizer_before_eval = NULL, on_eval_after_design = NULL, on_eval_after_benchmark = NULL, on_eval_before_archive = NULL, on_optimizer_after_eval = NULL, on_result = NULL, on_optimization_end = NULL, on_auto_fselector_before_final_model = NULL, on_auto_fselector_after_final_model = NULL )
callback_batch_fselect( id, label = NA_character_, man = NA_character_, on_optimization_begin = NULL, on_optimizer_before_eval = NULL, on_eval_after_design = NULL, on_eval_after_benchmark = NULL, on_eval_before_archive = NULL, on_optimizer_after_eval = NULL, on_result = NULL, on_optimization_end = NULL, on_auto_fselector_before_final_model = NULL, on_auto_fselector_after_final_model = NULL )
id |
( |
label |
( |
man |
( |
on_optimization_begin |
( |
on_optimizer_before_eval |
( |
on_eval_after_design |
( |
on_eval_after_benchmark |
( |
on_eval_before_archive |
( |
on_optimizer_after_eval |
( |
on_result |
( |
on_optimization_end |
( |
on_auto_fselector_before_final_model |
( |
on_auto_fselector_after_final_model |
( |
When implementing a callback, each function must have two arguments named callback
and context
.
A callback can write data to the state ($state
), e.g. settings that affect the callback itself.
Avoid writing large data the state.
# Write archive to disk callback_batch_fselect("mlr3fselect.backup", on_optimization_end = function(callback, context) { saveRDS(context$instance$archive, "archive.rds") } )
# Write archive to disk callback_batch_fselect("mlr3fselect.backup", on_optimization_end = function(callback, context) { saveRDS(context$instance$archive, "archive.rds") } )
Specialized bbotk::CallbackBatch for feature selection.
Callbacks allow customizing the behavior of processes in mlr3fselect.
The callback_batch_fselect()
function creates a CallbackBatchFSelect.
Predefined callbacks are stored in the dictionary mlr_callbacks and can be retrieved with clbk()
.
For more information on callbacks see callback_batch_fselect()
.
mlr3misc::Callback
-> bbotk::CallbackBatch
-> CallbackBatchFSelect
on_eval_after_design
(function()
)
Stage called after design is created.
Called in ObjectiveFSelectBatch$eval_many()
.
on_eval_after_benchmark
(function()
)
Stage called after feature sets are evaluated.
Called in ObjectiveFSelectBatch$eval_many()
.
on_eval_before_archive
(function()
)
Stage called before performance values are written to the archive.
Called in ObjectiveFSelectBatch$eval_many()
.
on_auto_fselector_before_final_model
(function()
)
Stage called before the final model is trained.
Called in AutoFSelector$train()
.
This stage is called after the optimization has finished and the final model is trained with the best feature set found.
on_auto_fselector_after_final_model
(function()
)
Stage called after the final model is trained.
Called in AutoFSelector$train()
.
This stage is called after the final model is trained with the best feature set found.
clone()
The objects of this class are cloneable with this method.
CallbackBatchFSelect$clone(deep = FALSE)
deep
Whether to make a deep clone.
# Write archive to disk callback_batch_fselect("mlr3fselect.backup", on_optimization_end = function(callback, context) { saveRDS(context$instance$archive, "archive.rds") } )
# Write archive to disk callback_batch_fselect("mlr3fselect.backup", on_optimization_end = function(callback, context) { saveRDS(context$instance$archive, "archive.rds") } )
The ContextBatchFSelect allows CallbackBatchFSelects to access and modify data while a batch of feature sets is evaluated.
See the section on active bindings for a list of modifiable objects.
See callback_batch_fselect()
for a list of stages that access ContextBatchFSelect.
This context is re-created each time a new batch of feature sets is evaluated.
Changes to $objective_fselect
, $design
$benchmark_result
are discarded after the function is finished.
Modification on the data table in $aggregated_performance
are written to the archive.
Any number of columns can be added.
mlr3misc::Context
-> bbotk::ContextBatch
-> ContextBatchFSelect
auto_fselector
(AutoFSelector)
The AutoFSelector instance.
xss
(list())
The feature sets of the latest batch.
design
(data.table::data.table)
The benchmark design of the latest batch.
benchmark_result
(mlr3::BenchmarkResult)
The benchmark result of the latest batch.
aggregated_performance
(data.table::data.table)
Aggregated performance scores and training time of the latest batch.
This data table is passed to the archive.
A callback can add additional columns which are also written to the archive.
clone()
The objects of this class are cloneable with this method.
ContextBatchFSelect$clone(deep = FALSE)
deep
Whether to make a deep clone.
Ensemble feature selection using multiple learners. The ensemble feature selection method is designed to identify the most predictive features from a given dataset by leveraging multiple machine learning models and resampling techniques. Returns an EnsembleFSResult.
embedded_ensemble_fselect( task, learners, init_resampling, measure, store_benchmark_result = TRUE )
embedded_ensemble_fselect( task, learners, init_resampling, measure, store_benchmark_result = TRUE )
task |
(mlr3::Task) |
learners |
(list of mlr3::Learner) |
init_resampling |
(mlr3::Resampling) |
measure |
(mlr3::Measure) |
store_benchmark_result |
( |
The method begins by applying an initial resampling technique specified by the user, to create multiple subsamples from the original dataset (train/test splits). This resampling process helps in generating diverse subsets of data for robust feature selection.
For each subsample (train set) generated in the previous step, the method applies learners that support embedded feature selection. These learners are then scored on their ability to predict on the resampled test sets, storing the selected features during training, for each combination of subsample and learner.
Results are stored in an EnsembleFSResult.
an EnsembleFSResult object.
Meinshausen, Nicolai, Buhlmann, Peter (2010). “Stability Selection.” Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4), 417–473. ISSN 1369-7412, doi:10.1111/J.1467-9868.2010.00740.X, 0809.2932.
Hedou, Julien, Maric, Ivana, Bellan, Gregoire, Einhaus, Jakob, Gaudilliere, K. D, Ladant, Xavier F, Verdonk, Franck, Stelzer, A. I, Feyaerts, Dorien, Tsai, S. A, Ganio, A. E, Sabayev, Maximilian, Gillard, Joshua, Amar, Jonas, Cambriel, Amelie, Oskotsky, T. T, Roldan, Alennie, Golob, L. J, Sirota, Marina, Bonham, A. T, Sato, Masaki, Diop, Maigane, Durand, Xavier, Angst, S. M, Stevenson, K. D, Aghaeepour, Nima, Montanari, Andrea, Gaudilliere, Brice (2024). “Discovery of sparse, reliable omic biomarkers with Stabl.” Nature Biotechnology 2024, 1–13. ISSN 1546-1696, doi:10.1038/s41587-023-02033-x, https://www.nature.com/articles/s41587-023-02033-x.
eefsr = embedded_ensemble_fselect( task = tsk("sonar"), learners = lrns(c("classif.rpart", "classif.featureless")), init_resampling = rsmp("subsampling", repeats = 5), measure = msr("classif.ce") ) eefsr
eefsr = embedded_ensemble_fselect( task = tsk("sonar"), learners = lrns(c("classif.rpart", "classif.featureless")), init_resampling = rsmp("subsampling", repeats = 5), measure = msr("classif.ce") ) eefsr
The EnsembleFSResult
stores the results of ensemble feature selection.
It includes methods for evaluating the stability of the feature selection process and for ranking the selected features among others.
Both functions ensemble_fselect()
and embedded_ensemble_fselect()
return an object of this class.
as.data.table.EnsembleFSResult(x, benchmark_result = TRUE)
Returns a tabular view of the ensemble feature selection.
EnsembleFSResult -> data.table::data.table()
x
(EnsembleFSResult)
benchmark_result
(logical(1)
)
Whether to add the learner, task and resampling information from the benchmark result.
c(...)
(EnsembleFSResult, ...) -> EnsembleFSResult
Combines multiple EnsembleFSResult objects into a new EnsembleFSResult.
benchmark_result
(mlr3::BenchmarkResult)
The benchmark result.
man
(character(1)
)
Manual page for this object.
result
(data.table::data.table)
Returns the result of the ensemble feature selection.
n_learners
(numeric(1)
)
Returns the number of learners used in the ensemble feature selection.
measure
(mlr3::Measure)
Returns the 'active' measure that is used in methods of this object.
active_measure
(character(1)
)
Indicates the type of the active performance measure.
During the ensemble feature selection process, the dataset is split into multiple subsamples (train/test splits) using an initial resampling scheme. So, performance can be evaluated using one of two measures:
"outer"
: measure used to evaluate the performance on the test sets.
"inner"
: measure used for optimization and to compute performance during inner resampling on the training sets.
n_resamples
(character(1)
)
Returns the number of times the task was initially resampled in the ensemble feature selection process.
new()
Creates a new instance of this R6 class.
EnsembleFSResult$new( result, features, benchmark_result = NULL, measure, inner_measure = NULL )
result
(data.table::data.table)
The result of the ensemble feature selection.
Mandatory column names should include "resampling_iteration"
, "learner_id"
,
"features"
and "n_features"
.
A column named as {measure$id}
(scores on the test sets) must also be
always present.
The column with the performance scores on the inner resampling of the train sets is not mandatory,
but note that it should be named as {inner_measure$id}_inner
to distinguish from
the {measure$id}
.
features
(character()
)
The vector of features of the task that was used in the ensemble feature
selection.
benchmark_result
(mlr3::BenchmarkResult)
The benchmark result object.
measure
(mlr3::Measure)
The performance measure used to evaluate the learners on the test sets generated
during the ensemble feature selection process.
By default, this serves as the 'active' measure for the methods of this object.
The active measure can be updated using the $set_active_measure()
method.
inner_measure
(mlr3::Measure)
The performance measure used to optimize and evaluate the learners during the inner resampling process of the training sets, generated as part of the ensemble feature selection procedure.
format()
Helper for print outputs.
EnsembleFSResult$format(...)
...
(ignored).
print()
Printer.
EnsembleFSResult$print(...)
...
(ignored).
help()
Opens the corresponding help page referenced by field $man
.
EnsembleFSResult$help()
set_active_measure()
Use this function to change the active measure.
EnsembleFSResult$set_active_measure(which = "outer")
which
(character(1)
)
Which measure from the ensemble feature selection result
to use in methods of this object.
Should be either "inner"
(optimization measure used in training sets)
or "outer"
(measure used in test sets, default value).
combine()
Combines a second EnsembleFSResult into the current object, modifying it in-place.
If the second EnsembleFSResult (efsr
) is NULL
, the method returns the object unmodified.
Both objects must have the same task features and measure
.
If the inner_measure
differs between the objects or is NULL
in either, it will be set to NULL
in the combined object.
Additionally, the importance
column will be removed if it is missing in either object.
If both objects contain a benchmark_result
, these will be combined.
Otherwise, the combined object will have a NULL
value for benchmark_result
.
This method modifies the object by reference.
To preserve the original state, explicitly $clone()
the object beforehand.
Alternatively, you can use the c()
function, which internally calls this method.
EnsembleFSResult$combine(efsr)
efsr
(EnsembleFSResult)
A second EnsembleFSResult object to combine with the current object.
Returns the object itself, but modified by reference.
feature_ranking()
Calculates the feature ranking via fastVoteR::rank_candidates()
.
EnsembleFSResult$feature_ranking( method = "av", use_weights = TRUE, committee_size = NULL, shuffle_features = TRUE )
method
(character(1)
)
The method to calculate the feature ranking. See fastVoteR::rank_candidates()
for a complete list of available methods.
Approval voting ("av"
) is the default method.
use_weights
(logical(1)
)
The default value (TRUE
) uses weights equal to the performance scores
of each voter/model (or the inverse scores if the measure is minimized).
If FALSE
, we treat all voters as equal and assign them all a weight equal to 1.
committee_size
(integer(1)
)
Number of top selected features in the output ranking.
This parameter can be used to speed-up methods that build a committee sequentially
("seq_pav"
), by requesting only the top N selected candidates/features
and not the complete feature ranking.
shuffle_features
(logical(1)
)
Whether to shuffle the task features randomly before computing the ranking.
Shuffling ensures consistent random tie-breaking across methods and prevents
deterministic biases when features with equal scores are encountered.
Default is TRUE
and it's advised to set a seed before running this function.
Set to FALSE
if deterministic ordering of features is preferred (same as
during initialization).
The feature ranking process is built on the following framework: models act as voters, features act as candidates, and voters select certain candidates (features). The primary objective is to compile these selections into a consensus ranked list of features, effectively forming a committee.
For every feature a score is calculated, which depends on the "method"
argument.
The higher the score, the higher the ranking of the feature.
Note that some methods output a feature ranking instead of a score per feature, so we always include Borda's score, which is method-agnostic, i.e. it can be used to compare the feature rankings across different methods.
We shuffle the input candidates/features so that we enforce random tie-breaking.
Users should set the same seed
for consistent comparison between the different feature ranking methods and for reproducibility.
A data.table::data.table listing all the features, ordered by decreasing scores (depends on the "method"
). Columns are as follows:
"feature"
: Feature names.
"score"
: Scores assigned to each feature based on the selected method (if applicable).
"norm_score"
: Normalized scores (if applicable), scaled to the range , which can be loosely interpreted as selection probabilities (Meinshausen et al. (2010)).
"borda_score"
: Borda scores for method-agnostic comparison, ranging in , where the top feature receives a score of 1 and the lowest-ranked feature receives a score of 0.
This column is always included so that feature ranking methods that output only rankings have also a feature-wise score.
stability()
Calculates the stability of the selected features with the stabm package. The results are cached. When the same stability measure is requested again with different arguments, the cache must be reset.
EnsembleFSResult$stability( stability_measure = "jaccard", stability_args = NULL, global = TRUE, reset_cache = FALSE )
stability_measure
(character(1)
)
The stability measure to be used.
One of the measures returned by stabm::listStabilityMeasures()
in lower case.
Default is "jaccard"
.
stability_args
(list
)
Additional arguments passed to the stability measure function.
global
(logical(1)
)
Whether to calculate the stability globally or for each learner.
reset_cache
(logical(1)
)
If TRUE
, the cached results are ignored.
A numeric()
value representing the stability of the selected features.
Or a numeric()
vector with the stability of the selected features for each learner.
pareto_front()
This function identifies the Pareto front of the ensemble feature selection process, i.e., the set of points that represent the trade-off between the number of features and performance (e.g. classification error).
EnsembleFSResult$pareto_front(type = "empirical")
type
(character(1)
)
Specifies the type of Pareto front to return. See details.
Two options are available for the Pareto front:
"empirical"
(default): returns the empirical Pareto front.
"estimated"
: the Pareto front points are estimated by fitting a linear model with the inversed of the number of features () as input and the associated performance scores as output.
This method is useful when the Pareto points are sparse and the front assumes a convex shape if better performance corresponds to lower measure values (e.g. classification error), or a concave shape otherwise (e.g. classification accuracy).
The
estimated
Pareto front will include points for a number of features ranging from 1 up to the maximum number found in the empirical Pareto front.
A data.table::data.table with columns the number of features and the performance that together form the Pareto front.
knee_points()
This function implements various knee point identification (KPI) methods, which select points in the Pareto front, such that an optimal trade-off between performance and number of features is achieved. In most cases, only one such point is returned.
EnsembleFSResult$knee_points(method = "NBI", type = "empirical")
method
(character(1)
)
Type of method to use to identify the knee point. See details.
type
(character(1)
)
Specifies the type of Pareto front to use for the identification of the knee point.
See pareto_front()
method for more details.
The available KPI methods are:
"NBI"
(default): The Normal-Boundary Intersection method is a geometry-based method which calculates the perpendicular distance of each point from the line connecting the first and last points of the Pareto front.
The knee point is determined as the Pareto point with the maximum distance from this line, see Das (1999).
A data.table::data.table with the knee point(s) of the Pareto front.
clone()
The objects of this class are cloneable with this method.
EnsembleFSResult$clone(deep = FALSE)
deep
Whether to make a deep clone.
Das, I (1999). “On characterizing the 'knee' of the Pareto curve based on normal-boundary intersection.” Structural Optimization, 18(1-2), 107–115. ISSN 09344373.
Meinshausen, Nicolai, Buhlmann, Peter (2010). “Stability Selection.” Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4), 417–473. ISSN 1369-7412, doi:10.1111/J.1467-9868.2010.00740.X, 0809.2932.
efsr = ensemble_fselect( fselector = fs("rfe", n_features = 2, feature_fraction = 0.8), task = tsk("sonar"), learners = lrns(c("classif.rpart", "classif.featureless")), init_resampling = rsmp("subsampling", repeats = 2), inner_resampling = rsmp("cv", folds = 3), inner_measure = msr("classif.ce"), measure = msr("classif.acc"), terminator = trm("none") ) # contains the benchmark result efsr$benchmark_result # contains the selected features for each iteration efsr$result # returns the stability of the selected features efsr$stability(stability_measure = "jaccard") # returns a ranking of all features head(efsr$feature_ranking()) # returns the empirical pareto front, i.e. n_features vs measure (error) efsr$pareto_front() # returns the knee points (optimal trade-off between n_features and performance) efsr$knee_points() # change to use the inner optimization measure efsr$set_active_measure(which = "inner") # Pareto front is calculated on the inner measure efsr$pareto_front()
efsr = ensemble_fselect( fselector = fs("rfe", n_features = 2, feature_fraction = 0.8), task = tsk("sonar"), learners = lrns(c("classif.rpart", "classif.featureless")), init_resampling = rsmp("subsampling", repeats = 2), inner_resampling = rsmp("cv", folds = 3), inner_measure = msr("classif.ce"), measure = msr("classif.acc"), terminator = trm("none") ) # contains the benchmark result efsr$benchmark_result # contains the selected features for each iteration efsr$result # returns the stability of the selected features efsr$stability(stability_measure = "jaccard") # returns a ranking of all features head(efsr$feature_ranking()) # returns the empirical pareto front, i.e. n_features vs measure (error) efsr$pareto_front() # returns the knee points (optimal trade-off between n_features and performance) efsr$knee_points() # change to use the inner optimization measure efsr$set_active_measure(which = "inner") # Pareto front is calculated on the inner measure efsr$pareto_front()
Ensemble feature selection using multiple learners. The ensemble feature selection method is designed to identify the most predictive features from a given dataset by leveraging multiple machine learning models and resampling techniques. Returns an EnsembleFSResult.
ensemble_fselect( fselector, task, learners, init_resampling, inner_resampling, inner_measure, measure, terminator, callbacks = NULL, store_benchmark_result = TRUE, store_models = FALSE )
ensemble_fselect( fselector, task, learners, init_resampling, inner_resampling, inner_measure, measure, terminator, callbacks = NULL, store_benchmark_result = TRUE, store_models = FALSE )
fselector |
(FSelector) |
task |
(mlr3::Task) |
learners |
(list of mlr3::Learner) |
init_resampling |
(mlr3::Resampling) |
inner_resampling |
(mlr3::Resampling) |
inner_measure |
(mlr3::Measure) |
measure |
(mlr3::Measure) |
terminator |
(bbotk::Terminator) |
callbacks |
(Named list of lists of CallbackBatchFSelect) |
store_benchmark_result |
( |
store_models |
( |
The method begins by applying an initial resampling technique specified by the user, to create multiple subsamples from the original dataset (train/test splits). This resampling process helps in generating diverse subsets of data for robust feature selection.
For each subsample (train set) generated in the previous step, the method performs wrapped-based feature selection (auto_fselector) using each provided learner, the given inner resampling method, inner performance measure and optimization algorithm. This process generates 1) the best feature subset and 2) a final trained model using these best features, for each combination of subsample and learner. The final models are then scored on their ability to predict on the resampled test sets.
Results are stored in an EnsembleFSResult.
The result object also includes the performance scores calculated during the inner resampling of the training sets, using models with the best feature subsets.
These scores are stored in a column named {measure_id}_inner
.
an EnsembleFSResult object.
The active measure of performance is the one applied to the test sets.
This is preferred, as inner resampling scores on the training sets are likely to be overestimated when using the final models.
Users can change the active measure by using the set_active_measure()
method of the EnsembleFSResult.
Saeys, Yvan, Abeel, Thomas, Van De Peer, Yves (2008). “Robust feature selection using ensemble feature selection techniques.” Machine Learning and Knowledge Discovery in Databases, 5212 LNAI, 313–325. doi:10.1007/978-3-540-87481-2_21.
Abeel, Thomas, Helleputte, Thibault, Van de Peer, Yves, Dupont, Pierre, Saeys, Yvan (2010). “Robust biomarker identification for cancer diagnosis with ensemble feature selection methods.” Bioinformatics, 26, 392–398. ISSN 1367-4803, doi:10.1093/BIOINFORMATICS/BTP630.
Pes, Barbara (2020). “Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains.” Neural Computing and Applications, 32(10), 5951–5973. ISSN 14333058, doi:10.1007/s00521-019-04082-3.
efsr = ensemble_fselect( fselector = fs("random_search"), task = tsk("sonar"), learners = lrns(c("classif.rpart", "classif.featureless")), init_resampling = rsmp("subsampling", repeats = 2), inner_resampling = rsmp("cv", folds = 3), inner_measure = msr("classif.ce"), measure = msr("classif.acc"), terminator = trm("evals", n_evals = 10) ) efsr
efsr = ensemble_fselect( fselector = fs("random_search"), task = tsk("sonar"), learners = lrns(c("classif.rpart", "classif.featureless")), init_resampling = rsmp("subsampling", repeats = 2), inner_resampling = rsmp("cv", folds = 3), inner_measure = msr("classif.ce"), measure = msr("classif.acc"), terminator = trm("evals", n_evals = 10) ) efsr
Extract inner feature selection archives of nested resampling.
Implemented for mlr3::ResampleResult and mlr3::BenchmarkResult.
The function iterates over the AutoFSelector objects and binds the archives to a data.table::data.table()
.
AutoFSelector must be initialized with store_fselect_instance = TRUE
and resample()
or benchmark()
must be called with store_models = TRUE
.
extract_inner_fselect_archives(x, exclude_columns = "uhash")
extract_inner_fselect_archives(x, exclude_columns = "uhash")
x |
|
exclude_columns |
( |
The returned data table has the following columns:
experiment
(integer(1))
Index, giving the according row number in the original benchmark grid.
iteration
(integer(1))
Iteration of the outer resampling.
One column for each feature of the task.
One column for each performance measure.
runtime_learners
(numeric(1)
)
Sum of training and predict times logged in learners per
mlr3::ResampleResult / evaluation. This does not include potential
overhead time.
timestamp
(POSIXct
)
Time stamp when the evaluation was logged into the archive.
batch_nr
(integer(1)
)
Feature sets are evaluated in batches. Each batch has a unique batch
number.
resample_result
(mlr3::ResampleResult)
Resample result of the inner resampling.
task_id
(character(1)
).
learner_id
(character(1)
).
resampling_id
(character(1)
).
# Nested Resampling on Palmer Penguins Data Set # create auto fselector at = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) resampling_outer = rsmp("cv", folds = 2) rr = resample(tsk("penguins"), at, resampling_outer, store_models = TRUE) # extract inner archives extract_inner_fselect_archives(rr)
# Nested Resampling on Palmer Penguins Data Set # create auto fselector at = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) resampling_outer = rsmp("cv", folds = 2) rr = resample(tsk("penguins"), at, resampling_outer, store_models = TRUE) # extract inner archives extract_inner_fselect_archives(rr)
Extract inner feature selection results of nested resampling. Implemented for mlr3::ResampleResult and mlr3::BenchmarkResult.
extract_inner_fselect_results(x, fselect_instance, ...)
extract_inner_fselect_results(x, fselect_instance, ...)
x |
|
fselect_instance |
( |
... |
(any) |
The function iterates over the AutoFSelector objects and binds the feature selection results to a data.table::data.table()
.
AutoFSelector must be initialized with store_fselect_instance = TRUE
and resample()
or benchmark()
must be called with store_models = TRUE
.
Optionally, the instance can be added for each iteration.
The returned data table has the following columns:
experiment
(integer(1))
Index, giving the according row number in the original benchmark grid.
iteration
(integer(1))
Iteration of the outer resampling.
One column for each feature of the task.
One column for each performance measure.
features
(character())
Vector of selected feature set.
task_id
(character(1)
).
learner_id
(character(1)
).
resampling_id
(character(1)
).
# Nested Resampling on Palmer Penguins Data Set # create auto fselector at = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) resampling_outer = rsmp("cv", folds = 2) rr = resample(tsk("iris"), at, resampling_outer, store_models = TRUE) # extract inner results extract_inner_fselect_results(rr)
# Nested Resampling on Palmer Penguins Data Set # create auto fselector at = auto_fselector( fselector = fs("random_search"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measure = msr("classif.ce"), term_evals = 4) resampling_outer = rsmp("cv", folds = 2) rr = resample(tsk("iris"), at, resampling_outer, store_models = TRUE) # extract inner results extract_inner_fselect_results(rr)
Functions to retrieve objects, set parameters and assign to fields in one go.
Relies on mlr3misc::dictionary_sugar_get()
to extract objects from the respective mlr3misc::Dictionary:
fs()
for a FSelector from mlr_fselectors.
fss()
for a list of a FSelector from mlr_fselectors.
trm()
for a bbotk::Terminator from mlr_terminators.
trms()
for a list of Terminators from mlr_terminators.
fs(.key, ...) fss(.keys, ...)
fs(.key, ...) fss(.keys, ...)
.key |
( |
... |
(any) |
.keys |
( |
R6::R6Class object of the respective type, or a list of R6::R6Class objects for the plural versions.
# random search with batch size of 5 fs("random_search", batch_size = 5) # run time terminator with 20 seconds trm("run_time", secs = 20)
# random search with batch size of 5 fs("random_search", batch_size = 5) # run time terminator with 20 seconds trm("run_time", secs = 20)
Function to optimize the features of a mlr3::Learner.
The function internally creates a FSelectInstanceBatchSingleCrit or FSelectInstanceBatchMultiCrit which describes the feature selection problem.
It executes the feature selection with the FSelector (method
) and returns the result with the fselect instance ($result
).
The ArchiveBatchFSelect ($archive
) stores all evaluated hyperparameter configurations and performance scores.
fselect( fselector, task, learner, resampling, measures = NULL, term_evals = NULL, term_time = NULL, terminator = NULL, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL, ties_method = "least_features" )
fselect( fselector, task, learner, resampling, measures = NULL, term_evals = NULL, term_time = NULL, terminator = NULL, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL, ties_method = "least_features" )
fselector |
(FSelector) |
task |
(mlr3::Task) |
learner |
(mlr3::Learner) |
resampling |
(mlr3::Resampling) |
measures |
(mlr3::Measure or list of mlr3::Measure) |
term_evals |
( |
term_time |
( |
terminator |
(bbotk::Terminator) |
store_benchmark_result |
( |
store_models |
( |
check_values |
( |
callbacks |
(list of CallbackBatchFSelect) |
ties_method |
( |
The mlr3::Task, mlr3::Learner, mlr3::Resampling, mlr3::Measure and bbotk::Terminator are used to construct a FSelectInstanceBatchSingleCrit.
If multiple performance Measures are supplied, a FSelectInstanceBatchMultiCrit is created.
The parameter term_evals
and term_time
are shortcuts to create a bbotk::Terminator.
If both parameters are passed, a bbotk::TerminatorCombo is constructed.
For other Terminators, pass one with terminator
.
If no termination criterion is needed, set term_evals
, term_time
and terminator
to NULL
.
FSelectInstanceBatchSingleCrit | FSelectInstanceBatchMultiCrit
There are several sections about feature selection in the mlr3book.
Getting started with wrapper feature selection.
Do a sequential forward selection Palmer Penguins data set.
The gallery features a collection of case studies and demos about optimization.
Utilize the built-in feature importance of models with Recursive Feature Elimination.
Run a feature selection with Shadow Variable Search.
For analyzing the feature selection results, it is recommended to pass the archive to as.data.table()
.
The returned data table is joined with the benchmark result which adds the mlr3::ResampleResult for each feature set.
The archive provides various getters (e.g. $learners()
) to ease the access.
All getters extract by position (i
) or unique hash (uhash
).
For a complete list of all getters see the methods section.
The benchmark result ($benchmark_result
) allows to score the feature sets again on a different measure.
Alternatively, measures can be supplied to as.data.table()
.
# Feature selection on the Palmer Penguins data set task = tsk("pima") learner = lrn("classif.rpart") # Run feature selection instance = fselect( fselector = fs("random_search"), task = task, learner = learner, resampling = rsmp ("holdout"), measures = msr("classif.ce"), term_evals = 4) # Subset task to optimized feature set task$select(instance$result_feature_set) # Train the learner with optimal feature set on the full data set learner$train(task) # Inspect all evaluated configurations as.data.table(instance$archive)
# Feature selection on the Palmer Penguins data set task = tsk("pima") learner = lrn("classif.rpart") # Run feature selection instance = fselect( fselector = fs("random_search"), task = task, learner = learner, resampling = rsmp ("holdout"), measures = msr("classif.ce"), term_evals = 4) # Subset task to optimized feature set task$select(instance$result_feature_set) # Train the learner with optimal feature set on the full data set learner$train(task) # Inspect all evaluated configurations as.data.table(instance$archive)
Function to conduct nested resampling.
fselect_nested( fselector, task, learner, inner_resampling, outer_resampling, measure = NULL, term_evals = NULL, term_time = NULL, terminator = NULL, store_fselect_instance = TRUE, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL, ties_method = "least_features" )
fselect_nested( fselector, task, learner, inner_resampling, outer_resampling, measure = NULL, term_evals = NULL, term_time = NULL, terminator = NULL, store_fselect_instance = TRUE, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL, ties_method = "least_features" )
fselector |
(FSelector) |
task |
(mlr3::Task) |
learner |
(mlr3::Learner) |
inner_resampling |
(mlr3::Resampling) |
outer_resampling |
mlr3::Resampling) |
measure |
(mlr3::Measure) |
term_evals |
( |
term_time |
( |
terminator |
(bbotk::Terminator) |
store_fselect_instance |
( |
store_benchmark_result |
( |
store_models |
( |
check_values |
( |
callbacks |
(list of CallbackBatchFSelect) |
ties_method |
( |
# Nested resampling on Palmer Penguins data set rr = fselect_nested( fselector = fs("random_search"), task = tsk("penguins"), learner = lrn("classif.rpart"), inner_resampling = rsmp ("holdout"), outer_resampling = rsmp("cv", folds = 2), measure = msr("classif.ce"), term_evals = 4) # Performance scores estimated on the outer resampling rr$score() # Unbiased performance of the final model trained on the full data set rr$aggregate()
# Nested resampling on Palmer Penguins data set rr = fselect_nested( fselector = fs("random_search"), task = tsk("penguins"), learner = lrn("classif.rpart"), inner_resampling = rsmp ("holdout"), outer_resampling = rsmp("cv", folds = 2), measure = msr("classif.ce"), term_evals = 4) # Performance scores estimated on the outer resampling rr$score() # Unbiased performance of the final model trained on the full data set rr$aggregate()
The FSelectInstanceBatchMultiCrit specifies a feature selection problem for a FSelector.
The function fsi()
creates a FSelectInstanceBatchMultiCrit and the function fselect()
creates an instance internally.
There are several sections about feature selection in the mlr3book.
Learn about multi-objective optimization.
The gallery features a collection of case studies and demos about optimization.
For analyzing the feature selection results, it is recommended to pass the archive to as.data.table()
.
The returned data table is joined with the benchmark result which adds the mlr3::ResampleResult for each feature set.
The archive provides various getters (e.g. $learners()
) to ease the access.
All getters extract by position (i
) or unique hash (uhash
).
For a complete list of all getters see the methods section.
The benchmark result ($benchmark_result
) allows to score the feature sets again on a different measure.
Alternatively, measures can be supplied to as.data.table()
.
bbotk::OptimInstance
-> bbotk::OptimInstanceBatch
-> bbotk::OptimInstanceBatchMultiCrit
-> FSelectInstanceBatchMultiCrit
result_feature_set
(list of character()
)
Feature sets for task subsetting.
new()
Creates a new instance of this R6 class.
FSelectInstanceBatchMultiCrit$new( task, learner, resampling, measures, terminator, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL )
task
(mlr3::Task)
Task to operate on.
learner
(mlr3::Learner)
Learner to optimize the feature subset for.
resampling
(mlr3::Resampling)
Resampling that is used to evaluated the performance of the feature subsets.
Uninstantiated resamplings are instantiated during construction so that all feature subsets are evaluated on the same data splits.
Already instantiated resamplings are kept unchanged.
measures
(list of mlr3::Measure)
Measures to optimize.
If NULL
, mlr3's default measure is used.
terminator
(bbotk::Terminator)
Stop criterion of the feature selection.
store_benchmark_result
(logical(1)
)
Store benchmark result in archive?
store_models
(logical(1)
).
Store models in benchmark result?
check_values
(logical(1)
)
Check the parameters before the evaluation and the results for
validity?
callbacks
(list of CallbackBatchFSelect)
List of callbacks.
assign_result()
The FSelector object writes the best found feature subsets and estimated performance values here. For internal use.
FSelectInstanceBatchMultiCrit$assign_result(xdt, ydt, extra = NULL, ...)
xdt
(data.table::data.table()
)
x values as data.table
. Each row is one point. Contains the value in
the search space of the FSelectInstanceBatchMultiCrit object. Can contain
additional columns for extra information.
ydt
(data.table::data.table()
)
Optimal outcomes, e.g. the Pareto front.
extra
(data.table::data.table()
)
Additional information.
...
(any
)
ignored.
print()
Printer.
FSelectInstanceBatchMultiCrit$print(...)
...
(ignored).
clone()
The objects of this class are cloneable with this method.
FSelectInstanceBatchMultiCrit$clone(deep = FALSE)
deep
Whether to make a deep clone.
# Feature selection on Palmer Penguins data set task = tsk("penguins") # Construct feature selection instance instance = fsi( task = task, learner = lrn("classif.rpart"), resampling = rsmp("cv", folds = 3), measures = msrs(c("classif.ce", "time_train")), terminator = trm("evals", n_evals = 4) ) # Choose optimization algorithm fselector = fs("random_search", batch_size = 2) # Run feature selection fselector$optimize(instance) # Optimal feature sets instance$result_feature_set # Inspect all evaluated sets as.data.table(instance$archive)
# Feature selection on Palmer Penguins data set task = tsk("penguins") # Construct feature selection instance instance = fsi( task = task, learner = lrn("classif.rpart"), resampling = rsmp("cv", folds = 3), measures = msrs(c("classif.ce", "time_train")), terminator = trm("evals", n_evals = 4) ) # Choose optimization algorithm fselector = fs("random_search", batch_size = 2) # Run feature selection fselector$optimize(instance) # Optimal feature sets instance$result_feature_set # Inspect all evaluated sets as.data.table(instance$archive)
The FSelectInstanceBatchSingleCrit specifies a feature selection problem for a FSelector.
The function fsi()
creates a FSelectInstanceBatchSingleCrit and the function fselect()
creates an instance internally.
The instance contains an ObjectiveFSelectBatch object that encodes the black box objective function a FSelector has to optimize.
The instance allows the basic operations of querying the objective at design points ($eval_batch()
).
This operation is usually done by the FSelector.
Evaluations of feature subsets are performed in batches by calling mlr3::benchmark()
internally.
The evaluated feature subsets are stored in the Archive ($archive
).
Before a batch is evaluated, the bbotk::Terminator is queried for the remaining budget.
If the available budget is exhausted, an exception is raised, and no further evaluations can be performed from this point on.
The FSelector is also supposed to store its final result, consisting of a selected feature subset and associated estimated performance values, by calling the method instance$assign_result()
.
If no measure is passed, the default measure is used. The default measure depends on the task type.
Task | Default Measure | Package |
"classif" |
"classif.ce" |
mlr3 |
"regr" |
"regr.mse" |
mlr3 |
"surv" |
"surv.cindex" |
mlr3proba |
"dens" |
"dens.logloss" |
mlr3proba |
"classif_st" |
"classif.ce" |
mlr3spatial |
"regr_st" |
"regr.mse" |
mlr3spatial |
"clust" |
"clust.dunn" |
mlr3cluster |
There are several sections about feature selection in the mlr3book.
Getting started with wrapper feature selection.
Do a sequential forward selection Palmer Penguins data set.
The gallery features a collection of case studies and demos about optimization.
Utilize the built-in feature importance of models with Recursive Feature Elimination.
Run a feature selection with Shadow Variable Search.
For analyzing the feature selection results, it is recommended to pass the archive to as.data.table()
.
The returned data table is joined with the benchmark result which adds the mlr3::ResampleResult for each feature set.
The archive provides various getters (e.g. $learners()
) to ease the access.
All getters extract by position (i
) or unique hash (uhash
).
For a complete list of all getters see the methods section.
The benchmark result ($benchmark_result
) allows to score the feature sets again on a different measure.
Alternatively, measures can be supplied to as.data.table()
.
bbotk::OptimInstance
-> bbotk::OptimInstanceBatch
-> bbotk::OptimInstanceBatchSingleCrit
-> FSelectInstanceBatchSingleCrit
result_feature_set
(character()
)
Feature set for task subsetting.
new()
Creates a new instance of this R6 class.
FSelectInstanceBatchSingleCrit$new( task, learner, resampling, measure, terminator, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL, ties_method = "least_features" )
task
(mlr3::Task)
Task to operate on.
learner
(mlr3::Learner)
Learner to optimize the feature subset for.
resampling
(mlr3::Resampling)
Resampling that is used to evaluated the performance of the feature subsets.
Uninstantiated resamplings are instantiated during construction so that all feature subsets are evaluated on the same data splits.
Already instantiated resamplings are kept unchanged.
measure
(mlr3::Measure)
Measure to optimize. If NULL
, default measure is used.
terminator
(bbotk::Terminator)
Stop criterion of the feature selection.
store_benchmark_result
(logical(1)
)
Store benchmark result in archive?
store_models
(logical(1)
).
Store models in benchmark result?
check_values
(logical(1)
)
Check the parameters before the evaluation and the results for
validity?
callbacks
(list of CallbackBatchFSelect)
List of callbacks.
ties_method
(character(1)
)
The method to break ties when selecting sets while optimizing and when selecting the best set.
Can be "least_features"
or "random"
.
The option "least_features"
(default) selects the feature set with the least features.
If there are multiple best feature sets with the same number of features, one is selected randomly.
The random
method returns a random feature set from the best feature sets.
Ignored if multiple measures are used.
assign_result()
The FSelector writes the best found feature subset and estimated performance value here. For internal use.
FSelectInstanceBatchSingleCrit$assign_result(xdt, y, extra = NULL, ...)
xdt
(data.table::data.table()
)
x values as data.table
. Each row is one point. Contains the value in
the search space of the FSelectInstanceBatchMultiCrit object. Can contain
additional columns for extra information.
y
(numeric(1)
)
Optimal outcome.
extra
(data.table::data.table()
)
Additional information.
...
(any
)
ignored.
print()
Printer.
FSelectInstanceBatchSingleCrit$print(...)
...
(ignored).
clone()
The objects of this class are cloneable with this method.
FSelectInstanceBatchSingleCrit$clone(deep = FALSE)
deep
Whether to make a deep clone.
# Feature selection on Palmer Penguins data set task = tsk("penguins") learner = lrn("classif.rpart") # Construct feature selection instance instance = fsi( task = task, learner = learner, resampling = rsmp("cv", folds = 3), measures = msr("classif.ce"), terminator = trm("evals", n_evals = 4) ) # Choose optimization algorithm fselector = fs("random_search", batch_size = 2) # Run feature selection fselector$optimize(instance) # Subset task to optimal feature set task$select(instance$result_feature_set) # Train the learner with optimal feature set on the full data set learner$train(task) # Inspect all evaluated sets as.data.table(instance$archive)
# Feature selection on Palmer Penguins data set task = tsk("penguins") learner = lrn("classif.rpart") # Construct feature selection instance instance = fsi( task = task, learner = learner, resampling = rsmp("cv", folds = 3), measures = msr("classif.ce"), terminator = trm("evals", n_evals = 4) ) # Choose optimization algorithm fselector = fs("random_search", batch_size = 2) # Run feature selection fselector$optimize(instance) # Subset task to optimal feature set task$select(instance$result_feature_set) # Train the learner with optimal feature set on the full data set learner$train(task) # Inspect all evaluated sets as.data.table(instance$archive)
The 'FSelector“ implements the optimization algorithm.
FSelector
is an abstract base class that implements the base functionality each fselector must provide.
There are several sections about feature selection in the mlr3book.
Learn more about fselectors.
The gallery features a collection of case studies and demos about optimization.
Utilize the built-in feature importance of models with Recursive Feature Elimination.
Run a feature selection with Shadow Variable Search.
id
(character(1)
)
Identifier of the object.
Used in tables, plot and text output.
param_set
paradox::ParamSet
Set of control parameters.
properties
(character()
)
Set of properties of the fselector.
Must be a subset of mlr_reflections$fselect_properties
.
packages
(character()
)
Set of required packages.
Note that these packages will be loaded via requireNamespace()
, and are not attached.
label
(character(1)
)
Label for this object.
Can be used in tables, plot and text output instead of the ID.
man
(character(1)
)
String in the format [pkg]::[topic]
pointing to a manual page for this object.
The referenced help package can be opened via method $help()
.
new()
Creates a new instance of this R6 class.
FSelector$new( id = "fselector", param_set, properties, packages = character(), label = NA_character_, man = NA_character_ )
id
(character(1)
)
Identifier for the new instance.
param_set
paradox::ParamSet
Set of control parameters.
properties
(character()
)
Set of properties of the fselector.
Must be a subset of mlr_reflections$fselect_properties
.
packages
(character()
)
Set of required packages.
Note that these packages will be loaded via requireNamespace()
, and are not attached.
label
(character(1)
)
Label for this object.
Can be used in tables, plot and text output instead of the ID.
man
(character(1)
)
String in the format [pkg]::[topic]
pointing to a manual page for this object.
The referenced help package can be opened via method $help()
.
format()
Helper for print outputs.
FSelector$format(...)
...
(ignored).
(character()
).
print()
Print method.
FSelector$print()
(character()
).
help()
Opens the corresponding help page referenced by field $man
.
FSelector$help()
clone()
The objects of this class are cloneable with this method.
FSelector$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other FSelector:
mlr_fselectors
,
mlr_fselectors_design_points
,
mlr_fselectors_exhaustive_search
,
mlr_fselectors_genetic_search
,
mlr_fselectors_random_search
,
mlr_fselectors_rfe
,
mlr_fselectors_rfecv
,
mlr_fselectors_sequential
,
mlr_fselectors_shadow_variable_search
The FSelectorBatch implements the optimization algorithm.
FSelectorBatch is an abstract base class that implements the base functionality each fselector must provide. A subclass is implemented in the following way:
Inherit from FSelectorBatch.
Specify the private abstract method $.optimize()
and use it to call into your optimizer.
You need to call instance$eval_batch()
to evaluate design points.
The batch evaluation is requested at the FSelectInstanceBatchSingleCrit/FSelectInstanceBatchMultiCrit object instance
, so each batch is possibly executed in parallel via mlr3::benchmark()
, and all evaluations are stored inside of instance$archive
.
Before the batch evaluation, the bbotk::Terminator is checked, and if it is positive, an exception of class "terminated_error"
is generated.
In the latter case the current batch of evaluations is still stored in instance
, but the numeric scores are not sent back to the handling optimizer as it has lost execution control.
After such an exception was caught we select the best set from instance$archive
and return it.
Note that therefore more points than specified by the bbotk::Terminator may be evaluated, as the Terminator is only checked before a batch evaluation, and not in-between evaluation in a batch. How many more depends on the setting of the batch size.
Overwrite the private super-method .assign_result()
if you want to decide how to estimate the final set in the instance and its estimated performance.
The default behavior is: We pick the best resample experiment, regarding the given measure, then assign its set and aggregated performance to the instance.
.optimize(instance)
-> NULL
Abstract base method. Implement to specify feature selection of your subclass.
See technical details sections.
.assign_result(instance)
-> NULL
Abstract base method. Implement to specify how the final feature subset is selected.
See technical details sections.
There are several sections about feature selection in the mlr3book.
Learn more about fselectors.
The gallery features a collection of case studies and demos about optimization.
Utilize the built-in feature importance of models with Recursive Feature Elimination.
Run a feature selection with Shadow Variable Search.
mlr3fselect::FSelector
-> FSelectorBatch
new()
Creates a new instance of this R6 class.
FSelectorBatch$new( id = "fselector_batch", param_set, properties, packages = character(), label = NA_character_, man = NA_character_ )
id
(character(1)
)
Identifier for the new instance.
param_set
paradox::ParamSet
Set of control parameters.
properties
(character()
)
Set of properties of the fselector.
Must be a subset of mlr_reflections$fselect_properties
.
packages
(character()
)
Set of required packages.
Note that these packages will be loaded via requireNamespace()
, and are not attached.
label
(character(1)
)
Label for this object.
Can be used in tables, plot and text output instead of the ID.
man
(character(1)
)
String in the format [pkg]::[topic]
pointing to a manual page for this object.
The referenced help package can be opened via method $help()
.
optimize()
Performs the feature selection on a FSelectInstanceBatchSingleCrit or FSelectInstanceBatchMultiCrit until termination. The single evaluations will be written into the ArchiveBatchFSelect that resides in the FSelectInstanceBatchSingleCrit / FSelectInstanceBatchMultiCrit. The result will be written into the instance object.
FSelectorBatch$optimize(inst)
clone()
The objects of this class are cloneable with this method.
FSelectorBatch$clone(deep = FALSE)
deep
Whether to make a deep clone.
Function to construct a FSelectInstanceBatchSingleCrit or FSelectInstanceBatchMultiCrit.
fsi( task, learner, resampling, measures = NULL, terminator, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL, ties_method = "least_features" )
fsi( task, learner, resampling, measures = NULL, terminator, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, callbacks = NULL, ties_method = "least_features" )
task |
(mlr3::Task) |
learner |
(mlr3::Learner) |
resampling |
(mlr3::Resampling) |
measures |
(mlr3::Measure or list of mlr3::Measure) |
terminator |
(bbotk::Terminator) |
store_benchmark_result |
( |
store_models |
( |
check_values |
( |
callbacks |
(list of CallbackBatchFSelect) |
ties_method |
( |
There are several sections about feature selection in the mlr3book.
Getting started with wrapper feature selection.
Do a sequential forward selection Palmer Penguins data set.
The gallery features a collection of case studies and demos about optimization.
Utilize the built-in feature importance of models with Recursive Feature Elimination.
Run a feature selection with Shadow Variable Search.
If no measure is passed, the default measure is used. The default measure depends on the task type.
Task | Default Measure | Package |
"classif" |
"classif.ce" |
mlr3 |
"regr" |
"regr.mse" |
mlr3 |
"surv" |
"surv.cindex" |
mlr3proba |
"dens" |
"dens.logloss" |
mlr3proba |
"classif_st" |
"classif.ce" |
mlr3spatial |
"regr_st" |
"regr.mse" |
mlr3spatial |
"clust" |
"clust.dunn" |
mlr3cluster |
# Feature selection on Palmer Penguins data set task = tsk("penguins") learner = lrn("classif.rpart") # Construct feature selection instance instance = fsi( task = task, learner = learner, resampling = rsmp("cv", folds = 3), measures = msr("classif.ce"), terminator = trm("evals", n_evals = 4) ) # Choose optimization algorithm fselector = fs("random_search", batch_size = 2) # Run feature selection fselector$optimize(instance) # Subset task to optimal feature set task$select(instance$result_feature_set) # Train the learner with optimal feature set on the full data set learner$train(task) # Inspect all evaluated sets as.data.table(instance$archive)
# Feature selection on Palmer Penguins data set task = tsk("penguins") learner = lrn("classif.rpart") # Construct feature selection instance instance = fsi( task = task, learner = learner, resampling = rsmp("cv", folds = 3), measures = msr("classif.ce"), terminator = trm("evals", n_evals = 4) ) # Choose optimization algorithm fselector = fs("random_search", batch_size = 2) # Run feature selection fselector$optimize(instance) # Subset task to optimal feature set task$select(instance$result_feature_set) # Train the learner with optimal feature set on the full data set learner$train(task) # Inspect all evaluated sets as.data.table(instance$archive)
A mlr3misc::Dictionary storing objects of class FSelector.
Each fselector has an associated help page, see mlr_fselectors_[id]
.
For a more convenient way to retrieve and construct fselectors, see fs()
/fss()
.
R6::R6Class object inheriting from mlr3misc::Dictionary.
See mlr3misc::Dictionary.
as.data.table(dict, ..., objects = FALSE)
mlr3misc::Dictionary -> data.table::data.table()
Returns a data.table::data.table()
with fields "key", "label", "properties" and "packages" as columns.
If objects
is set to TRUE
, the constructed objects are returned in the list column named object
.
Other FSelector:
FSelector
,
mlr_fselectors_design_points
,
mlr_fselectors_exhaustive_search
,
mlr_fselectors_genetic_search
,
mlr_fselectors_random_search
,
mlr_fselectors_rfe
,
mlr_fselectors_rfecv
,
mlr_fselectors_sequential
,
mlr_fselectors_shadow_variable_search
as.data.table(mlr_fselectors) mlr_fselectors$get("random_search") fs("random_search")
as.data.table(mlr_fselectors) mlr_fselectors$get("random_search") fs("random_search")
Feature selection using user-defined feature sets.
The feature sets are evaluated in order as given.
The feature selection terminates itself when all feature sets are evaluated. It is not necessary to set a termination criterion.
This FSelector can be instantiated with the associated sugar function fs()
:
fs("design_points")
batch_size
integer(1)
Maximum number of configurations to try in a batch.
design
data.table::data.table
Design points to try in search, one per row.
mlr3fselect::FSelector
-> mlr3fselect::FSelectorBatch
-> mlr3fselect::FSelectorBatchFromOptimizerBatch
-> FSelectorBatchDesignPoints
new()
Creates a new instance of this R6 class.
FSelectorBatchDesignPoints$new()
clone()
The objects of this class are cloneable with this method.
FSelectorBatchDesignPoints$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other FSelector:
FSelector
,
mlr_fselectors
,
mlr_fselectors_exhaustive_search
,
mlr_fselectors_genetic_search
,
mlr_fselectors_random_search
,
mlr_fselectors_rfe
,
mlr_fselectors_rfecv
,
mlr_fselectors_sequential
,
mlr_fselectors_shadow_variable_search
# Feature Selection # retrieve task and load learner task = tsk("pima") learner = lrn("classif.rpart") # create design design = mlr3misc::rowwise_table( ~age, ~glucose, ~insulin, ~mass, ~pedigree, ~pregnant, ~pressure, ~triceps, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE ) # run feature selection on the Pima Indians diabetes data set instance = fselect( fselector = fs("design_points", design = design), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce") ) # best performing feature set instance$result # all evaluated feature sets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
# Feature Selection # retrieve task and load learner task = tsk("pima") learner = lrn("classif.rpart") # create design design = mlr3misc::rowwise_table( ~age, ~glucose, ~insulin, ~mass, ~pedigree, ~pregnant, ~pressure, ~triceps, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE ) # run feature selection on the Pima Indians diabetes data set instance = fselect( fselector = fs("design_points", design = design), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce") ) # best performing feature set instance$result # all evaluated feature sets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
Feature Selection using the Exhaustive Search Algorithm. Exhaustive Search generates all possible feature sets.
The feature selection terminates itself when all feature sets are evaluated. It is not necessary to set a termination criterion.
This FSelector can be instantiated with the associated sugar function fs()
:
fs("exhaustive_search")
max_features
integer(1)
Maximum number of features.
By default, number of features in mlr3::Task.
mlr3fselect::FSelector
-> mlr3fselect::FSelectorBatch
-> FSelectorBatchExhaustiveSearch
new()
Creates a new instance of this R6 class.
FSelectorBatchExhaustiveSearch$new()
clone()
The objects of this class are cloneable with this method.
FSelectorBatchExhaustiveSearch$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other FSelector:
FSelector
,
mlr_fselectors
,
mlr_fselectors_design_points
,
mlr_fselectors_genetic_search
,
mlr_fselectors_random_search
,
mlr_fselectors_rfe
,
mlr_fselectors_rfecv
,
mlr_fselectors_sequential
,
mlr_fselectors_shadow_variable_search
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("exhaustive_search"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), term_evals = 10 ) # best performing feature set instance$result # all evaluated feature sets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("exhaustive_search"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), term_evals = 10 ) # best performing feature set instance$result # all evaluated feature sets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
Feature selection using the Genetic Algorithm from the package genalg.
This FSelector can be instantiated with the associated sugar function fs()
:
fs("genetic_search")
For the meaning of the control parameters, see genalg::rbga.bin()
.
genalg::rbga.bin()
internally terminates after iters
iteration.
We set ìters = 100000
to allow the termination via our terminators.
If more iterations are needed, set ìters
to a higher value in the parameter set.
mlr3fselect::FSelector
-> mlr3fselect::FSelectorBatch
-> FSelectorBatchGeneticSearch
new()
Creates a new instance of this R6 class.
FSelectorBatchGeneticSearch$new()
clone()
The objects of this class are cloneable with this method.
FSelectorBatchGeneticSearch$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other FSelector:
FSelector
,
mlr_fselectors
,
mlr_fselectors_design_points
,
mlr_fselectors_exhaustive_search
,
mlr_fselectors_random_search
,
mlr_fselectors_rfe
,
mlr_fselectors_rfecv
,
mlr_fselectors_sequential
,
mlr_fselectors_shadow_variable_search
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("genetic_search"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), term_evals = 10 ) # best performing feature set instance$result # all evaluated feature sets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("genetic_search"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), term_evals = 10 ) # best performing feature set instance$result # all evaluated feature sets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
Feature selection using Random Search Algorithm.
The feature sets are randomly drawn.
The sets are evaluated in batches of size batch_size
.
Larger batches mean we can parallelize more, smaller batches imply a more fine-grained checking of termination criteria.
This FSelector can be instantiated with the associated sugar function fs()
:
fs("random_search")
max_features
integer(1)
Maximum number of features.
By default, number of features in mlr3::Task.
batch_size
integer(1)
Maximum number of feature sets to try in a batch.
mlr3fselect::FSelector
-> mlr3fselect::FSelectorBatch
-> FSelectorBatchRandomSearch
new()
Creates a new instance of this R6 class.
FSelectorBatchRandomSearch$new()
clone()
The objects of this class are cloneable with this method.
FSelectorBatchRandomSearch$clone(deep = FALSE)
deep
Whether to make a deep clone.
Bergstra J, Bengio Y (2012). “Random Search for Hyper-Parameter Optimization.” Journal of Machine Learning Research, 13(10), 281–305. https://jmlr.csail.mit.edu/papers/v13/bergstra12a.html.
Other FSelector:
FSelector
,
mlr_fselectors
,
mlr_fselectors_design_points
,
mlr_fselectors_exhaustive_search
,
mlr_fselectors_genetic_search
,
mlr_fselectors_rfe
,
mlr_fselectors_rfecv
,
mlr_fselectors_sequential
,
mlr_fselectors_shadow_variable_search
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("random_search"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), term_evals = 10 ) # best performing feature subset instance$result # all evaluated feature subsets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("random_search"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), term_evals = 10 ) # best performing feature subset instance$result # all evaluated feature subsets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
Feature selection using the Recursive Feature Elimination (RFE) algorithm. Recursive feature elimination iteratively removes features with a low importance score. Only works with mlr3::Learners that can calculate importance scores (see the section on optional extractors in mlr3::Learner).
The learner is trained on all features at the start and importance scores are calculated for each feature.
Then the least important feature is removed and the learner is trained on the reduced feature set.
The importance scores are calculated again and the procedure is repeated until the desired number of features is reached.
The non-recursive option (recursive = FALSE
) only uses the importance scores calculated in the first iteration.
The feature selection terminates itself when n_features
is reached.
It is not necessary to set a termination criterion.
When using a cross-validation resampling strategy, the importance scores of the resampling iterations are aggregated.
The parameter aggregation
determines how the importance scores are aggregated.
By default ("rank"
), the importance score vector of each fold is ranked and the feature with the lowest average rank is removed.
The option "mean"
averages the score of each feature across the resampling iterations and removes the feature with the lowest average score.
Averaging the scores is not appropriate for most importance measures.
The ArchiveBatchFSelect holds the following additional columns:
"importance"
(numeric()
)
The importance score vector of the feature subset.
The gallery features a collection of case studies and demos about optimization.
Utilize the built-in feature importance of models with Recursive Feature Elimination.
This FSelector can be instantiated with the associated sugar function fs()
:
fs("rfe")
n_features
integer(1)
The minimum number of features to select, by default half of the features.
feature_fraction
double(1)
Fraction of features to retain in each iteration.
The default of 0.5 retains half of the features.
feature_number
integer(1)
Number of features to remove in each iteration.
subset_sizes
integer()
Vector of the number of features to retain in each iteration.
Must be sorted in decreasing order.
recursive
logical(1)
If TRUE
(default), the feature importance is calculated in each iteration.
aggregation
character(1)
The aggregation method for the importance scores of the resampling iterations.
See details.
The parameter feature_fraction
, feature_number
and subset_sizes
are mutually exclusive.
mlr3fselect::FSelector
-> mlr3fselect::FSelectorBatch
-> FSelectorBatchRFE
new()
Creates a new instance of this R6 class.
FSelectorBatchRFE$new()
clone()
The objects of this class are cloneable with this method.
FSelectorBatchRFE$clone(deep = FALSE)
deep
Whether to make a deep clone.
Guyon I, Weston J, Barnhill S, Vapnik V (2002). “Gene Selection for Cancer Classification using Support Vector Machines.” Machine Learning, 46(1), 389–422. ISSN 1573-0565, doi:10.1023/A:1012487302797.
Other FSelector:
FSelector
,
mlr_fselectors
,
mlr_fselectors_design_points
,
mlr_fselectors_exhaustive_search
,
mlr_fselectors_genetic_search
,
mlr_fselectors_random_search
,
mlr_fselectors_rfecv
,
mlr_fselectors_sequential
,
mlr_fselectors_shadow_variable_search
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("rfe"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), store_models = TRUE ) # best performing feature subset instance$result # all evaluated feature subsets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("rfe"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), store_models = TRUE ) # best performing feature subset instance$result # all evaluated feature subsets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
Feature selection using the Recursive Feature Elimination with Cross-Validation (RFE-CV) algorithm. See FSelectorBatchRFE for a description of the base algorithm. RFE-CV runs a recursive feature elimination in each iteration of a cross-validation to determine the optimal number of features. Then a recursive feature elimination is run again on the complete dataset with the optimal number of features as the final feature set size. The performance of the optimal feature set is calculated on the complete data set and should not be reported as the performance of the final model. Only works with mlr3::Learners that can calculate importance scores (see the section on optional extractors in mlr3::Learner).
The resampling strategy is changed during the feature selection.
The resampling strategy passed to the instance (resampling
) is used to determine the optimal number of features.
Usually, a cross-validation strategy is used and a recursive feature elimination is run in each iteration of the cross-validation.
Internally, mlr3::ResamplingCustom is used to emulate this part of the algorithm.
In the final recursive feature elimination run the resampling strategy is changed to mlr3::ResamplingInsample i.e. the complete data set is used for training and testing.
The feature selection terminates itself when the optimal number of features is reached. It is not necessary to set a termination criterion.
The ArchiveBatchFSelect holds the following additional columns:
"iteration"
(integer(1)
)
The resampling iteration in which the feature subset was evaluated.
"importance"
(numeric()
)
The importance score vector of the feature subset.
The gallery features a collection of case studies and demos about optimization.
Utilize the built-in feature importance of models with Recursive Feature Elimination.
This FSelector can be instantiated with the associated sugar function fs()
:
fs("rfe")
n_features
integer(1)
The number of features to select.
By default half of the features are selected.
feature_fraction
double(1)
Fraction of features to retain in each iteration.
The default 0.5 retrains half of the features.
feature_number
integer(1)
Number of features to remove in each iteration.
subset_sizes
integer()
Vector of number of features to retain in each iteration.
Must be sorted in decreasing order.
recursive
logical(1)
If TRUE
(default), the feature importance is calculated in each iteration.
The parameter feature_fraction
, feature_number
and subset_sizes
are mutually exclusive.
mlr3fselect::FSelector
-> mlr3fselect::FSelectorBatch
-> FSelectorBatchRFECV
new()
Creates a new instance of this R6 class.
FSelectorBatchRFECV$new()
clone()
The objects of this class are cloneable with this method.
FSelectorBatchRFECV$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other FSelector:
FSelector
,
mlr_fselectors
,
mlr_fselectors_design_points
,
mlr_fselectors_exhaustive_search
,
mlr_fselectors_genetic_search
,
mlr_fselectors_random_search
,
mlr_fselectors_rfe
,
mlr_fselectors_sequential
,
mlr_fselectors_shadow_variable_search
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("rfecv"), task = task, learner = learner, resampling = rsmp("cv", folds = 3), measure = msr("classif.ce"), store_models = TRUE ) # best performing feature subset instance$result # all evaluated feature subsets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("rfecv"), task = task, learner = learner, resampling = rsmp("cv", folds = 3), measure = msr("classif.ce"), store_models = TRUE ) # best performing feature subset instance$result # all evaluated feature subsets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
Feature selection using Sequential Search Algorithm.
Sequential forward selection (strategy = fsf
) extends the feature set in each iteration with the feature that increases the model's performance the most.
Sequential backward selection (strategy = fsb
) follows the same idea but starts with all features and removes features from the set.
The feature selection terminates itself when min_features
or max_features
is reached.
It is not necessary to set a termination criterion.
This FSelector can be instantiated with the associated sugar function fs()
:
fs("sequential")
min_features
integer(1)
Minimum number of features. By default, 1.
max_features
integer(1)
Maximum number of features. By default, number of features in mlr3::Task.
strategy
character(1)
Search method sfs
(forward search) or sbs
(backward search).
mlr3fselect::FSelector
-> mlr3fselect::FSelectorBatch
-> FSelectorBatchSequential
new()
Creates a new instance of this R6 class.'
FSelectorBatchSequential$new()
optimization_path()
Returns the optimization path.
FSelectorBatchSequential$optimization_path(inst, include_uhash = FALSE)
inst
(FSelectInstanceBatchSingleCrit)
Instance optimized with FSelectorBatchSequential.
include_uhash
(logical(1)
)
Include uhash
column?
clone()
The objects of this class are cloneable with this method.
FSelectorBatchSequential$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other FSelector:
FSelector
,
mlr_fselectors
,
mlr_fselectors_design_points
,
mlr_fselectors_exhaustive_search
,
mlr_fselectors_genetic_search
,
mlr_fselectors_random_search
,
mlr_fselectors_rfe
,
mlr_fselectors_rfecv
,
mlr_fselectors_shadow_variable_search
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("sequential"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), term_evals = 10 ) # best performing feature set instance$result # all evaluated feature sets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("sequential"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), term_evals = 10 ) # best performing feature set instance$result # all evaluated feature sets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
Feature selection using the Shadow Variable Search Algorithm. Shadow variable search creates for each feature a permutated copy and stops when one of them is selected.
The feature selection terminates itself when the first shadow variable is selected. It is not necessary to set a termination criterion.
The gallery features a collection of case studies and demos about optimization.
Run a feature selection with Shadow Variable Search.
This FSelector can be instantiated with the associated sugar function fs()
:
fs("shadow_variable_search")
mlr3fselect::FSelector
-> mlr3fselect::FSelectorBatch
-> FSelectorBatchShadowVariableSearch
new()
Creates a new instance of this R6 class.'
FSelectorBatchShadowVariableSearch$new()
optimization_path()
Returns the optimization path.
FSelectorBatchShadowVariableSearch$optimization_path(inst)
inst
(FSelectInstanceBatchSingleCrit)
Instance optimized with FSelectorBatchShadowVariableSearch.
clone()
The objects of this class are cloneable with this method.
FSelectorBatchShadowVariableSearch$clone(deep = FALSE)
deep
Whether to make a deep clone.
Thomas J, Hepp T, Mayr A, Bischl B (2017). “Probing for Sparse and Fast Variable Selection with Model-Based Boosting.” Computational and Mathematical Methods in Medicine, 2017, 1–8. doi:10.1155/2017/1421409.
Wu Y, Boos DD, Stefanski LA (2007). “Controlling Variable Selection by the Addition of Pseudovariables.” Journal of the American Statistical Association, 102(477), 235–243. doi:10.1198/016214506000000843.
Other FSelector:
FSelector
,
mlr_fselectors
,
mlr_fselectors_design_points
,
mlr_fselectors_exhaustive_search
,
mlr_fselectors_genetic_search
,
mlr_fselectors_random_search
,
mlr_fselectors_rfe
,
mlr_fselectors_rfecv
,
mlr_fselectors_sequential
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("shadow_variable_search"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), ) # best performing feature subset instance$result # all evaluated feature subsets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
# Feature Selection # retrieve task and load learner task = tsk("penguins") learner = lrn("classif.rpart") # run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("shadow_variable_search"), task = task, learner = learner, resampling = rsmp("holdout"), measure = msr("classif.ce"), ) # best performing feature subset instance$result # all evaluated feature subsets as.data.table(instance$archive) # subset the task and fit the final model task$select(instance$result_feature_set) learner$train(task)
This CallbackBatchFSelect writes the mlr3::BenchmarkResult after each batch to disk.
clbk("mlr3fselect.backup", path = "backup.rds") # Run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("random_search"), task = tsk("pima"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measures = msr("classif.ce"), term_evals = 4, callbacks = clbk("mlr3fselect.backup", path = tempfile(fileext = ".rds")))
clbk("mlr3fselect.backup", path = "backup.rds") # Run feature selection on the Palmer Penguins data set instance = fselect( fselector = fs("random_search"), task = tsk("pima"), learner = lrn("classif.rpart"), resampling = rsmp ("holdout"), measures = msr("classif.ce"), term_evals = 4, callbacks = clbk("mlr3fselect.backup", path = tempfile(fileext = ".rds")))
This callback runs internal tuning alongside the feature selection. The internal tuning values are aggregated and stored in the results. The final model is trained with the best feature set and the tuned value.
clbk("mlr3fselect.internal_tuning")
clbk("mlr3fselect.internal_tuning")
Selects the smallest feature set within one standard error of the best as the result. If there are multiple such feature sets with the same number of features, the first one is selected. If the sets have exactly the same performance but different number of features, the one with the smallest number of features is selected.
Kuhn, Max, Johnson, Kjell (2013). “Applied Predictive Modeling.” In chapter Over-Fitting and Model Tuning, 61–92. Springer New York, New York, NY. ISBN 978-1-4614-6849-3.
clbk("mlr3fselect.one_se_rule") # Run feature selection on the pima data set with the callback instance = fselect( fselector = fs("random_search"), task = tsk("pima"), learner = lrn("classif.rpart"), resampling = rsmp ("cv", folds = 3), measures = msr("classif.ce"), term_evals = 10, callbacks = clbk("mlr3fselect.one_se_rule")) # Smallest feature set within one standard error of the best instance$result
clbk("mlr3fselect.one_se_rule") # Run feature selection on the pima data set with the callback instance = fselect( fselector = fs("random_search"), task = tsk("pima"), learner = lrn("classif.rpart"), resampling = rsmp ("cv", folds = 3), measures = msr("classif.ce"), term_evals = 10, callbacks = clbk("mlr3fselect.one_se_rule")) # Smallest feature set within one standard error of the best instance$result
Runs a recursive feature elimination with a mlr3learners::LearnerClassifSVM.
The SVM must be configured with type = "C-classification"
and kernel = "linear"
.
Guyon I, Weston J, Barnhill S, Vapnik V (2002). “Gene Selection for Cancer Classification using Support Vector Machines.” Machine Learning, 46(1), 389–422. ISSN 1573-0565, doi:10.1023/A:1012487302797.
clbk("mlr3fselect.svm_rfe") library(mlr3learners) # Create instance with classification svm with linear kernel instance = fsi( task = tsk("sonar"), learner = lrn("classif.svm", type = "C-classification", kernel = "linear"), resampling = rsmp("cv", folds = 3), measures = msr("classif.ce"), terminator = trm("none"), callbacks = clbk("mlr3fselect.svm_rfe"), store_models = TRUE ) fselector = fs("rfe", feature_number = 5, n_features = 10) # Run recursive feature elimination on the Sonar data set fselector$optimize(instance)
clbk("mlr3fselect.svm_rfe") library(mlr3learners) # Create instance with classification svm with linear kernel instance = fsi( task = tsk("sonar"), learner = lrn("classif.svm", type = "C-classification", kernel = "linear"), resampling = rsmp("cv", folds = 3), measures = msr("classif.ce"), terminator = trm("none"), callbacks = clbk("mlr3fselect.svm_rfe"), store_models = TRUE ) fselector = fs("rfe", feature_number = 5, n_features = 10) # Run recursive feature elimination on the Sonar data set fselector$optimize(instance)
Stores the objective function that estimates the performance of feature subsets. This class is usually constructed internally by the FSelectInstanceBatchSingleCrit / FSelectInstanceBatchMultiCrit.
bbotk::Objective
-> ObjectiveFSelect
task
(mlr3::Task).
learner
resampling
measures
(list of mlr3::Measure).
store_models
(logical(1)
).
store_benchmark_result
(logical(1)
).
callbacks
(List of CallbackBatchFSelects).
new()
Creates a new instance of this R6 class.
ObjectiveFSelect$new( task, learner, resampling, measures, check_values = TRUE, store_benchmark_result = TRUE, store_models = FALSE, callbacks = NULL )
task
(mlr3::Task)
Task to operate on.
learner
(mlr3::Learner)
Learner to optimize the feature subset for.
resampling
(mlr3::Resampling)
Resampling that is used to evaluated the performance of the feature subsets.
Uninstantiated resamplings are instantiated during construction so that all feature subsets are evaluated on the same data splits.
Already instantiated resamplings are kept unchanged.
measures
(list of mlr3::Measure)
Measures to optimize.
If NULL
, mlr3's default measure is used.
check_values
(logical(1)
)
Check the parameters before the evaluation and the results for
validity?
store_benchmark_result
(logical(1)
)
Store benchmark result in archive?
store_models
(logical(1)
).
Store models in benchmark result?
callbacks
(list of CallbackBatchFSelect)
List of callbacks.
clone()
The objects of this class are cloneable with this method.
ObjectiveFSelect$clone(deep = FALSE)
deep
Whether to make a deep clone.
Stores the objective function that estimates the performance of feature subsets. This class is usually constructed internally by the FSelectInstanceBatchSingleCrit / FSelectInstanceBatchMultiCrit.
bbotk::Objective
-> mlr3fselect::ObjectiveFSelect
-> ObjectiveFSelectBatch
archive
new()
Creates a new instance of this R6 class.
ObjectiveFSelectBatch$new( task, learner, resampling, measures, check_values = TRUE, store_benchmark_result = TRUE, store_models = FALSE, archive = NULL, callbacks = NULL )
task
(mlr3::Task)
Task to operate on.
learner
(mlr3::Learner)
Learner to optimize the feature subset for.
resampling
(mlr3::Resampling)
Resampling that is used to evaluated the performance of the feature subsets.
Uninstantiated resamplings are instantiated during construction so that all feature subsets are evaluated on the same data splits.
Already instantiated resamplings are kept unchanged.
measures
(list of mlr3::Measure)
Measures to optimize.
If NULL
, mlr3's default measure is used.
check_values
(logical(1)
)
Check the parameters before the evaluation and the results for
validity?
store_benchmark_result
(logical(1)
)
Store benchmark result in archive?
store_models
(logical(1)
).
Store models in benchmark result?
archive
(ArchiveBatchFSelect)
Reference to the archive of FSelectInstanceBatchSingleCrit | FSelectInstanceBatchMultiCrit.
If NULL
(default), benchmark result and models cannot be stored.
callbacks
(list of CallbackBatchFSelect)
List of callbacks.
clone()
The objects of this class are cloneable with this method.
ObjectiveFSelectBatch$clone(deep = FALSE)
deep
Whether to make a deep clone.