Title: | Gradient Boosting on Decision Trees |
---|---|
Description: | Open-source gradient boosting on decision trees with categorical features support out of the box. |
Authors: | CatBoost DevTeam [aut, cre] |
Maintainer: | Stanislav Kirillov <[email protected]> |
License: | Apache License (== 2.0) |
Version: | 1.2.7 |
Built: | 2024-11-22 19:25:43 UTC |
Source: | https://github.com/catboost/catboost |
Support caret interface
catboost.caret
catboost.caret
An object of class list
of length 10.
Estimate model performance using cross-validation.
catboost.cv( pool, params = list(), fold_count = 3, type = "Classical", partition_random_seed = 0, shuffle = TRUE, stratified = FALSE, early_stopping_rounds = NULL )
catboost.cv( pool, params = list(), fold_count = 3, type = "Classical", partition_random_seed = 0, shuffle = TRUE, stratified = FALSE, early_stopping_rounds = NULL )
pool |
Data to cross-validate on |
params |
Parameters for catboost.train |
fold_count |
Folds count. |
type |
is type of cross-validation. |
partition_random_seed |
The random seed used for splitting pool into folds. |
shuffle |
Shuffle the dataset objects before splitting into folds. |
stratified |
Perform stratified sampling. |
early_stopping_rounds |
Activates Iter overfitting detector with od_wait set to early_stopping_rounds. |
A data.frame of evaluation results from cross-validation.
Drop unused features information from model
catboost.drop_unused_features(model, ntree_end, ntree_start = 0)
catboost.drop_unused_features(model, ntree_end, ntree_start = 0)
model |
The model obtained as the result of training. |
ntree_end |
Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing). |
ntree_start |
Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing). |
Status, the result of dropping feature. TRUE if this succeeded, FALSE otherwise.
Calculate the specified metrics for the specified dataset.
catboost.eval_metrics( model, pool, metrics, ntree_start = 0L, ntree_end = 0L, eval_period = 1, thread_count = -1, tmp_dir = NULL )
catboost.eval_metrics( model, pool, metrics, ntree_start = 0L, ntree_end = 0L, eval_period = 1, thread_count = -1, tmp_dir = NULL )
model |
The model obtained as the result of training. Default value: Required argument |
pool |
The pool for which you want to evaluate the metrics. Default value: Required argument |
metrics |
The list of metrics to be calculated. (Supported metrics https://catboost.ai/docs/references/custom-metric__supported-metrics.html) Default value: Required argument |
ntree_start |
Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 0 |
ntree_end |
Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 0 (if value equals to 0 this parameter is ignored and ntree_end equal to tree_count) |
eval_period |
Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 1 |
thread_count |
The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: -1 |
tmp_dir |
The name of the temporary directory for intermediate results. If NULL, then the name will be generated. Default value: NULL |
dict: metric -> array of shape [(ntree_end - ntree_start) / eval_period].
https://catboost.ai/docs/concepts/python-reference_catboost_eval-metrics.html
Calculate the feature importances (see https://catboost.ai/docs/concepts/fstr.html#fstr) (Regular feature importance, ShapValues, and Feature interaction strength).
catboost.get_feature_importance( model, pool = NULL, type = "FeatureImportance", thread_count = -1, fstr_type = NULL )
catboost.get_feature_importance( model, pool = NULL, type = "FeatureImportance", thread_count = -1, fstr_type = NULL )
model |
The model obtained as the result of training. Default value: Required argument |
pool |
The input dataset. The feature importance for the training dataset is calculated if this argument is not specified. Models with ranking metrics require pool argument to calculate feature importance. Default value: NULL |
type |
The feature importance type. Possible values:
Default value: 'FeatureImportance' |
thread_count |
The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: -1 |
fstr_type |
Deprecated parameter, use 'type' instead. |
Feature importances
https://catboost.ai/docs/features/feature-importances-calculation.html
Return the model parameters.
catboost.get_model_params(model)
catboost.get_model_params(model)
model |
The model obtained as the result of training. |
A list object with model parameters.
https://catboost.ai/docs/concepts/r-reference_catboost-get_model_params.html
Calculate the object importances (see https://catboost.ai/docs/concepts/ostr.html). This is the implementation of the LeafInfluence algorithm from the following paper: https://arxiv.org/pdf/1802.06640.pdf
catboost.get_object_importance( model, pool, train_pool, top_size = -1, type = "Average", update_method = "SinglePoint", thread_count = -1, ostr_type = NULL )
catboost.get_object_importance( model, pool, train_pool, top_size = -1, type = "Average", update_method = "SinglePoint", thread_count = -1, ostr_type = NULL )
model |
The model obtained as the result of training. Default value: Required argument |
pool |
The pool for which you want to evaluate the object importances. Default value: Required argument |
train_pool |
The pool on which the model has been trained. Default value: Required argument |
top_size |
Method returns the result of the top_size most important train objects. If -1, then the top size is not limited. Default value: -1 |
type |
Possible values:
Default value: 'Average' |
update_method |
Description of the update set methods are given in section 3.1.3 of the paper. Possible values:
Default value: 'SinglePoint' |
thread_count |
The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: -1 |
ostr_type |
Deprecated parameter, use 'type' instead. |
List with elements "indices"
and "scores"
.
https://catboost.ai/docs/concepts/r-reference_catboost-get_object_importance.html
Return the plain model parameters.
catboost.get_plain_params(model)
catboost.get_plain_params(model)
model |
he model obtained as the result of training. |
A list object with model parameters.
Load the model from a file.
Note: Feature importance (see https://catboost.ai/docs/concepts/fstr.html#fstr) is not saved when using this function.
catboost.load_model(model_path, file_format = "cbm")
catboost.load_model(model_path, file_format = "cbm")
model_path |
The path to the model. Default value: Required argument |
file_format |
Format of the model file. Default value: 'cbm' |
A model object.
https://catboost.ai/docs/concepts/r-reference_catboost-load_model.html
Create a dataset from the given file, matrix or data.frame.
catboost.load_pool( data, label = NULL, cat_features = NULL, column_description = NULL, pairs = NULL, delimiter = "\t", has_header = FALSE, weight = NULL, group_id = NULL, group_weight = NULL, subgroup_id = NULL, pairs_weight = NULL, baseline = NULL, feature_names = NULL, thread_count = -1, graph = NULL )
catboost.load_pool( data, label = NULL, cat_features = NULL, column_description = NULL, pairs = NULL, delimiter = "\t", has_header = FALSE, weight = NULL, group_id = NULL, group_weight = NULL, subgroup_id = NULL, pairs_weight = NULL, baseline = NULL, feature_names = NULL, thread_count = -1, graph = NULL )
data |
A file path, matrix or data.frame with features. The following column types are supported:
Default value: Required argument |
label |
The label vector or label matrix |
cat_features |
A vector of categorical features indices. The indices are zero based and can differ from the given in the Column descriptions file. If data parameter is data.frame don't use cat_features, categorical features are determined automatically from data.frame column types. |
column_description |
The path to the input file that contains the column descriptions. |
pairs |
A file path, matrix or data.frame that contains the pairs descriptions. The shape should be Nx2, where N is the pairs' count. The first element of pair is the index of winner document in training set. The second element of pair is the index of loser document in training set. |
delimiter |
Delimiter character to use to separate features in a file. |
has_header |
Read column names from first line, if this parameter is set to True. |
weight |
The weights of the objects. |
group_id |
The group ids of the objects. |
group_weight |
The group weight of the objects. |
subgroup_id |
The subgroup ids of the objects. |
pairs_weight |
The weights of the pairs. |
baseline |
Vector of initial (raw) values of the objective function. Used in the calculation of final values of trees. |
feature_names |
A list of names for each feature in the dataset. |
thread_count |
The number of threads to use while reading the data. Optimizes reading time. This parameter doesn't affect results. |
graph |
A file path, matrix or data.frame that contains the pairs of indices of objects for graph features. The shape should be Nx2, where N is the pairs of indices count. If -1, then the number of threads is set to the number of CPU cores. |
catboost.Pool
## Not run: # From file pool_path <- system.file("extdata", "adult_train.1000", package = "catboost") cd_path <- system.file("extdata", "adult.cd", package = "catboost") pool <- catboost.load_pool(pool_path, column_description = cd_path) print(pool) # From matrix target <- 1 data_matrix <-matrix(runif(18), 6, 3) pool <- catboost.load_pool(data_matrix[, -target], label = data_matrix[, target]) print(pool) # From data.frame nonsense <- factor(c('A', 'B', 'C')) data_frame <- data.frame(value = runif(10), category = nonsense[(1:10) %% 3 + 1]) label = (1:10) %% 2 pool <- catboost.load_pool(data_frame, label = label) print(pool) ## End(Not run)
## Not run: # From file pool_path <- system.file("extdata", "adult_train.1000", package = "catboost") cd_path <- system.file("extdata", "adult.cd", package = "catboost") pool <- catboost.load_pool(pool_path, column_description = cd_path) print(pool) # From matrix target <- 1 data_matrix <-matrix(runif(18), 6, 3) pool <- catboost.load_pool(data_matrix[, -target], label = data_matrix[, target]) print(pool) # From data.frame nonsense <- factor(c('A', 'B', 'C')) data_frame <- data.frame(value = runif(10), category = nonsense[(1:10) %% 3 + 1]) label = (1:10) %% 2 pool <- catboost.load_pool(data_frame, label = label) print(pool) ## End(Not run)
Apply the model to the given dataset.
Peculiarities: In case of multiclassification the prediction is returned in the form of a matrix. Each line of this matrix contains the predictions for one object of the input dataset.
catboost.predict( model, pool, verbose = FALSE, prediction_type = "RawFormulaVal", ntree_start = 0, ntree_end = 0, thread_count = -1 )
catboost.predict( model, pool, verbose = FALSE, prediction_type = "RawFormulaVal", ntree_start = 0, ntree_end = 0, thread_count = -1 )
model |
The model obtained as the result of training. Default value: Required argument |
pool |
The input dataset. Default value: Required argument |
verbose |
Verbose output to stdout. Default value: FALSE (not used) |
prediction_type |
The format for displaying approximated values in output data (see https://catboost.ai/docs/concepts/output-data.html). Possible values:
Default value: 'RawFormulaVal' |
ntree_start |
Model is applied on the interval [ntree_start, ntree_end) (zero-based indexing). Default value: 0 |
ntree_end |
Model is applied on the interval [ntree_start, ntree_end) (zero-based indexing). Default value: 0 (if value equals to 0 this parameter is ignored and ntree_end equal to tree_count) |
thread_count |
The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: 1 |
Vector of predictions (matrix for multi-class classification).
https://catboost.ai/docs/concepts/r-reference_catboost-predict.html
After de-serializing a model object through R base's functions ('readRDS', 'load'), its underlying object will not exist in the computer's memory anymore, and needs to be restored from the raw bytes that the model stores.
This is automatically done internally when calling functions such as catboost.predict, but the process is repeated at each call, which makes them slower than if using a fresh model object and increases memory usage inbetween calls to the garbage collector. This function allows restoring the internal object beforehand so as to avoid restoring the object multiple times.
Note that the model object needs to be re-assigned as the output of this function, as the modifications are not done in-place.
catboost.restore_handle(model)
catboost.restore_handle(model)
model |
The model obtained as the result of training which has been serialized and is now de-serialized. |
The model object with its handle pointing to a valid object in memory.
Save the model to a file.
Note: Feature importance (see https://catboost.ai/docs/concepts/fstr.html#fstr) is not saved when using this function.
catboost.save_model( model, model_path, file_format = "cbm", export_parameters = NULL, pool = NULL )
catboost.save_model( model, model_path, file_format = "cbm", export_parameters = NULL, pool = NULL )
model |
The model to be saved. Default value: Required argument |
model_path |
The path to the resulting binary file with the model description. Used for solving other machine learning problems (for instance, applying a model). Default value: Required argument |
file_format |
specified format model from a file. Possible values:
Default value: 'cbm' |
export_parameters |
are a parameters for CoreML or PMML export. |
pool |
is training pool. |
Status, the result of model shrinking. TRUE if shrinking succeeded, FALSE otherwise.
https://catboost.ai/docs/features/export-model-to-core-ml.html
Save the dataset to the CatBoost format. Files with the following data are created:
Dataset description
Column descriptions
Use the catboost.load_pool function to read the resulting files. These files can also be used in the Command-line version and the Python library.
catboost.save_pool( data, label = NULL, weight = NULL, baseline = NULL, pool_path = "data.pool", cd_path = "cd.pool" )
catboost.save_pool( data, label = NULL, weight = NULL, baseline = NULL, pool_path = "data.pool", cd_path = "cd.pool" )
data |
A data.frame with features. The following column types are supported:
Default value: Required argument |
label |
The label vector. |
weight |
The weights of the label vector. |
baseline |
Vector of initial (raw) values of the label function for the object. Used in the calculation of final values of trees. |
pool_path |
The path to the output file that contains the dataset description. |
cd_path |
The path to the output file that contains the column descriptions. |
Nothing. This method writes a dataset to disk.
Shrink the model
catboost.shrink(model, ntree_end, ntree_start = 0)
catboost.shrink(model, ntree_end, ntree_start = 0)
model |
The model obtained as the result of training. |
ntree_end |
Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing). |
ntree_start |
Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing). |
Status, the result of model shrinking. TRUE if shrinking succeeded, FALSE otherwise.
https://catboost.ai/docs/concepts/r-reference_catboost-shrink.html
Apply the model to the given dataset and calculate the results for each i-th tree of the model taking into consideration only the trees in the range [1;i].
Peculiarities: In case of multiclassification the prediction is returned in the form of a matrix. Each line of this matrix contains the predictions for one object of the input dataset.
catboost.staged_predict( model, pool, verbose = FALSE, prediction_type = "RawFormulaVal", ntree_start = 0L, ntree_end = 0L, eval_period = 1, thread_count = -1 )
catboost.staged_predict( model, pool, verbose = FALSE, prediction_type = "RawFormulaVal", ntree_start = 0L, ntree_end = 0L, eval_period = 1, thread_count = -1 )
model |
The model obtained as the result of training. Default value: Required argument |
pool |
The input dataset. Default value: Required argument |
verbose |
Verbose output to stdout. Default value: FALSE (not used) |
prediction_type |
The format for displaying approximated values in output data (see https://catboost.ai/docs/concepts/output-data.html). Possible values:
Default value: 'RawFormulaVal' |
ntree_start |
Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 0 |
ntree_end |
Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 0 (if value equals to 0 this parameter is ignored and ntree_end equal to tree_count) |
eval_period |
Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 1 |
thread_count |
The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: 1 |
List object with predictions from one iteration.
https://catboost.ai/docs/concepts/r-reference_catboost-staged_predict.html
Blend trees and counters of two or more trained CatBoost models into a new model. Leaf values can be individually weighted for each input model. For example, it may be useful to blend models trained on different validation datasets.
catboost.sum_models( models, weights = NULL, ctr_merge_policy = "IntersectingCountersAverage" )
catboost.sum_models( models, weights = NULL, ctr_merge_policy = "IntersectingCountersAverage" )
models |
Models for the summation. Default value: Required argument |
weights |
The weights of the models. Default value: NULL (use weight 1 for every model) |
ctr_merge_policy |
The counters merging policy. Possible values:
Default value: 'IntersectingCountersAverage' |
Model object.
Train the model using a CatBoost dataset.
The list of parameters
Common parameters
fold_permutation_block
Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation.
Default value:
Default value differs depending on the dataset size and ranges from 1 to 256 inclusively
ignored_features
Identifiers of features to exclude from training. The non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to "42", the corresponding non-existing feature is successfully ignored.
The identifier corresponds to the feature's index.
Feature indices used in train and feature importance are numbered from 0 to featureCount-1.
If a file is used as input data then any non-feature column types are ignored when calculating these
indices. For example, each row in the input file contains data in the following order:
"categorical feature<\t
>label<\t
>numerical feature". So for the row "rock<\t
>0<\t
>42",
the identifier for the "rock" feature is 0, and for the "42" feature it is 1.
The identifiers of features to exclude should be enumerated at vector.
For example, if training should exclude features with the identifiers 1, 2, 7, 42, 43, 44, 45, the value of this parameter should be set to c(1,2,7,42,43,44,45).
Default value:
None (use all features)
use_best_model
If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
Build the number of trees defined by the training parameters.
Identify the iteration with the optimal loss function value.
No trees are saved after this iteration.
This option requires a test dataset to be provided.
Default value:
FALSE (not used)
loss_function
The loss function (see https://catboost.ai/docs/concepts/loss-functions.html#loss-functions) to use in training. The specified value also determines the machine learning problem to solve.
Format:
<Loss function 1>[:<parameter 1>=<value>:..<parameter N>=<value>:]
Supported loss functions:
'Logloss'
'CrossEntropy'
'MultiClass'
'MultiClassOneVsAll'
'RMSE'
'MAE'
'Quantile'
'LogLinQuantile'
'MAPE'
'Poisson'
'Lq'
'PairLogit'
'PairLogitPairwise'
'YetiRank'
'YetiRankPairwise'
'QueryCrossEntropy'
'QueryRMSE'
'QuerySoftMax'
Supported parameters:
alpha - The coefficient used in quantile-based losses ('Quantile' and 'LogLinQuantile'). The default value is 0.5.
For example, if you need to calculate the value of Quantile with the coefficient , use the following construction:
'Quantile:alpha=0.1'
Default value:
'RMSE'
custom_loss
Loss function (see https://catboost.ai/docs/concepts/loss-functions.html#loss-functions) values to output during training. These functions are not used for optimization and are displayed for informational purposes only.
Format:
c(<Loss function 1>[:<parameter>=<value>],<Loss function 2>[:<parameter>=<value>],...,<Loss function N>[:<parameter>=<value>])
Supported loss functions:
'Logloss'
'CrossEntropy'
'Precision'
'Recall'
'F1'
'F'
'BalancedAccuracy'
'BalancedErrorRate'
'MCC'
'Accuracy'
'CtrFactor'
'AUC'
'BrierScore'
'HingeLoss'
'HammingLoss'
'ZeroOneLoss'
'Kappa'
'WKappa'
'LogLikelihoodOfPrediction'
'MultiClass'
'MultiClassOneVsAll'
'TotalF1'
'MAE'
'MAPE'
'Poisson'
'Quantile'
'RMSE'
'LogLinQuantile'
'Lq'
'NumErrors'
'SMAPE'
'R2'
'MSLE'
'MedianAbsoluteError'
'PairLogit'
'PairLogitPairwise'
'PairAccuracy'
'QueryCrossEntropy'
'QueryRMSE'
'QuerySoftMax'
'PFound'
'NDCG'
'AverageGain'
'PrecisionAt'
'RecallAt'
'MAP'
'MRR'
'ERR'
Supported parameters:
alpha - The coefficient used in quantile-based losses ('Quantile' and 'LogLinQuantile'). The default value is 0.5.
For example, if you need to calculate the value of CrossEntropy and Quantile with the coefficient , use the following construction:
c('CrossEntropy') or simply 'CrossEntropy'.
Values of all custom loss functions for learning and test datasets are saved to the Loss function (see https://catboost.ai/docs/concepts/output-data_loss-function.html#output-data_loss-function) output files (learn_error.tsv and test_error.tsv respectively). The catalog for these files is specified in the train-dir (train_dir) parameter.
Default value:
None (use one of the loss functions supported by the library)
eval_metric
The loss function used for overfitting detection (if enabled) and best model selection (if enabled).
Supported loss functions:
'Logloss'
'CrossEntropy'
'Precision'
'Recall'
'F1'
'F'
'BalancedAccuracy'
'BalancedErrorRate'
'MCC'
'Accuracy'
'CtrFactor'
'AUC'
'BrierScore'
'HingeLoss'
'HammingLoss'
'ZeroOneLoss'
'Kappa'
'WKappa'
'LogLikelihoodOfPrediction'
'MultiClass'
'MultiClassOneVsAll'
'TotalF1'
'MAE'
'MAPE'
'Poisson'
'Quantile'
'RMSE'
'LogLinQuantile'
'Lq'
'NumErrors'
'SMAPE'
'R2'
'MSLE'
'MedianAbsoluteError'
'PairLogit'
'PairLogitPairwise'
'PairAccuracy'
'QueryCrossEntropy'
'QueryRMSE'
'QuerySoftMax'
'PFound'
'NDCG'
'AverageGain'
'PrecisionAt'
'RecallAt'
'MAP'
'MRR'
'ERR'
Format:
metric_name:param=Value
Examples:
'R2'
'Quantile:alpha=0.3'
Default value:
Optimized objective is used
iterations
The maximum number of trees that can be built when solving machine learning problems.
When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter.
Default value:
1000
border
The target border. If the value is strictly greater than this threshold, it is considered a positive class. Otherwise it is considered a negative class.
The parameter is obligatory if the Logloss function is used, since it uses borders to transform any given target to a binary target.
Used in binary classification.
Default value:
0.5
leaf_estimation_iterations
The number of gradient steps when calculating the values in leaves.
Default value:
1
depth
Depth of the trees.
The value can be any integer up to 16. It is recommended to use values in the range [1; 10].
Default value:
6
learning_rate
The learning rate.
Used for reducing the gradient step.
Default value:
0.03
rsm
Random subspace method. The percentage of features to use at each iteration of building trees. At each iteration, features are selected over again at random.
The value must be in the range [0;1].
Default value:
1
random_seed
The random seed used for training.
Default value:
0
nan_mode
Way to process missing values.
Possible values:
'Min'
'Max'
'Forbidden'
Default value:
'Min'
od_pval
Use the Overfitting detector (see https://catboost.ai/docs/concepts/overfitting-detector.html#overfitting-detector) to stop training when the threshold is reached. Requires that a test dataset was input.
For best results, it is recommended to set a value in the range [10^-10; 10^-2].
The larger the value, the earlier overfitting is detected.
Default value:
The overfitting detection is turned off
od_type
The method used to calculate the values in leaves.
Possible values:
IncToDec
Iter
Restriction. Do not specify the overfitting detector threshold when using the Iter type.
Default value:
'IncToDec'
od_wait
The number of iterations to continue the training after the iteration with the optimal loss function value. The purpose of this parameter differs depending on the selected overfitting detector type:
IncToDec - Ignore the overfitting detector when the threshold is reached and continue learning for the specified number of iterations after the iteration with the optimal loss function value.
Iter - Consider the model overfitted and stop training after the specified number of iterations since the iteration with the optimal loss function value.
Default value:
20
leaf_estimation_method
The method used to calculate the values in leaves.
Possible values:
Newton
Gradient
Default value:
Default value depends on the selected loss function
grow_policy
GPU only. The tree growing policy. It describes how to perform greedy tree construction.
Possible values:
SymmetricTree
Lossguide
Depthwise
Default value:
SymmetricTree
min_data_in_leaf
GPU only. The minimum training samples count in leaf. CatBoost will not search for new splits in leaves with samples count less than min_data_in_leaf. This parameter is used only for Depthwise and Lossguide growing policies.
Default value:
1
max_leaves
GPU only. The maximum leaf count in resulting tree. Used only for Lossguide growing policy. This parameter is used only for Lossguide growing policy.
Default value:
31
score_function GPU only. Score that is used during tree construction to select the next tree split.
Possible values:
L2
Cosine
NewtonL2
NewtonCosine
Default value:
Cosine
For growing policy Lossguide default is NewtonL2.
l2_leaf_reg
L2 regularization coefficient. Used for leaf value calculation.
Any positive values are allowed.
Default value:
3
model_size_reg
Model size regularization coefficient. The influence coefficient of the model size for choosing tree structure. To get a smaller model size - increase this coefficient.
Any positive values are allowed.
Default value:
0.5
has_time
Use the order of objects in the input data (do not perform a random permutation of the dataset at the preprocessing stage)
Default value:
FALSE (not used; permute input dataset)
allow_const_label
To allow the constant label value in the dataset.
Default value:
FALSE
name
The experiment name to display in visualization tools (see https://catboost.ai/docs/features/visualization.html#visualization).
Default value:
experiment
prediction_type
The format for displaying approximated values in output data.
Possible values:
'Probability'
'Class'
'RawFormulaVal'
Default value:
'RawFormulaVal'
fold_len_multiplier
Coefficient for changing the length of folds.
The value must be greater than 1. The best validation result is achieved with minimum values.
With values close to 1 (for example, ), each iteration takes a quadratic amount of memory and time
for the number of objects in the iteration. Thus, low values are possible only when there is a small number of objects.
Default value:
2
class_weights
Classes weights. The values are used as multipliers for the object weights.
For example, for 3 class classification you could use:
c(0.85, 1.2, 1)
Default value:
None (the weight for all classes is set to 1)
classes_count
The upper limit for the numeric class label. Defines the number of classes for multiclassification.
Only non-negative integers can be specified. The given integer should be greater than any of the target values.
If this parameter is specified the labels for all classes in the input dataset should be smaller than the given value.
Default value:
maximum class label + 1
one_hot_max_size
Convert the feature to float if the number of different values that it takes exceeds the specified value. Ctrs are not calculated for such features.
The one-vs.-all delimiter is used for the resulting float features.
Default value:
FALSE
Do not convert features to float based on the number of different values
random_strength
Score standard deviation multiplier.
Default value:
1
bootstrap_type
Bootstrap type. Defines the method for sampling the weights of documents.
Possible values:
'Bayesian'
'Bernoulli'
'Poisson'
'MVS'
'No'
Poisson bootstrap is supported only on GPU.
Default value:
'Bayesian'
bagging_temperature
Controls intensity of Bayesian bagging. The higher the temperature the more aggressive bagging is.
Typical values are in the range (0 is for no bagging).
Possible values are in the range .
Default value:
1
subsample
Sample rate for bagging. This parameter can be used if one of the following bootstrap types is defined:
'Bernoulli'
Default value:
0.66
sampling_unit
The parameter allows to specify the sampling scheme: sample weights for each object individually or for an entire group of objects together.
Possible values:
'Object'
'Group'
Default value:
'Object'
sampling_frequency
Frequency to sample weights and objects when building trees.
Possible values:
'PerTree'
'PerTreeLevel'
Default value:
'PerTreeLevel'
model_shrink_rate
For i > 0 at the start of i-th iteration multiplies model by (1 - model_shrink_rate / i).
Possible values: [0, 1).
Default value: 0
CTR settings
simple_ctr
Binarization settings for categorical features (see https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html).
Format:
c(CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N])
Components:
CTR types for training on CPU:
'Borders'
'Buckets'
'BinarizedTargetMeanValue'
'Counter'
CTR types for training on GPU:
'Borders'
'Buckets'
'FeatureFreq'
'FloatTargetMeanValue'
The number of borders for label value binarization. (see https://catboost.ai/docs/concepts/quantization.html) Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1. This option is available for training on CPU only.
The binarization (see https://catboost.ai/docs/concepts/quantization.html) type for the label value. Only used for regression problems.
Possible values:
'Median'
'Uniform'
'UniformAndQuantiles'
'MaxLogSum'
'MinEntropy'
'GreedyLogSum'
By default, 'MinEntropy'
This option is available for training on CPU only.
The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.
The binarization type for categorical features. Supported values for training on CPU:
'Uniform'
Supported values for training on GPU:
'Median'
'Uniform'
'UniformAndQuantiles'
'MaxLogSum'
'MinEntropy'
'GreedyLogSum'
Priors to use during training (several values can be specified) Possible formats:
'One number - Adds the value to the numerator.'
'Two slash-delimited numbers (for GPU only) - Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.'
combinations_ctr
Binarization settings for combinations of categorical features (see https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html).
Format:
c(CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N])
Components:
CTR types for training on CPU:
'Borders'
'Buckets'
'BinarizedTargetMeanValue'
'Counter'
CTR types for training on GPU:
'Borders'
'Buckets'
'FeatureFreq'
'FloatTargetMeanValue'
The number of borders for target binarization. (see https://catboost.ai/docs/concepts/quantization.html) Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1. This option is available for training on CPU only.
The binarization (see https://catboost.ai/docs/concepts/quantization.html) type for the target. Only used for regression problems.
Possible values:
'Median'
'Uniform'
'UniformAndQuantiles'
'MaxLogSum'
'MinEntropy'
'GreedyLogSum'
By default, 'MinEntropy'
This option is available for training on CPU only.
The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.
The binarization type for categorical features. Supported values for training on CPU:
'Uniform'
Supported values for training on GPU:
'Median'
'Uniform'
'UniformAndQuantiles'
'MaxLogSum'
'MinEntropy'
'GreedyLogSum'
Priors to use during training (several values can be specified) Possible formats:
'One number - Adds the value to the numerator.'
'Two slash-delimited numbers (for GPU only) - Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.'
ctr_target_border_count
Maximum number of borders used in target binarization for categorical features that need it. If TargetBorderCount is specified in 'simple_ctr', 'combinations_ctr' or 'per_feature_ctr' option it overrides this value.
Default value:
1
counter_calc_method
The method for calculating the Counter CTR type for the test dataset.
Possible values:
'Full'
'FullTest'
'PrefixTest'
'SkipTest'
Default value: 'PrefixTest'
max_ctr_complexity
The maximum number of categorical features that can be combined.
Default value:
4
ctr_leaf_count_limit
The maximum number of leaves with categorical features.
If the number of leaves exceeds the specified limit, some leaves are discarded.
The value must be positive (for zero limit use ignored_features
parameter).
The leaves to be discarded are selected as follows:
The leaves are sorted by the frequency of the values.
The top N leaves are selected, where N is the value specified in the parameter.
All leaves starting from N+1 are discarded.
This option reduces the resulting model size and the amount of memory required for training. Note that the resulting quality of the model can be affected.
Default value:
None (The number of leaves with categorical features is not limited)
store_all_simple_ctr
Ignore categorical features, which are not used in feature combinations, when choosing candidates for exclusion.
Use this parameter with ctr-leaf-count-limit only.
Default value:
FALSE (Both simple features and feature combinations are taken in account when limiting the number of leaves with categorical features)
Binarization settings
border_count
The number of splits for numerical features. Allowed values are integers from 1 to 255 inclusively.
Default value:
254 for training on CPU or 128 for training on GPU
feature_border_type
The binarization mode (see https://catboost.ai/docs/concepts/quantization.html) for numerical features.
Possible values:
'Median'
'Uniform'
'UniformAndQuantiles'
'MaxLogSum'
'MinEntropy'
'GreedyLogSum'
Default value:
'MinEntropy'
Performance settings
thread_count
The number of threads to use when applying the model.
Allows you to optimize the speed of execution. This parameter doesn't affect results.
Default value:
The number of CPU cores.
Output settings
logging_level
Possible values:
'Silent'
'Verbose'
'Info'
'Debug'
Default value:
'Silent'
metric_period
The frequency of iterations to print the information to stdout. The value should be a positive integer.
Default value:
1
train_dir
The directory for storing the files generated during training.
Default value:
None (current catalog)
save_snapshot
Enable snapshotting for restoring the training progress after an interruption.
Default value:
None
snapshot_file
Settings for recovering training after an interruption (see https://catboost.ai/docs/features/snapshots.html).
Depending on whether the file specified exists in the file system:
Missing - write information about training progress to the specified file.
Exists - load data from the specified file and continue training from where it left off.
Default value:
File can't be generated or read. If the value is omitted, the file name is experiment.cbsnapshot.
snapshot_interval
Interval between saving snapshots (seconds)
Default value:
600
allow_writing_files
If this flag is set to FALSE, no files with different diagnostic info will be created during training. With this flag set to FALSE no snapshotting can be done. Plus visualisation will not work, because visualisation uses files that are created and updated during training.
Default value:
TRUE
approx_on_full_history
If this flag is set to TRUE, each approximated value is calculated using all the preceding rows in the fold (slower, more accurate). If this flag is set to FALSE, each approximated value is calculated using only the beginning 1/fold_len_multiplier fraction of the fold (faster, slightly less accurate).
Default value:
FALSE
boosting_type
Boosting scheme. Possible values: - 'Ordered' - Gives better quality, but may slow down the training. - 'Plain' - The classic gradient boosting scheme. May result in quality degradation, but does not slow down the training.
Default value:
Depends on object count and feature count in train dataset and on learning mode.
dev_score_calc_obj_block_size
CPU only. Size of block of samples in score calculation. Should be > 0 Used only for learning speed tuning. Changing this parameter can affect results in pairwise scoring mode due to numerical accuracy differences
Default value:
5000000
dev_efb_max_buckets
CPU only. Maximum bucket count in exclusive features bundle. Should be in an integer between 0 and 65536. Used only for learning speed tuning.
Default value:
1024
sparse_features_conflict_fraction
CPU only. Maximum allowed fraction of conflicting non-default values for features in exclusive features bundle. Should be a real value in [0, 1) interval.
Default value:
0.0
leaf_estimation_backtracking
Type of backtracking during gradient descent. Possible values: - 'No' - never backtrack; supported on CPU and GPU - 'AnyImprovement' - reduce the descent step until the value of loss function is less than before the step; supported on CPU and GPU - 'Armijo' - reduce the descent step until Armijo condition is satisfied; supported on GPU only
Default value:
'AnyImprovement'
catboost.train(learn_pool, test_pool = NULL, params = list())
catboost.train(learn_pool, test_pool = NULL, params = list())
learn_pool |
The dataset used for training the model. Default value: Required argument |
test_pool |
The dataset used for testing the quality of the model. Default value: NULL (not used) |
params |
The list of parameters to start training with. If omitted, default values are used (see The list of parameters). If set, the passed list of parameters overrides the default values. Default value: Required argument |
Model object.
https://catboost.ai/docs/concepts/r-reference_catboost-train.html
## Not run: train_pool_path <- system.file("extdata", "adult_train.1000", package = "catboost") test_pool_path <- system.file("extdata", "adult_test.1000", package = "catboost") cd_path <- system.file("extdata", "adult.cd", package = "catboost") train_pool <- catboost.load_pool(train_pool_path, column_description = cd_path) test_pool <- catboost.load_pool(test_pool_path, column_description = cd_path) fit_params <- list( iterations = 100, loss_function = 'Logloss', ignored_features = c(4, 9), border_count = 32, depth = 5, learning_rate = 0.03, l2_leaf_reg = 3.5, train_dir = 'train_dir') model <- catboost.train(train_pool, test_pool, fit_params) ## End(Not run)
## Not run: train_pool_path <- system.file("extdata", "adult_train.1000", package = "catboost") test_pool_path <- system.file("extdata", "adult_test.1000", package = "catboost") cd_path <- system.file("extdata", "adult.cd", package = "catboost") train_pool <- catboost.load_pool(train_pool_path, column_description = cd_path) test_pool <- catboost.load_pool(test_pool_path, column_description = cd_path) fit_params <- list( iterations = 100, loss_function = 'Logloss', ignored_features = c(4, 9), border_count = 32, depth = 5, learning_rate = 0.03, l2_leaf_reg = 3.5, train_dir = 'train_dir') model <- catboost.train(train_pool, test_pool, fit_params) ## End(Not run)
Apply the model to the given dataset using several independent truncated models - virtual ensembles. Each tree in ensemble predicts its own value for each document from pool.
Peculiarities: Return value varies on prediction_type: array for 'VirtEnsembles' and matrix for 'TotalUncertainty'
catboost.virtual_ensembles_predict( model, pool, verbose = FALSE, prediction_type = "VirtEnsembles", ntree_end = 0L, virtual_ensembles_count = 10, thread_count = -1 )
catboost.virtual_ensembles_predict( model, pool, verbose = FALSE, prediction_type = "VirtEnsembles", ntree_end = 0L, virtual_ensembles_count = 10, thread_count = -1 )
model |
The model obtained as the result of training. Default value: Required argument |
pool |
The input dataset. Default value: Required argument |
verbose |
Verbose output to stdout. Default value: FALSE (not used) |
prediction_type |
The format for displaying approximated values in output data (see https://catboost.ai/docs/concepts/python-reference_virtual_ensembles_predict.html#python-reference_catboostclassifier_predict__output-format). Possible values:
Default value: 'VirtEnsembles' |
ntree_end |
Index of the first tree not to be used when applying the model or calculating the metrics (zero-based indexing). Default value: 0 (the index of the last tree to use equals to the number of trees in the model minus one) |
virtual_ensembles_count |
Number of tree ensembles to use. Each virtual ensemble can be considered as truncated model. Default value: 10 |
thread_count |
The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: -1 |
Matrix or Array of predictions (for 'TotalUncertainty' and 'VirtEnsembles' prediction_type correspondingly)
https://catboost.ai/docs/concepts/python-reference_virtual_ensembles_predict.html?lang=en
Get dimensions of a Pool.
## S3 method for class 'catboost.Pool' dim(x)
## S3 method for class 'catboost.Pool' dim(x)
x |
The input dataset. Default value: Required argument |
Returns a vector of row numbers and column numbers in an catboost.Pool.
Get dimension names of a Pool.
## S3 method for class 'catboost.Pool' dimnames(x)
## S3 method for class 'catboost.Pool' dimnames(x)
x |
The input dataset. Default value: Required argument |
A list with the two elements. The second element contains the column names.
Return a list with the first n objects of the dataset.
Each line of this list contains the following information for each object:
The label value.
The weight value.
The feature values.
## S3 method for class 'catboost.Pool' head(x, n = 10, ...)
## S3 method for class 'catboost.Pool' head(x, n = 10, ...)
x |
The input dataset. Default value: Required argument |
n |
The quantity of the first objects in the dataset to be returned. Default value: 10 |
... |
not currently used |
A matrix containing the first n
objects of the dataset.
Displays the most general characteristics of a CatBoost model.
## S3 method for class 'catboost.Model' print(x, ...)
## S3 method for class 'catboost.Model' print(x, ...)
x |
The model obtained as the result of training. |
... |
Not used |
The same model that was passed as input.
Print dimensions of catboost.Pool.
## S3 method for class 'catboost.Pool' print(x, ...)
## S3 method for class 'catboost.Pool' print(x, ...)
x |
a catboost.Pool object Default value: Required argument |
... |
not currently used |
Nothing. This method prints pool dimensions.
Displays the most general characteristics of a CatBoost model (same as 'print').
## S3 method for class 'catboost.Model' summary(object, ...)
## S3 method for class 'catboost.Model' summary(object, ...)
object |
The model obtained as the result of training. |
... |
Not used |
The same model that was passed as input.
Return a list with the last n objects of the dataset.
Each line of this list contains the following information for each object:
The target value.
The weight value.
The feature values.
## S3 method for class 'catboost.Pool' tail(x, n = 10, ...)
## S3 method for class 'catboost.Pool' tail(x, n = 10, ...)
x |
The input dataset. Default value: Required argument |
n |
The quantity of the last objects in the dataset to be returned. Default value: 10 |
... |
not currently used |
A matrix containing the last n
objects of the dataset.