Package 'catboost' reference manual

Title:	Gradient Boosting on Decision Trees
Description:	Open-source gradient boosting on decision trees with categorical features support out of the box.
Authors:	CatBoost DevTeam [aut, cre]
Maintainer:	Stanislav Kirillov <[email protected]>
License:	Apache License (== 2.0)
Version:	1.2.7
Built:	2024-11-22 19:25:43 UTC
Source:	https://github.com/catboost/catboost

Cross-validate model.

Description

Estimate model performance using cross-validation.

Usage

catboost.cv(
  pool,
  params = list(),
  fold_count = 3,
  type = "Classical",
  partition_random_seed = 0,
  shuffle = TRUE,
  stratified = FALSE,
  early_stopping_rounds = NULL
)
catboost.cv(
  pool,
  params = list(),
  fold_count = 3,
  type = "Classical",
  partition_random_seed = 0,
  shuffle = TRUE,
  stratified = FALSE,
  early_stopping_rounds = NULL
)

Arguments

`pool`	Data to cross-validate on
`params`	Parameters for catboost.train
`fold_count`	Folds count.
`type`	is type of cross-validation.
`partition_random_seed`	The random seed used for splitting pool into folds.
`shuffle`	Shuffle the dataset objects before splitting into folds.
`stratified`	Perform stratified sampling.
`early_stopping_rounds`	Activates Iter overfitting detector with od_wait set to early_stopping_rounds.

Value

A data.frame of evaluation results from cross-validation.

Drop unused features information from model

Description

Drop unused features information from model

Usage

catboost.drop_unused_features(model, ntree_end, ntree_start = 0)
catboost.drop_unused_features(model, ntree_end, ntree_start = 0)

Arguments

`model`	The model obtained as the result of training.
`ntree_end`	Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing).
`ntree_start`	Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing).

Value

Status, the result of dropping feature. TRUE if this succeeded, FALSE otherwise.

Calculate metrics.

Description

Calculate the specified metrics for the specified dataset.

Usage

catboost.eval_metrics(
  model,
  pool,
  metrics,
  ntree_start = 0L,
  ntree_end = 0L,
  eval_period = 1,
  thread_count = -1,
  tmp_dir = NULL
)
catboost.eval_metrics(
  model,
  pool,
  metrics,
  ntree_start = 0L,
  ntree_end = 0L,
  eval_period = 1,
  thread_count = -1,
  tmp_dir = NULL
)

Arguments

`model`	The model obtained as the result of training. Default value: Required argument
`pool`	The pool for which you want to evaluate the metrics. Default value: Required argument
`metrics`	The list of metrics to be calculated. (Supported metrics https://catboost.ai/docs/references/custom-metric__supported-metrics.html) Default value: Required argument
`ntree_start`	Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 0
`ntree_end`	Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 0 (if value equals to 0 this parameter is ignored and ntree_end equal to tree_count)
`eval_period`	Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 1
`thread_count`	The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: -1
`tmp_dir`	The name of the temporary directory for intermediate results. If NULL, then the name will be generated. Default value: NULL

Value

dict: metric -> array of shape [(ntree_end - ntree_start) / eval_period].

Calculate the feature importances

Description

Calculate the feature importances (see https://catboost.ai/docs/concepts/fstr.html#fstr) (Regular feature importance, ShapValues, and Feature interaction strength).

Usage

catboost.get_feature_importance(
  model,
  pool = NULL,
  type = "FeatureImportance",
  thread_count = -1,
  fstr_type = NULL
)
catboost.get_feature_importance(
  model,
  pool = NULL,
  type = "FeatureImportance",
  thread_count = -1,
  fstr_type = NULL
)

Arguments

`model`	The model obtained as the result of training. Default value: Required argument
`pool`	The input dataset. The feature importance for the training dataset is calculated if this argument is not specified. Models with ranking metrics require pool argument to calculate feature importance. Default value: NULL
`type`	The feature importance type. Possible values: 'PredictionValuesChange' Calculate score for every feature. 'LossFunctionChange' Calculate score for every feature for groupwise model. 'FeatureImportance' 'LossFunctionChange' in case of groupwise model and 'PredictionValuesChange' otherwise. 'Interaction' Calculate pairwise score between every feature. 'ShapValues' Calculate SHAP Values for every object. Default value: 'FeatureImportance'
`thread_count`	The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: -1
`fstr_type`	Deprecated parameter, use 'type' instead.

Value

Feature importances

Model parameters

Description

Return the model parameters.

Usage

catboost.get_model_params(model)
catboost.get_model_params(model)

Arguments

model

The model obtained as the result of training.

Value

A list object with model parameters.

Calculate the object importances

Description

Calculate the object importances (see https://catboost.ai/docs/concepts/ostr.html). This is the implementation of the LeafInfluence algorithm from the following paper: https://arxiv.org/pdf/1802.06640.pdf

Usage

catboost.get_object_importance(
  model,
  pool,
  train_pool,
  top_size = -1,
  type = "Average",
  update_method = "SinglePoint",
  thread_count = -1,
  ostr_type = NULL
)
catboost.get_object_importance(
  model,
  pool,
  train_pool,
  top_size = -1,
  type = "Average",
  update_method = "SinglePoint",
  thread_count = -1,
  ostr_type = NULL
)

Arguments

`model`	The model obtained as the result of training. Default value: Required argument
`pool`	The pool for which you want to evaluate the object importances. Default value: Required argument
`train_pool`	The pool on which the model has been trained. Default value: Required argument
`top_size`	Method returns the result of the top_size most important train objects. If -1, then the top size is not limited. Default value: -1
`type`	Possible values: 'Average' Method returns the mean train objects scores for all input objects. 'PerObject' Method returns the train objects scores for every input object. Default value: 'Average'
`update_method`	Description of the update set methods are given in section 3.1.3 of the paper. Possible values: 'SinglePoint' 'TopKLeaves' It is posible to set top size : TopKLeaves:top=2. 'AllPoints' Default value: 'SinglePoint'
`thread_count`	The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: -1
`ostr_type`	Deprecated parameter, use 'type' instead.

Value

List with elements "indices" and "scores".

Plain Model parameters

Description

Return the plain model parameters.

Usage

catboost.get_plain_params(model)
catboost.get_plain_params(model)

Arguments

model

he model obtained as the result of training.

Value

A list object with model parameters.

Load the model

Description

Load the model from a file.

Note: Feature importance (see https://catboost.ai/docs/concepts/fstr.html#fstr) is not saved when using this function.

Usage

catboost.load_model(model_path, file_format = "cbm")
catboost.load_model(model_path, file_format = "cbm")

Arguments

model_path

The path to the model.

Default value: Required argument

file_format

Format of the model file.

Default value: 'cbm'

Value

A model object.

Create a dataset

Description

Create a dataset from the given file, matrix or data.frame.

Usage

catboost.load_pool(
  data,
  label = NULL,
  cat_features = NULL,
  column_description = NULL,
  pairs = NULL,
  delimiter = "\t",
  has_header = FALSE,
  weight = NULL,
  group_id = NULL,
  group_weight = NULL,
  subgroup_id = NULL,
  pairs_weight = NULL,
  baseline = NULL,
  feature_names = NULL,
  thread_count = -1,
  graph = NULL
)
catboost.load_pool(
  data,
  label = NULL,
  cat_features = NULL,
  column_description = NULL,
  pairs = NULL,
  delimiter = "\t",
  has_header = FALSE,
  weight = NULL,
  group_id = NULL,
  group_weight = NULL,
  subgroup_id = NULL,
  pairs_weight = NULL,
  baseline = NULL,
  feature_names = NULL,
  thread_count = -1,
  graph = NULL
)

Arguments

`data`	A file path, matrix or data.frame with features. The following column types are supported: double factor. It is assumed that categorical features are given in this type of columns. A standard CatBoost processing procedure is applied to this type of columns: 1. The values are converted to strings. 2. The ConvertCatFeatureToFloat function is applied to the resulting string. Default value: Required argument
`label`	The label vector or label matrix
`cat_features`	A vector of categorical features indices. The indices are zero based and can differ from the given in the Column descriptions file. If data parameter is data.frame don't use cat_features, categorical features are determined automatically from data.frame column types.
`column_description`	The path to the input file that contains the column descriptions.
`pairs`	A file path, matrix or data.frame that contains the pairs descriptions. The shape should be Nx2, where N is the pairs' count. The first element of pair is the index of winner document in training set. The second element of pair is the index of loser document in training set.
`delimiter`	Delimiter character to use to separate features in a file.
`has_header`	Read column names from first line, if this parameter is set to True.
`weight`	The weights of the objects.
`group_id`	The group ids of the objects.
`group_weight`	The group weight of the objects.
`subgroup_id`	The subgroup ids of the objects.
`pairs_weight`	The weights of the pairs.
`baseline`	Vector of initial (raw) values of the objective function. Used in the calculation of final values of trees.
`feature_names`	A list of names for each feature in the dataset.
`thread_count`	The number of threads to use while reading the data. Optimizes reading time. This parameter doesn't affect results.
`graph`	A file path, matrix or data.frame that contains the pairs of indices of objects for graph features. The shape should be Nx2, where N is the pairs of indices count. If -1, then the number of threads is set to the number of CPU cores.

Value

catboost.Pool

Examples

## Not run: 
# From file
pool_path <- system.file("extdata", "adult_train.1000", package = "catboost")
cd_path <- system.file("extdata", "adult.cd", package = "catboost")
pool <- catboost.load_pool(pool_path, column_description = cd_path)
print(pool)

# From matrix
target <- 1
data_matrix <-matrix(runif(18), 6, 3)
pool <- catboost.load_pool(data_matrix[, -target], label = data_matrix[, target])
print(pool)

# From data.frame
nonsense <- factor(c('A', 'B', 'C'))
data_frame <- data.frame(value = runif(10), category = nonsense[(1:10) %% 3 + 1])
label = (1:10) %% 2
pool <- catboost.load_pool(data_frame, label = label)
print(pool)

## End(Not run)
## Not run: 
# From file
pool_path <- system.file("extdata", "adult_train.1000", package = "catboost")
cd_path <- system.file("extdata", "adult.cd", package = "catboost")
pool <- catboost.load_pool(pool_path, column_description = cd_path)
print(pool)

# From matrix
target <- 1
data_matrix <-matrix(runif(18), 6, 3)
pool <- catboost.load_pool(data_matrix[, -target], label = data_matrix[, target])
print(pool)

# From data.frame
nonsense <- factor(c('A', 'B', 'C'))
data_frame <- data.frame(value = runif(10), category = nonsense[(1:10) %% 3 + 1])
label = (1:10) %% 2
pool <- catboost.load_pool(data_frame, label = label)
print(pool)

## End(Not run)

Apply the model

Description

Apply the model to the given dataset.

Peculiarities: In case of multiclassification the prediction is returned in the form of a matrix. Each line of this matrix contains the predictions for one object of the input dataset.

Usage

catboost.predict(
  model,
  pool,
  verbose = FALSE,
  prediction_type = "RawFormulaVal",
  ntree_start = 0,
  ntree_end = 0,
  thread_count = -1
)
catboost.predict(
  model,
  pool,
  verbose = FALSE,
  prediction_type = "RawFormulaVal",
  ntree_start = 0,
  ntree_end = 0,
  thread_count = -1
)

Arguments

`model`	The model obtained as the result of training. Default value: Required argument
`pool`	The input dataset. Default value: Required argument
`verbose`	Verbose output to stdout. Default value: FALSE (not used)
`prediction_type`	The format for displaying approximated values in output data (see https://catboost.ai/docs/concepts/output-data.html). Possible values: 'Probability' 'LogProbability' 'Class' 'RawFormulaVal' 'Exponent' 'RMSEWithUncertainty' Default value: 'RawFormulaVal'
`ntree_start`	Model is applied on the interval [ntree_start, ntree_end) (zero-based indexing). Default value: 0
`ntree_end`	Model is applied on the interval [ntree_start, ntree_end) (zero-based indexing). Default value: 0 (if value equals to 0 this parameter is ignored and ntree_end equal to tree_count)
`thread_count`	The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: 1

Value

Vector of predictions (matrix for multi-class classification).

Restore or complete model handle after de-serializing

Description

After de-serializing a model object through R base's functions ('readRDS', 'load'), its underlying object will not exist in the computer's memory anymore, and needs to be restored from the raw bytes that the model stores.

This is automatically done internally when calling functions such as catboost.predict, but the process is repeated at each call, which makes them slower than if using a fresh model object and increases memory usage inbetween calls to the garbage collector. This function allows restoring the internal object beforehand so as to avoid restoring the object multiple times.

Note that the model object needs to be re-assigned as the output of this function, as the modifications are not done in-place.

Usage

catboost.restore_handle(model)
catboost.restore_handle(model)

Arguments

model

The model obtained as the result of training which has been serialized and is now de-serialized.

Value

The model object with its handle pointing to a valid object in memory.

Save the model

Description

Save the model to a file.

Note: Feature importance (see https://catboost.ai/docs/concepts/fstr.html#fstr) is not saved when using this function.

Usage

catboost.save_model(
  model,
  model_path,
  file_format = "cbm",
  export_parameters = NULL,
  pool = NULL
)
catboost.save_model(
  model,
  model_path,
  file_format = "cbm",
  export_parameters = NULL,
  pool = NULL
)

Arguments

`model`	The model to be saved. Default value: Required argument
`model_path`	The path to the resulting binary file with the model description. Used for solving other machine learning problems (for instance, applying a model). Default value: Required argument
`file_format`	specified format model from a file. Possible values: 'cbm' For catboost binary format 'coreml' To export into Apple CoreML format 'onnx' To export into ONNX-ML format 'pmml' To export into PMML format 'cpp' To export as C++ code 'python' To export as Python code. Default value: 'cbm'
`export_parameters`	are a parameters for CoreML or PMML export.
`pool`	is training pool.

Value

Status, the result of model shrinking. TRUE if shrinking succeeded, FALSE otherwise.

Save the dataset

Description

Save the dataset to the CatBoost format. Files with the following data are created:

Dataset description
Column descriptions

Use the catboost.load_pool function to read the resulting files. These files can also be used in the Command-line version and the Python library.

Usage

catboost.save_pool(
  data,
  label = NULL,
  weight = NULL,
  baseline = NULL,
  pool_path = "data.pool",
  cd_path = "cd.pool"
)
catboost.save_pool(
  data,
  label = NULL,
  weight = NULL,
  baseline = NULL,
  pool_path = "data.pool",
  cd_path = "cd.pool"
)

Arguments

`data`	A data.frame with features. The following column types are supported: double factor. It is assumed that categorical features are given in this type of columns. A standard CatBoost processing procedure is applied to this type of columns: 1. The values are converted to strings. 2. The ConvertCatFeatureToFloat function is applied to the resulting string. Default value: Required argument
`label`	The label vector.
`weight`	The weights of the label vector.
`baseline`	Vector of initial (raw) values of the label function for the object. Used in the calculation of final values of trees.
`pool_path`	The path to the output file that contains the dataset description.
`cd_path`	The path to the output file that contains the column descriptions.

Value

Nothing. This method writes a dataset to disk.

Shrink the model

Description

Shrink the model

Usage

catboost.shrink(model, ntree_end, ntree_start = 0)
catboost.shrink(model, ntree_end, ntree_start = 0)

Arguments

`model`	The model obtained as the result of training.
`ntree_end`	Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing).
`ntree_start`	Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing).

Value

Status, the result of model shrinking. TRUE if shrinking succeeded, FALSE otherwise.

Apply the model for each tree

Description

Apply the model to the given dataset and calculate the results for each i-th tree of the model taking into consideration only the trees in the range [1;i].

Peculiarities: In case of multiclassification the prediction is returned in the form of a matrix. Each line of this matrix contains the predictions for one object of the input dataset.

Usage

catboost.staged_predict(
  model,
  pool,
  verbose = FALSE,
  prediction_type = "RawFormulaVal",
  ntree_start = 0L,
  ntree_end = 0L,
  eval_period = 1,
  thread_count = -1
)
catboost.staged_predict(
  model,
  pool,
  verbose = FALSE,
  prediction_type = "RawFormulaVal",
  ntree_start = 0L,
  ntree_end = 0L,
  eval_period = 1,
  thread_count = -1
)

Arguments

`model`	The model obtained as the result of training. Default value: Required argument
`pool`	The input dataset. Default value: Required argument
`verbose`	Verbose output to stdout. Default value: FALSE (not used)
`prediction_type`	The format for displaying approximated values in output data (see https://catboost.ai/docs/concepts/output-data.html). Possible values: 'Probability' 'Class' 'RawFormulaVal' Default value: 'RawFormulaVal'
`ntree_start`	Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 0
`ntree_end`	Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 0 (if value equals to 0 this parameter is ignored and ntree_end equal to tree_count)
`eval_period`	Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing). Default value: 1
`thread_count`	The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: 1

Value

List object with predictions from one iteration.

Sum models.

Description

Blend trees and counters of two or more trained CatBoost models into a new model. Leaf values can be individually weighted for each input model. For example, it may be useful to blend models trained on different validation datasets.

Usage

catboost.sum_models(
  models,
  weights = NULL,
  ctr_merge_policy = "IntersectingCountersAverage"
)
catboost.sum_models(
  models,
  weights = NULL,
  ctr_merge_policy = "IntersectingCountersAverage"
)

Arguments

models

Models for the summation.

Default value: Required argument

weights

The weights of the models.

Default value: NULL (use weight 1 for every model)

ctr_merge_policy

The counters merging policy. Possible values:

'FailIfCtrIntersects' Ensure that the models have zero intersecting counters
'LeaveMostDiversifiedTable' Use the most diversified counters by the count of unique hash values
'IntersectingCountersAverage' Use the average ctr counter values in the intersecting bins

Default value: 'IntersectingCountersAverage'

Value

Model object.

Train the model

Description

Train the model using a CatBoost dataset.

The list of parameters

Common parameters
- fold_permutation_block
  
  Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation.
  
  Default value:
  
  Default value differs depending on the dataset size and ranges from 1 to 256 inclusively
- ignored_features
  
  Identifiers of features to exclude from training. The non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to "42", the corresponding non-existing feature is successfully ignored.
  
  The identifier corresponds to the feature's index. Feature indices used in train and feature importance are numbered from 0 to featureCount-1. If a file is used as input data then any non-feature column types are ignored when calculating these indices. For example, each row in the input file contains data in the following order: "categorical feature<⁠\t⁠>label<⁠\t⁠>numerical feature". So for the row "rock<⁠\t⁠>0<⁠\t⁠>42", the identifier for the "rock" feature is 0, and for the "42" feature it is 1.
  
  The identifiers of features to exclude should be enumerated at vector.
  
  For example, if training should exclude features with the identifiers 1, 2, 7, 42, 43, 44, 45, the value of this parameter should be set to c(1,2,7,42,43,44,45).
  
  Default value:
  
  None (use all features)
- use_best_model
  
  If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
  
  Build the number of trees defined by the training parameters.
  - Identify the iteration with the optimal loss function value.
  - No trees are saved after this iteration.
  This option requires a test dataset to be provided.
  
  Default value:
  
  FALSE (not used)
- loss_function
  
  The loss function (see https://catboost.ai/docs/concepts/loss-functions.html#loss-functions) to use in training. The specified value also determines the machine learning problem to solve.
  
  Format:
  
  <Loss function 1>[:<parameter 1>=<value>:..<parameter N>=<value>:]
  
  Supported loss functions:
  - 'Logloss'
  - 'CrossEntropy'
  - 'MultiClass'
  - 'MultiClassOneVsAll'
  - 'RMSE'
  - 'MAE'
  - 'Quantile'
  - 'LogLinQuantile'
  - 'MAPE'
  - 'Poisson'
  - 'Lq'
  - 'PairLogit'
  - 'PairLogitPairwise'
  - 'YetiRank'
  - 'YetiRankPairwise'
  - 'QueryCrossEntropy'
  - 'QueryRMSE'
  - 'QuerySoftMax'
  Supported parameters:
  - alpha - The coefficient used in quantile-based losses ('Quantile' and 'LogLinQuantile'). The default value is 0.5.
    
    For example, if you need to calculate the value of Quantile with the coefficient $\alpha = 0.1$ , use the following construction:
    
    'Quantile:alpha=0.1'
  Default value:
  
  'RMSE'
- custom_loss
  
  Loss function (see https://catboost.ai/docs/concepts/loss-functions.html#loss-functions) values to output during training. These functions are not used for optimization and are displayed for informational purposes only.
  
  Format:
  
  c(<Loss function 1>[:<parameter>=<value>],<Loss function 2>[:<parameter>=<value>],...,<Loss function N>[:<parameter>=<value>])
  
  Supported loss functions:
  - 'Logloss'
  - 'CrossEntropy'
  - 'Precision'
  - 'Recall'
  - 'F1'
  - 'F'
  - 'BalancedAccuracy'
  - 'BalancedErrorRate'
  - 'MCC'
  - 'Accuracy'
  - 'CtrFactor'
  - 'AUC'
  - 'BrierScore'
  - 'HingeLoss'
  - 'HammingLoss'
  - 'ZeroOneLoss'
  - 'Kappa'
  - 'WKappa'
  - 'LogLikelihoodOfPrediction'
  - 'MultiClass'
  - 'MultiClassOneVsAll'
  - 'TotalF1'
  - 'MAE'
  - 'MAPE'
  - 'Poisson'
  - 'Quantile'
  - 'RMSE'
  - 'LogLinQuantile'
  - 'Lq'
  - 'NumErrors'
  - 'SMAPE'
  - 'R2'
  - 'MSLE'
  - 'MedianAbsoluteError'
  - 'PairLogit'
  - 'PairLogitPairwise'
  - 'PairAccuracy'
  - 'QueryCrossEntropy'
  - 'QueryRMSE'
  - 'QuerySoftMax'
  - 'PFound'
  - 'NDCG'
  - 'AverageGain'
  - 'PrecisionAt'
  - 'RecallAt'
  - 'MAP'
  - 'MRR'
  - 'ERR'
  Supported parameters:
  - alpha - The coefficient used in quantile-based losses ('Quantile' and 'LogLinQuantile'). The default value is 0.5.
  For example, if you need to calculate the value of CrossEntropy and Quantile with the coefficient $\alpha = 0.1$ , use the following construction:
  
  c('CrossEntropy') or simply 'CrossEntropy'.
  
  Values of all custom loss functions for learning and test datasets are saved to the Loss function (see https://catboost.ai/docs/concepts/output-data_loss-function.html#output-data_loss-function) output files (learn_error.tsv and test_error.tsv respectively). The catalog for these files is specified in the train-dir (train_dir) parameter.
  
  Default value:
  
  None (use one of the loss functions supported by the library)
- eval_metric
  
  The loss function used for overfitting detection (if enabled) and best model selection (if enabled).
  
  Supported loss functions:
  - 'Logloss'
  - 'CrossEntropy'
  - 'Precision'
  - 'Recall'
  - 'F1'
  - 'F'
  - 'BalancedAccuracy'
  - 'BalancedErrorRate'
  - 'MCC'
  - 'Accuracy'
  - 'CtrFactor'
  - 'AUC'
  - 'BrierScore'
  - 'HingeLoss'
  - 'HammingLoss'
  - 'ZeroOneLoss'
  - 'Kappa'
  - 'WKappa'
  - 'LogLikelihoodOfPrediction'
  - 'MultiClass'
  - 'MultiClassOneVsAll'
  - 'TotalF1'
  - 'MAE'
  - 'MAPE'
  - 'Poisson'
  - 'Quantile'
  - 'RMSE'
  - 'LogLinQuantile'
  - 'Lq'
  - 'NumErrors'
  - 'SMAPE'
  - 'R2'
  - 'MSLE'
  - 'MedianAbsoluteError'
  - 'PairLogit'
  - 'PairLogitPairwise'
  - 'PairAccuracy'
  - 'QueryCrossEntropy'
  - 'QueryRMSE'
  - 'QuerySoftMax'
  - 'PFound'
  - 'NDCG'
  - 'AverageGain'
  - 'PrecisionAt'
  - 'RecallAt'
  - 'MAP'
  - 'MRR'
  - 'ERR'
  Format:
  
  metric_name:param=Value
  
  Examples:
  
  'R2'
  
  'Quantile:alpha=0.3'
  
  Default value:
  
  Optimized objective is used
- iterations
  
  The maximum number of trees that can be built when solving machine learning problems.
  
  When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter.
  
  Default value:
  
  1000
- border
  
  The target border. If the value is strictly greater than this threshold, it is considered a positive class. Otherwise it is considered a negative class.
  
  The parameter is obligatory if the Logloss function is used, since it uses borders to transform any given target to a binary target.
  
  Used in binary classification.
  
  Default value:
  
  0.5
- leaf_estimation_iterations
  
  The number of gradient steps when calculating the values in leaves.
  
  Default value:
  
  1
- depth
  
  Depth of the trees.
  
  The value can be any integer up to 16. It is recommended to use values in the range [1; 10].
  
  Default value:
  
  6
- learning_rate
  
  The learning rate.
  
  Used for reducing the gradient step.
  
  Default value:
  
  0.03
- rsm
  
  Random subspace method. The percentage of features to use at each iteration of building trees. At each iteration, features are selected over again at random.
  
  The value must be in the range [0;1].
  
  Default value:
  
  1
- random_seed
  
  The random seed used for training.
  
  Default value:
  
  0
- nan_mode
  
  Way to process missing values.
  
  Possible values:
  - 'Min'
  - 'Max'
  - 'Forbidden'
  Default value:
  
  'Min'
- od_pval
  
  Use the Overfitting detector (see https://catboost.ai/docs/concepts/overfitting-detector.html#overfitting-detector) to stop training when the threshold is reached. Requires that a test dataset was input.
  
  For best results, it is recommended to set a value in the range [10^-10; 10^-2].
  
  The larger the value, the earlier overfitting is detected.
  
  Default value:
  
  The overfitting detection is turned off
- od_type
  
  The method used to calculate the values in leaves.
  
  Possible values:
  - IncToDec
  - Iter
  Restriction. Do not specify the overfitting detector threshold when using the Iter type.
  
  Default value:
  
  'IncToDec'
- od_wait
  
  The number of iterations to continue the training after the iteration with the optimal loss function value. The purpose of this parameter differs depending on the selected overfitting detector type:
  - IncToDec - Ignore the overfitting detector when the threshold is reached and continue learning for the specified number of iterations after the iteration with the optimal loss function value.
  - Iter - Consider the model overfitted and stop training after the specified number of iterations since the iteration with the optimal loss function value.
  Default value:
  
  20
- leaf_estimation_method
  
  The method used to calculate the values in leaves.
  
  Possible values:
  - Newton
  - Gradient
  Default value:
  
  Default value depends on the selected loss function
- grow_policy
  
  GPU only. The tree growing policy. It describes how to perform greedy tree construction.
  
  Possible values:
  - SymmetricTree
  - Lossguide
  - Depthwise
  Default value:
  
  SymmetricTree
- min_data_in_leaf
  
  GPU only. The minimum training samples count in leaf. CatBoost will not search for new splits in leaves with samples count less than min_data_in_leaf. This parameter is used only for Depthwise and Lossguide growing policies.
  
  Default value:
  
  1
- max_leaves
  
  GPU only. The maximum leaf count in resulting tree. Used only for Lossguide growing policy. This parameter is used only for Lossguide growing policy.
  
  Default value:
  
  31
- score_function GPU only. Score that is used during tree construction to select the next tree split.
  
  Possible values:
  - L2
  - Cosine
  - NewtonL2
  - NewtonCosine
  Default value:
  
  Cosine
  
  For growing policy Lossguide default is NewtonL2.
- l2_leaf_reg
  
  L2 regularization coefficient. Used for leaf value calculation.
  
  Any positive values are allowed.
  
  Default value:
  
  3
- model_size_reg
  
  Model size regularization coefficient. The influence coefficient of the model size for choosing tree structure. To get a smaller model size - increase this coefficient.
  
  Any positive values are allowed.
  
  Default value:
  
  0.5
- has_time
  
  Use the order of objects in the input data (do not perform a random permutation of the dataset at the preprocessing stage)
  
  Default value:
  
  FALSE (not used; permute input dataset)
- allow_const_label
  
  To allow the constant label value in the dataset.
  
  Default value:
  
  FALSE
- name
  
  The experiment name to display in visualization tools (see https://catboost.ai/docs/features/visualization.html#visualization).
  
  Default value:
  
  experiment
- prediction_type
  
  The format for displaying approximated values in output data.
  
  Possible values:
  - 'Probability'
  - 'Class'
  - 'RawFormulaVal'
  Default value:
  
  'RawFormulaVal'
- fold_len_multiplier
  
  Coefficient for changing the length of folds.
  
  The value must be greater than 1. The best validation result is achieved with minimum values.
  
  With values close to 1 (for example, $1 + \epsilon$ ), each iteration takes a quadratic amount of memory and time for the number of objects in the iteration. Thus, low values are possible only when there is a small number of objects.
  
  Default value:
  
  2
- class_weights
  
  Classes weights. The values are used as multipliers for the object weights.
  
  For example, for 3 class classification you could use:
  
  c(0.85, 1.2, 1)
  
  Default value:
  
  None (the weight for all classes is set to 1)
- classes_count
  
  The upper limit for the numeric class label. Defines the number of classes for multiclassification.
  
  Only non-negative integers can be specified. The given integer should be greater than any of the target values.
  
  If this parameter is specified the labels for all classes in the input dataset should be smaller than the given value.
  
  Default value:
  
  maximum class label + 1
- one_hot_max_size
  
  Convert the feature to float if the number of different values that it takes exceeds the specified value. Ctrs are not calculated for such features.
  
  The one-vs.-all delimiter is used for the resulting float features.
  
  Default value:
  
  FALSE
  
  Do not convert features to float based on the number of different values
- random_strength
  
  Score standard deviation multiplier.
  
  Default value:
  
  1
- bootstrap_type
  
  Bootstrap type. Defines the method for sampling the weights of documents.
  
  Possible values:
  - 'Bayesian'
  - 'Bernoulli'
  - 'Poisson'
  - 'MVS'
  - 'No'
  Poisson bootstrap is supported only on GPU.
  
  Default value:
  
  'Bayesian'
- bagging_temperature
  
  Controls intensity of Bayesian bagging. The higher the temperature the more aggressive bagging is.
  
  Typical values are in the range $[0, 1]$ (0 is for no bagging).
  
  Possible values are in the range $[0, +\infty)$ .
  
  Default value:
  
  1
- subsample
  
  Sample rate for bagging. This parameter can be used if one of the following bootstrap types is defined:
  - 'Bernoulli'
  Default value:
  
  0.66
- sampling_unit
  
  The parameter allows to specify the sampling scheme: sample weights for each object individually or for an entire group of objects together.
  
  Possible values:
  - 'Object'
  - 'Group'
  Default value:
  
  'Object'
- sampling_frequency
  
  Frequency to sample weights and objects when building trees.
  
  Possible values:
  - 'PerTree'
  - 'PerTreeLevel'
  Default value:
  
  'PerTreeLevel'
- model_shrink_rate
  
  For i > 0 at the start of i-th iteration multiplies model by (1 - model_shrink_rate / i).
  
  Possible values: [0, 1).
  
  Default value: 0
CTR settings
- simple_ctr
  
  Binarization settings for categorical features (see https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html).
  
  Format:
  
  c(CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N])
  
  Components:
  - CTR types for training on CPU:
    - 'Borders'
    - 'Buckets'
    - 'BinarizedTargetMeanValue'
    - 'Counter'
  - CTR types for training on GPU:
    - 'Borders'
    - 'Buckets'
    - 'FeatureFreq'
    - 'FloatTargetMeanValue'
  - The number of borders for label value binarization. (see https://catboost.ai/docs/concepts/quantization.html) Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1. This option is available for training on CPU only.
  - The binarization (see https://catboost.ai/docs/concepts/quantization.html) type for the label value. Only used for regression problems.
    
    Possible values:
    - 'Median'
    - 'Uniform'
    - 'UniformAndQuantiles'
    - 'MaxLogSum'
    - 'MinEntropy'
    - 'GreedyLogSum'
    By default, 'MinEntropy' This option is available for training on CPU only.
  - The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.
  - The binarization type for categorical features. Supported values for training on CPU:
    - 'Uniform'
    Supported values for training on GPU:
    - 'Median'
    - 'Uniform'
    - 'UniformAndQuantiles'
    - 'MaxLogSum'
    - 'MinEntropy'
    - 'GreedyLogSum'
  - Priors to use during training (several values can be specified) Possible formats:
    - 'One number - Adds the value to the numerator.'
    - 'Two slash-delimited numbers (for GPU only) - Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.'
- combinations_ctr
  
  Binarization settings for combinations of categorical features (see https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html).
  
  Format:
  
  c(CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N])
  
  Components:
  - CTR types for training on CPU:
    - 'Borders'
    - 'Buckets'
    - 'BinarizedTargetMeanValue'
    - 'Counter'
  - CTR types for training on GPU:
    - 'Borders'
    - 'Buckets'
    - 'FeatureFreq'
    - 'FloatTargetMeanValue'
  - The number of borders for target binarization. (see https://catboost.ai/docs/concepts/quantization.html) Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1. This option is available for training on CPU only.
  - The binarization (see https://catboost.ai/docs/concepts/quantization.html) type for the target. Only used for regression problems.
    
    Possible values:
    - 'Median'
    - 'Uniform'
    - 'UniformAndQuantiles'
    - 'MaxLogSum'
    - 'MinEntropy'
    - 'GreedyLogSum'
    By default, 'MinEntropy' This option is available for training on CPU only.
  - The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.
  - The binarization type for categorical features. Supported values for training on CPU:
    - 'Uniform'
    Supported values for training on GPU:
    - 'Median'
    - 'Uniform'
    - 'UniformAndQuantiles'
    - 'MaxLogSum'
    - 'MinEntropy'
    - 'GreedyLogSum'
  - Priors to use during training (several values can be specified) Possible formats:
    - 'One number - Adds the value to the numerator.'
    - 'Two slash-delimited numbers (for GPU only) - Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.'
- ctr_target_border_count
  
  Maximum number of borders used in target binarization for categorical features that need it. If TargetBorderCount is specified in 'simple_ctr', 'combinations_ctr' or 'per_feature_ctr' option it overrides this value.
  
  Default value:
  
  1
- counter_calc_method
  
  The method for calculating the Counter CTR type for the test dataset.
  
  Possible values:
  - 'Full'
  - 'FullTest'
  - 'PrefixTest'
  - 'SkipTest'
  Default value: 'PrefixTest'
- max_ctr_complexity
  
  The maximum number of categorical features that can be combined.
  
  Default value:
  
  4
- ctr_leaf_count_limit
  
  The maximum number of leaves with categorical features. If the number of leaves exceeds the specified limit, some leaves are discarded. The value must be positive (for zero limit use ignored_features parameter).
  
  The leaves to be discarded are selected as follows:
  1. The leaves are sorted by the frequency of the values.
  2. The top N leaves are selected, where N is the value specified in the parameter.
  3. All leaves starting from N+1 are discarded.
  This option reduces the resulting model size and the amount of memory required for training. Note that the resulting quality of the model can be affected.
  
  Default value:
  
  None (The number of leaves with categorical features is not limited)
- store_all_simple_ctr
  
  Ignore categorical features, which are not used in feature combinations, when choosing candidates for exclusion.
  
  Use this parameter with ctr-leaf-count-limit only.
  
  Default value:
  
  FALSE (Both simple features and feature combinations are taken in account when limiting the number of leaves with categorical features)
Binarization settings
- border_count
  
  The number of splits for numerical features. Allowed values are integers from 1 to 255 inclusively.
  
  Default value:
  
  254 for training on CPU or 128 for training on GPU
- feature_border_type
  
  The binarization mode (see https://catboost.ai/docs/concepts/quantization.html) for numerical features.
  
  Possible values:
  - 'Median'
  - 'Uniform'
  - 'UniformAndQuantiles'
  - 'MaxLogSum'
  - 'MinEntropy'
  - 'GreedyLogSum'
  Default value:
  
  'MinEntropy'
Performance settings
- thread_count
  
  The number of threads to use when applying the model.
  
  Allows you to optimize the speed of execution. This parameter doesn't affect results.
  
  Default value:
  
  The number of CPU cores.
Output settings
- logging_level
  
  Possible values:
  - 'Silent'
  - 'Verbose'
  - 'Info'
  - 'Debug'
  Default value:
  
  'Silent'
- metric_period
  
  The frequency of iterations to print the information to stdout. The value should be a positive integer.
  
  Default value:
  
  1
- train_dir
  
  The directory for storing the files generated during training.
  
  Default value:
  
  None (current catalog)
- save_snapshot
  
  Enable snapshotting for restoring the training progress after an interruption.
  
  Default value:
  
  None
- snapshot_file
  
  Settings for recovering training after an interruption (see https://catboost.ai/docs/features/snapshots.html).
  
  Depending on whether the file specified exists in the file system:
  - Missing - write information about training progress to the specified file.
  - Exists - load data from the specified file and continue training from where it left off.
  Default value:
  
  File can't be generated or read. If the value is omitted, the file name is experiment.cbsnapshot.
- snapshot_interval
  
  Interval between saving snapshots (seconds)
  
  Default value:
  
  600
- allow_writing_files
  
  If this flag is set to FALSE, no files with different diagnostic info will be created during training. With this flag set to FALSE no snapshotting can be done. Plus visualisation will not work, because visualisation uses files that are created and updated during training.
  
  Default value:
  
  TRUE
- approx_on_full_history
  
  If this flag is set to TRUE, each approximated value is calculated using all the preceding rows in the fold (slower, more accurate). If this flag is set to FALSE, each approximated value is calculated using only the beginning 1/fold_len_multiplier fraction of the fold (faster, slightly less accurate).
  
  Default value:
  
  FALSE
- boosting_type
  
  Boosting scheme. Possible values: - 'Ordered' - Gives better quality, but may slow down the training. - 'Plain' - The classic gradient boosting scheme. May result in quality degradation, but does not slow down the training.
  
  Default value:
  
  Depends on object count and feature count in train dataset and on learning mode.
- dev_score_calc_obj_block_size
  
  CPU only. Size of block of samples in score calculation. Should be > 0 Used only for learning speed tuning. Changing this parameter can affect results in pairwise scoring mode due to numerical accuracy differences
  
  Default value:
  
  5000000
- dev_efb_max_buckets
  
  CPU only. Maximum bucket count in exclusive features bundle. Should be in an integer between 0 and 65536. Used only for learning speed tuning.
  
  Default value:
  
  1024
- sparse_features_conflict_fraction
  
  CPU only. Maximum allowed fraction of conflicting non-default values for features in exclusive features bundle. Should be a real value in [0, 1) interval.
  
  Default value:
  
  0.0
- leaf_estimation_backtracking
  
  Type of backtracking during gradient descent. Possible values: - 'No' - never backtrack; supported on CPU and GPU - 'AnyImprovement' - reduce the descent step until the value of loss function is less than before the step; supported on CPU and GPU - 'Armijo' - reduce the descent step until Armijo condition is satisfied; supported on GPU only
  
  Default value:
  
  'AnyImprovement'

Usage

catboost.train(learn_pool, test_pool = NULL, params = list())
catboost.train(learn_pool, test_pool = NULL, params = list())

Arguments

learn_pool

The dataset used for training the model.

Default value: Required argument

test_pool

The dataset used for testing the quality of the model.

Default value: NULL (not used)

params

The list of parameters to start training with.

If omitted, default values are used (see The list of parameters).

If set, the passed list of parameters overrides the default values.

Default value: Required argument

Value

Model object.

Examples

## Not run: 
train_pool_path <- system.file("extdata", "adult_train.1000", package = "catboost")
test_pool_path <- system.file("extdata", "adult_test.1000", package = "catboost")
cd_path <- system.file("extdata", "adult.cd", package = "catboost")
train_pool <- catboost.load_pool(train_pool_path, column_description = cd_path)
test_pool <- catboost.load_pool(test_pool_path, column_description = cd_path)
fit_params <- list(
    iterations = 100,
    loss_function = 'Logloss',
    ignored_features = c(4, 9),
    border_count = 32,
    depth = 5,
    learning_rate = 0.03,
    l2_leaf_reg = 3.5,
    train_dir = 'train_dir')
model <- catboost.train(train_pool, test_pool, fit_params)

## End(Not run)
## Not run: 
train_pool_path <- system.file("extdata", "adult_train.1000", package = "catboost")
test_pool_path <- system.file("extdata", "adult_test.1000", package = "catboost")
cd_path <- system.file("extdata", "adult.cd", package = "catboost")
train_pool <- catboost.load_pool(train_pool_path, column_description = cd_path)
test_pool <- catboost.load_pool(test_pool_path, column_description = cd_path)
fit_params <- list(
    iterations = 100,
    loss_function = 'Logloss',
    ignored_features = c(4, 9),
    border_count = 32,
    depth = 5,
    learning_rate = 0.03,
    l2_leaf_reg = 3.5,
    train_dir = 'train_dir')
model <- catboost.train(train_pool, test_pool, fit_params)

## End(Not run)

Apply the model with several virtual ensembles

Description

Apply the model to the given dataset using several independent truncated models - virtual ensembles. Each tree in ensemble predicts its own value for each document from pool.

Peculiarities: Return value varies on prediction_type: array for 'VirtEnsembles' and matrix for 'TotalUncertainty'

Usage

catboost.virtual_ensembles_predict(
  model,
  pool,
  verbose = FALSE,
  prediction_type = "VirtEnsembles",
  ntree_end = 0L,
  virtual_ensembles_count = 10,
  thread_count = -1
)
catboost.virtual_ensembles_predict(
  model,
  pool,
  verbose = FALSE,
  prediction_type = "VirtEnsembles",
  ntree_end = 0L,
  virtual_ensembles_count = 10,
  thread_count = -1
)

Arguments

`model`	The model obtained as the result of training. Default value: Required argument
`pool`	The input dataset. Default value: Required argument
`verbose`	Verbose output to stdout. Default value: FALSE (not used)
`prediction_type`	The format for displaying approximated values in output data (see https://catboost.ai/docs/concepts/python-reference_virtual_ensembles_predict.html#python-reference_catboostclassifier_predict__output-format). Possible values: 'VirtEnsembles' 'TotalUncertainty' Default value: 'VirtEnsembles'
`ntree_end`	Index of the first tree not to be used when applying the model or calculating the metrics (zero-based indexing). Default value: 0 (the index of the last tree to use equals to the number of trees in the model minus one)
`virtual_ensembles_count`	Number of tree ensembles to use. Each virtual ensemble can be considered as truncated model. Default value: 10
`thread_count`	The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores. Allows you to optimize the speed of execution. This parameter doesn't affect results. Default value: -1

Value

Matrix or Array of predictions (for 'TotalUncertainty' and 'VirtEnsembles' prediction_type correspondingly)

Dimensions of catboost.Pool

Description

Get dimensions of a Pool.

Usage

## S3 method for class 'catboost.Pool'
dim(x)
## S3 method for class 'catboost.Pool'
dim(x)

Arguments

`x`	The input dataset. Default value: Required argument

Value

Returns a vector of row numbers and column numbers in an catboost.Pool.

Dimension names of catboost.Pool

Description

Get dimension names of a Pool.

Usage

## S3 method for class 'catboost.Pool'
dimnames(x)
## S3 method for class 'catboost.Pool'
dimnames(x)

Arguments

`x`	The input dataset. Default value: Required argument

Value

A list with the two elements. The second element contains the column names.

Head of catboost.Pool

Description

Return a list with the first n objects of the dataset.

Each line of this list contains the following information for each object:

The label value.
The weight value.
The feature values.

Usage

## S3 method for class 'catboost.Pool'
head(x, n = 10, ...)
## S3 method for class 'catboost.Pool'
head(x, n = 10, ...)

Arguments

x

The input dataset.

Default value: Required argument

n

The quantity of the first objects in the dataset to be returned.

Default value: 10

...

not currently used

Value

A matrix containing the first n objects of the dataset.

Print basic information about model

Description

Displays the most general characteristics of a CatBoost model.

Usage

## S3 method for class 'catboost.Model'
print(x, ...)
## S3 method for class 'catboost.Model'
print(x, ...)

Arguments

`x`	The model obtained as the result of training.
`...`	Not used

Value

The same model that was passed as input.

Print catboost.Pool

Description

Print dimensions of catboost.Pool.

Usage

## S3 method for class 'catboost.Pool'
print(x, ...)
## S3 method for class 'catboost.Pool'
print(x, ...)

Arguments

x

a catboost.Pool object

Default value: Required argument

...

not currently used

Value

Nothing. This method prints pool dimensions.

Print basic information about model

Description

Displays the most general characteristics of a CatBoost model (same as 'print').

Usage

## S3 method for class 'catboost.Model'
summary(object, ...)
## S3 method for class 'catboost.Model'
summary(object, ...)

Arguments

`object`	The model obtained as the result of training.
`...`	Not used

Value

The same model that was passed as input.

Tail of catboost.Pool

Description

Return a list with the last n objects of the dataset.

Each line of this list contains the following information for each object:

The target value.
The weight value.
The feature values.

Usage

## S3 method for class 'catboost.Pool'
tail(x, n = 10, ...)
## S3 method for class 'catboost.Pool'
tail(x, n = 10, ...)

Arguments

x

The input dataset.

Default value: Required argument

n

The quantity of the last objects in the dataset to be returned.

Default value: 10

...

not currently used

Value

A matrix containing the last n objects of the dataset.

Package 'catboost'

Help Index

Support caret interface

Description

Usage

Format

Cross-validate model.

Description

Usage

Arguments

Value

Drop unused features information from model

Description

Usage

Arguments

Value

Calculate metrics.

Description

Usage

Arguments

Value

See Also

Calculate the feature importances

Description

Usage

Arguments

Value

See Also

Model parameters

Description

Usage

Arguments

Value

See Also

Calculate the object importances

Description

Usage

Arguments

Value

See Also

Plain Model parameters

Description

Usage

Arguments

Value

Load the model

Description

Usage

Arguments

Value

See Also

Create a dataset

Description

Usage

Arguments

Value

Examples

Apply the model

Description

Usage

Arguments

Value

See Also

Restore or complete model handle after de-serializing

Description

Usage

Arguments

Value

Save the model

Description

Usage

Arguments

Value

See Also

Save the dataset

Description

Usage

Arguments

Value

Shrink the model