Package 'catboost'

Title: Gradient Boosting on Decision Trees
Description: Open-source gradient boosting on decision trees with categorical features support out of the box.
Authors: CatBoost DevTeam [aut, cre]
Maintainer: Stanislav Kirillov <[email protected]>
License: Apache License (== 2.0)
Version: 1.2.7
Built: 2024-09-18 09:22:56 UTC
Source: https://github.com/catboost/catboost

Help Index


Support caret interface

Description

Support caret interface

Usage

catboost.caret

Format

An object of class list of length 10.


Cross-validate model.

Description

Estimate model performance using cross-validation.

Usage

catboost.cv(
  pool,
  params = list(),
  fold_count = 3,
  type = "Classical",
  partition_random_seed = 0,
  shuffle = TRUE,
  stratified = FALSE,
  early_stopping_rounds = NULL
)

Arguments

pool

Data to cross-validate on

params

Parameters for catboost.train

fold_count

Folds count.

type

is type of cross-validation.

partition_random_seed

The random seed used for splitting pool into folds.

shuffle

Shuffle the dataset objects before splitting into folds.

stratified

Perform stratified sampling.

early_stopping_rounds

Activates Iter overfitting detector with od_wait set to early_stopping_rounds.

Value

A data.frame of evaluation results from cross-validation.


Drop unused features information from model

Description

Drop unused features information from model

Usage

catboost.drop_unused_features(model, ntree_end, ntree_start = 0)

Arguments

model

The model obtained as the result of training.

ntree_end

Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing).

ntree_start

Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing).

Value

Status, the result of dropping feature. TRUE if this succeeded, FALSE otherwise.


Calculate metrics.

Description

Calculate the specified metrics for the specified dataset.

Usage

catboost.eval_metrics(
  model,
  pool,
  metrics,
  ntree_start = 0L,
  ntree_end = 0L,
  eval_period = 1,
  thread_count = -1,
  tmp_dir = NULL
)

Arguments

model

The model obtained as the result of training.

Default value: Required argument

pool

The pool for which you want to evaluate the metrics.

Default value: Required argument

metrics

The list of metrics to be calculated. (Supported metrics https://catboost.ai/docs/references/custom-metric__supported-metrics.html)

Default value: Required argument

ntree_start

Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing).

Default value: 0

ntree_end

Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing).

Default value: 0 (if value equals to 0 this parameter is ignored and ntree_end equal to tree_count)

eval_period

Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing).

Default value: 1

thread_count

The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores.

Allows you to optimize the speed of execution. This parameter doesn't affect results.

Default value: -1

tmp_dir

The name of the temporary directory for intermediate results. If NULL, then the name will be generated.

Default value: NULL

Value

dict: metric -> array of shape [(ntree_end - ntree_start) / eval_period].

See Also

https://catboost.ai/docs/concepts/python-reference_catboost_eval-metrics.html


Calculate the feature importances

Description

Calculate the feature importances (see https://catboost.ai/docs/concepts/fstr.html#fstr) (Regular feature importance, ShapValues, and Feature interaction strength).

Usage

catboost.get_feature_importance(
  model,
  pool = NULL,
  type = "FeatureImportance",
  thread_count = -1,
  fstr_type = NULL
)

Arguments

model

The model obtained as the result of training.

Default value: Required argument

pool

The input dataset.

The feature importance for the training dataset is calculated if this argument is not specified. Models with ranking metrics require pool argument to calculate feature importance.

Default value: NULL

type

The feature importance type.

Possible values:

  • 'PredictionValuesChange'

    Calculate score for every feature.

  • 'LossFunctionChange'

    Calculate score for every feature for groupwise model.

  • 'FeatureImportance'

    'LossFunctionChange' in case of groupwise model and 'PredictionValuesChange' otherwise.

  • 'Interaction'

    Calculate pairwise score between every feature.

  • 'ShapValues'

    Calculate SHAP Values for every object.

Default value: 'FeatureImportance'

thread_count

The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores.

Allows you to optimize the speed of execution. This parameter doesn't affect results.

Default value: -1

fstr_type

Deprecated parameter, use 'type' instead.

Value

Feature importances

See Also

https://catboost.ai/docs/features/feature-importances-calculation.html


Model parameters

Description

Return the model parameters.

Usage

catboost.get_model_params(model)

Arguments

model

The model obtained as the result of training.

Value

A list object with model parameters.

See Also

https://catboost.ai/docs/concepts/r-reference_catboost-get_model_params.html


Calculate the object importances

Description

Calculate the object importances (see https://catboost.ai/docs/concepts/ostr.html). This is the implementation of the LeafInfluence algorithm from the following paper: https://arxiv.org/pdf/1802.06640.pdf

Usage

catboost.get_object_importance(
  model,
  pool,
  train_pool,
  top_size = -1,
  type = "Average",
  update_method = "SinglePoint",
  thread_count = -1,
  ostr_type = NULL
)

Arguments

model

The model obtained as the result of training.

Default value: Required argument

pool

The pool for which you want to evaluate the object importances.

Default value: Required argument

train_pool

The pool on which the model has been trained.

Default value: Required argument

top_size

Method returns the result of the top_size most important train objects. If -1, then the top size is not limited.

Default value: -1

type

Possible values:

  • 'Average'

    Method returns the mean train objects scores for all input objects.

  • 'PerObject'

    Method returns the train objects scores for every input object.

Default value: 'Average'

update_method

Description of the update set methods are given in section 3.1.3 of the paper.

Possible values:

  • 'SinglePoint'

  • 'TopKLeaves' It is posible to set top size : TopKLeaves:top=2.

  • 'AllPoints'

Default value: 'SinglePoint'

thread_count

The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores.

Allows you to optimize the speed of execution. This parameter doesn't affect results.

Default value: -1

ostr_type

Deprecated parameter, use 'type' instead.

Value

List with elements "indices" and "scores".

See Also

https://catboost.ai/docs/concepts/r-reference_catboost-get_object_importance.html


Plain Model parameters

Description

Return the plain model parameters.

Usage

catboost.get_plain_params(model)

Arguments

model

he model obtained as the result of training.

Value

A list object with model parameters.


Load the model

Description

Load the model from a file.

Note: Feature importance (see https://catboost.ai/docs/concepts/fstr.html#fstr) is not saved when using this function.

Usage

catboost.load_model(model_path, file_format = "cbm")

Arguments

model_path

The path to the model.

Default value: Required argument

file_format

Format of the model file.

Default value: 'cbm'

Value

A model object.

See Also

https://catboost.ai/docs/concepts/r-reference_catboost-load_model.html


Create a dataset

Description

Create a dataset from the given file, matrix or data.frame.

Usage

catboost.load_pool(
  data,
  label = NULL,
  cat_features = NULL,
  column_description = NULL,
  pairs = NULL,
  delimiter = "\t",
  has_header = FALSE,
  weight = NULL,
  group_id = NULL,
  group_weight = NULL,
  subgroup_id = NULL,
  pairs_weight = NULL,
  baseline = NULL,
  feature_names = NULL,
  thread_count = -1,
  graph = NULL
)

Arguments

data

A file path, matrix or data.frame with features. The following column types are supported:

  • double

  • factor. It is assumed that categorical features are given in this type of columns. A standard CatBoost processing procedure is applied to this type of columns:

    1.

    The values are converted to strings.

    2.

    The ConvertCatFeatureToFloat function is applied to the resulting string.

Default value: Required argument

label

The label vector or label matrix

cat_features

A vector of categorical features indices. The indices are zero based and can differ from the given in the Column descriptions file. If data parameter is data.frame don't use cat_features, categorical features are determined automatically from data.frame column types.

column_description

The path to the input file that contains the column descriptions.

pairs

A file path, matrix or data.frame that contains the pairs descriptions. The shape should be Nx2, where N is the pairs' count. The first element of pair is the index of winner document in training set. The second element of pair is the index of loser document in training set.

delimiter

Delimiter character to use to separate features in a file.

has_header

Read column names from first line, if this parameter is set to True.

weight

The weights of the objects.

group_id

The group ids of the objects.

group_weight

The group weight of the objects.

subgroup_id

The subgroup ids of the objects.

pairs_weight

The weights of the pairs.

baseline

Vector of initial (raw) values of the objective function. Used in the calculation of final values of trees.

feature_names

A list of names for each feature in the dataset.

thread_count

The number of threads to use while reading the data. Optimizes reading time. This parameter doesn't affect results.

graph

A file path, matrix or data.frame that contains the pairs of indices of objects for graph features. The shape should be Nx2, where N is the pairs of indices count. If -1, then the number of threads is set to the number of CPU cores.

Value

catboost.Pool

Examples

## Not run: 
# From file
pool_path <- system.file("extdata", "adult_train.1000", package = "catboost")
cd_path <- system.file("extdata", "adult.cd", package = "catboost")
pool <- catboost.load_pool(pool_path, column_description = cd_path)
print(pool)

# From matrix
target <- 1
data_matrix <-matrix(runif(18), 6, 3)
pool <- catboost.load_pool(data_matrix[, -target], label = data_matrix[, target])
print(pool)

# From data.frame
nonsense <- factor(c('A', 'B', 'C'))
data_frame <- data.frame(value = runif(10), category = nonsense[(1:10) %% 3 + 1])
label = (1:10) %% 2
pool <- catboost.load_pool(data_frame, label = label)
print(pool)

## End(Not run)

Apply the model

Description

Apply the model to the given dataset.

Peculiarities: In case of multiclassification the prediction is returned in the form of a matrix. Each line of this matrix contains the predictions for one object of the input dataset.

Usage

catboost.predict(
  model,
  pool,
  verbose = FALSE,
  prediction_type = "RawFormulaVal",
  ntree_start = 0,
  ntree_end = 0,
  thread_count = -1
)

Arguments

model

The model obtained as the result of training.

Default value: Required argument

pool

The input dataset.

Default value: Required argument

verbose

Verbose output to stdout.

Default value: FALSE (not used)

prediction_type

The format for displaying approximated values in output data (see https://catboost.ai/docs/concepts/output-data.html).

Possible values:

  • 'Probability'

  • 'LogProbability'

  • 'Class'

  • 'RawFormulaVal'

  • 'Exponent'

  • 'RMSEWithUncertainty'

Default value: 'RawFormulaVal'

ntree_start

Model is applied on the interval [ntree_start, ntree_end) (zero-based indexing).

Default value: 0

ntree_end

Model is applied on the interval [ntree_start, ntree_end) (zero-based indexing).

Default value: 0 (if value equals to 0 this parameter is ignored and ntree_end equal to tree_count)

thread_count

The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores.

Allows you to optimize the speed of execution. This parameter doesn't affect results.

Default value: 1

Value

Vector of predictions (matrix for multi-class classification).

See Also

https://catboost.ai/docs/concepts/r-reference_catboost-predict.html


Restore or complete model handle after de-serializing

Description

After de-serializing a model object through R base's functions ('readRDS', 'load'), its underlying object will not exist in the computer's memory anymore, and needs to be restored from the raw bytes that the model stores.

This is automatically done internally when calling functions such as catboost.predict, but the process is repeated at each call, which makes them slower than if using a fresh model object and increases memory usage inbetween calls to the garbage collector. This function allows restoring the internal object beforehand so as to avoid restoring the object multiple times.

Note that the model object needs to be re-assigned as the output of this function, as the modifications are not done in-place.

Usage

catboost.restore_handle(model)

Arguments

model

The model obtained as the result of training which has been serialized and is now de-serialized.

Value

The model object with its handle pointing to a valid object in memory.


Save the model

Description

Save the model to a file.

Note: Feature importance (see https://catboost.ai/docs/concepts/fstr.html#fstr) is not saved when using this function.

Usage

catboost.save_model(
  model,
  model_path,
  file_format = "cbm",
  export_parameters = NULL,
  pool = NULL
)

Arguments

model

The model to be saved.

Default value: Required argument

model_path

The path to the resulting binary file with the model description. Used for solving other machine learning problems (for instance, applying a model).

Default value: Required argument

file_format

specified format model from a file. Possible values:

  • 'cbm' For catboost binary format

  • 'coreml' To export into Apple CoreML format

  • 'onnx' To export into ONNX-ML format

  • 'pmml' To export into PMML format

  • 'cpp' To export as C++ code

  • 'python' To export as Python code.

Default value: 'cbm'

export_parameters

are a parameters for CoreML or PMML export.

pool

is training pool.

Value

Status, the result of model shrinking. TRUE if shrinking succeeded, FALSE otherwise.

See Also

https://catboost.ai/docs/features/export-model-to-core-ml.html


Save the dataset

Description

Save the dataset to the CatBoost format. Files with the following data are created:

  • Dataset description

  • Column descriptions

Use the catboost.load_pool function to read the resulting files. These files can also be used in the Command-line version and the Python library.

Usage

catboost.save_pool(
  data,
  label = NULL,
  weight = NULL,
  baseline = NULL,
  pool_path = "data.pool",
  cd_path = "cd.pool"
)

Arguments

data

A data.frame with features. The following column types are supported:

  • double

  • factor. It is assumed that categorical features are given in this type of columns. A standard CatBoost processing procedure is applied to this type of columns:

    1.

    The values are converted to strings.

    2.

    The ConvertCatFeatureToFloat function is applied to the resulting string.

Default value: Required argument

label

The label vector.

weight

The weights of the label vector.

baseline

Vector of initial (raw) values of the label function for the object. Used in the calculation of final values of trees.

pool_path

The path to the output file that contains the dataset description.

cd_path

The path to the output file that contains the column descriptions.

Value

Nothing. This method writes a dataset to disk.


Shrink the model

Description

Shrink the model

Usage

catboost.shrink(model, ntree_end, ntree_start = 0)

Arguments

model

The model obtained as the result of training.

ntree_end

Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing).

ntree_start

Leave the trees with indices from the interval [ntree_start, ntree_end) (zero-based indexing).

Value

Status, the result of model shrinking. TRUE if shrinking succeeded, FALSE otherwise.

See Also

https://catboost.ai/docs/concepts/r-reference_catboost-shrink.html


Apply the model for each tree

Description

Apply the model to the given dataset and calculate the results for each i-th tree of the model taking into consideration only the trees in the range [1;i].

Peculiarities: In case of multiclassification the prediction is returned in the form of a matrix. Each line of this matrix contains the predictions for one object of the input dataset.

Usage

catboost.staged_predict(
  model,
  pool,
  verbose = FALSE,
  prediction_type = "RawFormulaVal",
  ntree_start = 0L,
  ntree_end = 0L,
  eval_period = 1,
  thread_count = -1
)

Arguments

model

The model obtained as the result of training.

Default value: Required argument

pool

The input dataset.

Default value: Required argument

verbose

Verbose output to stdout.

Default value: FALSE (not used)

prediction_type

The format for displaying approximated values in output data (see https://catboost.ai/docs/concepts/output-data.html).

Possible values:

  • 'Probability'

  • 'Class'

  • 'RawFormulaVal'

Default value: 'RawFormulaVal'

ntree_start

Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing).

Default value: 0

ntree_end

Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing).

Default value: 0 (if value equals to 0 this parameter is ignored and ntree_end equal to tree_count)

eval_period

Model is applied on the interval [ntree_start, ntree_end) with the step eval_period (zero-based indexing).

Default value: 1

thread_count

The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores.

Allows you to optimize the speed of execution. This parameter doesn't affect results.

Default value: 1

Value

List object with predictions from one iteration.

See Also

https://catboost.ai/docs/concepts/r-reference_catboost-staged_predict.html


Sum models.

Description

Blend trees and counters of two or more trained CatBoost models into a new model. Leaf values can be individually weighted for each input model. For example, it may be useful to blend models trained on different validation datasets.

Usage

catboost.sum_models(
  models,
  weights = NULL,
  ctr_merge_policy = "IntersectingCountersAverage"
)

Arguments

models

Models for the summation.

Default value: Required argument

weights

The weights of the models.

Default value: NULL (use weight 1 for every model)

ctr_merge_policy

The counters merging policy. Possible values:

  • 'FailIfCtrIntersects' Ensure that the models have zero intersecting counters

  • 'LeaveMostDiversifiedTable' Use the most diversified counters by the count of unique hash values

  • 'IntersectingCountersAverage' Use the average ctr counter values in the intersecting bins

Default value: 'IntersectingCountersAverage'

Value

Model object.


Train the model

Description

Train the model using a CatBoost dataset.

The list of parameters

  • Common parameters

    • fold_permutation_block

      Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation.

      Default value:

      Default value differs depending on the dataset size and ranges from 1 to 256 inclusively

    • ignored_features

      Identifiers of features to exclude from training. The non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to "42", the corresponding non-existing feature is successfully ignored.

      The identifier corresponds to the feature's index. Feature indices used in train and feature importance are numbered from 0 to featureCount-1. If a file is used as input data then any non-feature column types are ignored when calculating these indices. For example, each row in the input file contains data in the following order: "categorical feature<⁠\t⁠>label<⁠\t⁠>numerical feature". So for the row "rock<⁠\t⁠>0<⁠\t⁠>42", the identifier for the "rock" feature is 0, and for the "42" feature it is 1.

      The identifiers of features to exclude should be enumerated at vector.

      For example, if training should exclude features with the identifiers 1, 2, 7, 42, 43, 44, 45, the value of this parameter should be set to c(1,2,7,42,43,44,45).

      Default value:

      None (use all features)

    • use_best_model

      If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:

      Build the number of trees defined by the training parameters.

      • Identify the iteration with the optimal loss function value.

      • No trees are saved after this iteration.

      This option requires a test dataset to be provided.

      Default value:

      FALSE (not used)

    • loss_function

      The loss function (see https://catboost.ai/docs/concepts/loss-functions.html#loss-functions) to use in training. The specified value also determines the machine learning problem to solve.

      Format:

      <Loss function 1>[:<parameter 1>=<value>:..<parameter N>=<value>:]

      Supported loss functions:

      • 'Logloss'

      • 'CrossEntropy'

      • 'MultiClass'

      • 'MultiClassOneVsAll'

      • 'RMSE'

      • 'MAE'

      • 'Quantile'

      • 'LogLinQuantile'

      • 'MAPE'

      • 'Poisson'

      • 'Lq'

      • 'PairLogit'

      • 'PairLogitPairwise'

      • 'YetiRank'

      • 'YetiRankPairwise'

      • 'QueryCrossEntropy'

      • 'QueryRMSE'

      • 'QuerySoftMax'

      Supported parameters:

      • alpha - The coefficient used in quantile-based losses ('Quantile' and 'LogLinQuantile'). The default value is 0.5.

        For example, if you need to calculate the value of Quantile with the coefficient α=0.1\alpha = 0.1, use the following construction:

        'Quantile:alpha=0.1'

      Default value:

      'RMSE'

    • custom_loss

      Loss function (see https://catboost.ai/docs/concepts/loss-functions.html#loss-functions) values to output during training. These functions are not used for optimization and are displayed for informational purposes only.

      Format:

      c(<Loss function 1>[:<parameter>=<value>],<Loss function 2>[:<parameter>=<value>],...,<Loss function N>[:<parameter>=<value>])

      Supported loss functions:

      • 'Logloss'

      • 'CrossEntropy'

      • 'Precision'

      • 'Recall'

      • 'F1'

      • 'F'

      • 'BalancedAccuracy'

      • 'BalancedErrorRate'

      • 'MCC'

      • 'Accuracy'

      • 'CtrFactor'

      • 'AUC'

      • 'BrierScore'

      • 'HingeLoss'

      • 'HammingLoss'

      • 'ZeroOneLoss'

      • 'Kappa'

      • 'WKappa'

      • 'LogLikelihoodOfPrediction'

      • 'MultiClass'

      • 'MultiClassOneVsAll'

      • 'TotalF1'

      • 'MAE'

      • 'MAPE'

      • 'Poisson'

      • 'Quantile'

      • 'RMSE'

      • 'LogLinQuantile'

      • 'Lq'

      • 'NumErrors'

      • 'SMAPE'

      • 'R2'

      • 'MSLE'

      • 'MedianAbsoluteError'

      • 'PairLogit'

      • 'PairLogitPairwise'

      • 'PairAccuracy'

      • 'QueryCrossEntropy'

      • 'QueryRMSE'

      • 'QuerySoftMax'

      • 'PFound'

      • 'NDCG'

      • 'AverageGain'

      • 'PrecisionAt'

      • 'RecallAt'

      • 'MAP'

      • 'MRR'

      • 'ERR'

      Supported parameters:

      • alpha - The coefficient used in quantile-based losses ('Quantile' and 'LogLinQuantile'). The default value is 0.5.

      For example, if you need to calculate the value of CrossEntropy and Quantile with the coefficient α=0.1\alpha = 0.1, use the following construction:

      c('CrossEntropy') or simply 'CrossEntropy'.

      Values of all custom loss functions for learning and test datasets are saved to the Loss function (see https://catboost.ai/docs/concepts/output-data_loss-function.html#output-data_loss-function) output files (learn_error.tsv and test_error.tsv respectively). The catalog for these files is specified in the train-dir (train_dir) parameter.

      Default value:

      None (use one of the loss functions supported by the library)

    • eval_metric

      The loss function used for overfitting detection (if enabled) and best model selection (if enabled).

      Supported loss functions:

      • 'Logloss'

      • 'CrossEntropy'

      • 'Precision'

      • 'Recall'

      • 'F1'

      • 'F'

      • 'BalancedAccuracy'

      • 'BalancedErrorRate'

      • 'MCC'

      • 'Accuracy'

      • 'CtrFactor'

      • 'AUC'

      • 'BrierScore'

      • 'HingeLoss'

      • 'HammingLoss'

      • 'ZeroOneLoss'

      • 'Kappa'

      • 'WKappa'

      • 'LogLikelihoodOfPrediction'

      • 'MultiClass'

      • 'MultiClassOneVsAll'

      • 'TotalF1'

      • 'MAE'

      • 'MAPE'

      • 'Poisson'

      • 'Quantile'

      • 'RMSE'

      • 'LogLinQuantile'

      • 'Lq'

      • 'NumErrors'

      • 'SMAPE'

      • 'R2'

      • 'MSLE'

      • 'MedianAbsoluteError'

      • 'PairLogit'

      • 'PairLogitPairwise'

      • 'PairAccuracy'

      • 'QueryCrossEntropy'

      • 'QueryRMSE'

      • 'QuerySoftMax'

      • 'PFound'

      • 'NDCG'

      • 'AverageGain'

      • 'PrecisionAt'

      • 'RecallAt'

      • 'MAP'

      • 'MRR'

      • 'ERR'

      Format:

      metric_name:param=Value

      Examples:

      'R2'

      'Quantile:alpha=0.3'

      Default value:

      Optimized objective is used

    • iterations

      The maximum number of trees that can be built when solving machine learning problems.

      When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter.

      Default value:

      1000

    • border

      The target border. If the value is strictly greater than this threshold, it is considered a positive class. Otherwise it is considered a negative class.

      The parameter is obligatory if the Logloss function is used, since it uses borders to transform any given target to a binary target.

      Used in binary classification.

      Default value:

      0.5

    • leaf_estimation_iterations

      The number of gradient steps when calculating the values in leaves.

      Default value:

      1

    • depth

      Depth of the trees.

      The value can be any integer up to 16. It is recommended to use values in the range [1; 10].

      Default value:

      6

    • learning_rate

      The learning rate.

      Used for reducing the gradient step.

      Default value:

      0.03

    • rsm

      Random subspace method. The percentage of features to use at each iteration of building trees. At each iteration, features are selected over again at random.

      The value must be in the range [0;1].

      Default value:

      1

    • random_seed

      The random seed used for training.

      Default value:

      0

    • nan_mode

      Way to process missing values.

      Possible values:

      • 'Min'

      • 'Max'

      • 'Forbidden'

      Default value:

      'Min'

    • od_pval

      Use the Overfitting detector (see https://catboost.ai/docs/concepts/overfitting-detector.html#overfitting-detector) to stop training when the threshold is reached. Requires that a test dataset was input.

      For best results, it is recommended to set a value in the range [10^-10; 10^-2].

      The larger the value, the earlier overfitting is detected.

      Default value:

      The overfitting detection is turned off

    • od_type

      The method used to calculate the values in leaves.

      Possible values:

      • IncToDec

      • Iter

      Restriction. Do not specify the overfitting detector threshold when using the Iter type.

      Default value:

      'IncToDec'

    • od_wait

      The number of iterations to continue the training after the iteration with the optimal loss function value. The purpose of this parameter differs depending on the selected overfitting detector type:

      • IncToDec - Ignore the overfitting detector when the threshold is reached and continue learning for the specified number of iterations after the iteration with the optimal loss function value.

      • Iter - Consider the model overfitted and stop training after the specified number of iterations since the iteration with the optimal loss function value.

      Default value:

      20

    • leaf_estimation_method

      The method used to calculate the values in leaves.

      Possible values:

      • Newton

      • Gradient

      Default value:

      Default value depends on the selected loss function

    • grow_policy

      GPU only. The tree growing policy. It describes how to perform greedy tree construction.

      Possible values:

      • SymmetricTree

      • Lossguide

      • Depthwise

      Default value:

      SymmetricTree

    • min_data_in_leaf

      GPU only. The minimum training samples count in leaf. CatBoost will not search for new splits in leaves with samples count less than min_data_in_leaf. This parameter is used only for Depthwise and Lossguide growing policies.

      Default value:

      1

    • max_leaves

      GPU only. The maximum leaf count in resulting tree. Used only for Lossguide growing policy. This parameter is used only for Lossguide growing policy.

      Default value:

      31

    • score_function GPU only. Score that is used during tree construction to select the next tree split.

      Possible values:

      • L2

      • Cosine

      • NewtonL2

      • NewtonCosine

      Default value:

      Cosine

      For growing policy Lossguide default is NewtonL2.

    • l2_leaf_reg

      L2 regularization coefficient. Used for leaf value calculation.

      Any positive values are allowed.

      Default value:

      3

    • model_size_reg

      Model size regularization coefficient. The influence coefficient of the model size for choosing tree structure. To get a smaller model size - increase this coefficient.

      Any positive values are allowed.

      Default value:

      0.5

    • has_time

      Use the order of objects in the input data (do not perform a random permutation of the dataset at the preprocessing stage)

      Default value:

      FALSE (not used; permute input dataset)

    • allow_const_label

      To allow the constant label value in the dataset.

      Default value:

      FALSE

    • name

      The experiment name to display in visualization tools (see https://catboost.ai/docs/features/visualization.html#visualization).

      Default value:

      experiment

    • prediction_type

      The format for displaying approximated values in output data.

      Possible values:

      • 'Probability'

      • 'Class'

      • 'RawFormulaVal'

      Default value:

      'RawFormulaVal'

    • fold_len_multiplier

      Coefficient for changing the length of folds.

      The value must be greater than 1. The best validation result is achieved with minimum values.

      With values close to 1 (for example, 1+ϵ1 + \epsilon), each iteration takes a quadratic amount of memory and time for the number of objects in the iteration. Thus, low values are possible only when there is a small number of objects.

      Default value:

      2

    • class_weights

      Classes weights. The values are used as multipliers for the object weights.

      For example, for 3 class classification you could use:

      c(0.85, 1.2, 1)

      Default value:

      None (the weight for all classes is set to 1)

    • classes_count

      The upper limit for the numeric class label. Defines the number of classes for multiclassification.

      Only non-negative integers can be specified. The given integer should be greater than any of the target values.

      If this parameter is specified the labels for all classes in the input dataset should be smaller than the given value.

      Default value:

      maximum class label + 1

    • one_hot_max_size

      Convert the feature to float if the number of different values that it takes exceeds the specified value. Ctrs are not calculated for such features.

      The one-vs.-all delimiter is used for the resulting float features.

      Default value:

      FALSE

      Do not convert features to float based on the number of different values

    • random_strength

      Score standard deviation multiplier.

      Default value:

      1

    • bootstrap_type

      Bootstrap type. Defines the method for sampling the weights of documents.

      Possible values:

      • 'Bayesian'

      • 'Bernoulli'

      • 'Poisson'

      • 'MVS'

      • 'No'

      Poisson bootstrap is supported only on GPU.

      Default value:

      'Bayesian'

    • bagging_temperature

      Controls intensity of Bayesian bagging. The higher the temperature the more aggressive bagging is.

      Typical values are in the range [0,1][0, 1] (0 is for no bagging).

      Possible values are in the range [0,+)[0, +\infty).

      Default value:

      1

    • subsample

      Sample rate for bagging. This parameter can be used if one of the following bootstrap types is defined:

      • 'Bernoulli'

      Default value:

      0.66

    • sampling_unit

      The parameter allows to specify the sampling scheme: sample weights for each object individually or for an entire group of objects together.

      Possible values:

      • 'Object'

      • 'Group'

      Default value:

      'Object'

    • sampling_frequency

      Frequency to sample weights and objects when building trees.

      Possible values:

      • 'PerTree'

      • 'PerTreeLevel'

      Default value:

      'PerTreeLevel'

    • model_shrink_rate

      For i > 0 at the start of i-th iteration multiplies model by (1 - model_shrink_rate / i).

      Possible values: [0, 1).

      Default value: 0

  • CTR settings

    • simple_ctr

      Binarization settings for categorical features (see https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html).

      Format:

      c(CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N])

      Components:

      • CTR types for training on CPU:

        • 'Borders'

        • 'Buckets'

        • 'BinarizedTargetMeanValue'

        • 'Counter'

      • CTR types for training on GPU:

        • 'Borders'

        • 'Buckets'

        • 'FeatureFreq'

        • 'FloatTargetMeanValue'

      • The number of borders for label value binarization. (see https://catboost.ai/docs/concepts/quantization.html) Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1. This option is available for training on CPU only.

      • The binarization (see https://catboost.ai/docs/concepts/quantization.html) type for the label value. Only used for regression problems.

        Possible values:

        • 'Median'

        • 'Uniform'

        • 'UniformAndQuantiles'

        • 'MaxLogSum'

        • 'MinEntropy'

        • 'GreedyLogSum'

        By default, 'MinEntropy' This option is available for training on CPU only.

      • The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.

      • The binarization type for categorical features. Supported values for training on CPU:

        • 'Uniform'

        Supported values for training on GPU:

        • 'Median'

        • 'Uniform'

        • 'UniformAndQuantiles'

        • 'MaxLogSum'

        • 'MinEntropy'

        • 'GreedyLogSum'

      • Priors to use during training (several values can be specified) Possible formats:

        • 'One number - Adds the value to the numerator.'

        • 'Two slash-delimited numbers (for GPU only) - Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.'

    • combinations_ctr

      Binarization settings for combinations of categorical features (see https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html).

      Format:

      c(CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N])

      Components:

      • CTR types for training on CPU:

        • 'Borders'

        • 'Buckets'

        • 'BinarizedTargetMeanValue'

        • 'Counter'

      • CTR types for training on GPU:

        • 'Borders'

        • 'Buckets'

        • 'FeatureFreq'

        • 'FloatTargetMeanValue'

      • The number of borders for target binarization. (see https://catboost.ai/docs/concepts/quantization.html) Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1. This option is available for training on CPU only.

      • The binarization (see https://catboost.ai/docs/concepts/quantization.html) type for the target. Only used for regression problems.

        Possible values:

        • 'Median'

        • 'Uniform'

        • 'UniformAndQuantiles'

        • 'MaxLogSum'

        • 'MinEntropy'

        • 'GreedyLogSum'

        By default, 'MinEntropy' This option is available for training on CPU only.

      • The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.

      • The binarization type for categorical features. Supported values for training on CPU:

        • 'Uniform'

        Supported values for training on GPU:

        • 'Median'

        • 'Uniform'

        • 'UniformAndQuantiles'

        • 'MaxLogSum'

        • 'MinEntropy'

        • 'GreedyLogSum'

      • Priors to use during training (several values can be specified) Possible formats:

        • 'One number - Adds the value to the numerator.'

        • 'Two slash-delimited numbers (for GPU only) - Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.'

    • ctr_target_border_count

      Maximum number of borders used in target binarization for categorical features that need it. If TargetBorderCount is specified in 'simple_ctr', 'combinations_ctr' or 'per_feature_ctr' option it overrides this value.

      Default value:

      1

    • counter_calc_method

      The method for calculating the Counter CTR type for the test dataset.

      Possible values:

      • 'Full'

      • 'FullTest'

      • 'PrefixTest'

      • 'SkipTest'

      Default value: 'PrefixTest'

    • max_ctr_complexity

      The maximum number of categorical features that can be combined.

      Default value:

      4

    • ctr_leaf_count_limit

      The maximum number of leaves with categorical features. If the number of leaves exceeds the specified limit, some leaves are discarded. The value must be positive (for zero limit use ignored_features parameter).

      The leaves to be discarded are selected as follows:

      1. The leaves are sorted by the frequency of the values.

      2. The top N leaves are selected, where N is the value specified in the parameter.

      3. All leaves starting from N+1 are discarded.

      This option reduces the resulting model size and the amount of memory required for training. Note that the resulting quality of the model can be affected.

      Default value:

      None (The number of leaves with categorical features is not limited)

    • store_all_simple_ctr

      Ignore categorical features, which are not used in feature combinations, when choosing candidates for exclusion.

      Use this parameter with ctr-leaf-count-limit only.

      Default value:

      FALSE (Both simple features and feature combinations are taken in account when limiting the number of leaves with categorical features)

  • Binarization settings

    • border_count

      The number of splits for numerical features. Allowed values are integers from 1 to 255 inclusively.

      Default value:

      254 for training on CPU or 128 for training on GPU

    • feature_border_type

      The binarization mode (see https://catboost.ai/docs/concepts/quantization.html) for numerical features.

      Possible values:

      • 'Median'

      • 'Uniform'

      • 'UniformAndQuantiles'

      • 'MaxLogSum'

      • 'MinEntropy'

      • 'GreedyLogSum'

      Default value:

      'MinEntropy'

  • Performance settings

    • thread_count

      The number of threads to use when applying the model.

      Allows you to optimize the speed of execution. This parameter doesn't affect results.

      Default value:

      The number of CPU cores.

  • Output settings

    • logging_level

      Possible values:

      • 'Silent'

      • 'Verbose'

      • 'Info'

      • 'Debug'

      Default value:

      'Silent'

    • metric_period

      The frequency of iterations to print the information to stdout. The value should be a positive integer.

      Default value:

      1

    • train_dir

      The directory for storing the files generated during training.

      Default value:

      None (current catalog)

    • save_snapshot

      Enable snapshotting for restoring the training progress after an interruption.

      Default value:

      None

    • snapshot_file

      Settings for recovering training after an interruption (see https://catboost.ai/docs/features/snapshots.html).

      Depending on whether the file specified exists in the file system:

      • Missing - write information about training progress to the specified file.

      • Exists - load data from the specified file and continue training from where it left off.

      Default value:

      File can't be generated or read. If the value is omitted, the file name is experiment.cbsnapshot.

    • snapshot_interval

      Interval between saving snapshots (seconds)

      Default value:

      600

    • allow_writing_files

      If this flag is set to FALSE, no files with different diagnostic info will be created during training. With this flag set to FALSE no snapshotting can be done. Plus visualisation will not work, because visualisation uses files that are created and updated during training.

      Default value:

      TRUE

    • approx_on_full_history

      If this flag is set to TRUE, each approximated value is calculated using all the preceding rows in the fold (slower, more accurate). If this flag is set to FALSE, each approximated value is calculated using only the beginning 1/fold_len_multiplier fraction of the fold (faster, slightly less accurate).

      Default value:

      FALSE

    • boosting_type

      Boosting scheme. Possible values: - 'Ordered' - Gives better quality, but may slow down the training. - 'Plain' - The classic gradient boosting scheme. May result in quality degradation, but does not slow down the training.

      Default value:

      Depends on object count and feature count in train dataset and on learning mode.

    • dev_score_calc_obj_block_size

      CPU only. Size of block of samples in score calculation. Should be > 0 Used only for learning speed tuning. Changing this parameter can affect results in pairwise scoring mode due to numerical accuracy differences

      Default value:

      5000000

    • dev_efb_max_buckets

      CPU only. Maximum bucket count in exclusive features bundle. Should be in an integer between 0 and 65536. Used only for learning speed tuning.

      Default value:

      1024

    • sparse_features_conflict_fraction

      CPU only. Maximum allowed fraction of conflicting non-default values for features in exclusive features bundle. Should be a real value in [0, 1) interval.

      Default value:

      0.0

    • leaf_estimation_backtracking

      Type of backtracking during gradient descent. Possible values: - 'No' - never backtrack; supported on CPU and GPU - 'AnyImprovement' - reduce the descent step until the value of loss function is less than before the step; supported on CPU and GPU - 'Armijo' - reduce the descent step until Armijo condition is satisfied; supported on GPU only

      Default value:

      'AnyImprovement'

Usage

catboost.train(learn_pool, test_pool = NULL, params = list())

Arguments

learn_pool

The dataset used for training the model.

Default value: Required argument

test_pool

The dataset used for testing the quality of the model.

Default value: NULL (not used)

params

The list of parameters to start training with.

If omitted, default values are used (see The list of parameters).

If set, the passed list of parameters overrides the default values.

Default value: Required argument

Value

Model object.

See Also

https://catboost.ai/docs/concepts/r-reference_catboost-train.html

Examples

## Not run: 
train_pool_path <- system.file("extdata", "adult_train.1000", package = "catboost")
test_pool_path <- system.file("extdata", "adult_test.1000", package = "catboost")
cd_path <- system.file("extdata", "adult.cd", package = "catboost")
train_pool <- catboost.load_pool(train_pool_path, column_description = cd_path)
test_pool <- catboost.load_pool(test_pool_path, column_description = cd_path)
fit_params <- list(
    iterations = 100,
    loss_function = 'Logloss',
    ignored_features = c(4, 9),
    border_count = 32,
    depth = 5,
    learning_rate = 0.03,
    l2_leaf_reg = 3.5,
    train_dir = 'train_dir')
model <- catboost.train(train_pool, test_pool, fit_params)

## End(Not run)

Apply the model with several virtual ensembles

Description

Apply the model to the given dataset using several independent truncated models - virtual ensembles. Each tree in ensemble predicts its own value for each document from pool.

Peculiarities: Return value varies on prediction_type: array for 'VirtEnsembles' and matrix for 'TotalUncertainty'

Usage

catboost.virtual_ensembles_predict(
  model,
  pool,
  verbose = FALSE,
  prediction_type = "VirtEnsembles",
  ntree_end = 0L,
  virtual_ensembles_count = 10,
  thread_count = -1
)

Arguments

model

The model obtained as the result of training.

Default value: Required argument

pool

The input dataset.

Default value: Required argument

verbose

Verbose output to stdout.

Default value: FALSE (not used)

prediction_type

The format for displaying approximated values in output data (see https://catboost.ai/docs/concepts/python-reference_virtual_ensembles_predict.html#python-reference_catboostclassifier_predict__output-format).

Possible values:

  • 'VirtEnsembles'

  • 'TotalUncertainty'

Default value: 'VirtEnsembles'

ntree_end

Index of the first tree not to be used when applying the model or calculating the metrics (zero-based indexing).

Default value: 0 (the index of the last tree to use equals to the number of trees in the model minus one)

virtual_ensembles_count

Number of tree ensembles to use. Each virtual ensemble can be considered as truncated model.

Default value: 10

thread_count

The number of threads to use when applying the model. If -1, then the number of threads is set to the number of CPU cores.

Allows you to optimize the speed of execution. This parameter doesn't affect results.

Default value: -1

Value

Matrix or Array of predictions (for 'TotalUncertainty' and 'VirtEnsembles' prediction_type correspondingly)

See Also

https://catboost.ai/docs/concepts/python-reference_virtual_ensembles_predict.html?lang=en


Dimensions of catboost.Pool

Description

Get dimensions of a Pool.

Usage

## S3 method for class 'catboost.Pool'
dim(x)

Arguments

x

The input dataset.

Default value: Required argument

Value

Returns a vector of row numbers and column numbers in an catboost.Pool.


Dimension names of catboost.Pool

Description

Get dimension names of a Pool.

Usage

## S3 method for class 'catboost.Pool'
dimnames(x)

Arguments

x

The input dataset.

Default value: Required argument

Value

A list with the two elements. The second element contains the column names.


Head of catboost.Pool

Description

Return a list with the first n objects of the dataset.

Each line of this list contains the following information for each object:

  • The label value.

  • The weight value.

  • The feature values.

Usage

## S3 method for class 'catboost.Pool'
head(x, n = 10, ...)

Arguments

x

The input dataset.

Default value: Required argument

n

The quantity of the first objects in the dataset to be returned.

Default value: 10

...

not currently used

Value

A matrix containing the first n objects of the dataset.


Print basic information about model

Description

Displays the most general characteristics of a CatBoost model.

Usage

## S3 method for class 'catboost.Model'
print(x, ...)

Arguments

x

The model obtained as the result of training.

...

Not used

Value

The same model that was passed as input.


Print catboost.Pool

Description

Print dimensions of catboost.Pool.

Usage

## S3 method for class 'catboost.Pool'
print(x, ...)

Arguments

x

a catboost.Pool object

Default value: Required argument

...

not currently used

Value

Nothing. This method prints pool dimensions.


Print basic information about model

Description

Displays the most general characteristics of a CatBoost model (same as 'print').

Usage

## S3 method for class 'catboost.Model'
summary(object, ...)

Arguments

object

The model obtained as the result of training.

...

Not used

Value

The same model that was passed as input.


Tail of catboost.Pool

Description

Return a list with the last n objects of the dataset.

Each line of this list contains the following information for each object:

  • The target value.

  • The weight value.

  • The feature values.

Usage

## S3 method for class 'catboost.Pool'
tail(x, n = 10, ...)

Arguments

x

The input dataset.

Default value: Required argument

n

The quantity of the last objects in the dataset to be returned.

Default value: 10

...

not currently used

Value

A matrix containing the last n objects of the dataset.