catboost CatBoost的Python与R实现

博客:https://dataxujing.github.io/

https://github.com/DataXujing

CatBoost(分类Boosting)算法是一种类似于XGBoost、LightGBM的梯度Boosting算法，其算法创新主要包括两个方面:一是处理离散特征值的有序ts(目标统计量)方法；其次，提供了两种训练模式:有序和普通，具体伪码如下图所示:

有序boosting的思想，解决了梯度Boosting中经常出现的预测偏移问题。

CatBoost目前支持通过Python、R和命令行进行调用和训练，支持GPU，提供强大的训练过程可视化功能。它可以使用Jupyter Notebook、CatBoost Viewer和TensorBoard对训练过程进行可视化，学习文档丰富，使用方便。

本文结合kaggle中titanic的公共数据集，带你训练基于Python和R的CatBoost模型。

Python实现CatBoost

1.加载数据:

```python fromcatboost.datasets importtitanic importnumpy asnp fromsklearn.model_selection importtrain_test_split fromcatboost importCatBoostClassifier, Pool, cv fromsklearn.metrics importaccuracy_score

train_df，test_df = titanic

x = train _ df . drop(' Virted '，axis=1)y = train_df。幸存

#数据分区x _ train，x _ validation，y _ train，y _ validation = train _ test _ split(x，y，train _ size = 0.75，random _ state = 42)

X_test = test_df ` ` `

这里，我们直接使用数据帧结构。对于CatBoost，我们支持numpy中的数组和熊猫中的数据帧。同时，我们还提供了一个池数据结构。如果需要速度和内存利用率优化，官方推荐使用池数据结构。本文以数据帧结构为例。

2.使用hyperopt调整参数:

```python

importhotyperoptfromnumpy . RandomState随机导入

#旨在最小化目标函数defhyperopt _ objective(params):model = catboost classifier(L2 _ leaf _ reg = int(params[' L2 _ leaf _ reg '])，learning _ rate = params[' learning _ rate ']，iterations=500，eval_metric= '精度'，random_seed=42，logging_level= '无声')cv_data= cv(Pool(X，y，cat _ features = classional _ features _ indexs)，model . get _ params)best _ precision = NP max(cv _ max

#要优化的参数:params _ space = {'L2 _叶_ reg ':hyperopt . HP . qloguniform(' L2 _叶_ reg '，0，2，1)，' learning _ rate ':hyperopt . HP . uniform(' learning _ rate '，1e

试验= hyperopt。审判

#参数搜索best = hyper opt . fmin(hyper opt _ objective，space = params _ space，algo = hyper opt . TPE . advise，max _ evals = 50，trients = trients，rstate = randomstate (123))

#打印参数的最佳组合，并通过交叉验证重新打印(最佳)

```

最佳参数下的交叉验证:

```python

model = CatboostClassifier(L2 _ leaf _ reg = int(best[' L2 _ leaf _ reg '])，learning _ rate = best[' learning _ rate ']，迭代次数=500，eval_metric= '准确度'，random_seed=42，logging_level= '无声')cv_data= cv(Pool(X，y，cat _ features = classional _ features _ indexs)，model.get_params)

model.fit(X，y，cat _ features = classional _ features _ indexs)` ` `

此外，我们还可以使用网格搜索或随机参数搜索。下面我们提供gridsearchCV的流程供参考:

```python

from catboost importcatbootsclassifier

defauc(m，X_train，X _ test):return(metrics . roc _ AUC _ score(y—train，m.predict_proba(X_train)[:，1])，metrics.roc_auc_score(y_test，m.predict_proba(X_test)[:，1])

params = {'depth': [4，7，10]，' learning_rate': [0.03，0.1，0.15]，' l2_leaf_reg': [1，4，9]，' iterations ':[300]} CB = catboostclassifierMoDEL = GridSEARCHcv(CB，params，scoring="roc_auc "，cv = 3)model.fit(X_train，y_train，cat _ features = classional _ features _ indexs)` ' '

训练过程中有很多有趣的参数需要调整，主要是针对训练速度和精度的参数，除此之外，还有一些可视化的和GPU相关的参数，具体可以参考CatBoost的官方文档。

3.变量的重要性和预测:

```python# 打印变量重要性feature_importances = model.get_feature_importance(X_train)feature_names = X_train.columnsforscore, name insorted((feature_importances, feature_names), reverse=True):print('{}: {}'.format(name, score))

#预测#三个模型的预测结果显示打印(模型。predict _ proba(data = x _ validation))打印(模型。预测(data = x _ validation))raw _ pred = model。预测(数据= x _ validation，预测_ type =' rawformulanal)

importmathdefsigmoid(x):return 1/(1+math . exp(-x))概率=[sigmoid(x)for x inraw _ pred]print(NP . array(概率))

```

4.模型持久性:

```python# 模型保存(后缀名可以换成其他)model.save_model('catboost_model.bin')

# model load my _ best _ model。load _ model ('catboost _ model。bin ')打印(my _ best _ model。get _ params)打印my _ best _ model。random _ seed _ print my _ best _ model。学习率

```

用r语言实现CatBoost

1.构建池数据

A.从文件中读取:

```Rlibrary(catboost)library(caret)library(titanic)

pool_path <。- system.file("extdata "，"成人_train.1000 "，package = " cat boost ")column _ deion _ path & lt；- system.file("extdata "，"成人. cd "，package="catboost ")池& lt- catboost.load_pool(pool_path，column_deion=column_deion_path)

头部(水池，1)

```

主要有两个文件，一个是具体特征值文件pool_path，一个是列描述文件column _ deoion _ path，主要描述列的属性。目前主要有三种:

目标(标签)；CategNum(默认类型)，注意这里的列索引是以Python方式从0开始的。

B.从矩阵中获取:

```Rpool_path = system.file("extdata", "adult_train.1000", package="catboost")

column _ deion _ vector = rep(' numeric '，15)cat _ features & lt；- c(3，5，7，8，9，10，11，15)for(I incat _ features)column _ deion _ vector[I]& lt；-"因素"

数据<。- read.table(pool_path，head = F，sep = "t "，colClasses = column_deion_vector，na.strings='NAN ')

#将分类要素转换为数字。对于(i incat_features)数据[，I]& lt；- as.numeric(factor(data[，i])

目标<。- c(1)data_matrix <。- as.matrix(数据)池<。-cat boost . load _ pool(as . matrix(data[，-target])，label = as.matrix(data[，target])，cat _ features = cat _ features)head(pool，1)

```

请注意，矩阵中的所有数据都是数字数据，因此首先要对进入矩阵之前的数据进行数字化，然后在构建池数据结构时要指定离散特征的列标签。

C.从数据框中获取:

```Rtrain_path = system.file("extdata", "adult_train.1000", package="catboost")test_path = system.file("extdata", "adult_test.1000", package="catboost")

column _ deion _ vector = rep(' numeric '，15)cat _ features & lt；- c(3，5，7，8，9，10，11，15)for(I incat _ features)column _ deion _ vector[I]& lt；-“因素”培训<。- read.table(train_path，head = F，sep = "t "，colClasses = column_deion_vector，na . strings = ' NAN ')test & lt；- read.table(test_path，head = F，sep = "t "，colClasses = column_deion_vector，na . strings = ' NAN ')target & lt；- c(1)train_pool <。-cat boost . load _ pool(data = train[，-target]，label = train[，target])test _ pool & lt；- catboost.load_pool(data=test[，-target]，label = test[，target])head(train_pool，1)head(test_pool，1)

```

注意离散变量需要转换成因子变量，数值变量是数值变量，标签也应该是数值变量。

2.训练模型和预测

```Rfit_params <- list(iterations = 100,thread_count = 10,loss_function = 'Logloss',ignored_features = c(4,9),border_count = 32,depth = 5,learning_rate = 0.03,l2_leaf_reg = 3.5,train_dir = 'train_dir',logging_level = 'Silent')model <- catboost.train(train_pool, test_pool, fit_params)

#更多参数会有所帮助

#准确度方法计算_准确度

#概率预测预测

#分类预测标签

#仅适用于对数损失准确性& lt- calc_accuracy(预测，测试[，目标])cat(" naccurcy:"，accuracy，" n ")

#可变重要性计算cat ("n特征重要性"，" n") catboost。get _ feature _重要性(模型，训练池)

cat("nTree计数: "，model$tree_count，" n ")

```

3.使用插入符号包

A.加载数据:

```rset.seed(12345)

数据<。-as . data . frame(as . matrix(titanic _ train)，stringsAsFactors = TRUE)

年龄级别<。-级别(数据$年龄)最频繁年龄& lt-what . max(table(data $ Age))data $ Age[is . na(data $ Age)]& lt；-年龄等级[最常见年龄]

drop_columns = c(“乘客”、“幸存”、“姓名”、“机票”、“客舱”)x & lt- data[，！(name(data)% in % drop _ columns)]y & lt；-数据[，c("幸存")]` `

B.基于脱字符号的模型训练:

```rfit_control<- trainControl(method = "cv",number= 5,classProbs= TRUE)

# gridCVgrid & lt- expand.grid(depth = c(4，6，8)，learning_rate= 0.1，迭代= 100，l2_leaf_reg= 0.1，rsm= 0.95，border_count= 64)

#使用catboost.caret方法模型

```

C.打印模型和变量的重要性:

```r

打印(模型)

重要性<。- varImp(型号，比例=假)打印(重要性)```

D.预测:

```rpre_prob <- predict(model, type = 'prob')print(pre_prob)```

参考

[1]https://github.com/catboost/tutorials

[2]https://github.com/catboost

[3] CatBoost:带有分类特征的无偏增强

[4] CatBoost:支持分类特征的梯度增强

[5]谁是数据竞争之王？CatBoost对Light GBM对XGBoost

1.《catboost CatBoost的Python与R实现》援引自互联网，旨在传递更多网络信息知识，仅代表作者本人观点，与本网站无关，侵删请联系页脚下方联系方式。

2.《catboost CatBoost的Python与R实现》仅供读者参考，本网站未对该内容进行证实，对其原创性、真实性、完整性、及时性不作任何保证。

3.文章转载时请保留本站内容来源地址，https://www.lu-xu.com/caijing/711127.html

catboost CatBoost的Python与R实现

矿砂船 40万吨级超大型满载矿砂船首次试靠宁波舟山港

乐多港乐多港

托福成绩单 ETS官方数据：全球及中国考生托福成绩平均分

北京模型展第十九届北京模型展有哪些看点？

财富号揭秘蚂蚁财富号运营大数据

淘宝爆款分析软件大数据工具帮你选出行业爆款！让你告别费心费力选款却效果不好的尴尬局面

燃气真空热水锅炉样本济南真空热水锅炉样本/参数

元旦自由行元旦旅游大数据：出行人群80后90后打主力

数据通信什么是数据通信数据通信知识详解

暗网网站你的信息不值钱！误入暗网知识大全，这波数据交易让我瑟瑟发抖

catboost CatBoost的Python与R实现

矿砂船 40万吨级超大型满载矿砂船首次试靠宁波舟山港

乐多港 乐多港

乐多港乐多港