Applying XGBoost algorithm on data set given in “Otto Group Product Classification Challenge” in Kaggle.

- Install XGBoost package in R
- Understanding of Data:

a. Id: unique ID of a product

b. feat_1 to feat_93: 93 features given for a product

c. target: class of a product labelled as “class_1”, “class_2”…., “class_9” (response variable) - Check missing value, do imputation if required. In this data set there are no missing values.
- We have to build a model on the basis of train data in which response variable column will be target and explanatory variables will be 93 feature columns. We will be using this model to predict the target column in test data set.
- We will be using XGBoost algorithm but one point to note that is XGBoost can only deal with numeric matrices so we need to convert the given data frames into numeric matrices.

__Pre-processing on data sets:__

#load below 3 libraries

require(xgboost)

require(methods)

require(reshape2)

require(caret)

#reading data in required format

featureclass=rep(‘numeric’,93)

colclasstrain=c(‘integer’,featureclass,’character’)

colclasstest=c(‘integer’,featureclass)

train=read.csv(‘C:\\Users\\roma.agrawal\\Downloads\\train.csv’,colClasses=colclasstrain)

train1=train

test=read.csv(‘C:\\Users\\roma.agrawal\\Downloads\\test.csv’,colClasses=colclasstest)

test1=test

#segregating the id column from test data set

id=test[,1]

test=test[,-1]

#removing the id column from train data set, as we do not need this

train=train[,-1]

#converting the values in target column of train data set into integer so that it will hold values from 0 to 8

target=train$target

classnames=unique(target)

target=as.integer(colsplit(target,’_’,names=c(‘x1′,’x2’))[,2])-1

noOfClasses=max(target)+1

#removing the target column from train data set

train=train[,-ncol(train)]

Our train and test data sets are ready, so to pass it to XGBoost we need to convert it to matrix format

trainMat=data.matrix(train)

testMat=data.matrix(test)

**Cross Validation and Model Building:**

After preparing of training and test data set, it’s time to run cross validation to choose parameters. We will tune only 1 parameter i.e. **no of trees**. Cross validation can be done in two ways:

**Using xgb.cv()**

#creating parameter list for the model

param=list(“objective”=”multi:softprob”,”eval_metric”=”mlogloss”,”num_class”=noOfClasses)

As it is a multi class classification problem, so that is passed in objective function and “eval_metric” indicates the error measurement of the model.

cv.round=500 #no of trees to build (we will be tuning this parameter)

cv.nfold=5 #how many parts we want to divide the train data into for the cross-validation

#Running the cross validation

bst.cv=xgb.cv(param=param,data=trainMat,label=target,nfold=cv.nfold,nrounds=cv.round)

plot(bst.cv$test.mlogloss.mean,type=”l”)

nround=which(bst.cv$test.mlogloss.mean==min(bst.cv$test.mlogloss.mean)) #nround comes out to be 173

#training the model

bst=xgboost(data=trainMat,label=target,param=param,nrounds=nround)

#predicting the test dataset

ypred=predict(bst,testMat)

**Creation of submission file:**

predMat=data.frame(matrix(ypred,ncol=9,byrow=TRUE))

colnames(predMat)=classnames

res=data.frame(id,predMat)

write.csv(res,’C:\\Users\\roma.agrawal\\Downloads\\result.csv’,quote=F,row.names=F)

__Using xgb.train()__

For using this method we require to divide the training set into train and validation set. As it is a multiclass problem we will be using **stratified random sampling** method to divide the training set.

trainIndex=createDataPartition(train1$target, p = .8,list = FALSE,times = 1)

trainStarta=train1[trainIndex,]

testStarta=train1[-trainIndex,]

#converting the values in target column of both the datasets into integer so that it will hold values from 0 to 8

trainStarta_target=target[trainIndex]

trainStarta=trainStarta[,-1]

trainStarta=trainStarta[,-ncol(trainStarta)]

testStarta_target=target[-trainIndex]

testStarta=testStarta[,-1]

testStarta=testStarta[,-ncol(testStarta)]

#This function requires inputs in matrix format, so converting both the datasets into matrix using xgb.DMatrix()

dtrain=xgb.DMatrix(data.matrix(trainStarta), label = data.matrix(trainStarta_target))

dtest=xgb.DMatrix(data.matrix(testStarta), label = data.matrix(testStarta_target))

#creating watchlist with train and validation dataset in matrix format

Watchlist=list(eval = dtest, train = dtrain)

#creating parameter list for the model

param=list(“objective”=”multi:softprob”,”eval_metric”=”mlogloss”,”num_class”=noOfClasses)

#Running the cross validation

bst=xgb.train(param, dtrain,nround = cv.round, watchlist,early.stop.round=10)

It will stop if it sees values of bst.cv$test.mlogloss.mean has started increasing upto 10 iterations and give the iteration no which has minimum mlogloss value

Index 174th has minimum score, next 10 iterations are having more logloss value than this

#training the model

bst1=xgboost(data=trainMat,label=target,param=param,nrounds=bst$bestInd)

#predicting the test dataset

ypred=predict(bst1,testMat)

**Creation of submission file:**

predMat=data.frame(matrix(ypred,ncol=9,byrow=TRUE))

colnames(predMat)=classnames

res=data.frame(id,predMat)

write.csv(res,’C:\\Users\\roma.agrawal\\Downloads\\result1.csv’,quote=F,row.names=F)

bst is having “raw” and “handle” only. How to get “bestScore” and “bestInd”

Hi Shiv,

In case of “early stopping” cross validation methodology, bst will have two extra attributes: “bestScore” and “bestInd”

Yeah, me too not able to see it in summary(bst) but check below link that has full code of xgboost, I got it from here:

https://github.com/dmlc/xgboost/blob/master/R-package/R/xgb.train.R

Thank you

Tong He (maintainer on Cran) replied as well few weeks ago. I have requested to put these values in normal case. The development team is considering it