XGBoost Algorithm

Applying XGBoost algorithm on data set given in “Otto Group Product Classification Challenge” in Kaggle.

  1. Install XGBoost package in R
  2. Understanding of Data:
    a. Id: unique ID of a product
    b. feat_1 to feat_93: 93 features given for a product
    c. target: class of a product labelled as “class_1”, “class_2”…., “class_9” (response variable)
  3. Check missing value, do imputation if required. In this data set there are no missing values.
  4. We have to build a model on the basis of train data in which response variable column will be target and explanatory variables will be 93 feature columns. We will be using this model to predict the target column in test data set.
  5. We will be using XGBoost algorithm but one point to note that is XGBoost can only deal with numeric matrices so we need to convert the given data frames into numeric matrices.

Pre-processing on data sets:

#load below 3 libraries
require(xgboost)
require(methods)
require(reshape2)
require(caret)

#reading data in required format
featureclass=rep(‘numeric’,93)
colclasstrain=c(‘integer’,featureclass,’character’)
colclasstest=c(‘integer’,featureclass)

train=read.csv(‘C:\\Users\\roma.agrawal\\Downloads\\train.csv’,colClasses=colclasstrain)
train1=train
test=read.csv(‘C:\\Users\\roma.agrawal\\Downloads\\test.csv’,colClasses=colclasstest)
test1=test

#segregating the id column from test data set
id=test[,1]
test=test[,-1]

#removing the id column from train data set, as we do not need this
train=train[,-1]

#converting the values in target column of train data set into integer so that it will hold values from 0 to 8
target=train$target
classnames=unique(target)
target=as.integer(colsplit(target,’_’,names=c(‘x1′,’x2’))[,2])-1
noOfClasses=max(target)+1

#removing the target column from train data set
train=train[,-ncol(train)]

Our train and test data sets are ready, so to pass it to XGBoost we need to convert it to matrix format

trainMat=data.matrix(train)
testMat=data.matrix(test)

Cross Validation and Model Building:

After preparing of training and test data set, it’s time to run cross validation to choose parameters. We will tune only 1 parameter i.e. no of trees. Cross validation can be done in two ways:

  • Using xgb.cv()

#creating parameter list for the model
param=list(“objective”=”multi:softprob”,”eval_metric”=”mlogloss”,”num_class”=noOfClasses)

As it is a multi class classification problem, so that is passed in objective function and “eval_metric” indicates the error measurement of the model.

cv.round=500 #no of trees to build (we will be tuning this parameter)
cv.nfold=5 #how many parts we want to divide the train data into for the cross-validation

#Running the cross validation
bst.cv=xgb.cv(param=param,data=trainMat,label=target,nfold=cv.nfold,nrounds=cv.round)

xgboost1

plot(bst.cv$test.mlogloss.mean,type=”l”)

xgboost2

nround=which(bst.cv$test.mlogloss.mean==min(bst.cv$test.mlogloss.mean)) #nround comes out to be 173

xgboost3

 
 
 
 
 

#training the model
bst=xgboost(data=trainMat,label=target,param=param,nrounds=nround)

#predicting the test dataset
ypred=predict(bst,testMat)

Creation of submission file:

predMat=data.frame(matrix(ypred,ncol=9,byrow=TRUE))
colnames(predMat)=classnames
res=data.frame(id,predMat)
write.csv(res,’C:\\Users\\roma.agrawal\\Downloads\\result.csv’,quote=F,row.names=F)

xgboost4

 

  • Using xgb.train()

For using this method we require to divide the training set into train and validation set. As it is a multiclass problem we will be using stratified random sampling method to divide the training set.

trainIndex=createDataPartition(train1$target, p = .8,list = FALSE,times = 1)
trainStarta=train1[trainIndex,]
testStarta=train1[-trainIndex,]

#converting the values in target column of both the datasets into integer so that it will hold values from 0 to 8
trainStarta_target=target[trainIndex]
trainStarta=trainStarta[,-1]
trainStarta=trainStarta[,-ncol(trainStarta)]

testStarta_target=target[-trainIndex]
testStarta=testStarta[,-1]
testStarta=testStarta[,-ncol(testStarta)]

#This function requires inputs in matrix format, so converting both the datasets into matrix using xgb.DMatrix()
dtrain=xgb.DMatrix(data.matrix(trainStarta), label = data.matrix(trainStarta_target))
dtest=xgb.DMatrix(data.matrix(testStarta), label = data.matrix(testStarta_target))

#creating watchlist with train and validation dataset in matrix format
Watchlist=list(eval = dtest, train = dtrain)

#creating parameter list for the model
param=list(“objective”=”multi:softprob”,”eval_metric”=”mlogloss”,”num_class”=noOfClasses)

#Running the cross validation
bst=xgb.train(param, dtrain,nround = cv.round, watchlist,early.stop.round=10)

It will stop if it sees values of bst.cv$test.mlogloss.mean has started increasing upto 10 iterations and give the iteration no which has minimum mlogloss value

xgboost5

 
 
 
 
 
 
 
 
 
 
Index 174th has minimum score, next 10 iterations are having more logloss value than this

xgboost6

 
 
 
 
#training the model
bst1=xgboost(data=trainMat,label=target,param=param,nrounds=bst$bestInd)

#predicting the test dataset
ypred=predict(bst1,testMat)

Creation of submission file:
predMat=data.frame(matrix(ypred,ncol=9,byrow=TRUE))
colnames(predMat)=classnames
res=data.frame(id,predMat)
write.csv(res,’C:\\Users\\roma.agrawal\\Downloads\\result1.csv’,quote=F,row.names=F)

Share the joy
  • 2
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

3 thoughts on “XGBoost Algorithm

Leave a Reply

Your email address will not be published. Required fields are marked *