What is XGBoost - and how can it be used in Precision Medicine?

Subscribe to stay up to date with the latest Sonrai content.

Author: Eoghan Conlon, Data Scientist

Updated: 12/02/2024

Eoghan Conlon

Data Scientist

Eoghan is a seasoned data scientist who embarked on their career journey at Sonrai Analytics immediately after graduating with a degree in Computer Science. Over the past three years, they have honed their expertise in data visualisation, data analysis, and machine learning web applications, making significant contributions to numerous projects aimed at leveraging data for actionable insights. Eoghans keen interest in leveraging data to drive insights and innovation led them to explore the application of XGBoost in precision medicine, resulting in their debut article on the subject.

Introduction

XGBoost which stands for extreme gradient boosting stands as a powerful and highly favoured machine learning algorithm, exhibiting potential across a spectrum of applications within the realm of precision medicine. This article is tailored for medical researchers, bioinformaticians, and data scientists, offering a glimpse into the algorithm's efficacy, particularly in the Precision Medicine arena, along with an in-depth technical overview of how the algorithm works.

Utility of XGBoost in Precision Medicine

In 2022, the mention of XGBoost skyrocketed in peer-reviewed articles on PubMed, with over 1,200 references, marking a significant uptick from the mere 134 references in 2019. Below, we've outlined some pivotal reasons for employing XGBoost.

Improved Predictive Performance: XGBoost has demonstrated superior predictive performance compared to other models like logistic regression in medical studies, which is crucial for making accurate diagnoses or risk assessments in precision medicine [1].

Identifying Important Features: It has the capability to identify important genes and features, which can be instrumental in understanding the underlying aetiology of disease or the mechanism of action of therapeutic compounds.

Interpretability: XGBoost provides a level of interpretability which is essential for clinicians to understand the model's decisions. This feature supports the integration of machine learning into medical decision-making processes.

What is XGBoost

XGBoost is a decision tree-based ensemble learning and supervised classical machine learning model as opposed to a deep learning model. It was developed in 2016 by Tianqi Chen and Carlos Guestrin[1] and has since become the leading algorithm for building classification and regression models on tabular and structured data. XGBoost is available as open source software across a wide range of programming languages such as java, scala and julia. XGBoost is an improved version of tree-based algorithms like decision trees, random forests, and gradient boosting.

Regression vs Classification

XGBoost is capable of creating models to solve regression, binary classification and multi-class classification problems. Depending on the type of values in the target feature trying to be predicted, the corresponding XGBoost algorithm should be used. If the values being predicted are continuous (e.g the number of mutations), then XGBoost regression should be used; otherwise, a classification model should be used when predicting discrete class values (responder vs non-responder).

When Should You Use XGBoost?

XGBoost has many benefits over other regression and classification algorithms. The benefits can be seen during machine learning challenges and Kagglecompetitions [2], where XGBoost models dominate the data science competitions.

The XGBoost algorithm can handle sparse data, where values are missing in some features but not in the target feature to be predicted. It generally doesn’t perform as well on sparse data compared to non-sparse data. No feature normalisation is needed before using XGBoost.

XGBoost shouldn’t be used with small datasets as XGBoost can overfit to the training data with smaller data. It performs well when data has a mixture of numerical and categorical features or just numeric features.

Consider using XGBoost for any supervised machine learning task when the following criteria are met:

When you have a large number of observations in training data

When the number of features is less than the number of observations in the training data. However, there is evidence to suggest that XGBoost does perform better than other classical machine learning algorithms when the number of features is significantly greater than the number of observations which makes it well suited to Precision Medicine data modalities.

If you have large, structured and tabulated data

When the model interpretation is considered.

The Difference Sonrai Makes

Every organization is unique, and generic solutions fall short in addressing their distinct challenges. Sonrai is purpose-built to tackle the complexities of precision medicine, offering tailor-made solutions that align with your specific needs.

Discover Predictive Biomarkers 52% Faster

Sonrai’s AI technology supports rapid biomarker identification, empowering our partner to predict optimal therapy responses through an innovative toolkit.

See how Sonrai unified diverse data sources to rapidly understand complex molecular pathways, helping identify predictive biomarkers and drug mechanisms of action.

XGBoost is a member of the tree-based algorithms family and is built up using simpler methods and techniques also found in the tree-based algorithms family. Understanding these simpler algorithms and techniques is important for understanding how XGBoost works under the hood.

Decision Tree

A decision tree is an algorithm that uses a tree-like structure to split training data based on feature values. Each node in the tree represents a decision based on a particular feature value. The leaf node will decide the class.

Bagging

Bootstrap aggregating is an ensemble learning technique combining predictions from multiple decision trees through a voting mechanism. Bagging decreases the variance of the model and prevents individual trees from ruining the whole model.

Random Forest

A bagging-based algorithm which creates multiple decision trees (a forest) using a randomly selected subset of features from the training set. The model will make a final prediction based on the predictions of all the trees. For regression, the final prediction will be the mean of all the individual tree predictions.

Random forest uses the fact that many relatively uncorrelated models (trees) operating as a committee will outperform any of the respective constituent models. This means that while some trees may be wrong, the majority will be correct so that the final model will perform very well.

Boosting

Boosting aims to build a robust classifier from several weak classifiers. Unlike bagging algorithms, where many trees are built in parallel with each other, boosting algorithms build trees serially. Subsequent models are built using the output from the previous model, with the aim of reducing errors and increasing the influence of models that perform well.

Gradient Boosting

Each model is a gradient boosted decision tree meaning it is trained using the residual errors of the previous model. The Gradient descent algorithm is used in gradient boosting to minimise the errors in subsequent models.

XGBoost

As the name suggests, XGBoost is an advanced implementation of gradient boosting.

XGBoost builds small trees called ‘weak learners’ as each tree has a poor performance individually. XGBoost starts by building a single, simple tree. This tree performs poorly, but XGBoost uses the errors from it to build another tree. This second tree can predict what the first tree could not.

This process continues until a break condition is reached; this could be the number of training rounds or if the loss function is no longer improving and training is complete.

Consider the following example dataset with 6 training sample[3]:

Make initial prediction using the mean of the feature to be predicted, in this example the average salary being 6(100K)

Calculate error residuals: Actual value - initial prediction

3. Build regression tree from residualerrors, splitting the data based on experience <=2 and calculate information gain

4. Calculate the similarity score for the root, left and right nodes

Similarity Score =Sum of Residuals Squared( Number of residuals + )

Gain = Similarity Score Left + Similarity Score Right - Similarity Score Root

The tree will be split multiple times using the average value of every two values and the tree with the highest information gain will be selected as the first branch in the tree

Repeat step 3 splitting nodes which have more than one value and creating a new branch until all nodes have only one value or max depth has been reached

5. The tree can then be pruned using a regularisation parameter lambda. For every branch Gain - lambda is calculated and if the result is negative the branch is discarded. This is why the left branch isn’t split any further

6. Calculate output values for each branch

Output Value =Sum of Residuals(Number of residuals + )Predicted values can then be calculated using

7. Predicted values can then be calculated using

predicted value =initial predictions + learning rate * output value of correct leaf in tree

Parameters

To get the best out of XGBoost there are a number of tuneable hyperparameters, the optimum values of which are data-dependent. Understanding these at a high level will help you make informed choices when experimenting with values.

Split

What percentage of data is used for training the model and testing the model. A normal train and test split would be 60/40 or 70/30, with most of the data used for training. This is so the model can be evaluated on data that it has never seen before and give a truer reflection of real world metrics.

Objective

The learning objective is also called the cost or loss function and is used to penalise errors in predictions the model makes during the learning process. The objective function uses some function to quantify the difference between the value predicted by the model and the ground truth value provided.

The eval_metric parameter is used to evaluate the model after training has completed but not during training. This is so the model can be evaluated on new data, which will give a more accurate reflection of real-world metrics. Classification will primarily use binary:logistic or multi:softmax depending on whether there are more than two classes or not.

Eval_metric

The eval_metric parameter is not used during training of the model but is used to evaluate the model after training has completed.

For regression tasks, it uses either root mean squared error or mean absolute error. For classification tasks with more than two classes, it uses logloss or mlogloss.

Learning Rate

When new trees are created to correct the residual errors in the previous tree, the model can overfit to the training data. Learning rate (also known as eta) can be used to control the weighting of a new tree added to the model.

A model with a low learning rate will mean each new tree added will have a smaller impact on the overall model compared to a high learning rate being used. A high learning rate can overfit the training data.

Tree Depth

The complexity of each decision tree, which is determined by the number of layers, or the maximum depth.

Shallow trees are expected to have poor performance because they capture few details of the problem. They are referred to as weak learners. Deeper trees generally capture too many details of the problem and overfit the training dataset, limiting the ability to make good predictions on new data.

XGBoost generally is configured with weak learners.

Num Training Rounds

The number of trees that the model will create.

Minimum Loss Reduction (Gamma)

Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.

Alpha

L1 regularisation term on weights. Increasing this value will make the model more conservative and reduce the model's sensitivity to individual values. L1 regularisation penalises the sum of absolute values of the weights.

Lambda

L2 regularisation term on weights. Increasing this value will make the model more conservative and reduce the model's sensitivity to individual values. L2 regularisation penalises the sum of squares of the weights.

Early Stopping

Early stopping is the number of rounds over which a model must continue improving based on the evaluation metric selected, otherwise training will be stopped and the latest iteration of the model will be used.

Early stopping can prevent overfitting and overtraining the model and ensures the model is better at generalising to new data. Early stopping also enables a very good model for the parameters and data to be generated from only one training iteration instead of trying to train the model multiple times with different number of training rounds, which can be both time consuming and processor intensive.

Interpret Results

XGBoost returns a number of very useful metrics, one of the most useful metrics is ‘feature importance’. Feature importance shows which features the model depended on the most to make accurate predictions. Feature importance helps reduce the number of features in your dataset, which can improve the model's performance.

When using XGBoost for classification problems a confusion matrix will also be produced showing the predicted classes against the true classes in a visually easy-to-interpret way.

Alongside the confusion matrix, there are metrics which can be interpreted from them, such as specificity, precision, recall, f1-score and support.

When used for regression problems, a graph of predicted values against real values will be produced which can produce insights into the model's performance.

The most important metric is the accuracy of the model, which relays a percentage of how many predictions the model got correct. The accuracy is not the be-all and end-all and a high model accuracy can sometimes be misleading and actually be the result of a poor model. This can be due to overfitting where the model performs very well on the data it was trained on but then very poorly on previously unseen data.

A graph of error vs training rounds can be a useful way to understand if the model is overfitting. The error for test and train data should decrease with each iteration before levelling off.

Summary

XGBoost represents a cutting-edge tool in the arsenal of precision medicine. The algorithm excels in handling large, structured datasets typical of precision medicine, offering insights into disease etiology and aiding in the development of patient stratification strategies. Its superior predictive performance, ability to identify important features, and interpretability make it a preferred choice among medical researchers, bioinformaticians, and data scientists. Its proven track record in peer-reviewed studies and our client case studies underscores its potential to revolutionize medical research and patient care.

We use technologies like cookies to store and/or access device information. We do this to improve the browsing experience and to show personalized ads. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent may reduce user experience.

Functional
Always active

The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.

Preferences

The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.

Statistics

The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.

Marketing

The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.