Introduction
Tree-based machine learning methods are among the most commonly used supervised learning methods. They are constructed by two entities; branches and nodes. Tree-based ML methods are built by recursively splitting a training sample, using different features from a dataset at each node that splits the data most effectively. The splitting is based on learning simple decision rules inferred from the training data.
Common Terminology

i) Root node — this represents the entire population or the sample, which gets divided into two or more homogenous subsets.
ii) Splitting — subdividing a node into two or more sub-nodes.
iii) Decision node — this is when a sub-node is divided into further sub-nodes.
iv) Leaf/Terminal node — this is the final/last node that we consider for our model output. It cannot be split further.
v) Pruning — removing unnecessary sub-nodes of a decision node to combat overfitting.
vi) Branch/Sub-tree — the sub-section of the entire tree.
vii) Parent and Child node — a node that’s subdivided into a sub-node is a parent, while the sub-node is the child node.
Types of tree based model
Tree-based machine learning models are a category of algorithms that make decisions by recursively partitioning the input space into regions. Some common types of tree-based models include:
- Decision Trees:
- Overview: Decision trees are a fundamental type of tree-based model that makes decisions based on a series of if-else conditions. Each internal node represents a decision based on a feature, and each leaf node represents the output.
- Applications: Decision trees are versatile and can be used for both classification and regression tasks.
- Random Forest:
- Overview: Random Forest is an ensemble learning method that constructs a multitude of decision trees during training and outputs the average prediction (for regression tasks) or the majority vote (for classification tasks) of the individual trees.
- Applications: Random Forest is effective in reducing overfitting and improving accuracy.
- Gradient Boosting Machines (GBM):
- Overview: GBM is another ensemble method that builds trees sequentially, with each tree compensating for the errors of the previous ones. It combines weak learners to create a strong predictive model.
- Applications: GBM is widely used for both regression and classification tasks and is known for its high predictive power.
- XGBoost (Extreme Gradient Boosting):
- Overview: XGBoost is an optimized and efficient implementation of gradient boosting. It incorporates regularization techniques, parallel processing, and tree pruning to enhance performance.
- Applications: XGBoost is commonly used in various machine learning competitions and real-world applications due to its speed and accuracy.
- LightGBM:
- Overview: LightGBM is a gradient boosting framework that uses a tree-based learning algorithm. It is designed for distributed and efficient training and can handle large datasets.
- Applications: LightGBM is suitable for large-scale machine learning tasks and is particularly efficient in scenarios with high dimensionality.


