Histogram-based Gradient Boosting in scikit-learn 0.21

Olivier Grisel

Cython Data Machine-Learning Multi-Threading Scientific Libraries (Numpy/Pandas/SciKit/...)

scikit-learn 0.21 was recently released and this presentation will give an overview its main new features in general and present the new implementation of Gradient Boosted Trees.

Gradient Boosted Trees (also known as Gradient Boosting Machines) are very competitive supervised machine learning models especially on tabular data.

Scikit-learn offered a traditional implementation of this family of methods for many years. However its computational performance was no longer competitive and was dramatically dominated by specialized state of the art libraries such as XGBoost and LightGBM. The new implementation in version 0.21 uses histograms of binned features to evaluate the tree node spit candidates. This implementation can efficiently leverage multi-core CPUs and is competitive with XGBoost and LightGBM.

We will also introduce pygbm, a numba-based implementation of gradient boosted trees that was used as prototype for the scikit-learn implementation and compare the numba vs cython developer experience.

Type: Talk (45 mins); Python level: Intermediate; Domain level: Intermediate

Olivier Grisel

Inria

Olivier Grisel is a Machine Learning engineer at Inria Saclay, near Paris in France. He is a core developer of the scikit-learn machine learning library and contributes to related projects in the scientific and data Python ecosystem.