IntroducingSimple sklearn ensemble machine learning

Simple sklearn ensemble machine learning

This post is a write up on sklearn ensemble pipeline for multiple target columns using traditional and established libraries such as numpy, pandas, scipy and sklearn. This post further extends article on sklearn pipeline for time-series data.

Some of the ideas for this post came from researching for machine learning competition Sound the Alarm 2 for Australian based platform Unearthed, where Dspyt team took 10th place on a private leader-board. The company is very strict about the data and code that they provided. You can try the platform in Evergreen Challenges.

Simple sklearn ensemble machine learning Dspyt

Loading Libraries

First we need to load libraries, but our choice is limited to only well-established python libraries such as pandas, numpy and sklearn.

Loading data and hard code the tags

Next, in this example we have a single set of targets and input dataframes. In addition, we explicitly include the names of inputs and target columns under the variables input_tags and target_tags, respectively

Preprocessing and separating data into train and test

The preprocessing function aggregates unstructured data into time periods and tags.

In this article the data has a significant reliance on the temporal dimension, therefore it is not the best practice to split data randomly into test and train. Hence, we put 80% of the first observations into train and 20% of the latest observations into a test data split.

sklearn ensemble model

To predict the labels for three columns we create an ensemble class that fits a separate decision tree model and combines individual predictions.

Predict and score the predictions

In this section we initialize the ensemble model and fit each individual model to estimate predictions.

Moreover, we also build a scoring function that penalizes inaccurate predictions.

Summary

In this example we demonstrated how to combine machine learning functions into a pipeline. We used preprocessing function that cleans and aggregates data. Besides, we built sklearn ensemble model that fits individual models for three targets and predicts all the labels. Finally, we scored the predictions with the help of custom scoring function.

The areas for improvement are adding more features and improving the prediction models. Furthermore, we could use GridSearch and K-Folds cross-validator to estimate the necessary features and hyper parameters.

Kaggle notebook