This post is a write up on sklearn ensemble pipeline for multiple target columns using traditional and established libraries such as numpy, pandas, scipy and sklearn. This post further extends article on sklearn pipeline for time-series data.
Some of the ideas for this post came from researching for machine learning competition Sound the Alarm 2 for Australian based platform Unearthed, where Dspyt team took 10th place on a private leader-board. The company is very strict about the data and code that they provided. You can try the platform in Evergreen Challenges.
First we need to load libraries, but our choice is limited to only well-established python libraries such as pandas, numpy and sklearn.
Next, in this example we have a single set of targets and input dataframes. In addition, we explicitly include the names of inputs and target columns under the variables input_tags and target_tags, respectively
The preprocessing function aggregates unstructured data into time periods and tags.
In this article the data has a significant reliance on the temporal dimension, therefore it is not the best practice to split data randomly into test and train. Hence, we put 80% of the first observations into train and 20% of the latest observations into a test data split.
To predict the labels for three columns we create an ensemble class that fits a separate decision tree model and combines individual predictions.
In this section we initialize the ensemble model and fit each individual model to estimate predictions.
Moreover, we also build a scoring function that penalizes inaccurate predictions.
In this example we demonstrated how to combine machine learning functions into a pipeline. We used preprocessing function that cleans and aggregates data. Besides, we built sklearn ensemble model that fits individual models for three targets and predicts all the labels. Finally, we scored the predictions with the help of custom scoring function.
The areas for improvement are adding more features and improving the prediction models. Furthermore, we could use GridSearch and K-Folds cross-validator to estimate the necessary features and hyper parameters.