Machine Learning simple Sklearn Ensemble

This post is a write up on sklearn pipeline with multiple regression models for multiple target columns using traditional and established libraries such as numpy, pandas, scipy and sklearn. This post explores further the topics of the previous article on sklearn pipeline for time-series data, which you can find here:

Some of the ideas for this post came from researching for machine learning competition Sound the Alarm 2 for Australian based platform Unearthed, where I took 10th place on a private leader-board. The company is very strict about the data and code that they provided. You can try the platform in Evergreen Challenges: Evergreen: Exploration Data Science and Evergreen: HYDROSAVER.

Load Libraries

First we need to load libraries, but our choice is limited to only well-established python libraries such as pandas, numpy and sklearn.

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

Load data and hard-code the tags

Next, in this example we have a single set of targets and input dataframes. In addition, we explicitly include the names of inputs and target columns under the variables input_tags and target_tags, respectively.

df = pd.read_csv("/kaggle/input/sound-the-alarm2/public.csv")#inputs
df_ = pd.read_csv("/kaggle/input/sound-the-alarm2/public.targets.csv")#targets

Preprocessing and separating data into train and test

The preprocessing function aggregates unstructured data into time periods and tags.

def rolling(df, window, step):
    count = 0
    df_length = len(df)
def rolling(df, window, step):
    count = 0
    df_length = len(df)
    while count < (df_length - window):
        yield count, df[count:window+count]
        count += step

def preprocess(alarms, labels=None):

    alarms = alarms[alarms.major_down_time==False].drop(columns=['major_down_time'])
    #count frequencies
    t = alarms.groupby(['day', 'tag']).count().rename(columns={'date':'freq'}).reset_index()
    # add an empty row for all columns so we always get the same shape output
    t = t.append(pd.DataFrame({'day':[pd.to_datetime([0]).date()]*len(input_tags),'freq':[0]*len(input_tags), 'tag':input_tags}))
    X = pd.pivot_table(t, values='freq', columns='tag', index='day', aggfunc=np.sum).reset_index()
    # ensure the columns are in the same order
    X=X[['day'] + input_tags]

    # For model input we will take 30 days of history for every row IE 3 dimensions
    # (sample_day, history date, column)
    # then flatten to 2 dimensions using date
    # (sample day, datecolumn)
    for offset, window in rolling(X, 30, 1): # prepare the X input
        d = window.tail(1).day.values[0]
        if d in labels[labels.window=='7 day'].date.values : # make sure we have a label for the date

    inputs = [x[y] for y in x]
    inp = np.array(inputs)
    X = inp.reshape((inp.shape[0],inp.shape[1]*inp.shape[2])) # flatten to one row per day

    target1 = labels[(labels.window=='7 day') &].fillna(0)
    target2 = labels[(labels.window=='8-14 day') &].fillna(0)
    target3 = labels[(labels.window=='15-21 day') &].fillna(0)
    return X, x.keys(), target1[target_tags], target2[target_tags], target3[target_tags]

In this article the data has a significant reliance on the temporal dimension, therefore it is not the best practice to split data randomly into test and train. Hence, we put 80% of the first observations into train and 20% of the latest observations into a test data split.

X,dates,y1,y2,y3 = preprocess(inputs,targets)
X_train, X_test = X[:int(X.shape[0]*0.8)], X[int(X.shape[0]*0.8):]
dates_train, dates_test = list(dates)[:int(len(dates)*0.8)], list(dates)[int(len(dates)*0.8):]

Ensemble model

To predict the labels for three columns we create an ensemble class that fits a separate decision tree model and combines individual predictions.

class  EnsembleModel:

    def __init__(self):
        self.models = dict()
        self.models['model1'] = DecisionTreeRegressor(random_state=1)
        self.models['model2'] = DecisionTreeRegressor(random_state=1)
        self.models['model3'] = DecisionTreeRegressor(random_state=1)

    def fit1(self, X, y):
        self.models['model1'].fit(X, y)

    def fit2(self, X, y):
        self.models['model2'].fit(X, y)

    def fit3(self, X, y):
        self.models['model3'].fit(X, y)

    def _predict(self,model_name, inp_X, dates):
        preds = self.models[model_name].predict(inp_X)
        preds = pd.DataFrame(dict(zip(target_tags,preds.T)))
        return preds

    def _predict_all(self,inp_X, dates):
        p1 = self._predict('model1',inp_X,dates)
        p2 = self._predict('model2',inp_X,dates)
        p3 = self._predict('model3',inp_X,dates)
        p1['window']=['7 day']*len(p1)
        p2['window']=['8-14 day']*len(p2)
        p3['window']=['15-21 day']*len(p3)
        return pd.concat([p1,p2,p3]).reset_index(drop=True)

    def predict(self, inp_X, dates):
        return self._predict_all(inp_X, dates)

Predict and score the predictions

In this section we initialize the ensemble model and fit each individual model to estimate predictions.

models = EnsembleModel()
models.fit1(X_train, y1[:167])
models.fit2(X_train, y2[:167])
models.fit3(X_train, y3[:167])
pred = models.predict(X_test, dates_test)

Moreover, we also build a scoring function that penalizes inaccurate predictions.

def scoring(gt, pred):

    gt = gt.set_index(['window','date'])

    pred = pred.set_index(['window','date'])
    m = gt.join(pred, how='inner', rsuffix='_pred')

    p_cols  = [c+"_pred" for c in target_tags]
    gt = m[target_tags].values
    pred = m[p_cols].values
    correct = np.bitwise_and(gt>0,pred>0).sum()
    incorrect = np.bitwise_and(gt==0,pred>0).sum()
    return correct - (incorrect/2)

scoring(targets, pred)


In this example we demonstrated how to combine machine learning functions into a pipeline. We used preprocessing function that cleans and aggregates data. Besides, we built an ensemble model that fits individual models for three targets and predicts all the labels. Finally, we scored the predictions with the help of custom scoring function.

The areas for improvement are adding more features and improving the prediction models. Furthermore, we could use GridSearch and K-Folds cross-validator to estimate the necessary features and hyper parameters.

Kaggle notebook at:

You can contact me for professional inquires via my social media:

Leave a comment