Machine Learning time-series simple pipeline SkLearn

This post is a write up on sklearn pipeline with multiple regression models using traditional and established libraries like numpy, pandas, scipy and sklearn. In this post we are making a model for time-series data which we introduced in this post:

Some of the ideas for this post came from researching for machine learning competition Pressure predictor for Australian based platform Unearthed, where I took 11th place out of 55 competitors on a private leader-board. The company is very strict about the data and code that they provided. You can try the platform in Evergreen Challenges: Evergreen: Exploration Data Science and Evergreen: HYDROSAVER.

11th Unearthed.Solutions

Load Libraries

First we need to load libraries, but our choice is limited to only established ones like pandas, numpy, scipy and sklearn, so we have got to go in depth with available options.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, PowerTransformer, PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.multioutput import MultiOutputRegressor
#from sklearn.feature_selection import SelectKBest
from scipy.stats import skew
from sklearn.model_selection import GridSearchCV

Load and separate data into train and test

After we successfully loaded libraries we can load the data and separate it into test and train.

df = pd.read_csv('/kaggle/input/pressure/public.csv', parse_dates=True,index_col=0)
target_columns = ["target1","target2","target3"]

y = df[target_columns]
X = df.drop(columns=target_columns)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=13)


Furthermore, we create features and transform the data which we will feed into regression models. For each column that is numeric we create expanding mean, rolling means for various periods ranging from a day to 6 weeks. We also difference the values for stationarity and square differenced values to find a signal.

numeric = df.dtypes[df.dtypes != 'object'].index
for x in numeric:
    df[f'{x}_mean']  = df[x].expanding().mean()
    df[f'{x}_diff']  = df[x].diff()
    df[f'{x}_diff_pow2']  = df[x].diff()**2
    df[f'{x}_ma_day'] = df[x].rolling(24, min_periods=1).mean()
    df[f'{x}_ma_w'] = df[x].rolling(168, min_periods=1).mean()
    df[f'{x}_ma_2w'] = df[x].rolling(336, min_periods=1).mean()
    df[f'{x}_ma_6w'] = df[x].rolling(1008,min_periods=1).mean()
    df[f'{x}_ewm'] = df[x].ewm(alpha=0.1, adjust=False).mean()

Besides, we can transform a date column to numerical in pandas by using to pd.to_datetime() and astype(int).

df['time'] = pd.to_datetime(df.index).astype('int64')

Finally, we eliminate excessively skewed numeric features by using log transformation. We could also try box-cox transformation that is available in scipy library. But for this project we use only log(1+x) transformation.

skewed_feats = df[numeric_feats].apply(lambda x: skew(x)).sort_values(ascending=False)
high_skew = skewed_feats[abs(skewed_feats) > 0.5]
for feature in high_skew.index:
    df[feature] = np.log1p(np.abs(df[feature]))


Now we can navigate the preprocessed data into sklearn pipeline that imputes the missing values, scales the columns, provides interaction terms between features and estimates regression models.

    estimators = [
                ('RandomForest', RandomForestRegressor(n_estimators=400)),
                ('Boosting', GradientBoostingRegressor(n_estimators=3000)),
                ('ridge', RidgeCV())
    stack_reg = MultiOutputRegressor(StackingRegressor(estimators=estimators, final_estimator=RandomForestRegressor(n_estimators=300) , n_jobs=-1))
    stack_reg.estimator.final_estimator_ = stack_reg.estimator.final_estimator

    clf = Pipeline(steps=[
                        ('imp', imp),
                        ('poly', PolynomialFeatures(interaction_only=True)),
                        ('', StandardScaler()),
                        ('a', stack_reg),       
                        ]), y_train)

You can estimate the best hyper-parameters for each model using Gridsearch.

param_grid = [{'a__n_estimators': [330, 340, 350, 360, 370], 'a__max_features': [1, 2, 3, 4, 5], 'a__max_depth': [65, 70, 75, 80, 85, 90, 95]}]

clf = Pipeline(steps=[('', StandardScaler()),
                      ('a', RandomForestRegressor())])

grid_search_RFR = GridSearchCV(clf, param_grid, cv=10, scoring='neg_mean_squared_error'), y_train)

print("Best Hyperparameters::\n{}".format(grid_search_RFR.best_params_))

Moreover, you can variate the scalers for various columns, try different imputers and models for your pipelines, objectives and datasets. The result will substantially different form one case to another.

numerical = X.columns[:-1]
num_transformer = RobustScaler()
date_feature = X.columns[-1:]
date_transformer = MinMaxScaler(feature_range=(-1,1))

preprocessor = ColumnTransformer(
            ('num', num_transformer, numerical),
            ('dat', date_transformer, date_feature)

Predict and score the predictions

We came to the very final stage of the model prediction and scoring. We use mean absolute error for each of the three targets and weigh the third with the highest coefficient.

y_pred = pd.DataFrame(model.predict(X_test))
start = y.index.min()
end = start + pd.DateOffset(days=30)
start_idx = y.loc[start:end].shape[0]
mae = mean_absolute_error(y_test.iloc[start_idx:], y_pred[start_idx:], multioutput="raw_values")
weights = [0.1, 0.3, 0.6], weights)

Notebook on Kaggle:

You can contact me for professional inquires via my social media:

1 thought on “Machine Learning time-series simple pipeline SkLearn”

Leave a comment