Tabular Playground Overfitting | Solvers Club

On the first day of Autumn in 2021, I finished my first Kaggle competition as a part of Solvers Club. Solvers Club has a great community of Data scientists, Data engineers, programmers and great people, in general, that motivated me to try new ideas and blending techniques throughout the month.

In particular, I want to congratulate all the participants and especially the winner: Ivan Kontic.

Private Leaderboard: Top 10

Blending

During the competition I particularly focused on a few simple blending techniques that improved my ranking on the public board. This is my final blending notebook: https://www.kaggle.com/pavfedotov/blending-tool-tps-aug-2021?scriptVersionId=73469190. The notebook scored 23rd on the public dataset and 139th on the private dataset.

In this competition I have explored a more efficient library of loading the csv files through Dask. Besides, through trial-and-error I have noticed that adding a normally-distributed noise to a submission improved the ranking on the public leaderboard tremendously. Nevertheless, performing well on the public leaderboard such blended notebooks do have a tendency to overfit horribly and suffer a brutal shakedown on the private leaderboard. Therefore, I would advise caution in using such a notebook as final submission choice.

import dask.dataframe as dd
import numpy

numpy.random.seed(2021)


file1 = dd.read_csv("../input/tps08-public-notebook/7.84996.csv",
                        dtype={'loss': float,'id':int})
file2 = dd.read_csv("../input/tps08-temp/7.85000 b version 17.csv",
                        dtype={'loss': float,'id':int})

file1.loss = file1.loss*0.8 + file2.loss*0.2

file1.loss = file1.loss.apply(lambda x: x+numpy.random.normal(0, 0.03))


file1.to_csv('blend.csv',index=False)

Conclusion

It was a lot of fun exploring and researching noise generation in Python. The main idea that I had trouble implementing in this completion but would really like to come into life in further Tabular Series as a part of Solvers Club is adding cauchy distributed noise. In addition to going in more details through Kaggle Ensembling Guide.

Finally, In Solvers Club we are taking part in data science competitions on other platform such as Zindi.Africa and Facebook AI. The club sponsors all the members with cloud computing resources if we compete as a group. Hope to see you soon as part of our large team.


You can contact me for professional inquires via my social media:

Leave a comment