My first participation in a Kaggle Competition as a part of my learning journey in MLZoomcamp at DataTalks.Club

When I first heard about Kaggle competitions, I was both excited and nervous. As a participant in the MLZoomcamp organized by DataTalks.Club, I knew this was a unique opportunity to learn something new & out-of-the-course in conjunction with all the knowledge that the course has provided and apply that in a real-world, competitive environment. This article shares my journey—from initial hesitation to the thrill of submission—detailing my experiences, technical challenges, and key takeaways.

Setting the Stage: MLZoomcamp and DataTalks.Club

MLZoomcamp at DataTalks.Club has been an invaluable resource for budding machine learning enthusiasts like me. Through structured lessons, hands-on projects, and peer interactions, I’ve been able to build a solid foundation in machine learning. Kaggle competitions became the next logical step in my journey—an arena where theoretical knowledge meets practical problem solving.

The competition I participated in focused on retail demand forecasting. With historical sales data provided from multiple sources (both in-store and online), along with complementary datasets like price history, markdowns, and catalog details, the task was to predict future sales. 

Diving into the Challenge


Data Loading and Merging

My first step was to load the provided datasets using the Pandas library. I had CSV files for sales, online transactions, price history, markdowns, catalog information, and store details. Merging these diverse datasets was no small feat—each file had its quirks (like different column names or formats) that needed harmonizing.

I learned the importance of using left joins when merging the test set with additional features to preserve all the rows. This practice ensured that no potential test data was lost during preprocessing, which is critical for maintaining the integrity of the submission file.

Data Cleaning and Feature Engineering

Data cleaning was essential. I had to remove duplicates, handle missing values, and convert date strings into proper datetime objects. Once my data was clean, I began feature engineering. Creating date features (year, month, day, day of week) and lag features (previous day’s sales and 7-day rolling averages) significantly boosted my model’s performance. These engineered features allowed my model to capture trends and seasonality patterns in the retail sales data.

Model Training and Validation

For modeling, I chose XGBoost—an algorithm known for its high performance. Using scikit‑learn’s GridSearchCV with a time series split (TimeSeriesSplit), I tuned hyperparameters such as the number of estimators, max depth, and learning rate. The validation process, which involved reserving the last month of data as a test period, simulated real-world forecasting scenarios where the model predicts future data based on past trends.

There were technical challenges along the way, including compatibility issues between newer versions of Scikit-learn and XGBoost. After some troubleshooting—downgrading Scikit-learn to a compatible version in Google Collab—I was able to proceed smoothly. I later found out that I could have resolved the problem also by upgrading XGBoost to version greater than or equal to 2.0

Final Predictions and Submission

Once the model was trained and validated, the next step was to process the test set. The test data had around 884,000 rows, and I had to ensure that all rows were preserved during merging and feature engineering. I carefully applied lag feature calculations and imputed missing values rather than dropping rows, which guaranteed that my final test DataFrame matched the expected shape.

Finally, I generated predictions, formatted them according to the sample submission file, and submitted my results to Kaggle. The feeling of having my work evaluated for the first time on a public leaderboard was exciting even though I was just above the sample_submission cut-off score :D I promised myself to learn the finer nuances of time series predictions and do well in other competitions in future!

Lessons Learned

  1. Data Preparation is Crucial:
    Merging diverse datasets and engineering features like lags and rolling means can make or break your model. Attention to detail during these early steps pays off later.

  2. Respect the Time Dimension:
    Time series data requires careful handling. I learned that preserving the order of data by using time-based splits is essential for realistic model validation.

  3. Troubleshooting is Part of the Process:
    Dealing with library version conflicts (such as Scikit-learn vs. XGBoost issues) is common in real-world projects. Patience and resourcefulness—reading release notes, adjusting versions, or applying workarounds—are vital skills.

  4. Continuous Learning:
    Participating in this Kaggle competition was not just about the final score. The process taught me valuable lessons about data cleaning, feature engineering, hyperparameter tuning, and problem solving under pressure.

Conclusion

My first Kaggle competition as part of MLZoomcamp at DataTalks.Club was a steep but rewarding learning curve. The hands-on experience helped me consolidate my theoretical knowledge, and the challenges I encountered have only motivated me to dive deeper into the world of demand forecasting with time series

I encourage anyone who is curious about real-world ML applications to participate in a Kaggle competition—even if it means facing a few technical hurdles along the way. Every error is an opportunity to learn, and every submission is a step closer to becoming a more skilled data scientist.

Feel free to leave your thoughts and questions in the comments below—I'd love to hear about your experiences or any tips you might have for first-time Kagglers!



Comments

Popular posts from this blog

My midterm project at MLZoomcamp led by Alexey Grigorov for DataTalksClub

Logistic Regression: A walkthrough by Alexey Grigorev

Starting my Data Engineering journey with a foundational insight on Docker, Terraform and Google Cloud Platform