Data leakage in applied ML: reproducing examples of irreproducibility

Exploring Data Leakage in Machine Learning Education

Last updated on Jun 14, 2024 SummerofReproducibility24

Hello,

I am Kyrillos Ishak I am happy to be part of SOR 2024, I am working on Data leakage in applied ML: reproducing examples of irreproducibility project. My proposal was accepted.

I am excited to work with Fraida Fund and Mohamed Saeed as my mentors. The objective of the project is to develop educational resources that can be adjusted by professors/instructors to explain specific data leakage problems. This involves ensuring the reproducibility of certain research papers that contain data preprocessing issues, then fixing these issues to demonstrate how they can affect the results.

Data leakage is a problem caused when information from outside the training dataset is used to create the model. This issue can lead to overly optimistic performance estimates and, ultimately, models that do not perform well on new, unseen data.

Despite the importance of addressing data leakage, many people from fields not closely related to computer science, are often unfamiliar with it, even if they are aware of best practices for data preprocessing. Developing educational materials on this topic will greatly benefit them.

I am excited to dive into the topic of data leakage in machine learning. Throughout the summer, I will be sharing regular updates and insightful blog posts on this subject. Stay tuned for more information!

osre24 reproducibility

Data leakage in applied ML: reproducing examples of irreproducibility

Kyrillos Ishak

Computer Engineering student