Member-only story

Machine Learning Must Know — Preparing the Modeling Data

7 min readJan 20, 2025

At first glance, this topic may seem basic, but I’ve noticed that most machine learning books overlook it. While many books focus on techniques for randomly splitting data into training, test, and validation sets, the discussion quickly moves on to more advanced topics like k-fold cross-validation. But here’s the question: How do we prepare the modeling data in the first place? For example, a credit card company may have billions of transactions, but that’s raw data, not data ready for modeling. Similarly, there could be millions of mortgage loan applicants, but raw data alone isn’t sufficient to build a model. So, how do we transform raw data into modeling data? In many cases, the success of a data science project depends on how well the training data is defined. That’s why I’ve decided to write this article, where I’ll explore this topic in depth, using the mortgage default case as a real-world example.

This article is focused on “Feature Engineering for Credit Card Fraud Detection” and “Feature Engineering for Healthcare Fraud Detection.” Fraud detection plays a crucial role in identifying fraud early and preventing losses. Raw transaction data, often in the billions, cannot be used directly to build a model. Instead, this data must be carefully prepared and transformed into a structured format suitable for model building.

Machine Learning Must Know — Preparing the Modeling Data

Written by Chris Kuo/Dr. Dataman

No responses yet