When building new models or algorithms in machine learning, the expectation is that they will perform optimally on test data and solely on test data. Unfortunately, this doesn't always happen and our models don’t come out the way we like them to. This can stem from data quantity, data quality, over/under training, algorithm choice, tuning, and bias.
What is Data Preparation for Machine Learning
Data preparation or data preprocessing is the process of transforming collected raw data into cleaner more effective data for machine learning algorithms to be trained on to extract insights and make predictions.
Data preparation address issues such as
- Too much data: Data irrelevant to the machine learning algorithm can negatively affect performance. It is smart to identify the necessary data needed to train your machine-learning model and omit everything else. We don’t want our model to train on a variable we didn’t intend it to (unless that is something you choose to uncover).
- Missing data: Data in a CSV or spreadsheet might be left blank, perhaps because there is no value attributed to it. For example, Flight Delay Reasons can have multiple columns but a delay only attributes to a single reason, leaving the other columns blank. These will appear as NULL or NAN and it is up to the preparation to editing these values to a numerical '0'.
- Formatting Errors: Our computers love numbers (integers) and work much harder with words (strings). So feeding our machine learning models with numbers is much more efficient. Making categorical data into binary numbers can greatly increase efficiency.
- Inconsistency: This goes back to the previous need: Sometimes strings can get garbled while still representing the same data point. Standardizing these data points cleans your data
- Feature Engineering: This should be the last thing done to your dataset. Tailoring the data to match the inputs and outputs of your machine learning model. Think about all the data you will normally obtain that you will feed your data, and decide what you want your model to predict.
Why is Data Preparation Essential
Data Preparation techniques are essential not only for improving model performance but also for enhancing model interpretability and robustness. Handling missing values, removing outliers, and scaling data can prevent overfitting, resulting in models that generalize better to new data. Understanding your variables and how they represent data is highly important in deciding the purpose of the model.
It's important to note that the specific techniques required for a given dataset depend on the nature of the data and algorithm requirements. Don’t get too overzealous though; sometimes, too much pre-processing data can negatively impact the performance of your machine-learning model.
Exploring and editing the data is a critical step in the machine learning workflow, ensuring that the data is error-free and in a format that the algorithm can comprehend. In this article, we'll discuss the need to view and clean your data and provide an example of how to preprocess data using Pandas.
By cleaning and formatting the data, we can ensure that the algorithm is considering only relevant information, leading to a more accurate and robust model. It can also help to prevent overfitting, where the model performs well on the training data but poorly on new, unseen data. This happens when the data is too specific in testing data that it basically overcompensates. When the model is deployed, the data is different, therefore overfitted, and can inaccurately make assumptions based on data bias.
Also by exploring the data, you can gain a better understanding of what the data represents. This is key to developing the exact model to do the exact thing you need.
Many machine learning examples use CSV files for data science. To keep things simple, CSV files are the most basic to learn before delving deeper into other file types like images.
Our preprocessing data exploration includes the need to address:
- Trimming Data
- Missing Values
- Numerical and Categorical Values
- Binary Values
- Column Types (ints vs strings vs objects)
First, we have to decide our approach and the question we want to be answered according to this data. Let’s say we want to evaluate whether a flight will be delayed.
1. Trimming Down Data
When we train our machine learning model, the model will only have access to specific data points. Values like that seem irrelevant to evaluate the model and can be removed to keep the file clean, tidy, and purposeful.
2. Check for Missing Values
Missing values can either be attributed to errors when the data was inputted. Check to see if the errors make a significant dent in your entire dataset. If there are minimal errors, it is smart to only feed your machine learning model with data that did not have a typo! So you can remove these rows from the dataset.
Sometimes missing values when reading CSV files in Pandas shows as 'NaN' which means in your spreadsheet, the field had nothing in it. These missing values be numerical like a zero. For example “Delay Time” cells are left blank to distinguish no delay. These cells can be rewritten to populate with a 0.
You might be answering a specific question so filtering your data to only resemble the situation that you are testing/predicting is key to feeding your model the correct data. For example. if you are predicting delay times for flights based on reason, remove all flights that were on time and keep the flight data that actually had a delay.
Verify whether outliers are detrimental to your system or if they are representative of your dataset. Sometimes using visualization can aid in identifying outliers. A good way to use Histograms is to map out the range of your CSV file. Take this histogram:
Looking at all the different Delay reasons, the histogram not only shows distribution but also the range. These values are predominantly below 300 but the histogram shows the maximum at large numbers like 2000. This means that there are times in which the delay was 2000.
Check these outliers to see if it was possible. Sometimes, flight delays can occur and push departures and arrivals to the day after their scheduled flight; these outliers are in fact, representative. However, eliminating these outliers is up to your discretion as these can skew your model’s predictions. However, the idea that these outliers can impact how long a delay may be true in some cases.
4. Numerical and Categorical Data
Verify that the values in your columns match what they are supposed to be. Sometimes there might be errors in which an entire column is classified as an object (or string) instead of a value (int64) with a single cell being incorrect. A column having all zeros and one cell having a dash which casts it to an object. Change these cells to their appropriate value using code, changing the parent file, or specifying what values count as numeric.
Categorical data are labeled as objects, something that machine learning algorithms aren’t very fond of. Keeping these values as numeric can greatly improve model training and is super easy with the pd.get_dummies( ) function. For example, Airline names will be objects or strings. With get dummies, you can make rows for every single unique string, and apply one-hot encoding to them.
5. Prepping for Model Feature Engineering
Designing your data to match your machine learning model is the most conceptual step but fairly simple too. You’ve already decided what you wanted to solve and gathered the data to solve it, and now all you got to do is format your data in such a way that our machine-learning algorithms can do their best job.
Assign the variables that will be fed to the model such as state, airline, time of day, and day of the year (X-values) to predict an output: whether a flight will be delayed (Y-value).
In conclusion, pre-processing data is a crucial step in the machine learning workflow, as it can improve the accuracy of the model, reduce training time and resources, and improve the model's interpretability. While it's important to keep in mind that the specific pre-processing techniques and extent of pre-processing will depend on the data's nature and algorithm requirements, pre-processing data is generally an essential step toward building a better machine learning model.