Data Pre-processing in Machine Learning

Tolulade Ademisoye
6 min readJan 20, 2024

A prequel to my recommender systems series

According to Technology Magazine and other sources, the global big data market is projected to be worth almost US$ 400bn by 2030. A quarter of 210 executives based in Europe & the USA reported profitability on their big data and analytics initiatives, while 12% reported losses on their initiatives, as per a Capgemini report.

By the Author- Tolulade Ademisoye

More than 60% of a data scientist or analyst’s time in data preparation is spent in pre-processing, particularly for raw data. Despite the critical nature of this phase, company executives or management might not provide the necessary support to turn that data into business value. So, what do you do? Strike a balance between the business and technical goals?

Join Semis today to network in AI & Bigtech

Data pre-processing is crucial in machine learning and determines the robustness and quality of your model. In simple terms, machine learning involves taking in raw data (text, audio, video, etc.) and learning its properties and behaviour to make predictions based on the learned history. However, before this process occurs, you need to ensure that the right and accurate data are used; otherwise, it will affect your results and the model.

Let’s delve into the fundamental process of data processing or pre-processing in machine learning. This write-up serves as a prequel to my series on look-alike machine modelling and recommender systems.

Pre-processing in Machine Learning & It’s Importance

A reference highlights, “It’s also worth mentioning that preprocessing techniques are not only important for improving the performance of the model but also for making the model more interpretable and robust.” This statement holds.

Pre-processing takes place during the data preparation phase before analytics can commence. Picture having inconsistent headers or data types in your dataset — a column that contains customer location associated with revenue or vice versa; that’s a disaster.

Join Semis today to network in AI & Bigtech

Initial Data Analysis

Information on a Parquet/Spark dataset

Before commencing your data pre-processing, gain insight into the nature, shape, and details of your dataset. This process can be termed as our initial data analysis.

Info on a CSV dataset

As illustrated in the image above, some basic commands in Python can be executed to obtain vital information guiding your chosen route or method for data pre-processing.

What information should you be looking out for here?

  1. Are there empty cells in some columns? Null values?
  2. What is the total number of columns and rows for this data?
  3. Based on the information of the datasets, do the column names/headers match? Will I need to change some header names?
  4. What are the column data types? Is a string placed in a numeric column?
  5. And so on.

This process and checklist help to improve the accuracy of your model.

To discover missing values in your dataset, several approaches can be employed. The diagram above utilises a common method;

dataset.isna().sum()
#isna finds the missing values
#sum, sums them up
#we could also write,

dataset.isna()

After careful review and correction, you can then proceed to the next phase in your machine-learning workflow.

Join Semis today to network in AI & Bigtech

Approaches to Data Pre-processing

Having gone through the rigorous process of identifying outliers and addressing unformatted, inconsistent data, the next step is to clean and format the dataset. This is a crucial process, particularly when dealing with tabular or text datasets in machine learning.

Data Cleaning

Data cleaning involves removing sighted errors, dropping rows or columns, handling missing values, and placing headers/column names if they are not available. I previously shared a similar process on GitHub.

In a project I worked on in 2023, the dataset comprised tabular data containing punctuation marks that necessitated cleaning and removal.

Removing punctuation marks

#data cleaning - removing punctuation marks

import string

# Define a lambda function to remove punctuation marks
#remove_punct = lambda x: x.translate(str.maketrans('', '', string.punctuation))
remove_punct = lambda x: x.translate(str.maketrans('', '', string.punctuation)) if x is not None else None


# Remove punctuation marks from each column
for col in column_to_split.columns:
column_to_split[col] = column_to_split[col].apply(remove_punct)

# Print the updated DataFrame
print(column_to_split)

The extent and style of your data cleaning may vary depending on the irregularities identified in your dataset.

Data Formatting

Imagine your dataset has a column named “Gender,” and the entries in the gender column are represented in various formats such as “Male,” “male,” “maLe,” “Female,” “FEmale,” etc. The inconsistency in format poses a challenge during the data exploratory phase. For instance, when attempting to analyse the ‘male’ fields, it won’t capture the other representations. Same as having different date formats in your data.

The above is an example of an invalid data format.

A Parquet data information

In an actual sense, data formatting commences once you have cleaned your data. “It helps convert the data into a more usable format by machine learning algorithms.” However, data formatting could take various forms and may have different interpretations for analysts/engineers. Some data scientists or engineers might address this during the cleaning process, which is acceptable, while others might handle it at this stage.

What’s crucial at this point is ensuring your dataset is in the right format (CSV, Excel, Parquet, etc.) before proceeding to the modelling phase.

Data Sampling & Feature Scaling

To avoid bias in your model, it is essential to ensure a balanced distribution of the population in your dataset. Consider a scenario where you’re constructing a model with one of the key features being Gender, and your dataset exhibits an imbalance, with only 20% representing one gender compared to the other.

As elucidated in Express Analytics, “One of the most important things you can do when working with data is to ensure you’re sampling it properly. This means that you’re taking a representative sample of the data rather than just grabbing whatever data is available…”

Feature Scaling or Standardisation

from sklearn.preprocessing import StandardScaler

# Scale the numerical features
scaler_new = StandardScaler()
scaled_features = scaler_new.fit_transform(dataset[numerical_features])
scaled_features = pd.DataFrame(scaled_features, columns=numerical_features)

This process is primarily applied to numerical data or independent variables in your dataset. It involves normalising and standardising the data within a specific range.

Standardising the features of a dataset helps to decrease variability, making comparison and analysis more manageable, such as within the range of 0–100 or 0–1. This ensures that the data you’ve received possesses similar properties.

Join Semis today to network in AI & Bigtech

Now that you’ve completed your data pre-processing, you can proceed to Feature Engineering!

If this has been helpful, let me know. Please reach out for machine learning consultancy , you may also buy me a coffee to support my work.

Tolulade.

Reference reading

https://technologymagazine.com/articles/big-data-market-be-worth-400bn-by-2030-driven-by-ai-and-ml

--

--

Tolulade Ademisoye

i build enterprise AI & data for the world at Reispar Technologies