white and black plane on blue sky

DATA CLEANING

Why is Data Cleaning important?

Because useful AI models require lots of training data, the quality of the data used to train them matters greatly. If the data has mistakes, is incomplete, or repetitive for example, it can skew the AI model. To help an AI model learn correctly, the data needs to be cleaned before use. That is known as data cleaning.

Imagine trying to build a car using broken pieces and parts that don't fit together. Your project will be destined to fail. Well, training an AI model without cleaning the data is like building that car with the broken parts. Bad data can confuse an AI model and cause it to make mistakes. For example, if we are teaching an AI model to recognise animals in pictures but some pictures are blurry or incorrectly labelled, the AI model might learn the wrong thing. That’s why cleaning the data is vital for building robust and useful AI models.

Types of Data Cleaning for AI

Some common methods of data cleaning:

  1. Noise Reduction: When data has random or unwanted information that doesn’t help the AI model learn, this is called "noise." For example, if you’re training an AI model to recognise faces, but some of the pictures you use have lots of background clutter (e.g. cars, trees, or signs), that extra information can confuse the AI model. Noise reduction helps get rid of these distractions so the AI model can focus on what matters most—the face in this case.

  2. Duplication Detection: When the same piece of data shows up more than once, this can waste time and slow down the AI model learning. Like when you have two copies of the same photo of a dog in your dataset, the AI model could learn the same thing twice, making the training process inefficient.

  3. Fixing Missing Data: If some parts of the data are blank, the AI model might fail to learn everything it needs. Such as, if a picture of a dog is missing its label ("dog"), the AI model will not know what it’s looking at. In data cleaning, we either fill in the missing information or remove the incomplete data.

  4. Correcting Errors: When data has mistakes, like spelling errors, wrong numbers, or mislabelled items. For example, if a picture of a cat is labelled as a dog, the AI model could get confused and learn the wrong thing.

  5. Standardising Data: Data might come in different formats, and an AI model can get confused if it doesn't understand how to read everything. For instance, some data might use "1" to mean "yes" and others might use "yes." When the dataset is standardised, the AI model understands it better.