Curate High Quality Datasets
  • 15 Oct 2024
  • 5 Minutes to read
  • Dark
    Light
  • PDF

Curate High Quality Datasets

  • Dark
    Light
  • PDF

Article summary

This article applies to these versions of LandingLens:

LandingLensLandingLens on Snowflake

It's important to consider the images you’re feeding your model. In this section, you’ll learn best practices and tips to consider when selecting images for your computer vision project.

Use High-Quality Images and Dump the Rest

A model will only be as good as the dataset you train it on. Think of the phrase, "Garbage in, garbage out." If you feed your model low-quality images, you will likely receive poor performance. On the other hand, if you feed your model high-quality images, you will likely have great performance.

LandingLens is built on a data-centric approach. This means it’s better to have fewer high-quality images than many poor-quality ones.

What Are High-Quality Images?

When we talk about high-quality images, we don’t just mean the resolution of the images. There are many factors when it comes to high-quality images, including:

  • Variability: Your images should represent different real-life variations of the objects of interest.
  • Consistency: The lighting, perspective, and background should be consistent.
  • Clarity: Ensure your images are free from blurriness, noise (meaningless data), and similar artifacts.
  • Background: Try to keep the background as uncluttered as possible so the model can focus better on the object of interest.
  • Resolution: Images should have a sufficient resolution to capture intricate details. We recommend having at least 10 pixels across the smallest feature you want to detect.

Variability: Real-World Scenarios Should Outweigh Other Factors

The most important factor to consider when curating a dataset is variability, which is that the images should represent different real-life variations of the objects of interest. It's important that the images in the dataset are similar to the images the model will see when it's used because this helps the model recognize patterns and features that will appear in real-world situations. 

Depending on the use case, it might seem like variability is at odds with consistency. For example, if your goal is to detect motorcycles in traffic cam footage, then you should include images from different intersections, from different times of days, of different motorcycles, etc. But you might still be able to control consistency. For example, you could train on images only from traffic cams to mimic the real-world angle, perspective, and resolution that the model will encounter. 

In other use cases, you might be able to control the variables for both the training dataset and the real-world use case. For example, if your goal is to detect scratches on a motorcycle on a production line, then you can control camera placement, lighting, the type of motorcycle, etc.

Ask These Questions to Help Determine Quality

Take a moment to look at the images you want to train your models on. Then consider these questions:

  • Are the images dark or well-lit?
  • Are the images clear or blurry?
  • Can you clearly identify the object of interest or region of interest, or do you need a magnifying glass?
  • Are your images similar to what you’ll be running inference on when you deploy your model?
  • Do your images represent all use cases relevant to the object of interest?

Example: Identify High-Quality Images

Take a look at the set of images below. Which would you choose to keep? Which ones would you throw out and why?

Select the High-Quality Images

If you chose to keep 1, 2, and 4, and throw out 3, 5, and 6, then you’re correct! Number 3 is too dark, number 5 is blurry, and number 6 is too bright.

Don't Include Images that Are Too Dark, Too Light, or Too Blurry

How Many Images Do I Need?

If you ever had this question, you’re not alone. This is the question we get the most, and the answer is: it depends.

LandingLens requires that you have at least 10 labeled images before you can train a model. Are only 10 images enough to train with? This also depends on your use case.

The more image examples you provide to a model, the more it will “learn” the relevant features of the object of interest. 

  • If you have a simple use case, like detecting whether a bolt is present or missing on a component, 10 images might be enough.
  • If you have a complex use case, like detecting the morphology of tissues and cells to determine whether they’re cancerous, you’ll likely need more than 10 images.

Ask These Questions to Help Determine How Many Images You Need

When you’re selecting your images for your project, consider these questions:

  • How elaborate is the object of interest?
  • How many types of objects of interest are there?
  • Are there many variants of the object of interest?
  • Do the objects of interest in your images represent real-life use cases?
  • Do your images represent a balanced number of classes?

Balance the Number of Images for Each Object of Interest

When you want a model to detect a certain object of interest, you create a class for that object. A class is basically a name for an object you want to detect. For example, if you want a model to detect apples and bananas, create an Apple class and a Banana class.

If your project has multiple classes, upload about the same number of images per class. Having a balanced number of images per class helps the model accurately detect objects for each class. So if you're detecting Apples and Bananas, 50% of the images should be apples, and 50% of the images should be bananas.

Tips for Object Detection Projects

Models in Object Detection projects learn based on the contents of the bounding boxes (the areas you identify as the Ground Truth). Therefore, it's important to get a well-represented spread of bounding boxes for each class.

For complex projects (like circuit boards), we recommend you include a minimum of 50 bounding boxes per class. If one image has five bounding boxes, then this counts as five, not one.

Tips for Classification Projects

Each image can only have one class applied to it. Therefore, if an image meets the criteria for more than one class, we recommend that you don't include it in training because it could confuse the model.

For complex projects (like circuit boards), we recommend you include a minimum of 50 images per class.

Tips for Segmentation Projects

Models in Segmentation projects learn based on the specific pixels marked by the lines or shapes you add to the images (the areas you identify as the Ground Truth). 

Therefore, it's important to get a well-represented spread of marked areas for each class.

Typically, a Segmentation project requires fewer images than an Object Detection project. This is because a Segmentation project allows you to label areas more precisely.

For complex projects (like circuit boards), we recommend you include about 20 to 30 labeled areas per class. If one image has five labeled areas, then this counts as five, not one.

Tips for Visual Prompting Projects

You only need to label a few small areas of images to train a Visual Prompting model. For more information, go to Visual Prompting.


Was this article helpful?

What's Next