12 overlooked data labeling rules to unlock the power of AI

In our previous post, we established the importance of high quality labeled data in ensuring your machine learning (ML) model is real-world ready. But going from raw data to labeled data is not a magic black box. No matter how you split it, real people need to label data by hand to ensure accuracy. And that means they need to understand how to label your data, which makes instructional clarity essential. Let’s explore why this is the case and how you can write instructions that any human annotator anywhere in the world can understand.

Crystal ball

Why have instructions?

Every machine learning model requires training. You train an ML model using training data. Training data is raw data that has been processed. That might mean applying labels to it or structuring it in some way. This training data needs to be accurately labeled and it needs to address the specifics of your use case (e.g., your speech recognition model for the Indian market can’t be trained solely on American English speakers).

This is the apparent catch 22 you face at the start of any machine learning project: you want your model to accurately predict the labels on your raw data, but these predictions won’t be accurate until your model has been trained on accurately labeled data [pulls out hair].

The solution is not complicated, but it is difficult to implement effectively: you get humans to label raw data, and you train your model on this data.

create training data ML
Before any prediction (2) can take place, you need to create training data and use it to train your model (1)

At Canotic, we use AI, but we also rely on human labelers, and they need to clearly understand how to label any raw data they are given, so they can produce a training dataset that is good enough for you to train your ML model on.

All you need to do is provide raw data, along with instructions on how you want it labeled, and we return labeled training data.

Ultimately, a set of well written instructions is the largest factor under your control that will determine the success of the data labeling process.

Time invested in good instructions will save you a lot of time in the long run. And, if you’ve done it right, you won’t ever know it!

How to write instructions: 12 rules to get started

  1. Get writing and rework your instructions as you go, based on where labelers make mistakes. That means fine tuning small details. You are unlikely to get things right the first time. Every ML project is different, so be patient when trying to find an instruction set that provides the results you’re after.
    • Example: If you see images of wolves coming back labeled as dogs, expand your instructions to specify that you only want domestic dogs labeled as dogs
  2. Remember that you are writing for another person
    • This may seem eye-rollingly obvious, but keeping this in mind at all times will help focus your writing. It’s like speaking to an interviewer rather than rambling to yourself in the shower.
  3. Remember that labelers lack your domain knowledge
    • Do not assume that a labeler knows the difference between a Gibson ES-335 and a ES-355TD (if you happen to be working for Gibson). Be detailed, be specific, and provide everything necessary for a total noob to get the job done.
    • Example: The ES-355TD models features a split-diamond inlay in the headstock and mother-of-pearl fingerboard inlays.
gibson guitars
Which is which? Go!
  1. Keep it brief
    • While your instructions need to be detailed and specific, they also need to be understood in under 15 minutes. If a labeler can’t do that, there’s too much information there for them to bear in mind throughout the task.
  2. Describe both the concept and the details
    • Provide an introductory overview, then zoom in on the fine details. This will help the labeler place the information you provide in the right context and make the task more meaningful.
    • Example:
      • Context: Extract the text from these images to help our library organise its archive of manuscripts
      • Detail: Enter the words or values for each field exactly as they appear in the image, including capitalization and spelling mistakes.
  3. Format content for clarity
    • Use bullet points, numbered lists, bolding, and other formatting to structure your content in the most digestible fashion
    • Example: This list that you’re reading right now
  4. Provide descriptions for any labels that are not obvious
    • Example: The `chair` label should not be applied to a sofa
  5. Clarify the differences between any labels that are similar. Your definition does not have to be scientific; it just needs to be consistent.
    • Example: A `city` is any settlement with over 100,000 inhabitants
  6. Highlight any possible edge cases. This is one area where iteration is essential, as there will be things you cannot predict.
    • Example: Label any bicycle attached to a car as part of the car
    • Example: If a speed sign shows two values, enter the lowest value
  7. Explain how to add the labels
    • Example: The bounding box should touch the edge of the object
    • Example: Include the country as part of an organization, e.g., `United States government` is all one organization
    • Example: Do not include multiple objects in one cuboid
  8. If possible, provide examples of good and bad labels based on the instructions you’ve provided
  9. Include links to any tools the user might need to complete the task
    • Example: Use a URL shortener for any links (https://bitly.com/)

This may all seem a little overwhelming at first, but at Canotic we’ve gained a lot of experience working with instructions, and we’re always available to help you through the process.

In the end, if you provide instructions that mean a domain outsider can waltz in and correctly understand how to label your data in under 15 minutes, your work is done.

Did we miss any essential rules? Have you seen any example bad instructions or labeling mistakes that have resulted from them? Send them to us at hi@canotic.com.

 

Keep in touch


 

Sign up for our newsletter to get the latest AI and ML news!

SUBSCRIBE