How to Label Data for Machine Learning
Learning, in its most basic form, is knowing the output you will get from an input. Even before they can speak, toddlers learn that if they point to a little cardboard box with a straw, they can get juice or if they get too close to an oven door it will feel hot. Patterned after this human capability, data scientists teach machine learning models by using a set of inputs in the form of labeled data so they can produce desired outputs.
It’s easy to see that, depending on the application, different types of inputs — data — may be required. Machine learning may drive chatbots or personal assistants, enable navigation services that help you find the best route when traveling, recognize who’s in a picture on social media, identify email spam, or make e-commerce product recommendations. Therefore, the data sets for machine learning may need to recognize spoken words, images, video, text, patterns, behaviors, or a combination of them.
Training, Validation & Testing Data Sets
Data scientists also need to prepare different data sets to use during a machine learning project.
- A training data set, usually the most extensive, is necessary to train a machine learning model to perform the actions you want it to do.
- A machine learning project also requires a validation data set that allows data scientists to conduct an evaluation of model fit while fine-tuning the model’s hyperparameters for optimal performance.
- The project also needs to have a unique testing data set. It would be a waste of time to test the model with training data because the application would already know the result you expect. The testing data set needs to have new data, so outputs can verify that the model works the way you want it to.
The Data Labeling Process
If your goal, for example, is to have an autonomous warehouse vehicle automatically stop when it “sees” a stop sign, you’d collect images of stop signs and video of routes that include stop signs. But work to develop the data sets you need for machine learning is far from over. That data must be labeled in a way that the machine learning model can use it.
Data must be labeled. This metadata — a layer of data that provides structured information — doesn’t change the original data. It’s purpose is to communicate the input to the machine learning model.
Because labeling data is a labor-intensive, time-consuming process, data labelers often use a tool that enables them to work faster and more accurately. Teaching a model to recognize a stop sign, would require pulling images or video into the tool and drawing “bounding boxes” around the stop signs. Bounding boxes are usually drawn tightly around the stop sign in the image, but big enough to include all of the stop signs, so the machine learning model can accurately identify them.
Recognizing a stop sign is an extremely simple example of what you’d teach a machine learning model to do. Obviously, autonomous vehicles need to do much more — and the more the machine learning application needs to do, the more data you need for training, validation, and testing.
Furthermore, data labelers need to be skilled and conscientious. They need a thorough understanding of the context in which the machine learning model will be used and, especially in the case of labeling audio or text data, must be skilled in language and communication to detect meaning and sentiment.
Regardless of the types of data used, all datasets for machine learning require:
- Accuracy: Data scientists measure accuracy against “ground truth,” how closely the model’s outputs align with real-world situations.
- Quality: Quality refers to how consistently data has been labeled, which, in turn, can impact quality of outputs.
More Data-Labeling-How-To Considerations
Although machine learning models can mimic some of the patterns of human thinking, they rely solely on the data used to train them. The quality of the output you get from a machine learning model will reflect the quality of the input. For this reason, labeling data correctly is essential. Be aware that this means more than drawing the right-sized bounding box around an image and using the right code. Data labelers also need to avoid biases toward, for example, a specific gender or race, which could influence the outputs the model produces. Additionally, data labelers can sometimes impose their beliefs about what the model is expected to do, so they may tend to label data that actually skews results to what they want, rather than an unbiased result.
Producing an output based on an input may seem simple, but it depends on accurate, meaningful data to produce the results you need. Ask any teacher with pupils who are human or otherwise. It takes planning, preparation, time and attention to teach. If you’re looking to expand into or scale your data labelling capabilities, contact us to see how Daivergent can help your organization grow.
Opinions expressed by Daivergent contributors are their own.