MIT Technology Review Subscribe
Sponsored

Computer vision in AI: The data needed to succeed

There is nothing more important for artificial intelligence initiatives than acquiring contextually labeled, high-quality training data and developing a reliable data pipeline. Here’s how to get there.

Developing the capacity to annotate massive volumes of data while maintaining quality is a function of the model development lifecycle that enterprises often underestimate. It’s resource intensive and requires specialized expertise.

Advertisement

At the heart of any successful machine learning/artificial intelligence (ML/AI) initiative is a commitment to high-quality training data and a pathway to quality data that is proven and well-defined. Without this quality data pipeline, the initiative is doomed to fail.

This story is only available to subscribers.

Don’t settle for half the story.
Get paywall-free access to technology news for the here and now.

Subscribe now Already a subscriber? Sign in
You’ve read all your free stories.

MIT Technology Review provides an intelligent and independent filter for the flood of information about technology.

Subscribe now Already a subscriber? Sign in

Computer vision or data science teams often turn to external partners to develop their data training pipeline, and these partnerships drive model performance.

There is no one definition of quality: “quality data” is completely contingent on the specific computer vision or machine learning project. However, there is a general process all teams can follow when working with an external partner, and this path to quality data can be broken down into four prioritized phases.

Annotation criteria and quality requirements

Training data quality is an evaluation of a data set’s fitness to serve its purpose in a given ML/AI use case.

The computer vision team needs to establish an unambiguous set of rules that describe what quality means in the context of their project. Annotation criteria are the collection of rules that define which objects to annotate, how to annotate them correctly, and what the quality targets are.

Accuracy or quality targets define the lowest acceptable result for evaluation metrics like accuracy, recall, precision, F1 score, et cetera. Typically, a computer vision team will have quality targets for how accurately objects of interest were classified, how accurately objects were localized, and how accurately relationships between objects were identified.

Workforce training and platform configuration

Platform configuration. Task design and workflow setup require time and expertise, and accurate annotation requires task-specific tools. At this stage, data science teams need a partner with expertise to help them determine how best to configure labeling tools, classification taxonomies, and annotation interfaces for accuracy and throughput.

Worker testing and scoring. To accurately label data, annotators need a well-designed training curriculum so they fully understand the annotation criteria and domain context. The annotation platform or external partner should ensure accuracy by actively tracking annotator proficiency against gold data tasks or when a judgement is modified by a higher-skilled worker or admin.

Advertisement

Ground truth or gold data. Ground truth data is crucial at this stage of the process as the baseline to score workers and measure output quality. Many computer vision teams are already working with a ground truth data set.

Sources of authority and quality assurance

There is no one-size-fits-all quality assurance (QA) approach that will meet the quality standards of all ML use cases. Specific business objectives, as well as the risk associated with an under-performing model, will drive quality requirements. Some projects reach target quality using multiple annotators. Others require complex reviews against ground truth data or escalation workflows with verification from a subject matter expert.

There are two primary sources of authority that can be used to measure the quality of annotations and that are used to score workers: gold data and expert review.

Iterating on data success

Once a computer vision team has successfully launched a high quality training data pipeline, it can accelerate progress to a production ready model. Through ongoing support, optimization, and quality control, an external partner can help them:

Without high-quality training data, even the best funded, most ambitious ML/AI projects cannot succeed. Computer vision teams need partners and platforms they can trust to deliver the data quality they need and to power life-changing ML/AI models for the world.

Alegion is the proven partner to build the training data pipeline that will fuel your model throughout its lifecycle. Contact Alegion at solutions@alegion.com.

This content was produced by Alegion. It was not written by MIT Technology Review’s editorial staff.

This is your last free story.
Sign in Subscribe now

Your daily newsletter about what’s up in emerging technology from MIT Technology Review.

Please, enter a valid email.
Privacy Policy
Submitting...
There was an error submitting the request.
Thanks for signing up!

Our most popular stories

Advertisement