Even half a decade ago, the primary conversation around data focused mainly on how AI could solve business problems. Customers were learning about the potential of AI for their companies. Today, many businesses understand how they want to use AI and the expected benefits for themselves and their customers. Many have also put in place the data science teams they need to do AI right.
In this article, Sujatha Sagiraju, Senior Vice President of Product, Appen discusses ways to succeed in 2022 with the rise of a new discipline â€“ Data for AI Lifecycle.Â
The AI Data Challenge
AI budgets increased dramatically over the last year, signaling the criticality of AI for companies across multiple industries. According to Appen’s 2021 State of AI report, AI budgets increased 55% year-over-year. Another key development over the last year is that businesses have matured their data science and their ability to develop the machine learning (ML) models they need. However, many of them are coming to the realization that the real challenge lies around the data throughout the AI lifecycle.
The first big challenge for a company building an AI solution is the need to acquire or collect the data needed to train internal models. This is no easy task, especially if you want to do it in a responsible, ethical and scalable way and have high-quality requirements. The State of AI report also found that an overwhelming majority of organizations have partnered with external training data providers to deploy and update AI projects at scale â€“ a reflection of the fact that data acquisition and preparation are top challenges that AI practitioners face.
However, simply acquiring training data isn’t enough. Whether working internally or with an external training data provider, the AI team must ensure a continuous flow of updated data, critical for production models that deal with changing inputs â€“ there, the work is never done, or the models start to drift. In addition, project teams want more data faster, so they can’t rely on one-time projects and ad-hoc collection/annotation efforts. With the goal to expand current AI projects and implement new ones, businesses must find new ways to streamline project creation, from pilot to production, and create consistent and accurate processes that can scale. This can be achieved only through automation, especially around data sourcing and preparation.
Finally, businesses must confront the challenge of making their machine learning models inclusive and eliminating bias from their data. Biased data can kill an AI project. First, biased data can lead to unfair practices, for example, making decisions on a loan application based on race or gender instead of actual analysis of credit or risk. This can result in the failure of the project, a negative impact on a protected group, or a public backlash that can damage the brand. Second, even if the results are not immediately damaging or publicly embarrassing, biased data can mean that the model prediction results are simply sub-optimal, so the business never achieves the intended ROI from the AI application.
Data for AI Lifecycle
We predict that a new discipline is emerging, called Data for AI Lifecycle. Its focus will be to develop the tools and best practices to enable businesses to reduce complexity and manage data throughout the entire AI development process with a data-centric lens.
- Data sourcing: This is the first step in ensuring AI success, no matter which stage of AI maturity the business is in. Pre-labeled datasets can accelerate AI projects through licensable ready-made data for model training. For complex AI projects, data sourcing can be outsourced to external vendors. Take the time to find the right vendor who can meet the quality requirements and ensure that the datasets are ethically sourced. Lastly, look into synthetic data and how to leverage it for hard-to-find data to enhance model training.
- Data preparation: This step is critical for AI project success, and this is a step where teams typically work with a globally diverse group of people to annotate, rate, judge and label data to create high-quality inputs for your models. This is where you might tap into knowledge graphs and ontology management tools to turn data into intelligence.
- Model training and deployment: Whether you work with your in-house team of engineers and data scientists, or you choose to work with a top consulting or technology provider, it’s important to make sure your data providers connect well with your model infrastructure. This is important for the initial development iteration stage but also for the next step as well.
- Model evaluation and continuous improvement stage: This is where AI projects turn into programs and teams watch for model drift and for continuous improvement opportunities. There is a huge value from sourcing real-world model performance validation and tuning across a range of use cases and demographics, especially for organizations that have a global audience. This is where it’s good practice to compare your model performance to competitors and peers to ensure best-in-class results.
This new discipline comes to meet two realities of the AI industry that are developing for 2022. First, no single company can provide solutions across the entire data for the AI lifecycle. We should expect to see a variety of new partnerships to make this discipline a reality. Second, AI practitioners don’t need a tech stack that has multiple, siloed applications that increase complexity and require separate, complex administration. Thus, we expect winning solutions to combine partners with a streamlined interface that simplifies and accelerates the Data for AI Lifecycle processes.
As AI use cases continue to increase, AI will evolve, and Data for AI Lifecycle will expand and adapt with this evolution. We are just now beginning to understand all the possibilities that high-quality data practices can provide within the AI space. So the rise of Data for AI Lifecycle as a discipline in 2022, and the automated processes and reduced complexity we build around it, will ensure that businesses can constantly keep up to date with the latest best practices and ensure the maximum ROI and the optimal customer benefits from their current and future AI projects.