How To Clean Your Data for AI Agents Without Breaking the Bank

Bad data costs companies good money — especially if that data’s fueling AI.

How much are we talking about? A 2024 survey by data integration company Fivetran found that artificial intelligence (AI) trained on inaccurate, incomplete, or low-quality data can cost large businesses 6% of their revenue, or an average of $406 million annually.

“If you don’t have good data, you’re probably not going to be making the best business decisions,” said Karim Habbal, vice president, data management solutions, at Salesforce. “There’s a real impact on both the day-to-day tactical decisions and the multiyear decisions being made.”

With that big a bite out of revenue, you’d think leaders would rush to get going on data cleaning. But time, labor, and tech tools are expensive, and some companies don’t want to make the investment. That’s shortsighted. Investing even a small amount can pay off later. Our list of five cost-effective ways to clean data can get you started.

What’s your agentic AI strategy?

Our playbook is your free guide to becoming an agentic enterprise. Learn about use cases, deployment, and AI skills, and download interactive worksheets for your team.

The future starts now

The cost of bad data

Many big brands have learned that bad data hurts reputations and bottom lines. One major airline ended up in court after its chatbot explained its bereavement travel policy incorrectly, telling a customer he was eligible for a refund when he was not. Elsewhere, a data error in an automated air traffic control system cancelled 2,000 flights in the U.K. and Ireland, leaving thousands of travelers stranded and airlines suffering as much as $135 million in losses.

The costs can also be more subtle. A minor typo in a customer’s address could lead to missed communications, missed deliveries, and lost sales. And then there’s customer trust, which is too valuable to put a price tag on. If an AI agent hallucinates or answers questions incorrectly, customers might take their business elsewhere. They don’t care if the AI screwed up. They’ll only remember it was your company.

How to clean your data in a cost-effective way

Your AI agent will only be as good as the data you feed it. But it may be easier — and less expensive — to get your data ready than you think. Here’s how:

1. Prioritize which data needs to be cleaned

Start by cleaning only the data your agent needs.

Salesforce does this with its own agents, which are powered by Agentforce, the company’s platform for building and deploying AI agents. When the product team builds an agent, they focus on the task (or tasks) they want the agent to perform. “Those jobs are called ‘topics,’ and the topics are a way of routing a user query to a specific thing the agent can do,” said Daniel Zielaski, vice president, data science, at Salesforce. Once the product team has identified a topic, they build a “corpus,” which is the knowledge base an agent needs to carry out its task.

Zielaski pointed to Salesforce’s new sales development representative (SDR) agent as an example. The SDR agent needs clean and updated account, lead, and contact information to write outreach emails to prospects. But it doesn’t need information on how to solve a tech problem. “We identify the data that will be consumed by a specific topic, and then we focus on improving its overall quality, versus boiling the ocean and trying to clean all our data,” he said.

2. Manage your labor costs

For many companies, the largest data-related cost is labor. A data engineer in San Francisco, for example, earns a median salary of $178,000 per year. And when you build an entire in-house data team, the cost of salaries, training, and benefits can add up.

Internal teams are crucial for handling sensitive data like health or financial information. They also offer continuity and institutional knowledge. But for less sensitive data, you could use an outside provider or freelancers, which would allow you to pay only for the services you need. Or you could use a combo of both, a hybrid approach.

You can also use Salesforce’s Data Cloud, which solves one of the biggest problems companies face: pulling data from different software systems into one place for an AI agent to read. “The product has been designed so that you don’t have to pay for a large data engineering team,” said Zielaski. “You don’t have to pay for an architecture team. You don’t have to pay a group of people to go in and use code to move data from one place to another.”

3. Automate as much data cleaning as possible

The Fivetran survey found that data scientists spend most of their time (67%) preparing data rather than building and refining AI models. But there’s a way to lighten their load: Automate your data quality processes.

Automating data quality processes — either through code or data quality tools — can drastically reduce the time you need to monitor and clean data. Yes, it requires an upfront investment. But a Forrester report found that data quality tools catch issues sooner, improving resolution time by 90% and saving 5,184 data engineer hours.

They do this partly by detecting anomalies. Habbal’s team, for example, uses various data quality tools to automatically profile data sets, including those that calculate annual contract value (ACV), a critical financial metric. He shared a hypothetical example of a data set in which the typical ACV range is $10 million to $50 million per customer. If the data quality tool discovers an ACV for $30, Habbal said, “we’re then alerted, and can investigate it.”

Habbal’s team also uses these tools to monitor data for completeness, timeliness, accuracy, and conformity. “Basically, what that means is, I can create a rule that says, ‘Trigger an alert when the completeness of the data falls below 99%’,” he said.

Why is this important? “If we were going to report the quarterly ACV to [Salesforce CEO] Marc Benioff, we don’t want to give him a data set that’s 90% complete,” Habbal said. “For that situation, we’d have a very high threshold with our data quality tool, that the data needs to be 99% complete or greater.”

Keep your data lean and clean

Improve your data quality by standardizing formats, updating information, and merging duplicate records. Discover how on Trailhead, the free online learning platform from Salesforce.

Prep your data

4. Put a data governance policy in place

Another way to contain costs is to create clear governance that includes data stewardship. In other words, spell out who’s responsible for a specific set of data.

Consider the hypothetical example of data created in a business application. As the data moves downstream for analytical or reporting use cases, it might be replicated four times. When someone discovers there’s an issue with the data, “we don’t want four different teams to remediate their copies of the data,” said Habbal. If you have clear ownership of the data, only one team will be responsible, which means fewer labor costs.

A governance policy that outlines your stance on access, security, and compliance also protects you against risk. Errors in financial reporting or the improper handling of personal data can lead to costly fines and legal battles. And compliance issues drain resources, too. Clear governance lessens these risks.

5. Use AI to prevent bad data in the first place

In 1992, George Labovitz and Yu Sang Chang, then both professors at the Boston University School of Management, introduced the 1:10:100 rule of data quality. Their rule asserts:

The cost of preventing poor data quality at the source is $1 per record.
The cost of remediation after a data quality issue has been identified is $10.
The cost of doing nothing is $100.

Those numbers have likely changed over the years, but the idea is the same: One of the best ways to save money is to prevent bad data from entering your system in the first place. AI can help.

Zielaski said that Salesforce’s SDR agent was a good example. When a potential customer visits Salesforce’s website, they’re asked to fill out a form, which generates a lead. But the form needs to be filled out in a specific way to create standardized, well-formatted data. If a prospect adds an extra digit to a phone number, they’ll be asked to re-enter the number. Or they may not be able to click the Submit button if a field is left blank.

Preventing bad data gets even more challenging when a company goes by different names. Japan’s Nippon Airways, for example, is often called ANA. If the airline’s employees fill out Salesforce website forms using different company names at different times, duplicate accounts will be created — and Salesforce might send redundant outreach emails. To avoid this, a Salesforce team builds AI algorithms to de-duplicate entries, and scrub and clean data to make sure it’s pristine. “Think of the algorithms like vacuum cleaners that are constantly fixing up all that data,” said Zielaski.

The revenue generated by the SDR agent offsets the cost of this team. “If you’re building an autonomous agent that can generate a pipeline of hundreds of millions of dollars, and the only thing you’ve got to do is build a five to 10 person team to manage data quality,” Zielaski said, “find a CEO that isn’t willing to make that investment.”

Data cleaning is worth every penny

Prepping your data for AI can feel daunting and expensive. But if you break the job down, clean only the data you need, and allocate resources mindfully, you can make the CEO and CFO happy. It’s an investment you won’t regret.