Join our daily and weekly newsletters to stay updated with the latest news and exclusive content on cutting-edge AI technology. Subscribe now for more information.
In my initial role as a machine learning (ML) product manager, a simple question sparked intense discussions among different departments and leaders: How can we determine if this product is truly effective? The product I oversaw served both internal and external customers. It allowed internal teams to identify the main issues faced by customers so they could prioritize the right solutions to address these issues. With a complex network of relationships between internal and external customers, selecting the right metrics to measure the product’s impact was crucial for guiding it towards success.
Failing to monitor the effectiveness of your product is like trying to land a plane without guidance from air traffic control. Without understanding what is going well or poorly, you cannot make informed decisions for your customers. Moreover, if you do not define the metrics proactively, your team will come up with their own alternative metrics. The danger of having multiple interpretations of an ‘accuracy’ or ‘quality’ metric is that each person will develop their own version, potentially leading to a lack of alignment in working towards the same goal.
For instance, when I discussed my annual goal and the underlying metric with our engineering team, their immediate response was: “But this is a business metric, we already track precision and recall.”
First and foremost, determine what you want to learn about your AI product. When it comes to defining metrics for your product, especially one that operates with multiple customers like an ML product, the complexity grows. How do you measure the effectiveness of a model? Measuring the impact on internal teams to prioritize releases based on our models may not be fast enough, and measuring customer adoption of solutions recommended by our model could lead to conclusions based on a broad adoption metric (what if the customer did not adopt the solution because they simply wanted to speak with a support agent?).
Moving into the age of large language models (LLMs), where outputs include text answers, images, and music, the dimensions requiring metrics multiply rapidly — formats, customers, types, and more.
When developing metrics for all my products, my initial step is to distill the impact on customers into key questions. Identifying the right questions makes it easier to identify the appropriate metrics. Here are some examples:
1. Did the customer receive an output? → metric for coverage
2. How long did it take for the product to provide an output? → metric for latency
3. Did the user like the output? → metrics for customer feedback, customer adoption, and retention
After identifying key questions, the next step is to determine a set of sub-questions for ‘input’ and ‘output’ signals. Output metrics are lagging indicators that measure events that have already occurred. Input metrics, on the other hand, are leading indicators that can identify trends or predict outcomes. Not all questions need to have leading or lagging indicators.
The final step is to establish the method for collecting metrics. Most metrics are collected at scale through new instrumentation via data engineering. However, for ML-based products, especially for questions like the third example above, you have the option of manual or automated evaluations to assess model outputs. While automated evaluations are preferred, starting with manual evaluations for assessing the quality of outputs can lay the groundwork for a robust and tested automated evaluation process.
Example use cases: AI search, listing descriptions
The framework outlined above can be applied to any ML-based product to identify the primary metrics for the product. Let’s consider search as an example.
Question Metrics Nature of Metric
Did the customer receive an output? → Coverage % search sessions with search results displayed to customer Output
How long did it take for the product to provide an output? → Latency Time taken to display search results for the user Output
Did the user like the output? → Customer feedback, customer adoption, and retention
Did the user indicate that the output is correct/incorrect? (Output) Was the output good/fair? (Input)
% of search sessions with ‘thumbs up’ feedback on search results from the customer or % of search sessions with clicks from the customer % of search results marked as ‘good/fair’ for each search term, according to quality rubric Output Input
Consider a product that generates descriptions for listings (e.g., menu items on Doordash or product listings on Amazon).
Question Metrics Nature of Metric
Did the customer receive an output? → Coverage % listings with generated descriptions Output
How long did it take for the product to provide an output? → Latency Time taken to generate descriptions for the user Output
Did the user like the output? → Customer feedback, customer adoption, and retention
Did the user indicate that the output is correct/incorrect? (Output) Was the output good/fair? (Input)
% of listings with generated descriptions requiring edits from technical content team/seller/customer % of listing descriptions marked as ‘good/fair’, according to quality rubric Output Input
This approach can be extended to various ML-based products. I trust this framework will assist you in defining the appropriate metrics for your ML model.
Sharanya Rao is a group product manager at Intuit.