The Maihem Evaluation Ontology: Transforming Metrics into Actionable Insights

AI workflows have become increasingly easy to build, yet exceedingly complex and difficult to evaluate. As organizations implement systems from Retrieval-Augmented Generation (RAG) based AI assistants to AI agents, they face a fundamental challenge: How do they determine if these AI systems are performing sufficiently well?
Gabriele Morello
Apr 3, 2025
Back to news & insights
/
Insights
/
Architecture diagram of the Maihem Ontology with examples

The problem goes beyond simple measurement. Many teams find themselves drowning in metrics without clarity on what actions to take. They collect scores, percentages, and distributions, but struggle to translate these numbers into meaningful improvements. How do you actually measure success of your AI product?

At Maihem, we've developed a structured framework for AI workflow evaluation that bridges the gap between raw measurement and actionable insight. Our approach transforms data into decisions through three essential components: the Evaluator, the Metric, and the Criteria. This framework enables you to quickly pinpoint exactly what's working and what isn't according to your company’s quality standards, confidently prioritize meaningful improvements, and ultimately reduce operational risks – ensuring your organization can trust and maximize the impact of your AI systems.

The Insight Problem in AI Evaluation

Modern AI workflows often involve multiple interconnected components, each with its own potential failure modes. A RAG system, for instance, combines retrieval, reranking, generation, and specialized post-processing steps. Traditional evaluation approaches typically generate dozens of metrics across these components, creating what we call "metrics overload".

This overload leads to several problems: teams spend more time collecting measurements than interpreting them, struggle to prioritize issues, and miss the connection between the numbers they're seeing and the actions they should take.

A deeper challenge exists in building confidence in systems whose outputs aren't fully deterministic. Organizations need assurance that AI components will perform reliably, but the inherent variability of AI outputs makes traditional verification approaches inadequate. Without structured evaluation, teams lack the certainty needed to trust and deploy AI solutions.

The non-deterministic nature of AI also creates obstacles to building consensus. Technical and business teams may interpret the same performance data differently, leading to disagreements about system readiness or improvement priorities. Without objective metrics that can be consistently applied to probabilistic outputs, organizations struggle to align on whether their AI workflows meet requirements or where to direct resources.

The fundamental challenge is that metrics alone don't tell a story. They provide data points, but not direction. What's needed is a systematic approach to transform measurements into actionable insights while establishing the objectivity necessary to build both confidence and consensus around AI-powered systems.

The Conceptual Framework: Evaluation as a Three-Part System

Our framework approaches evaluation as a cohesive system rather than a collection of isolated measurements. Think of it as similar to a quality control process in manufacturing: you need instruments to collect measurements from the production line, standards to quantify those measurements, and decision rules that determine when intervention is needed.

In our framework, these roles are filled by three complementary components:

  1. The Evaluator: Connects to a component of your AI workflow and determines what can be observed
  2. The Metric: Transforms observations into quantifiable measurements
  3. The Criteria: Converts measurements into actionable insights

These components form a pipeline that takes raw data from your workflows and progressively refines it into clear, actionable knowledge. Each component has a distinct purpose but gains its full power when used as part of this integrated system.

The Evaluator: Connecting and Observing

The Evaluator interfaces with your AI workflows by connecting to specific points in your system, monitoring inputs and outputs for measurement. Like quality control sensors in manufacturing, different Evaluators monitor different workflow aspects - from query-document pairs in retrieval to result ordering in reranking. Its key strength lies in component isolation, enabling precise issue identification by examining individual steps rather than treating the workflow as a black box.

The Evaluator specifies:

  • What inputs are needed for evaluation
  • What outputs will be analyzed
  • How these connect to specific points in your workflow

By establishing these connection points, the Evaluator creates the foundation for meaningful measurement.

The Metric: From Observation to Measurement

If the Evaluator determines what can be observed, the Metric determines how it should be measured. Metrics are the quantifiable measures that transform observations into numerical values to enable benchmarking and comparing the performance of AI workflows.

Metrics in our framework can be organized along several dimensions:

  • Component-specific vs. multi-step: Component-specific metrics measure individual steps in a workflow, while multi-step metrics assess the performance across several connected components (which may include the entire workflow or just a subset of steps).
  • Intrinsic vs. reference-based: Intrinsic metrics evaluate outputs based on their inherent properties (like coherence or toxicity), while reference-based metrics compare outputs against ground truth or labeled data.
  • Single-score vs. distribution: Single-score metrics provide a single value for each data point, while distribution metrics examine patterns across aggregated data.

The challenge with metrics is understanding how to use them to ensure AI agents behave as they should, especially when having a large number of possibly conflicting metrics.

The Criteria: Transforming Measurements into Insights

While metrics tell you what is happening, criteria tell you what it means. The Criteria component transforms raw measurements into clear, actionable insights by establishing boundaries between acceptable and unacceptable performance.

A criterion is a falsifiable statement about your AI workflow—something that can be definitively proven true or false.

Examples include:

  • "No hallucinations are detected in generated responses"
  • "The cosine similarity between original and rephrased queries is below 0.8"

What makes criteria so powerful is their simplicity and consistency. When a criterion is violated, it highlights exactly what aspect of performance needs attention. This transforms the evaluation process from passive measurement to active insight generation. They translate the language of metrics into the language of improvements, answering not just "How is it performing?" but "What should we fix?"

Nevertheless, criteria need to be correctly defined to ensure your actual preferences are reflected.

Balancing Maihem’s auto-evals and customization

One of the core principles of the Maihem framework is providing your with auto-evals: our expert suggestions based on industry best practices so you don’t have to worry about choosing metrics and criteria. However, Maihem enables you to add and customize metrics and criteria so you can always be in control and have the final word.

This approach gives you the benefits of structured evaluation without sacrificing adaptability. You get a head start with our recommended settings while maintaining the freedom to tailor the evaluation to your specific needs.

A practical example with a RAG workflow

To understand how these components work together, consider a simplified example of evaluating a RAG workflow:

  1. The Evaluator connects to both the retrieval and generation components, observing queries, retrieved documents, and generated responses.
  2. The Metrics measure various aspects of performance such as:
    • Retrieval metrics calculate relevance between queries and documents
    • Generation metrics assess factors like factual accuracy and coherence
    • End-to-end metrics evaluate the overall user experience
  3. The Criteria transform these measurements into insights:
    • "Retrieval recall@10 is below threshold for technical queries"
    • "10% of responses contain information not supported by retrieved documents"
    • "Response time exceeds target for queries longer than 100 characters"

Each insight directly suggests an action: improve retrieval for technical queries, address hallucination in the generation step, optimize response time for longer queries. The evaluation has progressed from raw data to clear direction.

This flow—from connection to measurement to insight—is what makes the framework powerful. It's not just about collecting data; it's about creating meaning from that data.

Conclusion

At the core of Maihem’s framework is a simple belief: evaluation should drive action, not just produce metrics. Instead of overwhelming teams with data, we start by asking, “What decisions need to be made?” and works backward to identify the most meaningful observations and measurements. This decision-oriented approach ensures you’re not just tracking performance-you’re continuously improving it.

Maihem’s structured, three-part system-Evaluator, Metric, and Criteria-turns raw data into actionable insights. Evaluators connect to your workflow’s key components, Metrics quantify performance, and Criteria define what success looks like. Together, they enable teams to quickly identify issues, prioritize fixes, and align on clear next steps. This not only accelerates improvements but also builds confidence and consensus across technical and non-technical collaborators.

Ultimately, Maihem helps you move beyond passive measurement into a cycle of active optimization. By transforming complex AI evaluations into clear, actionable knowledge, it empowers your organization to enhance workflow quality, reduce operational risks, and maximize the impact of your AI systems.

Liked this article?
Share your thoughts with us.

Related news and insights

View all
The latest AI insights, delivered to your inbox
Email address
Submit
You've been added to our list!
Oops! Something went wrong while submitting the form.