OMB’s memo M-24-10 (5c. Minimum Practices for Safety-Impacting and Rights-Impacting Artificial Intelligence) is prescriptive (and timebound):
No later than December 1, 2024 and on an ongoing basis while using new or existing
covered safety-impacting or rights-impacting AI, agencies must ensure these practices are followed for the AI:
D. Conduct ongoing monitoring. In addition to pre-deployment testing, agencies must institute ongoing procedures to monitor degradation of the AI’s functionality and to detect changes in the AI’s impact on rights and safety. Agencies should also scale up the use of new or updated AI features incrementally where possible to provide adequate time to monitor for adverse performance or outcomes. Agencies should monitor and defend the AI from AI-specific exploits, particularly those that would adversely impact rights and safety.
E. Regularly evaluate risks from the use of AI. The monitoring process in paragraph (D) must include periodic human reviews to determine whether the deployment context, risks, benefits, and agency needs have evolved. Agencies must also determine whether the current implementation of the memorandum’s minimum practices adequately mitigates new and existing risks, or whether updated risk response options are required. At a minimum, human review is required at least on an annual basis and after significant modifications to the AI or to the conditions or context in which the AI is used, and the review must include renewed testing for performance of the AI in a real-world context. Reviews must also include oversight and consideration by an appropriate internal agency authority not directly involved in the system’s development or operation
On the face of it, the OMB memo refers to specific steps that the implementing organization must undertake. And taking one AI system or model at a time, an implementation specialist (i.e. AI engineer, product manager, program manager, or scientist) could incorporate these steps into their process for building an AI service or product. For example, one could incorporate a library such as Trusty AI or Arize-phoenix into an MLOps pipeline, for evaluation and monitoring.
But what about at an organizational level, and from the perspective of an accountable executive like the Chief AI Officer (CAIO) – how might one set up processes and functions to ensure that any team working on AI products meets or exceeds minimum standards of monitoring and evaluation?
To implement a durable and defensible test evaluation and monitoring solution, it is important to begin with standards and build practices and procedures around those prescriptive standards such as NIST AI RMF.
Even if your organization isn’t under the purview of the OMB memo (it applies to federal agencies as defined in 44 U.S.C. § 3502(1)), it is reasonable to expect that upcoming regulations on AI at both the federal and state levels would be informed by (or even replicate) the approach taken by NIST and the OMB.
Using NIST AI RMF to Baseline Testing & Evaluation
Artificial intelligence (AI) advancements have ignited discussions about associated risks, potential biases within training data, and the characteristics of reliable, trustworthy AI. The NIST AI Risk Management Framework (AI RMF) addresses these concerns by providing both a conceptual roadmap for pinpointing AI-related risks and a set of processes tailored to assess and manage these risks. The framework is based on seven key traits of trustworthy AI (safety, security, resilience, explainability, privacy protection, fairness, and accountability) while linking the socio-technical aspects of AI to its lifecycle and relevant actors. A critical component of these processes is “test, evaluation, verification, and validation (TEVV).” The NIST AI RMF outlines core functions (govern, map, measure, manage) with subcategories detailing ways to implement them.
Incorporating TEVV into the AI lifecycle is essential and described in Appendix A of the RMF, namely:
- TEVV tasks for design, planning, and data may center on internal and external validation of assumptions for system design, data collection, and measurements relative to the intended context of deployment or application.
- TEVV tasks for development (i.e., model building) include model validation and assessment.
- TEVV tasks for deployment include system validation and integration in production, with testing, and recalibration for systems and process integration, user experience, and compliance with existing legal, regulatory, and ethical specifications.
- TEVV tasks for operations involve ongoing monitoring for periodic updates, testing, and subject matter expert (SME) recalibration of models, the tracking of incidents or errors reported and their management, the detection of emergent properties and related impacts, and processes for redress and response.
We’ll go through all of these in this series of blogs. In our previous article, we covered one approach how LLM’s might be tested with a mix of approaches, to provide robust testing within a reasonable budget (both time and money).
In today’s article, we’re going to cover monitoring. We find that the evaluation of LLM performance happens here too, via the metrics that we gather in our monitoring. And this evaluation (which we’re going to refer to as online evaluation), when done as part of an organizational process, can provide dashboards and be distilled into management reporting that supports effective decision making by accountable executives.
Understanding LLM based Systems and Applications
At this point, organizations are racing ahead to implement applications that use LLMs at their core. To develop an effective testing and evaluation strategy it is important to understand current deployment and usage models. A common pattern is a RAG application, which consists of one or more LLMs that are connected to external data sources and inputs (e.g. user input at a website, and the users record at the agency). These are fed a corpus of knowledge that is organization and task specific (e.g. information about processes and procedures at the agency) and is fine tuned to respond to perform the desired inference on its inputs based on the corpus of data evaluated on those tasks (e.g. responding to the users query accurately). Typically, this system is deployed to a production environment accessible to end users.
Use this typical deployment pattern, the organizational group responsible should develop a systematic approach that includes:
- Identifying Key Metrics
- Defining Data Collection and Integration Needs
- Establishing Performance Monitoring and Analysis Reports
- Develop Baseline using Benchmarking and Comparison
- Monitor Resource Utilization and Cost Optimization
- Ensure Continuous Improvement and Adaptation
What are the key metrics we would look at during the evaluation and monitoring process?
These would be:
- Groundedness: The extent to which generated responses are factually verifiable based on authoritative sources (e.g., company databases, product manuals, regulatory guidelines). This metric is crucial in scenarios where accuracy and compliance are non-negotiable.
- Relevance: The model’s ability to tailor responses specifically to the user’s query or input. High relevance scores signify strong comprehension of business context and promote efficiency in problem-solving.
- Coherence: The clarity, logical structure, and overall readability of generated text. Coherent responses are indispensable for seamless internal and external communications, upholding a professional brand image.
- Fluency: The model’s linguistic proficiency, focusing on grammatical correctness and vocabulary usage. Responses demonstrating strong fluency enhance the organization’s credibility and optimize information exchange.
- Similarity: The alignment between model-generated responses and the desired content or messaging (as established in ground truth sources). High similarity ensures consistency with the company’s knowledge base and brand voice.
Evaluation & Monitoring
Now that we’ve established the high level concepts and processes that go into monitoring and evaluating AI applications, we’re going to look at specifics and implementation.
Microsoft Azure OpenAI
For the examples in this blog post, we’re going to use Microsoft Azure OpenAI as an example. Why? Microsoft recently made OpenAI available on its Azure Government cloud and is FedRAMP certified for the high baseline. Our analysis shows a number of agencies are currently deploying and building solutions using this platform.
However, the same general principles and approach could be used for other LLMs and platforms, as they become available in a FedRAMP envelope in the future.
Evaluation of the LLM based application can then be automated in the Azure environment, with dashboarding and metrics collection and calculation. Azure AI Studio (currently in preview) provides flexible options for evaluating generative AI applications. There are several options that an AI application developer can take. The figure below provides an overview of the AI development lifecycle beginning with sample data.
Figure: AI development lifecycle using Microsoft Azure (image taken from Microsoft website and reproduced here for ease of reading the blog)
Playground: Offers a sandbox for experimentation. The AI practitioner chooses the grounding data, a base model, and guiding instructions, then evaluates the application’s output manually. This path includes an evaluation wizard for further assessment using traditional or AI-assisted metrics.
Flows: An AI practitioner would use the open source tool promptflow. Azure studio has a dedicated development environment for promptflow that visualizes how the various LLMs, prompts, and application specific code are connected (called a “flow”). The environment allows for streamlined debugging, easy sharing, and performance testing of prompt variations at scale. A practitioner could also use Langchain, which offers other flexibility, but is not incorporated into Azure studio in the way promptflow is. More information about using Langchain with Azure and OpenAI is available here.
Direct Dataset Evaluation: If you have interaction data from your application, you can evaluate it directly for AI-assisted insights. Results are visualized within Azure AI Studio. You can also use SDK/CLI for this process.
An overview of the platform can be seen here: AI Show: On Demand | LLM Evaluations in Azure AI Studio
AI practitioners can evaluate applications both via traditional machine learning measurements, and also AI assisted measurements in Azure AI Studio. The above is a summary of the options available, and AI practitioners will want to further explore these features and its documentation.
Alternative AI Evaluation Platforms
Alternatives to Azure AI Studio exist, and organizations should evaluate if these alternatives meet their needs better.
Red Hat Openshift AI (and its community variant, Open Data Hub) is one such alternative. Red Hat OpenShift AI offers an flexible and useful hybrid and multi-cloud solution architecture. For US Government agencies the ready availability of FedRAMP accredited versions of OpenShift on AWS and Azure provide a great platform for working with large language models. OpenShift AI relies on the open source Kogito project’s Trusty AI service for model evaluation and testing.
Amazon Bedrock is another alternative, which might be a favored approach for an organization that is already Amazon Web Services (AWS) centric. It is a fully managed service for the development and deployment of generative AI applications. It offers a choice of Foundation Models from leading AI providers like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon and the ability to structure them with organizational data in a Retrieval Augmented Generation (RAG) construct. Information about the evaluation capabilities in Bedrock is available here.
Summary
In this blog post, we have described our approach to evaluating large language models (LLMs) and their applications. Federal agencies (and indeed any organization deploying AI applications) need a comprehensive framework for testing and evaluation of LLMs, covering tasks and considerations across the model lifecycle in order to meet their obligations under OMB’s M-24-10 . The focus in this blog post was on monitoring LLM systems during operations. An effective evaluation of performance, and user experience can only take place in production settings (i.e. pre-deployment testing would not be sufficient), especially as we need to measure key metrics such as assessing groundedness, relevance, coherence, fluency, and similarity. We shall continue to explore better ways to ensure safe and responsible AI application development, deployment and operation in the context of regulated organizations.