Operationalizing AI in Production Systems

It’s 2025, and Large Language Models (LLMs) are all the rage. We’ve seen a massive build-out of infrastructure ranging from the GPU stack in datacenters to the software product stack. It’s easier than ever to build cool things using AI, including AI-enabled applications. One recent innovation is the integration directly into desktop and server environments via things like the Model Context Protocol (MCP, introduced by Anthropic in Q1 of 2025). Finally, Enterprise offerings of AI systems are becoming available to the front-line workers of every company, along with tooling to enable non-software professionals to build custom solutions using high-quality, general-purpose AI systems.

With all this innovation, there is an apparent industry-wide push to start generating real economic value beyond the domain of Software Engineering, a profession that first received AI-augmentation in June of 2021 with the release of GitHub Copilot. This blog aims to provide some useful background, then define a tool-agnostic approach for assessing and operationalizing AI use-cases.

AI Introduction

Importantly, the AI domain is much larger than just LLMs. There are many different techniques and architectures available for building systems that learn from data. I’m not an expert, but there’s a rich trove of literature in the world on the subject. LLMs are also great for inquiring about LLMs as well as other techniques, particularly when they have access to web-search features. I highly recommend using them to grow your knowledge if you believe that sounds appealing. From this point forward, when you encounter the term AI, I’ll be referring to LLMs.

Before attempting to answer the question “can AI do that?”, it seems prudent to spend some time contemplating how use-cases are conceptually built using modern AI systems. Most people will be familiar with the product versions of these models, such as OpenAI’s ChatGPT, Google’s Gemini, xAI’s Grok, and Anthropic’s Claude system. These aren’t the only offerings, but are all quite popular.

An important thing to be aware of up front is that AI is trained to perform “text prediction”: given a sequence of words, complete the _____. Most people would think the word “sentence” belongs in the placeholder position. I would agree. Importantly, this is a fundamentally probabilistic task. Because this is what models are doing at their core, it’s important to understand that models are data-hungry, and more information is generally better for any given target task. When possible, be as specific as possible when providing a task to the systems. This helps them “understand” what you’re asking for more clearly. The Prompt engineering overview page from Anthropic is a great starting point to learn more techniques and considerations in this regard.

One important detail to be aware of with AI is the so-called “training cutoff date” problem. In training, models learn a form of compressed representation of everything observed in training. At the conclusion, they have memory of an incredible range of subjects, including STEM, Computer Science, code bases, journals, articles, etc., but their knowledge is “fixed” to what was learned in training. This poses two problems:

Models have no memory of recent events
Models have no knowledge about the inner-workings of your business (beyond what’s publicly available)

The first step in overcoming this was the Retrieval Augmented Generation architecture, which addresses knowledge cut-off by injecting data into the prompt to imbue models with new information. The second was the introduction of web-search, effectively granting access to any information available on the public internet. But web-search is really a specialization of a more general pattern often referred to as “tool use”. Tool use is a specific feature that enables AI systems to interact with the rest of the world. Claude Desktop and its’ ability to use your local storage to explore files and folders is a good example of such “tool use”. Another example of this is artifact creation and code interpreters in the major AI products. Tool use and ability to interact with an environment is the cornerstone of “Agentic AI”.

Agentic AI

So how do agents “work”? It’s relatively straight forward:

Model capability has advanced far enough that models can use tools very reliably
System information is injected into a prompt so that the AI knows what tools are available to it
Models “contemplate” which tools at their disposal would be useful for answering a user query

At the heart of this is basic prompt engineering with the model. A template is used to tell the model what tools are available and guidance for what tasks a model should or should not perform (both are important!). Up front, the model needs to establish a “plan” for how it’s going to accomplish what the user requested. From there, it will iterate with its available tools to execute the plan, then report the final outcome to the user. Planning is typically done through Chain-of-Thought (CoT) prompting, and models now undergo special training runs as part of the overall training pipeline to make them good at “tool use”.

Some interesting papers on CoT were the original CoT paper, as well as Tree of Thoughts (ToT) from Google. If a literature review sounds fun, I highly recommend Tree of Thoughts!

There are a lot of tool use mechanisms available today. The one that seems to be the most popular is Model Context Protocol (MCP), introduced by Anthropic in 2025-Q1. Google also published their Agent to Agent (A2A) protocol, which I believe shares many objectives with MCP. There are also many others, though MCP seems to be receiving tremendous adoption across the ecosystem. Importantly, a primary motivation to have a simple, well-defined protocol is that it makes it easy for labs to conduct post-training to enable models to excel at general tool use.

The workflow for tool use also happens to be a simple one:

At the start, the agent software system sends a request to a model to construct a plan
The agent software sends a request to a model including the generated plan and its’ current step
The model determines whether it should use a tool, and if so, crafts a special response for the software to do so
The software uses the tool, then sends the result back to the model
This continues until one of three states is achieved:
1. The model achieves its goal, and writes out a response indicating this has occurred
2. The software “times out” by reaching a prompt and/or token limit
3. The model gives up on the task

Conceptually, this sort of architecture is fairly simple, though the implementation details can get fairly complicated.

Note: this is a simple rubric intended for explanatory purposes. The implementation details can get quite complicated in practice.

With all that context, let’s now consider an example of what primitive agentic behavior might look like. This does not leverage any tools or planning—it’s a one-shot attempt at achieving a task—but I believe illustrates the basic idea.

<instructions>
You are an agent supporting a product search inbox. Users will reach out to our brand requesting details about ordering assembled products and replacement parts. The SKUs we offer are described in the following table:

"Product Name","Item Kind","SKU","Product Description","RetailPrice"
"5 guitar stand","Package","G117-5WS","This is our most popular seller! It holds up to 5 electric guitars, or 2 acoustic and 3 electric.","$59.99"
"3 guitar stand","Package","G117-3WS","This stand is great for holding up to 3 electric guitars or 1 acoustic and 2 electric.","$39.99"
"3 guitar tree","Package","T117-3WS","This guitar-tree is a hanging stand with a strong, stable base, and is great for holding up to 3 guitars. The spatial dimensions are 3 feet in circumference, and an adjustable 3.5-4 feet in height.","$29.99"
"Rubber cap","Replacement Part","G117-R-RC","This is a replacement cap compatible with all G117- series stands. 4 per pack.","$1.99"

Your job is to accurately answer customer inquiries based on the information above. When you feel we can answer the question directly, include the text "ANSWERED" in the first line, followed by a new line, and the remainder of the response. If the inquiry has nothing to do with our business, write the text label "UNKNOWN_PRODUCT_SUPPORT" in the first line of the response, along with an assessment of what may be needed from our customer support team. If the inquiry can be answered based on our catalog, but more information is required, write the label "INFORMATION_NEEDED" in the first line, followed by a response to our customer requesting the additional information required to service their request. If we can service their need directly by linking to a product page, then replace the SKU in the following URL: https://mybrand.com/purchase/<sku>. When answering inquiries for prospective customers, write an encouraging message thanking them for their interest in our brand, and encourage them to reach out if they'd like to speak to a sales associate. When inquiries for sales associates are received, please write "ROUTE_TO_SALES" in the first line of the response.

If the customer inquires about where they can buy one of our products, or about our product catalog, answer their inquiry about our product offerings first, and then advise them that they can buy either directly from us, or from an outlet such as Guitar Center or Sweetwater. We also supply to a vast network of smaller outlets through a distribution network, and they may be able to obtain one of our products at a local store if they prefer to shop there.

If more information would help you serve the user in navigating our product list, please ask them for information. Please try not to be redundant in follow-up conversations: they can always re-read the message history if they want, and being too verbose may be off-putting, even when trying to be helpful.

Always maintain a professional and courteous tone, regardless of how the user is behaving. Many customers contact our support agents very frustrated, and it's our job to turn those experiences around and make sure our customers have a great experience with us. We believe this approach will lead to lifelong customers of our brand, and make them brand ambassadors in the process. We need you fully aligned with this mission.
</instructions>

Here are some examples:

<example>
  <user_query>
  I'm interested in replacement parts for one of your stands. I lost a leg cap at a recent gig. Do you sell replacements?
  </user_query>
  <response>
  ANSWERED
  // Write reply here
  </response>
</example>

<example>
  <user_query>
  I'm looking for a guitar stand. What do you offer
  </user_query>
  <response>
  INFORMATION_NEEDED
  // reply goes here
  </response>
</example>

Always, always, always ensure to write the label in the first line. As a reminder, here are the valid labels and when to use them:

- ANSWERED: use this one when we've fully answered the inquiry
- INFORMATION_NEEDED: use this one to signal that we need to inquire for more information from the customer
- ROUTE_TO_SALES: use this when you believe that there's a possible lead, but are unsure of how to answer the question based on information available, and believe we should assign a Sales associate to the user
- UNKNOWN_PRODUCT_SUPPORT: use this one when there's no reasonable answer to the inquiry. We have a special mechanism we use for this case that has been approved to ensure we protect our brand integrity.

<user_query>
I have a Gibson Les Paul, a Fender Telecaster, and an acoustic guitar. I'm looking for something that can hold these at band practices and gigs. Our rehearsal space is a small living room, and we typically play in wineries and craft brew houses.
</user_query>

I plugged this into Claude and Gemini, and below are some sample outputs from each. First, Claude:

ANSWERED

Thank you for your interest in our guitar stands! Based on your setup with a Gibson Les Paul, Fender Telecaster, and an acoustic guitar, I'd recommend our **5 guitar stand (SKU: G117-5WS)** - it's actually our most popular seller and would be perfect for your needs.

This stand can hold up to 5 electric guitars, or 2 acoustic and 3 electric guitars, which means it will easily accommodate all three of your instruments with room to spare. At $59.99, it's designed to be stable and reliable for both rehearsal spaces and gigs at venues like wineries and craft brew houses.

If you're looking for something more compact for smaller spaces, our **3 guitar stand (SKU: G117-3WS)** at $39.99 can hold up to 3 electric guitars or 1 acoustic and 2 electric, which would also work perfectly for your three-guitar setup.

You can purchase directly from us at https://mybrand.com/purchase/G117-5WS (for the 5-guitar stand) or https://mybrand.com/purchase/G117-3WS (for the 3-guitar stand). You can also find our products at Guitar Center, Sweetwater, or through our network of smaller local music stores if you prefer to shop locally.

Thanks for considering our brand for your guitar storage needs! Please don't hesitate to reach out if you'd like to speak with one of our sales associates or if you have any other questions.

Google Gemini:

INFORMATION_NEEDED
Thanks for your interest in our guitar stands! To find the perfect stand for you, could you please let us know how many guitars you'd like to store? If you're looking for a stand for three guitars, we have two different options: a stand that holds up to three guitars, and a guitar tree that also holds up to three guitars. Please let us know if either of these sound like something you'd be interested in.

Clearly, these models replied very differently. That’s partly an artifact of me using the consumer applications for the experiment—you can control this further when using their APIs—but it illustrates some details about real-world tradeoffs you might encounter between models and why evaluation is so important.

Admittedly, this is a simple example, but it’s essentially what’s happening under the hood in agent systems. If you’re willing to read some code (or ask an LLM to explain it to you), there are a number of agentic frameworks out there you can study to learn more about how these sorts of systems are built:

These are also all great starting points for building a custom agent if you’re comfortable working in Python or any of the other languages they support. It’s also 2025, and I’d be remise if I didn’t suggest you try vibe coding an example agent using Python and a system like Claude Code.

The example in this section is a toy I came up with from a real product support experience. I emailed a brand that with an inquiry about replacement parts that I eventually purchased from them, but the experience was slow, and they could have earned my business in a single response if they had a great support agent in place for product inquiry questions. At the very least, they could have got me there in a single exchange, but they could have also avoided some unnecessary email back-and-forth from me not providing enough information up front. If that brand wanted to build such a tool, how would they know if their idea was even viable?

Manual Capability Testing

When assessing use-cases for AI, I think the most important first step is to conduct a capability test. The objective in this phase is to come up with a concept for what your system would even do (goals and tasks), along with what it shouldn’t do, then craft a prompt that seeks to demonstrate whether it could succeed at all. This involves:

Establishing your “labeling” instructions. This is basically a set of instructions for how to perform a task or tasks.
Bringing data to the prompt. This can be made-up, so long as it’s representative of what you intend to provide for production uses.
Prompting one or more models with a query that reflects what you expect users to ask, along with all the other information captured in the previous steps.

Hopefully, my simple example helped identify what defining and sending a task to a potential agent might look like. If you’re able to show that a model might be capable of a task, then you’re off to the races! This leads us into the next phase of operationalizing such systems: evaluations.

Evaluation Criteria and Metrics

Once you’ve got a sample prompt (or prompts) showcasing tasks you expect your system to be capable of performing, there’s an important question to answer: did the system do what we wanted it to do? To answer this, the standard practice is to create an evaluation test suite. The objective here is to do things like vary the way that questions are asked of your system, and to permute those questions to reflect mistakes someone is likely to make when asking a question. Consider things like question re-wording, spelling permutations, lack of punctuation, etc. Most models aren’t sensitive to these things, but it’s a good idea to check.

The good news is in this phase is that you can get pretty far with automated prompting, and you can also have models generate some of the input queries for you. Continuing from our original example, here’s a revised version of the same prompt:

<task>
We're working to generate test data for an AI System. The objective is to evaluate the `instructions` section below and generate a field of candidate questions. Let's aim for 5 sample inquiries that this system might receive. We should also generate the data with a range of discrepancies, such as missing or mis-spelled words, absence of punctuation, etc.
</task>

<instructions>
You are an agent supporting a product search inbox. Users will reach out to our brand requesting details about ordering assembled products and replacement parts. The SKUs we offer are described in the following table:

"Product Name","Item Kind","SKU","Product Description","RetailPrice"
"5 guitar stand","Package","G117-5WS","This is our most popular seller! It holds up to 5 electric guitars, or 2 acoustic and 3 electric.","$59.99"
"3 guitar stand","Package","G117-3WS","This stand is great for holding up to 3 electric guitars or 1 acoustic and 2 electric.","$39.99"
"3 guitar tree","Package","T117-3WS","This guitar-tree is a hanging stand with a strong, stable base, and is great for holding up to 3 guitars. The spatial dimensions are 3 feet in circumference, and an adjustable 3.5-4 feet in height.","$29.99"
"Rubber cap","Replacement Part","G117-R-RC","This is a replacement cap compatible with all G117- series stands. 4 per pack.","$1.99"

Your job is to accurately answer customer inquiries based on the information above. When you feel we can answer the question directly, include the text "ANSWERED" in the first line, followed by a new line, and the remainder of the response. If the inquiry has nothing to do with our business, write the text label "UNKNOWN_PRODUCT_SUPPORT" in the first line of the response, along with an assessment of what may be needed from our customer support team. If the inquiry can be answered based on our catalog, but more information is required, write the label "INFORMATION_NEEDED" in the first line, followed by a response to our customer requesting the additional information required to service their request. If we can service their need directly by linking to a product page, then replace the SKU in the following URL: https://mybrand.com/purchase/<sku>. When answering inquiries for prospective customers, write an encouraging message thanking them for their interest in our brand, and encourage them to reach out if they'd like to speak to a sales associate. When inquiries for sales associates are received, please write "ROUTE_TO_SALES" in the first line of the response.

If the customer inquires about where they can buy one of our products, or about our product catalog, answer their inquiry about our product offerings first, and then advise them that they can buy either directly from us, or from an outlet such as Guitar Center or Sweetwater. We also supply to a vast network of smaller outlets through a distribution network, and they may be able to obtain one of our products at a local store if they prefer to shop there.

If more information would help you serve the user in navigating our product list, please ask them for information. Please try not to be redundant in follow-up conversations: they can always re-read the message history if they want, and being too verbose may be off-putting, even when trying to be helpful.

Always maintain a professional and courteous tone, regardless of how the user is behaving. Many customers contact our support agents very frustrated, and it's our job to turn those experiences around and make sure our customers have a great experience with us. We believe this approach will lead to lifelong customers of our brand, and make them brand ambassadors in the process. We need you fully aligned with this mission.

Always, always, always ensure to write the label in the first line. As a reminder, here are the valid labels and when to use them:

- ANSWERED: use this one when we've fully answered the inquiry
- INFORMATION_NEEDED: use this one to signal that we need to inquire for more information from the customer
- ROUTE_TO_SALES: use this when you believe that there's a possible lead, but are unsure of how to answer the question based on information available, and believe we should assign a Sales associate to the user
- UNKNOWN_PRODUCT_SUPPORT: use this one when there's no reasonable answer to the inquiry. We have a special mechanism we use for this case that has been approved to ensure we protect our brand integrity.
</instructions>

Now, please generate these as a set of test questions. Write them out in JSON format with the structure:

<JSON>
[
    "message 1",
    "message 2",
    // etc.
]
</JSON>

I need this so I can pipe it into some automation.

I’ve obtained the result from Gemini:

[
    "hey can you tell me the price for a 5 guitar stand",
    "I'm looking for a replacement rubber cap for my stand, can I get a link to that?",
    "my kid just took up guitar, and I'd like to get them one of your stands... which one would you recommend?",
    "I have one of your G117-3WS stands and I need some spare parts but I lost the sku can you help me with that",
    "do you guys make stands for keyboards"
]

It did a decent job at crafting queries, though this would still benefit from generating some mangled versions of the same things. We also need to get some queries that should be patently refused. One of my favorites is “how many pizza boxes would I need to stack to get to the moon?” It’s clearly absurd, and most systems being built for business use-cases should probably reject this one.

Armed with your test prompt and queries, it’s time to measure. First, establish a task label for each query. These should group queries that are similar and should be answered the same way. Next, send each query to the model(s) you’re evaluating, and for each response, capture the following in a CSV file or spreadsheet:

First, assign the task label to a “label” column.
Next, A positive or negative label. Positive cases are ones where the model behaves correctly according to your instructions; negative cases are ones where the model refused something. Place these in a “predicted_direction” column.
Finally, you need to determine whether it was correct in its’ behavior. This will also be a positive or negative label.

Here’s a sample of what it should look like:

"label","predicted_direction","correct"
"task-1","pos","true"
"task-1","pos","false"
"task-1","pos","true"
"task-2","neg","true"
"task-2","neg","false"

With this, you can summarize the data to obtain a confusion matrix. Doing so, you’ll build a matrix that looks like this:

	Predicted Positive	Predicted Negative	Total
Actual Positive	2	1	3
Actual Negative	1	1	2
Total	3	2	5

Using this, you can calculate precision and accuracy. Here are the definitions of these metrics:

Precision = TP / (TP + FP)

Or, in English, “precision equals true positives divided by all positives”.

Accuracy = (TP + TN) / (TP + FP + TN + FN)

Or, in English, “accuracy equals total correct predictions divided by total predictions”.

These evaluation metrics are really important to understand and track through the lifecycle of your system. It’s also advisable to have targets for each of them that serve as a gate before going to production. The nice thing about these is that you can always re-benchmark the system after any change.

There’s another metric that frontier labs use in AI benchmarks, pass@1, that tracks whether a system achieved a desired outcome on its’ first try. If your system is going to support turn-based dialogue, this would be a good one to track for identifying how many attempts it took for a user to get to their desired outcome.

Further Learning

If you would like to learn more about language models specifically, there are some excellent videos on YouTube:

There’s also the wonderful Ahead of AI publication from Sebastian Raschka. Some excellent posts are:

All these resources go into pretty significant detail, likely much more than most people care to. I don’t think engaging with them is mandatory, but you may find it helpful.

Concluding

In this article, I did a very high-level survey of the current state of the AI ecosystem. I also attempted to provide a conceptual model for how agentic AI works, along with a simplified example of what agents look like internally, and mentioned some tools commonly used for building agents. I then stated my opinion that I believe most AI use-cases should start with a capability test to determine whether any of these models seems capable of performing a given target-task, including emphasis on negative cases. Finally, I introduced a simple method for evaluating a system, with further recommendations on getting to production.

This post is the first in what I intend to be a series. In the future, I’ll address things like how to operationalize a use-case, including some common themes I’m seeing. I also really want to touch on things like cost modeling, logging, capturing feedback from users (and taking automated action from it!), as well as a deeper dive into metrics. Please stay tuned!

Now, a brief outlook on where I think these systems are today or will be in the near future. The standardization occurring across the frontier labs to ensure models can behave properly in agentic workflows suggests they should perform well today, and only improve moving forward. The labs have many sophisticated measures they use for both training and preventing degradation from candidate models when making updates. Additionally, standardization means that many existing products have a direct roadmap for becoming integrated into AI workflows. This is profound: it’s going to enable brands of all shapes and stripes to become automatically integrated, and reduce the choice of integration to consumer preferences.

Some challenges on the frontier that still require thoughtful expertise:

How should security be handled by design to ensure consumer privacy protection in agentic workflows?
What are the leading candidate architectures for creating agentic workflows across enterprise and consumer use-cases?
Agentic workflows, while powerful, pose risks: how do we ensure agents don’t delete databases, for example, causing harmful side-effects in the systems they interact with?
- Ideally, this is solved in training by ensuring AI systems prompt for consent before taking certain actions, but how is this behavior going to be guaranteed?

The horizon looks beautiful, but the waves are a little choppy.

Best of luck, and get prompting!