This post hypothesizes a path from ideation to assessment of candidate use-cases for AI-enabled systems, along with techniques for assessing behavior of systems in relation to prompts. Throughout, we’ll consider building a customer inquiry chatbot for a music equipment supplier. By the end, we’ll have explored prompt engineering and techniques for evaluating the ability of an AI system to meet a target use-case.

I assume the reader has experienced AI to some degree, even if only casually via systems like ChatGPT or Google Gemini.

Introduction

Before asking “can AI do that?”, we should first ask “what it would take for a human to do that”? A human would need knowledge about our business, product catalog, and potentially guidelines regarding customer interactions. Our chatbot requires all the same things, and potentially more.

Side-quest: I frequently say AI systems have learned a lot about the world in training, but nothing about your business. I also liken AI systems to the character from the film 50 First Dates, who recalls everything up to the car accident (long-term memory), but nothing between days (short-term memory, different conversations).

When designing a chatbot, a useful first step is to think about what a system should do. We can start with the following questions:

  • What capabilities and traits should this system have?
    • Answer inquiries across a span of products in our catalog
  • What sorts of tasks might it be able to handle?
    • Send users to specific pages on the site that match their inquiries
    • Answer pricing and distribution questions
    • Identify when it is uncertain about how to proceed and escalate to a human operator
  • What should it not do?
    • Answer any questions that aren’t defined in the set of things we want it to answer

From here, the next step is to contemplate what it would take to achieve these goals. As with a human, our bot needs a product catalog, along with instructions about things like site navigation. It also needs instructions and examples for answering different kinds of inquiry, supplemental information we believe our customers may value, and—most importantly—instructions for the negative cases that should be rejected.

Fortunately, this is all achievable with crafty prompting!

Capability Testing

When assessing a candidate use-case, I think the most important first step is to conduct a capability test. With our design criteria in hand, the objective in this phase is to establish a set of system instructions matching our design criteria. These should reflect the dos and don’ts of how our system should behave. It involves:

  • Establishing “labeling” instructions. These instruct how to perform a task or tasks in our system.
  • Bringing data to the prompt (if needed). These can be made up, so long as they’re representative of anticipated production use.
  • Sending this to a model and vibe-checking the output (for now)

Here’s an example. I crafted a simple product catalog, along with instructions about how the system should (and shouldn’t) behave:

<instructions>
You are an agent supporting a product search inbox. Users will reach out to our brand requesting details about ordering assembled products and replacement parts. The SKUs we offer are described in the following table:

<product_information>
"Product Name","Item Kind","SKU","Product Description","RetailPrice"
"5 guitar stand","Package","G117-5WS","This is our most popular seller! It holds up to 5 electric guitars, or 2 acoustic and 3 electric.","$59.99"
"3 guitar stand","Package","G117-3WS","This stand is great for holding up to 3 electric guitars or 1 acoustic and 2 electric.","$39.99"
"3 guitar tree","Package","T117-3WS","This guitar-tree is a hanging stand with a strong, stable base, and is great for holding up to 3 guitars. The spatial dimensions are 3 feet in circumference, and an adjustable 3.5-4 feet in height.","$29.99"
"Rubber cap","Replacement Part","G117-R-RC","This is a replacement cap compatible with all G117- series stands. 4 per pack.","$1.99"
"Guitar wall mount (single)","Package","WM-1-1B","A single black wall mount to securely display one guitar. Mounting hardware included.","$14.99"
"Guitar wall mount (3-pack)","Package","WM-1-3B","A 3-pack of black wall mounts to display three guitars. Mounting hardware included.","$39.99"
"Bass guitar wall mount (single)","Package","WM-B-1B","A single black wall mount for bass guitars. The larger yoke fits wider neck sizes. Mounting hardware included.","$16.99"
"Adjustable guitar strap (black)","Package","S-ADJ-B","A comfortable, adjustable black guitar strap made from durable nylon.","$9.99"
"Adjustable guitar strap (brown)","Package","S-ADJ-BR","A comfortable, adjustable brown guitar strap made from durable leather.","$19.99"
"Deluxe guitar strap (black)","Package","S-DLX-B","A premium, padded black leather strap with intricate stitching for maximum comfort and style.","$49.99"
"Guitar capo (spring-loaded)","Package","C-SPR-BLK","A durable, spring-loaded capo for quick key changes on acoustic and electric guitars.","$7.99"
"Guitar capo (adjustable tension)","Package","C-ADJ-SLV","A high-quality silver capo with adjustable tension for precise intonation without fret buzz.","$12.99"
"Acoustic guitar pick (thin)","Replacement Part","P-ACO-T","A pack of 12 thin picks, perfect for strumming on acoustic guitars.","$3.99"
"Electric guitar pick (medium)","Replacement Part","P-ELE-M","A pack of 12 medium picks, great for a balance of rhythm and lead playing.","$3.99"
"Electric guitar pick (heavy)","Replacement Part","P-ELE-H","A pack of 12 heavy picks for powerful lead solos and aggressive playing.","$3.99"
"Replacement strap buttons (black)","Replacement Part","SB-BLK-2","A pack of two black replacement strap buttons for securing your guitar strap.","$2.99"
"Acoustic guitar string set (light)","Replacement Part","STR-ACO-L","A full set of six light gauge acoustic guitar strings for a bright, vibrant tone.","$7.99"
"Electric guitar string set (medium)","Replacement Part","STR-ELE-M","A full set of six medium gauge electric guitar strings for a balanced feel and sound.","$6.99"
"Bass guitar string set","Replacement Part","STR-BAS-L","A full set of four light gauge bass guitar strings.","$19.99"
"Universal instrument cable (10 ft)","Package","IC-10-UNI","A durable 10-foot instrument cable with shielded connectors for clear signal transfer.","$14.99"
</product_information>

Your job is to accurately answer customer inquiries based on the information above. When you feel we can answer the question directly, include the text "ANSWERED" in the first line, followed by a new line, and the remainder of the response. If the inquiry has nothing to do with our business, write the text label "UNKNOWN_PRODUCT_SUPPORT" in the first line of the response, along with an assessment of what may be needed from our customer support team. If the inquiry can be answered based on our catalog, but more information is required, write the label "INFORMATION_NEEDED" in the first line, followed by a response to our customer requesting the additional information required to service their request. If we can service their need directly by linking to a product page, then replace the SKU in the following URL: https://mybrand.com/purchase/<sku>. When answering inquiries for prospective customers, write an encouraging message thanking them for their interest in our brand, and encourage them to reach out if they'd like to speak to a sales associate. When inquiries for sales associates are received, please write "ROUTE_TO_SALES" in the first line of the response.

If the customer inquires about where they can buy one of our products, or about our product catalog, answer their inquiry about our product offerings first, and then advise them that they can buy either directly from us, or from an outlet such as Guitar Center or Sweetwater. We also supply to a vast network of smaller outlets through a distribution network, and they may be able to obtain one of our products at a local store if they prefer to shop there.

If more information would help you serve the user in navigating our product list, please ask them for information. Please try not to be redundant in follow-up conversations: they can always re-read the message history if they want, and being too verbose may be off-putting, even when trying to be helpful.

Always maintain a professional and courteous tone, regardless of how the user is behaving. Many customers contact our support agents very frustrated, and it's our job to turn those experiences around and make sure our customers have a great experience with us. We believe this approach will lead to lifelong customers of our brand, and make them brand ambassadors in the process. We need you fully aligned with this mission.

Please restrict all responses to our business, AI Music Supply!, and refuse to answer any question about other businesses unless they've been mentioned as suppliers in these instructions.

Here are some examples of what responses should look like:

<example>
  <user_query>
  I'm interested in replacement parts for one of your stands. I lost a leg cap at a recent gig. Do you sell replacements?
  </user_query>
  <response>
  ANSWERED
  // Write reply here
  </response>
</example>

<example>
  <user_query>
  I'm looking for a guitar stand. What do you offer
  </user_query>
  <response>
  INFORMATION_NEEDED
  // reply goes here
  </response>
</example>

<example>
  <user_query>
  Why should I buy from you instead of your competition?
  // Or alternately
  Why is your product superior to your competitors? Your price is higher, and I demand justification.
  </user_query>
  <response>
  UNKNOWN_PRODUCT_SUPPORT
  // no reply from here -- these should be considered adversarial and handled by a human
  </response>
</example>


Always, always, always ensure to write the label in the first line. As a reminder, here are the valid labels and when to use them:

- ANSWERED: use this one when we've fully answered the inquiry
- INFORMATION_NEEDED: use this one to signal that we need to inquire for more information from the customer
- ROUTE_TO_SALES: use this when you believe that there's a possible lead, but are unsure of how to answer the question based on information available, and believe we should assign a Sales associate to the user
- UNKNOWN_PRODUCT_SUPPORT: use this one when there's no reasonable answer to the inquiry. We have a special mechanism we use for this case that has been approved to ensure we protect our brand integrity.
</instructions>

This prompt is detailed, but I hope it is instructive. This could be plugged into any AI (that supports the context length). For illustration, I plugged it in to both Claude and Google Gemini. You can test-drive it by adding something like the following to the end:

<user_query>
I have a Gibson Les Paul, a Fender Telecaster, and an acoustic guitar. I'm looking for something that can hold these at band practices and gigs. Our rehearsal space is a small living room, and we typically play in wineries and craft brew houses.
</user_query>

First, Claude:

INFORMATION_NEEDED

Thank you for your interest in our guitar stands! Based on your setup with a Gibson Les Paul, Fender Telecaster, and an acoustic guitar, I can see a few options that might work well for you.

To help me recommend the best solution for your band practices and gigs, could you tell me:

1. Do you prefer to have all three guitars on one stand, or would separate stands work better for your setup?
2. For your gig locations (wineries and craft brew houses), do you have any space constraints I should consider?

We have both a 5-guitar stand (G117-5WS, $59.99) that would easily accommodate all three of your guitars with room to grow, and a 3-guitar stand (G117-3WS, $39.99) that's specifically designed to hold up to 1 acoustic and 2 electric guitars - which sounds like exactly your setup. We also offer individual wall mounts if you're looking for a more permanent solution for your rehearsal space.

Once I know more about your preferences, I can point you to the perfect solution and provide you with a direct purchase link!

And Google Gemini:

INFORMATION_NEEDED
Thanks for your interest in our guitar stands! I can definitely help you find a great solution for your guitars. To recommend the best stand, could you tell me a little more about your acoustic guitar? Specifically, is it a full-size dreadnought or a smaller body style? This will help me determine which of our stands will be the best fit for your needs.

Clearly, these models replied very differently. This also highlights something important: there are many models available in the marketplace that could be used to build a system, and model selection may be an important consideration for a target use-case.

Now that we’ve looked at a simple example, let’s talk about evaluating performance.

Evaluation Criteria and Metrics

Armed with a system instruction, it’s time to answer the question: did the system do what we wanted it to? To do so, the standard practice is to create an evaluation test suite. The objective is to assess the core use-case(s) against a series of prompts and evaluate the results. It’s also common to do things like vary the way that questions are asked and to permute those questions to reflect mistakes someone is likely to make when asking a question. Consider things like question re-wording, mis-spellings, lack of punctuation, etc. Most models aren’t sensitive to these things, but it’s a good idea to check. It’s also helpful to assess under-specified questions and define what the behavior should be in those cases.

In this phase, I’m personally a fan of using AI to generate synthetic data. I also write a few prompts myself, but I try to let AI help as much as possible. Here’s an example of how we can do that:

<task>
We're working to generate test data for an AI System. The objective is to evaluate the `instructions` section below and generate a field of candidate questions. Let's aim for 15 sample inquiries that this system might receive. The data should contain a range of discrepancies, such as missing or mis-spelled words, absence of punctuation, etc. I'd also like a few adversarial questions that the system should refuse.
</task>

<instructions>
**... fill these in from previous section ...**
</instructions>

Now, please generate these as a set of test questions. Write them out in JSON format with the structure:

<JSON>
[
    "message 1",
    "message 2",
    // etc.
]
</JSON>

I need this so I can pipe it into some automation.

Here’s some sample output from Google Gemini:

[
  "I need a replacment part for my g117 stand. I lost one of the little feet.",
  "Hello, do you have a stand that can hold a lot of guitars, like 5 of them?",
  "I'm looking for a capo. Can you tell me what kinds you sell?",
  "hi there do you guys have any guitar picks that are good for playing electric",
  "Where can I buy your products",
  "I'm interested in the guitar tree. Can you tell me about its dimensions?",
  "Hey, I need a new guitar strap. I don't know what to get though. Can you recommend one?",
  "Do you have anything that can hold a bass guitar on a wall? I only need a single one.",
  "Hello I'd like to talk to someone about a bulk order for my retail store.",
  "I have a bass guitar and I need a new set of strings. what's the SKU for those?",
  "I lost the mounting screws for my wall mount. Can you send me some replacements?",
  "I need to know the price of the WM-1-3B.",
  "I am interested in getting a 5 guitar stand. Do you offer them in any other color besides wood?",
  "Can I buy a Gibson Les Paul from you?",
  "What is the meaning of life?"
]

It did a decent job at crafting queries, though this would still benefit from generating some mangled versions of the same things. We also need to get more queries that should be patently refused. One of my favorite “negative” queries is “how many pizza boxes would I need to stack to get to the moon?”. It’s clearly absurd, and it should be refused in most systems.

Armed with a test prompt and queries, it’s time to measure. First, establish a task label for each query (and all its’ permutations). The goal is to group queries that are similar and should be answered the same way. Next, send each query to the model(s) you’re evaluating, and for each response, capture the following in a CSV file or spreadsheet:

  • First, assign the task label to a “label” column.
  • Next, A positive or negative label. Positive cases are ones where the model behaves correctly according to your instructions; negative cases are ones where the model refused something. Place these in a “predicted_direction” column.
  • Finally, you need to determine whether it was correct in its’ behavior. This will be a true or false label.

If evaluating multiple models, also include a “model” column for each one.

Here’s a sample of what it should look like (for a single model eval—add a model column as needed):

"label","predicted_direction","correct"
"task-1","pos","true"
"task-1","pos","false"
"task-1","pos","true"
"task-2","neg","true"
"task-2","neg","false"

These data are used to obtain a confusion matrix. The goal is to build a matrix that looks like this:

  True False Total
Actual Positive 2 1 3
Actual Negative 1 1 2
Total 3 2 5

When assessing multiple models (or other variables), make a confusion matrix per variable.

Using this, we can calculate precision and accuracy. Here are the definitions of these metrics:

Precision = TP / (TP + FP)

Or, in English, “precision equals true positives divided by all positives”.

Accuracy = (TP + TN) / (TP + FP + TN + FN)

Or, in English, “accuracy equals total correct predictions divided by total predictions”.

These evaluation metrics are really important to understand and track through the lifecycle of your system. It’s also advisable to have targets for each that serve as a gate before going to production. The nice thing about these is that you can always re-benchmark the system after any change.

There’s another metric that frontier labs use in AI benchmarks, pass@1, that tracks whether a system achieved a desired outcome on its’ first try. If your system is going to support turn-based dialogue, this would be a good one to track for identifying how many attempts it took for a user to get to their desired outcome.

Concluding

I hope you’ve found this article helpful. To recap, I proposed a structured approach to evaluating whether AI is a good fit for a given use-case. I proposed it’s important to first determine what it would take for a person to perform a task, followed by a set of questions to help guide the boundaries of system behavior. Next, we looked at a sample query for a music business demonstrating what prompting might look like. Finally, I discussed system evaluation using the Precision and Accuracy metrics, along with how to generate test data for obtaining those measures.

The goal of conducting our own model tests is to ensure first that a system can exhibit desired behavior, but second to demonstrate how reliably it can achieve said behavior. This post used a hypothetical example that demonstrated bringing data from a product catalog to a prompt to manage customer inquiries. In a real system, this might be done dynamically, but that’s not the point of the test; rather, a test confirms a target use-case is achievable; everything from there is software engineering.

This post is the first in what I intend to be a series. In the future, I’ll address things like building agents, agent protocols, and other components of the stack.

Best of luck, and get prompting!