Most companies ship their first agent with a 1-line system prompt. "You are a helpful assistant for Acme Corp." It works in the demo. It falls apart in production within 48 hours. The difference between a demo agent and a production agent is not the model, the framework, or the hosting. It is the prompt.
Demo Prompts vs Production
A demo prompt fits in a tweet. A production system prompt runs 500 to 2,000 lines. That is not bloat. That is the accumulated knowledge of every edge case, brand violation, and user complaint your team has encountered.
The 1-line demo works because the person testing it already knows the product. They ask reasonable questions. They do not try to jailbreak it, ask about competitors, or paste in 10,000 tokens of garbage. Real users do all of that on day one.
Production prompts handle the long tail. They specify what to do when the user asks about pricing you do not offer, when the conversation goes off-topic, when the user is frustrated, when the tool call fails. Every line in a production prompt exists because something went wrong without it.
The gap between demo and production is 100% prompt engineering. Same model. Same tools. Same infrastructure. Entirely different outcomes.
Anatomy of a Production System Prompt
A production system prompt has six sections. Each one does a specific job.
Identity comes first. This is who the agent is, who it works for, and what it does. Keep it to 2 to 3 sentences. The model references this section constantly, so precision matters. "You are Petra, an agent built by Point11 that helps enterprise brands optimize their websites for both humans and agents." That is enough.
Voice and tone defines how the agent communicates. Not vague guidelines like "be professional." Specific rules: "Use short sentences. Never use exclamation marks. Address the user by first name after the first exchange. Do not use filler phrases like 'Great question!' or 'I'd be happy to help.'" Models follow concrete rules. They ignore vibes.
Tool definitions tell the agent what it can do. Each tool needs a name, description, parameter schema, and behavioral instructions. More on this in the tool section below.
Guardrails define what the agent must never do. Never reveal internal pricing logic. Never provide medical, legal, or financial advice. Never compare your product unfavorably to competitors. Never output raw JSON to the user.
These are hard rules, not suggestions.
Escalation paths define when the agent should hand off to a human. After 3 failed attempts to resolve an issue. When the user explicitly asks for a person. When the conversation involves billing disputes over $500. Without explicit escalation logic, agents will keep trying forever.
Memory context is injected dynamically at runtime. Prior conversation summaries, user preferences, account details, recent interactions. This section changes per request, but the prompt must instruct the agent on how to use it. "If the user's memory block indicates they have already discussed pricing, do not repeat the overview. Reference the prior conversation and ask if anything has changed."
Few-Shot Examples
Few-shot examples are the single highest-impact technique in prompt engineering. They beat elaborate instructions every time. Models learn patterns from examples faster than they learn rules from descriptions.
Include 3 to 5 example exchanges that demonstrate the exact behavior you want. Cover the common case, an edge case, and a failure case. Format them as user/assistant pairs so the model can pattern-match directly.
Placement matters more than most teams realize. Examples placed at the end of the system prompt outperform examples placed at the beginning[1]. The model attends more strongly to the most recent context. If your prompt is long, put your examples after the rules, not before.
A well-chosen set of 5 examples can replace 50 lines of instructions. When you find yourself writing elaborate rules about formatting, tone, or response structure, ask whether an example would communicate the same thing in fewer tokens and with higher compliance.
Chain-of-Thought in Production
Chain-of-thought prompting tells the model to reason through a problem step by step before producing the final answer. In production, this is not optional for complex tasks. Without it, models skip steps, miss edge cases, and produce confident but wrong outputs[3].
The technique works best with structured thinking blocks. Wrap a numbered reasoning section in XML tags. The model walks through the problem step by step: what did the user ask, what is their account context, which response strategy fits, and are there prior conversations to reference. Then it commits to a final answer.
You strip these tags from the final response so the user never sees the internal reasoning. The result is a cleaner, more accurate answer.
Cost: structured thinking adds 50 to 200 tokens per response. For most production agents, that is a rounding error compared to the cost of a wrong answer that requires human intervention. Only skip it for simple, low-stakes responses like greetings or FAQ lookups.
Tool Definitions That Work
Tool descriptions are prompts. Most teams treat them as API documentation. They write dry, technical descriptions and wonder why the model calls the wrong tool half the time.
A good tool description answers three questions: What does this tool do? When should the agent use it? When should the agent NOT use it?
Consider a product search tool. The description should say: search the catalog by keyword, category, or attribute. Use it when the user asks about specific products, wants recommendations, or mentions a product name. Do NOT use it for general questions about pricing tiers, company info, or shipping policies. Those are answered from system knowledge.
The "when not to use" instruction is just as important as the "when to use" instruction. Without it, models over-call tools. A user asks "what do you sell?" and the agent fires off 5 product searches instead of giving a 2-sentence overview from its system knowledge.
Test tool definitions adversarially. Feed the agent ambiguous queries that could trigger multiple tools and verify it picks the right one. "How much does that cost?" could mean product pricing (tool call) or service tier pricing (system knowledge). Your descriptions need to disambiguate.
Parameter descriptions matter too. A parameter called query with no description will get inconsistent inputs. A parameter described as "2 to 5 keyword search string, no full sentences, lowercase" will get clean inputs every time.
Iterating on Prompts
Prompt engineering is not a one-time task. It is an ongoing practice, closer to product management than software engineering.
Version control your prompts the same way you version control code. Every production prompt should live in a file, tracked in git, with a clear commit history. When something breaks, you need to know what changed and when. Storing prompts in a database or admin panel without version history is a common mistake that makes debugging impossible.
A/B test with real traffic. Split 10% of traffic to a prompt variant and measure completion rate, escalation rate, and user satisfaction. Small wording changes can produce 15 to 25% swings in task completion[2]. You will not discover this in development.
Review the worst 10% weekly. Pull the conversations where users abandoned, escalated, or expressed frustration. These are your prompt's failure cases. Each one is a specific scenario your prompt does not handle. Add a rule or example to cover it. This weekly review loop is the single most effective practice for improving agent quality over time.
Build a regression test suite. Collect 50 to 100 input/output pairs that represent critical behaviors. Run them against every prompt change before deploying. A prompt fix for one edge case should not break 3 existing behaviors. Automated regression testing catches this before your users do.
The best production prompts are not written. They are grown, one edge case at a time, through hundreds of iterations informed by real user behavior.
How Site Scanner Helps
Site Scanner evaluates how well your site's content is structured for agent consumption. Clean, parseable content means your own agents and third-party agents can extract accurate information about your products, policies, and brand. The Discoverability and Performance dimensions of your Site Score reflect whether your site is ready to serve as the knowledge foundation that production prompts depend on.