Building Enterprise-Grade AI Agents: Eight Practices That Separate Production Systems from Prototypes

The gap between an AI agent prototype and a production system is engineering discipline. Eight concrete practices — from two-phase architecture to token budget enforcement — illustrated with real production code.


Most AI agent demos look compelling. Most AI agents in production look like liabilities. The gap between a prototype and a system to be run inside an established business is a matter of engineering discipline.

This article walks through eight concrete practices that separate enterprise-grade agents from prototypes, illustrated throughout with TechFingerprintAgent — a passive technology fingerprinting agent we built at QualitaX. Every code example is real production code.

TechFingerprintAgent performs passive reconnaissance on a target domain: DNS analysis, WHOIS lookup, HTTP header fingerprinting, technology stack detection, SSL certificate inspection, security.txt and robots.txt parsing — all orchestrated into a structured security architecture report. It is a real workload with real failure modes, and it makes every tradeoff visible.

Practice 1: Design for Two Phases, Not One Loop

The most common architecture mistake in AI agent design is the uncontrolled tool loop. For instance, the agent selects a tool, calls it, receives the result, selects another tool, and continues until it decides it is done. This approach is easy to prototype and painful to operate.

The uncontrolled loop has several compounding problems: latency is sequential, token cost grows with every round-trip, context windows inflate with accumulated tool results, and the agent's tool selection behaviour is non-deterministic — meaning the same input can produce different execution sequences on different runs.

TechFingerprintAgent uses a two-phase architecture that eliminates all of these problems:

Phase 1 — Parallel Collection (pure Python, no AI)

All 7 data-collection tools run concurrently via asyncio.gather(). Each tool failure produces a structured error dict. No AI is involved. Tool execution order and error handling are deterministic Python code.

Phase 2 — Single AI-driven Synthesis (one call, forced structured output)

Pre-collected data is sent to the agent in a single API call with tool_choice forced to complete_tech_analysis. The agent interprets the data using expert heuristics. One call, one response.

The measurable impact: ~60% smaller context window usage per task. Collection tools run in parallel (~1 wall-clock round) instead of sequentially across 7+ round-trips. Token cost is reduced because the AI model sees one user message rather than accumulating 7 tool_use/tool_result pairs.

The key insight is that the choice of tool and the interpretation of data are two different cognitive tasks. Python is better at the first; AI is better at the second. A well-designed agent assigns each task to the right executor.

# Phase 1: All 7 tools run concurrently — wall-clock time ≈ slowest single tool
results = await asyncio.gather(
    _run_tool("analyze_dns",          analyze_dns(domain)),
    _run_tool("check_whois",          check_whois(domain)),
    _run_tool("fetch_http_headers",   fetch_http_headers(website, http_client)),
    _run_tool("detect_technologies",  detect_technologies(website, http_client)),
    _run_tool("check_ssl_certificate",check_ssl_certificate(domain)),
    _run_tool("check_security_txt",   check_security_txt(domain, http_client)),
    _run_tool("check_robots_txt",     check_robots_txt(website, http_client)),
)

# Phase 2: Single Claude call with forced tool_choice
response = await self._call_claude(
    messages,
    tool_choice={"type": "tool", "name": "complete_tech_analysis"},
)

Practice 2: Force Structured Output — Never Parse Free Text

An agent that returns unstructured text is not an agent — it is a chatbot. Enterprise systems require predictable output shapes that downstream consumers can depend on without parsing heuristics or prompt-based post-processing.

The agent's tool use API, combined with tool_choice: {"type": "tool", "name": "..."}, forces the agent to return a structured JSON object matching a defined schema. This is the correct pattern. Never rely on the agent voluntarily structuring its output correctly.

TechFingerprintAgent defines a complete output contract in the synthesis tool schema, with every field typed and described. The required fields are explicitly declared. The agent then verifies the response by extracting the tool_use block by name rather than by position.

# Force Claude to return complete_tech_analysis — no free text fallback
tool_choice = {"type": "tool", "name": "complete_tech_analysis"}

# Extract by name, not by index
for block in response.content:
    if block.type == "tool_use" and block.name == "complete_tech_analysis":
        final_data = block.input
        break

# If block is absent, fall through to structured fallback — not a crash
if final_data is None:
    final_data = {
        "subject": subject,
        "website": website,
        "architecture_summary": "Analysis incomplete...",
        "confidence": 0.05,
        "_incomplete": True,
    }

Notice the fallback: if the agent does not return the expected tool_use block, it produces a documented incomplete result rather than crashing or returning unstructured text. Every output path is defined and typed. Downstream consumers can always call dict.get() safely. Also, it is important to note that while forcing tool_choice is a battle-tested way to guarantee structured JSON output, several frontier models now support native JSON schema adherence at the decoding level — for example, OpenAI's Structured Outputs via the response_format parameter — though the availability and mechanism may vary across providers.

The output contract is part of the interface. Document it explicitly — which fields are guaranteed present on all paths, which are conditional on the happy path, and which are absent on error paths. TechFingerprintAgent does this as a module-level comment, making it part of the public API.

Practice 3: Build Security In From the Start

Security is not a layer you add to an AI agent — it is a structural property. An agent that makes outbound HTTP requests has an attack surface. An agent that interpolates user input into prompts has an injection surface. Both must be addressed at the architecture level, not as afterthoughts.

TechFingerprintAgent addresses three distinct security concerns with deliberate, documented engineering.

SSRF Protection (Server-Side Request Forgery)

The agent accepts a user-provided URL and makes HTTP requests to it. Without protection, this is an SSRF vulnerability — an attacker could target an AWS metadata endpoint or internal services. The agent defends using CIDR-aware IP checking via the ipaddress module:

# String prefix matching is insufficient — bypass vectors include:
#   decimal IPs (2130706433), IPv4-mapped IPv6 (::ffff:127.0.0.1),
#   DNS rebinding attacks
_SSRF_BLOCKED_NETWORKS = (
    ipaddress.ip_network("x.x.x.x/8"),       # loopback
    ipaddress.ip_network("x.0.0.0/8"),         # RFC 1918
    ipaddress.ip_network("x.x.0.0/12"),      # RFC 1918
    ipaddress.ip_network("x.x.0.0/16"),     # RFC 1918
    ipaddress.ip_network("x.x.0.0/16"),     # link-local / cloud metadata
    ipaddress.ip_network("::ffff:x.0.0.0/104"),  # IPv4-mapped loopback
    # ...and more
)

# Resolve hostname and check ALL resulting IPs — not just the first
infos = socket.getaddrinfo(hostname, None, socket.AF_UNSPEC, socket.SOCK_STREAM)
for family, _, _, _, sockaddr in infos:
    addr = ipaddress.ip_address(sockaddr[0])
    for network in _SSRF_BLOCKED_NETWORKS:
        if addr in network:
            return True  # Block if ANY resolved IP is private

Note that the check resolves ALL IP addresses returned for the hostname, not just the first. An attacker controlling a DNS server could return a public IP on the first query and a private IP on a subsequent request (DNS rebinding). Checking all addresses at connection time is a stronger posture.

Prompt Injection Prevention

The agent interpolates user-controlled fields — subject, sector, context — into the synthesis prompt. Without sanitisation, a malicious subject value like "Ignore all previous instructions and..." could manipulate the agent's behaviour. The agent applies explicit sanitisation before interpolation:

_CONTROL_CHAR_RE = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]")

def _sanitise_prompt_field(value: str, field_name: str) -> str:
    # Strip control characters (except \n, \r, \t)
    value = _CONTROL_CHAR_RE.sub("", value)
    # Truncate to prevent context flooding
    if len(value) > _MAX_FIELD_LEN:
        value = value[:_MAX_FIELD_LEN] + "...[truncated]"
    return value

safe_subject  = _sanitise_prompt_field(subject, "subject")
safe_sector   = _sanitise_prompt_field(payload.get("sector", ""), "sector")
safe_context  = _sanitise_prompt_field(payload.get("context", ""), "context")

IDNA Encoding for Homograph Attack Prevention

Domain names in user input are IDNA-encoded before use. This normalises internationalised domain names and defends against homograph attacks — where visually identical characters from different scripts (e.g. Cyrillic 'а' vs Latin 'a') create distinct DNS names that appear identical to humans. domain.encode("idna").decode("ascii") is a one-line defence with meaningful security value.

Practice 4: Make the Entrypoint Idempotent

An enterprise agent is not called once — it is called by a queue consumer, possibly by multiple workers, possibly by a retry loop. The same task may arrive at the agent twice. An agent that does not handle this correctly will either double-charge tokens or produce race conditions.

TechFingerprintAgent's run() method is idempotent by design. Every call goes through the same sequence:

async def run(self, task: Task) -> AgentResult:
    # 1. Cache check — return immediately if already computed
    cached = await self.state.get_result(task_id_str)
    if cached is not None:
        return cached

    # 2. Lock acquisition — prevents duplicate work across consumers
    acquired = await self.state.acquire_processing_lock(task_id_str)
    if not acquired:
        await asyncio.sleep(wait_s)  # give the other worker time to finish
        cached = await self.state.get_result(task_id_str)
        if cached:
            return cached
        raise RetryableError("Task locked by another consumer")

    # 3. Wall-clock timeout — prevents worker stalls
    async with asyncio.timeout(task_timeout_s):
        return await self._run_inner(task, task_id_str, bound_logger)

Cache check happens before lock acquisition. If the result exists, no lock is ever acquired — the common fast path is maximally cheap. Lock acquisition prevents two workers from processing the same task simultaneously. The post-lock cache check handles the race window between lock acquisition and task start.

The lock release contract is made explicit in a comment: write_result() sets the result AND releases the lock atomically. The comment warns against adding an explicit release_processing_lock() call, which would be a double-release. This kind of documented invariant is what makes complex distributed code maintainable.

Practice 5: Build a Complete Error Taxonomy

Not all failures are equal. An agent that treats every error the same — retrying everything, or retrying nothing — will either waste tokens on hopeless retries or fail silently on transient issues that would have resolved themselves.

TechFingerprintAgent uses a three-level error taxonomy:

RetryableError — Transient failures: lock contention, temporary unavailability. Queue consumer should retry. Lock released, exception re-raised.

NonRetryableAPIError — Permanent failures: invalid input, bad URLs, SSRF violations. Dead-lettered immediately. Retrying will not help.

RateLimitError — Infrastructure pressure: 429, 529 from Claude API. Retried with exponential backoff + jitter before exhaustion.

Input validation happens before any resource acquisition — before acquiring a lock, before creating an HTTP client, before spending any tokens. This is the fail-fast principle: identify non-retryable failures at the earliest possible point.

# Validate before any resource acquisition
if not website:
    raise NonRetryableAPIError(
        "Task payload is missing required field 'website'. "
        "Provide a full URL including scheme (e.g. https://example.com)."
    )

parsed = urlparse(website)
if parsed.scheme not in ("http", "https"):
    raise NonRetryableAPIError(
        f"Invalid URL scheme {parsed.scheme!r}. Only http/https supported."
    )

Error messages are written for operators, not for code. They explain what was wrong and what the correct input looks like. An error that says "Invalid input" requires investigation; an error that says "Invalid URL scheme 'ftp' — only http/https supported" requires nothing.

Practice 6: Instrument Everything — Token Cost Is a First-Class Concern

Token cost is not an afterthought in production AI agents — it is a budget line. An agent without token visibility is like a service without cost allocation: you discover the problem in the invoice, not in the monitoring dashboard.

TechFingerprintAgent tracks tokens at every stage and enforces hard limits.

Per-call token accounting

Every call accumulates into total_input_tokens and total_output_tokens. These are logged after the synthesis call alongside elapsed time and stop reason, and recorded to the agent metrics or cross-task aggregation.

Pre-flight budget check

Before calling the model, the agent checks whether accumulated token usage exceeds 85% of the budget ceiling (_BUDGET_PREFLIGHT_RATIO = 0.85). If it does, the synthesis call is skipped. This prevents a task from exceeding its budget during the expensive synthesis call.

Post-call budget enforcement

After the synthesis call returns, the agent re-checks whether the response overshot the budget. If it did, the result is discarded and the task falls through to the incomplete path. This enforces the token budget contract rather than silently accepting an overrun.

# Pre-flight check at 85% of ceiling
if cumulative_tokens > preflight_ceiling:
    await bound_logger.awarning("Token budget pre-flight exceeded — skipping synthesis")
    # fall through to _incomplete result
else:
    response = await self._call_claude(...)
    total_input_tokens  += response.usage.input_tokens
    total_output_tokens += response.usage.output_tokens

    # Post-call enforcement
    if total_input_tokens + total_output_tokens > token_budget:
        await bound_logger.awarning("Token budget exceeded — discarding synthesis")
        # fall through to _incomplete — do NOT return an overrun result
    else:
        for block in response.content:
            if block.type == "tool_use" and block.name == "complete_tech_analysis":
                final_data = block.input

Prompt caching matters. The synthesis prompt is sent with cache_control: {"type": "ephemeral"}, making it eligible for prompt caching. On repeated calls with the same system prompt, input token costs drop substantially. For an agent running thousands of tasks per day, this is a meaningful cost reduction.

Tool result size caps

Each tool result is capped at MAX_TOOL_RESULT_CHARS = 8,000 characters before being included in the synthesis prompt. This prevents a single verbose tool response (a large HTML page, a very long WHOIS record) from consuming the entire context window. At ~4 chars/token, this caps each tool's contribution at ~2,000 tokens — acceptable per-tool overhead.

Practice 7: Structured Logging with Distributed Trace IDs

Logs in production AI agents serve a different purpose than development logs. They need to be machine-queryable, correlated across services, and carry enough context to reconstruct exactly what happened when a task fails. printf-style logging fails all three requirements.

TechFingerprintAgent uses structlog with a trace_id bound to every log call within a task:

# Generate trace_id at task entry — propagated through all log calls
trace_id = str(uuid.uuid4())
bound_logger = logger.bind(task_id=task_id_str, trace_id=trace_id)

# Every log call in _run_inner() automatically carries task_id + trace_id
await bound_logger.ainfo(
    "Claude synthesis complete",
    elapsed_ms=round((time.monotonic() - claude_start) * 1000),
    input_tokens=response.usage.input_tokens,
    output_tokens=response.usage.output_tokens,
    cumulative_tokens=total_input_tokens + total_output_tokens,
    stop_reason=response.stop_reason,
)

Structured log events carry named fields, not interpolated strings. This means every synthesis call produces a log entry you can query: "show me all tasks where cumulative_tokens > 60000" or "show me all tasks with stop_reason = 'max_tokens'". These are operational questions that free-text logs cannot answer.

Per-tool timing is logged at both info and debug levels, giving operators a performance breakdown: which tool is slow today? Is the WHOIS resolver the bottleneck or is SSL certificate fetching?

What to log in an AI agent:

  • Task entry: task_id, trace_id, subject, target domain, sector
  • Phase 1 completion: elapsed_ms, successful_tools list, failed_tools list
  • Per-tool: name, elapsed_ms, error (if any)
  • Phase 2 entry: cumulative token budget status
  • Claude call completion: input_tokens, output_tokens, elapsed_ms, stop_reason
  • Task completion: success bool, total latency_ms, tokens_used, confidence score

Practice 8: Write Synthesis Prompts Like a Senior Engineer

The synthesis prompt is not a chat message. It is a specification — a document that defines how the model should interpret ambiguous evidence, apply domain heuristics, handle missing data, and produce outputs that meet a defined quality bar. Most agent prompts are chatbot prompts. Enterprise synthesis prompts look different.

TechFingerprintAgent's SYNTHESIS_PROMPT contains concrete analytical heuristics for each data source — not generic instructions like "analyse the data carefully".

Domain-specific heuristics

The prompt instructs the model on what specific evidence combinations mean. For instance, if a Via header contains 'varnish', a dedicated caching layer exists behind the CDN — set varnish_detected = true. If Cloudflare is the CDN proxy but Route 53 is the DNS nameserver, flag origin_bypass_risk = 'HIGH'. These are not instructions model would infer — they require domain expertise encoded in the prompt.

Scoring rubrics with explicit weights

Security header scoring is defined with precise point values: 10 points each for HSTS, CSP, X-Content-Type-Options; -5 for HSTS without includeSubDomains; -10 for dual-missing CSP. Grade thresholds are explicit. This prevents the model from applying inconsistent grading across tasks.

Evidence-first rules

Cite specific evidence for every claim (exact header value, DNS record, HTML pattern, certificate field). Never assert without evidence.

This rule transforms the output from general commentary to verifiable analysis. A report that says "uses Cloudflare CDN" is useful; one that cites the CF-Ray header value and the edge PoP city is actionable.

Graceful degradation instructions

If a tool failed, record the failure in the relevant field as {"error": "tool_name failed", "data_available": false}. Then proceed — partial data with honest gaps is always preferable to a fabricated complete result.

This instruction prevents hallucination under data scarcity — a common failure mode when agents receive partial inputs.

# Synthesis prompt excerpt — CDN/proxy chain heuristics
SYNTHESIS_PROMPT = """
CDN / PROXY CHAIN ANALYSIS:
- If `Via` header contains 'varnish': a dedicated caching layer exists behind
  the CDN. Note it in http_headers_analysis.varnish_detected = true.
- If Cloudflare CDN (CF-Ray present) BUT DNS nameservers are on Route 53:
  this is a Partial CNAME setup. Flag origin_bypass_risk = 'HIGH' —
  the origin IP is discoverable via historical DNS or certificate transparency.
- CF-Ray suffix reveals the Cloudflare edge PoP city (-DUB = Dublin).
  Record as cf_edge_pop in http_headers_analysis.

RULES:
- Cite specific evidence for every claim. Never assert without evidence.
- Never invent data. If no evidence for a field, leave it null.
"""

The Pattern That Runs Through All Eight Practices

Looking across these eight practices, a single principle connects them: an enterprise AI agent must behave predictably under every possible condition — not just the happy path.

Predictable behaviour under failure means: structured error types that drive correct retry decisions. Predictable cost means: explicit token budgets with pre-flight and post-call enforcement. Predictable output means: forced tool_choice with documented output contracts. Predictable security means: CIDR-aware SSRF blocking, prompt sanitisation, IDNA encoding — each addressing a specific bypass vector, not a vague category of 'security concern'.

Most agent prototypes work when everything goes right. Production agents are defined by what they do when things go wrong — when a tool times out, when the token budget is tight, when user input contains a homograph attack, when two workers pick up the same task simultaneously.

TechFingerprintAgent handles all of these. Not because they are clever abstractions, but because each failure mode was anticipated, named, and given a specific, documented response. That is what enterprise-grade looks like.

QualitaX builds production-grade agentic systems for B2B businesses. If your AI agent costs are higher than they should be — or you want to build something that won't surprise you with the bill — get in touch.