"AI agent" is the most over-used and under-defined term in marketing technology right now. Every legacy tool has rebranded its features as agents. Every agency has a slide deck claiming to operate them. The phrase has become close to meaningless — which is a problem for buyers trying to evaluate genuine capability against marketing language.

This piece describes what's actually inside VIMC01, our creative agent. Not the pitch. The architecture.

The agent loop

An agent is, at its core, a system that runs the following loop autonomously:

  1. Observe — read its current state and the relevant context.
  2. Plan — decide what to do next based on a goal.
  3. Act — call a tool, generate output, or trigger another system.
  4. Evaluate — check whether the action moved closer to the goal.
  5. Refine — update its plan based on the evaluation, or terminate if done.

That loop is the thing that distinguishes an agent from an LLM call. An LLM call is one step. An agent is the loop that runs the call, evaluates the output, and decides what to do next. It's the difference between "answer this prompt" and "produce a campaign that meets these specifications, and don't stop until the evaluation passes."

Most "AI features" in legacy marketing software are LLM calls dressed up as agents. They lack the evaluate and refine steps. They produce output and stop. This is why they tend to fail in the long tail — they have no way to notice their own mistakes.

What VIMC01 does in each step

Concretely, when VIMC01 is asked to produce a creative variant pack for a campaign, the loop runs as follows:

Observe

VIMC01 reads:

  • The campaign brief (audience, channel, objective, KPI).
  • The brand context (voice classifier weights, brand guidelines, restricted-claims list, palette, typography, locked assets).
  • The historical creative-performance database for this client and similar clients (anonymised).
  • The most recent VIMP01 signals — what's currently winning in the live auction, what audiences are saturating, what creative archetypes are decaying.

This context is retrieved from a vector database keyed to the client and a structured database of campaign metadata. The retrieval is not "throw it all in the prompt" — it's selective, prioritised, and budgeted against the LLM's effective context window.

Plan

VIMC01 decomposes the brief into a tree of creative archetypes — clusters of related variants sharing common DNA — and within each archetype, a set of variant axes (headline pattern, image type, CTA framing, length variant, colour treatment). The tree is the plan. It is generated by the LLM but constrained by hard rules: minimum diversity across archetypes, maximum overlap on tested-and-failed patterns, mandatory coverage of the audience segments in the brief.

The plan is reviewed by a senior strategist before execution on initial briefs. After roughly the first ten campaigns for a client, the plan generation is reliable enough that the review becomes spot-check.

Act

For each leaf in the tree, VIMC01:

  • Generates the copy through a language model fine-tuned on the client's approved corpus.
  • Generates or composites the image through the appropriate visual model (different models for product, lifestyle, abstract).
  • Assembles the full ad in the platform-required format.
  • Writes structured metadata that VIMP01 and VIMD01 can read downstream.

This is the step most people imagine when they hear "AI creative." It's also the smallest part of the system by code volume, and the part where model swaps happen most frequently — every six to eight weeks we evaluate whether a new model is meaningfully better for a given creative subtask.

Evaluate

Each variant runs through the five-layer review pipeline (covered separately in Brand-safe AI creative). The output of evaluation is binary at the per-variant level (pass/reject) but produces structured failure-mode data at the aggregate level — which constraints failed, which audiences are under-served, which archetypes are producing too much off-brand drift.

Refine

Based on the evaluation aggregates, VIMC01 adjusts the plan: re-weights archetype generation toward those that are passing review, retires archetypes that are systematically failing brand-voice classification, and triggers a regeneration cycle for under-covered audiences. The refine step is what allows volume to scale without quality collapsing.

If the evaluation indicates a structural problem (too many variants failing claim verification, brand-voice classifier confidence dropping), VIMC01 pauses and escalates to a human strategist rather than pushing through.

Memory: working, episodic, and semantic

The most important architectural distinction between an agent and an LLM is memory. VIMC01 has three:

Working memory. The current campaign's state — what's been generated, what's pending, what's been rejected and why. Cleared at campaign end. Lives in the agent's own state machine, not in the LLM context.

Episodic memory. A log of past campaigns and their outcomes, queryable by the agent. "What worked for this audience six months ago" is a retrieval against episodic memory. This is what allows VIMC01 to improve over time — without it, every campaign would start from zero.

Semantic memory. The brand guidelines, voice classifier, restricted-claims list, palette and typography rules. Stable, structured, treated as ground truth. Updated only when the client explicitly changes something.

The three layers are kept separate because they decay differently. Working memory should be cleared frequently. Episodic memory is precious and should compound. Semantic memory should change rarely and only via human approval. Conflating them — which most "AI features" in legacy tools do — is why those features feel inconsistent. They have no architectural separation between what should be stable and what should change.

Tool use

VIMC01 doesn't just generate text and images. She uses tools — specific, structured APIs that perform deterministic actions. The major tools:

  • The brand-asset library — fetches approved logos, photography, typography in the right formats.
  • The product database — reads pricing, availability, and feature data so claims are grounded in current reality, not the model's training data.
  • The platform APIs — Meta, Google, TikTok, Snap, DV360 — for previewing how a variant will render in each placement and ensuring policy compliance before submission.
  • The voice classifier — invoked at the evaluate step, returns a tone-of-voice score.
  • VIMP01's signal feed — the latest auction and creative-performance signals, used to skew planning toward what's currently winning.

Each tool is a distinct API contract. VIMC01 knows which tool to call for which task because the tool definitions are part of her system prompt and her training. This is what tool-using LLMs do; the engineering is in making the tool surface clean, fast, and predictable.

Where VIMC01 fails, and what we do about it

Three failure modes account for nearly all production issues:

Drift in the voice classifier. The classifier was trained on a snapshot of the client's brand voice. Brands evolve. The classifier becomes stale. We retrain quarterly on a fresh sample and audit the disagreements with senior creative judgement.

Over-fitting to recent winners. The refine step can over-weight what's working right now and under-explore what might work next. We enforce a minimum exploration budget — typically 15–20% of the variant volume must be in archetypes that haven't been validated yet. Without this, performance plateaus.

Tool-use errors. Sometimes the wrong tool gets called, or a tool returns unexpected data. The evaluate step catches most of these but not all. We have a deterministic fallback: any variant where the tool-use trace is malformed is dropped before it reaches the human reviewer, regardless of how good the output looks.

How this compares to "AI features"

If you take any of the major legacy creative platforms that have added "AI features" in the last 18 months and inspect what's actually running, you'll typically find: a single LLM call on an unstructured prompt, no evaluation step, no episodic memory, ad-hoc tool use if any, and a UI wrapper that hides the limitations. The output looks impressive in a demo and degrades quickly in production.

The architectural distinction is not academic. It's the difference between a system that can scale and improve and a system that produces a polished demo and then plateaus.

The marketing question — "is this real AI or AI-washing?" — has a precise technical answer. Look for the loop, the memory layers, the tool surface, and the evaluation. If they're not there, you have a feature, not an agent.