Why benchmark on LLMs
Because your prospects now ask a model their questions before they ever open Google. If ChatGPT recommends three competitors and never your brand, you lose the sale before you were even in the race. The competitive LLM benchmark measures exactly that gap.
The logic differs from classic SEO. In traditional search, you track your position on a query. In generative search, there is no results page: the model synthesizes an answer and cites a few brands. The right metric is therefore no longer rank, but share of citations: across all the prompts in your market, how many answers mention you, and how many mention each competitor.
This distinction is both measurable and strategic. The signals that trigger an AI citation are not the ones that determine your Google ranking. Ahrefs' analysis of 200,000 domains (Dec. 2025) shows that off-site brand mentions correlate more strongly with AI citations (YouTube 0.737, Reddit, Wikipedia) than Domain Rating (0.266). You can dominate SEO and remain invisible in ChatGPT. The overlap is small: only 11% of domains are cited by both ChatGPT and AI Overviews.
Before measuring, clarify the scope: your market, your three to five direct competitors, and the questions a buyer asks. This is the foundation of a structured GEO agency approach, where every figure feeds a decision.
Building the prompt list
Everything depends on the quality of your prompts. A benchmark is only as good as how representative the tested questions are: they must reproduce what a real prospect asks, not what you wish they would ask.
Structure the list around the buying journey, in three families. This breakdown avoids the classic bias of testing only branded queries, where you always win.
The three prompt families
| Family | Intent | Example prompt |
|---|---|---|
| Discovery | The prospect is exploring a problem, with no solution in mind | 'How do I improve my site's visibility in ChatGPT?' |
| Comparison | The prospect is comparing approaches or providers | 'Best GEO agencies in France in 2026' |
| Decision | The prospect is looking to validate a specific choice | 'Which agency should I use to optimize my AI visibility in Albi?' |
Aim for 20 to 40 prompts in total, spread across these families. Below 20, a single atypical answer skews your percentages. Phrase them in natural language, the way you talk to an assistant, not in telegraphic keywords. Vary the angles: 'best', 'how to choose', 'alternatives to', 'for [industry]'.
Systematically include prompts where you expect to see your competitors. That is where the benchmark becomes a competitive tool rather than a mere ego test. Also document each prompt in close variants: models are sensitive to phrasing, and a reworded question can make a brand appear or disappear.
Running the test on each model
Run each prompt on each model under standardized conditions, otherwise the results are not comparable from one wave to the next. The protocol matters as much as the prompts.
Test in private browsing or on a dedicated account, with no memory or personalization enabled. The history of a personal account biases answers toward your own past searches.
Run each prompt on ChatGPT, Perplexity and Gemini at a minimum. They rely on distinct citation mechanisms and do not return the same brands.
For each answer, record every brand or domain mentioned, its position in the answer, and whether it is cited as a source or recommended in the text.
Keep a screenshot or the raw text of every answer. Models evolve; without an archive, you will be unable to verify or compare the following month.
LLMs are not deterministic. Run each prompt twice and count a brand as present if it appears at least once.
One technical detail weighs heavily on the results: LLMs do not execute JavaScript. If your pages' content loads client-side, the model's crawler sees only an empty page. Server-side rendering (SSR) or static HTML is therefore essential to exist in the index that feeds these answers. A competitor missing from your benchmark despite strong notoriety often has this exact problem.
ChatGPT alone represents an audience volume that justifies including it in every benchmark. Ignoring this model means ignoring the leading generative search interface.
To automate collection at scale and cross-reference this data with your real traffic, see our method to track LLM traffic in GA4. The benchmark provides the snapshot; GA4 confirms the business impact.
Reading the results matrix
The matrix is a table with prompts in rows, models in columns, and each cell listing the brands cited. It is the centerpiece of the benchmark: it turns dozens of answers into a readable map of your AI visibility.
First, calculate your share of citations: the number of answers where you appear, divided by the total number of answers tested. Do the same calculation for each competitor. You get an AI visibility ranking that often bears no resemblance to the Google ranking in your market. This share-of-citations metric is covered in detail in our guide on AI share of voice.
Then read the matrix along three diagnostic axes.
The three zones to spot
| Cell configuration | What it means | Action priority |
|---|---|---|
| You + competitors cited | You exist on this question, the market is shared | Consolidate: strengthen your relative position |
| Competitor cited alone | They own the space, you are invisible | Attack: create the missing citable content |
| No relevant party cited | The model improvises or cites off-topic | First mover: an open window to seize quickly |
The third zone is the most profitable. When no player in your market is cited on a high-intent prompt, the first brand to publish factual, structured and citable content takes it all. It is the opposite of a head-on battle: you occupy empty ground.
Also spot the gaps between models. A brand cited on Perplexity but absent from ChatGPT reveals a specific citation signal to work on: web sourcing for one, off-site notoriety for the other. The matrix tells you not only where you lose, but why.
Turning it into an action plan
A benchmark with no action plan is a dead report. Each zone of the matrix translates into a concrete GEO project, prioritized by the gap between the business value of the prompt and your current absence.
Prioritize using a simple rule: start with decision prompts where a competitor is cited alone, then discovery prompts where the window is open. The former recover sales; the latter build foundational authority.
For every decision prompt where you are missing, create or enrich a page that answers the question directly, with a self-contained citable passage of 134 to 167 words placed up top.
Add FAQPage schema to these pages: it is a strong signal for AI Overviews, making it easier for models to extract your question-answer pairs.
Where a competitor dominates without an obvious SEO advantage, strengthen your mentions on YouTube, Reddit and the sources models favor. This is what correlates most with citations.
Audit each target page: if the content depends on JavaScript, switch to SSR or static HTML so the model's crawler actually sees your text.
Re-run the same prompt list a month later to measure the movement. Without a second wave, you will never know whether your actions shifted your share of citations.
The citable passage deserves particular attention. A paragraph of 134 to 167 words, factual and self-contained, that fully answers the sub-question, is the unit models extract. Too short, it lacks substance; too long, it dilutes the answer and loses its citability. This is the optimal format observed in content that is actually picked up.
Document each wave in the same table to track the trajectory. A share of citations that rises wave after wave is proof that your GEO strategy is working, well before revenue confirms it. The progression figures from a client case remain illustrative: what matters is your own curve, measured with a stable protocol.
To get a first quantified diagnostic without setting up the whole protocol yourself, use our AI Visibility Score. It gives you an immediate snapshot of your share of citations against your direct competitors.
Our free GEO audit benchmarks your visibility against your competitors on ChatGPT, Perplexity and Gemini, and hands you the prioritized action plan.
Questions fréquentes
How many prompts do you need to test for a reliable benchmark?+
Count on 20 to 40 prompts per market for a first representative snapshot. Below 20, statistical noise distorts your conclusions; above 40, the cost of collection explodes with no major gain in information. Spread them across discovery, comparison and decision questions to cover the entire buying journey.
Which models should you run the benchmark on?+
At a minimum ChatGPT, Perplexity and Gemini, because they cover most use cases and rely on different citation mechanisms. ChatGPT has more than 900 million weekly users. Add Claude and Google AI Overviews if your audience uses them. Test each model in a fresh session, with no history, to avoid personalization.
How often should you repeat a competitive LLM benchmark?+
Once a month is enough to track a trend, because model answers evolve as updates and the web index change. Keep the same prompt list and the same protocol from one wave to the next, otherwise the variations are no longer comparable. A quarterly cadence remains acceptable for a stable, low-competition market.
Does an LLM benchmark replace SEO rank tracking?+
No, it complements it. Google ranking and AI citation share overlap only partially: only 11% of domains are cited by both ChatGPT and AI Overviews. Tracking both gives you a complete view of your visibility, from the blue link to the generative answer.



