004Field Note

FEATURED_INTELLIGENCE

6 min read·June 2026

The AI Crawler Access Matrix: Separate Search Inclusion, Training Use, and Citation Measurement for GEO

A multi-engine GEO operating model for separating search-style AI inclusion, training preferences, preview controls, protected assets, and citation measurement across Google, OpenAI, Perplexity, and Bing.

#Multi-Engine GEO#AI Crawlers#Robots.txt#Citation Measurement

AI visibility is no longer governed by one crawl rule. A GEO team now needs a crawler access matrix: one column for whether a page can appear in search-style AI answers, one for whether it can be used for training or other model-improvement systems, one for preview controls, and one for whether the team can measure citations after the page is referenced.

The short version: do not treat robots.txt as a blunt on/off switch. Google Search AI features, OpenAI search retrieval, Perplexity retrieval, and Bing AI citation reporting all expose different controls and reporting surfaces. The winning operating model is to keep public evidence pages accessible to answer systems while separately managing training preferences, snippets, protected assets, and citation measurement.

What the evidence says

Google's guidance for AI features says the best practices for SEO remain relevant for AI Overviews and AI Mode. It also says there are no additional requirements to appear in those experiences and no special optimizations necessary beyond fundamental SEO practices. That matters because Google is positioning AI features as part of Search, not as a separate channel with a separate submission process.

Google's same AI-features page draws a useful control boundary. It says robots.txt directives for Googlebot are the control site owners use to manage access for how sites are crawled for Search. To limit what is shown from pages in Search, Google points to preview controls such as nosnippet, data-nosnippet, max-snippet, and noindex. For some other Google generative systems, Google points publishers to Google-Extended.

Google's crawler documentation makes the Google-Extended boundary even clearer: Google-Extended does not have a separate HTTP request user-agent string. Crawling is done with existing Google user-agent strings; the robots.txt product token is used in a control capacity. The documentation says the token lets publishers manage whether content Google crawls may be used to improve Gemini Apps and Vertex AI generative APIs. That is not the same thing as blocking Google Search.

OpenAI's crawler documentation uses a different split. It says OpenAI uses OAI-SearchBot and GPTBot robots.txt tags, and that each setting is independent. OpenAI gives the example that a webmaster can allow OAI-SearchBot to appear in search results while disallowing GPTBot to indicate that crawled content should not be used for training OpenAI's generative AI foundation models. That is the access matrix in one sentence: search inclusion and training use are separate decisions.

Perplexity's crawler documentation adds another retrieval layer. PerplexityBot is designed to surface and link websites in Perplexity search results and is not used to crawl content for AI foundation models. Perplexity-User supports user actions inside Perplexity when a user asks a question and Perplexity may visit a page to provide an accurate answer and include a link. Perplexity says each robots.txt setting works independently and may take up to 24 hours to reflect changes.

Bing's AI Performance announcement shows the measurement side of the matrix. The public-preview dashboard is described as showing how publisher content appears across Microsoft Copilot, AI-generated summaries in Bing, and select partner integrations. Bing says it can show total citations, average cited pages, cited URLs, citation activity over time, and grounding query phrases. It also states that Bing respects content owner preferences expressed through robots.txt and other supported controls.

The mistake: one robots.txt policy for every AI outcome

Many teams still ask a single question: should we block AI bots or allow them? That is too coarse for GEO.

A pricing page, methodology page, public documentation page, comparison guide, or customer-proof page may be exactly the evidence an answer engine needs. Blocking retrieval can protect the page from some uses, but it can also remove the page from the pool of sources that might support a buyer-facing answer.

The error is assuming every platform interprets the same signal the same way. Google's Search AI controls are tied to Googlebot and preview controls. Google-Extended is a product token for certain non-Search generative uses. OpenAI separates search inclusion from GPTBot training preference. Perplexity separates PerplexityBot search surfacing from Perplexity-User user-triggered access. Bing is adding reporting that helps teams see which URLs are cited in AI answer surfaces.

Build the crawler access matrix

Start with a simple spreadsheet or policy table. Each row is a URL pattern or page group. Each column answers a different question.

Should this page be eligible for search-style AI answers?Public evidence pages usually should be accessible. These include answer-first category pages, product docs, comparison pages, methodology notes, support articles, pricing explainers, location pages, and source-backed research summaries.
Should this page be excluded from training or model-improvement use where a platform offers a separate control?This is where Google-Extended, GPTBot, and other platform-specific tokens matter. The decision can differ from search inclusion.
Should the page preview be limited?Google points to `nosnippet`, `data-nosnippet`, `max-snippet`, and `noindex` for limiting what is shown from pages in Search. A page can be crawlable but still need preview boundaries.
Is the page safe for user-triggered retrieval?Perplexity-User and similar user-action agents raise a different question than bulk crawling: should a page be reachable when a user asks an AI system to inspect it?
Can the team measure whether the page is being cited?Bing's AI Performance concepts create a useful reporting target: total citations, cited pages, citation trends, cited URLs, and grounding query phrases. Even if other engines provide less reporting, the same fields can shape manual audits.

A practical page classification model

Use four page classes.

Open evidence pages are built to be cited. Keep them crawlable for search-style AI systems, make them answer-first, and keep facts current.

Controlled preview pages can be discovered but should not expose unlimited snippets. Use supported preview controls when the page contains useful public information but needs display boundaries.

Training-restricted pages are okay for search retrieval but not for model-improvement use where the platform provides an independent training control.

Protected pages should not be used as AI evidence at all. Use robots rules, authentication, noindex, or access controls as appropriate. Do not rely on voluntary robots preferences for material that truly must stay private.

Leading indicators to watch

The first indicator is source substitution. If AI answers cite listicles, Reddit threads, marketplaces, or competitors for facts your public evidence pages should answer, your access matrix may be too restrictive or your owned page may be less useful than the cited source.

The second indicator is preview mismatch. If a page appears in AI-assisted Search with an excerpt that creates risk or omits context, inspect whether supported preview controls match the page's purpose.

The third indicator is training-policy drift. If the company decides certain public pages should be searchable but not used for training where separate controls exist, verify platform-specific tokens instead of assuming a generic AI-bot rule covers every use.

The fourth indicator is measurement asymmetry. Bing's AI Performance direction shows that citation reporting can include cited URLs and grounding query phrases. GEO teams should build internal dashboards around similar concepts even when an engine does not provide a native dashboard: which URLs are cited, for which prompts, in which engine, and with what answer role.

The 30-day implementation plan

Week one: inventory public URL groups. Label open evidence pages, controlled-preview pages, training-restricted pages, and protected pages. Do not start with robots syntax. Start with business intent.

Week two: map platform controls. For Google, separate Search crawl access, Search preview controls, and Google-Extended. For OpenAI, separate OAI-SearchBot from GPTBot. For Perplexity, separate PerplexityBot from Perplexity-User. For Bing, document both crawl preferences and AI Performance reporting fields.

Week three: implement the safest high-value changes. Keep open evidence pages accessible to search-style AI systems. Add preview controls where needed. Apply training restrictions only where the platform supports a separate control and the business policy calls for it. Move truly protected content behind stronger access boundaries instead of relying on voluntary crawlers.

Week four: measure citations and gaps. Use Bing AI Performance where available. Run a manual prompt audit across Google, ChatGPT search, Perplexity, and Bing/Copilot. Record cited URL, source type, answer role, query phrase, and whether the cited page is the page you intended the engine to use.

The bottom line

GEO is now an access-control and measurement discipline, not just an editorial discipline. The page that deserves to be cited must be crawlable by the right search-style systems, constrained by the right preview rules, separated from training use where supported, and monitored after it enters AI answers.

The brands that win will not ask "allow AI bots or block AI bots?" They will ask a better set of questions: which pages are public evidence, which systems should retrieve them, which uses should be restricted, what snippets are safe, and how will we know when the answer layer starts citing the right source?

// RELATED_GEO_TOPICS

Continue the GEO Map

Follow the adjacent pages that make the AI visibility model easier for crawlers, LLMs, and buyers to understand.

12 min read

The Citation War: Why Your 2026 SEO Strategy is Invisible to AI

5 min read

Generative Engine Optimization (GEO) 2026: 5 Steps to Set Up llms.txt and Get Cited by AI

8 min read

The Entity Home Blueprint: Make Your Brand the Canonical AI Source

9 min read

ChatGPT Cites Differently Than Perplexity: The Platform-Specific GEO Guide You Actually Need

// AI_VISIBILITY_AUDIT

See how AI sees your brand

See your AI visibility across your site, content, and competitive signal, with the next fixes and priorities mapped for you.

Boost Visibility with AI Already have an account? Sign in

// CREATOR_MOMENTUM

Need the creator-side next step?

Build your creator momentum on Launchvibes while GeoCompanion stays focused on AI visibility, content structure, and citation readiness.

Build your creator momentum

// VERIFICATION_SOURCES

Sources

AI features and your website - Google Search Central

Google's common crawlers: Google-Extended - Google for Developers

Overview of OpenAI Crawlers - OpenAI Platform Docs

PerplexityBot - Perplexity

Introducing AI Performance in Bing Webmaster Tools Public Preview - Microsoft Bing Webmaster Blog

Join the GeoCompanion.ai Community

Connect with founders and marketers building stronger AI visibility, content systems, and next-generation execution.

Join Telegram

SIGNAL_PROPAGATION

Found this intelligence helpful? Propagate the signal across your nodes.