Who? What? Where? When did this happen? Last night I was hit with from an SEO friend of what is an LLMS.txt file? Is it useful? Do I need it? Well, I got you! This is everything you need to know about LLMS.txt files and how you can use it to give yourself that competitive edge over your competition.
The "llms.txt" file is a relatively new concept in the AI and SEO space — it's essentially like a "robots.txt" file, but instead of telling web crawlers (like Googlebot) how to index your site, it tells AI crawlers and large language model (LLM) scrapers what they can and cannot use from your site for training or generation purposes.
It's not yet a formal standard, but several companies (including OpenAI, Anthropic, and Perplexity) are starting to check for it. You'll find some WordPress plugins starting to add it and it's been built into Brilliance CRM since we learned about to help us help our customers.
The Origin Story & Timeline
- Who created it: The llms.txt concept was proposed by Jeremy Howard, co-founder of Answer.AI.
- When: The idea was publicly introduced in September 2024, with early adoption by platforms like Mintlify in November 2024.
- Why: To help LLMs (large language models) access curated, structured content from websites—especially for inference and generation, not just training.
- Which AI companies were first to recognize it (OpenAI's GPTBot announcement in August 2023 was the big moment).
- Current adoption rate among major AI crawlers and web platforms.
- Clarify that llms.txt isn't just about blocking training—it's also about guiding inference, helping LLMs answer questions using your site's most relevant content.
Purpose
- Gives website owners control over whether AI systems can scrape and use their content.
- Provides AI companies with machine-readable permissions about what data they can use.
- Works alongside (or in place of) "robots.txt" for AI-specific rules.
Comparison Table
Create a table comparing robots.txt, sitemap.xml, and llms.txt:
Feature | robots.txt | sitemap.xml | llms.txt |
Purpose | Search engine crawling | Page discovery | AI model guidance |
Format | Plain text | XML | Markdown or plain text |
Target Audience | Search bots | Search bots | AI crawlers / LLMs |
Enforcement | Voluntary | Voluntary | Voluntary |
Blocking Capability | Yes | No | Yes |
Current Known AI Crawlers That Respect It
List and update regularly:
- GPTBot (OpenAI)
- CCBot (Common Crawl)
- ClaudeBot (Anthropic)
- PerplexityBot (Perplexity.ai)
- Google-Extended (Google's AI indexing) — uses a robots.txt directive, but can be included in llms.txt for clarity
- Applebot-Extended (Apple)
Include example User-Agent strings for precise blocking.
Typical Content
A basic "llms.txt" might look like:
Block all LLMs from training on this site
User-Agent: GPTBot
Allow: /blog/
Disallow: /
User-Agent: ClaudeBot
Disallow: /
User-Agent: PerplexityBot
Disallow: /
Or to allow certain bots
# Allow OpenAI's bot but block others
User-Agent: GPTBot
Allow: /
User-Agent: ClaudeBot
Disallow: /
Block all but allow one research directory
User-Agent: *
Disallow: /
Allow: /public-dataset/
Where to Place It
- Location: Always in your site's root directory (https://yoursite.com/llms.txt), same as robots.txt.
- Hosting Notes: Must be plain text, UTF-8 encoded, accessible without authentication.
- Mention how to test access (curl https://yoursite.com/llms.txt).
How to Test if It's Working
- Check your server logs for bot access requests.
- Use online header/bot check tools.
- Look for User-Agent strings in analytics.
Integration with Other Protection Measures
- Pair with robots.txt for search crawlers.
- Use meta tags like noai or noimageai for image/content protection.
- Consider legal measures (Terms of Service updates to prohibit unauthorized AI training).
The Limitations
- Voluntary compliance — bad actors can still scrape.
- No legal enforcement by itself (must be paired with contracts or ToS).
- Not recognized by all AI companies yet.
Future Outlook
- Could become an official standard if adopted by W3C or other governing bodies.
- Possible browser-level AI content protection in future.
- AI companies may create central opt-out registries referencing llms.txt.
Value It Provides
- Content Protection: Prevents unauthorized use of your writing, images, and code for AI training.
- Transparency: Lets you clearly state your content-sharing preferences.
- Negotiation Tool: AI companies that want access to your content may need to request permission, opening possible licensing opportunities.
- Public Stance: Signals to visitors and AI companies your position on content usage.
- Compliance Layer: Some AI bots now respect "llms.txt" and will avoid scraping blocked content.
Important
Unlike "robots.txt", which most major search engines follow, "llms.txt" is voluntary — bad actors and rogue scrapers can still ignore it. It's mostly useful for good-faith AI companies who are trying to follow ethical scraping guidelines.