急増するAIクローラー対策として「llms.txt」を導入してみた

AIクローラーのスパイクアクセス対策にllms.txtを導入。レート制限、正規URL、メタデータなどを記述し、適切なクロールを促す設定を試みました

#AI

#Amazon CloudFront

#DevelopersIO

suzuki.ryo

2025.02.25

AIクローラーによる過剰アクセスの発生をうけ、対策としてサイト構造化データファイル「llms.txt」(Large Language Model Specifications) を公開しました。

LLMに適切なクロール方法を指示し、サイトリソースの効率的な利用を意図して反映した指示内容について紹介させていただきます。

LLMに適切なクロールを促すため、llms.txtに反映した指示内容について紹介します。

設置

マークダウン形式のテキストファイルを作成し、robots.txtや、エラーページを格納するS3バケットに保存。

以下のURLで公開しました。
https://dev.classmethod.jp/llms.txt

llms.txt 内容

user-agent

特定のLLMに限定せず、すべてのAIクローラーに適用されるよう設定しました。

user-agent: *

License

コンテンツの利用条件として、サイトバナーに準じた記載を反映。
AIの学習利用は認める指定としました。

x-content-license: "(c) Classmethod, Inc. All rights reserved."
x-ai-training-policy: "allowed"

Rate Limits

過剰なサーバ負荷を招く、過剰なクロールを制限する指示を記載しました。

crawl-delay: 1
x-rate-limit: 60
x-rate-limit-window: 60
x-rate-limit-policy: "strict"
x-rate-limit-retry: "no-retry"
x-rate-limit-description: "Maximum 60 requests per 60 seconds. If rate limit is exceeded, do not retry and move on to next request."

# Concurrency Limits
x-concurrency-limit: 3
x-concurrency-limit-description: "Please limit concurrent requests to a maximum of 3.  This helps us manage server load."

Error Handling and Retry Policy

リトライに関し、負荷回避のためのバックオフ処理と、リトライ禁止する応答コードの指定を行いました。

x-error-retry-policy: "exponential-backoff"
x-error-retry-policy-description: "For transient errors (5xx except 429), implement exponential backoff with initial wait of 2 seconds, doubling on each retry, with maximum 5 retries."
x-rate-limit-exceeded-policy: "wait-and-retry"
x-rate-limit-exceeded-policy-description: "When receiving HTTP 429 (Too Many Requests), do not retry immediately. Wait at least 60 seconds before attempting the request again."
x-max-retries: 5
x-retry-status-codes:
  - 500
  - 502
  - 503
  - 504
x-no-immediate-retry-status-codes:
  - 429
x-no-retry-status-codes:
  - 403
  - 404

Canonical URL

サイトの正式な公開URLを明示し、IPアドレスの直指定や、意図せぬFQDNでのアクセスを回避する指示を記載しました。

x-canonical-url-policy: "strict"
x-canonical-url: "https://dev.classmethod.jp/"
x-canonical-url-description: "Access via other FQDNs or IP addresses is invalid.  Use only 'https://dev.classmethod.jp/' as the base URL.  Within the HTML content, links should also start with 'https://dev.classmethod.jp/'"

Article Pages

ブログ記事ページのパス、含まれるメタ情報、更新頻度などを記載。
サイト内のコンテンツ構造を明確に伝える事で、効率的なクロールを促す情報を反映しました。

x-article-pattern: /articles/{slug}/
x-article-type: "technical-blog"
x-article-description: "Technical blog posts and tutorials by Classmethod Inc., an AWS Premier Tier Services Partner.  Articles cover cloud computing (especially AWS), development, and IT, with a focus on practical experiences and reproducible guides."
x-article-example: https://dev.classmethod.jp/articles/browser-use-start/
x-article-metadata:
  - author
  - title
  - description
x-article-update-frequency: "daily"
x-article-new-articles-per-day: "10-50"
x-article-new-articles-per-day-description: "Approximately 10 to 50 new articles are added each day."

Author Pages

執筆者ページと、RSSなどの情報を記述しました。

allow: /author/*
x-author-pattern: /author/{username}/
x-author-description: "Blog author profile pages"
x-author-example: https://dev.classmethod.jp/author/akari7/
x-author-metadata:
  - author
  - description
  - url
x-author-content-update-policy: "dynamic"
x-author-rss-pattern: /author/{username}/feed/
x-author-rss-pattern-example: https://dev.classmethod.jp/author/akari7/feed/

# Author Update Frequency
x-author-update-frequency: "monthly"
x-author-new-authors-per-month: "5-20"
x-author-new-authors-per-month-description: "Approximately 5 to 20 new authors are added each month."

RSS、Sitemap

新規に公開された30記事が反映されるRSSと、全公開記事が月単位で反映されているサイトマップの情報を記載しました。

# RSS Feeds
allow: /feed/
x-feed: https://dev.classmethod.jp/feed/
x-feed-format: rss
x-feed-item-count: 30
x-feed-content: "latest"
x-feed-description: "The RSS feed contains the 30 most recent articles published on the site. It is updated whenever new content is published."
x-feed-update-frequency: "real-time"

# Sitemap
allow: /sitemap.xml
allow: /sitemaps/*
x-sitemap: https://dev.classmethod.jp/sitemap.xml
x-sitemap-format: xml
x-sitemap-structure: "monthly"
x-sitemap-pattern: "/sitemaps/sitemap-{year}-{month}.xml"
x-sitemap-pattern-example: "https://dev.classmethod.jp/sitemaps/sitemap-2024-07.xml"
x-sitemap-description: "Articles are organized in monthly sitemaps following the pattern sitemap-YYYY-MM.xml. The main sitemap.xml is an index that references these monthly sitemaps."
x-sitemap-retention: "long-term"
x-sitemap-retention-description: "Monthly sitemaps are retained for multiple years, allowing access to historical content organization."

アクセス抑制

現在未使用、リダイレクトとなるパスについては、リクエストを抑制しました。

# disallow
disallow: /pages/*
disallow: /tag/*
disallow: /category/*

Publisher情報

サイト運営者の信頼性を示す情報として、DevelopersIOとはを英訳、構造化した内容を反映しました。

# Publisher Information
x-publisher-name: "Classmethod, Inc."
x-publisher-description: "Classmethod, Inc. is a technology company with 300+ engineers, specializing in cloud services and digital transformation. It operates a technical blog (dev.classmethod.jp) with over 30,000 articles based on practical engineering experiences, primarily focused on AWS."
x-publisher-credentials:
  - "AWS Premier Tier Services Partner (2014-present)"
  - "AWS Services Partner of the Year - Japan"
  - "LINE Technology Partner (OMO/Engagement certified)"
  - "3,000+ client companies, 15,000+ AWS accounts supported"
x-publisher-certifications:
  - "AWS Technical Certifications: 2,000+"
  - "AWS Competencies: Migration, Mobile, Big Data, DevOps"
  - "AWS Service Delivery Programs: 12"
x-publisher-website: "https://classmethod.jp/"

言語

現在、当ブログで執筆実績のある言語を記載しました。

x-supported-languages:
  - ja
  - en
  - ko
  - th
  - de
  - vi
  - id

Crawler Hints

末尾の「/」が不足、Nextアプリでリダイレクトとして扱われるリクエスト数が減る効果を期待し、LLMのクローラにURI末尾の「/」が存在しない場合、クローラー側で事前に補う措置を促しました。

x-crawler-hints:
  - "Before accessing each URL, check if the URL ends with '/'."
  - "If the URL does not end with '/', add '/' to the end of the URL before accessing it."
  - "Extract the HTML content of each page."

効果

llms.txt 設置後3日間のアクセスログに、llms.txt を参照したリクエスト (User-Agentに"bot"を含む) が記録されている事を確認できました。

User-Agent
Mozilla/5.0 (compatible; SBIntuitionsBot/0.1; +https://www.sbintuitions.co.jp/bot/)
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.6943.53 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)

まとめ

今回紹介した llms.txt (Large Language Model Specifications) 、2025年時点では標準化されていない状態ですが、今後AIクローラーとの共存を図る上で有効なアプローチと考えています。

今回の設定の効果測定や、今後の標準化動向などを見極めながら最適化を試みていく予定です。

AIアクセスを受け入れる公開ウェブサイトを運営されている場合、LLM向けにサイトの構造、クロールポリシー、コンテンツ利用に関する情報を llms.txt としてクローラーに提供することで、非効率なクロールの発生を抑制することをおすすめします。

また、ローカルLLMやRAG施策などで自組織外のWebサービスを情報源として利用したり、コンテンツを直接利用したりする場合、サイトポリシーの確認と併せて「llms.txt」の設置状況を確認頂き、llms.txt内にクロールやコンテンツ利用に関する具体的な指示が明記されている場合は、その指示に従うことを強く推奨します。