AI Crawler Whitelist: Yang Wajib Diperbolehkan
robots.txt konfigurasi yang mempersilakan AI crawler penting akses konten, sambil memblokir yang problematik. Daftar 2026 current.
Block semua AI crawler = block visibility di ChatGPT, Gemini, Perplexity. Tapi allow semua = data leakage ke training yang mungkin tidak kamu mau. Ini tabel current crawler dan rekomendasi policy untuk perusahaan B2B.
Tabel AI crawler 2026
| Crawler | Operator | Purpose | Rekomendasi |
|---|---|---|---|
| GPTBot | OpenAI | Training + search (ChatGPT) | Allow (untuk visibility) |
| ChatGPT-User | OpenAI | Real-time fetch saat user query | Allow (critical untuk citation) |
| OAI-SearchBot | OpenAI | Search product | Allow |
| ClaudeBot | Anthropic | Training + search | Allow |
| Claude-Web | Anthropic | Real-time fetch | Allow |
| anthropic-ai | Anthropic | Legacy training | Allow atau block |
| Google-Extended | Google Gemini | Training data control | Allow (untuk Gemini) |
| Googlebot | Regular + AI Overviews | JANGAN PERNAH block | |
| PerplexityBot | Perplexity | Search + citation | Allow |
| cohere-ai | Cohere | Training | Optional |
| YouBot | You.com | Search | Allow |
| CCBot | Common Crawl | Open dataset (trained oleh banyak) | Allow |
| Meta-ExternalAgent | Meta | Training | Block kalau khawatir |
| Bytespider | ByteDance | TikTok AI training | Block kalau tidak relevan |
| Amazonbot | Amazon | Alexa + shopping | Allow |
| Applebot-Extended | Apple | Siri/Spotlight AI | Allow |
Template robots.txt
# Allow all major AI crawlers User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-Web Allow: / User-agent: Google-Extended Allow: / User-agent: PerplexityBot Allow: / User-agent: Applebot-Extended Allow: / # Block crawler yang tidak kamu mau User-agent: Bytespider Disallow: / User-agent: Meta-ExternalAgent Disallow: / # Standard rules User-agent: * Allow: / Disallow: /admin/ Disallow: /private/ Sitemap: https://example.com/sitemap.xml
Monitoring crawler
- Log access untuk User-Agent AI crawler.Pastikan mereka benar-benar crawl. Kalau 0 hit selama 30 hari = crawler mungkin tidak honor robots.txt.
- Track citation di LLM setelah allow.Test prompt bulanan. Measure frekuensi kemunculan.
- Review new AI crawler setiap kuartal.Landscape berubah. Crawler baru muncul, yang lama deprecated.
- Coordinate dengan legal/compliance.Kalau konten confidential atau client-data, partial allow mungkin lebih tepat.
Allow GPTBot = konten-mu mungkin jadi training data. Kalau konten strategis dan kamu tidak mau ter-train, block GPTBot tapi allow ChatGPT-User dan OAI-SearchBot (real-time fetch tanpa training).