Skip to main content
Loading time...

Bot Detection Guide

How to identify search engine crawlers, scrapers, and automated bots through user agent analysis, behavioral patterns, and verification techniques.

Types of Bots

Not all bots are malicious. Understanding the different categories helps you decide which to allow and which to block:

Legitimate Crawlers

Search engine crawlers index your site for search results. Blocking them means your site disappears from search engines. Major crawlers include:

  • Googlebot: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Bingbot: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
  • Yandex: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
  • Baidu: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Monitoring and Uptime Bots

Services like Pingdom, UptimeRobot, and Datadog regularly ping your site to check availability. These typically identify themselves clearly in their UA strings.

Social Media Crawlers

When someone shares a link on social media, the platform fetches the page to generate a preview card. Common crawlers include facebookexternalhit, Twitterbot, and LinkedInBot.

Malicious Bots

Scrapers, vulnerability scanners, spam bots, and credential-stuffing tools. These may use legitimate-looking UA strings to avoid detection, or they may use generic strings like python-requests/2.28 or curl/7.68.

UA-Based Detection

The simplest form of bot detection examines the user agent string for known patterns. This catches bots that honestly identify themselves:

// Common bot indicators in UA strings
const botPatterns = [
  /bot/i,           // Googlebot, bingbot, etc.
  /crawl/i,         // crawler, webcrawler
  /spider/i,        // Baiduspider, etc.
  /slurp/i,         // Yahoo Slurp
  /mediapartners/i, // Google AdSense
  /lighthouse/i,    // Google Lighthouse
  /pagespeed/i,     // Google PageSpeed
  /headless/i,      // Headless browsers
  /phantom/i,       // PhantomJS
  /selenium/i,      // Selenium WebDriver
  /puppeteer/i,     // Puppeteer
];

function isBot(ua: string): boolean {
  return botPatterns.some(pattern => pattern.test(ua));
}

Limitations

UA-based detection is easily bypassed. A scraper can set any user agent string it wants. Conversely, some legitimate tools use generic UA strings that match bot patterns. UA analysis is a useful first signal but should not be your only defense.

Verifying Legitimate Crawlers

Just because a request claims to be Googlebot does not mean it actually is. Verify legitimate crawlers with reverse DNS:

# Step 1: Reverse DNS lookup on the IP
$ host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

# Step 2: Forward DNS lookup to verify
$ host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

# If the forward lookup matches the original IP, it's legitimate Googlebot

Google, Bing, and other major search engines publish their crawler IP ranges and verification procedures. Always verify before blocking traffic that claims to be a search engine.

Beyond User Agents: Behavioral Detection

Sophisticated bots spoof legitimate user agents. To detect them, look at behavior:

  • Request rate: Bots often make requests much faster than humans. Rate limiting per IP or session can catch automated traffic.
  • Navigation patterns: Humans browse non-linearly, click links, and spend variable time on pages. Bots tend to access pages sequentially or target specific endpoints.
  • JavaScript execution: Many bots do not execute JavaScript. Challenge responses that require JS can filter them.
  • Cookie handling: Bots that do not store cookies between requests are easy to identify.
  • TLS fingerprinting: The TLS handshake parameters (JA3 fingerprint) differ between browsers and HTTP libraries, even when the UA string is spoofed.

robots.txt and Ethical Crawling

The robots.txt file tells well-behaved bots which paths they should and should not crawl. It is a gentleman's agreement, not a security mechanism - malicious bots simply ignore it.

# robots.txt example
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/

User-agent: Googlebot
Allow: /
Crawl-delay: 1

User-agent: BadBot
Disallow: /

For protecting sensitive endpoints, use authentication, rate limiting, and CAPTCHAs rather than relying on robots.txt.

AI Crawlers

A newer category of crawlers scrapes content for training large language models. Common AI crawlers include GPTBot (OpenAI), Claude-Web (Anthropic), CCBot (Common Crawl), and Google-Extended. Many sites now explicitly block these in their robots.txt.

Try It Yourself

Use our User Agent Parser to analyze any UA string and determine if it belongs to a known bot. The Bot Detect tab provides detailed classification and verification guidance.

Further Reading