X Xerobit

Bot Detection Using User Agent Strings — Googlebot, Bingbot, and Crawlers

Identify search engine crawlers, social media bots, and malicious scrapers from user agent strings. Includes patterns for Googlebot, Bingbot, common bots, and how to verify...

Mian Ali Khalid · · 5 min read
Use the tool
User Agent Parser
Parse any User-Agent string into browser, OS, device, and engine. Or detect your own. Built on the maintained ua-parser-js dataset.
Open User Agent Parser →

Bots use user agent strings to identify themselves — but anyone can fake a UA. Distinguishing real Googlebot from a scraper spoofing it requires reverse DNS verification. Here’s how to detect, classify, and verify bots.

Use the User Agent Parser to classify any user agent string.

Common search engine bot UAs

Googlebot (web):
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Googlebot (mobile):
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 ... (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Bingbot:
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

DuckDuckBot:
DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html)

Yandex:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Baidu:
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

AhrefsBot:
Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)

Semrushbot:
Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)

Bot detection patterns

const BOT_PATTERNS = {
  // Search engines (good bots):
  googlebot: /Googlebot/i,
  bingbot: /bingbot/i,
  duckduckbot: /DuckDuckBot/i,
  yandex: /YandexBot/i,
  baidu: /Baiduspider/i,
  
  // SEO crawlers:
  ahrefs: /AhrefsBot/i,
  semrush: /SemrushBot/i,
  majestic: /MJ12bot/i,
  
  // Social media crawlers:
  facebook: /facebookexternalhit/i,
  twitter: /Twitterbot/i,
  linkedin: /LinkedInBot/i,
  slack: /Slackbot/i,
  telegram: /TelegramBot/i,
  
  // Generic patterns:
  generic: /bot|crawler|spider|scraper|fetch|curl|wget|python-requests/i,
};

function classifyUserAgent(ua = navigator.userAgent) {
  for (const [name, pattern] of Object.entries(BOT_PATTERNS)) {
    if (pattern.test(ua)) return { isBot: true, type: name };
  }
  return { isBot: false, type: 'human' };
}

Node.js / Express bot detection

app.use((req, res, next) => {
  const ua = req.headers['user-agent'] || '';
  const { isBot, type } = classifyUserAgent(ua);
  
  req.isBot = isBot;
  req.botType = type;
  
  // Log bot activity:
  if (isBot) {
    console.log(`Bot visit: ${type} from ${req.ip} → ${req.path}`);
  }
  
  next();
});

// Serve different content to bots:
app.get('/page', (req, res) => {
  if (req.isBot && req.botType === 'googlebot') {
    // Serve pre-rendered static HTML for SEO
    return res.sendFile('static-rendered.html');
  }
  res.sendFile('app.html');
});

Verify Googlebot with reverse DNS

Anyone can fake a Googlebot UA. Verify it’s real Google by checking that the IP reverse-resolves to googlebot.com or google.com:

import dns from 'dns/promises';

async function verifyGooglebot(ip) {
  try {
    // Step 1: Reverse DNS lookup (IP → hostname)
    const [hostname] = await dns.reverse(ip);
    
    // Step 2: Must end in googlebot.com or google.com
    if (!hostname.endsWith('.googlebot.com') && !hostname.endsWith('.google.com')) {
      return false;
    }
    
    // Step 3: Forward lookup (hostname → IP must match original IP)
    const addresses = await dns.lookup(hostname);
    return addresses.address === ip;
    
  } catch {
    return false;
  }
}

// Usage in Express:
app.use(async (req, res, next) => {
  const ua = req.headers['user-agent'] || '';
  if (/Googlebot/i.test(ua)) {
    const isReal = await verifyGooglebot(req.ip);
    req.isVerifiedGooglebot = isReal;
    if (!isReal) {
      console.warn(`Fake Googlebot from ${req.ip}`);
    }
  }
  next();
});

Python: detect bots from UA

import re

BOT_PATTERNS = {
    'googlebot': re.compile(r'Googlebot', re.I),
    'bingbot': re.compile(r'bingbot', re.I),
    'social': re.compile(r'facebookexternalhit|Twitterbot|LinkedInBot|Slackbot', re.I),
    'seo': re.compile(r'AhrefsBot|SemrushBot|MJ12bot', re.I),
    'generic': re.compile(r'bot|crawler|spider|curl|wget|python-requests', re.I),
}

def classify_bot(ua: str) -> dict:
    for name, pattern in BOT_PATTERNS.items():
        if pattern.search(ua):
            return {'is_bot': True, 'type': name}
    return {'is_bot': False, 'type': 'human'}

Block bad bots in Nginx

# Block scrapers and bad bots (not search engines):
map $http_user_agent $blocked_bot {
    default                 0;
    ~*SemrushBot            0;  # Allow SEO bots
    ~*AhrefsBot             0;
    ~*python-requests       1;  # Block common scrapers
    ~*scrapy                1;
    ~*wget                  1;
    ~*curl                  1;
    ~*Go-http-client        1;
    ""                      1;  # Block empty UA
}

server {
    if ($blocked_bot) {
        return 403;
    }
}

Related posts

Related tool

User Agent Parser

Parse any User-Agent string into browser, OS, device, and engine. Or detect your own. Built on the maintained ua-parser-js dataset.

Written by Mian Ali Khalid. Part of the Dev Productivity pillar.