Web Crawlers: The 12 Bots You Need to Know (And How They Actually Work)

Bits Lovers
Written by Bits Lovers on
Web Crawlers: The 12 Bots You Need to Know (And How They Actually Work)

Last week I spent 4 hours debugging why a client’s site wasn’t showing up in Bing. The answer turned out to be embarrassingly simple: the site was blocking Bingbot in robots.txt. Nobody on the current team knew because someone set it up 6 years ago “to reduce server load” and nobody ever revisited it.

This is a practical breakdown of the web crawlers you’re most likely to encounter, how they work, and what you actually need to know as a developer or site operator.

How Web Crawlers Actually Work

The theory is simple: a crawler visits a page, extracts the links, visits those links, extracts more links, and so on. In practice, there are three things that determine how a crawler behaves:

The crawl budget. Every crawler has a limit on how many pages it will crawl per day or per session. Big crawlers like Googlebot have enormous budgets; smaller ones have much tighter limits. If your site has 500,000 pages and a crawler has a budget of 10,000 pages per day, you won’t get fully indexed quickly.

The politeness settings. Well-behaved crawlers respect delays between requests, honor robots.txt, and don’t hammer your server. Aggressive crawlers (especially commercial ones) push these boundaries.

The purpose. A search engine crawler indexes for search. An SEO crawler indexes for link analysis. An ad network crawler reads page content to match ads. Each has different behavior because they’re optimizing for different outcomes.

The Major Crawlers

Googlebot

This is the one that matters most. Google operates separate crawlers for different content types:

  • Googlebot (desktop): Crawls with a desktop user agent, used for desktop-indexed pages
  • Googlebot (smartphone): Crawls with a mobile user agent
  • Googlebot-Image: Indexes images
  • AdsBot: Evaluates landing page quality for ads

You can check what Googlebot is hitting your site with:

# Reverse DNS lookup to verify Googlebot
host 66.249.66.1
# Should return a googlebot.com domain
# Then forward resolve to confirm:
host googlebot-com.googleusercontent.com
# Should return the original IP

# Or check in nginx logs:
awk '$1 ~ /66\.249\./ {print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head

Googlebot respects robots.txt but also has its own rules. If you block Googlebot in robots.txt, you won’t show up in Google Search. That’s usually the intended outcome — but I’ve seen sites accidentally block themselves and wonder why traffic dropped.

AhrefsBot

Ahrefs maintains a database of over 12 trillion links. Their crawler is the second busiest on the internet after Google. If you’re doing SEO work, you’ve probably used Ahrefs. The bot is aggressive — I’ve seen it crawl sites at 100+ requests per second even with polite robots.txt settings.

# Check if AhrefsBot is respecting your delay settings
# Look for user-agent string "AhrefsBot" in logs
grep -i "AhrefsBot" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head

Fun fact: AhrefsBot respects the Crawl-delay directive in robots.txt, but only up to a point. If you set it to 10 seconds, they’ll probably honor it. If you set it to 60 seconds, they might not.

Bingbot

Microsoft’s crawler for Bing. Less aggressive than AhrefsBot, but still significant. Bingbot follows robots.txt and respects Crawl-delay.

# Detect Bingbot like Googlebot
# Bingbot IPs resolve to search.msn.com
host 40.77.167.123
# Should return msn.com

YandexBot

Russia’s dominant search engine has about 60% market share in Russia. If you’re targeting Russian-speaking users or doing international SEO, Yandex matters.

Yandex has a reputation for being more aggressive than Google about crawling. They also have separate bots for different content types (YandexImages, YandexVideo, etc.).

DuckDuckGo Bot (DuckDuckBot)

DuckDuckGo doesn’t track users or personalize search results, and their crawler is similarly minimal. DDG crawls for index population and isn’t as aggressive as the major engines.

If you’re seeing DuckDuckBot in your logs, it’s probably because someone searched for your site on DDG — not because you’re being specifically targeted.

Applebot

Apple’s crawler handles Apple.com and related properties, plus Siri suggestions and Spotlight search. If you want your content to appear in Siri answers or Spotlight search results, Applebot needs to be able to crawl it.

Baiduspider

Baidu is China’s dominant search engine with a huge market share domestically. If you’re targeting the Chinese market, Baidu SEO matters. Baidu has a separate image crawler (Baiduspider-image) and a video crawler.

Note: Baidu’s bot behavior and policies differ from Google in ways that can surprise Western developers. They have different spam detection criteria, different indexing rules, and different webmaster guidelines.

Sogou Spider

Another Chinese search engine, owned by Tencent. Smaller than Baidu but still significant in China. Sogou has about 5% of the Chinese search market.

Exabot

France-based Exalead’s crawler, now part of Dassault Systèmes. Exabot crawls for Exalead’s web index, which is used by some enterprise search applications.

CriteoBot

This one is different. Criteo is an ad network. CriteoBot crawls pages on sites running Criteo ads, reading the content to match relevant advertisements. It doesn’t help your SEO — it’s optimizing ad targeting.

If you’re running ads on your site and you see CriteoBot, that’s normal and expected. They respect robots.txt, but blocking them means Criteo can’t serve targeted ads on your site.

PetalBot

Petal Search is another Chinese search engine, backed by Huawei. It’s grown significantly as Huawei has pushed its own ecosystem.

Facebook External Hit (facebookexternalhit)

You see this when someone shares a link on Facebook, Messenger, or Instagram. Facebook’s crawler fetches the page’s title, description, and preview image so the post displays nicely.

<!-- To control what Facebook shows when your link is shared -->
<!-- Use Open Graph meta tags -->
<meta property="og:title" content="Your Page Title">
<meta property="og:description" content="Description that appears in the post">
<meta property="og:image" content="https://example.com/share-image.jpg">
<meta property="og:url" content="https://example.com/page-url">

<!-- Test your sharing setup -->
<!-- Use Facebook's Sharing Debugger: -->
<!-- https://developers.facebook.com/tools/debug/ -->

Without these tags, Facebook uses whatever it can scrape from your page, which is often wrong or ugly.

robots.txt: What It Actually Controls

robots.txt sits at the root of your domain (yoursite.com/robots.txt) and tells crawlers what they’re allowed to access.

# Example robots.txt
User-agent: *
Allow: /public/
Allow: /api/public/
Disallow: /admin/
Disallow: /tmp/
Disallow: /private-files/

# Allow a specific crawler
User-agent: AhrefsBot
Disallow: /

# Set crawl delay for polite crawlers
User-agent: *
Crawl-delay: 5

# Block AI training bots (2024+)
User-agent: GPTBot
Allow: /
Disallow: /private/
User-agent: CCBot
Disallow: /

# Sitemap location
Sitemap: https://yoursite.com/sitemap.xml

Key things to know:

robots.txt is advisory, not enforced. A well-behaved crawler respects it. A malicious scraper ignores it. If you’re blocking sensitive content, robots.txt is not security — use authentication.

Block rules cascade. Disallow: /private/ blocks everything under /private/. But be careful: Disallow: /private (without trailing slash) blocks only /private itself, not /private/.

Multiple rules per crawler. You can specify different rules for different crawlers. If you want to block AhrefsBot but allow Googlebot, you can do that.

The Crawl-delay directive is not universal. Google ignores it. Many smaller crawlers honor it but don’t guarantee compliance.

What Goes Wrong in Practice

Blocking the Wrong Bot

I can’t count how many times I’ve seen this:

# Someone copied a "security" robots.txt from the internet
User-agent: *
Disallow: /

# Oops, you blocked everything including Googlebot

Or more subtle:

# This blocks ALL crawlers
User-agent: *
Disallow: /api/

# But you actually wanted that for Google too

The Crawl Budget Mismatch

Your site has 50,000 pages. You launch a new product section with 10,000 pages. Googlebot is crawling at 5,000 pages/day. Those new pages won’t all get indexed for 2 days — unless they’re linked prominently from your homepage, in which case Googlebot finds them faster.

If you’re launching something time-sensitive (a flash sale, an event registration), you need to understand crawl budget dynamics and sometimes submit a request to Google Search Console to recrawl.

Ignoring Log Analysis

Most teams focus on analytics (page views, conversions) and ignore crawler logs. But your logs tell you exactly which crawlers are hitting which pages, when, and how often.

# Quick analysis of who's crawling your site
cat /var/log/nginx/access.log | \
  awk '{print $NF}' | grep -E "Mozilla|Google|Bing|Ahrefs|Yandex|Baidu" | \
  sort | uniq -c | sort -rn | head -20

Look for:

  • Bots you didn’t know were crawling you
  • Crawlers hitting pages that should be noindex’d
  • Sudden changes in crawl patterns (might indicate a problem or an opportunity)

What Changed Recently (2024-2026)

The crawler landscape shifted significantly with AI’s impact on search and stricter data privacy rules:

Google AI Overviews changed crawler behavior. Google’s AI Overviews (launched 2024) generate summarized answers directly in search results, which changed how crawlers need to extract structured data. JSON-LD structured data and Schema.org markup became far more important as crawlers increasingly rely on structured data to generate AI summaries.

Cloudflare Bot Management and Turnstile became standard. Cloudflare’s bot management now blocks or challenges a significant portion of automated traffic. Ethical crawlers need to respect these signals — evading bot protection is both legally risky and technically fragile.

AI training bots proliferated. OpenAI’s GPTBot, Common Crawl’s CCBot, and similar crawlers began aggressively indexing the web for AI model training. By 2024, blocking AI bots in robots.txt became a common practice for sites concerned about their content being used to train commercial AI models:

# robots.txt additions to block AI training crawlers (2024+)
User-agent: GPTBot
Allow: /
Disallow: /private/
User-agent: ChatGPT-User
Allow: /
Disallow: /members/
User-agent: CCBot
Disallow: /

GDPR and CCPA enforcement matured. Cookie consent banners became ubiquitous, and crawlers increasingly need to handle them gracefully. Scraping personal data without proper legal basis became more legally risky under the EU AI Act (2024), which restricted scraping for AI training data.

Signed Exchanges (SXG) gained traction. Google’s SXG format lets crawlers serve cached versions of pages with verified authenticity, improving crawl efficiency for dynamic content.

robots.txt Best Practices in 2026

Beyond the basics, here are the patterns that matter now:

# Comprehensive robots.txt for most sites (2026)
User-agent: *
Allow: /public/
Allow: /blog/
Allow: /products/
Disallow: /admin/
Disallow: /api/
Disallow: /checkout/
Disallow: /account/
Disallow: /search?q=
Disallow: /tag/
Disallow: /filter/

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

# Rate limit aggressive SEO crawlers
User-agent: AhrefsBot
Crawl-delay: 10

User-agent: SemrushBot
Crawl-delay: 10

# Keep sitemap
Sitemap: https://yoursite.com/sitemap.xml

# Verify with Google Search Console
# https://search.google.com/search-console/robots.txt-testing

How to Build a Crawler (Responsibly)

If you’re building a crawler for your own site or a project, here’s the responsible approach:

import urllib.robotparser
import requests
import time
from bs4 import BeautifulSoup
import json

class RespectfulCrawler:
    def __init__(self, user_agent, politeness_delay=1.0):
        self.user_agent = user_agent
        self.delay = politeness_delay
        self.last_request_time = 0

    def can_fetch(self, url, robots_txt_url):
        rp = urllib.robotparser.RobotFileParser()
        rp.set_url(robots_txt_url)
        rp.read()
        return rp.can_fetch(self.user_agent, url)

    def fetch_with_delay(self, url):
        robots_url = urlparse(url)._replace(path="/robots.txt", query="").geturl()
        if not self.can_fetch(url, robots_url):
            print(f"Blocked by robots.txt: {url}")
            return None

        elapsed = time.time() - self.last_request_time
        if elapsed < self.delay:
            time.sleep(self.delay - elapsed)

        headers = {"User-Agent": self.user_agent}
        response = requests.get(url, headers=headers, timeout=10)
        self.last_request_time = time.time()
        return response.text

    def extract_jsonld(self, html):
        """Extract JSON-LD structured data — preferred over text scraping"""
        soup = BeautifulSoup(html, "html.parser")
        scripts = soup.find_all("script", type="application/ld+json")
        return [json.loads(s.string) for s in scripts if s.string]

Rules to follow:

  1. Always check robots.txt before crawling
  2. Respect Crawl-delay if specified
  3. Set a reasonable request rate (don’t spin up 100 threads)
  4. Identify your crawler in the User-Agent header with contact info
  5. If you get blocked, stop and back off
  6. Use residential proxies if you’re crawling at scale — cloud provider IPs get blocked
  7. Extract JSON-LD structured data instead of scraping raw text when possible

What This Means for You

If you’re running a site:

  • Check your robots.txt. Make sure you’re not accidentally blocking important crawlers.
  • Monitor your logs for crawler activity. Know who’s hitting your site.
  • If you need to block a specific crawler for load reasons, use rate limiting in your web server rather than blocking entirely.
  • Consider adding AI bot blocks if you don’t want your content used for AI training.

If you’re building something that crawls:

  • Be a good citizen. Check robots.txt, respect crawl delays, identify yourself in the User-Agent.
  • Your crawler will get blocked if you hammer servers. Slow down.
  • Use Playwright or Puppeteer for JavaScript-heavy sites, but check if the site has an API first.

The crawlers aren’t your enemy. They make search work. But understanding how they operate helps you make better decisions about your site’s visibility and security.


For more on making your site discoverable, the posts on sitemap configuration and noindex directives cover the other side of crawler interaction. For the broader SEO picture, the best practices for effective communication post covers how to structure technical content for both search engines and human readers.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus