{
  "type": "module",
  "source": "doc/api/best-practices-crawling.md",
  "modules": [
    {
      "textRaw": "Crawling",
      "name": "crawling",
      "type": "module",
      "desc": "<p><a href=\"https://datatracker.ietf.org/doc/html/rfc9309\">RFC 9309</a> defines crawlers as automated clients.</p>\n<p>Some web servers may reject requests that omit the <code>User-Agent</code> header or that use common defaults such as <code>'curl/7.79.1'</code>.</p>\n<p>In <strong>undici</strong>, the default user agent is <code>'undici'</code>. Since undici is integrated into Node.js core as the implementation of <code>fetch()</code>, requests made via <code>fetch()</code> use <code>'node'</code> as the default user agent.</p>\n<p>It is recommended to specify a <strong>custom <code>User-Agent</code> header</strong> when implementing crawlers. Providing a descriptive user agent allows servers to correctly identify the client and reduces the likelihood of requests being denied.</p>\n<p>A user agent string should include sufficient detail to identify the crawler and provide contact information. For example:</p>\n<pre><code>AcmeCo Crawler - acme.co - contact@acme.co\n</code></pre>\n<p>When adding contact details, avoid using personal identifiers such as your own name or a private email address—especially in a professional or employment context. Instead, use a role-based or organizational contact (e.g., <a href=\"mailto:crawler-team@company.com\">crawler-team@company.com</a>) to protect individual privacy while still enabling communication.</p>\n<p>If a crawler behaves unexpectedly—for example, due to misconfiguration or implementation errors—server administrators can use the information in the user agent to contact the operator and coordinate an appropriate resolution.</p>\n<p>The <code>User-Agent</code> header can be set on individual requests or applied globally by configuring a custom dispatcher.</p>\n<p><strong>Example: setting a <code>User-Agent</code> per request</strong></p>\n<pre><code class=\"language-js\">import { fetch } from 'undici'\n\nconst headers = {\n  'User-Agent': 'AcmeCo Crawler - acme.co - contact@acme.co'\n}\n\nconst res = await fetch('https://example.com', { headers })\n</code></pre>",
      "modules": [
        {
          "textRaw": "Best Practices for Crawlers",
          "name": "best_practices_for_crawlers",
          "type": "module",
          "desc": "<p>When developing a crawler, the following practices are recommended in addition to setting a descriptive <code>User-Agent</code> header:</p>\n<ul>\n<li>\n<p><strong>Respect <code>robots.txt</code></strong>\nFollow the directives defined in the target site’s <code>robots.txt</code> file, including disallowed paths and optional crawl-delay settings (see <a href=\"https://www.w3.org/wiki/Write_Web_Crawler\">W3C guidelines</a>).</p>\n</li>\n<li>\n<p><strong>Rate limiting</strong>\nRegulate request frequency to avoid imposing excessive load on servers. Introduce delays between requests or limit the number of concurrent requests. The W3C suggests at least one second between requests.</p>\n</li>\n<li>\n<p><strong>Error handling</strong>\nImplement retry logic with exponential backoff for transient failures, and stop requests when persistent errors occur (e.g., HTTP 403 or 429).</p>\n</li>\n<li>\n<p><strong>Monitoring and logging</strong>\nTrack request volume, response codes, and error rates to detect misbehavior and address issues proactively.</p>\n</li>\n<li>\n<p><strong>Contact information</strong>\nAlways include valid and current contact details in the <code>User-Agent</code> string so that administrators can reach the crawler operator if necessary.</p>\n</li>\n</ul>",
          "displayName": "Best Practices for Crawlers"
        },
        {
          "textRaw": "References and Further Reading",
          "name": "references_and_further_reading",
          "type": "module",
          "desc": "<ul>\n<li><a href=\"https://datatracker.ietf.org/doc/html/rfc9309\">RFC 9309: The Robots Exclusion Protocol</a></li>\n<li><a href=\"https://www.w3.org/wiki/Write_Web_Crawler\">W3C Wiki: Write Web Crawler</a></li>\n<li><a href=\"https://archives.iw3c2.org/www2010/proceedings/www/p1101.pdf\">Ethical Web Crawling (WWW 2010 Conference Paper)</a></li>\n</ul>",
          "displayName": "References and Further Reading"
        }
      ],
      "displayName": "Crawling"
    }
  ]
}