Web Crawl Refusals: Insights From Common Crawl

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

9 Downloads (Pure)

Abstract

Web crawlers are an indispensable tool for collecting research data. However, they may be blocked by servers for various reasons. This can reduce their coverage. In this early-stage work, we investigate server-side blocks encountered by Common Crawl (CC). We analyze page contents to cover a broader range of refusals than previous work. We construct fine-grained regular expressions to identify refusal pages with precision, finding that at least 1.68% of sites in a CC snapshot exhibit a form of explicit refusal. Significant contributors include large hosters. Our analysis categorizes the forms of refusal messages, from straight blocks to challenges and rate-limiting responses. We are able to extract the reasons for nearly half of the refusals we identify. We find an inconsistent and even incorrect use of HTTP status codes to indicate refusals. Examining the temporal dynamics of refusals, we find that most blocks resolve within one hour, but also that 80% of refusing domains block every request by CC. Our results show that website blocks deserve more attention as they have a relevant impact on crawling projects. We also conclude that standardization to signal refusals would be beneficial for both site operators and web crawlers.
Original languageEnglish
Title of host publicationPassive and Active Measurement
Subtitle of host publication26th International Conference, PAM 2025, Virtual Event, March 10–12, 2025, Proceedings
EditorsCecilia Testart, Roland van Rijswijk-Deij, Burkhard Stiller
Place of PublicationCham
PublisherSpringer
Pages197–214
Number of pages18
ISBN (Electronic)978-3-031-85960-1
ISBN (Print)978-3-031-85959-5
DOIs
Publication statusPublished - 7 Mar 2025
Event26th International Conference on Passive and Active Measurement, PAM 2025 - Virtual, Virtual, Online
Duration: 10 Mar 202512 Mar 2025
Conference number: 26
https://udesa.edu.ar/pam25

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume15567
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference26th International Conference on Passive and Active Measurement, PAM 2025
Abbreviated titlePAM
CityVirtual, Online
Period10/03/2512/03/25
Internet address

Keywords

  • 2025 OA procedure
  • Web crawling
  • Common crawl
  • Server-side blocking

Fingerprint

Dive into the research topics of 'Web Crawl Refusals: Insights From Common Crawl'. Together they form a unique fingerprint.

Cite this