Abstract
Web crawlers are an indispensable tool for collecting research data. However, they may be blocked by servers for various reasons. This can reduce their coverage. In this early-stage work, we investigate server-side blocks encountered by Common Crawl (CC). We analyze page contents to cover a broader range of refusals than previous work. We construct fine-grained regular expressions to identify refusal pages with precision, finding that at least 1.68% of sites in a CC snapshot exhibit a form of explicit refusal. Significant contributors include large hosters. Our analysis categorizes the forms of refusal messages, from straight blocks to challenges and rate-limiting responses. We are able to extract the reasons for nearly half of the refusals we identify. We find an inconsistent and even incorrect use of HTTP status codes to indicate refusals. Examining the temporal dynamics of refusals, we find that most blocks resolve within one hour, but also that 80% of refusing domains block every request by CC. Our results show that website blocks deserve more attention as they have a relevant impact on crawling projects. We also conclude that standardization to signal refusals would be beneficial for both site operators and web crawlers.
| Original language | English |
|---|---|
| Title of host publication | Passive and Active Measurement |
| Subtitle of host publication | 26th International Conference, PAM 2025, Virtual Event, March 10–12, 2025, Proceedings |
| Editors | Cecilia Testart, Roland van Rijswijk-Deij, Burkhard Stiller |
| Place of Publication | Cham |
| Publisher | Springer |
| Pages | 197–214 |
| Number of pages | 18 |
| ISBN (Electronic) | 978-3-031-85960-1 |
| ISBN (Print) | 978-3-031-85959-5 |
| DOIs | |
| Publication status | Published - 7 Mar 2025 |
| Event | 26th International Conference on Passive and Active Measurement, PAM 2025 - Virtual, Virtual, Online Duration: 10 Mar 2025 → 12 Mar 2025 Conference number: 26 https://udesa.edu.ar/pam25 |
Publication series
| Name | Lecture Notes in Computer Science |
|---|---|
| Publisher | Springer |
| Volume | 15567 |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 26th International Conference on Passive and Active Measurement, PAM 2025 |
|---|---|
| Abbreviated title | PAM |
| City | Virtual, Online |
| Period | 10/03/25 → 12/03/25 |
| Internet address |
Keywords
- 2025 OA procedure
- Web crawling
- Common crawl
- Server-side blocking