Operational Domain Name Classification: From Automatic Ground Truth Generation to Adaptation to Missing Values

Jan Bayer*, Ben Chukwuemeka Benjamin, Sourena Maroofi, Thymen Wabeke, Cristian Hesselman, Andrzej Duda, Maciej Korczyński

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

2 Citations (Scopus)
114 Downloads (Pure)

Abstract

With more than 350 million active domain names and at least 200,000 newly registered domains per day, it is technically and economically challenging for Internet intermediaries involved in domain registration and hosting to monitor them and accurately assess whether they are benign, likely registered with malicious intent, or have been compromised. This observation motivates the design and deployment of automated approaches to support investigators in preventing or effectively mitigating security threats. However, building a domain name classification system suitable for deployment in an operational environment requires meticulous design: from feature engineering and acquiring the underlying data to handling missing values resulting from, for example, data collection errors. The design flaws in some of the existing systems make them unsuitable for such usage despite their high theoretical accuracy. Even worse, they may lead to erroneous decisions, for example, by registrars, such as suspending a benign domain name that has been compromised at the website level, causing collateral damage to the legitimate registrant and website visitors. In this paper, we propose novel approaches to designing domain name classifiers that overcome the shortcomings of some existing systems. We validate these approaches with a prototype based on the COMAR (COmpromised versus MAliciously Registered domains) system focusing on its careful design, automated and reliable ground truth generation, feature selection, and the analysis of the extent of missing values. First, our classifier takes advantage of automatically generated ground truth based on publicly available domain name registration data. We then generate a large number of machine-learning models, each dedicated to handling a set of missing features: if we need to classify a domain name with a given set of missing values, we use the model without the missing feature set, thus allowing classification based on all other features. We estimate the importance of features using scatter plots and analyze the extent of missing values due to measurement errors. Finally, we apply the COMAR classifier to unlabeled phishing URLs and find, among other things, that 73% of corresponding domain names are maliciously registered. In comparison, only 27% are benign domains hosting malicious websites. The proposed system has been deployed at two ccTLD registry operators to support their anti-fraud practices.

Original languageEnglish
Title of host publicationPassive and Active Measurement - 24th International Conference, PAM 2023, Proceedings
EditorsAnna Brunstrom, Marcel Flores, Marco Fiore
PublisherSpringer
Pages564-591
Number of pages28
ISBN (Print)9783031284854
DOIs
Publication statusPublished - 10 Mar 2023
Event24th International Conference on Passive and Active Measurement, PAM 2023 - Virtual, Online
Duration: 21 Mar 202323 Mar 2023
Conference number: 24

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13882 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th International Conference on Passive and Active Measurement, PAM 2023
Abbreviated titlePAM 2023
CityVirtual, Online
Period21/03/2323/03/23

Keywords

  • 2024 OA procedure
  • Compromised websites
  • DNS
  • Domain name abuse
  • Malicious domain registration
  • Phishing
  • Classification

Fingerprint

Dive into the research topics of 'Operational Domain Name Classification: From Automatic Ground Truth Generation to Adaptation to Missing Values'. Together they form a unique fingerprint.

Cite this