基于 BERT 的恶意 URL 识别方法。BERT-Based Approaches to Identifying Malicious URLs.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Malicious uniform resource locators (URLs) are prevalent in cyberattacks, particularly in phishing attempts aimed at stealing sensitive information or distributing malware. Therefore, it is of paramount importance to accurately detect malicious URLs. Prior research has explored the use of deep-learning models to identify malicious URLs, using the segmentation of URL strings into character-level or word-level tokens, and embedding and employing trained models to differentiate between URLs. In this study, a bidirectional encoder representation from a transformers-based (BERT) model was devised to tokenize URL strings, employing its self-attention mechanism to enhance the understanding of correlations among tokens. Subsequently, a classifier was employed to determine whether a given URL was malicious. In evaluating the proposed methods, three different types of public datasets were utilized: a dataset consisting solely of URL strings from Kaggle, a dataset containing only URL features from GitHub, and a dataset including both types of data from the University of New Brunswick, namely, ISCX 2016. The proposed system achieved accuracy rates of 98.78%, 96.71%, and 99.98% on the three datasets, respectively. Additionally, experiments were conducted on two datasets from different domains-the Internet of Things (IoT) and Domain Name System over HTTPS (DoH)-to demonstrate the versatility of the proposed model.

摘要：

恶意统一资源定位符(URL)在网络攻击中普遍存在，特别是在旨在窃取敏感信息或分发恶意软件的网络钓鱼尝试中。因此,准确检测恶意URL至关重要。之前的研究已经探索了使用深度学习模型来识别恶意URL，使用将URL字符串分段为字符级或单词级令牌，嵌入和使用训练好的模型来区分URL。在这项研究中,设计了基于变压器(BERT)模型的双向编码器表示来标记URL字符串，利用其自我注意机制来增强对令牌之间相关性的理解。随后，分类器被用来确定给定的URL是否是恶意的。在评估提出的方法时，使用了三种不同类型的公共数据集：仅由Kaggle的URL字符串组成的数据集，仅包含来自GitHub的URL功能的数据集，和一个数据集，包括来自新不伦瑞克省大学的两种类型的数据，即,ISCX2016。该系统的准确率达到98.78%,96.71%,在三个数据集上为99.98%，分别。此外,在来自不同域的两个数据集上进行了实验-物联网（IoT）和基于HTTPS的域名系统（DoH）-以证明所提出模型的多功能性。