The development of generative AI models has brought numerous benefits, but it has also raised concerns regarding copyrights, ethics, and the spread of misinformation. However, researchers have introduced a unique solution called DarkBERT — a large-scale language model trained exclusively on data obtained from the dark web.
The objective behind training DarkBERT using dark web data is to provide artificial intelligence with a better understanding of the language and context prevalent in the dark web. By leveraging this knowledge, it aims to enhance research capabilities and assist law enforcement agencies in combating cybercrime effectively.
The dark web is an intentionally hidden part of the internet that cannot be accessed through regular search engines like Google. It necessitates the use of specialized software such as Tor to gain entry. Often associated with urban legends and tales of gruesome crimes, the dark web is predominantly a platform for fraudulent activities and data theft, rather than the extreme acts of violence often portrayed. Nonetheless, it has become a primary focus for law enforcement, as cybercriminal networks exploit its anonymity for covert conversations.
DarkBERT appears to be built upon the RoBERTa architecture, an AI approach introduced by Facebook researchers in 2019, as per Tom’s Hardware. However, the initial release of RoBERTa was undertrained. Consequently, the developers of DarkBERT utilized a vast dark web corpus obtained by crawling the Tor network to enhance its ability to adapt to the specific language used in the dark web. Pre-training involved data filtering, deduplication, and ethical considerations to address potential concerns regarding sensitive information contained within dark web texts.
The researchers envision DarkBERT as a potent tool for scanning the dark web to identify cybersecurity threats and monitor forums for malicious activities. Although DarkBERT is not intended for public release, academic requests for access may be considered.
Over the coming months, the researchers plan to enhance dark web domain-specific pre-trained language models by employing more advanced architectures. They also aim to include additional data to enable the creation of multilingual language models. Expanding the scope of DarkBERT’s crawling capabilities will further enhance its utility in various applications.