RedPajama-Data-v2: Transforming Language Model Training with Open Source 30 Trillion Tokens


Published on:

RedPajama has set a new benchmark with the release of RedPajama-Data-v2, a colossal dataset that is poised to revolutionize the training of large language models (LLMs).

This monumental dataset, which is now publicly available, contains an astonishing 30 trillion filtered and deduplicated tokens, sourced from 84 CommonCrawl dumps covering five languages: English, French, Spanish, German, and Italian. The dataset offers a solid foundation for advancing open LLMs such as Llama, Mistral, Falcon, and MPT.

RedPajama-Data-v2 represents the largest public dataset released specifically for LLM training, according to the information available on Together.AI’s blog. With over 40 pre-computed data quality annotations, RedPajama-Data-v2 provides the community with tools for further filtering and weighting, allowing for the creation of highly refined datasets for LLM training.

The journey to RedPajama-Data-v2 began with the release of RedPajama-1T in March, which consisted of 1 trillion high-quality English tokens. This dataset was downloaded more than 190,000 times, sparking the creation of numerous new language models. RedPajama-Data-v2 builds on this foundation, offering a more comprehensive and diverse dataset that includes over 100 billion text documents with 100+ trillion raw tokens.

The dataset designed to ease the burden of processing and filtering crude data from CommonCrawl, a task that is often laborious, time-consuming, energy-intensive, and expensive. RedPajama-Data-v2 provides a base from which high-quality datasets for LLM training can be extracted, and it facilitates thorough research on LLM training data.

One of the key features of RedPajama-Data-v2 is the inclusion of 40+ quality annotations. These annotations are the result of different machine learning classifiers on data quality, minhash results for fuzzy deduplication, and heuristics such as “the fraction of words that contain no alphabetical character.” These annotations enable LLM developers to easily slice and filter the data, combining them into a new data quality pipeline to create their own pre-training dataset.

For those interested in utilizing RedPajama-Data-v2, all data processing scripts are open source and available on GitHub, and all data are available on HuggingFace. The dataset includes over 100 billion text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30 billion documents in the corpus that additionally come with quality signals, and 20 billion documents that are deduplicated.

Vishak is a skilled Editor-in-chief at Code and Hack with a passion for AI and coding. He has a deep understanding of the latest trends and advancements in the fields of AI and Coding. He creates engaging and informative content on various topics related to AI, including machine learning, natural language processing, and coding. He stays up to date with the latest news and breakthroughs in these areas and delivers insightful articles and blog posts that help his readers stay informed and engaged.

Related Posts:

Leave a Reply

Please enter your comment!
Please enter your name here