RedPajama has set a new benchmark with the release of RedPajama-Data-v2, a colossal dataset that is poised to revolutionize the training of large language models (LLMs).
This monumental dataset, which is now publicly available, contains an astonishing 30 trillion filtered and deduplicated tokens, sourced from 84 CommonCrawl dumps covering five languages: English, French, Spanish, German, and Italian. The dataset offers a solid foundation for advancing open LLMs such as Llama, Mistral, Falcon, and MPT.
RedPajama-Data-v2 represents the largest public dataset released specifically for LLM training, according to the information available on Together.AI’s blog. With over 40 pre-computed data quality annotations, RedPajama-Data-v2 provides the community with tools for further filtering and weighting, allowing for the creation of highly refined datasets for LLM training.
The journey to RedPajama-Data-v2 began with the release of RedPajama-1T in March, which consisted of 1 trillion high-quality English tokens. This dataset was downloaded more than 190,000 times, sparking the creation of numerous new language models. RedPajama-Data-v2 builds on this foundation, offering a more comprehensive and diverse dataset that includes over 100 billion text documents with 100+ trillion raw tokens.
The dataset designed to ease the burden of processing and filtering crude data from CommonCrawl, a task that is often laborious, time-consuming, energy-intensive, and expensive. RedPajama-Data-v2 provides a base from which high-quality datasets for LLM training can be extracted, and it facilitates thorough research on LLM training data.
One of the key features of RedPajama-Data-v2 is the inclusion of 40+ quality annotations. These annotations are the result of different machine learning classifiers on data quality, minhash results for fuzzy deduplication, and heuristics such as “the fraction of words that contain no alphabetical character.” These annotations enable LLM developers to easily slice and filter the data, combining them into a new data quality pipeline to create their own pre-training dataset.
For those interested in utilizing RedPajama-Data-v2, all data processing scripts are open source and available on GitHub, and all data are available on HuggingFace. The dataset includes over 100 billion text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30 billion documents in the corpus that additionally come with quality signals, and 20 billion documents that are deduplicated.