PG-19 is a dataset curated by DeepMind, consisting of books from the Project Gutenberg archive that were published before 1919 and are out of copyright. It was specifically designed to support the development and evaluation of long-form language modeling, with texts significantly longer than those found in typical benchmark datasets. The dataset is particularly suitable for training models to understand and generate coherent long passages of text.
Key Features
Contains over 28,000 books suitable for long-context language modeling
Each book averages 69,000 words, much longer than typical datasets
Offers a challenging benchmark for models to learn long-range dependencies
Can be used for pretraining, evaluating language model coherence, and fine-tuning
Cleaned and filtered to remove modern works or those with incomplete metadata