PG-19: A Dataset for Long-Form Language Modeling

PG-19: A Dataset for Long-Form Language Modeling

License: MIT
Model Type: Other
PG-19 is a dataset curated by DeepMind, consisting of books from the Project Gutenberg archive that were published before 1919 and are out of copyright. It was specifically designed to support the development and evaluation of long-form language modeling, with texts significantly longer than those found in typical benchmark datasets. The dataset is particularly suitable for training models to understand and generate coherent long passages of text.

Key Features

  • Contains over 28,000 books suitable for long-context language modeling
  • Each book averages 69,000 words, much longer than typical datasets
  • Offers a challenging benchmark for models to learn long-range dependencies
  • Can be used for pretraining, evaluating language model coherence, and fine-tuning
  • Cleaned and filtered to remove modern works or those with incomplete metadata