PG-19: A Dataset for Long-Form Language Modeling

Category: Natural Language Processing

License: MIT

Model Type: Other

PG-19 is a dataset curated by DeepMind, consisting of books from the Project Gutenberg archive that were published before 1919 and are out of copyright. It was specifically designed to support the development and evaluation of long-form language modeling, with texts significantly longer than those found in typical benchmark datasets. The dataset is particularly suitable for training models to understand and generate coherent long passages of text.

Key Features

Contains over 28,000 books suitable for long-context language modeling
Each book averages 69,000 words, much longer than typical datasets
Offers a challenging benchmark for models to learn long-range dependencies
Can be used for pretraining, evaluating language model coherence, and fine-tuning
Cleaned and filtered to remove modern works or those with incomplete metadata

GitHub Demo Video Arxiv

Similar Projects

PG-19: A Dataset for Long-Form Language Modeling

Key Features

Similar Projects

GENRE: Autoregressive Entity Retrieval

LinguaLens Translation Assistant

Chat with PDF

llmtranslate

OpenAI GPT-2 — Large-Scale Generative Pretrained Transformer 2

Transformers by Hugging Face