Datasets
- WebText (GPT2, crawled outbound Reddit links with >3 karma)
- OpenWebText (open-source reproduction of WebText)
- CommonCrawl (completely random subset of Internet, extremely noisy)
- SlimPajama (subset of CommonCrawl, filtered for quality)
- HuggingFace FineWeb (subset of CommonCrawl, filtered for quality)