Skip to content

Datasets

  • WebText (GPT2, crawled outbound Reddit links with >3 karma)
  • OpenWebText (open-source reproduction of WebText)
  • CommonCrawl (completely random subset of Internet, extremely noisy)
  • SlimPajama (subset of CommonCrawl, filtered for quality)
  • HuggingFace FineWeb (subset of CommonCrawl, filtered for quality)