The data that those large language models were built on
Why can Large Language Models (LLMs) answer questions, do book reports, draft notes, or summarize a document? An important reason is the data that they were trained on, or fine-tuned with. This post helps you to understand those widely used datasets that are known in the LLM community. While you research LLMs, you may be curious about the data. If you are thinking about fine-tuning a pre-trained LLM for your specific purpose, it is important to know the data sources that a pre-trained LLM was trained on.
There are many LLMs and datasets. In this post, I focus on the LLMs that can provide enough breath for the datasets that they were developed and evaluated against. They are:
- GPT-2 (2019), GPT-3 (2020), chatGPT (2022), GPT-4 (2023),
- T-5 (2019), Flan-T5 (2022),
- BERT (2018), RoBERTa (2019), DeBERTa (2019), DistilBERT (2020)
- MPT-7B-StoryWriter-65k+.
In the above list, MPT-7B-StoryWriter-65k+ is a model designed to read and write stories with super long context lengths — a context length of 65k tokens. The BERT models are listed as the representatives of the large BERT family.
From these LLMs, I will highlight the modeling datasets including CommonCrawl, WebText, C4, SQuAD 1.0 and SQuAD 2.0, SWAG, and…