Member-only story

The data that those large language models were built on

18 min readMay 9, 2023

Why can Large Language Models (LLMs) answer questions, do book reports, draft notes, or summarize a document? An important reason is the data that they were trained on, or fine-tuned with. This post helps you to understand those widely used datasets that are known in the LLM community. While you research LLMs, you may be curious about the data. If you are thinking about fine-tuning a pre-trained LLM for your specific purpose, it is important to know the data sources that a pre-trained LLM was trained on.

There are many LLMs and datasets. In this post, I focus on the LLMs that can provide enough breath for the datasets that they were developed and evaluated against. They are:

GPT-2 (2019), GPT-3 (2020), chatGPT (2022), GPT-4 (2023),
T-5 (2019), Flan-T5 (2022),
BERT (2018), RoBERTa (2019), DeBERTa (2019), DistilBERT (2020)
MPT-7B-StoryWriter-65k+.

In the above list, MPT-7B-StoryWriter-65k+ is a model designed to read and write stories with super long context lengths — a context length of 65k tokens. The BERT models are listed as the representatives of the large BERT family.

From these LLMs, I will highlight the modeling datasets including CommonCrawl, WebText, C4, SQuAD 1.0 and SQuAD 2.0, SWAG, and BookCorpus. I also described the nine datasets of GLUE that are used as the benchmark datasets for model performance. GLUE is also used for model fine-tuning (e.g. T5). Although the datasets are readily available in hugging face, I look into the sources of the datasets. Hopefully, this will benefit your understanding quickly. If you are new to LLMs and do not know where to start, you are advised to use the Huggingface platform https://huggingface.co/.

Modeling Datasets

I categorize the datasets into “News/Wikipedia”, “Web crawling”, “Questions and Answers (Q&A)”, “Books”, and “Reading comprehension” in Table (A). This categorization is by no means mutually exclusive because, for example, “web crawling” can contain “news/Wikipedia”. This categorization features the primary sourcing method and the data contents. I provide representative models for each dataset in the table so you know the significance of each dataset.

The data that those large language models were built on

Written by Chris Kuo/Dr. Dataman

Responses (1)