GenAI model evaluation metric — ROUGE
In supervised learning, we use R-squared, ROC, Precision-call, or F-sore to evaluate performance during model training. How is a Large Language Model evaluated? Large Language Models are Transformer-based models built on complex neural networks and fundamentally follow a supervised learning framework. They still apply the typical train-test-validation data split. The language datasets that they were trained on still have fields to be used as the input and output fields for the neural networks.
LLMs can perform text summarization. They paraphrase or re-write a long article into a short summary. This sounds very “unsupervised”, right? In fact, the training process is still the supervised-learning framework. They were trained on records with the ‘article’ as the input field, and the ‘summary’ as the output field. Below is a well-known language dataset called “CNN/DailyMail” that many LLMs were trained or fined-tuned with. It has 300k unique news articles by journalists at CNN and the Daily Mail. It has the ‘article’ and the summarization field called ‘highlights’. Once an LLM is trained and ready to be tested on the validation data, it loads the ‘article’ field as the input, then produces a summarized version as the ‘prediction’. The prediction will be compared with the ground truth, which is the ‘highlights’ field, to evaluate the model performance.