Teaching to the Test: Evaluating the Performance of Generative AI Models for Economic Analysis is Harder than You Think

Authors

Wendy Dunn, Ellen Meade, Nitish Sinha, Raakin Kabir

Posted to EERN: December 30, 2025

FEDERAL RESERVE RESEARCH: Board of Governors

This study leverages the unique information dissemination structure of Federal Open Market Committee (FOMC) meeting minutes to investigate the data leakage hypothesis in economic text analysis. We employ hand-labeled data as our ground-truth benchmark and assess whether the past performance of these LLMs in a setting with potential leakage holds up when they are instead tasked with categorizing unseen data. We find evidence that some models do exhibit performance degradation when evaluated on truly novel data, but the effect is not uniform across models. We also find that some models perform better on novel data compared with their performance on potentially leaked training data. These counterintuitive findings suggest that the relationship between training data exposure and model performance is more complex than previously understood.

Read the paper