Philadelphia

On the Testability of the Anchor-Words Assumption in Topic Models

Simon Freyaldenhoven, Shikun Ke, Dingyi Li, Jose Luis Montiel Olea

August 5, 2025

Topic models are a simple and popular tool for the statistical analysis of textual data. Their identification and estimation is typically enabled by assuming the existence of anchor words; that is, words that are exclusive to specific topics. In this paper we show that the existence of anchor words is statistically testable: There exists a […]
Precision Without Labels: Detecting Cross-Applicants in Mortgage Data Using Unsupervised Learning

Hadi Elzayn, Simon Freyaldenhoven, Minchul Shin

August 5, 2025

We develop a clustering-based algorithm to detect loan applicants who submit multiple applications (“cross-applicants”) in a loan-level dataset without personal identifiers. A key innovation of our approach is a novel evaluation method that does not require labeled training data, allowing us to optimize the tuning parameters of our machine learning algorithm. By applying this methodology […]
Debt Dictionaries

Jawad M. Addoum, Vitaly Meursault, Justin Murfin

July 8, 2025

Using the debt and equity response to the release of textual information from earnings calls, we demonstrate that stock and bond investors for the same firm interpret different aspects of firm information as value-relevant. These differences in interpretation-captured by distinct functions mapping text to returns-cannot be explained by differences in security payoffs. If anything, the […]
Can LLMs Credibly Transform the Creation of Panel Data from Diverse Historical Tables?

Verónica Bäcker-Peral, Vitaly Meursault, Christopher Severen

June 24, 2025

Multimodal LLMs offer a watershed change for the digitization of historical tables, enabling low-cost processing centered on domain expertise rather than technical skills. We rigorously validate an LLM-based pipeline on a new panel of historical county-level vehicle registrations. This pipeline is 100 times less expensive than outsourcing, reduces critical parsing errors from 40% to 0.3%, […]
Patent Text and Long-Run Innovation Dynamics: The Critical Role of Model Selection

Ina Ganguli, Jeffrey Lin, Vitaly Meursault, Nicholas Reynolds

May 27, 2025

As distorted maps may mislead, Natural Language Processing (NLP) models may misrepresent. How do we know which NLP model to trust? We provide comprehensive guidance for selecting and applying NLP representations of patent text. We develop novel validation tasks to evaluate several leading NLP models. These tasks assess how well candidate models align with both […]
Generative AI: A Turning Point for Labor’s Share?

Lukasz Drozd, Marina M. Tavares

March 25, 2025

After years of slow and steady development, generative artificial intelligence (AI) technologies have exploded in popularity, and many experts believe that we are entering a new, AI-driven phase of the Industrial Revolution. The advent of AI as the new engine of growth raises questions about the future of labor. Some have expressed concerns that, in […]
PEAD.txt: Post-Earnings-Announcement Drift Using Text

Vitaly Meursault, Pierre Jinghong Liang, Bryan R. Routledge, and Madeline Marco Scanlon

Research conducted using AI/ML tools

May 29, 2024

We construct a new numerical measure of earnings announcement surprises, standardized unexpected earnings call text (SUE.txt), that does not explicitly incorporate the reported earnings value. SUE.txt generates a text-based post-earnings-announcement drift (PEAD.txt) larger than the classic PEAD. The magnitude of PEAD.txt is considerable even in recent years when the classic PEAD is close to 0. […]
Advancing Fairness in Lending Through Machine Learning

Vitaly Meursault, Daniel Moulton, Larry Santucci, and Nathan Schor, with web adaptation by Kali Aloisi

Research conducted using AI/ML tools

April 2, 2024

Advances in machine learning (ML) provide the opportunity to improve predictions that may expand credit access to more applicants. However, there is concern that gains from advanced models could accrue unequally between demographic groups or do little to reduce existing disparities in credit access. This research explores an approach using ML — paired with setting […]