Hyo Jin Do, Werner Geyer
AIES 2025
Synthetic data has emerged as an alternative or supplement to human-generated data, driven by several underlying assumptions that motivate its growing adoption among practitioners. These include the promise of increased efficiency by reducing the cost, time, and human labor involved in data collection and labeling, which is expected to potentially overcome data scarcity. Thus, as synthetic data becomes increasingly adopted to alleviate the data needs for Large Language Model development, it is critical to understand better the surrounding discourses and practices associated with their creation. We conducted a Critical Discourse Analysis on a corpus of 52 research articles from the Artificial Intelligence and Computational Linguistics conferences. As a result of our analysis, we identify three recurring disciplinary practices in establishing and reinforcing Cultural Scarcity and propose a set of recommendations to counteract it.
Hyo Jin Do, Werner Geyer
AIES 2025
Ioana Baldini Soares, Chhavi Yadav, et al.
ACL 2023
George Kour, Marcel Zalmanovici, et al.
EMNLP 2023
Lev Tankelevitch, Elena Glassman, et al.
CHI 2025