C.A. Micchelli, W.L. Miranker
Journal of the ACM
Training large language models (LLMs) for programming tasks requires diverse and syntactically valid input data. While data augmentation can enhance generalization, uncontrolled complexity may lead to overfitting or invalid examples. In this work, we introduce a grammar-based augmentation method that systematically generates program-like data with controlled complexity. By leveraging formal grammars, our approach ensures syntactic correctness while promoting semantic diversity. Preliminary experiments demonstrate that our method produces well-distributed training datasets, improving model robustness without compromising generalization. This grammar-aware strategy offers a scalable and principled solution for augmenting structured data in LLM training.
C.A. Micchelli, W.L. Miranker
Journal of the ACM
Saurabh Paul, Christos Boutsidis, et al.
JMLR
Joxan Jaffar
Journal of the ACM
Cristina Cornelio, Judy Goldsmith, et al.
JAIR