Workshop paper

Grammar- and Coverage-based Augmentation of Programs for Training LLMs

Abstract

Training large language models (LLMs) for programming tasks requires diverse and syntactically valid input data. While data augmentation can enhance generalization, uncontrolled complexity may lead to overfitting or invalid examples. In this work, we introduce a grammar-based augmentation method that systematically generates program-like data with controlled complexity. By leveraging formal grammars, our approach ensures syntactic correctness while promoting semantic diversity. Preliminary experiments demonstrate that our method produces well-distributed training datasets, improving model robustness without compromising generalization. This grammar-aware strategy offers a scalable and principled solution for augmenting structured data in LLM training.