IBM Research India - AI4Code
<! -- ========================== PAGE CONTENT ========================== ->
After witnessing the phenomenal success and impact of Artificial Intelligence (AI) in our day-to-day lives, the next questions that comes to the mind of almost all the practitioners, researchers, and even normal users is – “What’s next big thing in AI?”. While there are many important areas where AI is being nurtured aggressively, one area that clearly stands out in our mind is AI for Code (AI4Code) or AI for Software Engineering (AI4SE). We are at a unique moment in the history of software technology where two important streams of technology evolution have come together for a better future - Software Engineering and AI.
- First, modern software has evolved all the way from being just a means for programming computers to being a structured articulation of human and business intent, which is then executed by computers. Today's software engineering starts with understanding the process and requirements needed in the business, may it be for internal processes or for customer engagement, and then manually translating that understanding into programming languages based programs.
- Second, AI technology has evolved to a point where, in many data modalities, AI systems are at par or better than human performance in terms of accuracy and speed. This creates a tremendous opportunity to evolve new AI technologies that i) understand IT artefacts such as source code, data and metadata, and operational data from IT systems, ii) understand human and business intent, and iii) then finally bring significant improvements to the entire software lifecycle. We believe this will be one of the next big waves in the field of AI in coming years, impacting the full spectrum of professions, businesses, industries, and society. It has the potential to transform the entire software industry because it may completely change the manners and ways in which we code and program computers today.
To dwell further on this confluence of AI and Software Engineering, let us recall some of the basic understanding about programming languages via following simple analogies:
“What is Natural Language to Humans, Programming Language is to Computers”.
“What is Text Document to Humans, Source code is to Computers”.
These analogies suggest that any AI that is possible to carry out on natural languages, should also be possible for programming languages, at the least in theory, and that is a key ongoing trend in this upcoming field of AI4Code. For example, consider a parallel between some Natural Language Processing (NLP) tasks and Software Engineering tasks: document embedding versus code embedding, document search versus code search, natural languages translation versus programming languages translation, etc. This strong parallel between NLP and SE tasks may suggest that we can simply reapply recent AI based NLP techniques (e.g. Transformer, BERT, BART, GPT-3, etc.) to solve majority of the tasks in the field of SE. However, it is not that straightforward. The catch lies in a fundamental difference between how natural languages versus programming languages are processed by their respective engines, that is human mind versus computers, respectively. For instance, a mild syntactic variation in a natural language text hardly disturbs its understanding and processing by a human mind. However, the same is not true with the programming languages. A slight syntactic variation in a program statement may lead to a catastrophic change in the outcome. Therefore, the need of the hour is to invent a new breed of techniques for the tasks in AI4Code. While one may (and should) start by taking cues from recent advancements in deep-learning based NLP techniques, solving these problem requires out-of-the-box thinking and new innovations.
In the figure below, we have captured a high-level sketch of our vision of an enterprise strength solution stack for AI4Code. In some ways, this is parallel to popular NLP stacks that leverage tools as Spacy, NLTK, and HuggingFace, among others.
The stepping stone into any AI4Code stack would be the sources of the data. As shown in the figure above, typical data inputs to a AI4Code stack include a wide variety of application/code artefacts. Source code repositories certainly play a central role when it comes to data input. However, there are several other artefacts such as test suites, logs, configuration scripts, etc. which play major role while building different end applications. We would like to emphasize the last input artefact, namely data models/schemas. This is something unique and success of many of the end applications depend heavily on parsing and reasoning over these data models and schemas.
Data Pre-Processing and Enrichment
Often times, these inputs data may not be directly consumable and therefore, one requires to perform different kinds of pre-processing and cleaning operations on them. These preprocessing operations may include (but are not limited to) masking data to hide sensitive information, filter data to respect license and other compliance terms, random split into train/dev/test sets while maintaining data distributions, etc.
Some of other pre-processing operations include semantic enrichment of the artifacts such as human/automated labelling of the code/data so that the it becomes useful for running many subsequent AI algorithms. At some other times, we may also be required to generate some additional new code/data to supplement or complement the data for the task. Lastly, there is need for careful planning for data storage/transfer during this entire cycle of pre-processing because these data may be containing highly sensitive informations.
Note that most of the input data are either programming languages based source codes or allied data (i.e. logs, natural language documents, data warehouses). There are several classical tools available from Programming Language & Software Engineering (PLSE) as well Natural Language Processing (NLP) communities that exploit underlying grammar and offers rich insights about the input in the form of some kind of symbolic structures which is typically called as Intermediate Representation (IR). For example, Abstract Syntax Tree (AST), Control Flow Graph (CFG), etc. are well-known IR-tools in the PLSE community. On the other hand POS Tagger, Parse Tress, Abstract Meaning Representation (AMR) Tree, etc. are well know syntactic and semantic parsing tools in the NLP community. The symbolic structures produced by any of these tools help immensely in terms of improving our understanding about the input data - may it be code/application topology, infrastructure details, or the CRUD operations involved.
As we all know, all the major AI/NLP breakthroughs in recent times have emerged because of Deep-Nets based vector (aka distributed) representation of the data. We believe that having developed some very neat symbolic representations of the input data, a channel gets opened for leveraging all kinds of distributed representation techniques developed in the fields of AI/NLP. For example, one can consider developing Graph Neural Networks (GNN) based models to bake vector representations of the node/edges in the AST of a code, or one can invent new attention based techniques for representing entire code via single vector embedding. Depending on the nature of the end-task, one may either venture into building a custom purpose representation technique afresh or may just repurpose some of the existing representation models with little fine tuning.
Foundation Models for AI4Code
It is no secret that recent advances in the foundations models (i.e. Transformer, BERT, BART, GPT-3, etc.) has shaken up the entire field of NLP and has been a paradigm shift in the way NLP tasks are performed today. While these foundation models are susceptible to their own limitations, they are still rocking the whole field when it comes to any abstract task comprising of NL generation or NL2NL translation, may it be question answering, summarization, or story generation. The fundamental question is whether one can also leverage these models for solving the tasks in the field of AI4Code. For example,
- Can one repurpose Transformer model to automatically translate a Cobol program into functionally equivalent Java program?
- Can one use BERT model to embed a code into a single vector for the purpose of retrieving functionally similar code?
While there are several attempts in the AI4Code community to demonstrate the possibilities of repurposing existing foundation models, it is still unclear whether one can simply lift-and-shift these foundation model for Code related tasks with little customization and minimal fine tuning. Remember, a slight perturbation in the natural language sentence doesn't alter its meaning but the same is not true for the code. Therefore, we believe there are plenty of opportunities to inspect this whole field of AI4Code from the lenses of AI Foundation Models. This may trigger some out-of-the-box ideas to rewire the foundation models in a fundamentally different ways to suit them for the symbolic data coming from codes/application artefacts. We strongly believe that this particular layer in the AI4Code stack above would the attracting the major action and research activities in coming years.
As shown in the figure above, there is entire spectrum of applications waiting to be immensely benefitted from the use of AI/NLP based techniques and models. For example, let's take the use case of Application Modernization (i.e. migrating a legacy mainframe application into modern cloud platform) task. The client owning the legacy application may decide to adopt a micro-services based architecture in its journey to cloud. This might also lead to moving the dependencies to the latest versions. For this migration, we need the ”where" and "how", i.e. where all in the application code we need to make changes and how can we do it? Both "where" and "how" require program analysis but "how" also requires NLP techniques to parse javadocs, release notes, and discover how to do the changes. Similarly, for other applications shown in the figure above, you will find that there is strong need to combine AI and Program Analysis. Consider another case where AI4Code can help in testing. Similar to source-to-source translations, AI4Code can help in encoding exact or use-def semantics of program statements is essential in automated directed functional and unit testing of applications written in COBOL, SAP-ABAP due to their large number of non-primitive statements. The knowledge distillation power of AI4Code will also be helpful to predict the correlation between variables and long-term control dependency which will in turn be useful for combinatorial testing and better test coverage of programs.