Publications
2025
- Generating Failure-based Oracles to Support Testing of Reported Bugs in Mobile AppsJack Johnson, Junayed Mahmud, Oscar Chaparro, Kevin Moran, and 1 more authorIn Proceedings of the 40th International Conference on Automated Software Engineering (ASE’25), 2025
In the context of mobile apps, bug report management tasks have been shown to be among the most time-consuming and intellectually intensive software maintenance activities. As such, researchers have developed tools to automate the reproduction, validation, and localization of reported bugs. However, one complex, time-consuming, and important task that lacks automated support is the creation of test oracles for reported functional failures that manifest through the GUI. This is challenging task–requiring nuanced, multi-modal reasoning about bug descriptions, affected GUI components, and the characteristics of the related erroneous program state(s). To explore the feasibility of automating this task, we conduct a empirical investigation into how the multi-modal (i.e., text and GUI-related code) reasoning capabilities of Large Language Models (LLMs) can be used to automatically generate assertion-based test oracles for non-crashing, functional failures described in Android app bug reports. Building upon the findings of this study, we construct and evaluate AndroB2O, an automated, LLM-based approach that, given a bug report and the GUI screen associated with the reported failure as inputs, generates failure-based oracles (FBOs) in the form of test assertions. The approach first identifies the GUI elements related to the failure and then defines assertions that aim to confirm the absence of the failure based on the elements’ properties. To evaluate AndroB2O, we create the first dataset of Android bug reports containing test cases with GUI interactions and test oracles that reveal reported failures. The results of our evaluation on 152 failures show that AndroB2O is able to generate FBOs that successfully identify the failure (and hence can confirm it’s absence) in 61.2% of the cases. We integrated AndroB2O with ReBL, a failure reproduction tool, to evaluate its effectiveness in automated generation of test cases complete with oracles for reported failures, and obtained promising results.
- LadyBug: A GitHub Bot for UI-Enhanced Bug Localization in Mobile AppsJunayed Mahmud, James Chen, Terry Achille, Camilo Alvarez-Velez, and 7 more authorsIn Proceedings of the 41st International Conference on Software Maintenance and Evolution, Auckland, New Zealand, 2025
This paper introduces LadyBug, a GitHub bot that automatically localizes bugs for Android apps by combining UI interaction information with text retrieval. LadyBug connects to an Android app’s GitHub repository, and is triggered when a bug is reported in the corresponding issue tracker. Developers can then record a reproduction trace for the bug on a device or emulator and upload the trace to LadyBug via the GitHub issue tracker. This enables LadyBug to utilize both the text from the original bug description, and UI information from the reproduction trace to accurately retrieve a ranked list of files from the project that most likely contain the reported bug. We empirically evaluated LadyBug using an automated testing pipeline and benchmark called RedWing that contains 80 fully-localized and reproducible bug reports from 39 Android apps. Our results illustrate that LadyBug outperforms text-retrieval-based baselines and that the utilization of UI information leads to a substantial increase in localization accuracy. LadyBug is an open-source tool, available at https://github.com/LadyBugML/ladybug. A video showing the capabilities of Ladybug can be viewed here: https://youtu.be/hI3tzbRK0Cw
- Combining Language and App UI Analysis for the Automated Assessment of Bug Reproduction StepsJunayed Mahmud, Antu Saha, Oscar Chaparro, Kevin Moran, and 1 more authorIn Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension (ICPC’25), Ottawa, Canada, 2025
Bug reports are essential for developers to confirm software problems, investigate their causes, and validate fixes. Unfortunately, reports often miss important information or are written unclearly, which can cause delays, increased issue resolution effort, or even the inability to solve issues. One of the most common components of reports that are problematic is the steps to reproduce the bug(s) (S2Rs), which are essential to replicate the described program failures and reason about fixes. Given the proclivity for deficiencies in reported S2Rs, prior work has proposed techniques that assist reporters in writing or assessing the quality of S2Rs. However, automated understanding of S2Rs is challenging, and requires linking nuanced natural language phrases with specific, semantically related program information. Prior techniques often struggle to form such language <–> program connections - due to issues in language variability and limitations of information gleaned from program analyses. To more effectively tackle the problem of S2R quality annotation, we propose a new technique called AstroBR, which leverages the language understanding capabilities of LLMs to identify and extract the S2Rs from bug reports and map them to GUI interactions in a program state model derived via dynamic analysis. We compared AstroBR to a related state-of-the-art approach and we found that AstroBR annotates S2Rs 25.2% better (in terms of F1 score) than the baseline. Additionally, AstroBR suggests more accurate missing S2Rs than the baseline (by 71.4% in terms of F1 score).
2024
- Toward the Automated Localization of Buggy Mobile App UIs from Bug DescriptionsAntu Saha, Yang Song, Junayed Mahmud, Ying Zhou, and 2 more authorsIn Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’24), Vienna, Austria, 2024
Bug report management is a costly software maintenance process comprised of several challenging tasks. Given the UI-driven nature of mobile apps, bugs typically manifest through the UI, hence the identification of buggy UI screens and UI components (Buggy UI Localization) is important to localizing the buggy behavior and eventually fixing it. However, this task is challenging as developers must reason about bug descriptions (which are often low-quality), and the visual or code-based representations of UI screens. This paper is the first to investigate the feasibility of automating the task of Buggy UI Localization through a comprehensive study that evaluates the capabilities of one textual and two multi-modal deep learning (DL) techniques and one textual unsupervised technique. We evaluate such techniques at two levels of granularity, Buggy UI Screen and UI Component localization. Our results illustrate the individual strengths of models that make use of different representations, wherein models that incorporate visual information perform better for UI screen localization, and models that operate on textual screen information perform better for UI component localization – highlighting the need for a localization approach that blends the benefits of both types of techniques. Furthermore, we study whether Buggy UI Localization can improve traditional buggy code localization, and find that incorporating localized buggy UIs leads to improvements of 9%-12% in Hits@10.
- On Using GUI Interaction Data to Improve Text Retrieval-based Bug LocalizationJunayed Mahmud, Nadeeshan De Silva, Safwat Ali Khan, Seyed Hooman Mostafavi, and 4 more authorsIn Proceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE’24), Lisbon, Portugal, 2024
One of the most important tasks related to managing bug reports is localizing the fault so that a fix can be applied. As such, prior work has aimed to automate this task of bug localization by formulating it as an information retrieval problem, where potentially buggy files are retrieved and ranked according to their textual similarity with a given bug report. However, there is often a notable semantic gap between the information contained in bug reports and identifiers or natural language contained within source code files. For user-facing software, there is currently a key source of information that could aid in bug localization, but has not been thoroughly investigated - information from the GUI. We investigate the hypothesis that, for end user-facing applications, connecting information in a bug report with information from the GUI, and using this to aid in retrieving potentially buggy files, can improve upon existing techniques for bug localization. To examine this phenomenon, we conduct a comprehensive empirical study that augments four baseline techniques for bug localization with GUI interaction information from a reproduction scenario to (i) filter out potentially irrelevant files, (ii) boost potentially relevant files, and (iii) reformulate text-retrieval queries. To carry out our study, we source the current largest dataset of fully-localized and reproducible real bugs for Android apps, with corresponding bug reports, consisting of 80 bug reports from 39 popular open-source apps. Our results illustrate that augmenting traditional techniques with GUI information leads to a marked increase in effectiveness across multiple metrics, including a relative increase in Hits@10 of 13-18%. Additionally, through further analysis, we find that our studied augmentations largely complement existing techniques.
- Toward Rapid Bug Resolution for Android AppsJunayed MahmudIn Proceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE’24), Doctoral Symposium Track, Lisbon, Portugal, 2024
Bug reports document unexpected behaviors in software, enabling developers to understand, validate, and fix bugs. Unfortunately, a significant portion of bug reports is of low quality, which poses challenges for developers in terms of addressing these issues. Prior research has delved into the information needed for documenting high-quality bug reports and expediting bug report management. Furthermore, researchers have explored the challenges associated with bug report management and proposed various automated techniques. Nevertheless, these techniques exhibit several limitations, including a lexical gap between developers and reporters, difficulties in bug reproduction, and identifying bug locations. Therefore, there is a pressing need for additional efforts to effectively manage bug reports and enhance the quality of both desktop and mobile applications. In this paper, we describe the existing limitations of bug reports and identify potential strategies for addressing them. Our vision encompasses a future where the alleviation of these limitations and successful execution of our proposed new research directions can benefit both reporters and developers, ultimately making the entire software maintenance faster.
- Automating GUI-based Test Oracles for Mobile AppsKesina Baral, John Johnson, Junayed Mahmud, Sabiha Salma, and 4 more authorsIn Proceedings of the 21st International Conference on Mining Software Repositories (MSR’24), Lisbon, Portugal, 2024
In automated testing, test oracles are used to determine whether software behaves correctly on individual tests by comparing expected behavior with actual behavior, revealing incorrect behavior. Automatically creating test oracles is a challenging task, especially in domains where software behavior is difficult to model. Mobile apps are one such domain, primarily due to their event-driven, GUI-based nature, coupled with significant ecosystem fragmentation. This paper takes a step toward automating the construction of GUI-based test oracles for mobile apps, first by characterizing common behaviors associated with failures into a behavioral taxonomy, and second by using this taxonomy to create automated oracles. Our taxonomy identifies and categorizes common GUI element behaviors, expected app responses, and failures from 124 reproducible bug reports, which allow us to better understand oracle characteristics. We use the taxonomy to create app-independent oracles and report on their generalizability by analyzing an additional dataset of 603 bug reports. We also use this taxonomy to define an app-independent process for creating automated test oracles, which leverages computer vision and natural language processing, and apply our process to automate five types of app-independent oracles. We perform a case study to assess the effectiveness of our automated oracles by exposing them to 15 real-world failures. The oracles reveal 11 of the 15 failures and report only one false positive. Additionally, we combine our oracles with a recent automated test input generation tool for Android, revealing two bugs with a low false positive rate. Our results can help developers create stronger automated tests that can reveal more problems in mobile apps and help researchers who can use the understanding from the taxonomy to make further advances in test automation.
2023
- BURT: A Chatbot for Interactive Bug ReportingYang Song, Junayed Mahmud, Nadeeshan De Silva, Ying Zhou, and 4 more authorsIn Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE’23), Formal Tool Demonstrations Track, Melbourne, Australia, 2023
This paper introduces BURT, a web-based chatbot for interactive reporting of Android app bugs. BURT is designed to assist Android app end-users in reporting high-quality defect information using an interactive interface. BURT guides the users in reporting essential bug report elements, i.e., the observed behavior, expected behavior, and the steps to reproduce the bug. It verifies the quality of the text written by the user and provides instant feedback. In addition, BURT provides graphical suggestions that the users can choose as alternatives to textual descriptions. We empirically evaluated BURT, asking end-users to report bugs from six Android apps. The reporters found that BURT’s guidance and automated suggestions and clarifications are useful and BURT is easy to use.
2022
- Toward interactive bug reporting for (Android app) end-usersYang Song, Junayed Mahmud, Ying Zhou, Oscar Chaparro, and 3 more authorsIn Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’22), Singapore, 2022
Many software bugs are reported manually, particularly bugs that manifest themselves visually in the user interface. End-users typically report these bugs via app reviewing websites, issue trackers, or in-app built-in bug reporting tools, if available. While these systems have various features that facilitate bug reporting (e.g., textual templates or forms), they often provide limited guidance, concrete feedback, or quality verification to end-users, who are often inexperienced at reporting bugs and submit low-quality bug reports that lead to excessive developer effort in bug report management tasks. We propose an interactive bug reporting system for end-users (Burt), implemented as a task-oriented chatbot. Unlike existing bug reporting systems, Burt provides guided reporting of essential bug report elements (i.e., the observed behavior, expected behavior, and steps to reproduce the bug), instant quality verification, and graphical suggestions for these elements. We implemented a version of Burt for Android and conducted an empirical evaluation study with end-users, who reported 12 bugs from six Android apps studied in prior work. The reporters found that Burt’s guidance and automated suggestions/clarifications are useful and Burt is easy to use. We found that Burt reports contain higher-quality information than reports collected via a template-based bug reporting system. Improvements to Burt, informed by the reporters, include support for various wordings to describe bug report elements and improved quality verification. Our work marks an important paradigm shift from static to interactive bug reporting for end-users.
- An Empirical Investigation into the Reproduction of Bug Reports for Android AppsJack Johnson, Junayed Mahmud, Tyler Wendland, Kevin Moran, and 2 more authorsIn Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’22), Honolulu, Hawaii, 2022
One of the key tasks related to ensuring mobile app quality is the reporting, management, and resolution of bug reports. As such, researchers have committed considerable resources toward automating various tasks of the bug management process for mobile apps, such as reproduction and triaging. However, the success of these automated approaches is largely dictated by the characteristics and properties of the bug reports they operate upon. As such, understanding mobile app bug reports is imperative to drive the continued advancement of report management techniques. While prior studies have examined high-level statistics of large sets of reports, we currently lack an in-depth investigation of how the information typically reported in mobile app issue trackers relates to the specific details generally required to reproduce the underlying failures. In this paper, we perform an in-depth analysis of 180 reproducible bug reports systematically mined from Android apps on GitHub and investigate how the information contained in the reports relates to the task of reproducing the described bugs. In our analysis, we focus on three pieces of information: the environment needed to reproduce the bug report, the steps to reproduce (S2Rs), and the observed behavior. Focusing on this information, we characterize failure types, identify the modality used to report the information, and characterize the quality of the information within the reports. We find that bugs are reported in a multi-modal fashion, the environment is not always provided, and S2Rs often contain missing or non-specific enough information. These findings carry with them important implications on automated bug reproduction techniques as well as automated bug report management approaches more generally.
- An Empirical Investigation into the Use of Image Captioning for Automated Software DocumentationKevin Moran, Ali Yachnes, George Purnell, Junayed Mahmud, and 4 more authorsIn Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’22), Honolulu, Hawaii, 2022
Existing automated techniques for software documentation typically attempt to reason between two main sources of information: code and natural language. However, this reasoning process is often complicated by the lexical gap between more abstract natural language and more structured programming languages. One potential bridge for this gap is the Graphical User Interface (GUI), as GUIs inherently encode salient information about underlying program functionality into rich, pixel-based data representations. This paper offers one of the first comprehensive empirical investigations into the connection between GUIs and functional, natural language descriptions of software. First, we collect, analyze, and open source a large dataset of functional GUI descriptions consisting of 45,998 descriptions for 10,204 screenshots from popular Android applications. The descriptions were obtained from human labelers and underwent several quality control mechanisms. To gain insight into the representational potential of GUIs, we investigate the ability of four Neural Image Captioning models to predict natural language descriptions of varying granularity when provided a screenshot as input. We evaluate these models quantitatively, using common machine translation metrics, and qualitatively through a large-scale user study. Finally, we offer learned lessons and a discussion of the potential shown by multimodal models to enhance future techniques for automated software documentation.
2021
- Code to Comment Translation: A Comparative Study on Model Effectiveness & ErrorsJunayed Mahmud, Fahim Faisal, Raihan Islam Arnob, Antonios Anastasopoulos, and 1 more authorIn Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog’21), Bangkok, Thailand, Aug 2021
Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to “translate” code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an error taxonomy that can be used to drive future research efforts.
- Andror2: A Dataset of Manually-Reproduced Bug Reports for Android appsTyler Wendland, Jingyang Sun, Junayed Mahmud, SM Hasan Mansur, and 4 more authorsProceedings of the 18th IEEE/ACM International Conference on Mining Software Repositories (MSR’21), Data Showcase Track, Madrid, Spain, Aug 2021
Software maintenance constitutes a large portion of the software development lifecycle. To carry out maintenance tasks, developers often need to understand and reproduce bug reports. As such, there has been increasing research activity coalescing around the notion of automating various activities related to bug reporting. A sizable portion of this research interest has focused on the domain of mobile apps. However, as research around mobile app bug reporting progresses, there is a clear need for a manually vetted and reproducible set of real-world bug reports that can serve as a benchmark for future work. This paper presents ANDROR2: a dataset of 90 manually reproduced bug reports for Android apps listed on Google Play and hosted on GitHub, systematically collected via an in-depth analysis of 459 reports extracted from the GitHub issue tracker. For each reproduced report, ANDROR2 includes the original bug report, an apk file for the buggy version of the app, an executable reproduction script, and metadata regarding the quality of the reproduction steps associated with the original report. We believe that the ANDROR2 dataset can be used to facilitate research in automatically analyzing, understanding, reproducing, localizing, and fixing bugs for mobile applications as well as other software maintenance activities more broadly.
2018
- MAES: Modified advanced encryption standard for resource constraint environmentsArnab Rahman Chowdhury, Junayed Mahmud, Abu Raihan Mostofa Kamal, and Md. Abdul HamidIn 2018 IEEE Sensors Applications Symposium (SAS), Aug 2018