journals
2020
-
Investigating the Criticality of User Reported Issues through their Relations with App Rating
Di Sorbo, Andrea,
Grano, Giovanni,
Visaggio, Corrado Aron,
and Panichella, Sebastiano
Journal of Software: Evolution and Process
2020
[Abstract]
App quality impacts user experience and satisfaction. As consequence, both app ratings and user feedback reported in app reviews are directly influenced by the users perceived app quality. We conjecture that to perform an effective maintenance and evolution of mobile applications, it is crucial to find ways to detect the user reported issues that most impact the app rating (i.e., the app success). In this paper, we experiment the combined usage of app rating and user reviews analysis (i) to investigate the most important factors influencing the perceived app quality, (ii) focusing on the topics discussed in user review that most relate with app rating. In addition, we investigate whether specific code quality metrics could be monitored in order to prevent the rising of reviews with low ratings. Through an empirical study involving 210,517 reviews related to 317 Android apps, we first investigate the particular types of user feedback (e.g., bugs- or feature-related issues) that are associated with reviews with high/low rates. Then, we observe the extent to which (issue) metrics based on app rating and user reviews analysis correlate with specific code/mobile quality indicators. Our study demonstrates that user feedback reporting bugs are negatively correlated with the rating, while user reviews reporting feature requests do not. Interestingly, depending on the app category, we observed that different kinds of issues have rather different relationships with the rating and the user perceived quality of the app. In addition, we observe that for specific app categories (e.g., Communication) some code quality factors (e.g., the Android specific ones) have significant relationships with the raising of certain types of feedback, that, in turn, are negatively connected with app ratings. Our work complements state-of-art approaches that leverage app rating to measure user satisfaction and the (perceived) app software quality. Moreover, it demonstrates how an analysis based on the app rating can be enriched by a context-based analysis (i.e., taking into account the specific nature) of the different apps and a contextual-based user review analysis, this to guide developers better understanding user needs and achieve higher app success.
2019
-
Testing with Fewer Resources: An Adaptive Approach to Performance-Aware Test Case Generation
Grano, Giovanni,
Laaber, Christoph,
Panichella, Annibale,
and Panichella, Sebastiano
IEEE Transactions on Software Engineering
2019
[Abstract]
[PDF]
[Slides]
Automated test case generation is an effective technique to yield high-coverage test suites. While the majority of research effort has been devoted to satisfying coverage criteria, a recent trend emerged towards optimizing other non-coverage aspects. In this regard, runtime and memory usage are two essential dimensions: less expensive tests reduce the resource demands for the generation process and later regression testing phases. This study shows that performance-aware test case generation requires solving two main challenges: providing a good approximation of resource usage with minimal overhead and avoiding detrimental effects on both final coverage and fault detection effectiveness. To tackle these challenges, we conceived a set of performance proxies —inspired by previous work on performance testing— that provide a reasonable estimation of the test execution costs (i.e., runtime and memory usage). Thus, we propose an adaptive strategy, called aDynaMOSA, which leverages these proxies by extending DynaMOSA, a state-of-the-art evolutionary algorithm in unit testing. Our empirical study —involving 110 non-trivial Java classes— reveals that our adaptive approach generates test suite with statistically significant improvements in runtime (-25%) and heap memory consumption (-15%) compared to DynaMOSA. Additionally, aDynaMOSA has comparable results to DynaMOSA over seven different coverage criteria and similar fault detection effectiveness. Our empirical investigation also highlights that the usage of performance proxies (i.e., without the adaptiveness) is not sufficient to generate more performant test cases without compromising the overall coverage.
-
Scented Since the Beginning: On the Diffuseness of Test Smells in Automatically Generated Test Code
Grano, Giovanni,
Palomba, Fabio,
Di Nucci, Dario,
De Lucia, Andrea,
and Gall, Harald
Journal of Systems and Software
2019
[Abstract]
[PDF]
Software testing represents a key software engineering practice to ensure source code quality and reliability. To support developers in this activity and reduce testing effort, several automated unit test generation tools have been proposed. Most of these approaches have the main goal of covering as more branches as possible. While these approaches have good performance, little is still known on the maintainability of the test code they produce, i.e., whether the generated tests have a good code quality and if they do not possibly introduce issues threatening their effectiveness. To bridge this gap, in this paper we study to what extent existing automated test case generation tools produce potentially problematic test code. We consider seven test smells, i.e., suboptimal design choices applied by programmers during the development of test cases, as measure of code quality of the generated tests, and evaluate their diffuseness in the unit test classes automatically generated by three state-of-the-art tools such as Randoop, JTExpert, and Evosuite. Moreover, we investigate whether there are characteristics of test and production code influencing the generation of smelly tests. Our study shows that all the considered tools tend to generate a high quantity of two specific test smell types, i.e., Assertion Roulette and Eager Test, which are those that previous studies showed to negatively impact the reliability of production code. We also discover that test size is correlated with the generation of smelly tests. Based on our findings, we argue that more effective automated generation algorithms that explicitly take into account test code quality should be further investigated and devised.
-
A Large-Scale Empirical Exploration on Refactoring Activities in Open Source Software Projects
Vassallo, Carmine,
Grano, Giovanni,
Palomba, Fabio,
Gall, Harald,
and Bacchelli, Alberto
Science of Computer Programming
2019
[Abstract]
[PDF]
Refactoring is a well-established practice that aims at improving the internal structure of a software system without changing its external behavior. Existing literature provides evidence of how and why developers perform refactoring in practice. In this paper, we continue on this line of research by performing a large-scale empirical analysis of refactoring practices in 200 open source systems. Specifically, we analyze the change history of these systems at commit level to investigate: (i) whether developers perform refactoring operations and, if so, which are more diffused and (ii) when refactoring operations are applied, and (iii) which are the main developer-oriented factors leading to refactoring.
Based on our results, future research can focus on enabling automatic support for less frequent refactorings and on recommending refactorings based on the developer’s workload, project’s maturity and developer’s commitment to the project.
-
Lightweight Assessment of Test-Case Effectiveness using Source-Code-Quality Indicators
Grano, Giovanni,
Palomba, Fabio,
and Gall, Harald
IEEE Transactions on Software Engineering
2019
[Abstract]
[PDF]
[Slides]
Test cases are crucial to help developers preventing the introduction of software faults. Unfortunately, not all the tests are properly designed or can effectively capture faults in production code. Some measures have been defined to assess test-case effectiveness: the most relevant one is the mutation score, which highlights the quality of a test by generating the so-called mutants, ie variations of the production code that make it faulty and that the test is supposed to identify. However, previous studies revealed that mutation analysis is extremely costly and hard to use in practice. The approaches proposed by researchers so far have not been able to provide practical gains in terms of mutation testing efficiency. This leaves the problem of efficiently assessing test-case effectiveness as still open. In this paper, we investigate a novel, orthogonal, and lightweight methodology to assess test-case effectiveness: in particular, we study the feasibility to exploit production and test-code-quality indicators to estimate the mutation score of a test case. We firstly select a set of 67 factors and study their relation with test-case effectiveness. Then, we devise a mutation score estimation model exploiting such factors and investigate its performance as well as its most relevant features. The key results of the study reveal that our estimation model only based on static features has 86% of both F-Measure and AUC-ROC. This means that we can estimate the test-case effectiveness, using source-code-quality indicators, with high accuracy and without executing the tests. As a consequence, we can provide a practical approach that is beyond the typical limitations of current mutation testing techniques.
-
Branch Coverage Prediction in Automated Testing
Grano, Giovanni,
Titov, Timofey V.,
Panichella, Sebastiano,
and Gall, Harald C.
Journal of Software: Evolution and Process
2019
[Abstract]
[PDF]
Software testing is crucial in continuous integration (CI).
Ideally, at every commit, all the test cases should be executed and, moreover, new test cases should be generated for the new source code.
This is especially true in a Continuous Test Generation (CTG) environment, where the automatic generation of test cases is integrated into the continuous integration pipeline.
In this context, developers want to achieve a certain minimum level of coverage for every software build.
However, executing all the test cases and, moreover, generating new ones for all the classes at every commit is not feasible.
As a consequence, developers have to select which subset of classes has to be tested and/or targeted by test-case generation.
We argue that knowing a priori the branch-coverage that can be achieved with test-data generation tools can help developers into taking informed-decision about those issues.
In this paper, we investigate the possibility to use source-code metrics
to predict the coverage achieved by test-data generation tools.
We use four different categories of source-code features and assess the prediction on a large dataset involving more than 3’000 Java classes.
We compare different machine learning algorithms and conduct a fine-grained feature analysis aimed at investigating the factors that most impact the prediction accuracy.
Moreover, we extend our investigation to four different search-budgets.
Our evaluation shows that the best model achieves an average 0.15 and 0.21 MAE on nested cross-validation over the different budgets, respectively on EvoSuite and Randoop. Finally, the discussion of the results demonstrate the relevance of coupling-related features for the prediction accuracy.
conference and workshop papers
2020
-
Pizza versus Pinsa: On the Perception and Measurability of Unit Test Code Quality
Grano, Giovanni,
De Iaco, Cristian,
Palomba, Fabio,
and Gall, Harald C.
In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME)
2020
[Abstract]
[PDF]
[Slides]
Test cases are an essential asset to evaluate software quality. The research community has provided various alternatives to help developers assessing the quality of tests, like code or mutation coverage. Despite the effort spent so far, however, little is known on how practitioners perceive unit test code quality and whether the existing metrics reflect their perception. This paper aims at addressing this gap of knowledge. We first conduct semi-structured interviews and surveys with practitioners to establish a taxonomy of relevant factors for unit test quality and collect a dataset of tests rated by developers based on their perceived quality. Then, we devise a statistical model to measure how the metrics available in literature reflect the perceived quality of test cases. The findings of our study show that readability and maintainability are the key aspects for developers to diagnose the outcome of test cases and drive debugging activities. On the contrary, code coverage metrics are necessary but not sufficient to evaluate the capability of tests. Finally, we discover that available metrics are effective in characterizing poor-quality tests, while limited in distinguishing high-quality ones.
2019
-
A New Dimension of Test Quality: Assessing and Generating Higher Quality Unit Test Cases
Grano, Giovanni
In Proceedings of the 27th ACM SIGSOFT International Symposium on
Software Testing and Analysis
2019
[Abstract]
[PDF]
Unit tests form the first defensive line against the introduction of bugs in software systems.
Therefore, their quality is of a paramount importance to produce robust and reliable software.
To assess test quality, many organizations relies on metrics like code and mutation coverage.
However, they are not always optimal to fulfill such a purpose.
In my research, I want to make mutation testing scalable by devising a lightweight approach to estimate test effectiveness.
Moreover, I plan to introduce a new metric measuring test focus—as a proxy for the effort needed by developers to understand and maintain a test— that both complements code coverage to assess test quality and can be used to drive automated test case generation of higher quality tests.
-
On the Effectiveness of Manual and Automatic Unit Test Generation: Ten Years Later
Serra, Domenico,
Grano, Giovanni,
Palomba, Fabio,
Ferrucci, Filomena,
Gall, Harald,
and Bacchelli, Alberto
In Proceedings of the 16th International Conference on Mining Software Repositories
2019
[Abstract]
[PDF]
Good unit tests play a paramount role when it comes to foster and evaluate software quality. However, writing effective tests is an extremely costly and time consuming practice. In order to reduce such a burden for developers, researchers devised ingenious techniques to automatically generate test suite for existing code bases. Nevertheless, it is still not clear how do automatically generated test cases fare against manually written ones. In 2008, Bacchelli et.al. conducted an initial case study comparing automatic and manually generated test suites. However, during the last ten years we witnessed a huge amount of work on novel approaches and tools for automatic test generation. For this reason, in this paper we revise their study using current tools as well as complementing their research method by evaluating these tools’ ability in finding regressions.
2018
-
OCELOT: A Search-Based Test Data Generation Tool for C
Scalabrino, Simone,
Grano, Giovanni,
Di Nucci, Dario,
Guerra, Michele,
De Lucia, Andrea,
Gall, Harald C,
and Oliveto, Rocco
In Proceedings of the 33rd IEEE/ACM International Conference on Automated Software Engineering
2018
[Abstract]
[PDF]
[Slides]
Automatically generating test cases plays an important role to reduce the time spent by developers during the testing phase. In last years, several approaches have been proposed to tackle such a problem: amongst others, search-based techniques have been shown to be particularly promising. In this paper we describe Ocelot, a search-based tool for the automatic generation of test cases in C. Ocelot allows practitioners to write skeletons of test cases for their programs and researchers to easily implement and experiment new approaches for automatic test-data generation. We show that Ocelot achieves a higher coverage compared to a competitive tool in 81% of the cases. Ocelot is publicly available to support both researchers and practitioners.
-
An Empirical Investigation on the Readability of Manual and Generated Test Cases
Grano, Giovanni,
Scalabrino, Simone,
Oliveto, Rocco,
and Gall, Harald
In Proceedings of the 26th International Conference on Program Comprehension, ICPC
2018
[Abstract]
[PDF]
[Slides]
Software testing is one of the most crucial tasks in the typical development process. Developers are usually required to write unit test cases for the code they implement. Since this is a time-consuming task, in last years many approaches and tools for automatic test case generation — such as EvoSuite — have been introduced. Nevertheless, developers have to maintain and evolve tests to sustain the changes in the source code; therefore, having readable test cases is important to ease such a process.
However, it is still not clear whether developers make an effort in writing readable unit tests. Therefore, in this paper, we conduct an explorative study comparing the readability of manually written test cases with the classes they test. Moreover, we deepen such analysis looking at the readability of automatically generated test cases. Our results suggest that developers tend to neglect the readability of test cases and that automatically generated test cases are generally even less readable than manually written ones.
-
BECLoMA: Augmenting Stack Traces with User Review Information
Pelloni, Lucas,
Grano, Giovanni,
Ciurumelea, Adelina,
Palomba, Fabio,
Panichella, Sebastiano,
and Gall, Harald
In Software Analysis, Evolution and Reengineering (SANER), 2018 IEEE 25th International Conference on
2018
[Abstract]
[PDF]
[Slides]
Mobile devices such as smartphones, tablets and wearables are changing the way we do things, radically modifying our approach to technology. To sustain the brutal competition in the mobile market, developers need to deliver high quality applications in a short release cycle.
Therefore, to maximize their market success, they aim to reveal and fix bugs as soon as possible.
For this reason, researchers and practitioners proposed testing tools to automate the process of bug discovery and fixing.
In the mobile development context, the content of user reviews represents an unmatched source for developers seeking for defects in their applications. However, no prior work explored the adoption of information available in user reviews for testing purposes.
In this demo we present BECLoMA, a tool to enable the integration of user feedback in the testing process of mobile apps.
BECLoMA links information from testing tools and user reviews, presenting developers an augmented testing report combining stack traces with user reviews information referring to the same crash.
We show that BECLoMA facilitates not only the diagnosis and fix of app bugs, but also presents additional benefits: it eases the usage of testing tools and automates the analysis of user reviews from the Google Play Store.
-
How High Will It Be? Using Machine Learning Models to Predict Branch Coverage in Automated Testing
Grano, Giovanni,
Timov, Timofey,
Panichella, Sebastiano,
and Gall, Harald
In Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), IEEE Workshop on
2018
[Abstract]
[PDF]
[Slides]
Software testing is a crucial component in modern continuous integration development environment.
Ideally, at every commit, all the system’s test cases should be executed and moreover, new test cases should be generated for the new code.
This is especially true in the a Continuous Test Generation (CTG) environment, where the automatic generation of test cases is integrated into the continuous integration pipeline.
Furthermore, developers want to achieve a minimum level of coverage for every build of their systems.
Since both executing all the test cases and generating new ones for all the classes at every commit is not feasible, they have to select which subset of classes has to be tested.
In this context, knowing a priori the branch coverage that can be achieved with test data generation tools might gives some useful indications for answering such a question.
In this paper, we take the first steps towards the definition of machine learning models to predict the branch coverage achieved by test data generation tools.
We conduct a preliminary study considering well known code metrics as a features.
Despite the simplicity of these features, our results show that using machine learning to predict branch coverage in automated testing is a viable and feasible option.
-
Exploring the Integration of User Feedback in Automated Testing of Android Applications
Grano, Giovanni,
Ciurumelea, Adelina,
Palomba, Fabio,
Panichella, Sebastiano,
and Gall, Harald
In Software Analysis, Evolution and Reengineering (SANER), 2018 IEEE 25th International Conference on
2018
[Abstract]
[PDF]
[Slides]
The intense competition characterizing mobile application’s marketplaces forces developers to create and maintain high-quality mobile apps in order to ensure their commercial success and acquire new users. This motivated the research community to propose solutions that automate the testing process of mobile apps. However, the main problem of current testing tools is that they generate redundant and random inputs that are insufficient to properly simulate the human behavior, thus leaving feature and crash bugs undetected until they are encountered by users. To cope with this problem, we conjecture that information available in user reviews—that previous work showed as effective for maintenance and evolution problems—can be successfully exploited to identify the main issues users experience while using mobile applications, e.g., GUI problems and crashes.
In this paper we provide initial insights into this direction, investigating (i) what type of user feedback can be actually exploited for testing purposes, (ii) how complementary user feedback and automated testing tools are, when detecting crash bugs or errors and (iii) whether an automated system able to monitor crash-related information reported in user feedback is sufficiently accurate. Results of our study, involving 11,296 reviews of 8 mobile applications, show that user feedback can be exploited to provide contextual details about errors or exceptions detected by automated testing tools. Moreover, they also help detecting bugs that would remain uncovered when rely on testing tools only. Finally, the accuracy of the proposed automated monitoring system demonstrates the feasibility of our vision, i.e., integrate user feedback into testing process.
2017
-
Android apps and user feedback: a dataset for software evolution and
quality improvement
Grano, Giovanni,
Di Sorbo, Andrea,
Mercaldo, Francesco,
Visaggio, Corrado Aaron,
Canfora, Gerardo,
and Panichella, Sebastiano
In [email protected]/SIGSOFT FSE
2017
[Abstract]
[PDF]
[Slides]
Nowadays, Android represents the most popular mobile platform with a market share of around 80%. Previous research showed that data contained in user reviews and code change history of mobile apps represent a rich source of information for reducing software maintenance and development effort, increasing customers’ satisfaction. Stemming from this observation, we present in this paper a large dataset of Android applications belonging to 23 different apps categories, which provides an overview of the types of feedback users report on the apps and documents the evolution of the related code metrics. The dataset contains about 395 applications of the F-Droid repository, including around 600 versions, 280,000 user reviews and more than 450,000 user feedback (extracted with specific text mining approaches). Furthermore, for each app version in our dataset, we employed the Paprika tool and developed several Python scripts to detect 8 different code smells and compute 22 code quality indicators. The paper discusses the potential usefulness of the dataset for future research in the field.
2016
-
Search-Based Testing of Procedural Programs: Iterative Single-Target
or Multi-target Approach?
Scalabrino, Simone,
Grano, Giovanni,
Di Nucci, Dario,
Oliveto, Rocco,
and De Lucia, Andrea
In International Symposium on Search Based Software Engineering
2016
[Abstract]
[PDF]
[Slides]
In the context of testing of Object-Oriented (OO) software systems, researchers have recently proposed search based approaches to automatically generate whole test suites by considering simultaneously all targets (e.g., branches) defined by the coverage criterion (multi-target approach). The goal of whole suite approaches is to overcome the problem of wasting search budget that iterative single-target approaches (which iteratively generate test cases for each target) can encounter in case of infeasible targets. However, whole suite approaches have not been implemented and experimented in the context of procedural programs. In this paper we present OCELOT (Optimal Coverage sEarch-based tooL for sOftware Testing), a test data generation tool for C programs which implements both a state-of-the-art whole suite approach and an iterative single-target approach designed for a parsimonious use of the search budget. We also present an empirical study conducted on 35 open-source C programs to compare the two approaches implemented in OCELOT. The results indicate that the iterative single-target approach provides a higher efficiency while achieving the same or an even higher level of coverage than the whole suite approach.
book chapters
2018
-
Data-Driven Decisions and Actions in Today’s Software Development
Gall, Harald,
Alexandru, Carol,
Ciurumelea, Adelina,
Grano, Giovanni,
Laaber, Christoph,
Panichella, Sebastiano,
Proksch, Sebastian,
Schermann, Gerald,
Vassallo, Carmine,
and Zhao, Jitong
In The Essence of Software Engineering
2018
[Abstract]
Today’s software development is all about data: data about the software product itself, about the process and its different stages, about the customers and markets, about the development, the testing, the integration, the deployment, or the runtime aspects in the cloud. We use static and dynamic data of various kinds and quantities to analyze market feedback, feature impact, code quality, architectural design alternatives, or effects of performance optimizations. Development environments are no longer limited to IDEs in a desktop application or the like but span the Internet using live programming environments such as Cloud9 or large-volume repositories such as BitBucket, GitHub, GitLab, or StackOverflow. Software development has become “live” in the cloud, be it the coding, the testing, or the experimentation with different product options on the Internet. The inherent complexity puts a further burden on developers, since they need to stay alert when constantly switching between tasks in different phases. Research has been analyzing the development process, its data and stakeholders, for decades and is working on various tools that can help developers in their daily tasks to improve the quality of their work and their productivity. In this chapter, we critically reflect on the challenges faced by developers in a typical release cycle, identify inherent problems of the individual phases, and present the current state of the research that can help overcome these issues.
thesis
2015
-
Grano, G. (2015). Implementation and comparison of novel techniques for automated search based test data generation [Thesis]. University of Salerno.
[PDF]
These documents are made available as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each copyright holder. These works may not be reposted without the explicit permission of the copyright holder