A Delta-Debugging Approach to Assessing the Resilience of Actor Programs through Run-time Test Perturbations
Among distributed applications, the actor model is increasingly prevalent. This programming model organises applications into fully-isolated processes that communicate through asynchronous messaging. Supported by frameworks such as Akka and Orleans, it is believed to facilitate realising responsive, elastic and resilient distributed applications.
While these frameworks do provide abstractions for implementing resilience, it remains up to developers to use them correctly and to test that their implementation recovers from anticipated failures. As manually exploring the reaction to every possible failure scenario is infeasible, there is a need for automated means of testing the resilience of a distributed application.
We present the first automated approach to testing the resilience of actor programs. Our approach perturbs the execution of existing test cases and leverages delta debugging to explore all failure scenarios more efficiently. Moreover, we present a further optimisation that uses causality to prune away redundant perturbations and speed up the exploration. However, its effectiveness is sensitive to the program’s organisation and the actual location of the fault. Our experimental evaluation shows that our approach can speed up resilience testing by four times compared to random exploration.