Marcelo Amaral
OSSEU 2023
Cloud based microservice architecture has become a powerful mechanism in helping organizations to scale operations by accelerating the pace of change at minimal cost. With cloud based applications being accessed from diverse geographies, there is a need for round-the-clock monitoring of faults to prevent or to limit the impact of outages. Pinpointing source(s) of faults in cloud applications is a challenging problem due to complex interdependencies between applications, middleware, and hardware infrastructure all of which may be subject to frequent and dynamic updates. In this paper, we propose a light-weight fault localization technique, which can reduce human effort and dependency on domain knowledge for localizing observable operational faults. We model multivariate error-rate time series using minimal runtime logs to infer causal relationship among the golden signal errors (error rates) and micro-service errors to discover ranked list of possible faulty components. Our experimental results show that our system can localize operational faults with high accuracy (F1 = 88.4%) underscoring the effectiveness of using golden signal error rates in fault localization.
Marcelo Amaral
OSSEU 2023
Max Bloomfield, Amogh Wasti, et al.
ITherm 2025
Shubhi Asthana, Aly Megahed, et al.
ICSOC 2020
Nikoleta Iliakopoulou, Jovan Stojkovic, et al.
MICRO 2025