Understanding the latency distribution of cloud object storage systems
Published in Journal of Parallel and Distributed Computing, Volume 128, 2019, Pages 71-83, ISSN 0743-7315, 2019
Recommended Citation: Yi Su, Dan Feng, Yu Hua, Zhan Shi, Understanding the latency distribution of cloud object storage systems, Journal of Parallel and Distributed Computing, Volume 128, 2019, Pages 71-83, ISSN 0743-7315, https://doi.org/10.1016/j.jpdc.2019.01.008.
Abstract: As a fundamental cloud service, the cloud object storage system stores and retrieves millions or even billions of read-heavy data objects. Serving for a massive amount of requests each day makes the response latency be a vital component of user experiences. Timeout is also a key issue as it has a great impact on the response latency. Due to the lack of suitable understanding on the distribution of the response latency and the occurrence of timeouts, current practice is to use overprovision resources to meet a Service Level Agreement (SLA) on response latency. Hence, firstly, we build a performance model for the cloud object storage system, which assumes no timeout occurring. Our model predicts the percentage of requests meeting an SLA, in the context of complicated disk operations, event-driven programming model and requests waiting for being accept()-ed. Secondly, we propose a method that determines whether or not our model is applicable by predicting the occurrence of timeouts. We evaluate our model with a production system using a real-world trace. In a variety of scenarios, our model reduces the prediction errors by up to 90% compared with baseline models, and its overall average error is 2.63%. Moreover, we could also accurately predict the applicability of our model.