JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

ShortTail: taming tail latency for erasure-code-based in-memory systems

Author(s): Yun TENG, Zhiyue LI, Jing HUANG, Guangyan ZHANG
Affiliation(s): College of Computer Science and Technology, Jilin University, Changchun 130012, China; more
Corresponding email(s): gyzh@tsinghua.edu.cn
Key Words: Erasure code; In-memory system; Node fail-slow; Small write; Tail latency

Share this article to： More <<< Previous Paper \|Next Paper >>>

Yun TENG, Zhiyue LI, Jing HUANG, Guangyan ZHANG. ShortTail: taming tail latency for erasure-code-based in-memory systems[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2100566

@article{title="ShortTail: taming tail latency for erasure-code-based in-memory systems",
author="Yun TENG, Zhiyue LI, Jing HUANG, Guangyan ZHANG",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2100566"
}

%0 Journal Article
%T ShortTail: taming tail latency for erasure-code-based in-memory systems
%A Yun TENG
%A Zhiyue LI
%A Jing HUANG
%A Guangyan ZHANG
%J Frontiers of Information Technology & Electronic Engineering
%P 1646-1657
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2100566"

TY - JOUR
T1 - ShortTail: taming tail latency for erasure-code-based in-memory systems
A1 - Yun TENG
A1 - Zhiyue LI
A1 - Jing HUANG
A1 - Guangyan ZHANG
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1646
EP - 1657
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2100566"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: In-memory systems with erasure coding (EC) enabled are widely used to achieve high performance and data availability. However, as the scale of clusters grows, the server-level fail-slow problem is becoming increasingly frequent, which can create long tail latency. The influence of long tail latency is further amplified in EC-based systems due to the synchronous nature of multiple EC sub-operations. In this paper, we propose an EC-enabled in-memory storage system called ShortTail, which can achieve consistent performance and low latency for both reads and writes. First, ShortTail uses a lightweight request monitor to track the performance of each memory node and identify any fail-slow node. Second, ShortTail selectively performs degraded reads and redirected writes to avoid accessing fail-slow nodes. Finally, ShortTail posts an adaptive write strategy to reduce write amplification of small writes. We implement ShortTail on top of Memcached and compare it with two baseline systems. The experimental results show that ShortTail can reduce the P99 tail latency by up to 63.77%; it also brings significant improvements in the median latency and average latency.

ShortTail：降低纠删码内存存储系统的尾部延迟

滕云^1,3，李之悦^2,4，黄晶^1,3，张广艳^2,4
¹吉林大学计算机科学与技术学院，中国长春市，130012
²清华大学计算机科学与技术系，中国北京市，100084
³吉林大学符号计算与知识工程教育部重点实验室，中国长春市，130012
⁴北京国家信息科学与技术研究中心（清华大学），中国北京市，100084
摘要：为获得高性能和高数据可用性，基于纠删码的内存存储系统得到广泛应用。然而，随着集群规模不断增长，服务器级别的性能降级问题出现得越来越频繁，进而导致长尾延迟。在基于纠删码的系统中，由于一个纠删码操作可能依赖于多个子操作的同步完成，长尾延迟的影响被进一步放大。本文提出一种称为ShortTail的基于纠删码的内存存储系统，该系统可实现稳定的性能和较低的读写延迟。首先，ShortTail使用轻量请求监视器监测每个内存节点性能，以便及时发现性能降级节点。其次，ShortTail选择性执行降级读操作和重定向写操作，以避免访问性能降级节点。最后，ShortTail采用一种自适应写策略降低小写请求的写放大程度。本文在Memcached上实现了ShortTail，并将其与两个系统进行比较。实验结果表明，ShortTail最高可降低63.77%的99分位延迟，且显著改善中位延迟和平均延迟。

关键词组：纠删码；内存存储系统；节点性能降级；小写请求；尾部延迟

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Abebe M, Daudjee K, Glasbergen B, et al., 2018. EC-Store: bridging the gap between storage and latency in distributed erasure coded systems. Proc IEEE 38^th Int Conf on Distributed Computing System, p.255-266.

[2]Andersen DG, Balakrishnan H, Kaashoek MF, et al., 2005. Improving web availability for clients with MONET. Proc 2^nd Symp on Networked Systems Design and Implementation, p.115-128.

[3]Balaji SB, Krishnan MN, Vajha M, et al., 2018. Erasure coding for distributed storage: an overview. Sci China Inform Sci, 61(10):100301.

[4]Cooper BF, Silberstein A, Tam E, et al., 2010. Benchmarking cloud serving systems with YCSB. Proc 1^st ACM Symp on Cloud Computing, p.143-154.

[5]Dimakis AG, Godfrey PB, Wu YN, et al., 2010. Network coding for distributed storage systems. IEEE Trans Inform Theory, 56(9):4539-4551.

[6]Dragojević A, Narayanan D, Hodson O, et al., 2014. FaRM: fast remote memory. Proc 11^th USENIX Conf on Networked Systems Design and Implementation, p.401-414.

[7]Dragojević A, Narayanan D, Nightingale EB, et al., 2015. No compromises: distributed transactions with consistency, availability, and performance. Proc 25^th Symp on Operating Systems Principles, p.54-70.

[8]Fan B, Andersen DG, Kaminsky M, 2013. MemC3: compact and concurrent MemCache with dumber caching and smarter hashing. Proc 10^th USENIX Conf on Networked Systems Design and Implementation, p.371-384.

[9]Ford D, Labelle F, Popovici FI, et al., 2010. Availability in globally distributed storage systems. Proc 9^th USENIX Conf on Operating Systems Design and Implementation, p.61-74.

[10]Ganjam A, Jiang JC, Liu X, et al., 2015. C3: Internet-scale control plane for video quality optimization. Proc 12^th USENIX Conf on Networked Systems Design and Implementation, p.131-144.

[11]Gunawi HS, Suminto RO, Sears R, et al., 2018. Fail-slow at scale: evidence of hardware performance faults in large production systems. Proc 16^th USENIX Conf on File and Storage Technologies, p.1-14.

[12]Hu YC, Niu D, 2016. Reducing access latency in erasure coded cloud storage with local block migration. Proc 35^th Annual IEEE Int Conf on Computer Communications, p.1-9.

[13]Hu YC, Wang YS, Liu B, et al., 2017. Latency reduction and load balancing in coded storage systems. Symp on Cloud Computing, p.365-377.

[14]Hu YC, Cheng LF, Yao QR, et al., 2021. Exploiting combined locality for wide-stripe erasure coding in distributed storage. Proc 19^th USENIX Conf on File and Storage Technologies, p.233-248.

[15]Huang C, Simitci H, Xu YK, et al., 2012. Erasure coding in windows azure storage. USENIX Conf on Annual Technical Conf, p.2.

[16]Huang P, Guo CX, Zhou LD, et al., 2017. Gray failure: the Achilles' heel of cloud-scale systems. Proc 16^th Workshop on Hot Topics in Operating Systems, p.150-155.

[17]Intel, 2015. Intel Announces Optane Storage Brand for 3D XPoint Products. https://www.anandtech.com/show/9541/intel-announces-optane-storage-brand-for-3d-xpoint-products [Accessed on Nov. 8, 2021].

[18]Kalia A, Kaminsky M, Andersen DG, 2014. Using RDMA efficiently for key-value services. SIGCOMM Comput Commun Rev, 44(4):295-306.

[19]Kalia A, Kaminsky M, Andersen DG, 2016. FaSST: fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. Proc 12^th USENIX Symp on Operating Systems Design and Implementation, p.185-201.

[20]Lamport L, 1998. The part-time parliament. ACM Trans Comput Syst, 16(2):133-169.

[21]Li C, Porto D, Clement A, et al., 2012. Making geo-replicated systems fast as possible, consistent when necessary. Proc 10^th USENIX Conf on Operating Systems Design and Implementation, p.265-278.

[22]Li XL, Li RH, Lee PPC, et al., 2019. OpenEC: toward unified and configurable erasure coding management in distributed storage systems. Proc 17^th USENIX Conf on File and Storage Technologies, p.331-344.

[23]Lin SY, Gong GW, Shen ZR, et al., 2021. Boosting full-node repair in erasure-coded storage. USENIX Annual Technical Conf, p.641-655.

[24]Narayanan D, Donnelly A, Rowstron A, 2008. Write off-loading: practical power management for enterprise storage. ACM Trans Storage, 4(3):10.

[25]Nishtala R, Fugal H, Grimm S, et al., 2013. Scaling memcache at Facebook. Proc 10^th USENIX Symp on Networked Systems Design and Implementation, p.385-398.

[26]Ovsiannikov M, Rus S, Reeves D, et al., 2013. The quantcast file system. Proc VLDB Endow, 6(11):1092-1101.

[27]Pagh R, Rodler FF, 2004. Cuckoo hashing. J Algor, 51(2):122-144.

[28]Pamies-Juarez L, Blagojevic F, Mateescu R, et al., 2016. Opening the chrysalis: on the real repair performance of MSR codes. Proc 14^th USENIX Conf on File and Storage Technologies, p.81-94.

[29]Plank JS, Huang C, 2013. Tutorial: erasure coding for storage applications. Proc 11^th USENIX Conf on File and Storage Technologies.

[30]Poke M, Hoefler T, 2015. DARE: high-performance state machine replication on RDMA networks. Proc 24^th Int Symp on High-Performance Parallel and Distributed Computing, p.107-118.

[31]Rashmi KV, Nakkiran P, Wang JY, et al., 2015. Having your cake and eating it too: jointly optimal erasure codes for I/O, storage and network-bandwidth. Proc 13^th USENIX Conf on File and Storage Technologies, p.81-94.

[32]Rashmi KV, Chowdhury M, Kosaian J, et al., 2016. EC-Cache: load-balanced, low-latency cluster caching with online erasure coding. Proc 12^th USENIX Conf on Operating Systems Design and Implementation, p.401-417.

[33]Reed IS, Solomon G, 1960. Polynomial codes over certain finite fields. J Soc Ind Appl Math, 8(2):300-304.

[34]Shah NB, Lee K, Ramchandran K, 2016. When do redundant requests reduce latency? IEEE Trans Commun, 64(2):715-722.

[35]Stewart C, Chakrabarti A, Griffith R, 2013. Zoolander: efficiently meeting very strict, low-latency SLOs. Proc 10^th Int Conf on Autonomic Computing, p.265-277.

[36]Uluyol M, Huang A, Goel A, et al., 2020. Near-optimal latency versus cost tradeoffs in geo-distributed storage. Proc 17^th USENIX Symp on Networked Systems Design and Implementation, p.157-180.

[37]Vajha M, Ramkumar V, Puranik B, et al., 2018. Clay codes: moulding MDS codes to yield an MSR code. Proc 16^th USENIX Conf on File and Storage Technologies, p.139-154.

[38]Weil SA, Brandt SA, Miller EL, et al., 2006. Ceph: a scalable, high-performance distributed file system. Proc 7^th Symp on Operating Systems Design and Implementation, p.307-320.

[39]Wilcox-O'Hearn Z, Warner B, 2008. Tahoe: the least-authority filesystem. Proc 4^th ACM Int Workshop on Storage Security and Survivability, p.21-26.

[40]Wilkes J, Golding R, Staelin C, et al., 1996. The HP AutoRAID hierarchical storage system. ACM Trans Comput Syst, 14(1):108-136.

[41]Wu SZ, Mao B, Chen XL, et al., 2016. LDM: log disk mirroring with improved performance and reliability for SSD-based disk arrays. ACM Trans Storage, 12(4):22.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

ShortTail：降低纠删码内存存储系统的尾部延迟

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference