Full Text:   <477>

Summary:  <224>

CLC number: TP303

On-line Access: 2018-02-06

Received: 2017-09-25

Revision Accepted: 2017-12-25

Crosschecked: 2017-12-27

Cited: 0

Clicked: 2249

Citations:  Bibtex RefMan EndNote GB/T7714


Xin Liu


-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2017 Vol.18 No.12 P.1940-1971


ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers

Author(s):  Xin Liu, Yu-tong Lu, Jie Yu, Peng-fei Wang, Jie-ting Wu, Ying Lu

Affiliation(s):  School of Computer, National University of Defense Technology, Changsha 410073, China; more

Corresponding email(s):   xliu@cse.unl.edu, ytlu@nudt.edu.cn, yujie@nscc-tj.gov.cn, wangpf@nscc-tj.gov.cn, jwu@cse.unl.edu, ylu@cse.unl.edu

Key Words:  High performance computing, Hierarchical hybrid storage system, Distributed metadata management, Data migration

Xin Liu, Yu-tong Lu, Jie Yu, Peng-fei Wang, Jie-ting Wu, Ying Lu. ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(12): 1940-1971.

@article{title="ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers",
author="Xin Liu, Yu-tong Lu, Jie Yu, Peng-fei Wang, Jie-ting Wu, Ying Lu",
journal="Frontiers of Information Technology & Electronic Engineering",
publisher="Zhejiang University Press & Springer",

%0 Journal Article
%T ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers
%A Xin Liu
%A Yu-tong Lu
%A Jie Yu
%A Peng-fei Wang
%A Jie-ting Wu
%A Ying Lu
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 12
%P 1940-1971
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1700626

T1 - ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers
A1 - Xin Liu
A1 - Yu-tong Lu
A1 - Jie Yu
A1 - Peng-fei Wang
A1 - Jie-ting Wu
A1 - Ying Lu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 12
SP - 1940
EP - 1971
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1700626

With supercomputers developing towards exascale, the number of compute cores increases dramatically, making more complex and larger-scale applications possible. The input/output (I/O) requirements of large-scale applications, workflow applications, and their checkpointing include substantial bandwidth and an extremely low latency, posing a serious challenge to high performance computing (HPC) storage systems. Current hard disk drive (HDD) based underlying storage systems are becoming more and more incompetent to meet the requirements of next-generation exascale supercomputers. To rise to the challenge, we propose a hierarchical hybrid storage system, on-line and near-line file system (ONFS). It leverages dynamic random access memory (DRAM) and solid state drive (SSD) in compute nodes, and HDD in storage servers to build a three-level storage system in a unified namespace. It supports portable operating system interface (POSIX) semantics, and provides high bandwidth, low latency, and huge storage capacity. In this paper, we present the technical details on distributed metadata management, the strategy of memory borrow and return, data consistency, parallel access control, and mechanisms guiding downward and upward migration in ONFS. We implement an ONFS prototype on the TH-1A supercomputer, and conduct experiments to test its I/O performance and scalability. The results show that the bandwidths of single-thread and multi-thread &x2018;read&x2019;/&x2018;write&x2019; are 6-fold and 5-fold better than HDD-based Lustre, respectively. The I/O bandwidth of data-intensive applications in ONFS can be 6.35 times that in Lustre.


概要:随着超级计算机向Eflops规模快速发展和计算核数急剧增加,更大规模和更复杂的应用成为可能。大规模科学计算、新的工作流应用以及检查点操作均需要存储系统具有非常高的带宽和低延迟,这使得高性能存储系统面临严峻的技术挑战。当前基于磁盘的底层存储系统难以满足新一代Eflops超级计算机和应用的要求。为此,本文提出了基于计算结点内存、固态硬盘和磁盘的层次式混合存储系统ONFS(on-line and near-line file system)。它具有三个存储层次和统一的命名空间,支持可移植操作系统接口(portable operating system interface, POSIX)协议,可提供高带宽、低延迟和超大存储容量。本文详细分析了分布式元数据管理、内存借用和归还策略、数据一致性、并行访问控制,以及向下迁移和向上主动预迁移机制。在天河一号超级计算机上实现了ONFS原型系统,测试了I/O(input/output)性能和可扩展性。测试结果表明,单线程和多线程读/写性能比基于磁盘的Lustre分别高出6倍和5倍。与Lustre相比,运行在ONFS上的典型数据密集型应用可获得6.35倍的I/O加速。


Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article


[1]Agrawal, N., Bolosky, W.J., Douceur, J.R., et al., 2007. A five-year study of file-system metadata. ACM Trans. Stor., 3(3):9.

[2]ALCF, 2017. Computational Systems: Mira. Argonne Leadership Computing Facility. https://www.alcf.anl. gov/user-guides/computational-systems

[3]Ali, N., Carns, P., Iskra, K., et al., 2009. Scalable I/O forwarding framework for high-performance computing systems. IEEE Int. Conf. on CLUSTER Computing and Workshops, p.1-10.

[4]Anderson, E., Hall, J., Hartline, J., et al., 2001. An experimental study of data migration algorithms. Proc. Algorithm Engineering, Int. Workshop, p.145-158.

[5]Appuswamy, R., van Moolenbroek, D.C., Tanenbaum, A.S., 2012. Integrating flash-based SSDs into the storage stack. IEEE Symp. on Mass Storage Systems and Technologies, p.1-12.

[6]Bent, J., Grider, G., Kettering, B., et al., 2012. Storage challenges at Los Alamos National Lab. IEEE 28th Symp. on Mass Storage Systems and Technologies, p.1-5.

[7]Bharathi, S., Chervenak, A., Deelman, E., et al., 2008. Characterization of scientific workflows. 3rd Workshop on Workflows in Support of Large-Scale Science, p.1-10.

[8]Byan, S., Lentini, J., Madan, A., et al., 2012. Mercury: host-side flash caching for the data center. IEEE 28th Symp. on MASS Storage Systems and Technologies, p.1-12.

[9]Canim, M., Mihaila, G.A., Bhattacharjee, B., et al., 2010. SSD bufferpool extensions for database systems. Proc. VLDB Endow., 3(1-2):1435-1446.

[10]Carns, P.H., Ligon, W.B., III, Ross, R.B., 2000. PVFS: a parallel file system for Linux clusters. Proc. 4th Annual Linux Showcase and Conf., p.317-328.

[11]Carns, P.H., Harms, K., Allcock, W., et al., 2011. Understanding and improving computational science storage access through continuous characterization. ACM Trans. Stor., 7(3):1-14.

[12]Chen, F., Koufaty, D.A., Zhang, X., 2011. Hystor: making the best use of solid state drives in high performance storage systems. Proc. Int. Conf. on Supercomputing, p.22-32.

[13]Cheong, S.K., Jeong, J.J., Jeong, Y.W., et al., 2011. Research on the I/O performance advancement of a low speed HDD using DDR-SSD. 6th Int. Conf. on Future Information Technology, p.508-513.

[14]Congiu, G., Narasimhamurthy, S., Süss, T., et al., 2016. Improving collective I/O performance using non-volatile memory devices. IEEE Int. Conf. on Cluster Computing, p.120-129.

[15]Cray, 2017. Cray Sonexion 3000. https://www.cray.com/products/storage/sonexion

[16]Dai, N., Wu, W., Zhang, W., et al., 2011. TTI RTM using variable grid in depth. Int. Petroleum Technology Conf., p.1-7.

[17]Dell EMC, 2017. All Flash Storage. https://www.dellemc.com/en-us/storage/discover-flash-storage/index.htm

[18]Dong, W.R., Liu, G.M., Yu, J., et al., 2015. SFDC: file access pattern aware cache framework for high-performance computer. IEEE 17th Int. Conf. on High Performance Computing and Communications, IEEE 7th Int. Symp. on Cyberspace Safety and Security, IEEE 12th Int. Conf. on Embedded Software and Systems, p.342-350.

[19]Dong, X., Xie, Y., Muralimanohar, N., et al., 2011. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems. ACM Trans. Archit. Code Optim., 8(2):1-29.

[20]Dongarra, J., 2010. Impact of architecture and technology for extreme scale on software and algorithm design. Department of Energy Workshop on Cross-cutting Technologies for Computing at the Exascale.

[21]Facebook, 2013. Flashcache at Facebook from 2010 to 2013 and Beyond. https://www.facebook.com/notes/facebook-engineering/flashcache-at-facebook-from-2010-to-2013-and-beyond/10151725297413920/

[22]Gluster, 2017. Gluster File System. http://www.gluster.org

[23]Hitachi Data Systems Cooperation, 2010. Dynamic Storage Tiering: the Integration of Block, File and Content. https://shobiziems.com/hitachi_nas/hitachi-white-paper-dynamic-storage-tiering.pdf

[24]Holland, D.A., Angelino, E., Wald, G., et al., 2013. Flash caching on the storage client. USENIX Annual Technical Conf., p.127-138.

[25]IBM, 2017. IBM Blue Gene/Q. https://www-03.ibm.com/systems/technicalcomputing/solutions/bluegene/

[26]Intel, 2017. Intel Data Center SSD. https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-SSDs/dc-p4600-series/dc-p4600-4tb-aic-3d1.html

[27]Iskra, K., Romein, J.W., Yoshii, K., et al., 2008. ZOID: I/O-forwarding infrastructure for petascale architectures. Proc. 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, p.153-162.

[28]Kim, Y., Gupta, A., Urgaonkar, B., et al., 2011. Hybridstore: a cost-efficient, high-performance storage system combining SSDs and HDDs. IEEE 19th Annual Int. Symp. on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, p.227-236.

[29]Kuhlen, M., Vogelsberger, M., Angulo, R., 2012. Numerical simulations of the dark universe: state of the art and the next decade. Phys. Dark Univ., 1(1):50-93.

[30]Lee, D., Choi, J., Kim, J.H., et al., 1999. On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies. Proc. ACM SIGMETRICS Int. Conf. on Measurement and Modeling of Computer Systems, p.134-143.

[31]Liao, X., Xiao, L., Yang, C., et al., 2014. MilkyWay-2 supercomputer: system and application. Front. Comput. Sci., 8(3):345-356.

[32]Liu, N., Cope, J., Carns, P., et al., 2012. On the role of burst buffers in leadership-class storage systems. IEEE 28th Symp. on MASS Storage Systems and Technologies, p.1-11.

[33]Liu, X., Lu, Y., Yu, J., et al., 2017a. MemUsing: dynamic, efficient memory utilization in compute nodes for HPC memory-based storage systems. Proc. 7th Int. Workshop on Computer Science and Engineering, p.8-16.

[34]Liu, X., Lu, Y., Wu, C., et al., 2017b. UGSD: scalable and efficient metadata management for EB-scale file systems. Proc. Int. Conf. on Compute and Data Analysis, p.81-90.

[35]LLNL, 2012. Sequoia. Lawrence Livermore National Laboratory. https://computation.llnl.gov/computers/sequoia

[36]Lofstead, J., Jimenez, I., Maltzahn, C., et al., 2016. DAOS and friends: a proposal for an exascale storage system. Int. Conf. for High Performance Computing, Networking, Storage & Analysis, p.585-596.

[37]Lu, C.Y., Alvarez, G.A., Wilkes, J., 2002. Aqueduct: online data migration with performance guarantees. FAST Conf. on File and Storage Technologies, p.219-230.

[38]Miller, E.L., Greenan, K., Leung, A., et al., 2011. Reliable and efficient metadata storage and indexing using nvram. J. Comput. Sci. Technol., 26(3):344-351.

[39]Muralidhar, S., Lloyd, W., Roy, S., et al., 2014. f4: Facebook&x2019;s warm blob storage system. Proc. 11th USENIX Symp. on Operating Systems Design and Implementation, p.383-398.

[40]NERSC, 2017a. Burst Buffer Architecture and Software Roadmap. National Energy Research Scientific Computing Center. http://www.nersc.gov/users/ computational-systems/cori/burst-buffer/burst-buffer

[41]NERSC, 2017b. The Configuration of Cori File System. National Energy Research Scientific Computing Center. http://www.nersc.gov/users/computational-systems/cori/configuration/

[42]NetApp, 2016. All Flash Arrays. http://www.netapp.com/ us/products/storage-systems/all-flash-array/aff-a-series.aspx

[43]Ocaña, K., de Oliveira, D., 2015. Parallel computing in genomic research: advances and applications. Adv. Appl. Bioinform. Chem., 8:23-25.

[44]Ovsyannikov, A., Romanus, M., Straalen, B.V., et al., 2017. Scientific workflows at datawarp-speed: accelerated data-intensive science using NERSE&x2019;s burst buffer. Joint Int. Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, p.1-6.

[45]Pawlowski, B., Juszczak, C., Staubach, P., et al., 1994. NFS version 3: design and implementation. USENIX Summer Technical Conf., p.137-152.

[46]Prabhakar, R., Vazhkudai, S.S., Kim, Y., et al., 2011. Provisioning a multi-tiered data staging area for extreme-scale machines. Int. Conf. on Distributed Computing Systems, p.1-12.

[47]Qiao, F., Song, Z., Bao, Y., et al., 2013. Development and evaluation of an earth system model with surface gravity waves. J. Geophys. Res. Ocean., 118(9):4514-4524.

[48]Rajachandrasekar, R., Moody, A., Mohror, K., et al., 2013. A 1 PB/s file system to checkpoint three million MPI tasks. Proc. 22nd Int. Symp. on High-Performance Parallel and Distributed Computing, p.143-154.

[49]Rodeh, O., Teperman, A., 2003. zFS–-a scalable distributed file system using object disks. Proc. 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, p.207-218.

[50]Roselli, D., Anderson, T.E., Lorchid, J.R., 2000. A comparison of file system workloads. Proc. USENIX Annual Technical Conf., p.41-54.

[51]Saito, S., Oikawa, S., 2012. Exploration of non-volatile memory management in the OS kernel. 3rd Int. Conf. on Networking and Computing, p.302-306.

[52]Sato, K., Mohror, K., Moody, A., et al., 2014. A user-level infiniband-based file system and checkpoint strategy for burst buffers. 14th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing, p.21-30.

[53]Satyanarayanan, M., Kistler, J.J., Kumar, P., et al., 1990. Coda: a highly available file system for a distributed workstation environment. IEEE Trans. Comput., 39(4):447-459.

[54]Saxena, M., Swift, M.M., Zhang, Y., 2012. FlashTier: a lightweight, consistent and durable storage cache. Proc. 7th ACM European Conf. on Computer Systems, p.267-280.

[55]Schenck, W., El Sayed, S., Foszczynski, M., et al., 2017. Evaluation and performance modeling of a burst buffer solution. ACM SIGOPS Oper. Syst. Rev., 50(1):12-26.

[56]Schmuck, F., Haskin, R., 2002. GPFS: a shared-disk file system for large computing clusters. Proc. 1st USENIX Conf. on File and Storage Technologies, No. 19.

[57]Seagate Technology LLC, 2017. Seagate NAS+SRS HDD Product Manual. https://www.seagate.com/www-content/product-content/nas-fam/nas-hdd/en-us/docs/100764115g.pdf

[58]Shalf, J., Dosanjh, S., Morrison, J., 2010. Exascale computing technology challenges. Int. Conf. on High Performance Computing for Computational Science, p.1-25.

[59]Shibata, T., Choi, S., Taura, K., 2010. File-access patterns of data-intensive workflow applications and their implications to distributed filesystems. Proc. 19th ACM Int. Symp. on High Performance Distributed Computing, p.746-755.

[60]Soundararajan, G., Prabhakaran, V., Balakrishnan, M., et al., 2010. Extending SSD lifetimes with disk-based write caches. Proc. 8th USENIX Conf. on File and Storage Technologies, No. 8.

[61]Strande, S.M., Cicotti, P., Sinkovits, R.S., et al., 2012. Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer. Proc. 1st Conf. of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the Campus and Beyond, No. 3.

[62]Tan, Z., Zhou, W., Feng, D., et al., 2013. ALDM: adaptive loading data migration in distributed file systems. IEEE Trans. Magn., 49(6):2645-2652.

[63]Uta, A., Sandu, A., Kielmann, T., 2016. Overcoming data locality. Fut. Gener. Comput. Syst., 54(C):144-158.

[64]Vangoor, B.K.R., Tarasov, V., Zadok, E., 2017. To FUSE or not to FUSE: performance of user-space file systems. Proc. 15th USENIX Conf. on File and Storage Technologies, p.59-72.

[65]Vetter, J.S., Mittal, S., 2015. Opportunities for nonvolatile memory systems in extreme-scale high-performance computing. Comput. Sci. Eng., 17(2):73-82.

[66]Wang, F., Xin, Q., Hong, B., et al., 2004. File system workload analysis for large scale scientific computing applications. Proc. 21st IEEE/12th NASA Goddard Conf. on Mass Storage Systems and Technologies, p.139-152.

[67]Wang, F., Oral, S., Shipman, G., et al., 2010. Understanding Lustre Filesystem Internals. Technical Report, No. ORNL/TM-2009/117. Oak Ridge National Laboratory, National Center for Computational Sciences, Oak Ridge, USA.

[68]Wang, T., Oral, S., Wang, Y., et al., 2014. BurstMem: a high-performance burst buffer system for scientific applications. IEEE Int. Conf. on Big Data, p.71-79.

[69]Wang, T., Mohror, K., Moody, A., et al., 2016. An ephemeral burst-buffer file system for scientific applications. Proc. Int. Conf. for High Performance Computing, Networking, Storage and Analysis, p.807-818.

[70]Weil, S.A., Brandt, S.A., Miller, E.L., et al., 2006. Ceph: a scalable, high-performance distributed file system. Proc. 7th Symp. on Operating Systems Design and Implementation, p.307-320.

[71]Yang, X.J., Liao, X.K., Lu, K., et al., 2011. The TianHe-1A supercomputer: its hardware and software. J. Comput. Sci. Technol., 26(3):344-351.

[72]Yildiz, O., Dorier, M., Ibrahim, S., et al., 2016. On the root causes of cross-application I/O interference in HPC storage systems. IEEE Int. Parallel and Distributed Processing Symp., p.750-759.

[73]Yu, J., Liu, G.M., Dong, W.R., et al., 2017. WatCache: a workload-aware temporary cache on the compute side of HPC systems. J. Supercomput., 1(2):1-33.

[74]Zhao, D.F., Raicu, I., 2013. HyCache: a user-level caching middleware for distributed file systems. IEEE Int. Symp. on Parallel and Distributed Processing Workshops and Phd Forum, p.1997-2006.

[75]Zhao, D.F., Zhang, Z., Zhou, X.B., et al., 2014. FusionFS: toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. IEEE Int. Conf. on Big Data, p.61-70.

Open peer comments: Debate/Discuss/Question/Opinion


Please provide your name, email address and a comment

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE