Internet of Things and Cloud Computing

Submit a Manuscript

Publishing with us to make your research visible to the widest possible audience.

Propose a Special Issue

Building a community of authors and readers to discuss the latest research and develop new ideas.

Survey the Storage Systems Used in HPC and BDA Ecosystems

The advancement in HPC and BDA ecosystem demands a better understanding of the storage systems to plan effective solutions. The amount of data being generated from the ever-growing devices over years have increased tremendously. To make applications access data more efficiently for computation, HPC and BDA ecosystems adopt different storage systems. Each storage system has its pros and cons. Therefore, it is worthwhile and interesting to explore the storage systems used in HPC and BDA respectively. Also, it’s inquisitive to understand how such storage systems can handle data consistency and fault tolerance at a massive scale. In this paper, we’re surveying four storage systems: Lustre, Ceph, HDFS, and CockroachDB. Lustre and HDFS are some of the most prominent file systems in HPC and BDA ecosystem. Ceph is an upcoming filesystem and is being used by supercomputers. CockroachDB is based on NewSQL systems a technique that is being used in the industry for BDA applications. The study helps us to understand the underlying architecture of these storage systems and the building blocks used to create them. The protocols and mechanisms used for data storage, data access, data consistency, fault tolerance, and recovery from failover are also overviewed. The comparative study will help system designers to understand the key features and architectural goals of these storage systems to select better storage system solutions.

HPC, BDA, Storage Systems, CockroachDB, HDFS, Ceph, Lustre

APA Style

Priyam Shah, Jie Ye, Xian-He Sun. (2022). Survey the Storage Systems Used in HPC and BDA Ecosystems. Internet of Things and Cloud Computing, 10(1), 12-28. https://doi.org/10.11648/j.iotcc.20221001.12

ACS Style

Priyam Shah; Jie Ye; Xian-He Sun. Survey the Storage Systems Used in HPC and BDA Ecosystems. Internet Things Cloud Comput. 2022, 10(1), 12-28. doi: 10.11648/j.iotcc.20221001.12

AMA Style

Priyam Shah, Jie Ye, Xian-He Sun. Survey the Storage Systems Used in HPC and BDA Ecosystems. Internet Things Cloud Comput. 2022;10(1):12-28. doi: 10.11648/j.iotcc.20221001.12

Copyright © 2022 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. P. Matri. Storage-Based HPC and Big Data Convergence Using Transactional Blobs. PhD Thesis: Programa de Doctorado de Inteligencia Artificial Escuela T ́ecnica Superior de Ingenieros Inform ́aticos., 2018.
2. S. Caíno-Lores, J. Carretero, B. Nicolae, O. Yildiz and T. Peterka. Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY. In IEEE Access, vol. 7, pp. 156929-156955, 2019.
3. David Reinsel, John Gantz, and John Rydning. The Digitization of the World From Edge to Core. White Paper of IDC, 2018.
4. The Scientific Case for HPC in Europe. Insight publishers Bristol, 2012.
5. ITRS: International technology roadmap for semiconductors - 2.0. Tech. rep., 2015.
6. Top500: Top500 Supercomputer Sites. http://www.top500.org/ (2017), accessed: 2018-03-01.
7. Kuhn, M., Kunkel, J., and Ludwig T. Data Compression for Climate Data. Supercomputing Frontiers and Innovations, 2016.
8. McKee. Reflections on the memory wall. In Proceedings of the First Conference on Computing Frontiers, 2004.
9. Khan, S., Shakil, K. A., and Alam, M. Educational intelligence: applying cloud-based big data analytics to the Indian education sector. In 2016 2nd international conference on contemporary computing and informatics (IC3I) (pp. 29-34). IEEE, 2016.
10. Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R. Big Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 2015.
11. Chen, C. P. and Zhang, C. Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information sciences, 2014.
12. George, L. HBase: the definitive guide: random access to your planet-size data. O'Reilly Media, Inc., 2014.
13. Shvachko, K., Kuang, H., Radia, S., & Chansler, R. The hadoop distributed file system. In MSST, 2010.
14. Chodorow, K. MongoDB: the definitive guide: powerful and scalable data storage. O'Reilly Media, Inc., 2013.
15. Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., and Joshi, I. Impala: A Modern, Open-Source SQL Engine for Hadoop. In Cidr, 2015.
16. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I.. Spark: Cluster computing with working sets. HotCloud, 2010.
17. Ihaka, R., & Gentleman, R. R: a language for data analysis and graphics. Journal of computational and graphical statistics, 1996.
18. Oliphant, T. E. Python for scientific computing. Computing in Science & Engineering, 2007.
19. Gu, M., Li, X., & Cao, Y. Optical storage arrays: a perspective for future big data storage. Light: Science & Applications, 2014.
20. Strauch, C., Sites, U. L. S., & Kriha, W. NoSQL databases. Lecture Notes, Stuttgart Media University, 2011.
21. Lustre Software Release 2. x Operations Manual, https://doc.lustre.org/lustre_manual.pdf, 2011.
22. Oh, M., Eom, J., Yoon, J., Yun, J. Y., Kim, S., Yeom, H. Y. Performance optimization for all flash scale-out storage. In: 2016 IEEE International Conference on Cluster Computing, CLUSTER 2016.
23. Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In: 7th Symposium on Operating Systems Design and Implementation, 2018.
24. Weil, S. A., Leung, A. W., Brandt, S. A., Maltzahn, C. RADOS: a scalable, reliable storage service for petabyte-scale storage clusters. In: Proceedings of the 2nd International Petascale Data Storage Workshop, 2007.
25. Ceph Documentation, https://docs.ceph.com/en/pacific/architecture/, accessed: 11/25/2021.
26. IO 500, https://io500.org/, SC21 List, last accessed: 11/25/2021.
27. Comparison of distributed file systems, Wikipedia, Last updated: 11/2/2021.
28. OpenSFS, https://www.opensfs.org/wp-content/uploads/2020/04/Lustre_IO500_v2.pdf, DOI: 04/22/2020.
29. Dubeyko, Viacheslav. Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques, 2019.
30. Ceph.io Case Studies, https://ceph.io/en/discover/case-studies/, accessed: 11/25/2021.
31. Amazon FSx for Lustre Case Studies, https://aws.amazon.com/fsx/lustre/, accessed: 11/25/2021.
32. CockroachDB Documentation, https://www.cockroachlabs.com/docs/, accessed: 11/26/2021.
33. Hadoop Documentation 1.2.1, https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction, accessed: 11/26/2021.
34. Lofstead, J. F., Jimenez, I., Maltzahn, C., Koziol, Q., Bent, J., Barton, E. DAOS and friends: a proposal for an exascale storage system. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016.
35. Tang, H., Byna, S., Dong, B., Liu, J., Koziol, Q. Someta: Scalable object-centric metadata management for high performance computing. In: 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017.
36. Ferrer, E. C. The blockchain: a new framework for robotic swarm systems. In Proceedings of the Future Technologies Conference (pp. 1037-1058). Springer, 2018.
37. Dang, H., Dinh, T. T. A., Loghin, D., Chang, E. C., Lin, Q., and Ooi, B. C. Towards Scaling Blockchain Systems via Sharding. arXiv preprint, 2018.
38. Khan, Samiya, Xiufeng Liu, Syed Arshad Ali, and Mansaf Alam. Storage solutions for big data systems: A qualitative study and comparison. arXiv preprint, 2019.
39. DB-engines, https://db-engines.com/en/ranking, last accessed: 12/2/2021.