CS298 Proposal

High performance document store in Rust

Ishaan Aggarwal (ishaan.aggarwal@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun, Akshay Kajale

Abstract:

In recent times, the volume of data that is being produced and consumed daily, has increased aggressively. According to a recent survey by Visual Capitalist, in one internet minute, roughly 4 million Google searches happen, around 700,000 hours' worth of videos are watched on Netflix, around 2 million snaps are created on Snapchat, and 4.5 million videos are uploaded and viewed on YouTube. Apart from these, there are many other essential services which require error-free storage and processing of data like banking, social media, government records, etc. The above examples have one thing in common, proper storage and retrieval of data, and this is what databases are required for. In this project, we aim for creating a high-performance database system in Rust. Yioop! is a GPLv3, open source, PHP search engine software. It provides many features as done by larger search portals like, search results, media services, social groups, blogs, wikis, web site development, and monetization via ads. This project aims at fulfilling the needs of such a search engine to enable searching inconsistent web pages and show the search results as quickly as possible by upgrading the existing storage implementation to one in Rust. Improvements and additional features included are an on disk Linear Hash table implementation, WARC web crawled file parser/writer, a Database Connector, and a Query Processing Engine.

CS297 Results

Created a simple single node key-value store in Rust, that can receive requests for a document by key and return the corresponding document
Implemented linear hashing to store the data on disk in the Rust based key-value document store
Developed a WebARChive file reader/writer in Rust which allows parsing, filtering and writing of WARC files
Implemented consisted hashing mechanism in the key-value store to balance data over multiple instances

Proposed Schedule

Week 1: 08-24-2021 - 08-31-2021	Finalizing the deliverables; Drafting the proposal for CS298
Week 2-3: 09-01-2021 - 09-14-2021	Augment the data store's Linear Hash table implementation to improve performance with LRU caching, indexing, etc.
Week 4-6: 09-15-2021 - 10-05-2021	Create a driver for establishing Database Connectivity with other programming languages
Week 7-9: 10-06-2021 - 10-26-2021	Creation of a Query Processing Engine for this data store
Week 10-12: 10-27-2021 - 11-16-2021	Port over parts of the Yioop! data storage to Rust and run comparison tests with other current storage implementations
Week 13-16: 11-17-2021 - 12-06-2021	Finish CS298 Report and prepare slides for the presentation

Key Deliverables:

Software
- An on disk Data Storage system implemented in Rust with an underlying Linear Hashing mechanism
- A WARC Parser/Writer tool with parsing, filtering, indexing and writing capabilities
- Implementation of a Database Connectivity driver so that other programming languages/applications can interact with this Data Storage system
- A Query Processing Engine for the translation, optimization and evaluation of queries for this Data Storage system
Report
- CS298_Report
- CS298_Presentation

Innovations and Challenges

Our system will be unique because currently only one other DBMS, NoriaDB, is implemented in Rust. It is a relational database system and it focuses more on including the caching mechanism within the database to speed up database performance. We aim to create a NoSQL based document store in Rust. Caching within the database is something that we will leverage as well but in a different manner than what Noria does.
Current WARC parsers implemented in Rust only support basic reading and writing of WARC files and do not support useful features such as filtering of WARC file records based on CDX files.
Database Connectivity drivers are difficult to implement as the submission of queries from any programming language that interfaces with the driver must convert the calling language to a form the database can understand. Our solution will provide a common interface in form of APIs which can be leveraged from any language.
Query Processing Engines translate high level queries to the actual low level execution of the query. The design and implementation of a Query Processing Engines differs per DBMS. No other Query Processor Engine has been created before for a Linear Hash Table implemented document data store in Rust. Thus, the creation of one for our Data Storage system will be the first of it's kind.

References:

[1] Enbody, R. J., Du, H. C. (1988). Dynamic hashing schemes. ACM Computing Surveys, 20(2), 850-113

[2] Strodl Stephan, Beran Peter, and Rauber Andreas Migrating content in WARC files 2009 The 9th International Web Archiving Workshop (IWAW 2009) Proceedings, (2009), 43 - 49

[3] Alam, S., Nelson, M. L., Van de Sompel, H., Balakireva, L. L., Shankar, H. and Rosenthal, D. S. (2015). Web archive profiling through cdx summarization. Research and Advanced Technology for Digital Libraries, 3-14. doi:10.1007/978-3-319-24592-8_1

[4] Karger, David and Lehman, Eric and Leighton, Tom and Levine, Matthew and Lewin, Daniel and Panigrahy, Rina. (2001). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM STOC. 10.1145/258533.258660.

[5] Markoff, J. (2005, September 27). How many pages in Google? Take a guess. The New York Times. https://www.nytimes.com/2005/09/27/technology/how-many-pages-in-google-take-a-guess.html.

[6] The size of the world Wide web (the internet). WorldWideWebSize.com | The size of the World Wide Web (The Internet). (n.d.). https://www.worldwidewebsize.com/.

[7] [2012] Corbett, J., Dean, J., Epstein, M. et al. Spanner: Google's globally distributed database. In Proceedings of OSDI'12: Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October 2012.

[8] [2019] Khan, S., Liu, X., Ali, S. A., and Alam, M. (2019). Storage solutions for big data systems: A qualitative study and comparison. arXiv preprint arXiv:1904.11498.

[9] [2020] Okazaki, S. (2020). An experimental study of memory management in Rust programming for big data processing (Doctoral dissertation, Boston University).