Chris Pollett >
Students > [Bio] [Blog] [Code] |
CS298 ProposalHigh performance document store in RustIshaan Aggarwal (ishaan.aggarwal@sjsu.edu) Advisor: Dr. Chris Pollett Committee Members: Dr. Robert Chun, Akshay Kajale Abstract:In recent times, the volume of data that is being produced and consumed daily, has increased aggressively. According to a recent survey by Visual Capitalist, in one internet minute, roughly 4 million Google searches happen, around 700,000 hours' worth of videos are watched on Netflix, around 2 million snaps are created on Snapchat, and 4.5 million videos are uploaded and viewed on YouTube. Apart from these, there are many other essential services which require error-free storage and processing of data like banking, social media, government records, etc. The above examples have one thing in common, proper storage and retrieval of data, and this is what databases are required for. In this project, we aim for creating a high-performance database system in Rust. Yioop! is a GPLv3, open source, PHP search engine software. It provides many features as done by larger search portals like, search results, media services, social groups, blogs, wikis, web site development, and monetization via ads. This project aims at fulfilling the needs of such a search engine to enable searching inconsistent web pages and show the search results as quickly as possible by upgrading the existing storage implementation to one in Rust. Improvements and additional features included are an on disk Linear Hash table implementation, WARC web crawled file parser/writer, a Database Connector, and a Query Processing Engine. CS297 Results
Proposed Schedule
Key Deliverables:
Innovations and Challenges
References:[1] Enbody, R. J., Du, H. C. (1988). Dynamic hashing schemes. ACM Computing Surveys, 20(2), 850-113 [2] Strodl Stephan, Beran Peter, and Rauber Andreas Migrating content in WARC files 2009 The 9th International Web Archiving Workshop (IWAW 2009) Proceedings, (2009), 43 - 49 [3] Alam, S., Nelson, M. L., Van de Sompel, H., Balakireva, L. L., Shankar, H. and Rosenthal, D. S. (2015). Web archive profiling through cdx summarization. Research and Advanced Technology for Digital Libraries, 3-14. doi:10.1007/978-3-319-24592-8_1 [4] Karger, David and Lehman, Eric and Leighton, Tom and Levine, Matthew and Lewin, Daniel and Panigrahy, Rina. (2001). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM STOC. 10.1145/258533.258660. [5] Markoff, J. (2005, September 27). How many pages in Google? Take a guess. The New York Times. https://www.nytimes.com/2005/09/27/technology/how-many-pages-in-google-take-a-guess.html. [6] The size of the world Wide web (the internet). WorldWideWebSize.com | The size of the World Wide Web (The Internet). (n.d.). https://www.worldwidewebsize.com/. [7] [2012] Corbett, J., Dean, J., Epstein, M. et al. Spanner: Google's globally distributed database. In Proceedings of OSDI'12: Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October 2012. [8] [2019] Khan, S., Liu, X., Ali, S. A., and Alam, M. (2019). Storage solutions for big data systems: A qualitative study and comparison. arXiv preprint arXiv:1904.11498. [9] [2020] Okazaki, S. (2020). An experimental study of memory management in Rust programming for big data processing (Doctoral dissertation, Boston University). |