CS298 Proposal

Node.js based Document Store for Web Crawling

David Bui (david.bui01@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Katerina Potika, Dr. Ben Reed

Abstract:

In 2005, there were over 5 billion web pages indexed by Google to speed up search queries [5]. In 2021, there are over 36 billion web pages indexed by Google [6]. The volume of data on the internet along with number of users utilizing search engines increases every year. This necessitates the creation of better and faster tools so that search engine curators can keep pace with the growth of the internet. Yioop! is an open source PHP search engine that is constantly evolving to also tackle the challenge of searching an ever expanding internet. The aim of this project to upgrade Yioop's search engine data storage implementation to a JavaScript implementation in Node.js. Improvements and additional features included are an on disk Linear Hash table implementation, WARC web crawled file parser/writer, a Database Connector, and a Query Processing Engine

CS297 Results

Created a simple key-value store in Node.js that is interacted with a client webpage.
Implemented a Node.js key-value store that makes use of an underlying Linear Hash Table to store data on disk.
Developed a WARC file parser that allows for the parsing, filtering, and creation of WARC files. Also includes a feature to generate graph datasets for community detection from Common Crawl's web crawl datasets.
Developed a Consistent Hashing system the utilizes the Linear Hash Table implementation to balance data across multiple server instances.

Proposed Schedule

Week 1: Aug 24 - Aug 31	First Week Meeting and Reviewing CS298 Proposal
Week 2 - 3: Sep 1 - Sep 14	Augment the data store's Linear Hash table implementation to improve performance with LRU caching, indexing, etc.
Week 4 - 6: Sep 15 - Oct 5	Create a driver for establishing Database Connectivity with other programming languages.
Week 7 - 9: Oct 6 - Oct 26	Creation of a Query Processing Engine for this data store
Week 10 - 12: Oct 27 - Nov 16	Port over parts of the Yioop! data storage to Node.js and run comparison tests with other current storage implementations.
Week 13 - 16: Nov 17 - Dec 6	Finish CS298 Report and prepare slides for the presentation.

Key Deliverables:

Software
- An on disk Data Storage system implemented fully in Node.js with an underlying Linear Hash Table structure.
- A WARC Parser/Writer tool with parsing, filtering, indexing, writing, and graph dataset generation features.
- Implementation of a Database Connectivity driver so that other programming languages/applications can interact with this Data Storage system
- A Query Processing Engine for the translation, optimization and evaluation of queries for this Data Storage system.
Report
- CS298 Report
- CS298 Presentation

Innovations and Challenges

Our system will be unique because currently only one other DBMS, HarperDB, is only partially implemented in Node.js. HarperDB currently relies on external C libraries for parts of it's storage implementation while we aim to create our system fully in Node.js.
Current WARC parsers implemented in JavaScript only support basic parsing and do not support useful features such as the creation of new WARC files or the filtering of WARC file records.
Database Connectivity drivers are difficult to implement as the submission of queries from any programming language that interfaces with the driver must convert the calling language to a form the database can understand. This requires handling each different programming language that is interfacing with our Data Storage system as a unique case.
Query Processing Engines translate high level queries to the actual low level execution of the query. The design and implementation of a Query Processing Engines differs per DBMS. No other Query Processor Engine has been created before for a Linear Hash Table implemented data store in Node.js. Thus, the creation of one for our Data Storage system will be the first of it's kind.

References:

[1] Enbody, R. J., "&" Du, H. C. (1988). Dynamic hashing schemes. ACM Computing Surveys, 20(2), 850-113

[2] Strodl Stephan, Beran Peter, and Rauber Andreas Migrating content in WARC files 2009 The 9th International Web Archiving Workshop (IWAW 2009) Proceedings", (2009), 43 - 49

[3] Alam, S., Nelson, M. L., Van de Sompel, H., Balakireva, L. L., Shankar, H., "&" Rosenthal, D. S. (2015). Web archive profiling through cdx summarization. Research and Advanced Technology for Digital Libraries, 3-14. doi:10.1007/978-3-319-24592-8_1

[4] Karger, David "&" Lehman, Eric "&" Leighton, Tom "&" Levine, Matthew "&" Lewin, Daniel "&" Panigrahy, Rina. (2001). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM STOC. 10.1145/258533.258660.

[5] Markoff, J. (2005, September 27). How many pages in Google? Take a guess. The New York Times. https://www.nytimes.com/2005/09/27/technology/how-many-pages-in-google-take-a-guess.html.

[6] The size of the world Wide web (the internet). WorldWideWebSize.com | The size of the World Wide Web (The Internet). (n.d.). https://www.worldwidewebsize.com/.