CS297 Proposal
Node.js based Document Store for Web Crawling
David Bui (david.bui01@sjsu.edu)
Advisor: Dr. Chris Pollett
Description:
Node.js is an open source runtime environment that allows for Javascript code to run outside of a web browser in a server environment. The goal of this project is to create an an efficient Node.js document key store that allows for the indexing of Web ARChive (WARC) and custom JARC(JSON and WARC combination) file types generated from web crawling. To achieve this a portion of Yioop! search engine's PHP data storage implementation will be migrated to Node.js and leverage Node's many features.
Schedule:
Week 1:
02-02-2021 - 02-09-2021 | Finalize topic and draft proposal |
Week 2:
02-09-2021 - 02-16-2021 | Initial system setup and discuss project deliverables |
Week 3:
02-16-2021 - 02-23-2021 | Work on Del 1 and work on paper [1] presentation |
Week 4:
02-23-2021 - 03-02-2021 | Finish work on Del 1 |
Week 5:
03-02-2021 - 03-09-2021 | Begin work on Del 2 |
Week 6:
03-09-2021 - 03-16-2021 | Continue Del 2 work and work on paper [2] presentation |
Week 7:
03-16-2021 - 03-23-2021 | Finish work on Del 2 |
Week 8:
03-23-2021 - 03-30-2021 | Begin work on Del 3 |
Week 9:
03-30-2021 - 04-06-2021 | Spring Break |
Week 10:
04-06-2021 - 04-13-2021 | Continue work on Del 3 and work on paper [3] presentation |
Week 11:
04-13-2021 - 04-20-2021 | Finish Del 3 |
Week 12:
04-20-2021 - 04-27-2021 | Begin work on Del 4 and work on paper [4] presentation |
Week 13:
04-27-2021 - 05-04-2021 | Continue work on Del 4 |
Week 14:
05-04-2021 - 05-11-2021 | Finish Del 4 and work on Final Report |
Week 15:
05-11-2021 - 05-18-2021 | Wrap up any work |
Deliverables:
The full project will be done when CS298 is completed. The following will
be done by the end of CS297:
1. A server in node for a simple document key store.
2. Modify the document key store by implementing a linear hashing scheme
3. A WARC file and JARC file read/writer.
4. Implement a consistent hashing scheme for the document key store
References:
[1] Enbody, R. J., "&" Du, H. C. (1988). Dynamic hashing schemes. ACM Computing Surveys, 20(2), 850-113
[2] Strodl Stephan, Beran Peter, and Rauber Andreas Migrating content in WARC files 2009 The 9th International Web Archiving Workshop (IWAW 2009) Proceedings", (2009), 43 - 49
[3] Alam, S., Nelson, M. L., Van de Sompel, H., Balakireva, L. L., Shankar, H., "&" Rosenthal, D. S. (2015). Web archive profiling through cdx summarization. Research and Advanced Technology for Digital Libraries, 3-14. doi:10.1007/978-3-319-24592-8_1
[4] Karger, David "&" Lehman, Eric "&" Leighton, Tom "&" Levine, Matthew "&" Lewin, Daniel "&" Panigrahy, Rina. (2001). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM STOC. 10.1145/258533.258660.