Chris Pollett > Students >
Bui

    ( Print View)

    [Bio]

    [Blog]

    [CS 297 Proposal]

    [Dynamic Hashing Schemes - PDF]

    [WARC Files - PDF]

    [Deliverable 1]

    [Deliverable 2]

    [Deliverable 3]

    [Deliverable 4]

    [CS 297 Report - PDF]

    [CS 298 Proposal]

    [WARC-KIT Code]

    [CS 298 Report - PDF]

    [CS 298 Presentation - PDF]

CS297 Proposal

Node.js based Document Store for Web Crawling

David Bui (david.bui01@sjsu.edu)

Advisor: Dr. Chris Pollett

Description:

Node.js is an open source runtime environment that allows for Javascript code to run outside of a web browser in a server environment. The goal of this project is to create an an efficient Node.js document key store that allows for the indexing of Web ARChive (WARC) and custom JARC(JSON and WARC combination) file types generated from web crawling. To achieve this a portion of Yioop! search engine's PHP data storage implementation will be migrated to Node.js and leverage Node's many features.

Schedule:

Week 1: 02-02-2021 - 02-09-2021Finalize topic and draft proposal
Week 2: 02-09-2021 - 02-16-2021Initial system setup and discuss project deliverables
Week 3: 02-16-2021 - 02-23-2021Work on Del 1 and work on paper [1] presentation
Week 4: 02-23-2021 - 03-02-2021Finish work on Del 1
Week 5: 03-02-2021 - 03-09-2021Begin work on Del 2
Week 6: 03-09-2021 - 03-16-2021Continue Del 2 work and work on paper [2] presentation
Week 7: 03-16-2021 - 03-23-2021Finish work on Del 2
Week 8: 03-23-2021 - 03-30-2021Begin work on Del 3
Week 9: 03-30-2021 - 04-06-2021Spring Break
Week 10: 04-06-2021 - 04-13-2021Continue work on Del 3 and work on paper [3] presentation
Week 11: 04-13-2021 - 04-20-2021Finish Del 3
Week 12: 04-20-2021 - 04-27-2021Begin work on Del 4 and work on paper [4] presentation
Week 13: 04-27-2021 - 05-04-2021Continue work on Del 4
Week 14: 05-04-2021 - 05-11-2021Finish Del 4 and work on Final Report
Week 15: 05-11-2021 - 05-18-2021 Wrap up any work

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. A server in node for a simple document key store.

2. Modify the document key store by implementing a linear hashing scheme

3. A WARC file and JARC file read/writer.

4. Implement a consistent hashing scheme for the document key store

References:

[1] Enbody, R. J., "&" Du, H. C. (1988). Dynamic hashing schemes. ACM Computing Surveys, 20(2), 850-113

[2] Strodl Stephan, Beran Peter, and Rauber Andreas Migrating content in WARC files 2009 The 9th International Web Archiving Workshop (IWAW 2009) Proceedings", (2009), 43 - 49

[3] Alam, S., Nelson, M. L., Van de Sompel, H., Balakireva, L. L., Shankar, H., "&" Rosenthal, D. S. (2015). Web archive profiling through cdx summarization. Research and Advanced Technology for Digital Libraries, 3-14. doi:10.1007/978-3-319-24592-8_1

[4] Karger, David "&" Lehman, Eric "&" Leighton, Tom "&" Levine, Matthew "&" Lewin, Daniel "&" Panigrahy, Rina. (2001). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM STOC. 10.1145/258533.258660.