Chris Pollett >
Students > [Bio] [Blog] |
CS297 ProposalNode.js based Document Store for Web CrawlingDavid Bui (david.bui01@sjsu.edu) Advisor: Dr. Chris Pollett Description: Node.js is an open source runtime environment that allows for Javascript code to run outside of a web browser in a server environment. The goal of this project is to create an an efficient Node.js document key store that allows for the indexing of Web ARChive (WARC) and custom JARC(JSON and WARC combination) file types generated from web crawling. To achieve this a portion of Yioop! search engine's PHP data storage implementation will be migrated to Node.js and leverage Node's many features. Schedule:
Deliverables: The full project will be done when CS298 is completed. The following will be done by the end of CS297: 1. A server in node for a simple document key store. 2. Modify the document key store by implementing a linear hashing scheme 3. A WARC file and JARC file read/writer. 4. Implement a consistent hashing scheme for the document key store References: [1] Enbody, R. J., "&" Du, H. C. (1988). Dynamic hashing schemes. ACM Computing Surveys, 20(2), 850-113 [2] Strodl Stephan, Beran Peter, and Rauber Andreas Migrating content in WARC files 2009 The 9th International Web Archiving Workshop (IWAW 2009) Proceedings", (2009), 43 - 49 [3] Alam, S., Nelson, M. L., Van de Sompel, H., Balakireva, L. L., Shankar, H., "&" Rosenthal, D. S. (2015). Web archive profiling through cdx summarization. Research and Advanced Technology for Digital Libraries, 3-14. doi:10.1007/978-3-319-24592-8_1 [4] Karger, David "&" Lehman, Eric "&" Leighton, Tom "&" Levine, Matthew "&" Lewin, Daniel "&" Panigrahy, Rina. (2001). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM STOC. 10.1145/258533.258660. |