Chris Pollett > Students >

    ( Print View)



    [CS297 Proposal]





    [CS297_Report - PDF]

    [CS298 Proposal]


    [CS298_Report - PDF]

    [CS298_Presentation - PDF]

Project Blog

CS 298

Week 12 (11-09-2021 - 11-16-2021): This week I started on the draft of the report. I went through a few reports of previous students and shortlisted three of them which I liked. On those lines, and using the CS297 report, I started work on my report. I also rectified the offset counter to read the offset by combining three warc records as those three together are forming an actually relevant cdx record. Last time, professor suggested that I should write my code so that it is flexible for different formats of warc file. I am not sure how the handling of offset will happen in that case, unless hardcoded.

Week 11 (11-02-2021 - 11-09-2021): As per discussion with professor last week, I found a way around the offset problem by keeping a running offset counter variable and using it in the CDX records as I read the warc records. This week I got stuck on figuring out how the contents of the warc files are arranged. It is looking as if each record is coming in a different format inside the warc file as well.

Week 10 (10-26-2021 - 11-02-2021): This week I worked on reading the contents of warc file to compress them into cdx records so that those can be pushed into the linear hashtable. The main problem that I faced was how to get the offset of a particular record from the warc file so that the cdx record can be populated.

Week 9 (10-19-2021 - 10-26-2021): This week I spent time on integration of linear hashtable with graphql server. I got it working as a test sample as of now where hardcoded values are being pushed in to the table and retrieved. APIs from graphql server are just a means to call the hardcoded function. Upon integration of warc file functionality and packedtabletools file, this will be near completion.

Week 8 (10-12-2021 - 10-19-2021): I am almost complete with porting the PackedTableTools file in rust. Main focus was on the pack, unpack and constructor methods of the file. Faced problems in porting some of the methods because Rust, being a strictly typed language, doesn't allow mixed/any type of data. Also, it doesn't allow default values for arguments and method overloading. Compressor library similar to that in the php code needs to be found for Rust.

Week 7 (10-05-2021 - 10-12-2021): This week I spent time on figuring out how to integrate the graphql server to connect to our linear hash table and allows query executions. I also started on porting the PackedTableTools file in rust.

Week 6 (09-28-2021 - 10-05-2021): This week I implemented a rust based graphql server which connects to postgres and allows query executions. This will be leveraged for our document store on the same lines.

Week 5 (09-21-2021 - 09-28-2021): I spent some time finding material on how to use/implement custom ODBC drivers. Material exists for databases like mysql, postgres, etc. but that on how to write custom driver is not found yet.

Week 4 (09-14-2021 - 09-21-2021): This week I integrated lru caching mechanism in the linear hash implementation. Tests were run using the cache for values present in the cache and those not in the cache and it returned the results correctly. More tests are required to find the performance increase when the cache is used. During the case of cache miss, it's important that if the value is found in hashtable, it should be put in the cache as well. This needs to be implemented. I spent some time finding material on how to use/implement custom ODBC drivers. Material exists for databases like mysql, postgre, etc. but that on how to write custom driver is not found yet.

Week 3 (09-07-2021 - 09-14-2021): This week I spent time on implementation of the lru cache using rust. Aim was to finish the integration of lru cache with the linear hash implementation but the it took more time than expected. I also had a look at PackedTableTools.php file to get an idea about the implementation for indexing.

Week 2 (31-08-2021 - 09-07-2021): The aim of this week is to read about different types of data compressions like rice coding, golomb coding, and gamma coding. This week also aims at figuring out how I would implement the LRU caching mechanism in the Linear Hashing implementation done in CS 297. I looked at few of the library files in yioop's codebase and started reading about ODBC drivers and their functionalities.

CS 297

Week 15 (05-11-2021 - 05-18-2021): This week marks the end of the project for this semester. I validated my pages on the portal and checked for the accessibility. I also added report in the form of a downloadable pdf on the portal and checked the accessibility of all the web pages. None of them is showing any error!

Week 14 (05-04-2021 - 05-11-2021): As per last week's meeting, I made a few corrections in my code as suggested by professor. Also, completed the report and will make changes according to the feedback I get.

Week 13 (04-27-2021 - 05-04-2021): This week, I almost completed the implementation of consistent hashing. Will be able to show it in running form. I also start the work on writing report for this semester. I have written the outline of the report and will proceed on it after I get approval.

Week 12 (04-20-2021 - 04-27-2021): I started work on the implementation of consistent hashing. I also implemented cdx reading, further on deliverable-3 starting from where I left it last week. Looks like I am on track for this semester.

Week 11 (04-13-2021 - 04-20-2021): This week brought the finishing of Warc reader/writer deliverable. My code is able to read the gzip files as well. New addition was the implementation of CDX file reader. Faced some issues in getting .cdx files but thankfully, it was resolved!

Week 10 (04-06-2021 - 04-13-2021): As for previous meeting(week 8-9), I reached a point where warc reading was almost done. After that, I searched more on the internet and got to know that rust provides a crate(rust libraries are called crates) for warc reading/writing. So, I found out the crate and experimented with it. I have achieved the reading/writing of warc files as well as gzip files, using the same crate.

Week 8-9 (03-23-2021 - 04-06-2021): These two weeks will get their blog together as they incorporate spring break. I start work on deliverable 3 where I need to read/write documents from/to WebARChive (warc) files. I first start with simple file I/O using rust. I play with simple txt files by creating, writing to, and reading from those. Currently, I'm working on reading warc files which are large in size. I plan to create separate library for this as well, just like I did for deliverable-2 (linear-hashing), so that it can be used later.

Week 7 (03-16-2021 - 03-23-2021): This week, I work on finishing the linear hash implementation. In last meeting, my logic for splitting of the bucket was a flawed where I was splitting based on how full a bucket was. I corrected it to depend on the total space occupied in the hash table. Though the implementation of linear hashing is complete, there are a few tweaks which are required so that it suits are requirement. I will be mentioning those in the Deliverable-2 page. In last meeting, we also discussed on things to be done while storing the keys instead of querying like page ranks, de-duplication, etc. These are food for thought.

Week 6 (03-09-2021 - 03-16-2021): In last meeting, the main point of discussion was the way to lookup for a particular key in a large a number of files. I worked on finishing the implementation of linear hashing. For now, I try to fetch the pages of the file in sequence and keep searching for the key in pages and the page that finds it, is stored in memory for later use. It's FIFO currently but I realize that LRU eviction would be an improvement.

Week 5 (03-02-2021 - 03-09-2021): This week I start work on implementation of linear hashing using rust. I found some references for good programming practices (reference updated in proposal). I also tried to write one unit test cases. The code should come to running state by next week.

Week 4 (02-23-2021 - 03-02-2021): This week sees the finishing of Deliverable 1 in the form of a key-value store using rust. I leverage the unqlite crate in rust to implement this. It can be utilized to store the documents, which is the actual plan. I also started looking into deliverable 3 where we need to implement linear hashing to store files on disk. Next deliverable requires read/write of WARC files, the work on which I'll be picking up in week 5.

Week 3 (02-16-2021 - 02-23-2021): This week I start in this project more extensively - researching Noria to find if we can leverage its approach for faster reads, implementing the TCP client using rust, and making the server and client connect and exchange data (the server will be an echo server).

Week 2 (02-09-2021 - 02-16-2021): The aim of this week is to setup the environment for rust hands-on and study relevant research papers.