CS298 Proposal

Benchmarking Survival Analysis Data with MongoDB and Time Series Databases

Aarsh Patel(aarsh.patel@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun, Dr. William Andreopoulos

Abstract:

A statistical technique called survival analysis is used to examine how long it will be until a given event occurs. The "event" can be anything that can be quantified in terms of time, such as the onset of a disease or the death of a person. It is especially helpful for analyzing time-to-event data. In CS297, I used Time Series Database Suite (TSBS) to benchmark time series data against the four databases and evaluate their performance based on read and write performance. While there are many benchmarking suites available for time series databases, there are none specifically designed for storing time-to-event data. Survival analysis data contains censored data and often involves time-dependent computations, which creates a need for a benchmarking tool that is tuned for such use cases, thus allowing fair and accurate comparisons between different database systems. The goal of this project is to create a benchmarking suite where the dataset containing such time-to-event data can be bench-marked. This project is an extension of CS297 as here I am analyzing a particular sub-case of time-series data that is survival analysis data to research how the four databases used for time-series benchmarking can handle survival analysis data. The goal of the project is to compare the databases for time-series and survival analysis data, thus exploring the features of the databases. This benchmarking suite can be used where more databases/use cases can be added to the suite and more features like tailored query workload and optimizations can be done. The dataset for the experiments will be selected from available open-source datasets, and the databases to be bench-marked will be MongoDB, TimeScaleDB, InfluxDB, and QuestDB. The metrics will be read performance (time to load data) and write performance (execution time for queries).

CS297 Results

Researched time series data and databases. Explored various benchmarking suites that can benchmark time series data and support NoSQL databases.
Choose Time Series Benchmarking Suite (TSBS) as the benchmarking suite and Influx DB, TimeScaleDB, and QuestDB as the three time-series databases; and MongoDB as the NoSQL database to benchmark against synthetic time series data.
Conducted benchmarking to evaluate the performance of the databases on read/write performance and data storage footprint. Concluded that MongoDB and NoSQL databases in general are not ideal for storing time series data. TimeScaleDB is the best choice as it is fast, and the support for SQL makes it convenient for storing time series data.

Proposed Schedule

Week 1: Aug 22-Aug 29	Meet with advisor and finish proposal
Week 2-3: Aug 30-Sep 12	Create a dataset with parameters defined in Deliverable 1 using Python ( Pandas, NumPy, etc.) and perform cleaning. Research different formats of data (JSON, XML, or CSV) and choose the optimal one. Also, MongoDB's data format is different (BSON), so will need to understand its format and how different it is from JSON.
Week 4-7: Sep 13-Oct 10	Write automatic scripts to perform and record write performance and explore various parameters supported by the database to check if there is any significant difference in write performance. ( Modify data if there are issues with a particular database and make it compatible )
Week 8-12: Oct 11-Nov 14	Create queries for each database to measure the read performance and record the execution time of queries.
Week 13-16: Nov 15-Dec 2	Complete the report and get it reviewed by Prof. Pollett and the committee. Create presentation. Make final changes and practice presentation.

Key Deliverables:

Software
- Create a dataset of survival analysis data for shipment tracking information of size about 5 GB. Examples of parameters will be "order_number", "order_placed", "order_processed", "shipment_created", "shipment_delivered" and "isDelivered". Create multiple datasets with different formats as MongoDB uses BSON format for data. This will be data that will be imported into the databases and on which queries will be performed.
- Create automatic scripts with different parameters for MongoDB, InfluxDB, TimeScaleDB, and QuestDB like file_path (path to the data file) and name (name of the database) to load the data in the database. Tune the dataset for a particular database if issues occur in data loading. Perform data loading and record write time.
- Write multiple queries that include aggregation and JOINS for MongoDB, InfluxDB, TimeScaleDB, and QuestDB. (Each database has a different syntax and structure, so the queries will be different for each database). Write scripts to perform the queries to the target database and record the execution time of queries. Record the results and compare the performance of the databases to determine which one will be more suited for storing and querying time-to-event data.
Report
- CS 298 Report
- CS 298 Presentation

Innovations and Challenges

The research on survival analysis database benchmarking is very limited, especially on how good time-series databases like InfluxDB and NoSQL databases like MongoDB are in terms of storing survival analysis data. As InfluxDB and TimeScaleDB are the most popular time-series databases, it will be interesting to see how good are these databases in storing survival analysis data. So, this project may serve as a starting point for researching on storing survival analysis data. This suite can be further explored and new optimizations/additions can be made. Survival analysis is an evolving field with ongoing research into new algorithms and techniques. This benchmarking suite can provide a platform to test the implementation of these new methods within database systems and compare their performance.
As every database is different in terms of data format and structure, a dataset may need to be re-structured for a target database to be able to load the data in the database. So Deliverable 2 will be challenging as it might need some research about the format/structure of the data and the dataset will be changed accordingly to make it work with the database.
In Deliverable 3,the main challenge is writing queries for each database to compute the read performance. The queries will be different for each database and this will require researching and understanding of how to write functional queries in different databases.

References:

[1] C. F. Chung, P. Schmidt, and A. D. Witte, "Survival analysis: A survey," Journal of Quantitative Criminology, vol. 7, no. 1, pp. 59–98, 1991. doi:10.1007/bf01083132

[2] P. Schober and T. R. Vetter, "Survival analysis and interpretation of time-to-event data: The Tortoise and the hare," Anesthesia and Analgesia, vol. 127, no. 3, pp. 792–798, 2018. doi:10.1213/ane.0000000000003653

[3] D. Ramesh, A. Sinha, and S. Singh, "Data modelling for discrete time series data using Cassandra and MongoDB," 2016 3rd International Conference on Recent Advances in Information Technology (RAIT), 2016. doi:10.1109/rait.2016.7507966

[4] A. Koschel et al., "Evaluating time series database management systems for insurance company," 2022 13th International Conference on Information, Intelligence, Systems and Applications (IISA), 2022. doi:10.1109/iisa56318.2022.9904360