CS297 Proposal

Improving Mathematical and Logical Capabilities of LLMs

Rohan Kumar (nagarohankumar.bayya@sjsu.edu)

Advisor: Dr. Chris Pollett

Description:

LLMs struggle with math because they are trained primarily on natural language data, lacking the formal reasoning and symbolic manipulation needed for precise calculations. They often prioritize linguistic patterns over strict correctness, leading to plausible but inaccurate answers. Their limited context window and inability to perform multi-step reasoning further hinder accuracy. This project aims to enhance LLMs ability to handle mathematical problems and logical reasoning tasks by utilizing curated datasets, hybrid models incorporating symbolic reasoning elements, and Process supervision. The goal is to improve the model's performance on complex math and logic problems.

Schedule:

Week	Activities
Week 1: September 2, 2024	Review current limitations of LLMs in math and logic
Week 2: September 9, 2024	Identify set of LLMs. Read on how to train and deploy them
Week 3: September 16, 2024	Start fine-tuning pre-trained models on math and logic datasets (MATH and GSM8k). Read [Hendrycks2021]
Week 4: September 23, 2024	Prepare evaluation results on initial fine-tuned models
Week 5: September 30, 2024	Start on deliverable 2: Understand working of Mathics
Week 6: October 7, 2024	Continue deliverable 2: Read [Peng2022]. Invoke Mathics through LLM's output prompt
Week 7: October 14, 2024	Complete deliverable 2: Document results upon performing basic math calculations
Week 8: October 21, 2024	Start deliverable 3: Read Documentation of Mathics. Read [Leonardo2015], [Polu2020]
Week 9: October 28, 2024	Continue on deliverable 3: Integrate LLM with LEAN
Week 10: November 4, 2024	Complete deliverable 3: Prove basic theorems using LLM+LEAN
Week 11: November 11, 2024	Start deliverable 4: Read about process training and CoT prompting
Week 12: November 18, 2024	Continue deliverable 4: Find word problems datasets or create a synthetic one
Week 13: November 25, 2024	Continue deliverable 4: Read [Wei2022]. Continue evaluating our model on dataset
Week 14: December 2, 2024	Complete deliverable 4: Record accuracy metrics
Week 15: December 9, 2024	Start deliverable 5: Compile all the deliverables
Week 16: December 16, 2024	Complete CS297 Report

Deliverables:

Start fine-tuning LLMs using MATH and GSM8k datasets
Integrate ChatGPT with Mathics to compute the integral of sin(x)
Lookup LEAN and prove for every prime, there exists a bigger prime.
Find word problems dataset XYZ and fine-tune an LLM to answer a math problem
Submit CS297 Report

References:

[Hendrycks2021] Measuring Mathematical Problem Solving with the MATH Dataset. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt. arXiv preprint arXiv:2103.03874. 2021.
[Peng2022] PAL: Program-Aided Language Models. Ronald S. Peng, Karl Cobbe, Jacob Hilton, John Schulman. arXiv preprint arXiv:2205.11916. 2022.
[Leonardo2015] Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. The Lean theorem prover (system description). In International Conference on Automated Deduction (CADE), 2015. 1, 2, 22
[Polu2020] Learning to Prove Theorems via Interacting with Proof Assistants. Stanislas Polu, Ilya Sutskever. arXiv preprint arXiv:2009.03393. 2020.
[Wei2022] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc V. Le, Denny Zhou. arXiv preprint arXiv:2201.11903. 2022.