This project investigates whether large language models (LLMs) can generate high-quality reviews for academic papers on platforms like OpenReview and arXiv. We’ll compare closed-source models (e.g., Gemini, Claude) with open-source models (e.g., Qwen), fine-tune local models on existing reviews, and use interpretability tools to understand what drives effective feedback.
Goal 1: Assess Review Quality
Use OpenReview papers and their human-written reviews as a benchmark.
Generate reviews with LLMs and compare them to human reviews using BLEU, ROUGE, and human evaluations for insightfulness.
Goal 2: Compare Model Types
Evaluate reviews from closed-source (Gemini, Claude) and open-source (Qwen, LLama) models on identical papers.
Identify trade-offs in performance, cost, and accessibility.
Goal 3: Enhance Reviews via Fine-Tuning
Fine-tune an open-source model (e.g., Qwen) on OpenReview review data.
Measure improvements in review quality post-fine-tuning.
Goal 4: Interpret Review Drivers
Apply sparse autoencoders to pinpoint paper elements (e.g., abstract, methods) most influencing LLM-generated reviews.
Computational Resources: Rent GPU instances for training and inference ($3,000-$5,000)
API calls to LLM providers ($1,000-$2,000)
Dissemination: Conference fees or publication costs ($500-$1,000)
Cost of Living in MCOL area: $3000-$6000 per month
Dmitrii Magas, ML SWE @ Lyft, https://eamag.me/
Similar projects: https://openreview-copilot.eamag.me/, https://eamag.me/2024/Automated-Paper-Classification
LLMs fail to match human review depth.
Fine-tuning yields minimal gains.
Interpretability tools lack actionable insights.