2

Research on AI-Powered Peer Review: Evaluating LLMs for Academic Feedback

šŸ

Dmitrii

Not fundedGrant
$0raised

Project summary

This project investigates whether large language models (LLMs) can generate high-quality reviews for academic papers on platforms like OpenReview and arXiv. Weā€™ll compare closed-source models (e.g., Gemini, Claude) with open-source models (e.g., Qwen), fine-tune local models on existing reviews, and use interpretability tools to understand what drives effective feedback.

What are this project's goals and how will you achieve them?

  • Goal 1: Assess Review Quality

    • Use OpenReview papers and their human-written reviews as a benchmark.

    • Generate reviews with LLMs and compare them to human reviews using BLEU, ROUGE, and human evaluations for insightfulness.

  • Goal 2: Compare Model Types

    • Evaluate reviews from closed-source (Gemini, Claude) and open-source (Qwen, LLama) models on identical papers.

    • Identify trade-offs in performance, cost, and accessibility.

  • Goal 3: Enhance Reviews via Fine-Tuning

    • Fine-tune an open-source model (e.g., Qwen) on OpenReview review data.

    • Measure improvements in review quality post-fine-tuning.

  • Goal 4: Interpret Review Drivers

    • Apply sparse autoencoders to pinpoint paper elements (e.g., abstract, methods) most influencing LLM-generated reviews.

How will this funding be used?

  • Computational Resources: Rent GPU instances for training and inference ($3,000-$5,000)

  • API calls to LLM providers ($1,000-$2,000)

  • Dissemination: Conference fees or publication costs ($500-$1,000)

  • Cost of Living in MCOL area: $3000-$6000 per month

Who is on your team and what's your track record on similar projects?

Dmitrii Magas, ML SWE @ Lyft, https://eamag.me/

Similar projects: https://openreview-copilot.eamag.me/, https://eamag.me/2024/Automated-Paper-Classification

What are the most likely causes and outcomes if this project fails? (premortem)

  • LLMs fail to match human review depth.

  • Fine-tuning yields minimal gains.

  • Interpretability tools lack actionable insights.

No comments yet. Sign in to create one!