Back to projects

Research Paper Judge

Multi-agent AI system that evaluates arXiv research papers and generates peer-review-style reports with PASS/FAIL verdicts and scored dimensions.

6 specialized evaluation agentsWave-based concurrent executionWeighted scoring: Consistency 30%, Novelty 20%FastAPI + Streamlit + PostgreSQL

Problem

Manual peer review is slow, inconsistent, and doesn't scale — reviewers apply different rubrics, miss key dimensions, and create bottlenecks.

Approach

Built a multi-agent pipeline with two sequential waves: Wave 1 runs Grammar, Novelty (with live Google Search), and Fact-Check agents concurrently; Wave 2 runs Consistency and Authenticity agents in parallel; a final Evaluator agent applies weighted scoring to produce a PASS/FAIL verdict.

Value

Structured, explainable peer-review reports in seconds — traceable per-dimension scores with rationale, not just a verdict.

Snapshot

Accepts an arXiv URL, retrieves paper metadata, extracts PDF content page-by-page, deploys six specialized AI agents across two waves, and outputs a scored review report with verdict.

Stack

  • Python
  • FastAPI
  • PostgreSQL (NeonDB)
  • Streamlit
  • OpenRouter
  • Gemini
  • pymupdf4llm

Role

  • Multi-agent system design
  • LLM orchestration
  • Evaluation pipeline
  • Full-stack build

Outcomes

  • PASS/FAIL verdicts with per-dimension scores
  • Six specialized agents covering grammar, novelty, fact-checking, consistency, authenticity
  • Wave-based concurrency for faster evaluation
  • Traceable scoring rationale per dimension

Build notes

  • Wave 1 agents run concurrently — Grammar, Novelty (Gemini + Google Search), Fact-Check.
  • Wave 2 agents run after Wave 1 completes — Consistency and Authenticity in parallel.
  • Final Evaluator agent applies weighted scoring: Consistency 30%, Authenticity 25%, Novelty 20%, Fact-Check 15%, Grammar 10%.
  • pymupdf4llm for page-level PDF extraction with structured metadata.
  • NeonDB (PostgreSQL) for storing paper metadata and evaluation results.

Roadmap

  • Multi-paper batch processing and comparison.
  • Fine-tuned judge model on domain-specific data.
  • Export to structured review formats.

Want something similar built for your product?

I'll scope the path, then ship with reliability in mind.