Research Paper Judge

Multi-agent AI system that evaluates arXiv research papers and generates peer-review-style reports with PASS/FAIL verdicts and scored dimensions.

6 specialized evaluation agentsWave-based concurrent executionWeighted scoring: Consistency 30%, Novelty 20%FastAPI + Streamlit + PostgreSQL

External link

Problem

Manual peer review is slow, inconsistent, and doesn't scale — reviewers apply different rubrics, miss key dimensions, and create bottlenecks.

Approach

Built a multi-agent pipeline with two sequential waves: Wave 1 runs Grammar, Novelty (with live Google Search), and Fact-Check agents concurrently; Wave 2 runs Consistency and Authenticity agents in parallel; a final Evaluator agent applies weighted scoring to produce a PASS/FAIL verdict.

Value

Structured, explainable peer-review reports in seconds — traceable per-dimension scores with rationale, not just a verdict.

Snapshot

Accepts an arXiv URL, retrieves paper metadata, extracts PDF content page-by-page, deploys six specialized AI agents across two waves, and outputs a scored review report with verdict.

Stack

Python
FastAPI
PostgreSQL (NeonDB)
Streamlit
OpenRouter
Gemini
pymupdf4llm

Role

Multi-agent system design
LLM orchestration
Evaluation pipeline
Full-stack build

Outcomes

PASS/FAIL verdicts with per-dimension scores
Six specialized agents covering grammar, novelty, fact-checking, consistency, authenticity
Wave-based concurrency for faster evaluation
Traceable scoring rationale per dimension

Build notes

Wave 1 agents run concurrently — Grammar, Novelty (Gemini + Google Search), Fact-Check.
Wave 2 agents run after Wave 1 completes — Consistency and Authenticity in parallel.
Final Evaluator agent applies weighted scoring: Consistency 30%, Authenticity 25%, Novelty 20%, Fact-Check 15%, Grammar 10%.
pymupdf4llm for page-level PDF extraction with structured metadata.
NeonDB (PostgreSQL) for storing paper metadata and evaluation results.

Roadmap

Multi-paper batch processing and comparison.
Fine-tuned judge model on domain-specific data.
Export to structured review formats.

Want something similar built for your product?

I'll scope the path, then ship with reliability in mind.

Start a project