Do AI Agents Produce More Accurate Scientific Citations Than LLMs?
A controlled experiment comparing baseline LLM performance with agentic web search on 88 scientific claims. Results show a 19x improvement in citation verification.
Key Findings
Discovery
Success Rate
Significant
(Large)
Abstract
We investigate whether AI agents with web search capabilities can find more verifiable scientific citations than LLMs relying on training knowledge alone. Using a controlled experiment with 88 claims extracted from scientific arXiv papers, we compare two conditions: baseline (LLM without tools) vs. agentic (LLM with web search). Our results show that the agentic approach finds 38/88 (43.2%) citations compared to only 2/88 (2.3%) for baseline—a 19x improvement that is statistically significant (McNemar χ² = 32.24, p < 0.0001) with a large effect size (φ = 0.61). We discuss implications for academic writing, research verification, and the future of AI-assisted literature review.
Keywords
1. Introduction
The reliability of citations in scientific literature is fundamental to the integrity of academic research. With the widespread adoption of Large Language Models (LLMs) in research workflows, questions arise about their ability to accurately cite sources. While LLMs can generate text that appears authoritative, they are known to "hallucinate" citations—producing plausible but non-existent references.
Recent developments in AI agents—systems that combine LLMs with tools like web search—offer a potential solution. By enabling models to actively retrieve current information rather than relying solely on training data, agents might produce more accurate citations.
Research Question
Do AI agents with web search capabilities produce more accurate scientific citations than LLMs relying on training knowledge alone?
Contributions
- We present a controlled experiment comparing baseline LLM performance with agentic web search
- We develop a reproducible methodology for citation verification
- We provide statistical evidence of significant improvement with agentic approaches
2. Related Work
2.1 Hallucination in LLMs
Previous research has extensively documented the tendency of LLMs to generate fabricated citations. Huang et al. (2023) provide a comprehensive survey of hallucination in LLMs, identifying attribution errors as a major category. Liu et al. (2024) specifically examine code generation hallucinations, while recent work by Manvi et al. (2024) demonstrates that DOI hallucination is systematic rather than incidental across multiple LLMs.
2.2 Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) combines LLMs with external knowledge bases (Lewis et al., 2020). Our work extends this concept by using web search as the retrieval mechanism rather than fixed corpora.
2.3 AI Agents for Research
Emerging research explores AI agents for scientific tasks. The DeepResearch Bench (2025) provides a benchmark for evaluating deep research agents. Kapoor et al. (2024) critically evaluate AI agents that matter, highlighting the need for empirical benchmarking.
3. Methodology
3.1 Experimental Design
| Condition | Method | Description |
|---|---|---|
| Baseline | LLM alone | Model responds from training knowledge, no external tools |
| Agentic | LLM + Web Search | Model uses web search tool to find citations before responding |
Model Used: MiniMax M2.5 (multimodal LLM)
Search Tool: Brave Search API via OpenClaw agent framework
3.2 Claim Extraction and Filtering
Source: arXiv papers from Computer Science, AI, and NLP categories (January-February 2025)
Extraction Process:
- Downloaded 50 recent arXiv papers as PDF
- Extracted references and inline citations using regex patterns
- Collected 1,080 raw claims
Filtering Pipeline:
- Word count filter: Minimum 10 words
- Pattern detection: Keep factual claims using regex (show, find, demonstrate...)
- BibTeX removal: Exclude patterns like "Author, Title, Journal Year Pages"
Result: 144 filtered claims → 93 with normalized text for testing
3.3 Citation Verification
For each claim, we evaluated:
- Found: Whether a citation was provided (DOI, arXiv-ID, or Author+Year)
- Valid: Whether the citation exists and is retrievable
4. Results
4.1 Main Findings
| Metric | Baseline | Agentic | Difference |
|---|---|---|---|
| Found | 2/88 (2.3%) | 38/88 (43.2%) | +40.9 pp |
| Valid | 1/88 (1.1%) | 3/88 (3.4%) | +2.3 pp |
4.2 Statistical Analysis
McNemar's Test:
Contingency Table:
Agentic YES Agentic NO
Baseline YES 1 1
Baseline NO 37 49
χ² = 32.24, p < 0.0001Confidence Intervals (Wilson):
- Agentic: 43.2% [95% CI: 35.2%, 55.5%]
- Baseline: 2.3% [95% CI: 0.8%, 8.1%]
Effect Size: Phi coefficient: φ = 0.6053 (Large effect)
Risk Ratio: Agentic is 19x more likely to find a citation than baseline
4.3 Qualitative Examples
Example 1: Agentic finds, Baseline does not
Claim: "Energy usage of random number generators in programming languages was accurately measured."
- Baseline: (no citation provided)
- Agentic: Antunes, K. & Hill, S. (2024). doi:10.48550/arXiv.2403.17090
5. Discussion
5.1 Implications for Research Practice
Our findings provide empirical evidence that AI agents with web search significantly improve citation accuracy compared to standalone LLMs. For researchers using AI writing assistants:
- Prefer agentic approaches for citation-dependent tasks
- Always verify AI-provided citations manually
- Be aware that ~57% of agentic responses still lack verifiable citations
5.2 Limitations
- Sample size: n=88 claims from arXiv papers may not generalize to all domains
- Single LLM: We tested only MiniMax M2.5
- Domain specificity: Claims from AI/ML/NLP papers may be easier to verify
6. Conclusion
We demonstrated that AI agents with web search capabilities (43.2% citation success) significantly outperform LLMs relying on training knowledge alone (2.3%) for finding verifiable scientific citations. This 19x improvement is statistically significant (p < 0.0001) with a large effect size (φ = 0.61).
Our findings suggest that AI-assisted research should leverage agentic architectures for citation verification tasks. However, the gap between "found" and "valid" citations indicates that manual verification remains essential.
References
- Huang, L., et al. (2023). A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2311.05232.
- Liu, F., et al. (2024). Exploring and Evaluating Hallucinations in LLM-Powered Code Generation. arXiv preprint arXiv:2404.00971.
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
- Manvi, R., et al. (2024). Geographic Variation in LLM DOI Fabrication. MDPI Papers.
- Press, O., et al. (2024). CiteME: A Benchmark for Citation Identification. arXiv preprint.
- Song, J., et al. (2024). RAG-HAT: A Hallucination-Aware Tuning Pipeline. EMNLP 2024.
- Kapoor, S., et al. (2024). AI Agents That Matter. arXiv preprint arXiv:2407.01502.
Appendix: Data & Code
All data and code required to reproduce this study are available in the project repository.
| Resource | Location |
|---|---|
| Claims dataset | claims_normalized_filtered.json |
| Raw results | auto_experiment_results.json |
| Evaluation code | evaluate_citation.py |
About the Author
I'm a Master's student in Business Informatics at Friedrich Schiller University Jena, focused on AI research and empirical methods. This experiment was conducted as part of my research on AI-assisted academic writing.