ResearchAICitations

Do AI Agents Produce More Accurate Scientific Citations Than LLMs?

A controlled experiment comparing baseline LLM performance with agentic web search on 88 scientific claims. Results show a 19x improvement in citation verification.

NK
Nils Kriz
Eugen Rutherford
15 min read

Key Findings

19×
Better Citation
Discovery
43.2%
Agentic
Success Rate
p<0.0001
Statistically
Significant
0.61
Effect Size
(Large)

Abstract

We investigate whether AI agents with web search capabilities can find more verifiable scientific citations than LLMs relying on training knowledge alone. Using a controlled experiment with 88 claims extracted from scientific arXiv papers, we compare two conditions: baseline (LLM without tools) vs. agentic (LLM with web search). Our results show that the agentic approach finds 38/88 (43.2%) citations compared to only 2/88 (2.3%) for baseline—a 19x improvement that is statistically significant (McNemar χ² = 32.24, p < 0.0001) with a large effect size (φ = 0.61). We discuss implications for academic writing, research verification, and the future of AI-assisted literature review.

Keywords

LLMAI AgentCitation VerificationWeb SearchScientific WritingHallucination

1. Introduction

The reliability of citations in scientific literature is fundamental to the integrity of academic research. With the widespread adoption of Large Language Models (LLMs) in research workflows, questions arise about their ability to accurately cite sources. While LLMs can generate text that appears authoritative, they are known to "hallucinate" citations—producing plausible but non-existent references.

Recent developments in AI agents—systems that combine LLMs with tools like web search—offer a potential solution. By enabling models to actively retrieve current information rather than relying solely on training data, agents might produce more accurate citations.

Research Question

Do AI agents with web search capabilities produce more accurate scientific citations than LLMs relying on training knowledge alone?

Contributions

  1. We present a controlled experiment comparing baseline LLM performance with agentic web search
  2. We develop a reproducible methodology for citation verification
  3. We provide statistical evidence of significant improvement with agentic approaches

2. Related Work

2.1 Hallucination in LLMs

Previous research has extensively documented the tendency of LLMs to generate fabricated citations. Huang et al. (2023) provide a comprehensive survey of hallucination in LLMs, identifying attribution errors as a major category. Liu et al. (2024) specifically examine code generation hallucinations, while recent work by Manvi et al. (2024) demonstrates that DOI hallucination is systematic rather than incidental across multiple LLMs.

2.2 Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) combines LLMs with external knowledge bases (Lewis et al., 2020). Our work extends this concept by using web search as the retrieval mechanism rather than fixed corpora.

2.3 AI Agents for Research

Emerging research explores AI agents for scientific tasks. The DeepResearch Bench (2025) provides a benchmark for evaluating deep research agents. Kapoor et al. (2024) critically evaluate AI agents that matter, highlighting the need for empirical benchmarking.

3. Methodology

3.1 Experimental Design

ConditionMethodDescription
BaselineLLM aloneModel responds from training knowledge, no external tools
AgenticLLM + Web SearchModel uses web search tool to find citations before responding

Model Used: MiniMax M2.5 (multimodal LLM)
Search Tool: Brave Search API via OpenClaw agent framework

3.2 Claim Extraction and Filtering

Source: arXiv papers from Computer Science, AI, and NLP categories (January-February 2025)

Extraction Process:

  1. Downloaded 50 recent arXiv papers as PDF
  2. Extracted references and inline citations using regex patterns
  3. Collected 1,080 raw claims

Filtering Pipeline:

  1. Word count filter: Minimum 10 words
  2. Pattern detection: Keep factual claims using regex (show, find, demonstrate...)
  3. BibTeX removal: Exclude patterns like "Author, Title, Journal Year Pages"

Result: 144 filtered claims → 93 with normalized text for testing

3.3 Citation Verification

For each claim, we evaluated:

  • Found: Whether a citation was provided (DOI, arXiv-ID, or Author+Year)
  • Valid: Whether the citation exists and is retrievable

4. Results

4.1 Main Findings

MetricBaselineAgenticDifference
Found2/88 (2.3%)38/88 (43.2%)+40.9 pp
Valid1/88 (1.1%)3/88 (3.4%)+2.3 pp

4.2 Statistical Analysis

McNemar's Test:

Contingency Table:
                    Agentic YES  Agentic NO
Baseline YES              1          1
Baseline NO              37         49

χ² = 32.24, p < 0.0001

Confidence Intervals (Wilson):

  • Agentic: 43.2% [95% CI: 35.2%, 55.5%]
  • Baseline: 2.3% [95% CI: 0.8%, 8.1%]

Effect Size: Phi coefficient: φ = 0.6053 (Large effect)

Risk Ratio: Agentic is 19x more likely to find a citation than baseline

4.3 Qualitative Examples

Example 1: Agentic finds, Baseline does not

Claim: "Energy usage of random number generators in programming languages was accurately measured."

  • Baseline: (no citation provided)
  • Agentic: Antunes, K. & Hill, S. (2024). doi:10.48550/arXiv.2403.17090

5. Discussion

5.1 Implications for Research Practice

Our findings provide empirical evidence that AI agents with web search significantly improve citation accuracy compared to standalone LLMs. For researchers using AI writing assistants:

  • Prefer agentic approaches for citation-dependent tasks
  • Always verify AI-provided citations manually
  • Be aware that ~57% of agentic responses still lack verifiable citations

5.2 Limitations

  1. Sample size: n=88 claims from arXiv papers may not generalize to all domains
  2. Single LLM: We tested only MiniMax M2.5
  3. Domain specificity: Claims from AI/ML/NLP papers may be easier to verify

6. Conclusion

We demonstrated that AI agents with web search capabilities (43.2% citation success) significantly outperform LLMs relying on training knowledge alone (2.3%) for finding verifiable scientific citations. This 19x improvement is statistically significant (p < 0.0001) with a large effect size (φ = 0.61).

Our findings suggest that AI-assisted research should leverage agentic architectures for citation verification tasks. However, the gap between "found" and "valid" citations indicates that manual verification remains essential.

References

  1. Huang, L., et al. (2023). A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2311.05232.
  2. Liu, F., et al. (2024). Exploring and Evaluating Hallucinations in LLM-Powered Code Generation. arXiv preprint arXiv:2404.00971.
  3. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
  4. Manvi, R., et al. (2024). Geographic Variation in LLM DOI Fabrication. MDPI Papers.
  5. Press, O., et al. (2024). CiteME: A Benchmark for Citation Identification. arXiv preprint.
  6. Song, J., et al. (2024). RAG-HAT: A Hallucination-Aware Tuning Pipeline. EMNLP 2024.
  7. Kapoor, S., et al. (2024). AI Agents That Matter. arXiv preprint arXiv:2407.01502.

Appendix: Data & Code

All data and code required to reproduce this study are available in the project repository.

ResourceLocation
Claims datasetclaims_normalized_filtered.json
Raw resultsauto_experiment_results.json
Evaluation codeevaluate_citation.py
NK

About the Author

I'm a Master's student in Business Informatics at Friedrich Schiller University Jena, focused on AI research and empirical methods. This experiment was conducted as part of my research on AI-assisted academic writing.

nils.kriz@uni-jena.de