When ‘Smarter’ Still Means Harder: Lessons Learned from an AI Model’s Flawed Literature Review

By Leslie Ellis, Ph.D., The Caissa Group & Mia Breitenstein, George Washington University Forensic Psychology M.A. Student

For legal professionals and trial consultants, AI has become a primary research tool—often without us realizing it. Google, Siri, and Alexa all use AI. For complex tasks, we turn to large language models (LLMs) like ChatGPT and Claude AI to summarize materials, analyze data, and automate processes. (Full disclosure – the first author used Claude AI to tighten up a too-wordy first draft of the introductory section of this article.) We’ve become so reliant on AI that the Pew Center for Research found that, when Google search results include an AI summary, users were much less likely to click on the actual search results than when there was no AI summary included.^[1] In legal research, AI has proven helpful but also problematic when “hallucinations” appeared in court briefs and judicial orders.

Trial consultants rely on empirical research about legal decision-making to understand mock trials, focus groups, and predict juror behavior. However, those consultants outside academia have limited access to the peer-reviewed journals that publish this research.

How do we stay current on relevant research? AI could fill this gap. Literature reviews are time-consuming but straightforward—seemingly ideal for AI. Among major platforms, Claude AI is a natural choice, having been built for educators using academic sources.

The first author was invited to speak on “the relationship between race and damage awards.” Based on her knowledge, she suspected limited research existed. First, Leslie searched her own files. Second, she asked a graduate student intern with university library access to search; the intern found a few additional articles. Third, to ensure thoroughness, Leslie asked Claude AI to “summarize the empirical research on the relationship between race and damage awards in civil litigation, focusing on the race of the litigant, attorney, and juror.”

Claude initially returned about 5 pages of citations with minimal detail. Leslie requested more detailed summaries for each category and an APA reference list. Claude returned a 15-page document citing 63 articles—far more than expected, raising immediate suspicion. However, the summaries were coherent, aligned with cited researchers’ work, and appeared credible to anyone unfamiliar with the topic.

When reviewing the reference list, Leslie found a paper she’d co-authored with her graduate-school advisor that addressed race but not damages. This heightened her suspicion. She shelved Claude’s results and relied on traditional research methods to prepare for the panel. Months later, she asked intern Mia Breitenstein to verify Claude’s findings. That process and results follow.

Before analyzing and fact checking Claude AI’s citations, we created a spreadsheet to keep track of everything. We recorded each article title and author(s) and, for each article:

Whether it existed
If it did exist, if it reported what Claude AI summarized
If it did report what Claude AI summarized, whether it was citing another article or if it was the original article
If it existed but did NOT report what Claude AI summarized
If it did not report what Claude AI summarized
What the article actually found
Other miscellaneous notes/discrepancies (wrong publication year, authors, or title)

After collecting as many of the cited articles as possible, Mia did two rounds of searching, reading, and analyzing. The first round was a rough read-through and/or Google search (if it had not already been found) to start filling in the spreadsheet. The purpose was to answer two questions: 1) did the article exist? And 2) did it find what Claude AI reported? She typically began with each article by searching keywords, such as “awards,” “race,” and “damages” and reviewed the search results, if any. Mia also skimmed the articles’ abstracts, discussions, and conclusions to develop an initial assessment of Claude AI’s accuracy during this first round. She then took some brief notes and included direct quotes from the articles for easy reference/double-checking. The second round of searching and reading was more focused. In order to test her initial assessment of each article, she began reading each article with the purpose of finding the actual conclusions and citing specific quotes/supporting evidence, tracking quotes and page numbers from the original articles. This second round also allowed Mia to confirm that articles or books she was unable to find truly did not exist.

Soon after Mia completed the second round, we had a call to discuss emerging patterns, notes of interest, and general findings. We talked about a few articles/books that did not seem to exist. In one example of an elusive article, both Leslie and Mia copied a citation straight from Claude AI and plugged it into Google. The results were telling: Google’s own AI-produced overview summarized what the article in question found, despite it not existing, and referenced other related sources/articles. Not only was Google’s AI attempting to report the article’s findings, but when we did it at the same time using the same citation, we received different answers from Google that were citing different related articles. This was in June of 2025. Interestingly, a current Google search for the same fake citations yields another set of different results; Google’s AI overview reports that there is no known academic work by those authors or title but offers potentially related articles that do exist. The discrepancies between these three Google searches (Leslie and Mia in June 2025, Mia in September 2025 )illustrate the unreliability of Google’s AI function for this type of task, in addition to that of Claude AI.

At the beginning of this exercise, we had some suspicions about Claude AI’s results, for the reasons previously mentioned. However, we expected to see mostly circular citations – multiple later citations back to a few original sources. However, this was not the case. The majority of articles fell into one of two categories: they did not exist – that is, they were completely made-up articles – or they did exist, but Claude AI’s summary was inaccurate.

Out of the 63 citations Claude AI produced:

Only six articles were accurately reported by Claude AI.
Even for those six, Claude AI’s reports were often oversimplified.
Twenty-eight of the cited articles existed but were inaccurately summarized.
Eighteen did not exist at all.

If you’re keeping up with the math, that leaves 11 that fell into an odd half-state category. These 11 were articles that existed, sort of. When retrieving original articles, Mia found that several did exist, but Claude AI’s citation was incorrect; it would have the publication year, an author, or even part of the title wrong.

In other words, Claude AI’s results were accurate in less than 10% of its cited references.

Leslie’s co-authored paper that Claude AI cited was a perfect example of the risk of using an LLM – it can be both close enough to be plausible and wrong enough to get you into trouble. While the paper that Claude AI cited was about race but not damages, Leslie had done other research on damage awards. Including her paper in a summary of research on race and damages was not out of the range of possibilities, just not accurate. When it comes to using AI for literature reviews and online research, the principle is clear: “don’t trust, and still verify”.

But, if you don’t have access to the original research, how do you verify AI’s results? That’s why we use AI for literature reviews to begin with. Most articles are available for purchase but they can add up if you’re doing a broad search. There are a few ways to gain access to original research that won’t break the bank.

Get an intern! This is a slightly cheeky but also serious recommendation. Many graduate programs require their students to complete a specific number of internship hours, and some students in trial consulting-relevant fields (e.g., forensic psychology) are very interested in trial consulting. Enrolled students also have full access to their library’s databases and publications. Helping with literature reviews is one of many tasks interns can do to support consultants, while also learning about the industry and research that is relevant to the field.
Consultants who are ASTC members can post a request on the ASTC members-only listserv and see if anyone has access to articles that they can share privately (without violating copyrights, of course).
Subscribe to JSTOR (a digital library of scholarly publications) or a professional organization like the American Psychological Association, whose memberships often include access to digital libraries or subscriptions to publications, as well as discounted subscriptions to other publications.
Take advantage of your public library. Library cards are free and grant you access to all of the databases the library is subscribed to, most of which you can do from your computer at home (some may require the use of library computers though). Our local libraries subscribe to JSTOR, with thousands of full-PDFs available for download!
There are several free online sites to which researchers can upload unpublished versions (e.g., university-branded versions) their own published papers, such as SSRN, Academia.edu, ResearchGate, or Google Scholar. Registration is often required but all have free subscription options.
Finally, many authors are happy to email PDFs of their articles, when asked. Most researchers hope their research reaches beyond the journal in which it’s published. This is also a great way to open communication channels between empirical researchers and consultants.

Staying current with empirical research relevant to client issues is essential for consultants, yet access to original research remains limited. AI LLMs offer tempting shortcuts but carry inherent risks. Our experience serves as a cautionary reminder of AI’s limitations for literature reviews, which should be supplemented with more reliable alternatives for accessing research.

^[1] https://www.pewresearch.org/short-reads/2025/07/22/google-users-are-less-likely-to-click-on-links-when-an-ai-summary-appears-in-the-results/

When ‘Smarter’ Still Means Harder: Lessons Learned from an AI Model’s Flawed Literature Review

Comments Closed

Articles by Category

Articles by Issue

Recent Comments

Copyright © The Jury Expert, 2008-present. All Rights Reserved.