{Reference Type}: Journal Article {Title}: ChatGPT-4 Performs Clinical Information Retrieval Tasks Using Consistently More Trustworthy Resources Than Does Google Search for Queries Concerning the Latarjet Procedure. {Author}: Oeding JF;Lu AZ;Mazzucco M;Fu MC;Taylor SA;Dines DM;Warren RF;Gulotta LV;Dines JS;Kunze KN; {Journal}: Arthroscopy {Volume}: 0 {Issue}: 0 {Year}: 2024 Jun 25 {Factor}: 5.973 {DOI}: 10.1016/j.arthro.2024.05.025 {Abstract}: OBJECTIVE: To assess the ability of ChatGPT-4, an automated Chatbot powered by artificial intelligence, to answer common patient questions concerning the Latarjet procedure for patients with anterior shoulder instability and compare this performance with Google Search Engine.
METHODS: Using previously validated methods, a Google search was first performed using the query "Latarjet." Subsequently, the top 10 frequently asked questions (FAQs) and associated sources were extracted. ChatGPT-4 was then prompted to provide the top 10 FAQs and answers concerning the procedure. This process was repeated to identify additional FAQs requiring discrete-numeric answers to allow for a comparison between ChatGPT-4 and Google. Discrete, numeric answers were subsequently assessed for accuracy on the basis of the clinical judgment of 2 fellowship-trained sports medicine surgeons who were blinded to search platform.
RESULTS: Mean (± standard deviation) accuracy to numeric-based answers was 2.9 ± 0.9 for ChatGPT-4 versus 2.5 ± 1.4 for Google (P = .65). ChatGPT-4 derived information for answers only from academic sources, which was significantly different from Google Search Engine (P = .003), which used only 30% academic sources and websites from individual surgeons (50%) and larger medical practices (20%). For general FAQs, 40% of FAQs were found to be identical when comparing ChatGPT-4 and Google Search Engine. In terms of sources used to answer these questions, ChatGPT-4 again used 100% academic resources, whereas Google Search Engine used 60% academic resources, 20% surgeon personal websites, and 20% medical practices (P = .087).
CONCLUSIONS: ChatGPT-4 demonstrated the ability to provide accurate and reliable information about the Latarjet procedure in response to patient queries, using multiple academic sources in all cases. This was in contrast to Google Search Engine, which more frequently used single-surgeon and large medical practice websites. Despite differences in the resources accessed to perform information retrieval tasks, the clinical relevance and accuracy of information provided did not significantly differ between ChatGPT-4 and Google Search Engine.
CONCLUSIONS: Commercially available large language models (LLMs), such as ChatGPT-4, can perform diverse information retrieval tasks on-demand. An important medical information retrieval application for LLMs consists of the ability to provide comprehensive, relevant, and accurate information for various use cases such as investigation about a recently diagnosed medical condition or procedure. Understanding the performance and abilities of LLMs for use cases has important implications for deployment within health care settings.