바로가기 메뉴
본문 바로가기
푸터 바로가기
TOP

Chain-of-Thought reasoning versus linguistic optimization for artificial intelligence models on the prosthodontics section of a dental licensing examination

Chain-of-Thought reasoning versus linguistic optimization for artificial intelligence models on the prosthodontics section of a dental licensing examination

Author

Nan Hsu Myat Mon Hlaing, Koungjin Park, Seoyoun Hahn, Su Young Lee, In-Sung Luke Yeo, Jae-Hyun Lee

Journal

J Prosthet Dent

Year

2025

Hlaing NHMM, Park K, Hahn S, Lee SY, Yeo IL, Lee JH*(Corresponding author). Chain-of-Thought reasoning versus linguistic optimization for artificial intelligence models on the prosthodontics section of a dental licensing examination. J Prosthet Dent. 2025 Oct 13:S0022-3913(25)00779-6.

 

Abstract

Statement of problem: A critical gap remains in understanding how the architectural philosophies of different large language models (LLMs) perform in specialized clinical domains conducted in non-Indo-European languages, particularly regarding the effectiveness of Chain-of-Thought (CoT) reasoning models compared with those optimized for local languages.

Purpose: The purpose of this study was to compare the performance of 6 LLMs with distinct architectural philosophies (CoT-reasoning, general-purpose, and Korean-optimized) on the prosthodontics section of the Korean Dental Licensing Examination (KDLE) and to contextualize their performance with published human averages.

Material and methods: A total of 161 Korean-language, text-only multiple-choice questions from the prosthodontics section of the KDLE (2020-2024) were presented to 6 LLMs (ChatGPT-o1, ChatGPT-4o, DeepSeek-R1, DeepSeek-V3, Gemini 1.5 Flash, and CLOVA X). Each test set was posed 6 times. The questions were further classified into 5 domains: diagnosis and treatment planning, mandibular movements and occlusal relationships, removable complete denture, removable partial denture, and fixed prosthodontics. Performance was measured by percentage accuracy and analyzed using the Cochran Q and post hoc McNemar tests (α=.05). LLM scores were contextually benchmarked against the average performance of human examinees.

Results: Significant performance differences were observed among the models (P<.001). The CoT-based model, ChatGPT-o1, achieved the highest overall accuracy (80.54%); the total human average (79.51%) fell within this LLM’s 95% confidence interval. ChatGPT-4o (71.84%) and DeepSeek-R1 (70.19%) also demonstrated consistent passing-level performance. The Korean-language-optimized model, CLOVA X, obtained the lowest score (34.37%). The performance ranking of the models within individual domains generally mirrored the overall ranking.

Conclusions: LLMs with CoT-reasoning architectures can achieve passing-level accuracy on non-English dental licensing examinations at a level contextually comparable to the human average, but performance varied significantly by architecture, and localized language optimization did not ensure domain expertise.