arxiv:2505.16000

Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Published on May 21, 2025

Upvote

Authors:

Mehrdad ghassabi ,

Pedram Rostami ,

Abstract

Curated datasets of medical text and QA pairs significantly improve the medical knowledge and performance of small language models in resource-constrained settings like Persian.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study explores the enhancement of medical knowledge in a small language model by leveraging accessible online data, including a crawled corpus from medical magazines and a dataset of real doctor-patient QA pairs. We fine-tuned a baseline model using our curated data to improve its medical knowledge. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and provides better responses compared to its baseline. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments.