10th International Congress on Information and Communication Technology in concurrent with ICT Excellence Awards (ICICT 2025) will be held at London, United Kingdom | February 18 - 21 2025.
Authors - Pouyan Nahed, Sepideh Farivar, Kazem Taghva Abstract - This paper presents a large-scale biomedical Named Entity Recognition (NER) dataset automatically annotated using a Large Language Model (LLM) applied to the eligibility criteria from ClinicalTrials.gov. The dataset comprises over 4.6 million named entities, covering categories such as diseases, interventions, outcomes, and participants. A pseudo-labeling approach was employed to generate annotations with soft labels, providing confidence scores for each entity. We address challenges related to entity ambiguity and label inconsistency by introducing a structured mapping strategy to ensure uniformity across the dataset. The resulting dataset is a valuable resource for advancing tasks such as NER, information extraction, and text classification in biomedical research. By making this dataset publicly available, we aim to support the development of AI-driven healthcare applications.