背景:命名实体识别(NER)是自然语言处理中的一项基本任务。然而,它之前通常是命名实体注释,这带来了一些挑战,尤其是在临床领域。例如,确定实体边界是注释者之间最常见的分歧来源之一,因为诸如是否应该注释修饰语或外围词。如果未解决,这些会导致产生的语料库不一致,然而,另一方面,严格的指导方针或裁决会议可以进一步延长已经缓慢和复杂的过程。
目的:本研究的目的是通过评估两种新颖的注释方法来解决这些挑战,宽松的跨度和点注释,旨在减轻精确确定实体边界的难度。
方法:我们通过对日本医学病例报告数据集的注释案例研究来评估其效果。我们比较注释时间,注释者协议,和生成的标签的质量,并评估对在注释的语料库上训练的NER系统的性能的影响。
结果:我们看到了标签过程效率的显着提高,与传统的边界严格方法相比,整体注释时间减少了25%,注释者协议甚至提高了10%。然而,与传统的注释方法相比,即使是最好的NER模型也表现出一些性能下降。
结论:我们的发现证明了注释速度和模型性能之间的平衡。尽管忽略边界信息会在一定程度上影响模型性能,这是由显著减少注释者的工作量和显著提高注释过程的速度所抵消的。这些好处可能在各种应用中被证明是有价值的,为开发人员和研究人员提供了一个有吸引力的折衷方案。
BACKGROUND: Named entity recognition (NER) is a fundamental task in natural language processing. However, it is typically preceded by named entity annotation, which poses several challenges, especially in the clinical domain. For instance, determining entity boundaries is one of the most common sources of disagreements between annotators due to questions such as whether modifiers or peripheral words should be annotated. If unresolved, these can induce inconsistency in the produced corpora, yet, on the other hand, strict guidelines or adjudication sessions can further prolong an already slow and convoluted process.
OBJECTIVE: The aim of this study is to address these challenges by evaluating 2 novel annotation methodologies, lenient span and point annotation, aiming to mitigate the difficulty of precisely determining entity boundaries.
METHODS: We evaluate their effects through an annotation case study on a Japanese medical case report data set. We compare annotation time, annotator agreement, and the quality of the produced labeling and assess the impact on the performance of an NER system trained on the annotated corpus.
RESULTS: We saw significant improvements in the labeling process efficiency, with up to a 25% reduction in overall annotation time and even a 10% improvement in annotator agreement compared to the traditional boundary-strict approach. However, even the best-achieved NER model presented some drop in performance compared to the traditional annotation methodology.
CONCLUSIONS: Our findings demonstrate a balance between annotation speed and model performance. Although disregarding boundary information affects model performance to some extent, this is counterbalanced by significant reductions in the annotator\'s workload and notable improvements in the speed of the annotation process. These benefits may prove valuable in various applications, offering an attractive compromise for developers and researchers.