基于聚类算法的自动摘要模型Automatic summarization model based on clustering algorithm.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Extractive document summary is usually seen as a sequence labeling task, which the summary is formulated by sentences from the original document. However, the selected sentences usually are high redundancy in semantic space, so that the composed summary are high semantic redundancy. To alleviate this problem, we propose a model to reduce the semantic redundancy of summary by introducing the cluster algorithm to select difference sentences in semantic space and we improve the base BERT to score sentences. We evaluate our model and perform significance testing using ROUGE on the CNN/DailyMail datasets compare with six baselines, which include two traditional methods and four state-of-art deep learning model. The results validate the effectiveness of our approach, which leverages K-means algorithm to produce more accurate and less repeat sentences in semantic summaries.

摘要：

提取文档摘要通常被视为序列标记任务，摘要由原始文档中的句子表述。然而,所选句子在语义空间上通常是高度冗余的，使构成的摘要具有较高的语义冗余度。为了缓解这个问题,我们提出了一个模型，通过引入聚类算法来选择语义空间中的差异句子来减少摘要的语义冗余，并改进了基本BERT来对句子进行评分。我们评估了我们的模型，并使用ROUGE对CNN/DailyMail数据集进行了显著性检验，并与六个基线进行了比较，其中包括两种传统方法和四种最先进的深度学习模型。结果验证了我们方法的有效性，它利用K-means算法在语义摘要中产生更准确和更少重复的句子。