关键词: distance metric gene clustering gene expression hierarchical clustering linkage method pleiotropy

来  源:   DOI:10.2196/30890   PDF(Pubmed)

Abstract:
BACKGROUND: Large amounts of biological data have been generated over the last few decades, encouraging scientists to look for connections between genes that cause various diseases. Clustering illustrates such a relationship between numerous species and genes. Finding an appropriate distance-linkage metric to construct clusters from diverse biological data sets has thus become critical. Pleiotropy is also important for a gene\'s expression to vary and create varied consequences in living things. Finding the pleiotropy of genes responsible for various diseases has become a major research challenge.
OBJECTIVE: Our goal was to establish the optimal distance-linkage strategy for creating reliable clusters from diverse data sets and identifying the common genes that cause various tumors to observe genes with pleiotropic effect.
METHODS: We considered 4 linking methods-single, complete, average, and ward-and 3 distance metrics-Euclidean, maximum, and Manhattan distance. For assessing the quality of different sets of clusters, we used a fitness function that combines silhouette width and within-cluster distance.
RESULTS: According to our findings, the maximum distance measure produces the highest-quality clusters. Moreover, for medium data set, the average linkage method, and for large data set, the ward linkage method works best. The outcome is not improved by using ensemble clustering. We also discovered genes that cause 3 different cancers and used gene enrichment to confirm our findings.
CONCLUSIONS: Accuracy is crucial in clustering, and we investigated the accuracy of numerous clustering techniques in our research. Other studies may aid related works if the data set is similar to ours.
摘要:
背景:在过去的几十年中,已经产生了大量的生物学数据,鼓励科学家寻找导致各种疾病的基因之间的联系。聚类说明了许多物种和基因之间的这种关系。因此,找到合适的距离链接度量来从不同的生物数据集构建聚类变得至关重要。多效性对于基因的表达变化和在生物中产生不同的后果也很重要。发现负责各种疾病的基因的多效性已成为主要的研究挑战。
目的:我们的目标是建立最佳的距离连锁策略,用于从不同的数据集中创建可靠的簇,并鉴定导致各种肿瘤的常见基因,以观察具有多效效应的基因。
方法:我们考虑了4种链接方法-单,完成,平均,和病房和3距离度量-欧几里得,最大值,曼哈顿的距离为了评估不同组的质量,我们使用了结合轮廓宽度和簇内距离的适应度函数。
结果:根据我们的发现,最大距离度量产生最高质量的群集。此外,对于中等数据集,平均联动法,对于大型数据集,病房联动法效果最好。使用集成聚类不能改善结果。我们还发现了导致3种不同癌症的基因,并使用基因富集来证实我们的发现。
结论:准确性在聚类中至关重要,我们在研究中研究了众多聚类技术的准确性。如果数据集与我们相似,其他研究可能会帮助相关工作。
公众号