关键词: cluster analysis gene expression software toolkit unsupervised learning

Mesh : Cluster Analysis Transcriptome Software Gene Expression Profiling / methods Humans Computational Biology / methods Machine Learning High-Throughput Nucleotide Sequencing / methods Sequence Analysis, RNA / methods Algorithms

来  源:   DOI:10.1093/gigascience/giae039   PDF(Pubmed)

Abstract:
Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High-throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, but selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this, we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning-based functions.
The efficiency of each tool was tested with 7 datasets characterized by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit\'s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.
In conclusion, Omada successfully automates the robust unsupervised clustering of transcriptomic data, making advanced analysis accessible and reliable even for those without extensive machine learning expertise. Implementation of Omada is available at http://bioconductor.org/packages/omada/.
摘要:
背景:队列研究越来越多地收集生物样品进行分子谱分析,并观察到分子异质性。高通量RNA测序提供了能够反映疾病机制的大型数据集。聚类方法已经产生了许多工具来帮助剖析复杂的异构数据集,但是选择合适的方法和参数来执行转录组数据的探索性聚类分析需要深入理解机器学习和广泛的计算实验。在没有事先现场知识的情况下帮助做出此类决策的工具是不存在的。为了解决这个问题,我们开发了Omada,一套工具,旨在自动化这些过程,并通过基于机器学习的自动化功能使转录组数据的健壮无监督聚类更易于访问。
结果:使用以不同表达信号强度为特征的7个数据集测试了每种工具的效率,以捕获广谱的RNA表达数据集。我们的工具包的决策反映了数据集中的稳定分区的实际数量,其中子组是可辨别的。在生物学区别不太明确的数据集中,我们的工具要么形成了具有不同表达谱和可靠临床关联的稳定亚组,要么揭示了有问题数据的迹象,例如偏倚测量.
结论:结论:Omada成功地自动化了转录组数据的健壮无监督聚类,即使对于那些没有广泛的机器学习专业知识的人来说,也能使高级分析变得容易和可靠。Omada的实施可在http://biocorductor.org/packages/omada/上获得。
公众号