Mesh : RNA Splicing Single-Cell Analysis / methods Sequence Analysis, RNA / methods Humans Software RNA-Seq / methods Algorithms Single-Cell Gene Expression Analysis

来  源:   DOI:10.1093/bioinformatics/btae207

Abstract:
BACKGROUND: Short-read single-cell RNA-sequencing (scRNA-seq) has been used to study cellular heterogeneity, cellular fate, and transcriptional dynamics. Modeling splicing dynamics in scRNA-seq data is challenging, with inherent difficulty in even the seemingly straightforward task of elucidating the splicing status of the molecules from which sequenced fragments are drawn. This difficulty arises, in part, from the limited read length and positional biases, which substantially reduce the specificity of the sequenced fragments. As a result, the splicing status of many reads in scRNA-seq is ambiguous because of a lack of definitive evidence. We are therefore in need of methods that can recover the splicing status of ambiguous reads which, in turn, can lead to more accuracy and confidence in downstream analyses.
RESULTS: We develop Forseti, a predictive model to probabilistically assign a splicing status to scRNA-seq reads. Our model has two key components. First, we train a binding affinity model to assign a probability that a given transcriptomic site is used in fragment generation. Second, we fit a robust fragment length distribution model that generalizes well across datasets deriving from different species and tissue types. Forseti combines these two trained models to predict the splicing status of the molecule of origin of reads by scoring putative fragments that associate each alignment of sequenced reads with proximate potential priming sites. Using both simulated and experimental data, we show that our model can precisely predict the splicing status of many reads and identify the true gene origin of multi-gene mapped reads.
METHODS: Forseti and the code used for producing the results are available at https://github.com/COMBINE-lab/forseti under a BSD 3-clause license.
摘要:
背景:短读单细胞RNA测序(scRNA-seq)已用于研究细胞异质性,细胞命运,和转录动力学。对scRNA-seq数据中的剪接动力学建模具有挑战性,甚至在阐明从中提取测序片段的分子的剪接状态的看似简单的任务中也存在固有的困难。这个困难出现了,在某种程度上,从有限的读取长度和位置偏差,这大大降低了测序片段的特异性。因此,由于缺乏明确的证据,scRNA-seq中许多读段的剪接状态是不明确的.因此,我们需要可以恢复不明确读段的剪接状态的方法,反过来,可以提高下游分析的准确性和置信度。
结果:我们开发了Forseti,一个预测模型,以概率方式将剪接状态分配给scRNA-seq读取。我们的模型有两个关键组成部分。首先,我们训练了一个结合亲和力模型,以指定在片段生成中使用给定转录组位点的概率。第二,我们拟合了一个强大的片段长度分布模型,该模型可以很好地推广来自不同物种和组织类型的数据集。Forseti组合这两个训练模型以通过对推定的片段进行评分来预测读段起源分子的剪接状态,该推定的片段将测序读段的每个比对与最接近的潜在引发位点相关联。利用模拟和实验数据,我们表明,我们的模型可以精确地预测许多读段的剪接状态,并确定多基因定位读段的真实基因起源。
方法:Forseti和用于生成结果的代码可在https://github.com/COMBINE-lab/forseti上获得BSD3-clause许可证。
公众号