small samples

  • 文章类型: Journal Article
    从多组学数据中提取预后因素的深度学习工具最近有助于对生存结果进行个性化预测。然而,集成组学-成像-临床数据集的有限规模带来了挑战.这里,我们提出了两种生物学可解释和强大的深度学习架构,用于非小细胞肺癌(NSCLC)患者的生存预测,同时从计算机断层扫描(CT)扫描图像中学习,基因表达数据,和临床信息。拟议的模型集成了患者特定的临床,转录组,和成像数据,并纳入京都基因和基因组百科全书(KEGG)和反应组途径信息,在学习过程中增加生物学知识,以提取预后基因生物标志物和分子通路。虽然在仅130名患者的数据集上进行训练时,这两种模型都可以准确地对高风险和低风险组的患者进行分层,在稀疏自动编码器中引入交叉注意机制显着提高了性能,突出肿瘤区域和NSCLC相关基因作为潜在的生物标志物,因此在从小型成像组学临床样本中学习时提供了显着的方法学进步。
    Deep-learning tools that extract prognostic factors derived from multi-omics data have recently contributed to individualized predictions of survival outcomes. However, the limited size of integrated omics-imaging-clinical datasets poses challenges. Here, we propose two biologically interpretable and robust deep-learning architectures for survival prediction of non-small cell lung cancer (NSCLC) patients, learning simultaneously from computed tomography (CT) scan images, gene expression data, and clinical information. The proposed models integrate patient-specific clinical, transcriptomic, and imaging data and incorporate Kyoto Encyclopedia of Genes and Genomes (KEGG) and Reactome pathway information, adding biological knowledge within the learning process to extract prognostic gene biomarkers and molecular pathways. While both models accurately stratify patients in high- and low-risk groups when trained on a dataset of only 130 patients, introducing a cross-attention mechanism in a sparse autoencoder significantly improves the performance, highlighting tumor regions and NSCLC-related genes as potential biomarkers and thus offering a significant methodological advancement when learning from small imaging-omics-clinical samples.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    半参数概率指数模型允许比较两组观测值,在调整协变量的同时,从而在广义成对比较(GPC)的框架内很好地拟合。与此设置中的大多数回归方法一样,由于不满足渐近正态假设,有限的数据量导致无效的推断。此外,当考虑小样本时,可能会出现分离问题。在这篇文章中,我们证明了概率指数模型的参数可以使用广义估计方程来估计,对于存在的调整,导致三明治方差-协方差矩阵的估计器具有改进的有限样本属性,并且可以处理由于分离引起的偏差。这样,通过广泛的模拟研究表明,可以进行适当的推断。概率指数和其他GPC统计数据之间的已知关系也可以提供有效的推断,例如,净治疗效益或成功几率。
    Semiparametric probabilistic index models allow for the comparison of two groups of observations, whilst adjusting for covariates, thereby fitting nicely within the framework of generalized pairwise comparisons (GPC). As with most regression approaches in this setting, the limited amount of data results in invalid inference as the asymptotic normality assumption is not met. In addition, separation issues might arise when considering small samples. In this article, we show that the parameters of the probabilistic index model can be estimated using generalized estimating equations, for which adjustments exist that lead to estimators of the sandwich variance-covariance matrix with improved finite sample properties and that can deal with bias due to separation. In this way, appropriate inference can be performed as is shown through extensive simulation studies. The known relationships between the probabilistic index and other GPC statistics allow to also provide valid inference for example, the net treatment benefit or the success odds.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本文针对有限或小样本序列,提出了一种性能度量来评估具有给定采样策略的控制图的检测性能,并证明了具有动态非随机控制极限和给定采样策略的CUSUM控制图在该度量下可以是最优的。提供了地震的数值模拟和实际数据,以说明对于不同的采样策略,CUSUM图表在变化点检测中将具有不同的监控性能。在六种只抽取一部分样本的抽样策略中,数值比较结果表明,均匀采样策略(均匀分散采样策略)的监测效果最好。
    This article proposes a performance measure to evaluate the detection performance of a control chart with a given sampling strategy for finite or small samples sequence and prove that the CUSUM control chart with dynamic non-random control limit and a given sampling strategy can be optimal under the measure. Numerical simulations and real data for an earthquake are provided to illustrate that for different sampling strategies, the CUSUM chart will have different monitoring performance in change-point detection. Among the six sampling strategies that take only a part of samples, the numerical comparing results illustrate that the uniform sampling strategy (uniformly dispersed sampling strategy) has the best monitoring effect.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    多层建模(MLM)通常在心理学研究中用于对聚类数据进行建模。然而,应用研究中的数据通常违反了MLM的基本假设之一——方差的同质性。虽然最大似然法产生的固定效应估计保持无偏,固定效应的标准误差被错误估计,导致不准确的推论和膨胀或膨胀的I型错误率。为了纠正固定效应标准误差中的偏差并提供有效的推论,已使用小样本校正,例如Kenward-Roger(KR)调整和具有Satterthwaite近似的调整后的聚类稳健标准误差(CR-SE)进行t检验。目前的研究比较了KR与随机斜率(RS)模型和调整后的CR-SE与普通最小二乘(OLS),随机截距(RI)和RS模型分析小,异方差,使用蒙特卡罗模拟对数据进行聚类。结果表明,在存在2级异方差的情况下,使用RS模型的KR程序对于簇间效应具有较大的偏差和膨胀的I型错误率。相比之下,调整后的CR-SE通常产生具有可接受偏差的结果,并使所有检查模型的I型错误率保持在接近标称水平.因此,当兴趣仅在集群内效应时,任何具有调整后的CR-SE的模型都可以使用。然而,当兴趣是对簇间效应做出准确的推断时,研究人员应使用调整后的CR-SE和RS,以具有更高的功效,并防范未建模的异质性。我们重新分析了Snijders&Bosker(2012)中的一个例子,以证明调整后的CR-SE与不同模型的使用。
    Multilevel modeling (MLM) is commonly used in psychological research to model clustered data. However, data in applied research usually violate one of the essential assumptions of MLM-homogeneity of variance. While the fixed-effect estimates produced by the maximum likelihood method remain unbiased, the standard errors for the fixed effects are misestimated, resulting in inaccurate inferences and inflated or deflated type I error rates. To correct the bias in fixed effects standard errors and provide valid inferences, small-sample corrections such as the Kenward-Roger (KR) adjustment and the adjusted cluster-robust standard errors (CR-SEs) with the Satterthwaite approximation for t tests have been used. The current study compares KR with random slope (RS) models and the adjusted CR-SEs with ordinary least squares (OLS), random intercept (RI) and RS models to analyze small, heteroscedastic, clustered data using a Monte Carlo simulation. Results show the KR procedure with RS models has large biases and inflated type I error rates for between-cluster effects in the presence of level 2 heteroscedasticity. In contrast, the adjusted CR-SEs generally yield results with acceptable biases and maintain type I error rates close to the nominal level for all examined models. Thus, when the interest is only in within-cluster effect, any model with the adjusted CR-SEs could be used. However, when the interest is to make accurate inferences of the between-cluster effect, researchers should use the adjusted CR-SEs with RS to have higher power and guard against unmodeled heterogeneity. We reanalyzed an example in Snijders & Bosker (2012) to demonstrate the use of the adjusted CR-SEs with different models.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在具有锚定测试(NEAT)设计的非等效组中不存在的响应部分可以管理到计划中的缺失场景。在小样本量的背景下,我们提出了一种基于机器学习(ML)的插补技术,称为链接随机森林(CRF),以在NEAT设计中执行等同任务。具体来说,基于不同的数据增强方法,提出了7种基于CRF的插补方法。通过仿真研究检查了所提出方法的等效性能。考虑了五个因素:(a)测试长度(20,30,40,50),(b)每个测试表格的样本大小(50对100),(C)共同/固定项目的比率(0.2对0.3),(d)采用两种形式的等效组和非等效组(无平均差与平均差0.5),和(E)三种不同类型的锚(随机,easy,andhard),导致96个条件。此外,五种传统的等同方法,(1)塔克法;(2)莱文观察分数法;(3)等百分位数方法;(4)圆弧法;(5)基于Rasch模型的并行校准,还考虑了,加上7种基于CRF的归位方法,本研究共12种方法。研究结果表明,受益于ML技术的优势,基于CRF的方法,结合了Tucker方法的等同结果,例如IMP_total_Tucker,IMP_pair_Tucker,和IMP_Tucker_cirlce方法,可以为相等任务中的“错误”产生更可靠和可信的估计,因此比其他小样本短长度测试中的分数更准确。
    The part of responses that is absent in the nonequivalent groups with anchor test (NEAT) design can be managed to a planned missing scenario. In the context of small sample sizes, we present a machine learning (ML)-based imputation technique called chaining random forests (CRF) to perform equating tasks within the NEAT design. Specifically, seven CRF-based imputation equating methods are proposed based on different data augmentation methods. The equating performance of the proposed methods is examined through a simulation study. Five factors are considered: (a) test length (20, 30, 40, 50), (b) sample size per test form (50 versus 100), (c) ratio of common/anchor items (0.2 versus 0.3), and (d) equivalent versus nonequivalent groups taking the two forms (no mean difference versus a mean difference of 0.5), and (e) three different types of anchors (random, easy, and hard), resulting in 96 conditions. In addition, five traditional equating methods, (1) Tucker method; (2) Levine observed score method; (3) equipercentile equating method; (4) circle-arc method; and (5) concurrent calibration based on Rasch model, were also considered, plus seven CRF-based imputation equating methods for a total of 12 methods in this study. The findings suggest that benefiting from the advantages of ML techniques, CRF-based methods that incorporate the equating result of the Tucker method, such as IMP_total_Tucker, IMP_pair_Tucker, and IMP_Tucker_cirlce methods, can yield more robust and trustable estimates for the \"missingness\" in an equating task and therefore result in more accurate equated scores than other counterparts in short-length tests with small samples.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    高光谱遥感图像(HRSI)具有相同光谱的异物特征。由于难以手动标记样品,高光谱遥感图像被理解为典型的“小样本”数据集。深度神经网络可以有效地从HRSI中提取深层特征,但分类精度主要取决于训练标签样本。因此,本文采用了堆叠式卷积自动编码器网络和传输学习策略,以设计一种新的堆叠式卷积自动编码器网络模型传输(SCAE-MT),用于对HRSI进行分类。在提出的分类方法中,为了有效地从HRSI中提取深度特征,采用堆叠的卷积au-to-编码网络。然后,应用迁移学习策略,设计了一个小样本和有限训练样本下的堆叠式卷积自动编码器网络模型迁移。为了解决HRSI的小样本问题,利用SCAE-MT模型提出了一种新的HRSI分类方法。在这项研究中,为了证明所提出的分类方法的有效性,选择了两个HRSI数据集。为了验证方法的有效性,卷积自编码网络分类方法(CAE)的总体分类精度(OA),堆栈卷积自编码网络分类方法(SCAE),SCAE-MT方法在5%以下,10%,并计算15%的训练集。与5%的CAE和SCAE模型相比,10%,和15%的训练数据集,SCAE-MT法的总体准确度(OA)提高了2.71%,3.33%,和3.07%(平均),分别。SCAE-MT方法是,因此,明显优于其他方法,也显示出良好的分类性能。
    Hyperspectral remote sensing images (HRSI) have the characteristics of foreign objects with the same spectrum. As it is difficult to label samples manually, the hyperspectral remote sensing images are understood to be typical \"small sample\" datasets. Deep neural networks can effectively extract the deep features from the HRSI, but the classification accuracy mainly depends on the training label samples. Therefore, the stacked convolutional autoencoder network and transfer learning strategy are employed in order to design a new stacked convolutional autoencoder network model transfer (SCAE-MT) for the purposes of classifying the HRSI in this paper. In the proposed classification method, the stacked convolutional au-to-encoding network is employed in order to effectively extract the deep features from the HRSI. Then, the transfer learning strategy is applied to design a stacked convolutional autoencoder network model transfer under the small and limited training samples. The SCAE-MT model is used to propose a new HRSI classification method in order to solve the small samples of the HRSI. In this study, in order to prove the effectiveness of the proposed classification method, two HRSI datasets were chosen. In order to verify the effectiveness of the methods, the overall classification accuracy (OA) of the convolutional self-coding network classification method (CAE), the stack convolutional self-coding network classification method (SCAE), and the SCAE-MT method under 5%, 10%, and 15% training sets are calculated. When compared with the CAE and SCAE models in 5%, 10%, and 15% training datasets, the overall accuracy (OA) of the SCAE-MT method was improved by 2.71%, 3.33%, and 3.07% (on average), respectively. The SCAE-MT method is, thus, clearly superior to the other methods and also shows a good classification performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    众所周知,作物分类对于遗传资源和表型发展至关重要。与传统方法相比,卷积神经网络可用于自动识别特征。然而,作物和场景非常复杂,这使得开发通用分类方法具有挑战性。此外,手工设计需要专业知识,耗时耗力。相比之下,当面对新物种时,自动搜索可以创建网络架构。使用油菜籽图像进行实验,我们收集了八种类型来构建数据集(油菜籽数据集(RSDS))。此外,我们提出了一种基于VGGNet(目标相关神经架构搜索(TD-NAS))的新的目标相关搜索方法。结果表明,小样本和大样本之间的测试精度没有显着差异。因此,数据集大小对泛化的影响是有限的。此外,我们使用了两个额外的开放数据集(Pl@ntNet和ICL-Leaf)来测试和证明我们方法的有效性,因为我们有三个显著的特征:(a)小样本量,(b)稳定的概括,(c)没有毫无希望的检测。
    It is well known that crop classification is essential for genetic resources and phenotype development. Compared with traditional methods, convolutional neural networks can be utilized to identify features automatically. Nevertheless, crops and scenarios are quite complex, which makes it challenging to develop a universal classification method. Furthermore, manual design demands professional knowledge and is time-consuming and labor-intensive. In contrast, auto-search can create network architectures when faced with new species. Using rapeseed images for experiments, we collected eight types to build datasets (rapeseed dataset (RSDS)). In addition, we proposed a novel target-dependent search method based on VGGNet (target-dependent neural architecture search (TD-NAS)). The result shows that test accuracy does not differ significantly between small and large samples. Therefore, the influence of the dataset size on generalization is limited. Moreover, we used two additional open datasets (Pl@ntNet and ICL-Leaf) to test and prove the effectiveness of our method due to three notable features: (a) small sample sizes, (b) stable generalization, and (c) free of unpromising detections.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    研究人员经常使用莫肯量表分析(MSA),这是项目反应理论的非参数方法,当他们有相对较少的考生样本时。研究人员为MSA在各种条件下的最小样本量提供了一些指导。然而,这些研究没有集中在项目级测量问题上,例如违反单调性或不变项排序(IIO)。此外,这些研究集中在一个完整的考生样本中出现的问题上。当前的研究使用模拟研究来考虑MSA项目分析程序对潜在变量有限范围内出现的有问题项目特征的敏感性。只要考虑项目质量的多个指标,结果通常支持使用小样本(约100名考生)的MSA。
    Researchers frequently use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, when they have relatively small samples of examinees. Researchers have provided some guidance regarding the minimum sample size for applications of MSA under various conditions. However, these studies have not focused on item-level measurement problems, such as violations of monotonicity or invariant item ordering (IIO). Moreover, these studies have focused on problems that occur for a complete sample of examinees. The current study uses a simulation study to consider the sensitivity of MSA item analysis procedures to problematic item characteristics that occur within limited ranges of the latent variable. Results generally support the use of MSA with small samples (N around 100 examinees) as long as multiple indicators of item quality are considered.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    据报道,大约一半的生物学发现是不可重复的。这些不可重复的发现部分归因于统计能力差。贫穷的权力主要由小样本量拥有。然而,在分子生物学和医学中,由于生物资源和预算的限制,大多数分子生物学实验都是用小样本进行的。双样本t检验通过使用自由度来控制偏差。然而,这也暗示t检验在小样本中具有低功率。发现具有低统计能力的发现表明它具有差的再现性。所以,在小样本实验中,提高统计能力不是提高可重复性的可行方法。另一种方法是降低I型错误率。为了这样做,建立了所谓的tα检验。理论分析和仿真研究均表明tα检验优于t检验。然而,当样本量超过15时,tα检验简化为t检验。大规模的模拟研究和真实实验数据表明,与小样本实验中的t检验和Wilcoxon检验相比,tα检验明显降低了I型错误率。tα检验与t检验具有几乎相同的经验功效。零p值密度分布解释了为什么tα检验的I型错误率比t检验低。一个真实的实验数据集提供了一个典型的例子,表明tα检验优于t检验,微阵列数据集表明tα检验在五种统计方法中表现最好。此外,在数学上给出了tα统计量的密度分布和概率累积函数,理论分布和观测分布很好地匹配。
    It has been reported that about half of biological discoveries are irreproducible. These irreproducible discoveries were partially attributed to poor statistical power. The poor powers are majorly owned to small sample sizes. However, in molecular biology and medicine, due to the limit of biological resources and budget, most molecular biological experiments have been conducted with small samples. Two-sample t-test controls bias by using a degree of freedom. However, this also implicates that t-test has low power in small samples. A discovery found with low statistical power suggests that it has a poor reproducibility. So, promotion of statistical power is not a feasible way to enhance reproducibility in small-sample experiments. An alternative way is to reduce type I error rate. For doing so, a so-called t α -test was developed. Both theoretical analysis and simulation study demonstrate that t α -test much outperforms t-test. However, t α -test is reduced to t-test when sample sizes are over 15. Large-scale simulation studies and real experiment data show that t α -test significantly reduced type I error rate compared to t-test and Wilcoxon test in small-sample experiments. t α -test had almost the same empirical power with t-test. Null p-value density distribution explains why t α -test had so lower type I error rate than t-test. One real experimental dataset provides a typical example to show that t α -test outperforms t-test and a microarray dataset showed that t α -test had the best performance among five statistical methods. In addition, the density distribution and probability cumulative function of t α -statistic were given in mathematics and the theoretical and observed distributions are well matched.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:计算机辅助MRI分析有助于早期发现阿尔茨海默病(AD)。最近,3D卷积神经网络(CNN)被广泛用于分析MRI图像。然而,3DCNN需要巨大的内存成本。在本文中,我们引入级联CNN和长短期记忆(LSTM)网络。我们还使用知识蒸馏来提高使用小型医学图像数据集的模型的准确性。
    方法:我们提出了一种级联结构,CNN-LSTM.CNN被用作特征提取的功能,和LSTM用作分类器。这样,可以考虑不同切片之间的相关性,并且可以降低由3D数据引起的计算成本。为了克服图像训练数据有限的问题,迁移学习是一种比较合理的特征提取方法。我们使用知识蒸馏算法通过强大的教师模型来指导学生模型的工作,以提高学生模型的诊断性能。
    结果:使用知识蒸馏提高了所提出模型的准确性。结果表明,在教师模型的指导下,学生模型的准确率达到85.96%,增加3.83%。
    结论:我们提出了级联CNN-LSTM来对3DADNI数据进行分类,并使用知识蒸馏来提高使用小尺寸数据集训练时的模型准确性。它可以有效地处理3D数据,并降低计算成本。
    OBJECTIVE: Computer-aided MRI analysis is helpful for early detection of Alzheimer\'s disease(AD). Recently, 3D convolutional neural networks(CNN) are widely used to analyse MRI images. However, 3D CNN requires huge memory cost. In this paper, we introduce cascaded CNN and long and short-term memory (LSTM) networks. We also use knowledge distillation to improve the accuracy of the model using small medical image dataset.
    METHODS: We propose a cascade structure, CNN-LSTM. CNN is used as the function of feature extraction, and LSTM is used as the classifier. In this way, the correlation between different slices can be considered and the calculation cost caused by 3D data can be reduced. To overcome the problem of limited image training data, transfer learning is a more reasonable way of feature extraction. We use the knowledge distillation algorithm to improve the performance of student models for AD diagnosis through a powerful teacher model to guide the work of student models.
    RESULTS: The accuracy of the proposed model is improved using knowledge distillation. The results show that the accuracy of the student models reached 85.96% after the guidance of the teacher models, an increase by 3.83%.
    CONCLUSIONS: We propose cascaded CNN-LSTM to classify 3D ADNI data, and use knowledge distillation to improve the model accuracy when trained with small size dataset. It can process 3D data efficiently as well as reduce the computational cost.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号