Data distribution

数据分布
  • 文章类型: Journal Article
    本文强调了pH或质子活性测量在环境研究中的关键作用,并强调了在处理pH数据时应用适当统计方法的重要性。这允许做出更明智的决策,以有效地管理环境数据,例如采矿受影响的水。同一系统的pH和{H+}显示出不同的分布,pH值主要显示正常或双峰分布,{H}显示对数正态分布。因此,是否使用pH或{H+}来计算用于进一步环境统计分析的集中趋势的平均值或测量是一个挑战。在这项研究中,应用不同的统计技术来了解来自四个不同矿区的pH和{H+}的分布,Metsämonttu在芬兰,FelsendomeRabenstein在德国,南非的Eastrand和Westrand矿山水处理厂。根据统计结果,如果分布是单峰的,则几何平均值可用于计算pH的平均值。对于多峰pH数据分布,峰识别方法可用于提取每个数据群体的平均值,并将其用于进一步的统计分析。
    This paper highlights the critical role of pH or proton activity measurements in environmental studies and emphasises the importance of applying proper statistical approaches when handling pH data. This allows for more informed decisions to effectively manage environmental data such as from mining influenced water. Both the pH and {H+} of the same system display different distributions, with pH mostly displaying a normal or bimodal distribution and {H+} showing a lognormal distribution. It is therefore a challenge of whether to use pH or {H+} to compute the mean or measures of central tendency for further environmental statistical analyses. In this study, different statistical techniques were applied to understand the distribution of pH and {H+} from four different mine sites, Metsämonttu in Finland, Felsendome Rabenstein in Germany, Eastrand and Westrand mine water treatment plants in South Africa. Based on the statistical results, the geometric mean can be used to calculate the average of pH if the distribution is unimodal. For a multimodal pH data distribution, peak identifying methods can be applied to extract the mean for each data population and use them for further statistical analyses.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    研究文章中的图表可以增加对统计数据的理解,但如果设计不当,可能会误导读者。我们提出了一种新的情节类型,海堆栈图,它结合了垂直直方图和汇总统计信息来准确地表示大型单变量数据集,有用的,并且高效。我们比较了五种常用的情节类型(点状图和胡须图,箱线图,密度图,单变量散点图,和点图),以评估它们在表示生物学研究中通常观察到的数据分布时的相对优势和劣势。我们发现评估的地块类型要么难以在大样本量下阅读,要么有可能歪曲某些数据分布,表明需要一种改进的数据可视化方法。我们对涵盖这些研究领域多个领域的四种生态和保护期刊中使用的地块类型进行了分析,发现广泛使用无信息的条形图以及点状和晶须图(所有面板中有60%显示来自多个组的单变量数据,以进行比较)。一些文章通过结合情节类型(占面板的16%),提供了更多信息数据,通常是箱线图和第二层,如平面密度图,以更好地显示数据。这表明人们对保护和生态学中更有效的地块类型有兴趣,如果提供准确和用户友好的情节类型,这可能会进一步增加。最后,我们描述了海图,并解释了它们如何克服与其他替代无信息地块相关的弱点,当用于大型和/或不均匀分布的数据时。我们提供了一个工具来使用我们的R包“seastackplot”创建海堆栈图,通过GitHub提供。
    Graphs in research articles can increase the comprehension of statistical data but may mislead readers if poorly designed. We propose a new plot type, the sea stack plot, which combines vertical histograms and summary statistics to represent large univariate datasets accurately, usefully, and efficiently. We compare five commonly used plot types (dot and whisker plots, boxplots, density plots, univariate scatter plots, and dot plots) to assess their relative strengths and weaknesses when representing distributions of data commonly observed in biological studies. We find the assessed plot types are either difficult to read at large sample sizes or have the potential to misrepresent certain distributions of data, showing the need for an improved method of data visualisation. We present an analysis of the plot types used in four ecology and conservation journals covering multiple areas of these research fields, finding widespread use of uninformative bar charts and dot and whisker plots (60% of all panels showing univariate data from multiple groups for the purpose of comparison). Some articles presented more informative figures by combining plot types (16% of panels), generally boxplots and a second layer such as a flat density plot, to better display the data. This shows an appetite for more effective plot types within conservation and ecology, which may further increase if accurate and user-friendly plot types were made available. Finally, we describe sea stack plots and explain how they overcome the weaknesses associated with other alternatives to uninformative plots when used for large and/or unevenly distributed data. We provide a tool to create sea stack plots with our R package \'seastackplot\', available through GitHub.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基因调控网络(GRN)的推断是系统生物学中广泛解决的问题。GRN可以建模为布尔网络,这是这个任务最简单的方法。然而,布尔模型需要二值化数据。已经开发了几种方法用于基因表达数据(GED)的离散化。此外,数据提取技术的进步,例如单细胞RNA测序(scRNA-Seq),提供了基因表达的新视野,并为处理其特异性带来了新的挑战,例如大量出现零数据。这项工作提出了一种新的离散化方法来处理scRNA-Seq时间序列数据,命名为分布和连续样条点离散化(DSSPD),它考虑了数据分布和适当的预处理步骤。这里,笛卡尔遗传编程(CGP)用于使用DSSPD的结果推断GRN。该提案与CGP进行了比较,并在精选模型和实验数据上使用了标准数据处理和五种最先进的算法。结果表明,该建议在所有测试案例中都提高了CGP的结果,并且在大多数情况下都优于最先进的算法。
    The inference of gene regulatory networks (GRNs) is a widely addressed problem in Systems Biology. GRNs can be modeled as Boolean networks, which is the simplest approach for this task. However, Boolean models need binarized data. Several approaches have been developed for the discretization of gene expression data (GED). Also, the advance of data extraction technologies, such as single-cell RNA-Sequencing (scRNA-Seq), provides a new vision of gene expression and brings new challenges for dealing with its specificities, such as a large occurrence of zero data. This work proposes a new discretization approach for dealing with scRNA-Seq time-series data, named Distribution and Successive Spline Points Discretization (DSSPD), which considers the data distribution and a proper preprocessing step. Here, Cartesian Genetic Programming (CGP) is used to infer GRNs using the results of DSSPD. The proposal is compared with CGP with the standard data handling and five state-of-the-art algorithms on curated models and experimental data. The results show that the proposal improves the results of CGP in all tested cases and outperforms the state-of-the-art algorithms in most cases.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在制造过程中,设备故障直接关系到生产率,所以预测性维护起着非常重要的作用。工业园区分布,异构设备之间存在数据异构,这使得设备的预测性维护具有挑战性。在本文中,我们提出了两种主要技术来在这种环境中实现有效的预测性维护。我们提出了一个1DCNN-Bilstm模型,用于制造过程的时间序列异常检测和预测性维护。该模型结合了一维卷积神经网络(1DCNN)和双向LSTM(Bilstm),有效地从时间序列数据中提取特征并检测异常。在本文中,我们将联合学习框架与这些模型相结合,以考虑时间序列数据的分布变化,并基于它们进行异常检测和预测性维护。在本文中,我们利用泵数据集来评估几个联合学习框架和时间序列异常检测模型的组合的性能。实验结果表明,该框架达到了97.2%的测试准确率,这表明它有可能在未来用于现实世界的预测性维护。
    In the manufacturing process, equipment failure is directly related to productivity, so predictive maintenance plays a very important role. Industrial parks are distributed, and data heterogeneity exists among heterogeneous equipment, which makes predictive maintenance of equipment challenging. In this paper, we propose two main techniques to enable effective predictive maintenance in this environment. We propose a 1DCNN-Bilstm model for time series anomaly detection and predictive maintenance of manufacturing processes. The model combines a 1D convolutional neural network (1DCNN) and a bidirectional LSTM (Bilstm), which is effective in extracting features from time series data and detecting anomalies. In this paper, we combine a federated learning framework with these models to consider the distributional shifts of time series data and perform anomaly detection and predictive maintenance based on them. In this paper, we utilize the pump dataset to evaluate the performance of the combination of several federated learning frameworks and time series anomaly detection models. Experimental results show that the proposed framework achieves a test accuracy of 97.2%, which shows its potential to be utilized for real-world predictive maintenance in the future.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    起落架结构在飞机起降过程中承受较大载荷,准确预测起落架性能有利于保证飞行安全。然而,基于机器学习的起落架性能预测方法对数据集具有很强的依赖性,其中特征维度和数据分布对预测精度影响较大。为了解决这些问题,开发了一种新型MCA-MLPSA。首先,提出了一种选择关键特征的MCA(多重相关分析)方法。第二,提出了一种异构的多学习器集成框架,这使得使用不同的基础学习者。第三,提出了一种MLPSA(具有自我注意的多层感知器)模型,以自适应地捕获数据分布并调整每个基础学习器的权重。最后,通过对起落架数据的一系列实验,验证了所提出的MCA-MLPSA的出色预测性能。
    The landing gear structure suffers from large loads during aircraft takeoff and landing, and an accurate prediction of landing gear performance is beneficial to ensure flight safety. Nevertheless, the landing gear performance prediction method based on machine learning has a strong reliance on the dataset, in which the feature dimension and data distribution will have a great impact on the prediction accuracy. To address these issues, a novel MCA-MLPSA is developed. First, an MCA (multiple correlation analysis) method is proposed to select key features. Second, a heterogeneous multilearner integration framework is proposed, which makes use of different base learners. Third, an MLPSA (multilayer perceptron with self-attention) model is proposed to adaptively capture the data distribution and adjust the weights of each base learner. Finally, the excellent prediction performance of the proposed MCA-MLPSA is validated by a series of experiments on the landing gear data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    研究数据分布差异与联合深度学习(Fed-DL)算法在CT和MR图像上的肿瘤分割性能之间的相关性。
    回顾性收集了两个Fed-DL数据集(从2020年11月到2021年12月):一个肝肿瘤CT图像数据集(肝肿瘤分割联合成像[或,FILTS];三个站点,692次扫描)和一个公开可用的脑肿瘤MR图像数据集(联合肿瘤分割[或,FeTS];23个站点,1251次扫描)。来自两个数据集的扫描根据站点进行分组,肿瘤类型,肿瘤大小,数据集大小,和肿瘤强度。为了量化数据分布的差异,计算了以下四个距离度量:推土机距离(EMD),Bhattacharyya距离(BD),χ2距离(CSD),和Kolmogorov-Smirnov距离(KSD)。通过使用相同的分组数据集来训练联邦和集中式nnU-Net模型。Fed-DL模型的性能是通过使用Dice系数的比率来评估的,θ,在相同的80:20分割数据集上训练和测试的联合模型和集中模型之间。
    联合模型和集中模型之间的骰子系数比(θ)与数据分布之间的距离呈强烈负相关,EMD的相关系数为-0.920,-0.893用于BD,CSD为-0.899。然而,KSD与θ弱相关,相关系数为-0.479。
    Fed-DL模型在CT和MRI数据集上的肿瘤分割性能与数据分布之间的距离呈强烈负相关。关键词:CT,腹部/GI,肝脏,比较研究,MR成像,脑/脑干,卷积神经网络(CNN)联合深度学习,肿瘤分割,DataDistributionSupplementalmaterialisavailableforthisarticle.©RSNA,2023另见本期郭和白的评论。
    UNASSIGNED: To investigate the correlation between differences in data distributions and federated deep learning (Fed-DL) algorithm performance in tumor segmentation on CT and MR images.
    UNASSIGNED: Two Fed-DL datasets were retrospectively collected (from November 2020 to December 2021): one dataset of liver tumor CT images (Federated Imaging in Liver Tumor Segmentation [or, FILTS]; three sites, 692 scans) and one publicly available dataset of brain tumor MR images (Federated Tumor Segmentation [or, FeTS]; 23 sites, 1251 scans). Scans from both datasets were grouped according to site, tumor type, tumor size, dataset size, and tumor intensity. To quantify differences in data distributions, the following four distance metrics were calculated: earth mover\'s distance (EMD), Bhattacharyya distance (BD), χ2 distance (CSD), and Kolmogorov-Smirnov distance (KSD). Both federated and centralized nnU-Net models were trained by using the same grouped datasets. Fed-DL model performance was evaluated by using the ratio of Dice coefficients, θ, between federated and centralized models trained and tested on the same 80:20 split datasets.
    UNASSIGNED: The Dice coefficient ratio (θ) between federated and centralized models was strongly negatively correlated with the distances between data distributions, with correlation coefficients of -0.920 for EMD, -0.893 for BD, and -0.899 for CSD. However, KSD was weakly correlated with θ, with a correlation coefficient of -0.479.
    UNASSIGNED: Performance of Fed-DL models in tumor segmentation on CT and MRI datasets was strongly negatively correlated with the distances between data distributions.Keywords: CT, Abdomen/GI, Liver, Comparative Studies, MR Imaging, Brain/Brain Stem, Convolutional Neural Network (CNN), Federated Deep Learning, Tumor Segmentation, Data Distribution Supplemental material is available for this article. © RSNA, 2023See also the commentary by Kwak and Bai in this issue.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:数据归档和分发对于科学的严谨性和研究的可重复性至关重要。国家生物技术信息中心的基因型和表型数据库(dbGaP)是科学数据共享的公共存储库。为了支持数千个复杂数据集的策展,dbGaP有详细的提交说明,研究者在存档数据时必须遵循这些说明。
    结果:我们开发了dbGaPCheckup,一个R包,它实现了一系列的检查,意识,reporting,和实用功能,以支持在dbGaP提交之前数据完整性和主题表型数据集和数据字典的正确格式。例如,作为一种工具,dbGaPCheckup确保数据字典包含dbGaP所需的所有字段,和dbGaPCheckup所需的附加字段;变量的数量和名称在数据集和数据字典之间匹配;没有重复的变量名或描述;观察到的数据值并不比数据字典中规定的逻辑最小值和最大值更极端;等等。该软件包还包括在检测到错误时实现一系列次要/可扩展修复的功能(例如,一个函数,用于对数据字典中的变量进行重新排序,以匹配数据集中列出的顺序)。最后,我们还包括生成数据的图形和文本描述的报告功能,以进一步降低数据完整性问题的可能性。dbGaPCheckupR软件包在CRAN上可用(https://CRAN。R-project.org/package=dbGaPCheckup),并在GitHub(https://github.com/lwheinsberg/dbGaPCheckup)上开发。
    结论:dbGaPCheckup是一种创新的辅助和节省时间的工具,通过使dbGaP提交大型和复杂的数据集不易出错,填补了研究人员的重要空白。
    BACKGROUND: Data archiving and distribution are essential to scientific rigor and reproducibility of research. The National Center for Biotechnology Information\'s Database of Genotypes and Phenotypes (dbGaP) is a public repository for scientific data sharing. To support curation of thousands of complex data sets, dbGaP has detailed submission instructions that investigators must follow when archiving their data.
    RESULTS: We developed dbGaPCheckup, an R package which implements a series of check, awareness, reporting, and utility functions to support data integrity and proper formatting of the subject phenotype data set and data dictionary prior to dbGaP submission. For example, as a tool, dbGaPCheckup ensures that the data dictionary contains all fields required by dbGaP, and additional fields required by dbGaPCheckup; the number and names of variables match between the data set and data dictionary; there are no duplicated variable names or descriptions; observed data values are not more extreme than the logical minimum and maximum values stated in the data dictionary; and more. The package also includes functions that implement a series of minor/scalable fixes when errors are detected (e.g., a function to reorder the variables in the data dictionary to match the order listed in the data set). Finally, we also include reporting functions that produce graphical and textual descriptives of the data to further reduce the likelihood of data integrity issues. The dbGaPCheckup R package is available on CRAN ( https://CRAN.R-project.org/package=dbGaPCheckup ) and developed on GitHub ( https://github.com/lwheinsberg/dbGaPCheckup ).
    CONCLUSIONS: dbGaPCheckup is an innovative assistive and timesaving tool that fills an important gap for researchers by making dbGaP submission of large and complex data sets less error prone.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在轻度认知障碍(MCI)阶段早期发现阿尔茨海默病(AD)可以有效干预以减缓疾病进展。AD的计算机辅助诊断依赖于足够量的生物标志物数据。当这个要求不满足时,迁移学习可用于从具有比期望的目标域中可用的更多标记数据的源域转移知识。在这项研究中,提出了一种基于梯度提升机(GBM)的基于实例的迁移学习框架。在GBM中,建立了一系列基础学习者,每个学习者关注前一个学习者的错误(残差)。在我们的GBM迁移学习版本(TrGB)中,为源实例定义了基于基学习器残差的加权机制。因此,分布与目标数据不同的实例将对目标学习者产生较低的影响。所提出的加权方案旨在从源域传输尽可能多的信息,同时避免负传输。本研究中的目标数据来自西奈山数据集,该数据集是在西奈山医学中心的一项为期5年的协作项目中收集和处理的。使用阿尔茨海默病神经成像倡议(ADNI)数据集作为来源域。实验结果表明,所提出的TrGB算法对CN和CN的分类准确率分别提高了1.5%和4.5%。MCI和多类别分类,分别,与传统方法相比。此外,使用TrGB模型并从CN和CN转移知识源域的AD分类,早期MCI的平均分与晚期MCI分级提高5%。
    Early detection of Alzheimer\'s disease (AD) during the Mild Cognitive Impairment (MCI) stage could enable effective intervention to slow down disease progression. Computer-aided diagnosis of AD relies on a sufficient amount of biomarker data. When this requirement is not fulfilled, transfer learning can be used to transfer knowledge from a source domain with more amount of labeled data than available in the desired target domain. In this study, an instance-based transfer learning framework is presented based on the gradient boosting machine (GBM). In GBM, a sequence of base learners is built, and each learner focuses on the errors (residuals) of the previous learner. In our transfer learning version of GBM (TrGB), a weighting mechanism based on the residuals of the base learners is defined for the source instances. Consequently, instances with different distribution than the target data will have a lower impact on the target learner. The proposed weighting scheme aims to transfer as much information as possible from the source domain while avoiding negative transfer. The target data in this study was obtained from the Mount Sinai dataset which is collected and processed in a collaborative 5-year project at the Mount Sinai Medical Center. The Alzheimer\'s Disease Neuroimaging Initiative (ADNI) dataset was used as the source domain. The experimental results showed that the proposed TrGB algorithm could improve the classification accuracy by 1.5 and 4.5% for CN vs. MCI and multiclass classification, respectively, as compared to the conventional methods. Also, using the TrGB model and transferred knowledge from the CN vs. AD classification of the source domain, the average score of early MCI vs. late MCI classification improved by 5%.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    对于文档和研究目的,处理全长膀胱镜检查视频具有挑战性。因此,我们设计了一个外科医生指导的框架来提取具有膀胱病变的短视频剪辑,以实现更有效的内容导航和提取。在经尿道膀胱肿瘤电切术期间捕获膀胱病变的屏幕截图,然后根据病例识别手动标记,date,病变位置,成像模式,和病理学。该框架使用屏幕截图来搜索和提取相应的10秒视频剪辑。每个视频剪辑都包括一个一秒的空间保持器,带有通知视频内容的QR条形码。该框架的成功是通过这些短片的二次使用和视频材料所需存储空间的减少来衡量的。86例,该框架从230个截图中成功生成了249个视频剪辑,从8个截图中排除了14个错误的视频剪辑。符合HIPPA的条形码提供了具有100%数据完整性的视频内容信息。策划了一个基于网络的教育图库,其中包含各种诊断类别和带注释的帧序列。与未编辑的视频相比,信息丰富的视频短片将存储量减少了99.5%。总之,我们的框架加快了视觉内容的生成与外科医生的指导膀胱镜检查和潜在的结合视频数据对应用程序,包括临床文档,教育,和研究。
    Processing full-length cystoscopy videos is challenging for documentation and research purposes. We therefore designed a surgeon-guided framework to extract short video clips with bladder lesions for more efficient content navigation and extraction. Screenshots of bladder lesions were captured during transurethral resection of bladder tumor, then manually labeled according to case identification, date, lesion location, imaging modality, and pathology. The framework used the screenshot to search for and extract a corresponding 10-seconds video clip. Each video clip included a one-second space holder with a QR barcode informing the video content. The success of the framework was measured by the secondary use of these short clips and the reduction of storage volume required for video materials. From 86 cases, the framework successfully generated 249 video clips from 230 screenshots, with 14 erroneous video clips from 8 screenshots excluded. The HIPPA-compliant barcodes provided information of video contents with a 100% data completeness. A web-based educational gallery was curated with various diagnostic categories and annotated frame sequences. Compared with the unedited videos, the informative short video clips reduced the storage volume by 99.5%. In conclusion, our framework expedites the generation of visual contents with surgeon\'s instruction for cystoscopy and potential incorporation of video data towards applications including clinical documentation, education, and research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    数据分发是工业4.0中智能机器高效自动化的基石。尽管在最近的文献中,对相关方法进行了几次比较,我们发现,大多数这些比较要么是理论上的,要么是基于抽象的模拟工具,无法揭示具体的,方法对底层网络基础设施的详细影响。在这方面,作为本文的第一个贡献,我们为无线工业网络的固定和移动场景中的智能工厂中的强大数据分发开发了更详细和微调的解决方案。使用WirelessHART的技术推动者,RPL和代理选择的方法推动者作为构建块,我们组成了四种不同方法的协议栈(集中式和分散式),用于在IEEE802.15.4物理层上在无线工业网络中进行数据分发。我们在非常详细的OMNeT++仿真环境中实现了所提出的方法,并通过广泛的仿真分析评估了它们的性能。有趣的是,我们证明,仔细选择一组有限的代理用于网络中的数据缓存可以提高数据传递的成功率和较低的数据访问延迟。接下来,我们描述了在工业智能工厂环境中演示的两个测试用例。首先,我们展示了机器人元素和无线数据服务之间的协作。第二,我们展示了与控制车间设备的工业雾节点的集成。我们在更大的范围内报告选定的结果,通过模拟获得。
    Data distribution is a cornerstone of efficient automation for intelligent machines in Industry 4.0. Although in the recent literature there have been several comparisons of relevant methods, we identify that most of those comparisons are either theoretical or based on abstract simulation tools, unable to uncover the specific, detailed impacts of the methods to the underlying networking infrastructure. In this respect, as a first contribution of this paper, we develop more detailed and fine-tuned solutions for robust data distribution in smart factories on stationary and mobile scenarios of wireless industrial networking. Using the technological enablers of WirelessHART, RPL and the methodological enabler of proxy selection as building blocks, we compose the protocol stacks of four different methods (both centralized and decentralized) for data distribution in wireless industrial networks over the IEEE 802.15.4 physical layer. We implement the presented methods in the highly detailed OMNeT++ simulation environment and we evaluate their performance via an extensive simulation analysis. Interestingly enough, we demonstrate that the careful selection of a limited set of proxies for data caching in the network can lead to an increased data delivery success rate and low data access latency. Next, we describe two test cases demonstrated in an industrial smart factory environment. First, we show the collaboration between robotic elements and wireless data services. Second, we show the integration with an industrial fog node which controls the shop-floor devices. We report selected results in much larger scales, obtained via simulations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号