Data distribution

  • 文章类型: Journal Article
    This paper highlights the critical role of pH or proton activity measurements in environmental studies and emphasises the importance of applying proper statistical approaches when handling pH data. This allows for more informed decisions to effectively manage environmental data such as from mining influenced water. Both the pH and {H+} of the same system display different distributions, with pH mostly displaying a normal or bimodal distribution and {H+} showing a lognormal distribution. It is therefore a challenge of whether to use pH or {H+} to compute the mean or measures of central tendency for further environmental statistical analyses. In this study, different statistical techniques were applied to understand the distribution of pH and {H+} from four different mine sites, Metsämonttu in Finland, Felsendome Rabenstein in Germany, Eastrand and Westrand mine water treatment plants in South Africa. Based on the statistical results, the geometric mean can be used to calculate the average of pH if the distribution is unimodal. For a multimodal pH data distribution, peak identifying methods can be applied to extract the mean for each data population and use them for further statistical analyses.






  • 文章类型: Journal Article
    Graphs in research articles can increase the comprehension of statistical data but may mislead readers if poorly designed. We propose a new plot type, the sea stack plot, which combines vertical histograms and summary statistics to represent large univariate datasets accurately, usefully, and efficiently. We compare five commonly used plot types (dot and whisker plots, boxplots, density plots, univariate scatter plots, and dot plots) to assess their relative strengths and weaknesses when representing distributions of data commonly observed in biological studies. We find the assessed plot types are either difficult to read at large sample sizes or have the potential to misrepresent certain distributions of data, showing the need for an improved method of data visualisation. We present an analysis of the plot types used in four ecology and conservation journals covering multiple areas of these research fields, finding widespread use of uninformative bar charts and dot and whisker plots (60% of all panels showing univariate data from multiple groups for the purpose of comparison). Some articles presented more informative figures by combining plot types (16% of panels), generally boxplots and a second layer such as a flat density plot, to better display the data. This shows an appetite for more effective plot types within conservation and ecology, which may further increase if accurate and user-friendly plot types were made available. Finally, we describe sea stack plots and explain how they overcome the weaknesses associated with other alternatives to uninformative plots when used for large and/or unevenly distributed data. We provide a tool to create sea stack plots with our R package \'seastackplot\', available through GitHub.






  • 文章类型: Journal Article
    The inference of gene regulatory networks (GRNs) is a widely addressed problem in Systems Biology. GRNs can be modeled as Boolean networks, which is the simplest approach for this task. However, Boolean models need binarized data. Several approaches have been developed for the discretization of gene expression data (GED). Also, the advance of data extraction technologies, such as single-cell RNA-Sequencing (scRNA-Seq), provides a new vision of gene expression and brings new challenges for dealing with its specificities, such as a large occurrence of zero data. This work proposes a new discretization approach for dealing with scRNA-Seq time-series data, named Distribution and Successive Spline Points Discretization (DSSPD), which considers the data distribution and a proper preprocessing step. Here, Cartesian Genetic Programming (CGP) is used to infer GRNs using the results of DSSPD. The proposal is compared with CGP with the standard data handling and five state-of-the-art algorithms on curated models and experimental data. The results show that the proposal improves the results of CGP in all tested cases and outperforms the state-of-the-art algorithms in most cases.






  • 文章类型: Journal Article
    In the manufacturing process, equipment failure is directly related to productivity, so predictive maintenance plays a very important role. Industrial parks are distributed, and data heterogeneity exists among heterogeneous equipment, which makes predictive maintenance of equipment challenging. In this paper, we propose two main techniques to enable effective predictive maintenance in this environment. We propose a 1DCNN-Bilstm model for time series anomaly detection and predictive maintenance of manufacturing processes. The model combines a 1D convolutional neural network (1DCNN) and a bidirectional LSTM (Bilstm), which is effective in extracting features from time series data and detecting anomalies. In this paper, we combine a federated learning framework with these models to consider the distributional shifts of time series data and perform anomaly detection and predictive maintenance based on them. In this paper, we utilize the pump dataset to evaluate the performance of the combination of several federated learning frameworks and time series anomaly detection models. Experimental results show that the proposed framework achieves a test accuracy of 97.2%, which shows its potential to be utilized for real-world predictive maintenance in the future.






  • 文章类型: Journal Article
    The landing gear structure suffers from large loads during aircraft takeoff and landing, and an accurate prediction of landing gear performance is beneficial to ensure flight safety. Nevertheless, the landing gear performance prediction method based on machine learning has a strong reliance on the dataset, in which the feature dimension and data distribution will have a great impact on the prediction accuracy. To address these issues, a novel MCA-MLPSA is developed. First, an MCA (multiple correlation analysis) method is proposed to select key features. Second, a heterogeneous multilearner integration framework is proposed, which makes use of different base learners. Third, an MLPSA (multilayer perceptron with self-attention) model is proposed to adaptively capture the data distribution and adjust the weights of each base learner. Finally, the excellent prediction performance of the proposed MCA-MLPSA is validated by a series of experiments on the landing gear data.






  • 文章类型: Journal Article
    UNASSIGNED: To investigate the correlation between differences in data distributions and federated deep learning (Fed-DL) algorithm performance in tumor segmentation on CT and MR images.
    UNASSIGNED: Two Fed-DL datasets were retrospectively collected (from November 2020 to December 2021): one dataset of liver tumor CT images (Federated Imaging in Liver Tumor Segmentation [or, FILTS]; three sites, 692 scans) and one publicly available dataset of brain tumor MR images (Federated Tumor Segmentation [or, FeTS]; 23 sites, 1251 scans). Scans from both datasets were grouped according to site, tumor type, tumor size, dataset size, and tumor intensity. To quantify differences in data distributions, the following four distance metrics were calculated: earth mover\'s distance (EMD), Bhattacharyya distance (BD), χ2 distance (CSD), and Kolmogorov-Smirnov distance (KSD). Both federated and centralized nnU-Net models were trained by using the same grouped datasets. Fed-DL model performance was evaluated by using the ratio of Dice coefficients, θ, between federated and centralized models trained and tested on the same 80:20 split datasets.
    UNASSIGNED: The Dice coefficient ratio (θ) between federated and centralized models was strongly negatively correlated with the distances between data distributions, with correlation coefficients of -0.920 for EMD, -0.893 for BD, and -0.899 for CSD. However, KSD was weakly correlated with θ, with a correlation coefficient of -0.479.
    UNASSIGNED: Performance of Fed-DL models in tumor segmentation on CT and MRI datasets was strongly negatively correlated with the distances between data distributions.Keywords: CT, Abdomen/GI, Liver, Comparative Studies, MR Imaging, Brain/Brain Stem, Convolutional Neural Network (CNN), Federated Deep Learning, Tumor Segmentation, Data Distribution Supplemental material is available for this article. © RSNA, 2023See also the commentary by Kwak and Bai in this issue.






  • 文章类型: Journal Article
    BACKGROUND: Data archiving and distribution are essential to scientific rigor and reproducibility of research. The National Center for Biotechnology Information\'s Database of Genotypes and Phenotypes (dbGaP) is a public repository for scientific data sharing. To support curation of thousands of complex data sets, dbGaP has detailed submission instructions that investigators must follow when archiving their data.
    RESULTS: We developed dbGaPCheckup, an R package which implements a series of check, awareness, reporting, and utility functions to support data integrity and proper formatting of the subject phenotype data set and data dictionary prior to dbGaP submission. For example, as a tool, dbGaPCheckup ensures that the data dictionary contains all fields required by dbGaP, and additional fields required by dbGaPCheckup; the number and names of variables match between the data set and data dictionary; there are no duplicated variable names or descriptions; observed data values are not more extreme than the logical minimum and maximum values stated in the data dictionary; and more. The package also includes functions that implement a series of minor/scalable fixes when errors are detected (e.g., a function to reorder the variables in the data dictionary to match the order listed in the data set). Finally, we also include reporting functions that produce graphical and textual descriptives of the data to further reduce the likelihood of data integrity issues. The dbGaPCheckup R package is available on CRAN ( ) and developed on GitHub ( ).
    CONCLUSIONS: dbGaPCheckup is an innovative assistive and timesaving tool that fills an important gap for researchers by making dbGaP submission of large and complex data sets less error prone.






  • 文章类型: Journal Article
    Early detection of Alzheimer\'s disease (AD) during the Mild Cognitive Impairment (MCI) stage could enable effective intervention to slow down disease progression. Computer-aided diagnosis of AD relies on a sufficient amount of biomarker data. When this requirement is not fulfilled, transfer learning can be used to transfer knowledge from a source domain with more amount of labeled data than available in the desired target domain. In this study, an instance-based transfer learning framework is presented based on the gradient boosting machine (GBM). In GBM, a sequence of base learners is built, and each learner focuses on the errors (residuals) of the previous learner. In our transfer learning version of GBM (TrGB), a weighting mechanism based on the residuals of the base learners is defined for the source instances. Consequently, instances with different distribution than the target data will have a lower impact on the target learner. The proposed weighting scheme aims to transfer as much information as possible from the source domain while avoiding negative transfer. The target data in this study was obtained from the Mount Sinai dataset which is collected and processed in a collaborative 5-year project at the Mount Sinai Medical Center. The Alzheimer\'s Disease Neuroimaging Initiative (ADNI) dataset was used as the source domain. The experimental results showed that the proposed TrGB algorithm could improve the classification accuracy by 1.5 and 4.5% for CN vs. MCI and multiclass classification, respectively, as compared to the conventional methods. Also, using the TrGB model and transferred knowledge from the CN vs. AD classification of the source domain, the average score of early MCI vs. late MCI classification improved by 5%.






  • 文章类型: Journal Article
    Processing full-length cystoscopy videos is challenging for documentation and research purposes. We therefore designed a surgeon-guided framework to extract short video clips with bladder lesions for more efficient content navigation and extraction. Screenshots of bladder lesions were captured during transurethral resection of bladder tumor, then manually labeled according to case identification, date, lesion location, imaging modality, and pathology. The framework used the screenshot to search for and extract a corresponding 10-seconds video clip. Each video clip included a one-second space holder with a QR barcode informing the video content. The success of the framework was measured by the secondary use of these short clips and the reduction of storage volume required for video materials. From 86 cases, the framework successfully generated 249 video clips from 230 screenshots, with 14 erroneous video clips from 8 screenshots excluded. The HIPPA-compliant barcodes provided information of video contents with a 100% data completeness. A web-based educational gallery was curated with various diagnostic categories and annotated frame sequences. Compared with the unedited videos, the informative short video clips reduced the storage volume by 99.5%. In conclusion, our framework expedites the generation of visual contents with surgeon\'s instruction for cystoscopy and potential incorporation of video data towards applications including clinical documentation, education, and research.






  • 文章类型: Journal Article
    Data distribution is a cornerstone of efficient automation for intelligent machines in Industry 4.0. Although in the recent literature there have been several comparisons of relevant methods, we identify that most of those comparisons are either theoretical or based on abstract simulation tools, unable to uncover the specific, detailed impacts of the methods to the underlying networking infrastructure. In this respect, as a first contribution of this paper, we develop more detailed and fine-tuned solutions for robust data distribution in smart factories on stationary and mobile scenarios of wireless industrial networking. Using the technological enablers of WirelessHART, RPL and the methodological enabler of proxy selection as building blocks, we compose the protocol stacks of four different methods (both centralized and decentralized) for data distribution in wireless industrial networks over the IEEE 802.15.4 physical layer. We implement the presented methods in the highly detailed OMNeT++ simulation environment and we evaluate their performance via an extensive simulation analysis. Interestingly enough, we demonstrate that the careful selection of a limited set of proxies for data caching in the network can lead to an increased data delivery success rate and low data access latency. Next, we describe two test cases demonstrated in an industrial smart factory environment. First, we show the collaboration between robotic elements and wireless data services. Second, we show the integration with an industrial fog node which controls the shop-floor devices. We report selected results in much larger scales, obtained via simulations.





