Exploratory data analysis

探索性数据分析
  • 文章类型: Journal Article
    对应分析(CA)是一种多元统计和可视化技术。CA在分析双向或多路列联表时非常有用,表示列和行之间的一定程度的对应关系。CA结果以易于解释的“双图”可视化,其中项目的接近度(分类变量的值)表示所呈现项目之间的关联程度。换句话说,彼此靠近的项目比距离更远的项目更相关。每个双图都有两个维度,在分析过程中命名。维度的命名为分析增加了定性方面。对应分析可以支持医疗专业人员找到与健康有关的许多重要问题的答案,幸福,生活质量,与使用更复杂的统计或机器学习方法相比,以更简单但更非正式的方式进行类似主题。这样,它可以用于降维和数据简化,聚类,分类,特征选择,知识提取,不利影响的可视化,或模式检测。
    Correspondence analysis (CA) is a multivariate statistical and visualization technique. CA is extremely useful in analyzing either two- or multi-way contingency tables, representing some degree of correspondence between columns and rows. The CA results are visualized in easy-to-interpret \"bi-plots,\" where the proximity of items (values of categorical variables) represents the degree of association between presented items. In other words, items positioned near each other are more associated than those located farther away. Each bi-plot has two dimensions, named during the analysis. The naming of dimensions adds a qualitative aspect to the analysis. Correspondence analysis may support medical professionals in finding answers to many important questions related to health, wellbeing, quality of life, and similar topics in a simpler but more informal way than by using more complex statistical or machine learning approaches. In that way, it can be used for dimension reduction and data simplification, clustering, classification, feature selection, knowledge extraction, visualization of adverse effects, or pattern detection.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:政府间组织经济合作与发展组织(OECD)和替代方法验证机构间协调委员会(ICCVAM)制定了使用体外模型进行毒理学评估的指南化学品。然而,手动步骤的存在和数据分析的多种工具的需求,除了昂贵和耗时之外,可能会无意中引入研究人员的错误。
    目的:我们开发了SAEDC平台(用于细胞毒性的探索性数据分析和统计的技术解决方案,葡萄牙语),这使得能够分析来自遵循经合组织准则号的测定的细胞毒性数据。129.
    方法:使用体外实验数据与指南中建议的分析方法进行比较。我们分析了117个数据集,涵盖了根据GHS分类从I类到未分类的化学品。
    结果:通过SAEDC平台计算的非线性回归(4PL)的四个参数与标准方法相比,在任何数据集中都没有显着差异(p>0.05)。确定系数(R平方)不仅证明了4PL模型与数据的良好拟合,而且还证明了与常规方法获得的值的显着相似性。最后,SAEDC平台使用细胞毒性注册(RC)回归模型从IC50预测化学品的LD50值。
    结论:与标准数据分析方法的比较表明,SAEDC平台符合细胞毒性数据分析的要求,生成可靠和准确的结果与研究人员执行更少的步骤。与监管机构提出的标准方法相比,使用SAEDC平台获得毒性值可以减少分析时间。因此,使用SAEDC平台的自动化分析有可能为细胞毒性研究人员和实验室节省时间和资源,同时产生可靠的结果。
    BACKGROUND: The intergovernmental organizations Organisation for Economic Co-operation and Development (OECD) and Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) have developed guidelines for the use of in vitro models for toxicological evaluation of chemicals. However, the presence of manual steps and the requirement of multiple tools for data analysis, apart from being costly and time-consuming, can inadvertently introduce errors by researchers.
    OBJECTIVE: We have developed the SAEDC platform (Technological Solution for Exploratory Data Analysis and Statistics for Cytotoxicity, in Portuguese), which enables analysis of cytotoxicity data from assays following OECD Guideline No. 129.
    METHODS: In vitro experimental data were used to compare with the analysis methodology suggested in the Guideline. We analyzed 117 data sets covering chemicals from Category I to Unclassified according to GHS classification.
    RESULTS: The four-parameters of non-linear regression (4PL) calculated by the SAEDC platform showed no significant differences compared to standard methodology in any of the data sets (p > 0.05). The coefficient of determination (R-squared) also demonstrated not only a good fit of the 4PL model to the data but also significant similarity to values obtained by the conventional methodology. Finally, the SAEDC platform predicted LD50 values for the chemicals from IC50, using the Registry of Cytotoxicity (RC) regression models.
    CONCLUSIONS: The comparison with the standard data analysis methodology revealed that SAEDC platform fulfills the requirements for cytotoxicity data analysis, generating reliable and accurate results with fewer steps performed by researchers. The use of SAEDC platform for obtaining toxicity values can reduce analysis time compared to the standard methodology proposed by regulatory agencies. Thus, automation of the analysis using the SAEDC platform has the potential to save time and resources for cytotoxicity researchers and laboratories while generating reliable results.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    要求收费广泛用于商业和工业消费者。这些成本通常并不为人所知,更不用说PV对他们的影响了。这项工作提出了一种方法来评估光伏对减少这些费用的影响,并优化要收缩的功率,使用来自探索性数据分析的技术。该方法适用于来自西班牙不同部门的工业消费者的五个案例研究,在连续运营的行业中节省5%至11%的需求费用,在不连续运营的情况下节省高达28%。如果可收缩的最大功率低于最佳功率,则这些节省甚至更大。西班牙的需求费用由与收缩功率成比例的固定部分和取决于超过它的功率峰值的可变部分组成。由于对于变量部分,重合和非重合模型共存,对这两种模型进行了比较,发现在一般情况下,光伏用户可以通过重合模型实现更高的节省。
    Demand charges are widely used for commercial and industrial consumers. These costs are often not well known, let alone the effects that PV can have on them. This work proposes a methodology to assess the effect of PV on reducing these charges and to optimise the power to be contracted, using techniques taken from exploratory data analysis. This methodology is applied to five case studies of industrial consumers from different sectors in Spain, finding savings between 5 % and 11 % of demand charges in industries with continuous operation and up to 28 % in cases of discontinuous operation. These savings can be even greater if the maximum power that can be contracted is lower than the optimum. The demand charges in Spain consist of a fixed part proportional to the contracted power and a variable part depending on the power peaks exceeding it. Since for the variable part the coincident and non-coincident models coexist, a comparison is made between the two models, finding that in the general case PV users can achieve higher savings with the coincident model.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    PMart is a web-based tool for reproducible quality control, exploratory data analysis, statistical analysis, and interactive visualization of \'omics data, based on the functionality of the pmartR R package. The newly improved user interface supports more \'omics data types, additional statistical capabilities, and enhanced options for creating downloadable graphics. PMart supports the analysis of label-free and isobaric-labeled (e.g., TMT, iTRAQ) proteomics, nuclear magnetic resonance (NMR) and mass-spectrometry (MS)-based metabolomics, MS-based lipidomics, and ribonucleic acid sequencing (RNA-seq) transcriptomics data. At the end of a PMart session, a report is available that summarizes the processing steps performed and includes the pmartR R package functions used to execute the data processing. In addition, built-in safeguards in the backend code prevent users from utilizing methods that are inappropriate based on omics data type. PMart is a user-friendly interface for conducting exploratory data analysis and statistical comparisons of omics data without programming.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    广义结构化成分分析(GSCA)是一种结构方程建模(SEM)程序,该程序通过观测变量的加权和来构造成分,并确定地检查它们的回归关系。该研究提出了GSCA的探索性版本,称为探索性GSCA(EGSCA)。EGSCA类似于探索性SEM(ESEM),是基于探索性因素的SEM程序,它通过参数矩阵的正交旋转来寻找观察到的变量和分量之间的关系。GSCA中正交旋转的不确定性首先被证明是所提出方法的理论支持。然后介绍整个EGSCA过程,加上专门针对EGSCA的新旋转算法,旨在同时简化所有参数矩阵。两项数值模拟研究表明,经过以下旋转的EGSCA成功恢复了参数矩阵的真实值,并且优于现有的GSCA程序。EGSCA被应用于两个真实的数据集,EGSCA结果提出的模型比以往研究提出的模型更好,这证明了EGSCA在模型探索中的有效性。
    Generalized structured component analysis (GSCA) is a structural equation modeling (SEM) procedure that constructs components by weighted sums of observed variables and confirmatorily examines their regressional relationship. The research proposes an exploratory version of GSCA, called exploratory GSCA (EGSCA). EGSCA is analogous to exploratory SEM (ESEM) developed as an exploratory factor-based SEM procedure, which seeks the relationships between the observed variables and the components by orthogonal rotation of the parameter matrices. The indeterminacy of orthogonal rotation in GSCA is first shown as a theoretical support of the proposed method. The whole EGSCA procedure is then presented, together with a new rotational algorithm specialized to EGSCA, which aims at simultaneous simplification of all parameter matrices. Two numerical simulation studies revealed that EGSCA with the following rotation successfully recovered the true values of the parameter matrices and was superior to the existing GSCA procedure. EGSCA was applied to two real datasets, and the model suggested by the EGSCA\'s result was shown to be better than the model proposed by previous research, which demonstrates the effectiveness of EGSCA in model exploration.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    塑料消费及其报废管理造成了巨大的环境足迹,并且是能源密集型的。作为对策,欧洲已广泛推广了废物转化资源和预防战略;但是,其有效性仍不确定。本研究旨在通过探索性数据分析,揭示欧盟成员国(EU-27)塑料价值链的环境足迹模式。降维和分组。评估了九个变量,从社会经济和人口到环境影响。根据一系列特征的相似性形成三个簇(九个),环境影响被确定为确定集群的主要影响变量。大多数国家属于第0组,2014年由17个国家和2019年由18个国家组成。它们代表全球变暖潜势(GWP)相对较低的集群,2014年的平均值为2.64tCO2eq/cap,2019年为4.01tCO2eq/cap。在所有评估国家中,在EU-27的特征内评估时,丹麦显示出显着变化,从2014年的集群1(高GWP)到2019年的集群0(低GWP)。2019年塑料包装废弃物统计数据分析(2022年数据发布)显示,尽管欧盟27国内部的回收率有所提高,但全球升温潜能值并没有降低,暗示反弹效应。GWP倾向于与较高的塑料废物量相关地增加。相比之下,其他环境影响,比如富营养化,非生物和酸化潜力,被确定为通过恢复有效地缓解,抑制塑料废物产生增加的不利影响。五年间隔的数据分析在一组模式中确定了不同的集群,根据它们的相似性对它们进行分类。分类和见解是制定重点缓解策略的基础。
    Plastic consumption and its end-of-life management pose a significant environmental footprint and are energy intensive. Waste-to-resources and prevention strategies have been promoted widely in Europe as countermeasures; however, their effectiveness remains uncertain. This study aims to uncover the environmental footprint patterns of the plastics value chain in the European Union Member States (EU-27) through exploratory data analysis with dimension reduction and grouping. Nine variables are assessed, ranging from socioeconomic and demographic to environmental impacts. Three clusters are formed according to the similarity of a range of characteristics (nine), with environmental impacts being identified as the primary influencing variable in determining the clusters. Most countries belong to Cluster 0, consisting of 17 countries in 2014 and 18 countries in 2019. They represent clusters with a relatively low global warming potential (GWP), with an average value of 2.64 t CO2eq/cap in 2014 and 4.01 t CO2eq/cap in 2019. Among all the assessed countries, Denmark showed a significant change when assessed within the traits of EU-27, categorised from Cluster 1 (high GWP) in 2014 to Cluster 0 (low GWP) in 2019. The analysis of plastic packaging waste statistics in 2019 (data released in 2022) shows that, despite an increase in the recovery rate within the EU-27, the GWP has not reduced, suggesting a rebound effect. The GWP tends to increase in correlation with the higher plastic waste amount. In contrast, other environmental impacts, like eutrophication, abiotic and acidification potential, are identified to be mitigated effectively via recovery, suppressing the adverse effects of an increase in plastic waste generation. The five-year interval data analysis identified distinct clusters within a set of patterns, categorising them based on their similarities. The categorisation and managerial insights serve as a foundation for devising a focused mitigation strategy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在本文中,我们提出了一种数据分类和分析方法,利用火力发电厂的设施数据来估计火灾风险。根据设施数据估计火灾风险,我们将设施分为三个状态-稳定,瞬变,和异常-按其目的和操作条件分类。该方法旨在满足火电厂消防系统的三个要求。例如,必须识别有火灾危险的区域,火灾风险应分类并整合到现有系统中。我们把火力发电厂分为汽轮机,锅炉,和室内煤棚区。每个区域被细分为小设备。涡轮机,发电机,石油相关设备,氢气(H2),和锅炉给水泵(BFP)被选择用于涡轮区,而锅炉区选择了粉碎机和点火油。我们根据多年来对火电厂火灾和爆炸情况的检查,从监控和数据采集(SCADA)数据中选择了与火灾相关的标签,并在特定时期为两个火电厂获取了样本数据。我们专注于关键的火灾案例,如泳池火灾,3D火灾,和喷射火灾,并为每个区域组织了三个火灾危险级别。通过所提出的方法对500MW和100MW火电厂进行了实验分析。本文提出的数据分类和分析方法可以为没有电厂火灾领域知识的数据分析师提供间接经验,也可以为需要了解电厂设施的数据分析师提供很好的启示。
    In this paper, we propose a data classification and analysis method to estimate fire risk using facility data of thermal power plants. To estimate fire risk based on facility data, we divided facilities into three states-Steady, Transient, and Anomaly-categorized by their purposes and operational conditions. This method is designed to satisfy three requirements of fire protection systems for thermal power plants. For example, areas with fire risk must be identified, and fire risks should be classified and integrated into existing systems. We classified thermal power plants into turbine, boiler, and indoor coal shed zones. Each zone was subdivided into small pieces of equipment. The turbine, generator, oil-related equipment, hydrogen (H2), and boiler feed pump (BFP) were selected for the turbine zone, while the pulverizer and ignition oil were chosen for the boiler zone. We selected fire-related tags from Supervisory Control and Data Acquisition (SCADA) data and acquired sample data during a specific period for two thermal power plants based on inspection of fire and explosion scenarios in thermal power plants over many years. We focused on crucial fire cases such as pool fires, 3D fires, and jet fires and organized three fire hazard levels for each zone. Experimental analysis was conducted with these data set by the proposed method for 500 MW and 100 MW thermal power plants. The data classification and analysis methods presented in this paper can provide indirect experience for data analysts who do not have domain knowledge about power plant fires and can also offer good inspiration for data analysts who need to understand power plant facilities.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    心血管疾病(CVDs)占全球死亡率的很大一部分,强调需要有效的战略。这项研究的重点是心肌梗塞,肺血栓栓塞症,和主动脉瓣狭窄,旨在授权医疗从业者提供知情决策和及时干预的工具。根据圣玛丽亚医院的数据,我们的方法结合了探索性数据分析(EDA)和预测性机器学习(ML)模型,由跨行业数据挖掘标准流程(CRISP-DM)方法指导。EDA揭示了心血管疾病特有的复杂模式和关系。ML模型的精度达到80%以上,提供一个13分钟的窗口来预测心肌缺血事件并积极干预。本文介绍了增强医疗策略的实时数据和预测能力的概念证明。
    Cardiovascular diseases (CVDs) account for a significant portion of global mortality, emphasizing the need for effective strategies. This study focuses on myocardial infarction, pulmonary thromboembolism, and aortic stenosis, aiming to empower medical practitioners with tools for informed decision making and timely interventions. Drawing from data at Hospital Santa Maria, our approach combines exploratory data analysis (EDA) and predictive machine learning (ML) models, guided by the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology. EDA reveals intricate patterns and relationships specific to cardiovascular diseases. ML models achieve accuracies above 80%, providing a 13 min window to predict myocardial ischemia incidents and intervene proactively. This paper presents a Proof of Concept for real-time data and predictive capabilities in enhancing medical strategies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们报告了拉丁美洲化学信息学学院的主要亮点,墨西哥城,2022年11月24-25日。六个讲座,一个车间,在一次在线公开活动中,有四位编辑参加了一次圆桌会议,来自学术界的演讲者,大型制药公司,和公共研究机构。来自79个国家的1,000名学生和学者报名参加了会议。作为会议的一部分,化学空间的列举和可视化方面的进步,在基于天然产品的药物发现中的应用,药物发现被忽视的疾病,毒性预测,并讨论了数据分析的一般指南。ChEMBL的专家介绍了如何使用化学信息学中使用的这个主要化合物数据库的资源的研讨会。学校还包括与化学信息学期刊编辑的圆桌会议。会议的完整程序和会议记录可在https://www上公开获得。youtube.com/@SchoolChemInfLA/精选。
    We report the major highlights of the School of Cheminformatics in Latin America, Mexico City, November 24-25, 2022. Six lectures, one workshop, and one roundtable with four editors were presented during an online public event with speakers from academia, big pharma, and public research institutions. One thousand one hundred eighty-one students and academics from seventy-nine countries registered for the meeting. As part of the meeting, advances in enumeration and visualization of chemical space, applications in natural product-based drug discovery, drug discovery for neglected diseases, toxicity prediction, and general guidelines for data analysis were discussed. Experts from ChEMBL presented a workshop on how to use the resources of this major compounds database used in cheminformatics. The school also included a round table with editors of cheminformatics journals. The full program of the meeting and the recordings of the sessions are publicly available at https://www.youtube.com/@SchoolChemInfLA/featured .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    电子健康记录(EHR)通过为医生提供有关疾病进展和适当治疗方案的见解,在医疗保健决策中发挥着至关重要的作用。在EHR内,实验室检测结果经常用于预测疾病进展。然而,由于单位和格式的变化,处理实验室测试结果通常会带来挑战。此外,利用EHR中的时间信息可以改善结果,预后,和诊断预测。然而,这些记录中数据的不规则频率需要数据预处理,这会增加时间序列分析的复杂性。
    为了应对这些挑战,我们开发了一个开源的R包,便于从实验室记录中提取时间信息。所提出的实验室软件包通过将数据分段为时间序列窗口并估算缺失值来生成分析就绪时间序列数据。此外,用户可以将本地实验室代码映射到逻辑观测标识符名称和代码(LOINC),国际标准。此映射允许用户合并其他信息,如参考范围和相关疾病。此外,LOINC提供的参考范围使我们能够将结果分为正常或异常.最后,可以使用描述性统计进一步总结分析就绪的时间序列数据,并用于使用机器学习技术开发模型。
    使用实验室软件包,我们分析了MIMIC-III的数据,关注动脉导管未闭(PDA)的新生儿。我们提取了时间序列实验室记录,并比较了有和没有30天住院死亡率的患者之间测试结果的差异。然后,我们在PDA诊断后7天确定了几个实验室测试结果的显着差异。利用时间序列分析就绪数据,我们用长短期记忆算法训练了一个预测模型,在模型训练中预测30天住院死亡率的接受者工作特征曲线下面积为0.83。这些发现证明了实验室软件包在分析疾病进展方面的有效性。
    建议的实验室软件包简化并加快了实验室记录提取的工作流程。该工具在协助临床数据分析师克服与异质和稀疏实验室记录相关的障碍方面特别有价值。
    UNASSIGNED: Electronic health records (EHRs) play a crucial role in healthcare decision-making by giving physicians insights into disease progression and suitable treatment options. Within EHRs, laboratory test results are frequently utilized for predicting disease progression. However, processing laboratory test results often poses challenges due to variations in units and formats. In addition, leveraging the temporal information in EHRs can improve outcomes, prognoses, and diagnosis predication. Nevertheless, the irregular frequency of the data in these records necessitates data preprocessing, which can add complexity to time-series analyses.
    UNASSIGNED: To address these challenges, we developed an open-source R package that facilitates the extraction of temporal information from laboratory records. The proposed lab package generates analysis-ready time series data by segmenting the data into time-series windows and imputing missing values. Moreover, users can map local laboratory codes to the Logical Observation Identifier Names and Codes (LOINC), an international standard. This mapping allows users to incorporate additional information, such as reference ranges and related diseases. Moreover, the reference ranges provided by LOINC enable us to categorize results into normal or abnormal. Finally, the analysis-ready time series data can be further summarized using descriptive statistics and utilized to develop models using machine learning technologies.
    UNASSIGNED: Using the lab package, we analyzed data from MIMIC-III, focusing on newborns with patent ductus arteriosus (PDA). We extracted time-series laboratory records and compared the differences in test results between patients with and without 30-day in-hospital mortality. We then identified significant variations in several laboratory test results 7 days after PDA diagnosis. Leveraging the time series-analysis-ready data, we trained a prediction model with the long short-term memory algorithm, achieving an area under the receiver operating characteristic curve of 0.83 for predicting 30-day in-hospital mortality in model training. These findings demonstrate the lab package\'s effectiveness in analyzing disease progression.
    UNASSIGNED: The proposed lab package simplifies and expedites the workflow involved in laboratory records extraction. This tool is particularly valuable in assisting clinical data analysts in overcoming the obstacles associated with heterogeneous and sparse laboratory records.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号