Exploratory data analysis

探索性数据分析
  • 文章类型: Journal Article
    对应分析(CA)是一种多元统计和可视化技术。CA在分析双向或多路列联表时非常有用,表示列和行之间的一定程度的对应关系。CA结果以易于解释的“双图”可视化,其中项目的接近度(分类变量的值)表示所呈现项目之间的关联程度。换句话说,彼此靠近的项目比距离更远的项目更相关。每个双图都有两个维度,在分析过程中命名。维度的命名为分析增加了定性方面。对应分析可以支持医疗专业人员找到与健康有关的许多重要问题的答案,幸福,生活质量,与使用更复杂的统计或机器学习方法相比,以更简单但更非正式的方式进行类似主题。这样,它可以用于降维和数据简化,聚类,分类,特征选择,知识提取,不利影响的可视化,或模式检测。
    Correspondence analysis (CA) is a multivariate statistical and visualization technique. CA is extremely useful in analyzing either two- or multi-way contingency tables, representing some degree of correspondence between columns and rows. The CA results are visualized in easy-to-interpret \"bi-plots,\" where the proximity of items (values of categorical variables) represents the degree of association between presented items. In other words, items positioned near each other are more associated than those located farther away. Each bi-plot has two dimensions, named during the analysis. The naming of dimensions adds a qualitative aspect to the analysis. Correspondence analysis may support medical professionals in finding answers to many important questions related to health, wellbeing, quality of life, and similar topics in a simpler but more informal way than by using more complex statistical or machine learning approaches. In that way, it can be used for dimension reduction and data simplification, clustering, classification, feature selection, knowledge extraction, visualization of adverse effects, or pattern detection.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:政府间组织经济合作与发展组织(OECD)和替代方法验证机构间协调委员会(ICCVAM)制定了使用体外模型进行毒理学评估的指南化学品。然而,手动步骤的存在和数据分析的多种工具的需求,除了昂贵和耗时之外,可能会无意中引入研究人员的错误。
    目的:我们开发了SAEDC平台(用于细胞毒性的探索性数据分析和统计的技术解决方案,葡萄牙语),这使得能够分析来自遵循经合组织准则号的测定的细胞毒性数据。129.
    方法:使用体外实验数据与指南中建议的分析方法进行比较。我们分析了117个数据集,涵盖了根据GHS分类从I类到未分类的化学品。
    结果:通过SAEDC平台计算的非线性回归(4PL)的四个参数与标准方法相比,在任何数据集中都没有显着差异(p>0.05)。确定系数(R平方)不仅证明了4PL模型与数据的良好拟合,而且还证明了与常规方法获得的值的显着相似性。最后,SAEDC平台使用细胞毒性注册(RC)回归模型从IC50预测化学品的LD50值。
    结论:与标准数据分析方法的比较表明,SAEDC平台符合细胞毒性数据分析的要求,生成可靠和准确的结果与研究人员执行更少的步骤。与监管机构提出的标准方法相比,使用SAEDC平台获得毒性值可以减少分析时间。因此,使用SAEDC平台的自动化分析有可能为细胞毒性研究人员和实验室节省时间和资源,同时产生可靠的结果。
    BACKGROUND: The intergovernmental organizations Organisation for Economic Co-operation and Development (OECD) and Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) have developed guidelines for the use of in vitro models for toxicological evaluation of chemicals. However, the presence of manual steps and the requirement of multiple tools for data analysis, apart from being costly and time-consuming, can inadvertently introduce errors by researchers.
    OBJECTIVE: We have developed the SAEDC platform (Technological Solution for Exploratory Data Analysis and Statistics for Cytotoxicity, in Portuguese), which enables analysis of cytotoxicity data from assays following OECD Guideline No. 129.
    METHODS: In vitro experimental data were used to compare with the analysis methodology suggested in the Guideline. We analyzed 117 data sets covering chemicals from Category I to Unclassified according to GHS classification.
    RESULTS: The four-parameters of non-linear regression (4PL) calculated by the SAEDC platform showed no significant differences compared to standard methodology in any of the data sets (p > 0.05). The coefficient of determination (R-squared) also demonstrated not only a good fit of the 4PL model to the data but also significant similarity to values obtained by the conventional methodology. Finally, the SAEDC platform predicted LD50 values for the chemicals from IC50, using the Registry of Cytotoxicity (RC) regression models.
    CONCLUSIONS: The comparison with the standard data analysis methodology revealed that SAEDC platform fulfills the requirements for cytotoxicity data analysis, generating reliable and accurate results with fewer steps performed by researchers. The use of SAEDC platform for obtaining toxicity values can reduce analysis time compared to the standard methodology proposed by regulatory agencies. Thus, automation of the analysis using the SAEDC platform has the potential to save time and resources for cytotoxicity researchers and laboratories while generating reliable results.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    要求收费广泛用于商业和工业消费者。这些成本通常并不为人所知,更不用说PV对他们的影响了。这项工作提出了一种方法来评估光伏对减少这些费用的影响,并优化要收缩的功率,使用来自探索性数据分析的技术。该方法适用于来自西班牙不同部门的工业消费者的五个案例研究,在连续运营的行业中节省5%至11%的需求费用,在不连续运营的情况下节省高达28%。如果可收缩的最大功率低于最佳功率,则这些节省甚至更大。西班牙的需求费用由与收缩功率成比例的固定部分和取决于超过它的功率峰值的可变部分组成。由于对于变量部分,重合和非重合模型共存,对这两种模型进行了比较,发现在一般情况下,光伏用户可以通过重合模型实现更高的节省。
    Demand charges are widely used for commercial and industrial consumers. These costs are often not well known, let alone the effects that PV can have on them. This work proposes a methodology to assess the effect of PV on reducing these charges and to optimise the power to be contracted, using techniques taken from exploratory data analysis. This methodology is applied to five case studies of industrial consumers from different sectors in Spain, finding savings between 5 % and 11 % of demand charges in industries with continuous operation and up to 28 % in cases of discontinuous operation. These savings can be even greater if the maximum power that can be contracted is lower than the optimum. The demand charges in Spain consist of a fixed part proportional to the contracted power and a variable part depending on the power peaks exceeding it. Since for the variable part the coincident and non-coincident models coexist, a comparison is made between the two models, finding that in the general case PV users can achieve higher savings with the coincident model.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在本文中,我们提出了一种数据分类和分析方法,利用火力发电厂的设施数据来估计火灾风险。根据设施数据估计火灾风险,我们将设施分为三个状态-稳定,瞬变,和异常-按其目的和操作条件分类。该方法旨在满足火电厂消防系统的三个要求。例如,必须识别有火灾危险的区域,火灾风险应分类并整合到现有系统中。我们把火力发电厂分为汽轮机,锅炉,和室内煤棚区。每个区域被细分为小设备。涡轮机,发电机,石油相关设备,氢气(H2),和锅炉给水泵(BFP)被选择用于涡轮区,而锅炉区选择了粉碎机和点火油。我们根据多年来对火电厂火灾和爆炸情况的检查,从监控和数据采集(SCADA)数据中选择了与火灾相关的标签,并在特定时期为两个火电厂获取了样本数据。我们专注于关键的火灾案例,如泳池火灾,3D火灾,和喷射火灾,并为每个区域组织了三个火灾危险级别。通过所提出的方法对500MW和100MW火电厂进行了实验分析。本文提出的数据分类和分析方法可以为没有电厂火灾领域知识的数据分析师提供间接经验,也可以为需要了解电厂设施的数据分析师提供很好的启示。
    In this paper, we propose a data classification and analysis method to estimate fire risk using facility data of thermal power plants. To estimate fire risk based on facility data, we divided facilities into three states-Steady, Transient, and Anomaly-categorized by their purposes and operational conditions. This method is designed to satisfy three requirements of fire protection systems for thermal power plants. For example, areas with fire risk must be identified, and fire risks should be classified and integrated into existing systems. We classified thermal power plants into turbine, boiler, and indoor coal shed zones. Each zone was subdivided into small pieces of equipment. The turbine, generator, oil-related equipment, hydrogen (H2), and boiler feed pump (BFP) were selected for the turbine zone, while the pulverizer and ignition oil were chosen for the boiler zone. We selected fire-related tags from Supervisory Control and Data Acquisition (SCADA) data and acquired sample data during a specific period for two thermal power plants based on inspection of fire and explosion scenarios in thermal power plants over many years. We focused on crucial fire cases such as pool fires, 3D fires, and jet fires and organized three fire hazard levels for each zone. Experimental analysis was conducted with these data set by the proposed method for 500 MW and 100 MW thermal power plants. The data classification and analysis methods presented in this paper can provide indirect experience for data analysts who do not have domain knowledge about power plant fires and can also offer good inspiration for data analysts who need to understand power plant facilities.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    心血管疾病(CVDs)占全球死亡率的很大一部分,强调需要有效的战略。这项研究的重点是心肌梗塞,肺血栓栓塞症,和主动脉瓣狭窄,旨在授权医疗从业者提供知情决策和及时干预的工具。根据圣玛丽亚医院的数据,我们的方法结合了探索性数据分析(EDA)和预测性机器学习(ML)模型,由跨行业数据挖掘标准流程(CRISP-DM)方法指导。EDA揭示了心血管疾病特有的复杂模式和关系。ML模型的精度达到80%以上,提供一个13分钟的窗口来预测心肌缺血事件并积极干预。本文介绍了增强医疗策略的实时数据和预测能力的概念证明。
    Cardiovascular diseases (CVDs) account for a significant portion of global mortality, emphasizing the need for effective strategies. This study focuses on myocardial infarction, pulmonary thromboembolism, and aortic stenosis, aiming to empower medical practitioners with tools for informed decision making and timely interventions. Drawing from data at Hospital Santa Maria, our approach combines exploratory data analysis (EDA) and predictive machine learning (ML) models, guided by the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology. EDA reveals intricate patterns and relationships specific to cardiovascular diseases. ML models achieve accuracies above 80%, providing a 13 min window to predict myocardial ischemia incidents and intervene proactively. This paper presents a Proof of Concept for real-time data and predictive capabilities in enhancing medical strategies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们报告了拉丁美洲化学信息学学院的主要亮点,墨西哥城,2022年11月24-25日。六个讲座,一个车间,在一次在线公开活动中,有四位编辑参加了一次圆桌会议,来自学术界的演讲者,大型制药公司,和公共研究机构。来自79个国家的1,000名学生和学者报名参加了会议。作为会议的一部分,化学空间的列举和可视化方面的进步,在基于天然产品的药物发现中的应用,药物发现被忽视的疾病,毒性预测,并讨论了数据分析的一般指南。ChEMBL的专家介绍了如何使用化学信息学中使用的这个主要化合物数据库的资源的研讨会。学校还包括与化学信息学期刊编辑的圆桌会议。会议的完整程序和会议记录可在https://www上公开获得。youtube.com/@SchoolChemInfLA/精选。
    We report the major highlights of the School of Cheminformatics in Latin America, Mexico City, November 24-25, 2022. Six lectures, one workshop, and one roundtable with four editors were presented during an online public event with speakers from academia, big pharma, and public research institutions. One thousand one hundred eighty-one students and academics from seventy-nine countries registered for the meeting. As part of the meeting, advances in enumeration and visualization of chemical space, applications in natural product-based drug discovery, drug discovery for neglected diseases, toxicity prediction, and general guidelines for data analysis were discussed. Experts from ChEMBL presented a workshop on how to use the resources of this major compounds database used in cheminformatics. The school also included a round table with editors of cheminformatics journals. The full program of the meeting and the recordings of the sessions are publicly available at https://www.youtube.com/@SchoolChemInfLA/featured .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    电子健康记录(EHR)通过为医生提供有关疾病进展和适当治疗方案的见解,在医疗保健决策中发挥着至关重要的作用。在EHR内,实验室检测结果经常用于预测疾病进展。然而,由于单位和格式的变化,处理实验室测试结果通常会带来挑战。此外,利用EHR中的时间信息可以改善结果,预后,和诊断预测。然而,这些记录中数据的不规则频率需要数据预处理,这会增加时间序列分析的复杂性。
    为了应对这些挑战,我们开发了一个开源的R包,便于从实验室记录中提取时间信息。所提出的实验室软件包通过将数据分段为时间序列窗口并估算缺失值来生成分析就绪时间序列数据。此外,用户可以将本地实验室代码映射到逻辑观测标识符名称和代码(LOINC),国际标准。此映射允许用户合并其他信息,如参考范围和相关疾病。此外,LOINC提供的参考范围使我们能够将结果分为正常或异常.最后,可以使用描述性统计进一步总结分析就绪的时间序列数据,并用于使用机器学习技术开发模型。
    使用实验室软件包,我们分析了MIMIC-III的数据,关注动脉导管未闭(PDA)的新生儿。我们提取了时间序列实验室记录,并比较了有和没有30天住院死亡率的患者之间测试结果的差异。然后,我们在PDA诊断后7天确定了几个实验室测试结果的显着差异。利用时间序列分析就绪数据,我们用长短期记忆算法训练了一个预测模型,在模型训练中预测30天住院死亡率的接受者工作特征曲线下面积为0.83。这些发现证明了实验室软件包在分析疾病进展方面的有效性。
    建议的实验室软件包简化并加快了实验室记录提取的工作流程。该工具在协助临床数据分析师克服与异质和稀疏实验室记录相关的障碍方面特别有价值。
    UNASSIGNED: Electronic health records (EHRs) play a crucial role in healthcare decision-making by giving physicians insights into disease progression and suitable treatment options. Within EHRs, laboratory test results are frequently utilized for predicting disease progression. However, processing laboratory test results often poses challenges due to variations in units and formats. In addition, leveraging the temporal information in EHRs can improve outcomes, prognoses, and diagnosis predication. Nevertheless, the irregular frequency of the data in these records necessitates data preprocessing, which can add complexity to time-series analyses.
    UNASSIGNED: To address these challenges, we developed an open-source R package that facilitates the extraction of temporal information from laboratory records. The proposed lab package generates analysis-ready time series data by segmenting the data into time-series windows and imputing missing values. Moreover, users can map local laboratory codes to the Logical Observation Identifier Names and Codes (LOINC), an international standard. This mapping allows users to incorporate additional information, such as reference ranges and related diseases. Moreover, the reference ranges provided by LOINC enable us to categorize results into normal or abnormal. Finally, the analysis-ready time series data can be further summarized using descriptive statistics and utilized to develop models using machine learning technologies.
    UNASSIGNED: Using the lab package, we analyzed data from MIMIC-III, focusing on newborns with patent ductus arteriosus (PDA). We extracted time-series laboratory records and compared the differences in test results between patients with and without 30-day in-hospital mortality. We then identified significant variations in several laboratory test results 7 days after PDA diagnosis. Leveraging the time series-analysis-ready data, we trained a prediction model with the long short-term memory algorithm, achieving an area under the receiver operating characteristic curve of 0.83 for predicting 30-day in-hospital mortality in model training. These findings demonstrate the lab package\'s effectiveness in analyzing disease progression.
    UNASSIGNED: The proposed lab package simplifies and expedites the workflow involved in laboratory records extraction. This tool is particularly valuable in assisting clinical data analysts in overcoming the obstacles associated with heterogeneous and sparse laboratory records.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    机器学习(ML)模型已经能够代表我们做出关键决策。然而,由于这些模型的复杂性,解释他们的决定可能是具有挑战性的,人类不能总是控制他们。本文提供了ML模型在诊断四种类型的后颅窝肿瘤中做出的决定的解释:髓母细胞瘤,室管膜瘤,毛细胞星形细胞瘤,和脑干神经胶质瘤.所提出的方法包括使用高斯分布的核密度估计进行数据分析,以检查单个MRI特征。对这些特征之间的关系进行分析,并对ML模型行为进行全面分析。这种方法提供了一种简单而信息丰富且可靠的方法,可以识别和验证可区分的MRI特征,以诊断小儿脑肿瘤。通过全面分析四种儿科肿瘤类型对彼此的反应以及对单一来源的ML模型的反应,本研究旨在弥补现有文献中关于ML与医疗结果之间关系的知识差距.结果强调,在没有非常大的数据集的情况下采用简单的方法会导致明显更明显和可解释的结果,如预期。此外,该研究还表明,预分析结果与ML模型的输出和现有文献中报道的临床发现一致.
    Machine learning (ML) models have become capable of making critical decisions on our behalf. Nevertheless, due to complexity of these models, interpreting their decisions can be challenging, and humans cannot always control them. This paper provides explanations of decisions made by ML models in diagnosing four types of posterior fossa tumors: medulloblastoma, ependymoma, pilocytic astrocytoma, and brainstem glioma. The proposed methodology involves data analysis using kernel density estimations with Gaussian distributions to examine individual MRI features, conducting an analysis on the relationships between these features, and performing a comprehensive analysis of ML model behavior. This approach offers a simple yet informative and reliable means of identifying and validating distinguishable MRI features for the diagnosis of pediatric brain tumors. By presenting a comprehensive analysis of the responses of the four pediatric tumor types to each other and to ML models in a single source, this study aims to bridge the knowledge gap in the existing literature concerning the relationship between ML and medical outcomes. The results highlight that employing a simplistic approach in the absence of very large datasets leads to significantly more pronounced and explainable outcomes, as expected. Additionally, the study also demonstrates that the pre-analysis results consistently align with the outputs of the ML models and the clinical findings reported in the existing literature.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在催化研究中利用多变量数据分析具有非凡的重要性。MIRA21(MiskolcRAnking21)模型的目的是用来自15个不同变量的无偏差可量化数据来表征非均相催化剂,以标准化催化剂表征并提供一个简单的比较工具,等级,并对催化剂进行分类。本工作通过识别影响催化剂比较的基本原理来介绍和数学验证MIRA21模型,并为催化剂设计提供支持。使用MIRA21的描述符系统分析了用于甲苯二胺合成的2,4-二硝基甲苯加氢催化剂的文献数据。在这项研究中,探索性数据分析(EDA)已用于了解单个变量之间的关系,如催化剂性能,反应条件,催化剂组合物,和可持续的参数。结果将适用于催化剂设计,使用机器学习工具也是可能的。
    Utilization of multivariate data analysis in catalysis research has extraordinary importance. The aim of the MIRA21 (MIskolc RAnking 21) model is to characterize heterogeneous catalysts with bias-free quantifiable data from 15 different variables to standardize catalyst characterization and provide an easy tool to compare, rank, and classify catalysts. The present work introduces and mathematically validates the MIRA21 model by identifying fundamentals affecting catalyst comparison and provides support for catalyst design. Literature data of 2,4-dinitrotoluene hydrogenation catalysts for toluene diamine synthesis were analyzed by using the descriptor system of MIRA21. In this study, exploratory data analysis (EDA) has been used to understand the relationships between individual variables such as catalyst performance, reaction conditions, catalyst compositions, and sustainable parameters. The results will be applicable in catalyst design, and using machine learning tools will also be possible.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    化学流量分析(CFA)可用于收集生命周期清单(LCI),估计环境释放,并确定在寿命终止(EoL)阶段关注的化学品的潜在暴露情景。尽管如此,对全面数据的需求和对化学流动途径的认识不确定性使CFA,LCI和暴露评估是耗时且具有挑战性的任务。由于计算机能力的不断增长和更强大的算法的出现,数据驱动的建模是简化这些任务的有吸引力的工具。然而,在现实世界中部署服务数据驱动模型需要数据摄取管道。因此,这项工作通过提供以化学为中心和以数据为中心的提取方法而向前发展,变换,并在EoL加载CFA的综合数据,将跨年度和国家数据及其来源作为数据生命周期的一部分进行集成。该框架具有可扩展性,可适应生产级机器学习操作。该框架可以以每年的速度提供数据,使得可以处理模型预测因子的统计分布的变化,如转移量和目标变量(例如,EoL活动识别),以避免潜在的数据驱动模型性能随时间的衰减。例如,它可以检测到报告年份(1988年至2020年)的643种化学品的回收转移占29.87%,17.79%,加拿大为20.56%,澳大利亚,美国最后,所开发的方法使数据驱动建模的研究进步能够轻松地与其他数据源连接,以获取有关行业的经济信息,化学品的经济价值,以及可能影响EoL转移类别或活动的发生的环境法规影响,例如多年和国家的化学品回收。最后,利益相关者获得了有关环境法规严格性和经济事务的更多信息,这些信息可能会影响环境决策和EoL化学暴露预测。
    Chemical flow analysis (CFA) can be used for collecting life-cycle inventory (LCI), estimating environmental releases, and identifying potential exposure scenarios for chemicals of concern at the end-of-life (EoL) stage. Nonetheless, the demand for comprehensive data and the epistemic uncertainties about the pathway taken by the chemical flows make CFA, LCI, and exposure assessment time-consuming and challenging tasks. Due to the continuous growth of computer power and the appearance of more robust algorithms, data-driven modelling represents an attractive tool for streamlining these tasks. However, a data ingestion pipeline is required for the deployment of serving data-driven models in the real world. Hence, this work moves forward by contributing a chemical-centric and data-centric approach to extract, transform, and load comprehensive data for CFA at the EoL, integrating cross-year and country data and its provenance as part of the data lifecycle. The framework is scalable and adaptable to production-level machine learning operations. The framework can supply data at an annual rate, making it possible to deal with changes in the statistical distributions of model predictors like transferred amount and target variables (e.g., EoL activity identification) to avoid potential data-driven model performance decay over time. For instance, it can detect that recycling transfers of 643 chemicals over the reporting years (1988 to 2020) are 29.87%, 17.79%, and 20.56% for Canada, Australia, and the U.S. Finally, the developed approach enables research advancements on data-driven modelling to easily connect with other data sources for economic information on industry sectors, the economic value of chemicals, and the environmental regulatory implications that may affect the occurrence of an EoL transfer class or activity like recycling of a chemical over years and countries. Finally, stakeholders gain more context about environmental regulation stringency and economic affairs that could affect environmental decision-making and EoL chemical exposure predictions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号