Exploratory data analysis

探索性数据分析
  • 文章类型: Journal Article
    探索性数据分析(EDA)是科学项目中的关键步骤,旨在发现数据中有价值的见解和模式。传统上,EDA涉及手动检查,可视化,和各种统计方法。人工智能(AI)和机器学习(ML)的出现有可能改善EDA,提供更复杂的方法来提高其功效。这篇综述探讨了AI和ML算法如何在EDA期间改进特征工程和选择,导致更强大的预测模型和数据驱动的决策。基于树的模型,正则化回归,聚类算法被确定为关键技术。这些方法自动进行特征重要性排序,处理复杂的交互,执行特征选择,揭示隐藏的分组,并检测异常。实际应用包括全髋关节置换术的风险预测和脊柱侧凸患者的亚组识别。可解释AI和EDA自动化的最新进展显示出进一步改进的潜力。将AI和ML集成到EDA中可以加速任务并发现复杂的见解。然而,有效的利用需要对算法有深刻的理解,他们的假设,和限制,以及正确解释的领域知识。随着数据的不断增长,当与人类专业知识相结合时,AI将在EDA中发挥越来越重要的作用。驾驶更多的信息,跨各个科学领域的数据驱动决策。证据级别:V级-专家意见。
    Explorative data analysis (EDA) is a critical step in scientific projects, aiming to uncover valuable insights and patterns within data. Traditionally, EDA involves manual inspection, visualization, and various statistical methods. The advent of artificial intelligence (AI) and machine learning (ML) has the potential to improve EDA, offering more sophisticated approaches that enhance its efficacy. This review explores how AI and ML algorithms can improve feature engineering and selection during EDA, leading to more robust predictive models and data-driven decisions. Tree-based models, regularized regression, and clustering algorithms were identified as key techniques. These methods automate feature importance ranking, handle complex interactions, perform feature selection, reveal hidden groupings, and detect anomalies. Real-world applications include risk prediction in total hip arthroplasty and subgroup identification in scoliosis patients. Recent advances in explainable AI and EDA automation show potential for further improvement. The integration of AI and ML into EDA accelerates tasks and uncovers sophisticated insights. However, effective utilization requires a deep understanding of the algorithms, their assumptions, and limitations, along with domain knowledge for proper interpretation. As data continues to grow, AI will play an increasingly pivotal role in EDA when combined with human expertise, driving more informed, data-driven decision-making across various scientific domains. Level of Evidence: Level V - Expert opinion.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    对应分析(CA)是一种多元统计和可视化技术。CA在分析双向或多路列联表时非常有用,表示列和行之间的一定程度的对应关系。CA结果以易于解释的“双图”可视化,其中项目的接近度(分类变量的值)表示所呈现项目之间的关联程度。换句话说,彼此靠近的项目比距离更远的项目更相关。每个双图都有两个维度,在分析过程中命名。维度的命名为分析增加了定性方面。对应分析可以支持医疗专业人员找到与健康有关的许多重要问题的答案,幸福,生活质量,与使用更复杂的统计或机器学习方法相比,以更简单但更非正式的方式进行类似主题。这样,它可以用于降维和数据简化,聚类,分类,特征选择,知识提取,不利影响的可视化,或模式检测。
    Correspondence analysis (CA) is a multivariate statistical and visualization technique. CA is extremely useful in analyzing either two- or multi-way contingency tables, representing some degree of correspondence between columns and rows. The CA results are visualized in easy-to-interpret \"bi-plots,\" where the proximity of items (values of categorical variables) represents the degree of association between presented items. In other words, items positioned near each other are more associated than those located farther away. Each bi-plot has two dimensions, named during the analysis. The naming of dimensions adds a qualitative aspect to the analysis. Correspondence analysis may support medical professionals in finding answers to many important questions related to health, wellbeing, quality of life, and similar topics in a simpler but more informal way than by using more complex statistical or machine learning approaches. In that way, it can be used for dimension reduction and data simplification, clustering, classification, feature selection, knowledge extraction, visualization of adverse effects, or pattern detection.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:政府间组织经济合作与发展组织(OECD)和替代方法验证机构间协调委员会(ICCVAM)制定了使用体外模型进行毒理学评估的指南化学品。然而,手动步骤的存在和数据分析的多种工具的需求,除了昂贵和耗时之外,可能会无意中引入研究人员的错误。
    目的:我们开发了SAEDC平台(用于细胞毒性的探索性数据分析和统计的技术解决方案,葡萄牙语),这使得能够分析来自遵循经合组织准则号的测定的细胞毒性数据。129.
    方法:使用体外实验数据与指南中建议的分析方法进行比较。我们分析了117个数据集,涵盖了根据GHS分类从I类到未分类的化学品。
    结果:通过SAEDC平台计算的非线性回归(4PL)的四个参数与标准方法相比,在任何数据集中都没有显着差异(p>0.05)。确定系数(R平方)不仅证明了4PL模型与数据的良好拟合,而且还证明了与常规方法获得的值的显着相似性。最后,SAEDC平台使用细胞毒性注册(RC)回归模型从IC50预测化学品的LD50值。
    结论:与标准数据分析方法的比较表明,SAEDC平台符合细胞毒性数据分析的要求,生成可靠和准确的结果与研究人员执行更少的步骤。与监管机构提出的标准方法相比,使用SAEDC平台获得毒性值可以减少分析时间。因此,使用SAEDC平台的自动化分析有可能为细胞毒性研究人员和实验室节省时间和资源,同时产生可靠的结果。
    BACKGROUND: The intergovernmental organizations Organisation for Economic Co-operation and Development (OECD) and Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) have developed guidelines for the use of in vitro models for toxicological evaluation of chemicals. However, the presence of manual steps and the requirement of multiple tools for data analysis, apart from being costly and time-consuming, can inadvertently introduce errors by researchers.
    OBJECTIVE: We have developed the SAEDC platform (Technological Solution for Exploratory Data Analysis and Statistics for Cytotoxicity, in Portuguese), which enables analysis of cytotoxicity data from assays following OECD Guideline No. 129.
    METHODS: In vitro experimental data were used to compare with the analysis methodology suggested in the Guideline. We analyzed 117 data sets covering chemicals from Category I to Unclassified according to GHS classification.
    RESULTS: The four-parameters of non-linear regression (4PL) calculated by the SAEDC platform showed no significant differences compared to standard methodology in any of the data sets (p > 0.05). The coefficient of determination (R-squared) also demonstrated not only a good fit of the 4PL model to the data but also significant similarity to values obtained by the conventional methodology. Finally, the SAEDC platform predicted LD50 values for the chemicals from IC50, using the Registry of Cytotoxicity (RC) regression models.
    CONCLUSIONS: The comparison with the standard data analysis methodology revealed that SAEDC platform fulfills the requirements for cytotoxicity data analysis, generating reliable and accurate results with fewer steps performed by researchers. The use of SAEDC platform for obtaining toxicity values can reduce analysis time compared to the standard methodology proposed by regulatory agencies. Thus, automation of the analysis using the SAEDC platform has the potential to save time and resources for cytotoxicity researchers and laboratories while generating reliable results.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    要求收费广泛用于商业和工业消费者。这些成本通常并不为人所知,更不用说PV对他们的影响了。这项工作提出了一种方法来评估光伏对减少这些费用的影响,并优化要收缩的功率,使用来自探索性数据分析的技术。该方法适用于来自西班牙不同部门的工业消费者的五个案例研究,在连续运营的行业中节省5%至11%的需求费用,在不连续运营的情况下节省高达28%。如果可收缩的最大功率低于最佳功率,则这些节省甚至更大。西班牙的需求费用由与收缩功率成比例的固定部分和取决于超过它的功率峰值的可变部分组成。由于对于变量部分,重合和非重合模型共存,对这两种模型进行了比较,发现在一般情况下,光伏用户可以通过重合模型实现更高的节省。
    Demand charges are widely used for commercial and industrial consumers. These costs are often not well known, let alone the effects that PV can have on them. This work proposes a methodology to assess the effect of PV on reducing these charges and to optimise the power to be contracted, using techniques taken from exploratory data analysis. This methodology is applied to five case studies of industrial consumers from different sectors in Spain, finding savings between 5 % and 11 % of demand charges in industries with continuous operation and up to 28 % in cases of discontinuous operation. These savings can be even greater if the maximum power that can be contracted is lower than the optimum. The demand charges in Spain consist of a fixed part proportional to the contracted power and a variable part depending on the power peaks exceeding it. Since for the variable part the coincident and non-coincident models coexist, a comparison is made between the two models, finding that in the general case PV users can achieve higher savings with the coincident model.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    PMart是一个基于网络的可重复质量控制工具,探索性数据分析,统计分析,和“组学数据的交互式可视化,基于pmartRR包的功能。新改进的用户界面支持更多的组学数据类型,额外的统计能力,和用于创建可下载图形的增强选项。PMart支持无标签和等压标签的分析(例如,TMT,iTRAQ)蛋白质组学,核磁共振(NMR)和质谱(MS)为基础的代谢组学,基于MS的脂质组学,和核糖核酸测序(RNA-seq)转录组学数据。在PMart会议结束时,提供了一个报告,该报告总结了执行的处理步骤,并包括用于执行数据处理的pmartRR包函数。此外,后端代码中的内置防护措施可防止用户使用基于组学数据类型的不适当方法。PMart是一个用户友好的界面,用于在无需编程的情况下对组学数据进行探索性数据分析和统计比较。
    PMart is a web-based tool for reproducible quality control, exploratory data analysis, statistical analysis, and interactive visualization of \'omics data, based on the functionality of the pmartR R package. The newly improved user interface supports more \'omics data types, additional statistical capabilities, and enhanced options for creating downloadable graphics. PMart supports the analysis of label-free and isobaric-labeled (e.g., TMT, iTRAQ) proteomics, nuclear magnetic resonance (NMR) and mass-spectrometry (MS)-based metabolomics, MS-based lipidomics, and ribonucleic acid sequencing (RNA-seq) transcriptomics data. At the end of a PMart session, a report is available that summarizes the processing steps performed and includes the pmartR R package functions used to execute the data processing. In addition, built-in safeguards in the backend code prevent users from utilizing methods that are inappropriate based on omics data type. PMart is a user-friendly interface for conducting exploratory data analysis and statistical comparisons of omics data without programming.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    广义结构化成分分析(GSCA)是一种结构方程建模(SEM)程序,该程序通过观测变量的加权和来构造成分,并确定地检查它们的回归关系。该研究提出了GSCA的探索性版本,称为探索性GSCA(EGSCA)。EGSCA类似于探索性SEM(ESEM),是基于探索性因素的SEM程序,它通过参数矩阵的正交旋转来寻找观察到的变量和分量之间的关系。GSCA中正交旋转的不确定性首先被证明是所提出方法的理论支持。然后介绍整个EGSCA过程,加上专门针对EGSCA的新旋转算法,旨在同时简化所有参数矩阵。两项数值模拟研究表明,经过以下旋转的EGSCA成功恢复了参数矩阵的真实值,并且优于现有的GSCA程序。EGSCA被应用于两个真实的数据集,EGSCA结果提出的模型比以往研究提出的模型更好,这证明了EGSCA在模型探索中的有效性。
    Generalized structured component analysis (GSCA) is a structural equation modeling (SEM) procedure that constructs components by weighted sums of observed variables and confirmatorily examines their regressional relationship. The research proposes an exploratory version of GSCA, called exploratory GSCA (EGSCA). EGSCA is analogous to exploratory SEM (ESEM) developed as an exploratory factor-based SEM procedure, which seeks the relationships between the observed variables and the components by orthogonal rotation of the parameter matrices. The indeterminacy of orthogonal rotation in GSCA is first shown as a theoretical support of the proposed method. The whole EGSCA procedure is then presented, together with a new rotational algorithm specialized to EGSCA, which aims at simultaneous simplification of all parameter matrices. Two numerical simulation studies revealed that EGSCA with the following rotation successfully recovered the true values of the parameter matrices and was superior to the existing GSCA procedure. EGSCA was applied to two real datasets, and the model suggested by the EGSCA\'s result was shown to be better than the model proposed by previous research, which demonstrates the effectiveness of EGSCA in model exploration.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    塑料消费及其报废管理造成了巨大的环境足迹,并且是能源密集型的。作为对策,欧洲已广泛推广了废物转化资源和预防战略;但是,其有效性仍不确定。本研究旨在通过探索性数据分析,揭示欧盟成员国(EU-27)塑料价值链的环境足迹模式。降维和分组。评估了九个变量,从社会经济和人口到环境影响。根据一系列特征的相似性形成三个簇(九个),环境影响被确定为确定集群的主要影响变量。大多数国家属于第0组,2014年由17个国家和2019年由18个国家组成。它们代表全球变暖潜势(GWP)相对较低的集群,2014年的平均值为2.64tCO2eq/cap,2019年为4.01tCO2eq/cap。在所有评估国家中,在EU-27的特征内评估时,丹麦显示出显着变化,从2014年的集群1(高GWP)到2019年的集群0(低GWP)。2019年塑料包装废弃物统计数据分析(2022年数据发布)显示,尽管欧盟27国内部的回收率有所提高,但全球升温潜能值并没有降低,暗示反弹效应。GWP倾向于与较高的塑料废物量相关地增加。相比之下,其他环境影响,比如富营养化,非生物和酸化潜力,被确定为通过恢复有效地缓解,抑制塑料废物产生增加的不利影响。五年间隔的数据分析在一组模式中确定了不同的集群,根据它们的相似性对它们进行分类。分类和见解是制定重点缓解策略的基础。
    Plastic consumption and its end-of-life management pose a significant environmental footprint and are energy intensive. Waste-to-resources and prevention strategies have been promoted widely in Europe as countermeasures; however, their effectiveness remains uncertain. This study aims to uncover the environmental footprint patterns of the plastics value chain in the European Union Member States (EU-27) through exploratory data analysis with dimension reduction and grouping. Nine variables are assessed, ranging from socioeconomic and demographic to environmental impacts. Three clusters are formed according to the similarity of a range of characteristics (nine), with environmental impacts being identified as the primary influencing variable in determining the clusters. Most countries belong to Cluster 0, consisting of 17 countries in 2014 and 18 countries in 2019. They represent clusters with a relatively low global warming potential (GWP), with an average value of 2.64 t CO2eq/cap in 2014 and 4.01 t CO2eq/cap in 2019. Among all the assessed countries, Denmark showed a significant change when assessed within the traits of EU-27, categorised from Cluster 1 (high GWP) in 2014 to Cluster 0 (low GWP) in 2019. The analysis of plastic packaging waste statistics in 2019 (data released in 2022) shows that, despite an increase in the recovery rate within the EU-27, the GWP has not reduced, suggesting a rebound effect. The GWP tends to increase in correlation with the higher plastic waste amount. In contrast, other environmental impacts, like eutrophication, abiotic and acidification potential, are identified to be mitigated effectively via recovery, suppressing the adverse effects of an increase in plastic waste generation. The five-year interval data analysis identified distinct clusters within a set of patterns, categorising them based on their similarities. The categorisation and managerial insights serve as a foundation for devising a focused mitigation strategy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在本文中,我们提出了一种数据分类和分析方法,利用火力发电厂的设施数据来估计火灾风险。根据设施数据估计火灾风险,我们将设施分为三个状态-稳定,瞬变,和异常-按其目的和操作条件分类。该方法旨在满足火电厂消防系统的三个要求。例如,必须识别有火灾危险的区域,火灾风险应分类并整合到现有系统中。我们把火力发电厂分为汽轮机,锅炉,和室内煤棚区。每个区域被细分为小设备。涡轮机,发电机,石油相关设备,氢气(H2),和锅炉给水泵(BFP)被选择用于涡轮区,而锅炉区选择了粉碎机和点火油。我们根据多年来对火电厂火灾和爆炸情况的检查,从监控和数据采集(SCADA)数据中选择了与火灾相关的标签,并在特定时期为两个火电厂获取了样本数据。我们专注于关键的火灾案例,如泳池火灾,3D火灾,和喷射火灾,并为每个区域组织了三个火灾危险级别。通过所提出的方法对500MW和100MW火电厂进行了实验分析。本文提出的数据分类和分析方法可以为没有电厂火灾领域知识的数据分析师提供间接经验,也可以为需要了解电厂设施的数据分析师提供很好的启示。
    In this paper, we propose a data classification and analysis method to estimate fire risk using facility data of thermal power plants. To estimate fire risk based on facility data, we divided facilities into three states-Steady, Transient, and Anomaly-categorized by their purposes and operational conditions. This method is designed to satisfy three requirements of fire protection systems for thermal power plants. For example, areas with fire risk must be identified, and fire risks should be classified and integrated into existing systems. We classified thermal power plants into turbine, boiler, and indoor coal shed zones. Each zone was subdivided into small pieces of equipment. The turbine, generator, oil-related equipment, hydrogen (H2), and boiler feed pump (BFP) were selected for the turbine zone, while the pulverizer and ignition oil were chosen for the boiler zone. We selected fire-related tags from Supervisory Control and Data Acquisition (SCADA) data and acquired sample data during a specific period for two thermal power plants based on inspection of fire and explosion scenarios in thermal power plants over many years. We focused on crucial fire cases such as pool fires, 3D fires, and jet fires and organized three fire hazard levels for each zone. Experimental analysis was conducted with these data set by the proposed method for 500 MW and 100 MW thermal power plants. The data classification and analysis methods presented in this paper can provide indirect experience for data analysts who do not have domain knowledge about power plant fires and can also offer good inspiration for data analysts who need to understand power plant facilities.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    心血管疾病(CVDs)占全球死亡率的很大一部分,强调需要有效的战略。这项研究的重点是心肌梗塞,肺血栓栓塞症,和主动脉瓣狭窄,旨在授权医疗从业者提供知情决策和及时干预的工具。根据圣玛丽亚医院的数据,我们的方法结合了探索性数据分析(EDA)和预测性机器学习(ML)模型,由跨行业数据挖掘标准流程(CRISP-DM)方法指导。EDA揭示了心血管疾病特有的复杂模式和关系。ML模型的精度达到80%以上,提供一个13分钟的窗口来预测心肌缺血事件并积极干预。本文介绍了增强医疗策略的实时数据和预测能力的概念证明。
    Cardiovascular diseases (CVDs) account for a significant portion of global mortality, emphasizing the need for effective strategies. This study focuses on myocardial infarction, pulmonary thromboembolism, and aortic stenosis, aiming to empower medical practitioners with tools for informed decision making and timely interventions. Drawing from data at Hospital Santa Maria, our approach combines exploratory data analysis (EDA) and predictive machine learning (ML) models, guided by the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology. EDA reveals intricate patterns and relationships specific to cardiovascular diseases. ML models achieve accuracies above 80%, providing a 13 min window to predict myocardial ischemia incidents and intervene proactively. This paper presents a Proof of Concept for real-time data and predictive capabilities in enhancing medical strategies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们报告了拉丁美洲化学信息学学院的主要亮点,墨西哥城,2022年11月24-25日。六个讲座,一个车间,在一次在线公开活动中,有四位编辑参加了一次圆桌会议,来自学术界的演讲者,大型制药公司,和公共研究机构。来自79个国家的1,000名学生和学者报名参加了会议。作为会议的一部分,化学空间的列举和可视化方面的进步,在基于天然产品的药物发现中的应用,药物发现被忽视的疾病,毒性预测,并讨论了数据分析的一般指南。ChEMBL的专家介绍了如何使用化学信息学中使用的这个主要化合物数据库的资源的研讨会。学校还包括与化学信息学期刊编辑的圆桌会议。会议的完整程序和会议记录可在https://www上公开获得。youtube.com/@SchoolChemInfLA/精选。
    We report the major highlights of the School of Cheminformatics in Latin America, Mexico City, November 24-25, 2022. Six lectures, one workshop, and one roundtable with four editors were presented during an online public event with speakers from academia, big pharma, and public research institutions. One thousand one hundred eighty-one students and academics from seventy-nine countries registered for the meeting. As part of the meeting, advances in enumeration and visualization of chemical space, applications in natural product-based drug discovery, drug discovery for neglected diseases, toxicity prediction, and general guidelines for data analysis were discussed. Experts from ChEMBL presented a workshop on how to use the resources of this major compounds database used in cheminformatics. The school also included a round table with editors of cheminformatics journals. The full program of the meeting and the recordings of the sessions are publicly available at https://www.youtube.com/@SchoolChemInfLA/featured .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号