data science

数据科学
  • 文章类型: Journal Article
    镰状细胞病(SCD)是一种严重的遗传性贫血,在非洲五岁以下儿童死亡率中占50%至80%。坦桑尼亚每年有一万一千婴儿患有SCD,在尼日利亚之后排名第四,刚果民主共和国和印度。缺乏良好描述的SCD队列是非洲SCD健康研究的主要障碍。
    本文介绍了坦桑尼亚的镰刀泛非联盟(SPARCO)数据库,从发展来看,研究仪器的设计,数据收集,数据分析和数据质量问题的管理。
    SPARCO注册中心使用现有的Muhimbili镰状细胞队列(MSC)研究案例报告表(CRF),后来协调了SickleInAfrica联盟的数据元素,以开发研究电子数据捕获(REDCap)工具。通过各种策略招募患者,包括每年6月世界镰状细胞日和9月SCD宣传月期间媒体宣传和健康教育活动后的大规模筛查。通过主动监测MSC中先前参与的患者来鉴定另外的患者。
    在2017年10月至2021年5月之间招募了三千八百名患者。其中,男性1,946(51.21%),女性1,864(48.79%)。血红蛋白表型分布为3,762(99%)HbSS,3(0.08%)HbSC和35(0.92%)HbSb+地中海贫血。血红蛋白水平,入院史,在2017年12月至2021年5月期间,我们记录了输血和疼痛事件.
    坦桑尼亚SPARCO注册中心将通过促进SCD的协作数据驱动研究来改善非洲SCD的医疗保健。
    UNASSIGNED: Sickle cell disease (SCD) is a severe hereditary form of anemia that contributes between 50% and 80% of under-five mortality in Africa. Eleven thousand babies are born with SCD annually in Tanzania, ranking 4th after Nigeria, the Democratic Republic of Congo and India. The absence of well-described SCD cohorts is a major barrier to health research in SCD in Africa.
    UNASSIGNED: This paper describes the Sickle Pan African Consortium (SPARCO) database in Tanzania, from the development, design of the study instruments, data collection, analysis of data and management of data quality issues.
    UNASSIGNED: The SPARCO registry used existing Muhimbili Sickle Cell Cohort (MSC) study case report forms (CRF) and later harmonized data elements from the SickleInAfrica consortium to develop Research Electronic Data Capture (REDCap) instruments. Patients were enrolled through various strategies, including mass screening following media sensitization and health education events during World Sickle Cell Day each June and the SCD awareness month in September. Additional patients were identified through active surveillance of previously participating patients in the MSC.
    UNASSIGNED: Three thousand eight hundred patients were enrolled between October 2017 and May 2021. Of these, 1,946 (51.21%) were males and 1,864 (48.79%) were females. The hemoglobin phenotype distribution was 3,762 (99%) HbSS, 3 (0.08%) HbSC and 35 (0.92%) HbSb +thalassemia. Hemoglobin levels, admission history, blood transfusion and painful events were recorded from December 2017 to May 2021.
    UNASSIGNED: The Tanzania SPARCO registry will improve healthcare for SCD in Africa through the facilitation of collaborative data-driven research for SCD.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    医疗保健行业已经测试了管理各种来源提供的大量数据的必要性,以提供大量异构信息而闻名。使用不同的数据分析(DA)和机器学习算法方法收集和分析数据。研究人员,科学家,工业家必须管理或选择与医疗保健领域DA相关的最佳方法。这项科学研究基于DA因素和替代方案之间的决策分析。信息以合理的方式影响整个系统。这些信息在医疗保健行业中对于适当的预测和分析非常重要。评估讨论了其好处,并提出了一个分析框架。模糊层次分析法(FuzzyAHP)方法用于解决因素的权重。与理想解决方案相似度的订单偏好模糊技术(FuzzyTOPSIS)解决了医疗保健行业中使用的数据分析替代方案的排名。本文使用的模型简要讨论了DA的挑战以及解决这些挑战的方法。DA的各种因素是捕获,清洁,storage,安全,管理,reporting,可视化,更新,分享,和查询。DA替代方案包括描述性的,诊断,预测性,规定性,发现,回归,队列和推理分析。评估了DA的最大影响因素和最适合DA的方法。“清洁”因素具有最高的权重,和“更新”至少是通过模糊层次分析法实现的。数据分析的回归方法排名最高,诊断分析的排名最低。决策分析对于数据科学家和医疗提供商来说是必要的,以便在医疗保健领域适当地预测疾病。这一分析也揭示了医院的成本效益。
    The healthcare industry has been put to test the need to manage enormous amounts of data provided by various sources, which are renowned for providing enormous quantities of heterogeneous information. The data are collected and analyzed with different Data Analytic (DA) and machine learning algorithm approaches. Researchers, scientists, and industrialists must manage or select the best approach associated with DA in healthcare. This scientific study is based on decision analysis between the DA factors and alternatives. The information affects the whole system in a rational manner. This information is very important in healthcare sector for appropriate prediction and analysis. The evaluation discusses its benefits and presents an analytic framework. The Fuzzy Analytic Hierarchy Process (Fuzzy AHP) approach is used to address the weight of the factors. The Fuzzy Techniques for Order Preference by Similarity to Ideal Solution (Fuzzy TOPSIS) address the rank of the data analytic alternatives used in healthcare sector. The models used in the article briefly discuss the challenges of DA and approaches to address those challenges. The assorted factors of DA are capture, cleaning, storage, security, stewardship, reporting, visualization, updating, sharing, and querying. The DA alternatives include descriptive, diagnostic, predictive, prescriptive, discovery, regression, cohort and inferential analyses. The most influential factors of the DA and the most suitable approach for the DA are evaluated. The \'cleaning\' factor has the highest weight, and \'updating\' is achieved at least by the Fuzzy-AHP approach. The regression approach of data analysis had the highest rank, and the diagnostic analysis had the lowest rank. Decision analyses are necessary for data scientists and medical providers to predict diseases appropriately in the healthcare domain. This analysis also revealed the cost benefits to hospitals.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    “数据科学家”很快变得无处不在,经常臭名昭著,但是他们一直在与小说角色的模糊性作斗争。本文研究了数据科学在Twitter上的集体定义。
    该分析通过文化视角和1,025至752,815条推文的互补数据集来应对研究边界和实质不明确的紧急案例的挑战。它汇集了有关数据科学的推文帐户之间的关系,他们使用的标签,指示目的,以及他们讨论的话题。
    第一个结果再现了熟悉的商业和技术动机。其他结果揭示了对新的实践和道德标准的关注,这是构建数据科学的独特动机。
    这篇文章为通常抽象的数据集中的本地含义提供了敏感性,并提供了一种启发式方法,用于导航日益丰富的数据集以获得令人惊讶的见解。对于数据科学家来说,它提供了一个指导,让自己相对于他人定位,以驾驭自己的职业未来。
    UNASSIGNED: \"Data scientists\" quickly became ubiquitous, often infamously so, but they have struggled with the ambiguity of their novel role. This article studies data science\'s collective definition on Twitter.
    UNASSIGNED: The analysis responds to the challenges of studying an emergent case with unclear boundaries and substance through a cultural perspective and complementary datasets ranging from 1,025 to 752,815 tweets. It brings together relations between accounts that tweeted about data science, the hashtags they used, indicating purposes, and the topics they discussed.
    UNASSIGNED: The first results reproduce familiar commercial and technical motives. Additional results reveal concerns with new practical and ethical standards as a distinctive motive for constructing data science.
    UNASSIGNED: The article provides a sensibility for local meaning in usually abstract datasets and a heuristic for navigating increasingly abundant datasets toward surprising insights. For data scientists, it offers a guide for positioning themselves vis-à-vis others to navigate their professional future.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:倍数变化是生物医学研究中量化组学变量群体差异的常用指标。然而,不一致的计算方法和不充分的报告导致结果差异。这项研究评估了各种倍数变化计算方法,旨在推荐一种首选方法。方法:倍数变化计算的主要区别在于定义对数比计算的组期望值。要在“压力测试”场景中挑战方法的互换性,我们生成了具有不同分布的不同人工数据集(身份,制服,正常,log-normal,以及这些的混合物),并将计算出的倍数变化与已知值进行比较。此外,我们分析了一组多组学生物医学数据,以估计这些发现在多大程度上适用于现实世界的数据.结果:使用算术平均值作为治疗组和参考组的预期值,比其他方法更频繁地产生不准确的倍数变化值。特别是当亚组分布和/或标准偏差显着差异时。结论:算术平均法,通常被认为是标准的,或者在没有考虑替代方案的情况下被挑选出来,劣于组期望值的其他定义。使用中位数的方法,几何平均值,或成对的倍数变化组合对违反等方差或不同组分布更稳健。坚持对数据分布不太敏感的方法,无需权衡取舍,并在科学报告中准确报告计算方法是确保正确解释和可重复性的合理做法。
    Background: Fold change is a common metric in biomedical research for quantifying group differences in omics variables. However, inconsistent calculation methods and inadequate reporting lead to discrepancies in results. This study evaluated various fold-change calculation methods aiming at a recommendation of a preferred approach. Methods: The primary distinction in fold-change calculations lies in defining group expected values for log ratio computation. To challenge method interchangeability in a \"stress test\" scenario, we generated diverse artificial data sets with varying distributions (identity, uniform, normal, log-normal, and a mixture of these) and compared calculated fold-changes to known values. Additionally, we analyzed a multi-omics biomedical data set to estimate to what extent the findings apply to real-world data. Results: Using arithmetic means as expected values for treatment and reference groups yielded inaccurate fold-change values more frequently than other methods, particularly when subgroup distributions and/or standard deviations differed significantly. Conclusions: The arithmetic mean method, often perceived as standard or picked without considering alternatives, is inferior to other definitions of the group expected value. Methods using median, geometric mean, or paired fold-change combinations are more robust against violations of equal variances or dissimilar group distributions. Adhering to methods less sensitive to data distribution without trade-offs and accurately reporting calculation methods in scientific reports is a reasonable practice to ensure correct interpretation and reproducibility.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    当前肾脏病学的研究越来越集中在阐明紧密交织的分子系统固有的复杂性及其与病理学和相关疗法的相关性。包括透析和肾移植。组学科学的快速发展,医疗设备传感器,和网络化的数字医疗设备使这种研究越来越以数据为中心。以数据为中心的科学需要强大的计算和复杂的工具的支持,这些工具能够处理新的生物标志物和治疗靶标的溢出。这是人工智能(AI)和,更具体地说,机器学习(ML)可以提供明显的分析优势,鉴于他们利用多模态数据的能力迅速提高,从基因组信息到信号,图像甚至异构电子健康记录(EHR)。然而,矛盾的是,只有一小部分基于ML的医疗决策支持系统经过验证并证明了临床有用性.为了有效地将所有这些新知识转化为临床实践,基于可解释和可解释的ML方法和明确的个性化医疗分析策略的临床合规支持系统的开发势在必行.智能肾脏病学,也就是说,设计和开发基于AI的以数据为中心的肾脏病学策略,只是迈出了第一步,而且还没有接近它的时代。这些最初的步骤甚至没有被均匀地采取,随着发达国家和发展中国家在获取技术方面的数字鸿沟变得明显,也影响到代表性不足的少数群体。考虑到这一切,这篇社论旨在提供对当前AI技术在肾脏病学中的使用的选择性概述,并预示着BMC肾脏病学推出的“肾脏病学人工智能”特刊。
    Current research in nephrology is increasingly focused on elucidating the complexity inherent in tightly interwoven molecular systems and their correlation with pathology and related therapeutics, including dialysis and renal transplantation. Rapid advances in the omics sciences, medical device sensorization, and networked digital medical devices have made such research increasingly data centered. Data-centric science requires the support of computationally powerful and sophisticated tools able to handle the overflow of novel biomarkers and therapeutic targets. This is a context in which artificial intelligence (AI) and, more specifically, machine learning (ML) can provide a clear analytical advantage, given the rapid advances in their ability to harness multimodal data, from genomic information to signal, image and even heterogeneous electronic health records (EHR). However, paradoxically, only a small fraction of ML-based medical decision support systems undergo validation and demonstrate clinical usefulness. To effectively translate all this new knowledge into clinical practice, the development of clinically compliant support systems based on interpretable and explainable ML-based methods and clear analytical strategies for personalized medicine are imperative. Intelligent nephrology, that is, the design and development of AI-based strategies for a data-centric approach to nephrology, is just taking its first steps and is by no means yet close to its coming of age. These first steps are not even homogeneously taken, as a digital divide in access to technology has become evident between developed and developing countries, also affecting underrepresented minorities. With all this in mind, this editorial aim to provide a selective overview of the current use of AI technologies in nephrology and heralds the \"Artificial Intelligence in Nephrology\" special issue launched by BMC Nephrology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    对电子健康记录(EHR)和数据类型(即,诊断,药物,和实验室数据)要求评估其数据质量作为一种基本方法,特别是由于需要确定患有慢性病的适当分母人群,例如2型糖尿病(T2D),使用通常可用的可计算表型定义(即,表型)。
    为了弥合这一差距,我们的研究旨在评估表型中的EHR数据质量和变异以及稳健性(或缺乏)问题如何对分母群体的识别产生潜在影响.
    大约208,000名T2D患者被纳入我们的研究,该研究使用了约翰·霍普金斯大学医疗机构(JHMI)2017-2019年的回顾性EHR数据。我们的评估包括4个已发表的表型和1个来自Hopkins专家小组的定义。我们对人口统计进行了描述性分析(即,年龄,性别,种族,和种族),使用医疗保健(住院和急诊室就诊),和每个表型的平均Charlson合并症指数得分。然后,我们使用不同的方法来诱导或模拟完整性的数据质量问题,准确度,和时效性分别跨每个表型。对于诱发的数据不完整,我们的模型随机放弃诊断,药物,和实验室代码以10%的增量独立;对于诱导的数据不准确,我们的模型用相同数据类型的另一个代码随机替换诊断或药物代码,并在实验室结果值中从-100%到+10%引起2%的增量变化;最后,为了及时性,数据被建模为诱导的日期记录增量转移30天到365天.
    在使用EHR的所有表型中,不到四分之一(n=47,326,23%)的人口重叠。通过每种表型识别的群体在数据类型的所有组合中变化。诱发的不完整性识别出每次增加的患者较少;例如,在100%诊断不完整的情况下,慢性病数据仓库表型确定为零患者,因为其表型特征仅包括诊断代码。诱导的不准确性和及时性类似地证明了每个表型的性能变化,因此,每次增加的变化导致更少的患者被识别。
    我们使用EHR数据进行诊断,药物,和来自大型三级医院系统的实验室数据类型,以了解T2D表型差异和性能。我们使用诱导数据质量方法来了解数据质量问题如何影响临床分母群体的识别(例如,临床研究和试验,人口健康评估)和财务或运营决策。我们研究的新结果可能为未来塑造可应用于临床信息学的常见T2D可计算表型定义的方法提供信息。管理慢性病,以及整个行业在医疗保健方面的额外努力。
    UNASSIGNED: Increasing and substantial reliance on electronic health records (EHRs) and data types (ie, diagnosis, medication, and laboratory data) demands assessment of their data quality as a fundamental approach, especially since there is a need to identify appropriate denominator populations with chronic conditions, such as type 2 diabetes (T2D), using commonly available computable phenotype definitions (ie, phenotypes).
    UNASSIGNED: To bridge this gap, our study aims to assess how issues of EHR data quality and variations and robustness (or lack thereof) in phenotypes may have potential impacts in identifying denominator populations.
    UNASSIGNED: Approximately 208,000 patients with T2D were included in our study, which used retrospective EHR data from the Johns Hopkins Medical Institution (JHMI) during 2017-2019. Our assessment included 4 published phenotypes and 1 definition from a panel of experts at Hopkins. We conducted descriptive analyses of demographics (ie, age, sex, race, and ethnicity), use of health care (inpatient and emergency room visits), and the average Charlson Comorbidity Index score of each phenotype. We then used different methods to induce or simulate data quality issues of completeness, accuracy, and timeliness separately across each phenotype. For induced data incompleteness, our model randomly dropped diagnosis, medication, and laboratory codes independently at increments of 10%; for induced data inaccuracy, our model randomly replaced a diagnosis or medication code with another code of the same data type and induced 2% incremental change from -100% to +10% in laboratory result values; and lastly, for timeliness, data were modeled for induced incremental shift of date records by 30 days to 365 days.
    UNASSIGNED: Less than a quarter (n=47,326, 23%) of the population overlapped across all phenotypes using EHRs. The population identified by each phenotype varied across all combinations of data types. Induced incompleteness identified fewer patients with each increment; for example, at 100% diagnostic incompleteness, the Chronic Conditions Data Warehouse phenotype identified zero patients, as its phenotypic characteristics included only diagnosis codes. Induced inaccuracy and timeliness similarly demonstrated variations in performance of each phenotype, therefore resulting in fewer patients being identified with each incremental change.
    UNASSIGNED: We used EHR data with diagnosis, medication, and laboratory data types from a large tertiary hospital system to understand T2D phenotypic differences and performance. We used induced data quality methods to learn how data quality issues may impact identification of the denominator populations upon which clinical (eg, clinical research and trials, population health evaluations) and financial or operational decisions are made. The novel results from our study may inform future approaches to shaping a common T2D computable phenotype definition that can be applied to clinical informatics, managing chronic conditions, and additional industry-wide efforts in health care.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    生物医学数据存储库的快速增长的规模和多样性引发了重要的隐私问题。传统的收集和共享人类主体数据的框架提供有限的隐私保护。通常需要创建数据孤岛。隐私增强技术(PET)承诺通过提供共享和分析敏感数据的手段来保护这些数据并扩大其使用范围,同时保护隐私。这里,我们回顾了著名的PETs,并说明了它们在推进生物医学方面的作用。我们描述了PETs的关键用例及其最新的技术进步,并强调了PETs在一系列生物医学领域的最新应用。最后,我们讨论了需要解决的突出挑战和社会考虑因素,以促进在生物医学数据科学中更广泛地采用PET。
    The rapidly growing scale and variety of biomedical data repositories raise important privacy concerns. Conventional frameworks for collecting and sharing human subject data offer limited privacy protection, often necessitating the creation of data silos. Privacy-enhancing technologies (PETs) promise to safeguard these data and broaden their usage by providing means to share and analyze sensitive data while protecting privacy. Here, we review prominent PETs and illustrate their role in advancing biomedicine. We describe key use cases of PETs and their latest technical advances and highlight recent applications of PETs in a range of biomedical domains. We conclude by discussing outstanding challenges and social considerations that need to be addressed to facilitate a broader adoption of PETs in biomedical data science.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    协作定量科学家,包括生物统计学家,流行病学家,生物信息学家,和数据相关的专业人士,在研究中发挥重要作用,从研究设计到数据分析和传播。学术医疗保健中心(AHC)必须建立一个环境,为被聘为员工的定量科学家提供发展和发展自己的职业生涯的机会。随着临床和转化研究的迅速发展,AHC负责建立组织方法,培训工具,最佳实践,以及加速和支持招聘的指导方针,培训,并保留这些员工队伍。本文描述了在学术医疗保健中心建立和维护成功的协作人员定量科学家单位的三个基本要素:(1)组织基础设施和管理,(2)招聘、(3)职业发展和保留。提供了具体的策略作为AHC如何在这些领域脱颖而出的例子。
    Collaborative quantitative scientists, including biostatisticians, epidemiologists, bio-informaticists, and data-related professionals, play vital roles in research, from study design to data analysis and dissemination. It is imperative that academic health care centers (AHCs) establish an environment that provides opportunities for the quantitative scientists who are hired as staff to develop and advance their careers. With the rapid growth of clinical and translational research, AHCs are charged with establishing organizational methods, training tools, best practices, and guidelines to accelerate and support hiring, training, and retaining this staff workforce. This paper describes three essential elements for building and maintaining a successful unit of collaborative staff quantitative scientists in academic health care centers: (1) organizational infrastructure and management, (2) recruitment, and (3) career development and retention. Specific strategies are provided as examples of how AHCs can excel in these areas.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:该项目旨在确定使用数据驱动的计算预测模型和常规收集的医院病床管理数据来预测未来重症监护病床可用性的可行性。
    方法:在这个概念证明中,单中心数据信息学可行性研究,基于回归和分类的数据科学技术技术被用于前瞻性地收集常规医院范围内的病床管理数据,以预测重症监护病床容量.使用提前1、7和14天的预测范围预测至少一张重症监护床的可用性。
    结果:我们首次证明了仅使用常规收集的医院病床管理数据和可解释模型来预测重症监护病床容量而无需详细的患者水平数据的可行性。未来1天床可用性的预测性能优于14天(平均绝对误差分别为1.33vs1.61和曲线下面积0.78vs0.73)。通过分析特征重要性,我们证明,这些模型主要依赖于重症监护和时态数据,而不是来自医院其他病房的数据.
    结论:我们的数据驱动预测工具仅需要医院病床管理数据来预测重症监护病床的可用性。这种新颖的方法意味着在建模中不需要患者敏感数据,并保证在其他医院病房的未来床位可用性预测中进一步完善这种方法。
    结论:数据驱动的重症监护病床可用性预测是可能的。有必要对其在多中心重症监护环境或其他临床环境中的实用性进行进一步研究。
    OBJECTIVE: This project aims to determine the feasibility of predicting future critical care bed availability using data-driven computational forecast modelling and routinely collected hospital bed management data.
    METHODS: In this proof-of-concept, single-centre data informatics feasibility study, regression-based and classification data science techniques were applied retrospectively to prospectively collect routine hospital-wide bed management data to forecast critical care bed capacity. The availability of at least one critical care bed was forecasted using a forecast horizon of 1, 7 and 14 days in advance.
    RESULTS: We demonstrated for the first time the feasibility of forecasting critical care bed capacity without requiring detailed patient-level data using only routinely collected hospital bed management data and interpretable models. Predictive performance for bed availability 1 day in the future was better than 14 days (mean absolute error 1.33 vs 1.61 and area under the curve 0.78 vs 0.73, respectively). By analysing feature importance, we demonstrated that the models relied mainly on critical care and temporal data rather than data from other wards in the hospital.
    CONCLUSIONS: Our data-driven forecasting tool only required hospital bed management data to forecast critical care bed availability. This novel approach means no patient-sensitive data are required in the modelling and warrants further work to refine this approach in future bed availability forecast in other hospital wards.
    CONCLUSIONS: Data-driven critical care bed availability prediction was possible. Further investigations into its utility in multicentre critical care settings or in other clinical settings are warranted.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号