unsupervised machine learning

无监督机器学习
  • 文章类型: Journal Article
    背景:结直肠癌(CRC)是全球性的公共卫生问题。有强烈迹象表明,营养可能是一级预防的重要组成部分。饮食模式是了解不同人群饮食与癌症之间关系的强大技术。
    目的:我们使用无监督机器学习方法对与CRC相关的摩洛哥饮食模式进行聚类。
    方法:该研究是根据报告的CRC匹配病例和包括1483对的对照的营养进行的。使用经过验证的适合摩洛哥背景的食物频率问卷测量基线饮食摄入量。通过主成分分析(PCA),将食品合并为在6个维度上减少的30个食品组。
    结果:K均值法,应用于PCA子空间,确定了两种模式:“谨慎模式”(适度食用几乎所有食物,水果和蔬菜略有增加)和“危险模式”(植物油,蛋糕,巧克力,奶酪,红肉,糖和黄油),成分和簇之间的差异很小。学生测试显示,除家禽外,集群与所有食物消耗之间存在显着关系。简单逻辑回归检验显示,属于“危险模式”的人患CRC的风险较高,OR为1.59,95%CI(1.37至1.38)。
    结论:应用于CCR营养数据库的拟议算法确定了与CRC相关的两种饮食概况:“危险模式”和“谨慎模式”。这项研究的结果可能有助于摩洛哥人群中CRC预防性饮食的建议。
    BACKGROUND: Colorectal cancer (CRC) is a global public health problem. There is strong indication that nutrition could be an important component of primary prevention. Dietary patterns are a powerful technique for understanding the relationship between diet and cancer varying across populations.
    OBJECTIVE: We used an unsupervised machine learning approach to cluster Moroccan dietary patterns associated with CRC.
    METHODS: The study was conducted based on the reported nutrition of CRC matched cases and controls including 1483 pairs. Baseline dietary intake was measured using a validated food-frequency questionnaire adapted to the Moroccan context. Food items were consolidated into 30 food groups reduced on 6 dimensions by principal component analysis (PCA).
    RESULTS: K-means method, applied in the PCA-subspace, identified two patterns: \'prudent pattern\' (moderate consumption of almost all foods with a slight increase in fruits and vegetables) and a \'dangerous pattern\' (vegetable oil, cake, chocolate, cheese, red meat, sugar and butter) with small variation between components and clusters. The student test showed a significant relationship between clusters and all food consumption except poultry. The simple logistic regression test showed that people who belong to the \'dangerous pattern\' have a higher risk to develop CRC with an OR 1.59, 95% CI (1.37 to 1.38).
    CONCLUSIONS: The proposed algorithm applied to the CCR Nutrition database identified two dietary profiles associated with CRC: the \'dangerous pattern\' and the \'prudent pattern\'. The results of this study could contribute to recommendations for CRC preventive diet in the Moroccan population.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    社交媒体话语已经成为理解公众感知的关键数据源,以及公共卫生危机期间的情绪。然而,鉴于平台在信息交换方面占据的不同利基,依赖单一平台将提供不完整的公众意见。基于图式理论,这项研究提出了一种“社交媒体平台模式”,以根据以前的平台用法来指示用户的不同期望,并认为平台的独特特征促进了不同的平台模式,反过来,信息的独特性质。我们分析了Twitter上与COVID-19疫苗副作用相关的讨论,Reddit,和YouTube,每个代表不同类型的平台,并发现不同平台的主题和情感差异。使用k均值聚类算法的主题分析在每个平台中确定了七个聚类。要对跨平台的主题集群进行计算分组和对比,我们使用Louvain算法进行模块化分析,以确定基于主题的语义网络结构。我们还观察到不同平台的情感环境存在差异。然后讨论了理论和公共卫生影响。
    Social media discourse has become a key data source for understanding the public\'s perception of, and sentiments during a public health crisis. However, given the different niches which platforms occupy in terms of information exchange, reliance on a single platform would provide an incomplete picture of public opinions. Based on the schema theory, this study suggests a \'social media platform schema\' to indicate users\' different expectations based on previous usages of platform and argues that a platform\'s distinct characteristics foster distinct platform schema and, in turn, distinct nature of information. We analyzed COVID-19 vaccine side effect-related discussions from Twitter, Reddit, and YouTube, each of which represents a different type of the platform, and found thematic and emotional differences across platforms. Thematic analysis using k-means clustering algorithm identified seven clusters in each platform. To computationally group and contrast thematic clusters across platforms, we employed modularity analysis using the Louvain algorithm to determine a semantic network structure based on themes. We also observed differences in emotional contexts across platforms. Theoretical and public health implications are then discussed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    生物制药治疗剂在储存期/保质期内的稳定性一直是制造商的挑战性问题。根据主成分分析(PCA),使用色谱数据优化了实验室混合物中重组人血清白蛋白(rHSA)的最佳和合适储存条件的崇高策略。并使用层次聚类分析定义相似性。相比之下,使用线性判别分析(LDA)模型定义可分性。对rHSA峰(目标分析物)及其降解产物进行定量,即,二聚体,三聚体,附聚物和其他降解产物。使用经过验证的稳定性指示测定方法计算色谱变量。在不同温度下,对上述三个月的峰进行色谱数据映射,即,20°C,5-8°C和室温(25°C)。PCA已经找出了未分组的变量,而使用LDA进行监督映射。作为LDA的结果,大约60%的数据被正确分类,在25°C(Aq)下灵敏度最高,25°C和5-8°C(含5%葡萄糖作为稳定剂的Aq),而对于储存在5-8°C的样品观察到最高的特异性(含5%葡萄糖作为稳定剂的Aq)。
    The stability of biopharmaceutical therapeutics over the storage period/shelf life has been a challenging concern for manufacturers. A noble strategy for mapping best and suitable storage conditions for recombinant human serum albumin (rHSA) in laboratory mixture was optimized using chromatographic data as per principal component analysis (PCA), and similarity was defined using hierarchical cluster analysis. In contrast, separability was defined using linear discriminant analysis (LDA) models. The quantitation was performed for rHSA peak (analyte of interest) and its degraded products, i.e., dimer, trimer, agglomerates and other degradation products. The chromatographic variables were calculated using validated stability-indicating assay method. The chromatographic data mapping was done for the above-mentioned peaks over three months at different temperatures, i.e., 20°C, 5-8°C and at room temperature (25°C). The PCA had figured out the ungrouped variable, whereas supervised mapping was done using LDA. As an outcome result of LDA, about 60% of data were correctly classified with the highest sensitivity for 25°C (Aq), 25°C and 5-8°C (Aq with 5% glucose as a stabilizer), whereas the highest specificity was observed for samples stored at 5-8°C (Aq with 5% glucose as a stabilizer).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在劳厄微衍射的光栅扫描中收集的大量衍射图像需要快速治疗,而几乎没有人为干预。必须逐个索引衍射图案的常规方法是费力的并且几乎不能给出实时反馈。在这项工作中,提出了一种基于无监督机器学习算法的数据挖掘协议,该协议可以在没有索引的情况下从衍射图案中快速分割扫描网格。必须设置的唯一参数是确定段数的所谓的“距离阈值”。提出了一种面向统计的标准来设置“距离阈值”。该协议应用于疲劳多晶样品的扫描图像,并确定了几个值得进一步研究的区域,例如,差分孔径X射线显微镜。提出的数据挖掘协议有望帮助节省有限的波束时间。
    The massive amount of diffraction images collected in a raster scan of Laue microdiffraction calls for a fast treatment with little if any human intervention. The conventional method that has to index diffraction patterns one-by-one is laborious and can hardly give real-time feedback. In this work, a data mining protocol based on unsupervised machine learning algorithm was proposed to have a fast segmentation of the scanning grid from the diffraction patterns without indexation. The sole parameter that had to be set was the so-called \"distance threshold\" that determined the number of segments. A statistics-oriented criterion was proposed to set the \"distance threshold\". The protocol was applied to the scanning images of a fatigued polycrystalline sample and identified several regions that deserved further study with, for instance, differential aperture X-ray microscopy. The proposed data mining protocol is promising to help economize the limited beamtime.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    大型且密集采样的传感器数据集可能包含一系列复杂的随机结构,这些结构在常规线性模型中难以适应。这可能会混淆通过在多个异步传感器平台上聚合信息来构建更完整的动物行为图片的尝试。已在R中开发了牲畜信息学工具包(LIT),以使用新颖的无监督机器学习和信息理论方法来更好地促进精确畜牧业(PLF)数据流中复杂行为模式的知识发现。使用来自对185头混合胎性有机奶牛的封闭群进行的6个月饲料试验的数据证明了该分析管道的实用性。通过使用旨在模仿传感器数据的复杂误差结构的新颖的基于仿真的方法来增强常规的分层聚类技术,可以改善对从耳标加速度计记录获取的时间预算中的行为之间的权衡的见解。然后,这些模拟被重新用于使用新颖的修剪算法将该数据流中的信息压缩为经验确定的鲁棒编码。使用相互和逐点信息的非参数和半参数测试随后揭示了总体时间预算的编码与奶牛进入客厅挤奶的顺序之间的复杂非线性关联。
    Large and densely sampled sensor datasets can contain a range of complex stochastic structures that are difficult to accommodate in conventional linear models. This can confound attempts to build a more complete picture of an animal\'s behavior by aggregating information across multiple asynchronous sensor platforms. The Livestock Informatics Toolkit (LIT) has been developed in R to better facilitate knowledge discovery of complex behavioral patterns across Precision Livestock Farming (PLF) data streams using novel unsupervised machine learning and information theoretic approaches. The utility of this analytical pipeline is demonstrated using data from a 6-month feed trial conducted on a closed herd of 185 mix-parity organic dairy cows. Insights into the tradeoffs between behaviors in time budgets acquired from ear tag accelerometer records were improved by augmenting conventional hierarchical clustering techniques with a novel simulation-based approach designed to mimic the complex error structures of sensor data. These simulations were then repurposed to compress the information in this data stream into robust empirically-determined encodings using a novel pruning algorithm. Nonparametric and semiparametric tests using mutual and pointwise information subsequently revealed complex nonlinear associations between encodings of overall time budgets and the order that cows entered the parlor to be milked.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Sensor technologies allow ethologists to continuously monitor the behaviors of large numbers of animals over extended periods of time. This creates new opportunities to study livestock behavior in commercial settings, but also new methodological challenges. Densely sampled behavioral data from large heterogeneous groups can contain a range of complex patterns and stochastic structures that may be difficult to visualize using conventional exploratory data analysis techniques. The goal of this research was to assess the efficacy of unsupervised machine learning tools in recovering complex behavioral patterns from such datasets to better inform subsequent statistical modeling. This methodological case study was carried out using records on milking order, or the sequence in which cows arrange themselves as they enter the milking parlor. Data was collected over a 6-month period from a closed group of 200 mixed-parity Holstein cattle on an organic dairy. Cows at the front and rear of the queue proved more consistent in their entry position than animals at the center of the queue, a systematic pattern of heterogeneity more clearly visualized using entropy estimates, a scale and distribution-free alternative to variance robust to outliers. Dimension reduction techniques were then used to visualize relationships between cows. No evidence of social cohesion was recovered, but Diffusion Map embeddings proved more adept than PCA at revealing the underlying linear geometry of this data. Median parlor entry positions from the pre- and post-pasture subperiods were highly correlated (R = 0.91), suggesting a surprising degree of temporal stationarity. Data Mechanics visualizations, however, revealed heterogeneous non-stationary among subgroups of animals in the center of the group and herd-level temporal outliers. A repeated measures model recovered inconsistent evidence of a relationships between entry position and cow attributes. Mutual conditional entropy tests, a permutation-based approach to assessing bivariate correlations robust to non-independence, confirmed a significant but non-linear association with peak milk yield, but revealed the age effect to be potentially confounded by health status. Finally, queueing records were related back to behaviors recorded via ear tag accelerometers using linear models and mutual conditional entropy tests. Both approaches recovered consistent evidence of differences in home pen behaviors across subsections of the queue.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Case Reports
    Methylation profiling has become a mainstay in brain tumor diagnostics since the introduction of the first publicly available classification tool by the German Cancer Research Center in 2017. We demonstrate the capability of this system through an example of a rare case of IDH wildtype glioblastoma diagnosed in a patient previously treated for T-cell acute lymphoblastic leukemia. Our novel in-house diagnostic tool EpiDiP provided hints arguing against a radiation-induced tumor, identified a novel recurrent genetic aberration, and thus informed about a potential therapeutic target.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    There is a lack of reliable biomarkers for major depressive disorder (MDD) in clinical practice. However, several studies have shown an association between alterations in microRNA levels and MDD, albeit none of them has taken advantage of machine learning (ML).
    Supervised and unsupervised ML were applied to blood microRNA expression profiles from a MDD case-control dataset (n = 168) to distinguish between (1) case vs control status, (2) MDD severity levels defined based on the Montgomery-Asberg Depression Rating Scale, and (3) antidepressant responders vs nonresponders.
    MDD cases were distinguishable from healthy controls with an area-under-the receiver-operating characteristic curve (AUC) of 0.97 on testing data. High- vs low-severity cases were distinguishable with an AUC of 0.63. Unsupervised clustering of patients, before supervised ML analysis of each cluster for MDD severity, improved the performance of the classifiers (AUC of 0.70 for cluster 1 and 0.76 for cluster 2). Antidepressant responders could not be successfully separated from nonresponders, even after patient stratification by unsupervised clustering. However, permutation testing of the top microRNA, identified by the ML model trained to distinguish responders vs nonresponders in each of the 2 clusters, showed an association with antidepressant response. Each of these microRNA markers was only significant when comparing responders vs nonresponders of the corresponding cluster, but not using the heterogeneous unclustered patient set.
    Supervised and unsupervised ML analysis of microRNA may lead to robust biomarkers for monitoring clinical evolution and for more timely assessment of treatment in MDD patients.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    发现复杂疾病的亚表型可以帮助表征疾病队列,以进行旨在开发更好的诊断和治疗的调查研究。对电子健康记录(EHR)数据进行无监督机器学习的最新进展使研究人员能够在没有领域专家输入的情况下发现表型。然而,现有的大多数研究都忽略了时间,将疾病建模为离散事件.揭示表型的进化-它们是如何出现的,进化并有助于健康结果-对于定义更精确的表型和完善对疾病进展的理解至关重要。我们的目标是评估无监督方法的益处,该方法将时间模型疾病作为表型发现的动态过程。
    在这项研究中,我们应用了约束非负张量因式分解方法,根据纵向EHR数据对心血管疾病(CVD)患者队列的复杂性进行了表征.通过张量分解,我们确定了一组表型主题(即,亚型)这些患者在诊断CVD之前的10年内建立的,并展示了进展模式。对于每个确定的亚表型,我们研究了其与美国心脏病学会/美国心脏协会集合队列风险方程估计的不良心血管结局风险的关联,临床实践中经常使用的常规CVD风险评估工具。此外,我们使用生存分析比较了6种最普遍的亚表型的心肌梗死(MI)发生率.
    从12,380名成人心血管疾病个体的队列中,有1068个独特的PheCode,我们成功鉴定出14种亚型.通过与每个亚型估计的CVD风险的关联分析,我们发现了一些表型主题,如维生素D缺乏和抑郁症,尿路感染不能用常规的危险因素来解释。通过生存分析,我们发现,在六个最普遍的主题中,心血管疾病诊断后发生心肌梗死的风险明显不同(p<0.0001),表明这些主题可能捕获有临床意义的CVD亚表型.
    这项研究证明了使用张量分解将疾病建模为来自纵向EHR数据的动态过程的潜在益处。我们的研究结果表明,这种数据驱动的方法可能有助于研究人员在精准医学研究中识别复杂和慢性疾病亚型。
    Discovering subphenotypes of complex diseases can help characterize disease cohorts for investigative studies aimed at developing better diagnoses and treatments. Recent advances in unsupervised machine learning on electronic health record (EHR) data have enabled researchers to discover phenotypes without input from domain experts. However, most existing studies have ignored time and modeled diseases as discrete events. Uncovering the evolution of phenotypes - how they emerge, evolve and contribute to health outcomes - is essential to define more precise phenotypes and refine the understanding of disease progression. Our objective was to assess the benefits of an unsupervised approach that incorporates time to model diseases as dynamic processes in phenotype discovery.
    In this study, we applied a constrained non-negative tensor-factorization approach to characterize the complexity of cardiovascular disease (CVD) patient cohort based on longitudinal EHR data. Through tensor-factorization, we identified a set of phenotypic topics (i.e., subphenotypes) that these patients established over the 10 years prior to the diagnosis of CVD, and showed the progress pattern. For each identified subphenotype, we examined its association with the risk for adverse cardiovascular outcomes estimated by the American College of Cardiology/American Heart Association Pooled Cohort Risk Equations, a conventional CVD-risk assessment tool frequently used in clinical practice. Furthermore, we compared the subsequent myocardial infarction (MI) rates among the six most prevalent subphenotypes using survival analysis.
    From a cohort of 12,380 adult CVD individuals with 1068 unique PheCodes, we successfully identified 14 subphenotypes. Through the association analysis with estimated CVD risk for each subtype, we found some phenotypic topics such as Vitamin D deficiency and depression, Urinary infections cannot be explained by the conventional risk factors. Through a survival analysis, we found markedly different risks of subsequent MI following the diagnosis of CVD among the six most prevalent topics (p < 0.0001), indicating these topics may capture clinically meaningful subphenotypes of CVD.
    This study demonstrates the potential benefits of using tensor-decomposition to model diseases as dynamic processes from longitudinal EHR data. Our results suggest that this data-driven approach may potentially help researchers identify complex and chronic disease subphenotypes in precision medicine research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Retrospective analysing of fall incident reports can uncover hidden information, identify potential risk factors, and improve healthcare quality. This study explores potential fall incident clusters using word embeddings and hierarchical clustering. Fall incident reports from 7 local hospitals in Hong Kong were catalogued into 5 potential clusters with significantly different fall severity, gender, reporting department, and keywords. This study demonstrates the feasibility of using text clustering methods on real-world fall incident reports mining.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号