unsupervised machine learning

  • 文章类型: Journal Article
    BACKGROUND: Colorectal cancer (CRC) is a global public health problem. There is strong indication that nutrition could be an important component of primary prevention. Dietary patterns are a powerful technique for understanding the relationship between diet and cancer varying across populations.
    OBJECTIVE: We used an unsupervised machine learning approach to cluster Moroccan dietary patterns associated with CRC.
    METHODS: The study was conducted based on the reported nutrition of CRC matched cases and controls including 1483 pairs. Baseline dietary intake was measured using a validated food-frequency questionnaire adapted to the Moroccan context. Food items were consolidated into 30 food groups reduced on 6 dimensions by principal component analysis (PCA).
    RESULTS: K-means method, applied in the PCA-subspace, identified two patterns: \'prudent pattern\' (moderate consumption of almost all foods with a slight increase in fruits and vegetables) and a \'dangerous pattern\' (vegetable oil, cake, chocolate, cheese, red meat, sugar and butter) with small variation between components and clusters. The student test showed a significant relationship between clusters and all food consumption except poultry. The simple logistic regression test showed that people who belong to the \'dangerous pattern\' have a higher risk to develop CRC with an OR 1.59, 95% CI (1.37 to 1.38).
    CONCLUSIONS: The proposed algorithm applied to the CCR Nutrition database identified two dietary profiles associated with CRC: the \'dangerous pattern\' and the \'prudent pattern\'. The results of this study could contribute to recommendations for CRC preventive diet in the Moroccan population.






  • 文章类型: Journal Article
    Social media discourse has become a key data source for understanding the public\'s perception of, and sentiments during a public health crisis. However, given the different niches which platforms occupy in terms of information exchange, reliance on a single platform would provide an incomplete picture of public opinions. Based on the schema theory, this study suggests a \'social media platform schema\' to indicate users\' different expectations based on previous usages of platform and argues that a platform\'s distinct characteristics foster distinct platform schema and, in turn, distinct nature of information. We analyzed COVID-19 vaccine side effect-related discussions from Twitter, Reddit, and YouTube, each of which represents a different type of the platform, and found thematic and emotional differences across platforms. Thematic analysis using k-means clustering algorithm identified seven clusters in each platform. To computationally group and contrast thematic clusters across platforms, we employed modularity analysis using the Louvain algorithm to determine a semantic network structure based on themes. We also observed differences in emotional contexts across platforms. Theoretical and public health implications are then discussed.






  • 文章类型: Journal Article
    The stability of biopharmaceutical therapeutics over the storage period/shelf life has been a challenging concern for manufacturers. A noble strategy for mapping best and suitable storage conditions for recombinant human serum albumin (rHSA) in laboratory mixture was optimized using chromatographic data as per principal component analysis (PCA), and similarity was defined using hierarchical cluster analysis. In contrast, separability was defined using linear discriminant analysis (LDA) models. The quantitation was performed for rHSA peak (analyte of interest) and its degraded products, i.e., dimer, trimer, agglomerates and other degradation products. The chromatographic variables were calculated using validated stability-indicating assay method. The chromatographic data mapping was done for the above-mentioned peaks over three months at different temperatures, i.e., 20°C, 5-8°C and at room temperature (25°C). The PCA had figured out the ungrouped variable, whereas supervised mapping was done using LDA. As an outcome result of LDA, about 60% of data were correctly classified with the highest sensitivity for 25°C (Aq), 25°C and 5-8°C (Aq with 5% glucose as a stabilizer), whereas the highest specificity was observed for samples stored at 5-8°C (Aq with 5% glucose as a stabilizer).






  • 文章类型: Journal Article
    The massive amount of diffraction images collected in a raster scan of Laue microdiffraction calls for a fast treatment with little if any human intervention. The conventional method that has to index diffraction patterns one-by-one is laborious and can hardly give real-time feedback. In this work, a data mining protocol based on unsupervised machine learning algorithm was proposed to have a fast segmentation of the scanning grid from the diffraction patterns without indexation. The sole parameter that had to be set was the so-called \"distance threshold\" that determined the number of segments. A statistics-oriented criterion was proposed to set the \"distance threshold\". The protocol was applied to the scanning images of a fatigued polycrystalline sample and identified several regions that deserved further study with, for instance, differential aperture X-ray microscopy. The proposed data mining protocol is promising to help economize the limited beamtime.






  • 文章类型: Journal Article
    Large and densely sampled sensor datasets can contain a range of complex stochastic structures that are difficult to accommodate in conventional linear models. This can confound attempts to build a more complete picture of an animal\'s behavior by aggregating information across multiple asynchronous sensor platforms. The Livestock Informatics Toolkit (LIT) has been developed in R to better facilitate knowledge discovery of complex behavioral patterns across Precision Livestock Farming (PLF) data streams using novel unsupervised machine learning and information theoretic approaches. The utility of this analytical pipeline is demonstrated using data from a 6-month feed trial conducted on a closed herd of 185 mix-parity organic dairy cows. Insights into the tradeoffs between behaviors in time budgets acquired from ear tag accelerometer records were improved by augmenting conventional hierarchical clustering techniques with a novel simulation-based approach designed to mimic the complex error structures of sensor data. These simulations were then repurposed to compress the information in this data stream into robust empirically-determined encodings using a novel pruning algorithm. Nonparametric and semiparametric tests using mutual and pointwise information subsequently revealed complex nonlinear associations between encodings of overall time budgets and the order that cows entered the parlor to be milked.






  • 文章类型: Journal Article
    Sensor technologies allow ethologists to continuously monitor the behaviors of large numbers of animals over extended periods of time. This creates new opportunities to study livestock behavior in commercial settings, but also new methodological challenges. Densely sampled behavioral data from large heterogeneous groups can contain a range of complex patterns and stochastic structures that may be difficult to visualize using conventional exploratory data analysis techniques. The goal of this research was to assess the efficacy of unsupervised machine learning tools in recovering complex behavioral patterns from such datasets to better inform subsequent statistical modeling. This methodological case study was carried out using records on milking order, or the sequence in which cows arrange themselves as they enter the milking parlor. Data was collected over a 6-month period from a closed group of 200 mixed-parity Holstein cattle on an organic dairy. Cows at the front and rear of the queue proved more consistent in their entry position than animals at the center of the queue, a systematic pattern of heterogeneity more clearly visualized using entropy estimates, a scale and distribution-free alternative to variance robust to outliers. Dimension reduction techniques were then used to visualize relationships between cows. No evidence of social cohesion was recovered, but Diffusion Map embeddings proved more adept than PCA at revealing the underlying linear geometry of this data. Median parlor entry positions from the pre- and post-pasture subperiods were highly correlated (R = 0.91), suggesting a surprising degree of temporal stationarity. Data Mechanics visualizations, however, revealed heterogeneous non-stationary among subgroups of animals in the center of the group and herd-level temporal outliers. A repeated measures model recovered inconsistent evidence of a relationships between entry position and cow attributes. Mutual conditional entropy tests, a permutation-based approach to assessing bivariate correlations robust to non-independence, confirmed a significant but non-linear association with peak milk yield, but revealed the age effect to be potentially confounded by health status. Finally, queueing records were related back to behaviors recorded via ear tag accelerometers using linear models and mutual conditional entropy tests. Both approaches recovered consistent evidence of differences in home pen behaviors across subsections of the queue.







  • 文章类型: Case Reports
    Methylation profiling has become a mainstay in brain tumor diagnostics since the introduction of the first publicly available classification tool by the German Cancer Research Center in 2017. We demonstrate the capability of this system through an example of a rare case of IDH wildtype glioblastoma diagnosed in a patient previously treated for T-cell acute lymphoblastic leukemia. Our novel in-house diagnostic tool EpiDiP provided hints arguing against a radiation-induced tumor, identified a novel recurrent genetic aberration, and thus informed about a potential therapeutic target.






  • 文章类型: Journal Article
    There is a lack of reliable biomarkers for major depressive disorder (MDD) in clinical practice. However, several studies have shown an association between alterations in microRNA levels and MDD, albeit none of them has taken advantage of machine learning (ML).
    Supervised and unsupervised ML were applied to blood microRNA expression profiles from a MDD case-control dataset (n = 168) to distinguish between (1) case vs control status, (2) MDD severity levels defined based on the Montgomery-Asberg Depression Rating Scale, and (3) antidepressant responders vs nonresponders.
    MDD cases were distinguishable from healthy controls with an area-under-the receiver-operating characteristic curve (AUC) of 0.97 on testing data. High- vs low-severity cases were distinguishable with an AUC of 0.63. Unsupervised clustering of patients, before supervised ML analysis of each cluster for MDD severity, improved the performance of the classifiers (AUC of 0.70 for cluster 1 and 0.76 for cluster 2). Antidepressant responders could not be successfully separated from nonresponders, even after patient stratification by unsupervised clustering. However, permutation testing of the top microRNA, identified by the ML model trained to distinguish responders vs nonresponders in each of the 2 clusters, showed an association with antidepressant response. Each of these microRNA markers was only significant when comparing responders vs nonresponders of the corresponding cluster, but not using the heterogeneous unclustered patient set.
    Supervised and unsupervised ML analysis of microRNA may lead to robust biomarkers for monitoring clinical evolution and for more timely assessment of treatment in MDD patients.







  • 文章类型: Journal Article
    Discovering subphenotypes of complex diseases can help characterize disease cohorts for investigative studies aimed at developing better diagnoses and treatments. Recent advances in unsupervised machine learning on electronic health record (EHR) data have enabled researchers to discover phenotypes without input from domain experts. However, most existing studies have ignored time and modeled diseases as discrete events. Uncovering the evolution of phenotypes - how they emerge, evolve and contribute to health outcomes - is essential to define more precise phenotypes and refine the understanding of disease progression. Our objective was to assess the benefits of an unsupervised approach that incorporates time to model diseases as dynamic processes in phenotype discovery.
    In this study, we applied a constrained non-negative tensor-factorization approach to characterize the complexity of cardiovascular disease (CVD) patient cohort based on longitudinal EHR data. Through tensor-factorization, we identified a set of phenotypic topics (i.e., subphenotypes) that these patients established over the 10 years prior to the diagnosis of CVD, and showed the progress pattern. For each identified subphenotype, we examined its association with the risk for adverse cardiovascular outcomes estimated by the American College of Cardiology/American Heart Association Pooled Cohort Risk Equations, a conventional CVD-risk assessment tool frequently used in clinical practice. Furthermore, we compared the subsequent myocardial infarction (MI) rates among the six most prevalent subphenotypes using survival analysis.
    From a cohort of 12,380 adult CVD individuals with 1068 unique PheCodes, we successfully identified 14 subphenotypes. Through the association analysis with estimated CVD risk for each subtype, we found some phenotypic topics such as Vitamin D deficiency and depression, Urinary infections cannot be explained by the conventional risk factors. Through a survival analysis, we found markedly different risks of subsequent MI following the diagnosis of CVD among the six most prevalent topics (p < 0.0001), indicating these topics may capture clinically meaningful subphenotypes of CVD.
    This study demonstrates the potential benefits of using tensor-decomposition to model diseases as dynamic processes from longitudinal EHR data. Our results suggest that this data-driven approach may potentially help researchers identify complex and chronic disease subphenotypes in precision medicine research.







  • 文章类型: Journal Article
    Retrospective analysing of fall incident reports can uncover hidden information, identify potential risk factors, and improve healthcare quality. This study explores potential fall incident clusters using word embeddings and hierarchical clustering. Fall incident reports from 7 local hospitals in Hong Kong were catalogued into 5 potential clusters with significantly different fall severity, gender, reporting department, and keywords. This study demonstrates the feasibility of using text clustering methods on real-world fall incident reports mining.





