Agglomerative clustering

  • 文章类型: Journal Article
    目的: 用于立体定向放射外科(SRS)的单等中心多目标技术可以缩短治疗时间,但由于潜在的旋转误差而有可能损害剂量覆盖率。将目标聚类为两组可以减少等中心-目标距离,减轻旋转不确定性。然而,缺乏对SRS聚类算法的综合评估。本研究通过引入SRS目标聚类框架(Framework)来解决这一差距,一个综合工具,利用常用的聚类算法来生成有效的集群配置。 方法。该框架基于两个关键指标结合了四个不同的优化目标:等中心-目标距离以及该距离与目标半径的比率。对于minimax和加权minimax目标,采用凝聚和加权凝聚聚类,分别。K均值和加权k均值用于平方和和加权平方和目标。我们将框架应用于126个SRS计划,将结果与通过蛮力算法获得的地面实况解进行比较。 主要结果。 对于minimax目标,聚集聚类的平均最大等中心-目标距离(4.8cm)略高于地面实况(4.6cm)。同样,加权聚集聚类的平均最大比率为15.1,而实际情况为14.6。值得注意的是,k-means和加权k-means聚类显示与平均均方根目标等中心距离和比值(分别为3.6cm和11.1)的地面实况非常一致(精度在0.1以内)。 意义。 这些结果证明了框架在为SRS目标生成集群方面的有效性。所提出的方法有可能成为SRS治疗计划中的有价值的工具。此外,这项研究首次研究了用于最小化SRS中最大和平方和不确定性的聚类算法. .
    Objective. Single-isocenter-multiple-target technique for stereotactic radiosurgery (SRS) can reduce treatment duration but risks compromised dose coverage due to potential rotational errors. Clustering targets into two groups can reduce isocenter-target distances, mitigating the impact of rotational uncertainty. However, a comprehensive evaluation of clustering algorithms for SRS is absent. This study addresses this gap by introducing the SRS Target Clustering Framework (Framework), a comprehensive tool that utilizes commonly used clustering algorithms to generate efficient cluster configurations.Approach. The Framework incorporates four distinct optimization objectives based on two key metrics: the isocenter-target distance and the ratio of this distance to the target radius. Agglomerative and weighted agglomerative clustering are employed for minimax and weighted minimax objectives, respectively. K-means and weighted k-means are utilized for sum-of-squares and weighted sum-of-squares objectives. We applied the Framework to 126 SRS plans, comparing results to ground truth solutions obtained through a brute force algorithm.Main results. For the minimax objective, the average maximum isocenter-target distance from agglomerative clustering (4.8 cm) was slightly higher than the ground truth (4.6 cm). Similarly, the weighted agglomerative clustering achieved an average maximum ratio of 15.1 compared to the ground truth of 14.6. Notably, both k-means and weighted k-means clustering showed close agreement (within a precision of 0.1) with the ground truth for average root-mean-square target-isocenter distance and ratio (3.6 cm and 11.1, respectively).Significance. These results demonstrate the Framework\'s effectiveness in generating clusters for SRS targets. The proposed approach has the potential to become a valuable tool in SRS treatment planning. Furthermore, this study is the first to investigate clustering algorithms for both minimizing maximum and sum-of-squares uncertainty in SRS.






  • 文章类型: Journal Article
    BACKGROUND: The utilization of artificial intelligence (AI) technologies in the biomedical field has attracted increasing attention in recent decades. Studying how past AI technologies have found their way into medicine over time can help to predict which current (and future) AI technologies have the potential to be utilized in medicine in the coming years, thereby providing a helpful reference for future research directions.
    OBJECTIVE: The aim of this study was to predict the future trend of AI technologies used in different biomedical domains based on past trends of related technologies and biomedical domains.
    METHODS: We collected a large corpus of articles from the PubMed database pertaining to the intersection of AI and biomedicine. Initially, we attempted to use regression on the extracted keywords alone; however, we found that this approach did not provide sufficient information. Therefore, we propose a method called \"background-enhanced prediction\" to expand the knowledge utilized by the regression algorithm by incorporating both the keywords and their surrounding context. This method of data construction resulted in improved performance across the six regression models evaluated. Our findings were confirmed through experiments on recurrent prediction and forecasting.
    RESULTS: In our analysis using background information for prediction, we found that a window size of 3 yielded the best results, outperforming the use of keywords alone. Furthermore, utilizing data only prior to 2017, our regression projections for the period of 2017-2021 exhibited a high coefficient of determination (R2), which reached up to 0.78, demonstrating the effectiveness of our method in predicting long-term trends. Based on the prediction, studies related to proteins and tumors will be pushed out of the top 20 and become replaced by early diagnostics, tomography, and other detection technologies. These are certain areas that are well-suited to incorporate AI technology. Deep learning, machine learning, and neural networks continue to be the dominant AI technologies in biomedical applications. Generative adversarial networks represent an emerging technology with a strong growth trend.
    CONCLUSIONS: In this study, we explored AI trends in the biomedical field and developed a predictive model to forecast future trends. Our findings were confirmed through experiments on current trends.






  • 文章类型: Journal Article
    Environmental DNA (eDNA) technology has revolutionized biomonitoring, but challenges remain regarding water sample processing. The passive eDNA sampler (PEDS) represents a viable alternative to active, water filtration-based eDNA enrichment methods, but the effectiveness of PEDS for surveying biodiverse and complex natural water bodies is unknown. Here, we collected eDNA using filtration and glass fiber filter-based PEDS (submerged in water for 1 d) from 27 sites along the final reach of the Yangtze River and the coast of the Yellow Sea, followed by eDNA metabarcoding analysis of fish biodiversity and quantitative PCR (qPCR) for a critically endangered aquatic mammal, the Yangtze finless porpoise. We ultimately detected 98 fish species via eDNA metabarcoding. Both eDNA sampling methods captured comparable local species richness and revealed largely similar spatial variation in fish assemblages and community partitions between the river and sea sites. Notably, the Yangtze finless porpoise was detected only in the metabarcoding of eDNA collected by PEDS at five sites. Also, species-specific qPCR revealed that the PEDS captured porpoise eDNA at more sites (7 vs. 2), in greater quantities, and with a higher detection probability (0.803 vs. 0.407) than did filtration. Our results demonstrate the capacity of PEDS for surveying fish biodiversity, and support that continuous eDNA collection by PEDS can be more effective than instantaneous water sampling at capturing low abundance and ephemeral species in natural waters. Thus, the PEDS approach can facilitate more efficient and convenient eDNA-based biodiversity surveillance and rare species detection.






  • 文章类型: Journal Article
    The worldwide spread of the novel coronavirus originating from Wuhan, China led to an ongoing pandemic as COVID-19. The disease being a contagion transmitted rapidly in India through the people having travel histories to the affected countries, and their contacts that tested positive. Millions of people across all states and union territories (UT) were affected leading to serious respiratory illness and deaths. In the present study, two unsupervised clustering algorithms namely k-means clustering and hierarchical agglomerative clustering are applied on the COVID-19 dataset in order to group the Indian states/UTs based on the pandemic effect and the vaccination program from the period of March, 2020 to early June, 2021. The aim of the study is to observe the plight of each state and UT of India combating the novel coronavirus infection and to monitor their vaccination status. The research study will be helpful to the government and to the frontline workers coping to restrict the transmission of the virus in India. Also, the results of the study will provide a source of information for future research regarding the COVID-19 pandemic in India.






  • 文章类型: Journal Article
    The next-generation sequencing technologies have transformed our understanding of immunoglobulin (Ig) profiles in various immune states. Clonotyping, which groups Ig sequences into B cell clones, is crucial in investigating the diversity of repertoires and changes in antigen exposure. Despite its importance, there is no widely accepted method for clonotyping, and existing methods are computationally intensive for large sequencing datasets.
    To address this challenge, we introduce YClon, a fast and efficient approach for clonotyping Ig repertoire data. YClon uses a hierarchical clustering approach, similar to other methods, to group Ig sequences into B cell clones in a highly sensitive and specific manner. Notably, our approach outperforms other methods by being more than 30 to 5000 times faster in processing the repertoires analyzed. Astonishingly, YClon can effortlessly handle up to 2 million Ig sequences on a standard laptop computer. This enables in-depth analysis of large and numerous antibody repertoires.
    YClon was implemented in Python3 and is freely available on GitHub.






  • 文章类型: Journal Article
    There is growing interest in how data-driven approaches can help understand individual differences in face identity processing (FIP). However, researchers employ various FIP tests interchangeably, and it is unclear whether these tests 1) measure the same underlying ability/ies and processes (e.g., confirmation of identity match or elimination of identity match) 2) are reliable, 3) provide consistent performance for individuals across tests online and in laboratory. Together these factors would influence the outcomes of data-driven analyses. Here, we asked 211 participants to perform eight tests frequently reported in the literature. We used Principal Component Analysis and Agglomerative Clustering to determine factors underpinning performance. Importantly, we examined the reliability of these tests, relationships between them, and quantified participant consistency across tests. Our findings show that participants\' performance can be split into two factors (called here confirmation and elimination of an identity match) and that participants cluster according to whether they are strong on one of the factors or equally on both. We found that the reliability of these tests is at best moderate, the correlations between them are weak, and that the consistency in participant performance across tests and is low. Developing reliable and valid measures of FIP and consistently scrutinising existing ones will be key for drawing meaningful conclusions from data-driven studies.






  • 文章类型: Journal Article
    BACKGROUND: The ability to compare RNA secondary structures is important in understanding their biological function and for grouping similar organisms into families by looking at evolutionarily conserved sequences such as 16S rRNA. Most comparison methods and benchmarks in the literature focus on pseudoknot-free structures due to the difficulty of mapping pseudoknots in classical tree representations. Some approaches exist that permit to cluster pseudoknotted RNAs but there is not a general framework for evaluating their performance.
    RESULTS: We introduce an evaluation framework based on a similarity/dissimilarity measure obtained by a comparison method and agglomerative clustering. Their combination automatically partition a set of molecules into groups. To illustrate the framework we define and make available a benchmark of pseudoknotted (16S and 23S) and pseudoknot-free (5S) rRNA secondary structures belonging to Archaea, Bacteria and Eukaryota. We also consider five different comparison methods from the literature that are able to manage pseudoknots. For each method we clusterize the molecules in the benchmark to obtain the taxa at the rank phylum according to the European Nucleotide Archive curated taxonomy. We compute appropriate metrics for each method and we compare their suitability to reconstruct the taxa.






  • 文章类型: Journal Article
    The aim of the present study was to investigate if the presence of anterior cruciate ligament (ACL) injury risk factors depicted in the laboratory would reflect at-risk patterns in football-specific field data. Twenty-four female footballers (14.9 ± 0.9 year) performed unanticipated cutting maneuvers in a laboratory setting and on the football pitch during football-specific exercises (F-EX) and games (F-GAME). Knee joint moments were collected in the laboratory and grouped using hierarchical agglomerative clustering. The clusters were used to investigate the kinematics collected on field through wearable sensors. Three clusters emerged: Cluster 1 presented the lowest knee moments; Cluster 2 presented high knee extension but low knee abduction and rotation moments; Cluster 3 presented the highest knee abduction, extension, and external rotation moments. In F-EX, greater knee abduction angles were found in Cluster 2 and 3 compared to Cluster 1 (p = 0.007). Cluster 2 showed the lowest knee and hip flexion angles (p < 0.013). Cluster 3 showed the greatest hip external rotation angles (p = 0.006). In F-GAME, Cluster 3 presented the greatest knee external rotation and lowest knee flexion angles (p = 0.003). Clinically relevant differences towards ACL injury identified in the laboratory reflected at-risk patterns only in part when cutting on the field: in the field, low-risk players exhibited similar kinematic patterns as the high-risk players. Therefore, in-lab injury risk screening may lack ecological validity.






  • 文章类型: Journal Article
    This work proposes a machine learning-based phylogenetic tree generation model based on agglomerative clustering (PTGAC) that compares protein sequences considering all known chemical properties of amino acids. The proposed model can serve as a suitable alternative to the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), which is inherently time-consuming in nature. Initially, principal component analysis (PCA) is used in the proposed scheme to reduce the dimensions of 20 amino acids using seven known chemical characteristics, yielding 20 TP (Total Points) values for each amino acid. The approach of cumulative summing is then used to give a non-degenerate numeric representation of the sequences based on these 20 TP values. A special kind of three-component vector is proposed as a descriptor, which consists of a new type of non-central moment of orders one, two, and three. Subsequently, the proposed model uses Euclidean Distance measures among the descriptors to create a distance matrix. Finally, a phylogenetic tree is constructed using hierarchical agglomerative clustering based on the distance matrix. The results are compared with the UPGMA and other existing methods in terms of the quality and time of constructing the phylogenetic tree. Both qualitative and quantitative analysis are performed as key assessment criteria for analyzing the performance of the proposed model. The qualitative analysis of the phylogenetic tree is performed by considering rationalized perception, while the quantitative analysis is performed based on symmetric distance (SD). On both criteria, the results obtained by the proposed model are more satisfactory than those produced earlier on the same species by other methods. Notably, this method is found to be efficient in terms of both time and space requirements and is capable of dealing with protein sequences of varying lengths.






  • 文章类型: Journal Article
    The quadratic minimum spanning tree problem (QMSTP) is a spanning tree optimization problem that considers the interaction cost between pairs of edges arising from a number of practical scenarios. This problem is NP-hard, and therefore there is not a known polynomial time approach to solve it. To find a close-to-optimal solution to the problem in a reasonable time, we present for the first time a clustering-enhanced memetic algorithm (CMA) that combines four components, i.e., (i) population initialization with clustering mechanism, (ii) a tabu-based nearby exploration phase to search nearby local optima in a restricted area, (iii) a three-parent combination operator to generate promising offspring solutions, and (iv) a mutation operator using Lévy distribution to prevent the population from premature. Computational experiments are carried on 36 benchmark instances from 3 standard sets, and the results show that the proposed algorithm is competitive with the state-of-the-art approaches. In particular, it reports improved upper bounds for the 25 most challenging instances with unproven optimal solutions, while matching the best-known results for all but 2 of the remaining instances. Additional analysis highlights the contribution of the clustering mechanism and combination operator to the performance of the algorithm.





