k-means

K - means
  • 文章类型: Journal Article
    泥岩和页岩是各种地球能源应用中的天然屏障岩。尽管许多研究已经调查了它们的机械性能,由于它们的细粒度性质和对样品制备过程中引入的微观结构损伤的敏感性,在微观尺度上表征这些参数仍然具有挑战性。本研究旨在通过结合高速纳米压痕映射和机器学习数据分析来研究泥岩中粘土基复合材料的微观力学性能。纳米压痕方法有效地捕获了高分辨率机械性能图中的异质性。利用基于机器学习的k均值聚类,基质粘土的力学特性,脆性矿物,以及对晶界和结构不连续性的测量(例如,裂缝)被成功区分。通过与宽离子束扫描电子显微镜图像的相关性验证了分类结果。粘土基质的平均还原弹性模量(Er)和硬度(H)值确定为16.2±6.2和0.5±0.5GPa,分别,显示不同测试设置和压头提示的一致性。此外,研究了压痕测量对各种因素的敏感性,揭示对压痕深度和尖端几何形状的有限敏感性(在较小范围的压痕深度变化中比较Cube角和Berkovich尖端时),但在较低的加载速率下稳定性下降。应用盒计数和自举方法来评估为粘土基质确定的参数的代表性。需要一个相对较小的数据集(缩进数=60)来实现代表性,而主要挑战是覆盖粘土基质表征的代表性绘图区域。总的来说,这项研究证明了高速纳米压痕映射与数据分析相结合的可行性,用于泥岩中粘土基质的微观力学表征,为类似细粒沉积岩的高效分析铺平了道路。
    在线版本包含补充材料,可在10.1007/s40948-024-00864-9获得。
    Mudstones and shales serve as natural barrier rocks in various geoenergy applications. Although many studies have investigated their mechanical properties, characterizing these parameters at the microscale remains challenging due to their fine-grained nature and susceptibility to microstructural damage introduced during sample preparation. This study aims to investigate the micromechanical properties of clay matrix composite in mudstones by combining high-speed nanoindentation mapping and machine learning data analysis. The nanoindentation approach effectively captured the heterogeneity in high-resolution mechanical property maps. Utilizing machine learning-based k-means clustering, the mechanical characteristics of matrix clay, brittle minerals, as well as measurements on grain boundaries and structural discontinuities (e.g., cracks) were successfully distinguished. The classification results were validated through correlation with broad ion beam-scanning electron microscopy images. The resulting average reduced elastic modulus (E r ) and hardness (H) values for the clay matrix were determined to be 16.2 ± 6.2 and 0.5 ± 0.5 GPa, respectively, showing consistency across different test settings and indenter tips. Furthermore, the sensitivity of indentation measurements to various factors was investigated, revealing limited sensitivity to indentation depth and tip geometry (when comparing Cube corner and Berkovich tip in a small range of indentation depth variations), but decreased stability at lower loading rates. Box counting and bootstrapping methods were applied to assess the representativeness of parameters determined for the clay matrix. A relatively small dataset (indentation number = 60) is needed to achieve representativeness, while the main challenges is to cover a representative mapping area for clay matrix characterization. Overall, this study demonstrates the feasibility of high-speed nanoindentation mapping combined with data analysis for micromechanical characterization of the clay matrix in mudstones, paving the way for efficient analysis of similar fine-grained sedimentary rocks.
    UNASSIGNED: The online version contains supplementary material available at 10.1007/s40948-024-00864-9.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    每个工作环境都包含不同类型的风险以及风险之间的相互作用。因此,进行风险评估时使用的方法非常重要。在确定使用哪种风险评估方法(RAM)时,有许多因素,例如工作环境中的风险类型,这些风险之间的相互作用,以及他们与员工的距离。虽然有许多RAM可用,没有适合所有工作场所的RAM,选择哪种方法是最大的问题。在这个问题上没有国际公认的规模或趋势。在研究中,26个部门,确定了10种不同的RAM和10种标准。设计了一种混合方法,通过使用k均值聚类和支持向量机(SVM)分类算法来确定最适合扇区的RAM,这是机器学习(ML)算法。首先,使用k-means算法将数据集划分为子集。然后,SVM算法在具有不同特征的所有子集上运行。最后,将所有子集的结果合并,得到整个数据集的结果。因此,而不是为影响整个集群的单个和大型集群确定的阈值,并且对所有集群都是强制性的,通过根据每个子集群的特征为其确定单独的阈值来创建灵活的结构.这样,通过为部门选择最合适的RAM,并从人力中消除选择阶段的行政和软件问题,提供了机器支持。该方法的第一个比较结果是混合方法:96.63%,k-means:90.63和SVM:94.68%。在与五种不同的ML算法进行的第二次比较中,人工神经网络(ANN)的结果:87.44%,天真贝叶斯(NB):91.29%,决策树(DT):89.25%,随机森林(RF):81.23%,k近邻(KNN):85.43%。
    Every work environment contains different types of risks and interactions between risks. Therefore, the method to be used when making a risk assessment is very important. When determining which risk assessment method (RAM) to use, there are many factors such as the types of risks in the work environment, the interactions of these risks with each other, and their distance from the employees. Although there are many RAMs available, there is no RAM that will suit all workplaces and which method to choose is the biggest question. There is no internationally accepted scale or trend on this subject. In the study, 26 sectors, 10 different RAMs and 10 criteria were determined. A hybrid approach has been designed to determine the most suitable RAMs for sectors by using k-means clustering and support vector machine (SVM) classification algorithms, which are machine learning (ML) algorithms. First, the data set was divided into subsets with the k-means algorithm. Then, the SVM algorithm was run on all subsets with different characteristics. Finally, the results of all subsets were combined to obtain the result of the entire dataset. Thus, instead of the threshold value determined for a single and large cluster affecting the entire cluster and being made mandatory for all of them, a flexible structure was created by determining separate threshold values for each sub-cluster according to their characteristics. In this way, machine support was provided by selecting the most suitable RAMs for the sectors and eliminating the administrative and software problems in the selection phase from the manpower. The first comparison result of the proposed method was found to be the hybrid method: 96.63%, k-means: 90.63 and SVM: 94.68%. In the second comparison made with five different ML algorithms, the results of the artificial neural networks (ANN): 87.44%, naive bayes (NB): 91.29%, decision trees (DT): 89.25%, random forest (RF): 81.23% and k-nearest neighbours (KNN): 85.43% were found.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:由于临床,功能,和结构参数。虽然这个群体存在显著的变异性,特别是在全膝关节置换术的候选人中,矫形外科医师对膝关节运动学的兴趣日益增加,其目的是寻求更个性化的方法来获得更好的结果和满意度。这项研究的主要目的是鉴定全膝关节置换术候选人中不同的运动学表型,并比较鉴定这些表型的不同方法。
    方法:使用从临床跑步机步行期间的膝关节运动成像检查获得的三维运动学数据。对聚类过程的各个方面进行了评估和比较,以实现最佳聚类,包括数据准备,改造,和表示方法。
    结果:K-Means聚类算法,使用欧几里德距离执行,结合主成分分析应用于标准化转化的数据,是最佳方法。在80名全膝关节置换术候选人中鉴定出两种独特的运动学表型。两种不同的表型将在膝关节运动学表现和临床结果方面均存在显着差异的患者分开。在77.33%的步态周期中,包括63.3%的额叶平面特征和81.8%的横向平面特征的显著变化,以及疼痛突变量表的差异,强调这些运动学变化对患者疼痛和功能的影响。
    结论:这项研究的结果为临床医生提供了有价值的见解,以开发基于患者表型的个性化治疗方法,最终有助于改善全膝关节置换术的结果。
    BACKGROUND: Characterizing the condition of patients suffering from knee osteoarthritis is complex due to multiple associations between clinical, functional, and structural parameters. While significant variability exists within this population, especially in candidates for total knee arthroplasty, there is increasing interest in knee kinematics among orthopedic surgeons aiming for more personalized approaches to achieve better outcomes and satisfaction. The primary objective of this study was to identify distinct kinematic phenotypes in total knee arthroplasty candidates and to compare different methods for the identification of these phenotypes.
    METHODS: Three-dimensional kinematic data obtained from a Knee Kinesiography exam during treadmill walking in the clinic were used. Various aspects of the clustering process were evaluated and compared to achieve optimal clustering, including data preparation, transformation, and representation methods.
    RESULTS: A K-Means clustering algorithm, performed using Euclidean distance, combined with principal component analysis applied on data transformed by standardization, was the optimal approach. Two unique kinematic phenotypes were identified among 80 total knee arthroplasty candidates. The two distinct phenotypes divided patients who significantly differed both in terms of knee kinematic representation and clinical outcomes, including a notable variation in 63.3% of frontal plane features and 81.8% of transverse plane features across 77.33% of the gait cycle, as well as differences in the Pain Catastrophizing Scale, highlighting the impact of these kinematic variations on patient pain and function.
    CONCLUSIONS: Results from this study provide valuable insights for clinicians to develop personalized treatment approaches based on patients\' phenotype affiliation, ultimately helping to improve total knee arthroplasty outcomes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着大型储能电池系统的广泛应用,对电池安全的需求正在上升。研究如何及早发现电池异常并减少热失控(TR)事故的发生变得尤为重要。现有关于电池TR预警算法的研究主要可分为两类:模型驱动和数据驱动。然而,常见的模型驱动方法通常很复杂,通用性差,预警能力低;常见的数据驱动方法大多基于神经网络,需要大量的培训费用,具有更好的预警能力,但误报警概率更高。为了解决现有工程的局限性,本文提出了一种基于数据驱动和基于模型的组合算法,用于准确的电池TR警告。具体来说,K-Means算法作为数据驱动模块,捕获电池数据中的异常值,Bernardi方程作为用于评估电池温度的模型驱动模块。最终,将加权模型驱动模块和数据驱动模块的输出进行组合,综合评估电池是否异常。所提出的算法结合了模型驱动和数据驱动方法的优点,实现25分钟的热失控提前警告,误报的概率大大降低。
    With the increasingly widespread application of large-scale energy storage battery systems, the demand for battery safety is rising. Research on how to detect battery anomalies early and reduce the occurrence of thermal runaway (TR) accidents has become particularly important. Existing research on battery TR warning algorithms can be mainly divided into two categories: model-driven and data-driven methods. However, the common model-driven methods are often of high complexity, with poor versatility and low early warning capability; and the common data-driven methods are mostly based on neural networks, requiring substantial training costs, with better early warning capabilities but higher false alarm probabilities. To address the limitations of existing works, this paper proposes a combined data-driven and model-based algorithm for accurate battery TR warnings. Specifically, the K-Means algorithm serves as the data-driven module, capturing outliers in battery data, and the Bernardi equation serves as the model-driven module used to evaluate battery temperature. Ultimately, the outputs of the weighted model-driven module and data-driven module are combined to comprehensively assess whether the battery is abnormal. The proposed algorithm combines the advantages of model-driven and data-driven approaches, achieving a 25 min advance warning for thermal runaway, with a significantly reduced probability of false alarms.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    慢性暴露于高原低压低氧环境可能会影响人类的认知行为,这受到动态大脑连接状态的支持。直到现在,大脑网络的功能连接(FC)如何随海拔变化尚不清楚.在这篇文章中,我们使用了渭南(347m)和林芝(2950m)的Go/NoGo范例的EEG数据。动态FC(dFC)和K均值聚类的组合用于提取动态FC状态,这些状态后来通过图度量进行区分。此外,网络的时间属性,如分数窗口(FW),计算过渡数(TN)和平均停留时间(MDT)。最后,我们成功地从dFC矩阵中提取了两个不同的状态,其中状态1被验证具有更高的功能集成和隔离。在Go/NoGo任务期间,dFC状态动态切换,状态1的FW显示高空参与者人数上升。此外,在区域分析中,我们发现额顶叶皮质的状态偏差较高,枕叶的FC强度增强。这些结果表明,长期暴露于高海拔环境可能导致大脑网络重组为网络间和网络内信息传递效率较高的网络,这可以归因于高原环境导致大脑功能受损的补偿机制。本研究为思考高原如何影响认知障碍提供了一个新的视角。
    Chronic exposure to the hypobaric hypoxia environment of plateau could influence human cognitive behaviours which are supported by dynamic brain connectivity states. Until now, how functional connectivity (FC) of the brain network changes with altitudes is still unclear. In this article, we used EEG data of the Go/NoGo paradigm from Weinan (347 m) and Nyingchi (2950 m). A combination of dynamic FC (dFC) and the K-means cluster was employed to extract dynamic FC states which were later distinguished by graph metrics. Besides, temporal properties of networks such as fractional windows (FW), transition numbers (TN) and mean dwell time (MDT) were calculated. Finally, we successfully extracted two different states from dFC matrices where State 1 was verified to have higher functional integration and segregation. The dFC states dynamically switched during the Go/NoGo tasks and the FW of State 1 showed a rise in the high-altitude participants. Also, in the regional analysis, we found higher state deviation in the fronto-parietal cortices and enhanced FC strength in the occipital lobe. These results demonstrated that long-term exposure to the high-altitude environment could lead brain networks to reorganize as networks with higher inter- and intra-networks information transfer efficiency, which could be attributed to a compensatory mechanism to the compromised brain function due to the plateau environment. This study provides a new perspective in considering how the plateau impacted cognitive impairment.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:血管生成在结肠癌(CC)进展中起重要作用。
    目的:研究肿瘤微环境(TME)和肿瘤内血管生成亚型(AGS)的微生物,探索CC抗血管生成治疗的潜在靶点。
    方法:数据来自癌症基因组图谱数据库和基因表达综合数据库。K均值聚类用于构建AGS。基于两种亚型之间的差异基因构建了预后模型。单细胞分析用于分析SLC2A3在CC中不同细胞的表达水平。通过免疫荧光验证。其生物学功能在HUVECs中得到进一步探索。
    结果:CC样本分为两个AGS(AGS-A和AGS-B)组,AGS-B组患者预后不良。进一步分析发现AGS-B组有较高的TME免疫细胞浸润,但也表现出高度的免疫逃逸。两种亚型之间的肿瘤内微生物也不同。一个方便的6基因血管生成相关标记(ARS),建立识别AGS并预测CC患者的预后。选择SLC2A3作为ARS的代表基因,在内皮细胞中表达较高,并促进HUVECs的迁移。
    结论:我们的研究确定了两种预后不同的AGS,TME,和肿瘤内微生物组成,这可以为CC对预后的影响提供潜在的解释。进一步构建了可靠的ARS模型,这可以指导个性化治疗。SLC2A3可能是抗血管生成治疗的潜在靶点。
    BACKGROUND: Angiogenesis plays an important role in colon cancer (CC) progression.
    OBJECTIVE: To investigate the tumor microenvironment (TME) and intratumor microbes of angiogenesis subtypes (AGSs) and explore potential targets for antiangiogenic therapy in CC.
    METHODS: The data were obtained from The Cancer Genome Atlas database and Gene Expression Omnibus database. K-means clustering was used to construct the AGSs. The prognostic model was constructed based on the differential genes between two subtypes. Single-cell analysis was used to analyze the expression level of SLC2A3 on different cells in CC, which was validated by immunofluorescence. Its biological functions were further explored in HUVECs.
    RESULTS: CC samples were grouped into two AGSs (AGS-A and AGS-B) groups and patients in the AGS-B group had poor prognosis. Further analysis revealed that the AGS-B group had high infiltration of TME immune cells, but also exhibited high immune escape. The intratumor microbes were also different between the two subtypes. A convenient 6-gene angiogenesis-related signature (ARS), was established to identify AGSs and predict the prognosis in CC patients. SLC2A3 was selected as the representative gene of ARS, which was higher expressed in endothelial cells and promoted the migration of HUVECs.
    CONCLUSIONS: Our study identified two AGSs with distinct prognoses, TME, and intratumor microbial compositions, which could provide potential explanations for the impact on the prognosis of CC. The reliable ARS model was further constructed, which could guide the personalized treatment. The SLC2A3 might be a potential target for antiangiogenic therapy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着在线物业租售平台的快速增长,假房地产上市的盛行已经成为一个重要的问题。这些欺骗性的清单浪费了买卖双方的时间和精力,并带来了潜在的风险。因此,开发区分真假上市的有效方法至关重要。准确识别虚假房地产列表是一个关键的挑战,聚类分析可以显著改善这一过程。虽然聚类已被广泛用于检测各个领域的欺诈,它在房地产领域的应用受到了一定的限制,主要集中在拍卖和财产评估上。这项研究旨在通过使用聚类来填补这一空白,根据行业专家策划的数据集将属性分类为虚假和真实列表。这项研究开发了一个K均值模型,将属性分组为集群,明确区分虚假和真实的清单。为了保证训练数据的质量,在原始数据集上执行数据预处理程序.使用了几种技术来确定K均值模型的每个参数的最佳值。使用轮廓系数确定聚类,Calinski-Harabasz指数,和戴维斯-博尔丁指数。发现与重叠相似性和Jaccard距离相比,聚类2的值是最好的,而Camberra技术是最好的方法。使用两种机器学习算法评估聚类结果:随机森林和决策树。观测结果表明,优化后的K-means显著提高了随机森林分类模型的准确性,将其提高了令人印象深刻的96%。此外,这项研究表明,聚类有助于创建一个包含虚假和真实聚类的平衡数据集。这个平衡的数据集为未来的调查提供了希望,特别是对于需要平衡数据才能最佳执行的深度学习模型。本研究通过利用聚类分析的力量,提出了一种实用有效的方法来识别虚假房地产列表,最终有助于建立一个更值得信赖和安全的房地产市场。
    With the rapid growth of online property rental and sale platforms, the prevalence of fake real estate listings has become a significant concern. These deceptive listings waste time and effort for buyers and sellers and pose potential risks. Therefore, developing effective methods to distinguish genuine from fake listings is crucial. Accurately identifying fake real estate listings is a critical challenge, and clustering analysis can significantly improve this process. While clustering has been widely used to detect fraud in various fields, its application in the real estate domain has been somewhat limited, primarily focused on auctions and property appraisals. This study aims to fill this gap by using clustering to classify properties into fake and genuine listings based on datasets curated by industry experts. This study developed a K-means model to group properties into clusters, clearly distinguishing between fake and genuine listings. To assure the quality of the training data, data pre-processing procedures were performed on the raw dataset. Several techniques were used to determine the optimal value for each parameter of the K-means model. The clusters are determined using the Silhouette coefficient, the Calinski-Harabasz index, and the Davies-Bouldin index. It was found that the value of cluster 2 is the best and the Camberra technique is the best method when compared to overlapping similarity and Jaccard for distance. The clustering results are assessed using two machine learning algorithms: Random Forest and Decision Tree. The observational results have shown that the optimized K-means significantly improves the accuracy of the Random Forest classification model, boosting it by an impressive 96%. Furthermore, this research demonstrates that clustering helps create a balanced dataset containing fake and genuine clusters. This balanced dataset holds promise for future investigations, particularly for deep learning models that require balanced data to perform optimally. This study presents a practical and effective way to identify fake real estate listings by harnessing the power of clustering analysis, ultimately contributing to a more trustworthy and secure real estate market.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    了解不同的生活方式轨迹对健康保护和疾病风险的影响对于有效的干预措施至关重要。
    这项研究使用K-means聚类分析了来自巴塞罗那大脑健康倡议的3,013名40-70岁健康成年人在五年内的生活方式参与。考虑了九个可改变的风险因素,包括认知,物理,和社会活动,重要的计划,饮食,肥胖,吸烟,酒精消费,和睡眠。在基线后不同时间点自我报告的新疾病诊断允许探索这五个概况与健康结果之间的关联。
    数据驱动的分析将受试者分为五种生活方式,揭示与健康行为和危险因素的关联。那些在促进健康行为和低风险行为方面得分很高的人,显示降低发展疾病的可能性(p<0.001)。相比之下,有危险习惯的档案显示出明显的精神病风险,神经学,和心血管疾病。参与者的生活方式轨迹随着时间的推移保持相对稳定。
    我们的研究发现了与特定生活方式相关的不同疾病的风险。这些结果可能有助于基于对促进健康生活方式的行为模式和政策的数据驱动观察的干预措施的个性化,并可能为老龄化社会的人们带来更好的健康结果。
    UNASSIGNED: Understanding the impact of different lifestyle trajectories on health preservation and disease risk is crucial for effective interventions.
    UNASSIGNED: This study analyzed lifestyle engagement over five years in 3,013 healthy adults aged 40-70 from the Barcelona Brain Health Initiative using K-means clustering. Nine modifiable risk factors were considered, including cognitive, physical, and social activity, vital plan, diet, obesity, smoking, alcohol consumption, and sleep. Self-reported diagnoses of new diseases at different time-points after baseline allowed to explore the association between these five profiles and health outcomes.
    UNASSIGNED: The data-driven analysis classified subjects into five lifestyle profiles, revealing associations with health behaviors and risk factors. Those exhibiting high scores in health-promoting behaviors and low-risk behaviors, demonstrate a reduced likelihood of developing diseases (p < 0.001). In contrast, profiles with risky habits showed distinct risks for psychiatric, neurological, and cardiovascular diseases. Participant\'s lifestyle trajectories remained relatively stable over time.
    UNASSIGNED: Our findings have identified risk for distinct diseases associated to specific lifestyle patterns. These results could help in the personalization of interventions based on data-driven observation of behavioral patterns and policies that promote a healthy lifestyle and can lead to better health outcomes for people in an aging society.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    与监督机器学习(ML)相比,无监督ML的特征选择的发展远远落后。为了解决这个问题,当前的研究提出了一种用于聚类方法的逐步特征选择方法,该方法具有高斯混合模型(GMM)和k均值的规范。而不是基于所有特征执行的现有GMM和k-means,所提出的方法选择特征的子集来实现这两种方法,分别。研究发现,如果通过良好的初始化来修改现有的GMM和k-means方法,可以获得更好的结果。基于蒙特卡罗模拟的实验表明,与现有的基于所有特征的GMM和k-means方法相比,该方法具有更高的计算效率和更高的精度。基于真实世界数据集的实验证实了这一发现。
    Compared to supervised machine learning (ML), the development of feature selection for unsupervised ML is far behind. To address this issue, the current research proposes a stepwise feature selection approach for clustering methods with a specification to the Gaussian mixture model (GMM) and the k-means. Rather than the existing GMM and k-means which are carried out based on all the features, the proposed method selects a subset of features to implement the two methods, respectively. The research finds that a better result can be obtained if the existing GMM and k-means methods are modified by nice initializations. Experiments based on Monte Carlo simulations show that the proposed method is more computationally efficient and the result is more accurate than the existing GMM and k-means methods based on all the features. The experiment based on a real-world dataset confirms this finding.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基因表达数据通常是高维的,具有有限数量的样品,并且包含与感兴趣的疾病无关的许多特征。现有的无监督特征选择算法主要关注特征在维护数据结构中的重要性,而不考虑特征之间的冗余。确定重要特征的适当数量是另一个挑战。
    在本文中,我们提出了一种针对基因表达数据的聚类指导的无监督特征选择(CGUFS)算法,以解决这些问题。我们提出的算法对现有算法进行了三项改进。对于现有聚类算法需要人为指定聚类数量的问题,我们提出了一种自适应k值策略,通过迭代更新变化函数为每个样本分配适当的伪标签。对于现有算法未能考虑特征间冗余的问题,我们提出了一种特征分组策略来对高度冗余的特征进行分组。针对现有算法无法过滤冗余特征的问题,我们提出了一种自适应过滤策略,通过计算每个特征组的潜在有效特征和潜在冗余特征来确定要保留的特征组合。
    实验结果表明,C4.5分类器对CGUFS算法选择的最优特征的平均准确率(ACC)和matthews相关系数(MCC)指标分别达到74.37%和63.84%,分别,显著优于现有算法。
    同样,Adaboost分类器在CGUFS算法选择的最优特征上的平均ACC和MCC指数明显优于现有算法。此外,统计实验结果表明CGUFS算法与现有算法存在显著差异。
    UNASSIGNED: Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge.
    UNASSIGNED: In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive k-value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group.
    UNASSIGNED: Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms.
    UNASSIGNED: Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号