k-means

K - means
  • 文章类型: Journal Article
    随着在线物业租售平台的快速增长,假房地产上市的盛行已经成为一个重要的问题。这些欺骗性的清单浪费了买卖双方的时间和精力,并带来了潜在的风险。因此,开发区分真假上市的有效方法至关重要。准确识别虚假房地产列表是一个关键的挑战,聚类分析可以显著改善这一过程。虽然聚类已被广泛用于检测各个领域的欺诈,它在房地产领域的应用受到了一定的限制,主要集中在拍卖和财产评估上。这项研究旨在通过使用聚类来填补这一空白,根据行业专家策划的数据集将属性分类为虚假和真实列表。这项研究开发了一个K均值模型,将属性分组为集群,明确区分虚假和真实的清单。为了保证训练数据的质量,在原始数据集上执行数据预处理程序.使用了几种技术来确定K均值模型的每个参数的最佳值。使用轮廓系数确定聚类,Calinski-Harabasz指数,和戴维斯-博尔丁指数。发现与重叠相似性和Jaccard距离相比,聚类2的值是最好的,而Camberra技术是最好的方法。使用两种机器学习算法评估聚类结果:随机森林和决策树。观测结果表明,优化后的K-means显著提高了随机森林分类模型的准确性,将其提高了令人印象深刻的96%。此外,这项研究表明,聚类有助于创建一个包含虚假和真实聚类的平衡数据集。这个平衡的数据集为未来的调查提供了希望,特别是对于需要平衡数据才能最佳执行的深度学习模型。本研究通过利用聚类分析的力量,提出了一种实用有效的方法来识别虚假房地产列表,最终有助于建立一个更值得信赖和安全的房地产市场。
    With the rapid growth of online property rental and sale platforms, the prevalence of fake real estate listings has become a significant concern. These deceptive listings waste time and effort for buyers and sellers and pose potential risks. Therefore, developing effective methods to distinguish genuine from fake listings is crucial. Accurately identifying fake real estate listings is a critical challenge, and clustering analysis can significantly improve this process. While clustering has been widely used to detect fraud in various fields, its application in the real estate domain has been somewhat limited, primarily focused on auctions and property appraisals. This study aims to fill this gap by using clustering to classify properties into fake and genuine listings based on datasets curated by industry experts. This study developed a K-means model to group properties into clusters, clearly distinguishing between fake and genuine listings. To assure the quality of the training data, data pre-processing procedures were performed on the raw dataset. Several techniques were used to determine the optimal value for each parameter of the K-means model. The clusters are determined using the Silhouette coefficient, the Calinski-Harabasz index, and the Davies-Bouldin index. It was found that the value of cluster 2 is the best and the Camberra technique is the best method when compared to overlapping similarity and Jaccard for distance. The clustering results are assessed using two machine learning algorithms: Random Forest and Decision Tree. The observational results have shown that the optimized K-means significantly improves the accuracy of the Random Forest classification model, boosting it by an impressive 96%. Furthermore, this research demonstrates that clustering helps create a balanced dataset containing fake and genuine clusters. This balanced dataset holds promise for future investigations, particularly for deep learning models that require balanced data to perform optimally. This study presents a practical and effective way to identify fake real estate listings by harnessing the power of clustering analysis, ultimately contributing to a more trustworthy and secure real estate market.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:先前的分型方法无法为肝外胆总管囊肿(ECC)的手术复杂性提供预测性见解。本研究旨在通过对成像结果的聚类,建立一种新的ECC分类系统。此外,它旨在比较已确定的ECC类型之间的差异,并评估手术难度的水平。
    方法:通过K均值聚类分析对124例患者的影像学数据进行自动分组。根据新分组的特点,进行了纠正和干预,以建立新的分类。人口统计数据,临床表现,手术参数,并发症,再操作,并根据不同类型对预后指标进行分析。还评估了导致手术时间延长的因素。
    结果:ECC的新分类系统:类型A(上段),B型(中段),C型(下段),和D型(整个胆管)。合并症(结石或感染)的发生率差异有统计学意义(P=0.000,P=0.002)。此外,术后胆管炎发生率差异有统计学意义(P=0.046).两组手术时间差异有统计学意义(P=0.001)。年龄,BMI>30,分类,合并结石的存在与手术时间延长显著相关(P=0.002,P=0.000,P=0.011,P=0.011)。
    结论:结论:我们利用机器学习驱动的聚类分析,创造了一种新颖的肝外胆管扩张类型学.这个分类,结合年龄等因素,联合结石发生,肥胖,显著影响腹腔镜胆总管囊肿手术的复杂性,为改进手术治疗提供有价值的见解。
    BACKGROUND: Prior typing methods fail to provide predictive insights into surgical complexities for extrahepatic choledochal cyst (ECC). This study aims to establish a new classification system for ECC through clustering of imaging results. Additionally, it seeks to compare the differences among the identified ECC types and assess the levels of surgical difficulty.
    METHODS: The imaging data of 124 patients were automatically grouped through a K-means clustering analysis. According to the characteristics of the new grouping, corrections and interventions were carried out to establish a new classification. Demographic data, clinical presentations, surgical parameters, complications, reoperation, and prognostic indicators were analyzed according to different types. Factors contributing to prolonged surgical time were also evaluated.
    RESULTS: A new classification system of ECC: Type A (upper segment), Type B (middle segment), Type C (lower segment), and Type D (entire bile duct). The incidences of comorbidities (calculus or infection) were significantly different (P = 0.000, P = 0.002). Additionally, variations in the incidence of postoperative biliary stricture were statistically significant (P = 0.046). The operative time was significantly different between groups (P = 0.001). Age, BMI > 30, classification, and the presence of combined stones exhibit a significant association with prolonged operative time (P = 0.002, P = 0.000, P = 0.011, P = 0.011).
    CONCLUSIONS: In conclusion, our utilization of machine learning-driven cluster analysis has enabled the creation of a novel extrahepatic biliary dilatation typology. This classification, in conjunction with factors like age, combined stone occurrence, and obesity, significantly influences the complexity of laparoscopic choledochal cyst surgery, offering valuable insights for improved surgical treatment.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    了解不同的生活方式轨迹对健康保护和疾病风险的影响对于有效的干预措施至关重要。
    这项研究使用K-means聚类分析了来自巴塞罗那大脑健康倡议的3,013名40-70岁健康成年人在五年内的生活方式参与。考虑了九个可改变的风险因素,包括认知,物理,和社会活动,重要的计划,饮食,肥胖,吸烟,酒精消费,和睡眠。在基线后不同时间点自我报告的新疾病诊断允许探索这五个概况与健康结果之间的关联。
    数据驱动的分析将受试者分为五种生活方式,揭示与健康行为和危险因素的关联。那些在促进健康行为和低风险行为方面得分很高的人,显示降低发展疾病的可能性(p<0.001)。相比之下,有危险习惯的档案显示出明显的精神病风险,神经学,和心血管疾病。参与者的生活方式轨迹随着时间的推移保持相对稳定。
    我们的研究发现了与特定生活方式相关的不同疾病的风险。这些结果可能有助于基于对促进健康生活方式的行为模式和政策的数据驱动观察的干预措施的个性化,并可能为老龄化社会的人们带来更好的健康结果。
    UNASSIGNED: Understanding the impact of different lifestyle trajectories on health preservation and disease risk is crucial for effective interventions.
    UNASSIGNED: This study analyzed lifestyle engagement over five years in 3,013 healthy adults aged 40-70 from the Barcelona Brain Health Initiative using K-means clustering. Nine modifiable risk factors were considered, including cognitive, physical, and social activity, vital plan, diet, obesity, smoking, alcohol consumption, and sleep. Self-reported diagnoses of new diseases at different time-points after baseline allowed to explore the association between these five profiles and health outcomes.
    UNASSIGNED: The data-driven analysis classified subjects into five lifestyle profiles, revealing associations with health behaviors and risk factors. Those exhibiting high scores in health-promoting behaviors and low-risk behaviors, demonstrate a reduced likelihood of developing diseases (p < 0.001). In contrast, profiles with risky habits showed distinct risks for psychiatric, neurological, and cardiovascular diseases. Participant\'s lifestyle trajectories remained relatively stable over time.
    UNASSIGNED: Our findings have identified risk for distinct diseases associated to specific lifestyle patterns. These results could help in the personalization of interventions based on data-driven observation of behavioral patterns and policies that promote a healthy lifestyle and can lead to better health outcomes for people in an aging society.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项研究调查了免费发布(F2P)与付费发布(P2P)模型在皮肤病学期刊中的影响,关注它们在期刊指标方面的差异,物品处理费(APC),开放存取(OA)状态。利用k均值聚类,该研究基于SCImago期刊排名(SJR)评估皮肤病学期刊,H-Index,和影响因子(IF),并检查这些指标之间的相关性,APC,和OA状态(完全或混合)。使用来自SCImago期刊排名和期刊引文报告数据库的数据,和来自106个期刊的指标进行了标准化,并分为三个层次。这项研究揭示了F2P期刊的比例更高,尤其是在更高层次的期刊上,表明了对质量驱动研究接受的偏好。相反,较低层次的P2P期刊比例不断上升,这表明潜在的支付能力存在偏见。这种差距给资金较少的机构或职业生涯早期的研究人员带来了挑战。研究还发现,F2P和P2P期刊之间的APC存在显着差异,混合OA在F2P中更常见。最后,该研究强调了F2P模型和P2P模型在皮肤病学期刊上的差异,并强调需要进一步研究这些期刊的作者人口统计和机构隶属关系.它还确立了k-means聚类作为评估期刊质量的标准化方法的有效性,这可以减少对潜在有偏见的个人指标的依赖。
    This study investigates the impact of Free-to-Publish (F2P) versus Pay-to-Publish (P2P) models in dermatology journals, focusing on their differences in terms of journal metrics, Article Processing Charges (APCs), and Open Access (OA) status. Utilizing k-means clustering, the research evaluates dermatology journals based on SCImago Journal Rankings (SJR), H-Index, and Impact Factor (IF), and examines the correlation between these metrics, APCs, and OA status (Full or Hybrid). Data from the SCImago Journal Rank and Journal Citation Report databases were used, and metrics from 106 journals were normalized and grouped into three tiers.The study reveals a higher proportion of F2P journals, especially in higher-tier journals, indicating a preference for quality-driven research acceptance. Conversely, a rising proportion of P2P journals in lower tiers suggests potential bias towards the ability to pay. This disparity poses challenges for researchers from less-funded institutions or those early in their careers. The study also finds significant differences in APCs between F2P and P2P journals, with hybrid OA being more common in F2P.Conclusively, the study highlights the disparities in dermatology journals between F2P and P2P models and underscores the need for further research into authorship demographics and institutional affiliations in these journals. It also establishes the effectiveness of k-means clustering as a standardized method for assessing journal quality, which can reduce reliance on potentially biased individual metrics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本文介绍的工作重点是由于集装箱油轮X-PressPearl的破裂而在印度洋发生的最大的海洋灾难。为了识别漏油事件及其时间演变,采用了最近提出的阻尼比(DR)指标。要导出DR,提出了一种数据驱动的GMM-EM聚类方法,该方法通过对Sentinel1SAR时间序列图像中的结果类进行随机排序来优化。船舶溢油现场基本上被认为由三个子现场组成:石油,公海,和船。通过使用k均值聚类确定初始站点概率密度。除了聚类方法,两种基于直方图的方法,即上下文峰值阈值(CPT)和上下文峰值排序(CPO),也被制定和提出。改进的直方图峰值检测方法考虑了空间和上下文依赖性。公海和石油类别的边际概率密度的相似性使得难以量化DR值以显示阻尼水平。在研究中,我们证明了通过使用GMM聚类可以正确确定σVV0,seaθ的合理类别可分性。还使用JM和ML距离报告了产生的类可分性。所测试的方法显示导出的DR值的范围显著保持在彼此相似的范围内。在灾难期间对漏油地点和其他化合物进行的地面调查对结果进行了测试。所提出的方法易于执行,健壮,完全自动化。Further,它们不需要手动掩蔽油或选择高置信度水像素。
    The work presented in this paper is focused on the largest marine disaster to have occurred in the Indian Ocean due to the breakup of the container tanker ship X-Press Pearl. In order to identify the oil spill and its temporal evolution, a recently proposed damping ratio (DR) index is employed. To derive the DR, a data-driven GMM-EM clustering method optimized by stochastic ordering of the resulting classes in Sentinel 1 SAR time series imagery is proposed. A ship-born oil spill site is essentially considered to consist of three subsites: oil, open sea, and ship. The initial site probability densities were determined by using k-means clustering. In addition to the clustering method, two histogram-based approaches, namely contextual peak thresholding (CPT) and contextual peak ordering (CPO), were also formulated and presented. The improved histogram peak detection methods take into account spatial and contextual dependencies. The similarity of the marginal probability densities of the open sea and the oil classes makes it difficult to quantify the DR values to show the level of dampening. In the study, we show that reasonable class separability to correctly determine the σVV0,seaθ is possible by using GMM clustering. Resulting class separability\'s are also reported using JM and ML distances. The methods tested show the range of derived DR values stays significantly within similar ranges to each other. The outcomes were tested with the ground-based surveys conducted during the disaster for oil spill sites and other chemical compounds. The proposed methods are simple to execute, robust, and fully automated. Further, they do not require masking the oil or the selection of high-confidence water pixels manually.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:尿道下裂表型评估确定解剖结构是否有利于重建。为了标准化尿道下裂的分类,已采用了GMS(GMS)。虽然非常主观,GMS已广泛用于对表型的严重程度进行分类以预测手术结果。使用数字图像分析已被证明是可行的,我们团队的先前努力已经证明机器学习算法可以模拟专家对表型的评估。尽管如此,这些图像识别算法的创建是高度主观的。为了减少表型评估中的主观输入,我们提出了一种新颖的方法,使用数字图像像素分析来分析解剖结构,并使用GMS评分比较结果。我们的假设是像素簇分割可以区分有利和不利的解剖结构。
    目的:评估图像分割和数字像素分析是否能够以比GMS评分更少的主观方式分析尿道下裂的有利和不利解剖结构。
    方法:根据GMS评分,5名独立专家中的1名将148名不同类型的尿道下裂患者分为“有利”(GG),“中度有利”(GM)和“不利”(GP)龟头。从那里,使用数字图像分割生成592张图像。由于某些图像因图像质量差或目标解剖结构捕获不足而被排除在外,因此包括584个用于最终分析。对于每个图像,感兴趣的区域由两名评估者分别分割成“龟头,尿道板,\"\"包皮\"和\"尿道周围板\"。分析使用机器学习统计像素k均值聚类分析为每个分割区域获得的值,并使用ANOVA分析与给予该图像的GMS得分进行比较。
    结果:图像分割的分析表明,k均值像素聚类分析区分“有利”和“不利”尿道板。比较GG和GM组(p=0.03)以及GG和GP组(p=0.05)时,得分之间存在显着差异。像素聚类分析无法区分“中度有利”和“不利”尿道板。
    结论:通过我们的分析,我们发现不同的组织质量存在显着成对差异。数字图像分割和统计k均值聚类分析可以以类似于GMS评分的方式区分解剖特征。未来的研究目标可以区分不同的组织质量,以预测尿道下裂修复的手术结果。
    BACKGROUND: Hypospadias phenotype assessment determines if the anatomy is favorable for reconstruction. Glans-Urethral Meatus-Shaft (GMS) has been adopted in an effort to standardize hypospadias classification. While extremely subjective, GMS has been widely used to classify the severity of the phenotype to predict surgical outcomes. The use of digital image analysis has proven to be feasible and prior efforts by our team have demonstrated that machine learning algorithms can emulate an expert\'s assessment of the phenotype. Nonetheless, the creation of these image recognition algorithms is highly subjective. In order to reduce a subjective input in the evaluation of the phenotype, we propose a novel approach to analyze the anatomy using digital image pixel analysis and to compare the results using the GMS score. Our hypothesis is that pixel cluster segmentation can discriminate between favorable and unfavorable anatomy.
    OBJECTIVE: To evaluate whether image segmentation and digital pixel analysis are able to analyze favorable vs unfavorable hypospadias anatomy in a less subjective manner than GMS score.
    METHODS: A total of 148 patients with different types of hypospadias were classified by 1 of 5 independent experts following the GMS score into \"favorable\" (GG), \"moderately favorable\" (GM) and \"unfavorable\" (GP) glans. From there, 592 images were generated using digital image segmentation. 584 were included for final analysis due to certain images being excluded for poor image quality or inadequate capture of target anatomy. For each image, the region of interest was segmented separately by two evaluators into \"glans,\" \"urethral plate,\" \"foreskin\" and \"periurethral plate\". The values obtained for each segmented region using machine-learning statistical pixel k-means cluster analysis were analyzed and compared to the GMS score given to that image using an ANOVA analysis.
    RESULTS: Analysis of image segmentation demonstrated that k-means pixel cluster analysis discriminated \"favorable\" vs \"unfavorable\" urethral plates. There was a significant difference between scores when comparing the GG and GM groups (p = 0.03) and GG and GP groups (p = 0.05). Pixel cluster analysis could not discriminate between \"moderately favorable\" and \"unfavorable\" urethral plates.
    CONCLUSIONS: Through our analysis, we found significant pairwise difference for different tissue qualities. Digital image segmentation and statistical k-means cluster analysis can discriminate anatomical features in a similar way to the GMS score. Future research can target discerning between different tissue qualities in an effort to predict surgical outcomes for hypospadias repair.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    与监督机器学习(ML)相比,无监督ML的特征选择的发展远远落后。为了解决这个问题,当前的研究提出了一种用于聚类方法的逐步特征选择方法,该方法具有高斯混合模型(GMM)和k均值的规范。而不是基于所有特征执行的现有GMM和k-means,所提出的方法选择特征的子集来实现这两种方法,分别。研究发现,如果通过良好的初始化来修改现有的GMM和k-means方法,可以获得更好的结果。基于蒙特卡罗模拟的实验表明,与现有的基于所有特征的GMM和k-means方法相比,该方法具有更高的计算效率和更高的精度。基于真实世界数据集的实验证实了这一发现。
    Compared to supervised machine learning (ML), the development of feature selection for unsupervised ML is far behind. To address this issue, the current research proposes a stepwise feature selection approach for clustering methods with a specification to the Gaussian mixture model (GMM) and the k-means. Rather than the existing GMM and k-means which are carried out based on all the features, the proposed method selects a subset of features to implement the two methods, respectively. The research finds that a better result can be obtained if the existing GMM and k-means methods are modified by nice initializations. Experiments based on Monte Carlo simulations show that the proposed method is more computationally efficient and the result is more accurate than the existing GMM and k-means methods based on all the features. The experiment based on a real-world dataset confirms this finding.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    工业园区的水回用设施面临着管理越来越多的废水源作为其入口水的挑战。通常,这种聚类结果是由具有广泛专业知识的工程师设计的。本文介绍了无监督学习方法在中国中水回用站进水分类中的创新应用,旨在减少对工程师经验的依赖。“水质距离”的概念被纳入三种无监督学习聚类算法(K-means,DBSCAN,和AGNES),通过六个案例研究进行了验证。在这六个案例中,三个被用来说明无监督学习聚类算法的可行性。结果表明,与人工聚类和基于ChatGPT的聚类相比,该聚类算法具有更大的稳定性和优越性。其余三个案例用于展示三种聚类算法的可靠性。研究结果表明,AGNES算法显示出优越的潜在应用能力。6例K-means的平均纯度,DBSCAN,和AGNES分别为0.947、0.852和0.955。
    The water reuse facilities of industrial parks face the challenge of managing a growing variety of wastewater sources as their inlet water. Typically, this clustering outcome is designed by engineers with extensive expertise. This paper presents an innovative application of unsupervised learning methods to classify inlet water in Chinese water reuse stations, aiming to reduce reliance on engineer experience. The concept of \'water quality distance\' was incorporated into three unsupervised learning clustering algorithms (K-means, DBSCAN, and AGNES), which were validated through six case studies. Of the six cases, three were employed to illustrate the feasibility of the unsupervised learning clustering algorithm. The results indicated that the clustering algorithm exhibited greater stability and excellence compared to both artificial clustering and ChatGPT-based clustering. The remaining three cases were utilized to showcase the reliability of the three clustering algorithms. The findings revealed that the AGNES algorithm demonstrated superior potential application ability. The average purity in six cases of K-means, DBSCAN, and AGNES were 0.947, 0.852, and 0.955, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    被生物心理社会框架告知,我们的研究使用中国纵向健康长寿调查(CLHLS)数据集来检查年龄最大(80岁以上)人群的认知功能轨迹.采用K均值聚类,我们确定了两个潜在的群体:高稳定性(HS)和低稳定性(LS)。HS组保持满意的认知功能,而LS组表现出一贯的低功能。Lasso回归揭示了预测因素,包括社会经济地位,生物条件,心理健康,生活方式,心理,和行为因素。这种数据驱动的方法揭示了认知衰老模式,并为健康衰老提供了政策。我们的研究在这种情况下开创了非参数机器学习方法。
    Informed by the biopsychosocial framework, our study uses the Chinese Longitudinal Healthy Longevity Survey (CLHLS) dataset to examine cognitive function trajectories among the oldest-old (80+). Employing K-means clustering, we identified two latent groups: High Stability (HS) and Low Stability (LS). The HS group maintained satisfactory cognitive function, while the LS group exhibited consistently low function. Lasso regression revealed predictive factors, including socioeconomic status, biological conditions, mental health, lifestyle, psychological, and behavioral factors. This data-driven approach sheds light on cognitive aging patterns and informs policies for healthy aging. Our study pioneers non-parametric machine learning methods in this context.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    随着产业的转型升级,工业残留污染场地引起的环境问题日益突出。根据实际调查案例,本研究分析了铜和锌轧制行业剩余站点的土壤污染状况,发现超过筛选值的污染物包括铜,Ni,Zn,Pb,总石油烃和6种多环芳烃单体。基于传统的相关系数和空间分布等分析方法,结合SOM+K-means等机器学习方法,推测重金属Zn/Pb可能主要与锌轧制的生产历史有关。Cu/Ni可能主要来自铜轧制的生产历史。PAHs主要是由于熔融设备中化石燃料的不完全燃烧。据推测,TPH污染与工业使用期间和车辆停放后期的漏油有关。结果表明,传统分析方法能够快速识别场地污染物之间的相关性,而SOM+K-means机器学习方法可以进一步有效提取数据中复杂的隐藏关系,实现对现场监测数据的深度挖掘。
    With the transformation and upgrading of industries, the environmental problems caused by industrial residual contaminated sites are becoming increasingly prominent. Based on actual investigation cases, this study analyzed the soil pollution status of a remaining sites of the copper and zinc rolling industry, and found that the pollutants exceeding the screening values included Cu, Ni, Zn, Pb, total petroleum hydrocarbons and 6 polycyclic aromatic hydrocarbon monomers. Based on traditional analysis methods such as the correlation coefficient and spatial distribution, combined with machine learning methods such as SOM + K-means, it is inferred that the heavy metal Zn/Pb may be mainly related to the production history of zinc rolling. Cu/Ni may be mainly originated from the production history of copper rolling. PAHs are mainly due to the incomplete combustion of fossil fuels in the melting equipment. TPH pollution is speculated to be related to oil leakage during the industrial use period and later period of vehicle parking. The results showed that traditional analysis methods can quickly identify the correlation between site pollutants, while SOM + K-means machine learning methods can further effectively extract complex hidden relationships in data and achieve in-depth mining of site monitoring data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号