关键词: Calinski-Harabasz index Camberra Clustering analysis Davies-Bouldin index Fake real estate listings Jaccard distance K-means Random forest Real estates Silhouette coefficient

来  源:   DOI:10.7717/peerj-cs.2019   PDF(Pubmed)

Abstract:
With the rapid growth of online property rental and sale platforms, the prevalence of fake real estate listings has become a significant concern. These deceptive listings waste time and effort for buyers and sellers and pose potential risks. Therefore, developing effective methods to distinguish genuine from fake listings is crucial. Accurately identifying fake real estate listings is a critical challenge, and clustering analysis can significantly improve this process. While clustering has been widely used to detect fraud in various fields, its application in the real estate domain has been somewhat limited, primarily focused on auctions and property appraisals. This study aims to fill this gap by using clustering to classify properties into fake and genuine listings based on datasets curated by industry experts. This study developed a K-means model to group properties into clusters, clearly distinguishing between fake and genuine listings. To assure the quality of the training data, data pre-processing procedures were performed on the raw dataset. Several techniques were used to determine the optimal value for each parameter of the K-means model. The clusters are determined using the Silhouette coefficient, the Calinski-Harabasz index, and the Davies-Bouldin index. It was found that the value of cluster 2 is the best and the Camberra technique is the best method when compared to overlapping similarity and Jaccard for distance. The clustering results are assessed using two machine learning algorithms: Random Forest and Decision Tree. The observational results have shown that the optimized K-means significantly improves the accuracy of the Random Forest classification model, boosting it by an impressive 96%. Furthermore, this research demonstrates that clustering helps create a balanced dataset containing fake and genuine clusters. This balanced dataset holds promise for future investigations, particularly for deep learning models that require balanced data to perform optimally. This study presents a practical and effective way to identify fake real estate listings by harnessing the power of clustering analysis, ultimately contributing to a more trustworthy and secure real estate market.
摘要:
随着在线物业租售平台的快速增长,假房地产上市的盛行已经成为一个重要的问题。这些欺骗性的清单浪费了买卖双方的时间和精力,并带来了潜在的风险。因此,开发区分真假上市的有效方法至关重要。准确识别虚假房地产列表是一个关键的挑战,聚类分析可以显著改善这一过程。虽然聚类已被广泛用于检测各个领域的欺诈,它在房地产领域的应用受到了一定的限制,主要集中在拍卖和财产评估上。这项研究旨在通过使用聚类来填补这一空白,根据行业专家策划的数据集将属性分类为虚假和真实列表。这项研究开发了一个K均值模型,将属性分组为集群,明确区分虚假和真实的清单。为了保证训练数据的质量,在原始数据集上执行数据预处理程序.使用了几种技术来确定K均值模型的每个参数的最佳值。使用轮廓系数确定聚类,Calinski-Harabasz指数,和戴维斯-博尔丁指数。发现与重叠相似性和Jaccard距离相比,聚类2的值是最好的,而Camberra技术是最好的方法。使用两种机器学习算法评估聚类结果:随机森林和决策树。观测结果表明,优化后的K-means显著提高了随机森林分类模型的准确性,将其提高了令人印象深刻的96%。此外,这项研究表明,聚类有助于创建一个包含虚假和真实聚类的平衡数据集。这个平衡的数据集为未来的调查提供了希望,特别是对于需要平衡数据才能最佳执行的深度学习模型。本研究通过利用聚类分析的力量,提出了一种实用有效的方法来识别虚假房地产列表,最终有助于建立一个更值得信赖和安全的房地产市场。
公众号