Isolation forest

  • 文章类型: Journal Article
    II型糖尿病(T2DM)是一个不断上升的全球健康负担,因为它在全球范围内的患病率迅速增加。并可能导致严重的并发症。因此,最重要的是尽早确定有风险的个体,以避免长期T2DM并发症.在这项研究中,我们开发了一个可解释的机器学习模型,利用氧化应激(OS)的生物标志物的基线水平,炎症,和线粒体功能障碍(MD)用于识别有发展为T2DM风险的个体。特别是,隔离森林(iForest)被用作异常检测算法来解决类不平衡问题。根据对照组数据对iForest进行了培训,以检测T2DM发展的高风险病例作为异常值。通过十倍交叉验证对两个iForest模型进行了训练和评估,第一个关于传统生物标志物(BMI,单独的血糖水平(BGL)和甘油三酯),第二个包括上述其他生物标志物。在所有评估指标中,第二个模型的性能优于第一个模型,特别是F1得分和回忆,分别从0.61±0.05增加到0.81±0.05和0.57±0.06增加到0.81±0.08。特征重要性评分确定了一种新的生物标志物组合,包括白细胞介素-10(IL-10),8-异前列腺素,humanin(HN),和氧化型谷胱甘肽(GSSG),在结果预测中,这些生物标志物比传统生物标志物更具影响力。这些结果揭示了一种同时预测和理解T2DM发展风险的有希望的方法,并建议可能的药物干预以解决疾病进展早期的炎症和OS。
    Type II diabetes mellitus (T2DM) is a rising global health burden due to its rapidly increasing prevalence worldwide, and can result in serious complications. Therefore, it is of utmost importance to identify individuals at risk as early as possible to avoid long-term T2DM complications. In this study, we developed an interpretable machine learning model leveraging baseline levels of biomarkers of oxidative stress (OS), inflammation, and mitochondrial dysfunction (MD) for identifying individuals at risk of developing T2DM. In particular, Isolation Forest (iForest) was applied as an anomaly detection algorithm to address class imbalance. iForest was trained on the control group data to detect cases of high risk for T2DM development as outliers. Two iForest models were trained and evaluated through ten-fold cross-validation, the first on traditional biomarkers (BMI, blood glucose levels (BGL) and triglycerides) alone and the second including the additional aforementioned biomarkers. The second model outperformed the first across all evaluation metrics, particularly for F1 score and recall, which were increased from 0.61 ± 0.05 to 0.81 ± 0.05 and 0.57 ± 0.06 to 0.81 ± 0.08, respectively. The feature importance scores identified a novel combination of biomarkers, including interleukin-10 (IL-10), 8-isoprostane, humanin (HN), and oxidized glutathione (GSSG), which were revealed to be more influential than the traditional biomarkers in the outcome prediction. These results reveal a promising method for simultaneously predicting and understanding the risk of T2DM development and suggest possible pharmacological intervention to address inflammation and OS early in disease progression.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    软传感器已被广泛用于风力发电中的实时功率预测。这是具有挑战性的即时测量。对风电的短期预测旨在为日内电网的调度提供参考。本研究通过将数据预处理与变分模态分解(VMD)相结合,提出了一种基于长短期记忆(LSTM)网络的软测量模型,以提高风电功率预测精度。采用隔离森林算法对原始风电机组进行异常检测,并对缺失数据进行多次插补处理。基于过程数据样本,VMD技术用于实现功率数据分解和降噪。引入LSTM网络来分别预测每个模态分量,并进一步求和重构各分量的预测结果,完成风电功率预测。从实验结果来看,可以看出,使用Adam优化算法的LSTM网络具有更好的收敛精度。VMD方法由于其固有的维纳滤波能力而表现出优异的分解结果,有效地减轻噪声和防止模态混叠。平均绝对百分比误差(MAPE)减少了9.358%,表明LSTM网络结合VMD方法具有较好的预测精度。
    Soft sensors have been extensively utilized to approximate real-time power prediction in wind power generation, which is challenging to measure instantaneously. The short-term forecast of wind power aims at providing a reference for the dispatch of the intraday power grid. This study proposes a soft sensor model based on the Long Short-Term Memory (LSTM) network by combining data preprocessing with Variational Modal Decomposition (VMD) to improve wind power prediction accuracy. It does so by adopting the isolation forest algorithm for anomaly detection of the original wind power series and processing the missing data by multiple imputation. Based on the process data samples, VMD technology is used to achieve power data decomposition and noise reduction. The LSTM network is introduced to predict each modal component separately, and further sum reconstructs the prediction results of each component to complete the wind power prediction. From the experimental results, it can be seen that the LSTM network which uses an Adam optimizing algorithm has better convergence accuracy. The VMD method exhibited superior decomposition outcomes due to its inherent Wiener filter capabilities, which effectively mitigate noise and forestall modal aliasing. The Mean Absolute Percentage Error (MAPE) was reduced by 9.3508%, which indicates that the LSTM network combined with the VMD method has better prediction accuracy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    由于这些小膜结合结构的独特性质,细胞外囊泡的蛋白质组学分析提出了若干挑战。替代分析可以揭示隐藏在标准统计数据中的结果,以探索和开发潜在的新生物学假设,这些假设可能在数据的初始评估过程中被忽略。进行了专注于偏离供体原代细胞的蛋白质表达的分析序列,利用机器学习技术来分析小数据集,并已用于评估从在掺杂或不掺杂金属离子的生物活性玻璃圆盘上培养的间充质干细胞收集的细胞外囊泡\'蛋白质含量。目标是提供额外的机会来检测实验条件之间的细节,这些细节不能完全用经典的统计推断来揭示。提供有关实验设计的进一步见解,并协助研究人员解释结果。该方法提取了一组与EV相关的蛋白质,其条件之间的差异可以用统计学来部分解释。提示生物活性玻璃与组织相互作用中涉及的其他因素的存在。与生物材料制备相关的细胞外囊泡蛋白表达水平的异常值鉴定有助于改善对实验结果的解释。
    Proteomic analysis of extracellular vesicles presents several challenges due to the unique nature of these small membrane-bound structures. Alternative analyses could reveal outcomes hidden from standard statistics to explore and develop potential new biological hypotheses that may have been overlooked during the initial evaluation of the data. An analysis sequence focusing on deviating protein expressions from donors\' primary cells was performed, leveraging machine-learning techniques to analyze small datasets, and it has been applied to evaluate extracellular vesicles\' protein content gathered from mesenchymal stem cells cultured on bioactive glass discs doped or not with metal ions. The goal was to provide additional opportunities for detecting details between experimental conditions that are not entirely revealed with classic statistical inference, offering further insights regarding the experimental design and assisting the researchers in interpreting the outcomes. The methodology extracted a set of EV-related proteins whose differences between conditions could be partially explainable with statistics, suggesting the presence of other factors involved in the bioactive glasses\' interactions with tissues. Outlier identification of extracellular vesicles\' protein expression levels related to biomaterial preparation was instrumental in improving the interpretation of the experimental outcomes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    高光谱异常检测用于识别高光谱数据中的异常模式或异常。目前,许多光谱空间检测方法已经提出了级联的方式;然而,它们往往忽略了光谱和空间维度之间的互补特征,这容易导致产生高误报率。为了缓解这个问题,设计了一种用于高光谱异常检测的光谱-空间信息融合(SSIF)方法。首先,利用隔离森林获取光谱异常图,其中使用熵率分割算法构造对象级特征。然后,提出了一种局部空间显著性检测方案来产生空间异常结果。最后,将频谱和空间异常分数集成在一起,然后进行域变换递归滤波以生成最终的检测结果。在覆盖海洋和机场场景的五个高光谱数据集上的实验证明,与其他最先进的检测技术相比,所提出的SSIF产生了更好的检测结果。
    Hyperspectral anomaly detection is used to recognize unusual patterns or anomalies in hyperspectral data. Currently, many spectral-spatial detection methods have been proposed with a cascaded manner; however, they often neglect the complementary characteristics between the spectral and spatial dimensions, which easily leads to yield high false alarm rate. To alleviate this issue, a spectral-spatial information fusion (SSIF) method is designed for hyperspectral anomaly detection. First, an isolation forest is exploited to obtain spectral anomaly map, in which the object-level feature is constructed with an entropy rate segmentation algorithm. Then, a local spatial saliency detection scheme is proposed to produce the spatial anomaly result. Finally, the spectral and spatial anomaly scores are integrated together followed by a domain transform recursive filtering to generate the final detection result. Experiments on five hyperspectral datasets covering ocean and airport scenes prove that the proposed SSIF produces superior detection results over other state-of-the-art detection techniques.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在这项研究中,我们提出了一种用于Web服务器异常检测的新颖机器学习框架,该框架将隔离森林算法与专家评估独特地结合在一起,专注于NGINX服务器日志中的个人用户活动。我们的方法通过有效地隔离和分析大量数据集中的细微异常来解决传统方法的局限性。最初,隔离林算法被应用于广泛的NGINX服务器日志,成功识别常规方法经常忽略的离群用户行为。然后,我们使用DBSCAN对这些异常进行详细的聚类,根据用户请求次数和类型对它们进行分类。我们方法的一个关键创新是结合了聚类后专家分析。网络安全专业人员评估了确定的集群,增加了一个关键的定性评估层。这使得能够准确区分良性和潜在有害的活动,导致有针对性的响应,如访问限制或Web服务器配置调整。我们的方法展示了网络安全方面的重大进步,提供对用户行为的更精细的理解。通过将算法精度与专家见解相结合,我们为加强网络安全措施提供了全面而细致的战略。这项研究不仅推进了异常检测技术,而且强调了保护Web服务器基础架构的多方面方法的迫切需要。
    In this study, we present a novel machine learning framework for web server anomaly detection that uniquely combines the Isolation Forest algorithm with expert evaluation, focusing on individual user activities within NGINX server logs. Our approach addresses the limitations of traditional methods by effectively isolating and analyzing subtle anomalies in vast datasets. Initially, the Isolation Forest algorithm was applied to extensive NGINX server logs, successfully identifying outlier user behaviors that conventional methods often overlook. We then employed DBSCAN for detailed clustering of these anomalies, categorizing them based on user request times and types. A key innovation of our methodology is the incorporation of post-clustering expert analysis. Cybersecurity professionals evaluated the identified clusters, adding a crucial layer of qualitative assessment. This enabled the accurate distinction between benign and potentially harmful activities, leading to targeted responses such as access restrictions or web server configuration adjustments. Our approach demonstrates a significant advancement in network security, offering a more refined understanding of user behavior. By integrating algorithmic precision with expert insights, we provide a comprehensive and nuanced strategy for enhancing cybersecurity measures. This study not only advances anomaly detection techniques but also emphasizes the critical need for a multifaceted approach in protecting web server infrastructures.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    近年来,由于设备和人工智能的进步,机器学习在生物医学研究中的应用激增。我们的目标是通过将机器学习应用于肺听诊信号来扩展这种知识体系。尽管数字听诊器有所改进,并试图找到它们与人工智能之间的协同作用,在临床环境中使用它们的解决方案仍然很少。医生继续用不太复杂的方法推断初步诊断,导致精度低,导致患者护理欠佳。为了得出正确的初步诊断,听诊诊断需要高精度。由于进行了大量的听诊,数据可用性为更有效的声音分析提供了机会。在这项研究中,在各种机器学习场景中使用了45名患者的数字6通道听诊,目的是区分正常和异常肺音。音频功能(如基本频率F0-4,响度,HNR,DFA,以及对数能量的描述性统计,RMS和MFCC)是使用Python库Surfboard提取的。开窗,特征聚合,并使用串联策略为无监督的机器学习算法准备数据(公平采伐森林,离群值森林)和监督(随机森林,正则化逻辑回归)设置。使用重复30次的9倍分层交叉验证进行评价。还测试了通过平均受试者的输出进行的决策融合,并发现有帮助。监督模型显示出比无监督模型一致的优势,随机森林的平均AUCROC为0.691(准确率为71.11%,Kappa0.416,F1评分0.675)在侧基检测中,平均AUCROC为0.721(准确率68.89%,Kappa0.371,F1评分0.650)在基于患者的检测中。
    The use of machine learning in biomedical research has surged in recent years thanks to advances in devices and artificial intelligence. Our aim is to expand this body of knowledge by applying machine learning to pulmonary auscultation signals. Despite improvements in digital stethoscopes and attempts to find synergy between them and artificial intelligence, solutions for their use in clinical settings remain scarce. Physicians continue to infer initial diagnoses with less sophisticated means, resulting in low accuracy, leading to suboptimal patient care. To arrive at a correct preliminary diagnosis, the auscultation diagnostics need to be of high accuracy. Due to the large number of auscultations performed, data availability opens up opportunities for more effective sound analysis. In this study, digital 6-channel auscultations of 45 patients were used in various machine learning scenarios, with the aim of distinguishing between normal and abnormal pulmonary sounds. Audio features (such as fundamental frequencies F0-4, loudness, HNR, DFA, as well as descriptive statistics of log energy, RMS and MFCC) were extracted using the Python library Surfboard. Windowing, feature aggregation, and concatenation strategies were used to prepare data for machine learning algorithms in unsupervised (fair-cut forest, outlier forest) and supervised (random forest, regularized logistic regression) settings. The evaluation was carried out using 9-fold stratified cross-validation repeated 30 times. Decision fusion by averaging the outputs for a subject was also tested and found to be helpful. Supervised models showed a consistent advantage over unsupervised ones, with random forest achieving a mean AUC ROC of 0.691 (accuracy 71.11%, Kappa 0.416, F1-score 0.675) in side-based detection and a mean AUC ROC of 0.721 (accuracy 68.89%, Kappa 0.371, F1-score 0.650) in patient-based detection.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    长期脑电图(长期脑电图)具有长期监测的能力,使其成为医疗机构中的宝贵工具。然而,由于大量的患者数据,从原始长期脑电图中选择干净的数据段进行进一步分析是一项极其耗时且费力的任务。此外,患者在记录过程中的各种动作使得很难使用算法去噪部分脑电图数据,从而导致对这些数据的拒绝。因此,用于快速拒绝长期EEG记录中严重损坏的时期的工具是非常有益的。在本文中,提出了一种新的基于隔离森林(IF)的长期脑电可靠,快速的自动伪影抑制方法。具体来说,重复应用IF算法来检测EEG数据中的异常值,并利用统计指标及时调整内点边界,使算法迭代进行。当干净历元和伪影破坏历元之间的距离度量保持不变时,终止迭代。六个统计指标(即,min,max,中位数,意思是,峰度,和偏度)通过将它们设置为质心以在迭代过程中调整边界来评估,并将所提出的方法与回顾性收集的数据集上的几种最先进的方法进行比较。实验结果表明,利用数据的最小值作为质心可以产生最佳的性能,所提出的方法在长期脑电图的自动伪影抑制中具有很高的有效性和可靠性,因为它显著提高了整体数据质量。此外,所提出的方法超过了大多数数据质量差的数据段的比较方法,展示了其卓越的能力,以提高严重损坏的数据的数据质量。此外,由于IF的线性时间复杂度,所提出的方法比其他方法快得多,从而在处理大量数据集时提供优势。
    Long-term electroencephalogram (Long-Term EEG) has the capacity to monitor over a long period, making it a valuable tool in medical institutions. However, due to the large volume of patient data, selecting clean data segments from raw Long-Term EEG for further analysis is an extremely time-consuming and labor-intensive task. Furthermore, the various actions of patients during recording make it difficult to use algorithms to denoise part of the EEG data, and thus lead to the rejection of these data. Therefore, tools for the quick rejection of heavily corrupted epochs in Long-Term EEG records are highly beneficial. In this paper, a new reliable and fast automatic artifact rejection method for Long-Term EEG based on Isolation Forest (IF) is proposed. Specifically, the IF algorithm is repetitively applied to detect outliers in the EEG data, and the boundary of inliers is promptly adjusted by using a statistical indicator to make the algorithm proceed in an iterative manner. The iteration is terminated when the distance metric between clean epochs and artifact-corrupted epochs remains unchanged. Six statistical indicators (i.e., min, max, median, mean, kurtosis, and skewness) are evaluated by setting them as centroid to adjust the boundary during iteration, and the proposed method is compared with several state-of-the-art methods on a retrospectively collected dataset. The experimental results indicate that utilizing the min value of data as the centroid yields the most optimal performance, and the proposed method is highly efficacious and reliable in the automatic artifact rejection of Long-Term EEG, as it significantly improves the overall data quality. Furthermore, the proposed method surpasses compared methods on most data segments with poor data quality, demonstrating its superior capacity to enhance the data quality of the heavily corrupted data. Besides, owing to the linear time complexity of IF, the proposed method is much faster than other methods, thus providing an advantage when dealing with extensive datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    动态数据(包括环境、交通,和传感器数据)最近被认为是开放政府数据(OGD)的重要组成部分。尽管这些数据在数据智能应用程序的开发中至关重要,例如利用流量数据来预测流量需求的业务应用程序,它们容易产生数据质量错误,例如,传感器故障和网络故障。本文探讨了动态开放政府数据的质量问题。为此,使用希腊官方OGD门户网站的流量数据研究了一个案例。门户使用应用程序编程接口(API),这对于有效的动态数据传播至关重要。我们的研究方法包括使用统计和机器学习方法评估数据质量,以检测缺失值和异常。交通流-速度相关性分析,季节性趋势分解,和无监督隔离森林(iForest)用于检测异常。iForest异常被分类为传感器故障和异常交通状况。iForest算法还对其他功能进行了训练,并使用可解释的人工智能对模型进行解释。有20.16%的交通观测缺失,50%的传感器有15.5%到33.43%的缺失值。每个传感器的平均异常百分比为71.1%,只有少数传感器有不到10%的异常。季节性趋势分解在这些传感器的数据中检测到12.6%的异常,iForest11.6%,重叠很少。就作者所知,这是首次有研究探索动态OGD的质量。
    Dynamic data (including environmental, traffic, and sensor data) were recently recognized as an important part of Open Government Data (OGD). Although these data are of vital importance in the development of data intelligence applications, such as business applications that exploit traffic data to predict traffic demand, they are prone to data quality errors produced by, e.g., failures of sensors and network faults. This paper explores the quality of Dynamic Open Government Data. To that end, a single case is studied using traffic data from the official Greek OGD portal. The portal uses an Application Programming Interface (API), which is essential for effective dynamic data dissemination. Our research approach includes assessing data quality using statistical and machine learning methods to detect missing values and anomalies. Traffic flow-speed correlation analysis, seasonal-trend decomposition, and unsupervised isolation Forest (iForest) are used to detect anomalies. iForest anomalies are classified as sensor faults and unusual traffic conditions. The iForest algorithm is also trained on additional features, and the model is explained using explainable artificial intelligence. There are 20.16% missing traffic observations, and 50% of the sensors have 15.5% to 33.43% missing values. The average percent of anomalies per sensor is 71.1%, with only a few sensors having less than 10% anomalies. Seasonal-trend decomposition detected 12.6% anomalies in the data of these sensors, and iForest 11.6%, with very few overlaps. To the authors\' knowledge, this is the first time a study has explored the quality of dynamic OGD.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本文介绍了一种使用EEG信号进行心理任务识别的无监督深度学习驱动方案。为此,首先将多通道维纳滤波器作为伪影去除算法应用于脑电信号以实现鲁棒识别。然后,应用二次时频分布(QTFD)来提取EEG信号的有效时频信号表示,并捕获EEG信号随时间的频谱变化,以提高对心理任务的识别。QTFD时频特征被用作所提出的深度信念网络(DBN)驱动的隔离森林(iF)方案的输入,以对EEG信号进行分类。的确,基于每个类的训练数据构建单个基于DBN的iF检测器,将该类的样本作为内点,将所有其他样本作为异常(即,One-vs.-休息)。DBN被认为是在没有数据分布假设的情况下学习相关信息,采用iF方案进行数据判别。这种方法是使用实验数据进行评估的,实验数据包括来自格拉茨理工大学公开数据库的五项心理任务。与基于DBN的椭圆包络相比,局部离群系数,和最先进的基于EEG的分类方法,所提出的基于DBN的iF检测器提供了出色的心理任务判别性能。
    This paper introduces an unsupervised deep learning-driven scheme for mental tasks\' recognition using EEG signals. To this end, the Multichannel Wiener filter was first applied to EEG signals as an artifact removal algorithm to achieve robust recognition. Then, a quadratic time-frequency distribution (QTFD) was applied to extract effective time-frequency signal representation of the EEG signals and catch the EEG signals\' spectral variations over time to improve the recognition of mental tasks. The QTFD time-frequency features are employed as input for the proposed deep belief network (DBN)-driven Isolation Forest (iF) scheme to classify the EEG signals. Indeed, a single DBN-based iF detector is constructed based on each class\'s training data, with the class\'s samples as inliers and all other samples as anomalies (i.e., one-vs.-rest). The DBN is considered to learn pertinent information without assumptions on the data distribution, and the iF scheme is used for data discrimination. This approach is assessed using experimental data comprising five mental tasks from a publicly available database from the Graz University of Technology. Compared to the DBN-based Elliptical Envelope, Local Outlier Factor, and state-of-the-art EEG-based classification methods, the proposed DBN-based iF detector offers superior discrimination performance of mental tasks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    这项研究解决了超宽带(UWB)网络中的定位问题,其中需要估计接入点和标签的位置。我们考虑了一个完全无线的UWB定位系统,包括软件和硬件,为消费者提供简单的即插即用可用性,主要针对运动和休闲应用。锚的自定位是通过双向测距解决的,还嵌入了用于估计和补偿天线延迟的高斯-牛顿算法,以及一种改进的隔离森林算法,该算法使用低维测量集进行异常值识别和删除。这种方法避免了耗时的校准程序,它通过到达时间差测量的多边实现准确的标签定位。对于性能的评估和不同算法的比较,我们考虑了一项由专有UWB定位系统收集的数据的实验活动。
    This study addressed the problem of localization in an ultrawide-band (UWB) network, where the positions of both the access points and the tags needed to be estimated. We considered a fully wireless UWB localization system, comprising both software and hardware, featuring easy plug-and-play usability for the consumer, primarily targeting sport and leisure applications. Anchor self-localization was addressed by two-way ranging, also embedding a Gauss-Newton algorithm for the estimation and compensation of antenna delays, and a modified isolation forest algorithm working with low-dimensional set of measurements for outlier identification and removal. This approach avoids time-consuming calibration procedures, and it enables accurate tag localization by the multilateration of time difference of arrival measurements. For the assessment of performance and the comparison of different algorithms, we considered an experimental campaign with data gathered by a proprietary UWB localization system.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号