statistical significance

统计意义
  • 文章类型: Journal Article
    背景:对研究结果的正确解释既需要对良好的方法实践的深刻理解,又需要对先前结果的深入了解,由效果大小的可用性辅助。
    方法:这篇综述采用了一篇说明性文章的形式,探讨了统计意义之间的复杂而细微的关系,临床重要性,和效果大小。
    结果:仔细注意研究设计和方法将增加获得统计学意义的可能性,并可能增强研究人员/读者准确解释结果的能力。效应大小的度量表明研究中使用的变量如何很好地解释/解释数据中的变异性。报告强效应的研究可能比报告弱效应的研究具有更大的实用价值/效用。效应大小需要在上下文中解释。效果大小的口头摘要表征(例如,\"弱\",\“strong\”)从根本上是有缺陷的,可能导致对结果的不恰当表征。通用语言效果大小(CLES)指标是一种相对较新的效果大小方法,可以提供更易于理解的结果解释,可以使提供者受益。病人,和广大公众。
    结论:以研究界和公众都清楚的方式传达研究结果非常重要。至少,这需要在研究报告中纳入标准效应大小数据。正确选择措施和仔细设计研究是解释研究结果的基础。当研究人员提高其工作的方法学质量时,从研究中得出有用结论的能力就会增强。
    BACKGROUND: The proper interpretation of a study\'s results requires both excellent understanding of good methodological practices and deep knowledge of prior results, aided by the availability of effect sizes.
    METHODS: This review takes the form of an expository essay exploring the complex and nuanced relationships among statistical significance, clinical importance, and effect sizes.
    RESULTS: Careful attention to study design and methodology will increase the likelihood of obtaining statistical significance and may enhance the ability of investigators/readers to accurately interpret results. Measures of effect size show how well the variables used in a study account for/explain the variability in the data. Studies reporting strong effects may have greater practical value/utility than studies reporting weak effects. Effect sizes need to be interpreted in context. Verbal summary characterizations of effect sizes (e.g., \"weak\", \"strong\") are fundamentally flawed and can lead to inappropriate characterization of results. Common language effect size (CLES) indicators are a relatively new approach to effect sizes that may offer a more accessible interpretation of results that can benefit providers, patients, and the public at large.
    CONCLUSIONS: It is important to convey research findings in ways that are clear to both the research community and to the public. At a minimum, this requires inclusion of standard effect size data in research reports. Proper selection of measures and careful design of studies are foundational to the interpretation of a study\'s results. The ability to draw useful conclusions from a study is increased when investigators enhance the methodological quality of their work.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    这项研究分析了21辆中国VI重型柴油卡车(HDDT)的实际NOx和颗粒数(PN)排放。首先使用便携式排放测量系统(PEMS)评估道路排放符合性。只有76.19%,71.43%和61.90%的车辆通过NOx测试,PN测试和两个测试,分别。包括废气再循环(EGR)设备在内的车辆功能的影响,然后评估里程和牵引吨位。结果表明,EGR有助于降低NOx排放因子(EF),同时增加PNEF。较大的里程和牵引吨位对应较高的NOx和PNEF,分别。通过数值比较和统计测试,对操作条件对排放的影响进行了深入分析。结果证明,HDDT在低速或大车辆比功率(VSP)下产生较高的NOxEF,和更高的PNEF在高速或小VSP一般。此外,不合格车辆产生的NOxEF明显高于高速公路上或车速≥40km/h的合格车辆,虽然郊区道路上产生了显著较高的PNEF,高速公路或不合格车辆在具有正VSP的运行模式下。最后研究了车载诊断(OBD)NOx数据的可靠性和准确性。结果显示,43%的测试车辆没有报告可靠的OBD数据。OBDNOx和PEMS测量之间的相关性分析进一步证明瞬时浓度的一致性通常较低。然而,滑动窗口平均浓度显示出更好的相关性,例如,对于大多数车辆,20s窗口平均浓度的Pearson相关系数超过0.85。研究结果为排放管制提供了有价值的见解,例如,更加注重中高速运行,以识别不合格车辆,设定更高的标准以提高OBD数据的质量,并采用窗口平均OBDNOx浓度评价车辆排放性能。
    This research analyzed the real-world NOx and particle number (PN) emissions of 21 China VI heavy-duty diesel trucks (HDDTs). On-road emission conformity was first evaluated with portable emission measurement system (PEMS). Only 76.19 %, 71.43 % and 61.90 % of the vehicles passed the NOx test, PN test and both tests, respectively. The impacts of vehicle features including exhaust gas recirculation (EGR) equipment, mileage and tractive tonnage were then assessed. Results demonstrated that EGR helped reducing NOx emission factors (EFs) while increased PN EFs. Larger mileages and tractive tonnages corresponded to higher NOx and PN EFs, respectively. In-depth analyses regarding the influences of operating conditions on emissions were conducted with both numerical comparisons and statistical tests. Results proved that HDDTs generated higher NOx EFs under low speeds or large vehicle specific powers (VSPs), and higher PN EFs under high speeds or small VSPs in general. In addition, unqualified vehicles generated significantly higher NOx EFs than qualified vehicles on freeways or under speed≥40 km/h, while significant higher PN EFs were generated on suburban roads, freeways or under operating modes with positive VSPs by unqualified vehicles. The reliability and accuracy of on-board diagnostic (OBD) NOx data were finally investigated. Results revealed that 43 % of the test vehicles did not report reliable OBD data. Correlation analyses between OBD NOx and PEMS measurements further demonstrated that the consistency of instantaneous concentrations were generally low. However, sliding window averaged concentrations show better correlations, e.g., the Pearson correlation coefficients on 20s-window averaged concentrations exceeded 0.85 for most vehicles. The research results provide valuable insights into emission regulation, e.g., focusing more on medium- to high-speed operations to identify unqualified vehicles, setting higher standards to improve the quality of OBD data, and adopting window averaged OBD NOx concentrations in evaluating vehicle emission performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    深度学习(DL)已经证明了其从复杂和多维数据中独立学习分层特征的固有能力。一个共同的理解是,它的性能随着训练数据量的增加而扩展。然而,数据还必须表现出多样性,以提高学习能力。在医学成像数据中,语义冗余,即存在类似或重复的信息,可能由于存在多个图像而发生,这些图像对于感兴趣的疾病具有高度相似的呈现。此外,当不加区别地应用于此类数据时,通常使用增强方法来生成DL训练中的多样性可能会限制性能。因此,我们假设语义冗余会降低性能,并限制对看不见的数据的可泛化性,并质疑其对分类器性能的影响,即使是大数据。我们提出了一种基于熵的样本评分方法来识别和去除语义冗余的训练数据,并使用公开的NIH胸部X射线数据集证明,在训练数据的结果信息子集上训练的模型明显优于在完整训练集上训练的模型,在内部(召回:0.7164vs0.6597,p<0.05)和外部测试(召回:0.3185vs0.2589,p<0.05)。我们的发现强调了以信息为导向的训练样本选择的重要性,而不是使用所有可用训练数据的常规做法。
    Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. However, the data must also exhibit variety to enable improved learning. In medical imaging data, semantic redundancy, which is the presence of similar or repetitive information, can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Also, the common use of augmentation methods to generate variety in DL training could limit performance when indiscriminately applied to such data. We hypothesize that semantic redundancy would therefore tend to lower performance and limit generalizability to unseen data and question its impact on classifier performance even with large data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data and demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:脆弱性分析是一种根据统计结果的稳定性进一步表征结果的方法。这项研究评估了最近的随机对照试验(RCT)的统计脆弱性,该试验评估了机器人辅助与常规全膝关节置换术(RA-TKA与C-TKA)。
    方法:我们向PubMed查询了比较对齐的RCT,函数,RA-TKA和C-TKA之间的结果。脆弱性指数(FI)和反向脆弱性指数(RFI)(统称,计算“FI”)作为改变统计显著性所需的结果逆转次数。通过将FI除以该结果事件的样本大小来计算脆性商(FQ)。计算所有结果以及每个单独结果的平均FI和FQ。根据结局事件类型和统计学意义进行分分析以评估FI和FQ,以及随访和发表年份的研究损失。
    结果:总体中位数FI为3.0(四分位距,[IQR]1.0至6.3),中位数RFI为3.0(IQR2.0至4.0)。总体中位数FQ为0.027(IQR0.012至0.050)。在评估的38项结果中,有23项随访损失大于FI。
    结论:少量的替代结果通常足以逆转RA-TKA与C-TKA中评估二分结果的RCT结果的统计学意义。我们建议报告FI和FQ以及P值,以提高RCT结果的可解释性。
    BACKGROUND: Fragility analysis is a method of further characterizing outcomes in terms of the stability of statistical findings. This study assesses the statistical fragility of recent randomized controlled trials (RCTs) evaluating robotic-assisted versus conventional total knee arthroplasty (RA-TKA versus C-TKA).
    METHODS: We queried PubMed for RCTs comparing alignment, function, and outcomes between RA-TKA and C-TKA. Fragility index (FI) and reverse fragility index (RFI) (collectively, \"FI\") were calculated for dichotomous outcomes as the number of outcome reversals needed to change statistical significance. Fragility quotient (FQ) was calculated by dividing the FI by the sample size for that outcome event. Median FI and FQ were calculated for all outcomes collectively as well as for each individual outcome. Subanalyses were performed to assess FI and FQ based on outcome event type and statistical significance, as well as study loss to follow-up and year of publication.
    RESULTS: The overall median FI was 3.0 (interquartile range, [IQR] 1.0 to 6.3) and the median reverse fragility index was 3.0 (IQR 2.0 to 4.0). The overall median FQ was 0.027 (IQR 0.012 to 0.050). Loss to follow-up was greater than FI for 23 of the 38 outcomes assessed.
    CONCLUSIONS: A small number of alternative outcomes is often enough to reverse the statistical significance of findings in RCTs evaluating dichotomous outcomes in RA-TKA versus C-TKA. We recommend reporting FI and FQ alongside P values to improve the interpretability of RCT results.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • DOI:
    文章类型: Preprint
    深度学习(DL)已经证明了其从复杂和多维数据中独立学习分层特征的固有能力。一个共同的理解是,它的性能随着训练数据量的增加而扩展。另一个数据属性是固有的多样性。它随之而来,因此,语义冗余,即存在类似或重复的信息,将倾向于降低性能并限制对看不见的数据的泛化性。在医学成像数据中,语义冗余可能会发生,由于存在多个图像,这些图像对感兴趣的疾病具有高度相似的表示。Further,当应用于语义冗余数据时,通常使用增强方法来生成DL训练中的多样性可能会限制性能。我们提出了一种基于熵的样本评分方法来识别和删除语义冗余的训练数据。我们使用公开的NIH胸部X射线数据集证明,在训练数据的结果信息子集上训练的模型显着优于在完整训练集上训练的模型,在内部(召回:0.7164vs0.6597,p<0.05)和外部测试(召回:0.3185vs0.2589,p<0.05)。我们的发现强调了以信息为导向的训练样本选择的重要性,而不是使用所有可用训练数据的常规做法。
    Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. Another data attribute is the inherent variety. It follows, therefore, that semantic redundancy, which is the presence of similar or repetitive information, would tend to lower performance and limit generalizability to unseen data. In medical imaging data, semantic redundancy can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Further, the common use of augmentation methods to generate variety in DL training may be limiting performance when applied to semantically redundant data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data. We demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:评估系统综述作者使用的语言,强调统计学上无显著性的结果显示有意义的差异。为了确定这些治疗效果的大小是否与作者解释为没有差异的非显著性结果不同。
    方法:我们筛选了2017年至2022年间发表的Cochrane综述,以获得统计学上不显著的影响估计,作者将其表示为有意义的差异。我们对解释进行了定性分类,并通过计算超过零或最小重要差异的置信区间部分的曲线下面积(AUC)来定量评估它们。表明一种干预的效果更大。
    结果:在2,337条评论中,我们检测到139例,其中作者强调在无显著性结果中存在有意义的差异.作者通常使用限定词来表达不确定性(66.9%)。有时(26.6%),他们在没有承认统计不确定性的情况下,对一项干预措施的更大益处或伤害做出了绝对主张。AUC分析表明,一些作者可能夸大了非显著差异的重要性,而其他人可能忽略了非显著效应估计中的有意义的差异。
    结论:在Cochrane综述中,对统计学上不显著结果的细微差别解释很少。我们的研究强调了系统评价作者在解释统计学上不显著的效应估计时需要更细致的方法。
    To assess the language used by systematic review authors to emphasize that statistically nonsignificant results show meaningful differences. To determine whether the magnitude of these treatment effects was distinct from nonsignificant results that authors interpreted as not different.
    We screened Cochrane reviews published between 2017 and 2022 for statistically nonsignificant effect estimates that authors presented as meaningful differences. We classified interpretations qualitatively and assessed them quantitatively by calculating the areas under the curve of the portions of confidence intervals exceeding the null or a minimal important difference, indicating one intervention\'s greater effect.
    In 2,337 reviews, we detected 139 cases where authors emphasized meaningful differences in nonsignificant results. Authors commonly used qualifying words to express uncertainty (66.9%). Sometimes (26.6%), they made absolute claims about one intervention\'s greater benefit or harm without acknowledging statistical uncertainty. The areas under the curve analyses indicated that some authors may overstate the importance of nonsignificant differences, whereas others may overlook meaningful differences in nonsignificant effect estimates.
    Nuanced interpretations of statistically nonsignificant results were rare in Cochrane reviews. Our study highlights the need for a more nuanced approach by systematic review authors when interpreting statistically nonsignificant effect estimates.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    术语“统计意义,“在医学文献中无处不在,经常被误解,它所源自的“p值”也是如此。本文探讨了数值上为正的结果的含义(例如,平均而言,治疗组中的患者表现更好),但无统计学意义。这种缺乏统计显著性有时被解释为很强,甚至是决定性的,在没有适当考虑其他因素的情况下反对效果的证据。关于羟氯喹(HCQ)作为COVID-19治疗的三篇有影响力的文章是说明性的。它们都涉及数字阳性结果,没有统计学意义,被误解为反对HCQ疗效的有力证据。这些和相关的考虑引起了人们对围绕COVID-19治疗的学术/医学推理的可靠性的担忧,更普遍的是,以及利益冲突造成偏见的可能性。
    The term \"statistical significance,\" ubiquitous in the medical literature, is often misinterpreted, as is the \"p-value\" from which it stems. This article explores the implications of results that are numerically positive (e.g., those in the treatment arm do better on average) but not statistically significant. This lack of statistical significance is sometimes interpreted as strong, even decisive, evidence against an effect without due consideration of other factors. Three influential articles on hydroxychloroquine (HCQ) as a treatment for COVID-19 are illustrative. They all involve numerically positive results that were not statistically significant that were misinterpreted as strong evidence against HCQ\'s efficacy. These and related considerations raise concerns regarding the reliability of academic/medical reasoning around COVID-19 treatments, as well as more generally, and regarding the potential for bias stemming from conflicts of interest.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    临床试验通常会有患者失去随访。我们提出了一种方法,通过改变脆弱性指数的概念来了解它们对统计检验结果的可能影响,以将观察到的患者的结局视为固定,但将失去随访的患者的潜在结局视为随机并进行修改。
    我们重新分析3项冠状动脉旁路移植术(CABG)临床试验的统计结果,以研究患者可能失去随访对治疗效果的统计学意义。要做到这一点,我们引入了LTFU感知脆弱性指数,作为一项临床试验统计结果对失访患者的稳健性的衡量指标.
    分析表明,临床试验可以完全可靠地治疗失访患者的结果,对失去随访的患者的结果极其敏感,或处于中间状态。当临床试验处于中间状态时,LTFU感知的脆弱性指数提供了一个可解释的度量来量化脆弱性或稳健性的程度。
    LTFU感知的脆弱性指数使研究人员能够严格探索失去随访的患者的结局。当他们的数据是合适的时候。LTFU感知脆弱性指数是敏感性度量,而原始脆弱性指数则不是。
    Clinical trials routinely have patients lost to follow up. We propose a methodology to understand their possible effect on the results of statistical tests by altering the concept of the fragility index to treat the outcomes of observed patients as fixed but incorporate the potential outcomes of patients lost to follow up as random and subject to modification.
    We reanalyse the statistical results of three clinical trials on coronary artery bypass grafting (CABG) to study the possible effect of patients lost to follow up on the treatment effect statistical significance. To do so, we introduce the LTFU-aware fragility indices as a measure of the robustness of a clinical trial\'s statistical results with respect to patients lost to follow up.
    The analyses illustrate that clinical trials can either be completely robust to the outcomes of patients lost to follow up, extremely sensitive to the outcomes of patients lost to follow up, or in an intermediate state. When a clinical trial is in an intermediate state, the LTFU-aware fragility indices provide an interpretable measure to quantify the degree of fragility or robustness.
    The LTFU-aware fragility indices allow researchers to rigorously explore the outcomes of patients who are lost to follow up, when their data is the appropriate kind. The LTFU-aware fragility indices are sensitivity measures in a way that the original fragility index is not.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    UNASSIGNED: The Fragility Index (FI) and Reverse Fragility Index are powerful tools to supplement the P value in evaluation of randomized clinical trial (RCT) outcomes. These metrics are defined as the number of patients needed to change the significance level of an outcome. The purpose of this study was to calculate these metrics for published RCTs in total joint arthroplasty (TJA).
    UNASSIGNED: We performed a systematic review of RCTs in TJA over the last decade. For each study, we calculated the FI (for statistically significant outcomes) or Reverse Fragility Index (for nonstatistically significant outcomes) for all dichotomous, categorical outcomes. We also used the Pearson correlation coefficient to evaluate publication-level variables.
    UNASSIGNED: We included 104 studies with 473 outcomes; 92 were significant, and 381 were nonstatistically significant. The median FI was 6 overall and 4 and 7 for significant and nonsignificant outcomes, respectively. There was a positive correlation between FI and sample size (R = 0.14, P = .002) and between FI and P values (R = 0.197, P = .000012).
    UNASSIGNED: This study is the largest evaluation of FI in orthopedics literature to date. We found a median FI that was comparable to or higher than FIs calculated in other orthopedic subspecialties. Although the mean and median FIs were greater than the 2 recommended by the American Academy of Orthopaedic Surgeons Clinical Practice Guidelines to demonstrate strong evidence, a large percentage of studies have an FI < 2. This suggests that the TJA literature is on par or slightly better than other subspecialties, but improvements must be made.
    UNASSIGNED: Level I; Systematic Review.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    OBJECTIVE: Language that implies a conclusion not supported by the evidence is common in the medical literature. The hypothesis of the present study was that medical journal publications are more likely to use misleading language for the interpretation of a demonstrated null (i.e. chance or not statistically significant) effect than a demonstrated real (i.e. statistically significant) effect.
    METHODS: This was an observational study of the medical literature with a systematic sampling method. Articles published in The Journal of the American Medical Association, The Lancet and The New England Journal of Medicine over the last two decades were eligible. The language used around the P-value was assessed for misleadingness (i.e. either suggesting an effect existed when a real effect did not exist or vice versa).
    RESULTS: There were 228 unique manuscripts examined, containing 400 statements interpreting a P-value proximate to 0.05. The P-value was between 0.036 and 0.050 for 303 (75.8%) statements and between 0.050 and 0.064 for 97 (24.3%) statements. Forty-four (11%) of the statements were misleading. There were 40 (41.2%) false-positive sentences, implying statistical significance when the P-value was >0.05, and four (1.3%) false-negative sentences, implying no statistical significance when the P-value <0.05 (relative risk 31.2; 95% confidence interval 11.5-85.1; P < 0.0001). The proportion of included manuscripts containing at least one misleading sentence was 16.2% (95% confidence interval 12.0-21.6).
    CONCLUSIONS: Among a random selection of sentences in prestigious journals describing P-values close to 0.05, 1 in 10 are misleading (n = 44, 11%) and this is more prevalent when the P-values are above 0.05 compared to below 0.05. Caution is advised for researchers, clinicians and editors to align with the context and purpose of P-values.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号