sample

样本
  • 文章类型: Journal Article
    背景:大型语言模型(LLM)具有支持健康信息学中有前途的新应用的潜力。然而,缺乏在生物医学和卫生政策背景下对LLM进行微调以执行特定任务的样本量考虑因素的实际数据。
    目的:本研究旨在评估用于微调LLM的样本量和样本选择技术,以支持针对利益冲突披露声明的自定义数据集的改进的命名实体识别(NER)。
    方法:随机抽取200份披露声明进行注释。所有“人员”和“ORG”实体均由2个评估者识别,一旦建立了适当的协议,注释者独立地注释了另外290个公开声明。从490个注释文档中,抽取了2500个不同大小范围的分层随机样本。2500个训练集子样本用于在2个模型架构(来自变压器[BERT]和生成预训练变压器[GPT]的双向编码器表示)中微调语言模型的选择,以改善NER。多元回归用于评估样本量(句子)之间的关系,实体密度(每个句子的实体[EPS]),和训练的模型性能(F1分数)。此外,单预测阈值回归模型用于评估增加样本量或实体密度导致边际收益递减的可能性。
    结果:在架构中,微调模型的顶线NER性能从F1分数=0.79到F1分数=0.96不等。双预测多元线性回归模型的多重R2在0.6057~0.7896之间有统计学意义(均P<.001)。在所有情况下,EPS和句子数是F1得分的显著预测因子(P<.001),除了GPT-2_large模型,其中每股收益不是显著的预测因子(P=0.184)。模型阈值表示由增加的训练数据集样本量(以句子的数量衡量)的边际收益递减点,点估计范围从RoBERTa_large的439个句子到GPT-2_large的527个句子。同样,阈值回归模型表明每股收益的边际收益递减,点估计在1.36和1.38之间。
    结论:相对适度的样本量可用于微调适用于生物医学文本的NER任务的LLM,和训练数据实体密度应代表性地近似生产数据中的实体密度。训练数据质量和模型架构的预期用途(文本生成与文本处理或分类)可能是,或更多,重要的是训练数据量和模型参数大小。
    BACKGROUND: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.
    OBJECTIVE: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.
    METHODS: A random sample of 200 disclosure statements was prepared for annotation. All \"PERSON\" and \"ORG\" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.
    RESULTS: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.
    CONCLUSIONS: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture\'s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    马强直性寄生虫在世界各地无处不在,是寄生虫控制计划的主要目标。近年来,基于图像分析的自动粪便卵计数系统已经变得可用,从而可以收集和分析大规模的卵计数数据。这项研究旨在评估美国三年来使用自动化系统生成的马匹粪便卵数(FEC)数据,并特别注意卵数大小和采样活动的季节性和区域性趋势。定义了五个美国地区;东北,东南,中北部,中南部和西部。数据集包括状态,每个FEC的地区和邮政编码。记录以下每个类别的FECs数量:(1)每克(EPG)0个鸡蛋,(2)1200EPG,(3)201500EPG和(4)>500EPG。数据包括58329个FECs。构建了一个固定效应模型,拟合每月分析的样本数量,年和地区,并构建了一个混合效应模型,以拟合上述4个卵计数类别中每个类别的FECs数量。占总FEC产量80%的马匹总比例为18.1%,这是多年来一致的,月份和除西部以外的所有地区,比例接近12%。统计分析表明,采样频率和FEC类别具有显着的季节性趋势和区域差异。数据表明,兽医在监测马中的强直性FECs时倾向于遵循双相模式,无论地点。
    Equine strongylid parasites are ubiquitous around the world and are main targets of parasite control programmes. In recent years, automated fecal egg counting systems based on image analysis have become available allowing for collection and analysis of large-scale egg count data. This study aimed to evaluate equine strongylid fecal egg count (FEC) data generated with an automated system over three years in the US with specific attention to seasonal and regional trends in egg count magnitude and sampling activity. Five US regions were defined; North East, South East, North Central, South Central and West. The data set included state, region and zip code for each FEC. The number of FECs falling in each of the following categories were recorded: (1) 0 eggs per gram (EPG), (2) 1 ⩽ 200 EPG, (3) 201 ⩽ 500 EPG and (4) >500 EPG. The data included 58 329 FECs. A fixed effects model was constructed fitting the number of samples analysed per month, year and region, and a mixed effects model was constructed to fit the number of FECs falling in each of the 4 egg count categories defined above. The overall proportion of horses responsible for 80% of the total FEC output was 18.1%, and this was consistent across years, months and all regions except West, where the proportion was closer to 12%. Statistical analyses showed significant seasonal trends and regional differences of sampling frequency and FEC category. The data demonstrated that veterinarians tended to follow a biphasic pattern when monitoring strongylid FECs in horses, regardless of location.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    美国(US)医疗保险索赔文件是国家医疗保健利用数据的宝贵来源,每年有超过4500万受益人。由于它们的庞大规模和获取数据所涉及的成本,对于进行多年随访的回顾性队列研究,随机抽取代表性样本的方法没有得到很好的记录.
    提供一种从代表每年医疗保险人群的医疗保险索赔文件中构建纵向患者样本的方法。
    回顾性队列和横断面设计。
    10年的美国糖尿病医疗保险受益人。
    Medicare总受益人摘要文件用于在10年内每年确定符合条件的患者。我们的目标是每年约90万名患者。第一年的样本按县和种族/族裔分层(白人与少数派),并针对每个阶层中的至少250名患者,其余样本的分配与县人口规模成比例,并对少数民族进行过采样。活着的病人,没有在县之间移动,并在随后的几年中保留了参加Medicare服务费(FFS)的费用。未保留的患者(死亡或从Medicare中退出的患者)被替换为具有MedicareFFS资格的第一年的患者样本,或在前一年进入采样县的患者样本。
    在10年的研究期间,所得样本平均每年包含899,266±408名患者,与人口统计学和慢性病状况紧密匹配。对于样本中的所有年份,加权平均样本年龄与人口平均年龄相差<0.01岁;白人比例在0.01%以内;女性比例在0.08%以内。从所有10年的样本中估计的21种合并症的比率在人口比率的0.12%以内。基于样本的纵向队列也非常类似于基于5年和10年随访后剩余的人群的队列。
    这种抽样策略可以很容易地适应其他项目,这些项目需要医疗保险受益人的随机样本或其他国家索赔文件,以便纵向跟进,并可能对子群体进行过度抽样。
    The United States (US) Medicare claims files are valuable sources of national healthcare utilization data with over 45 million beneficiaries each year. Due to their massive sizes and costs involved in obtaining the data, a method of randomly drawing a representative sample for retrospective cohort studies with multi-year follow-up is not well-documented.
    To present a method to construct longitudinal patient samples from Medicare claims files that are representative of Medicare populations each year.
    Retrospective cohort and cross-sectional designs.
    US Medicare beneficiaries with diabetes over a 10-year period.
    Medicare Master Beneficiary Summary Files were used to identify eligible patients for each year in over a 10-year period. We targeted a sample of ~900,000 patients per year. The first year\'s sample is stratified by county and race/ethnicity (white vs. minority), and targeted at least 250 patients in each stratum with the remaining sample allocated proportional to county population size with oversampling of minorities. Patients who were alive, did not move between counties, and stayed enrolled in Medicare fee-for-service (FFS) were retained in the sample for subsequent years. Non-retained patients (those who died or were dropped from Medicare) were replaced with a sample of patients in their first year of Medicare FFS eligibility or patients who moved into a sampled county during the previous year.
    The resulting sample contains an average of 899,266 ± 408 patients each year over the 10-year study period and closely matches population demographics and chronic conditions. For all years in the sample, the weighted average sample age and the population average age differ by <0.01 years; the proportion white is within 0.01%; and the proportion female is within 0.08%. Rates of 21 comorbidities estimated from the samples for all 10 years were within 0.12% of the population rates. Longitudinal cohorts based on samples also closely resembled the cohorts based on populations remaining after 5- and 10-year follow-up.
    This sampling strategy can be easily adapted to other projects that require random samples of Medicare beneficiaries or other national claims files for longitudinal follow-up with possible oversampling of sub-populations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    蜜蜂的殖民地,作为一个超级有机体,是通过年龄多伦理来调节的。蜜蜂工人的年龄是通过时间和生物学方法来考虑的。生物年龄是用生理相关的生物标记来估计的,例如,总血淋巴蛋白含量(THP)和下咽腺大小(HGs),也因季节而异。当代对蜂巢空间内与生物年龄有关的与年龄相关的空间工人分布的见解尚不充分。这项研究旨在监测整个季节中选定的生理标记相对于工人年龄及其在蜂巢中的空间位置的变化。在整个季节的9个菌落中进行了THP含量和HG大小分析,以比较已知年龄并在不同蜂巢部分采样的工人体内和之间的生理标记。在已知年龄的工人中证实了对生物标志物发展的季节性影响。在HG的情况下,这种影响在4周和5周大的工人中最为明显。对于THP,季节性影响在两周大的工人中最为明显。在整个季节中,THP最高的是1周大和2周大的工人。在生物学上,相同年龄的年轻工人全年始终位于蜂巢的上部,反之亦然。与以下抽样的工人相比,这些工人的THP明显更高。关于实际年龄,向下,蜂巢内工人老化时的空间转移机制。我们建议在必要时进行测定之前,将稀释的血淋巴样品储存长达一个月。生理背景,讨论了养蜂与分工和利益的关系。
    A honeybee colony, as a super-organism, is regulated through age-polyethism. A honeybee worker\'s age is considered by means of a chronological and biological approach. The biological age is estimated with physiologically related biological markers, e.g., total hemolymph protein content (THP) and hypopharyngeal gland size (HGs), which also vary seasonally. Contemporary insights into the age-related spatial workers\' distribution within the hive nest space regarding biological age are insufficiently clarified. This study aimed to monitor changes in selected physiological markers during the entire season in relation to worker age and their spatial position in the hive nest. THP content and HG size analysis was performed in nine colonies for the entire season to compare the physiological markers within and among the groups of the workers whose ages were known and sampled in different hive parts. Seasonal impact on the biomarkers\' development was confirmed in known-age workers. In the case of HGs, this impact was the most apparent in 4- and 5-week-old workers. For THP, the seasonal impact was the most obvious in 2-week-old workers. The highest THP was found in 1- and 2-week-old workers during the entire season. Biologically younger workers of the same age were located predominantly in upper hive parts consistently throughout the year and vice versa. These workers showed significantly higher THP in comparison with those sampled below. Regarding the chronological age, the downwards, spatially shifting mechanism of workers within the hive nest while they aged was characterized. We recommend storage of diluted hemolymph samples up to one month before performing an assay if necessary. The physiological context, relation to division of labor and benefits for beekeeping practices are discussed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:医学图像分割是大多数医学图像分析中的重要处理步骤。因此,他们需要高精度和鲁棒性。当前基于深度神经网络的医学分割方法对前景和背景均衡的图像具有良好的分割效果,但是经过多次卷积后,会失去前景和背景不平衡的图像上小目标的特征。
    方法:为了保留深度网络中小目标的特征,提出了一种新的基于U-Net的医学图像分割模型,本文称之为螺旋挤压激励和注意力网(SEA-NET)。分割模型使用挤压和提取模块调整通道信息以增强有用信息,并使用注意力模块调整特征图的空间信息以突出目标区域,以便在上采样时进行小目标分割。通过注意力模型将深层语义信息集成到浅层特征图中。因此,深度语义信息不能通过连续的上采样来分散。我们使用交叉熵损失+Tversky损失函数来快速收敛并很好地处理不平衡数据集。我们提出的SEA-NET在脑MRI数据集LPBA40和外周血涂片图像上进行了测试。
    结论:关于脑MRI数据,我们得到的骰子系数的平均值达到98.1[公式:见正文]。在外周血涂片数据集上,我们提出的模型对粘连细胞有很好的分割效果。
    结果:实验结果证明,所提出的SEA-Net的性能优于U-Net,U-Net++,等。在医学图像分割中。
    BACKGROUND: Medical image segmentation is an important processing step in most of medical image analysis. Thus, high accuracy and robustness are required for them. The current deep neural network based medical segmentation methods have good effect on image with balanced foreground and background, but it will loss the characteristics of small targets on image with imbalanced foreground and background after multiple convolutions.
    METHODS: In order to retain the features of small targets in the deep network, we proposed a new medical image segmentation model based on the U-Net with squeeze-and-excitation and attention modules which form a spiral closed path,callled as Spiral Squeeze-and-Excitation and Attention NET (SEA-NET) in this paper. The segmentation model used squeeze-and-extraction modules to adjust the channel information to enhance the useful information and used attention modules to adjust the spatial information of the feature map to highlight the target area for small target segmentation when up-sampling. The deep semantic information is integrated into the shallow feature map by the attention model. Therefore, the deep semantic information cannot be scattered by continuous up-sampling. We used cross entropy loss + Tversky loss function for fast convergence and well processing the imbalanced data sets. Our proposed SEA-NET was tested on the brain MRI dataset LPBA40 and peripheral blood smear images.
    CONCLUSIONS: On brain MRI data, the average value of the Dice coefficient we obtained reached 98.1[Formula: see text]. On the peripheral blood smear dataset, our proposed model has a good segmentation effect on adhesion cells.
    RESULTS: The experimental results proved that the proposed SEA-Net performed better than U-Net, U-Net++, etc. in medical image segmentation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    关于研究参与者个人信息的使用和转让,国内外的样本和其他数据,有必要构建数据管理计划。数据管理计划的关键目标之一是解释临床,生物化学,实验室,根据所有相关利益相关者的法规和政策,分子和其他数据来源。它还试图描述保护研究参与者个人信息的过程,特别是那些来自弱势群体的人。在大多数数据管理计划中,因此,框架包括描述集合,组织,使用,storage,语境化,保存,共享和访问研究数据和/或样本。它还可能包括数据管理资源的描述,包括与分析样本相关的样本,并确定该机构的责任方,实施和全面管理数据管理战略。重要的是,数据管理计划有助于突出收集的潜在问题,分享,和研究数据的保存。然而,数据管理计划有不同形式,由于资助者指引和所考虑研究的性质,要求可能会有所不同。本文利用了为“NESHIE研究”构建的详细数据管理计划,是首次尝试提供适用于针对弱势群体的研究的全面模板,特别是LMICs内部的那些,其中包括实现研究目标的多组学方法。更具体地说,这个模板,可作为补充文件下载,为涉及类似敏感性的未来项目提供了可修改的大纲,无论是临床研究还是临床试验。它不仅包括通过标准临床实践生成的数据的管理描述,但也是通过分析从研究参与者收集的各种样本产生的,并使用多组学方法进行分析。
    With regard to the use and transfer of research participants\' personal information, samples and other data nationally and internationally, it is necessary to construct a data management plan. One of the key objectives of a data management plan is to explain the governance of clinical, biochemical, laboratory, molecular and other sources of data according to the regulations and policies of all relevant stakeholders. It also seeks to describe the processes involved in protecting the personal information of research participants, especially those from vulnerable populations. In most data management plans, the framework therefore consists of describing the collection, organization, use, storage, contextualization, preservation, sharing and access of/to research data and/or samples. It may also include a description of data management resources, including those associated with analyzed samples, and identifies responsible parties for the establishment, implementation and overall management of the data management strategy. Importantly, the data management plan serves to highlight potential problems with the collection, sharing, and preservation of research data. However, there are different forms of data management plans and requirements may vary due to funder guidelines and the nature of the study under consideration. This paper leverages the detailed data management plans constructed for the \'NESHIE study\' and is a first attempt at providing a comprehensive template applicable to research focused on vulnerable populations, particularly those within LMICs, that includes a multi-omics approach to achieve the study aims. More particularly, this template, available for download as a supplementary document, provides a modifiable outline for future projects that involve similar sensitivities, whether in clinical research or clinical trials. It includes a description of the management not only of the data generated through standard clinical practice, but also that which is generated through the analysis of a variety of samples being collected from research participants and analyzed using multi-omics approaches.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    很少评估大型哺乳动物的社区保护区的作用。我们调查了多多拉社区保护区大型哺乳动物的物种丰富度和目击频率。研究区域分为三种栖息地类型,并铺设了49条横断面线(27条干旱的常绿非洲森林,20亚非洲高山栖息地,和2个人工林)根据地形,土地利用,和研究区的植被覆盖。在研究区域中总共鉴定并记录了24种大型哺乳动物。尽管社区保护区是多种哺乳动物的家园,包括一些地方性和濒危物种,如尼拉山和贝尔猴子,人类侵占,农业,过度放牧在该地区很突出,给动植物带来巨大压力.因此,我们建议加强参与性方法,以确保人与野生动物之间的可持续共存。
    The role of community conservation areas for large mammals is rarely evaluated. We investigated the species richness and frequency of sightings of large mammals in the Dodola Community Conservation Area. The study area was stratified into three habitat types, and 49 lines transect was laid (27 Dry evergreen Afromontane forests, 20 Sub-afro-alpine habitats, and 2 plantation forests) based on the topography, land use, and vegetation cover of the study area. A total of 24 species of large mammals were identified and recorded in the study area. Though the community conservation area is home to diverse species of mammals, including some endemic and endangered ones such as mountain nyala and Bale Monkey, human encroachment, agriculture, and overgrazing are prominent in the area, putting huge pressure on flora and fauna. Therefore, we recommend the participatory approach be strengthened to ensure sustainable coexistence between people and wildlife.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    表面增强拉曼光谱(SERS)是一种高度灵敏的技术,可以帮助生物医学的痕量分析,诊断,和环境应用。然而,SERS的主要限制是使用的基板的表面污染,这会使光谱再现性复杂化,检测限,和未知分析物的检测。这对于市售基材尤其普遍,因为在受控和清洁的环境下运输是困难的。在这里,我们报告了一种使用稀释的漂白剂溶液从市售基材上去除表面污染的方法,该基材由保持功能的金涂层纳米柱阵列组成。结果表明,该方法可用于去除与市售基材中典型表面污染相关的背景信号,以及去除硫醇化自组装单层(SAM)。结果表明漂白剂氧化表面污染物,然后可以很容易地洗掉。虽然在这个过程中金属表面也会被氧化,表面可以减少而不损失SERS活性。在所研究的所有浓度中,在漂白处理之后,SAM的SERS强度改善。
    Surface-enhanced Raman spectroscopy (SERS) is a highly sensitive technique that can assist in trace analysis for biomedical, diagnostic, and environmental applications. However, a major limitation of SERS is surface contamination of the substrates used, which can complicate the spectral reproducibility, limits of detection, and detection of unknown analytes. This is especially prevalent with commercially available substrates as shipping under a controlled and clean environment is difficult. Here we report a method using dilute bleach solutions to remove surface contamination from commercially available substrates consisting of gold-coated nanopillar arrays that maintains functionality. The results show that this method can be used to remove background signals associated with typical surface contamination in commercially available substrates as well as remove thiolated self-assembled monolayers (SAMs). Results indicate the bleach oxidizes the surface contaminants, which can then be easily washed away. Although the metallic surface also becomes oxidized in this process, the surface can be reduced without loss of SERS activity. The SERS intensity of SAMs improved following bleach treatment across all concentrations studied.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Letter
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号