Datasets as Topic

数据集作为主题
  • 文章类型: Journal Article
    背景:近几十年来,美国年轻人中2型糖尿病(DM)和前驱糖尿病(preDM)的患病率一直在增加,促使迫切需要了解和确定其相关的风险因素。这种努力,然而,由于缺乏易于获取的青年前DM/DM数据而受到阻碍。
    目标:我们的目标是首先建立一个高质量的,综合流行病学数据集集中于青年前DM/DM。随后,我们的目标是通过创建一个用户友好的门户网站来共享这些数据和相应的代码,从而使这些数据可以访问。通过这个,我们希望解决这一重大差距,并促进青年preDM/DM研究。
    方法:基于1999年至2018年国家健康和营养检查调查(NHANES)的数据,我们清理并协调了数百个与12-19岁青年(n=15,149)的前DM/DM(空腹血糖水平≥100mg/dL和/或HbA1C≥5.7%)相关的变量。我们使用双变量统计分析确定了与preDM/DM风险相关的单个因素,并使用我们的多领域机器学习集成(EI)框架预测了preDM/DM状态。然后,我们开发了一个用户友好的门户网站,名为青少年糖尿病前期/糖尿病在线仪表板(POND),以共享数据和代码。
    结果:我们提取了95个与DM/DM风险潜在相关的变量,这些变量被组织成4个领域(社会人口统计学,健康状况,饮食,和其他生活方式行为)。双变量分析确定了preDM/DM的27个显著相关(P≤0.0005,Bonferroni调整),包括种族/民族,健康保险,BMI,添加糖的摄入量,屏幕时间。这些因素中的16个也是根据EI方法确定的(Fisher重叠的P=7.06x10^-6)。除了那些,EI方法确定了11个额外的预测变量,包括一些已知的(例如,肉类和水果摄入量和家庭收入)以及不太认可的因素(例如,家庭中的房间数量)。在两个分析中确定的因素跨越所提到的所有4个领域。这些数据和结果,以及其他探索工具,可以在POND上访问(https://rstudio-connect。hpc.mssm.edu/POND/)。
    结论:使用NHANES数据,我们建立了一个最大的公共流行病学数据集,用于研究青年前DM/DM,并使用补充分析方法确定了潜在的危险因素.我们的结果与preDM/DM的多因素性质一致,并具有多个领域的相关性。此外,我们的数据共享平台,庞德,促进广泛的应用,为未来的青年预DM/DM研究提供信息。
    背景:
    BACKGROUND: The prevalence of type 2 diabetes mellitus (DM) and pre-diabetes mellitus (pre-DM) has been increasing among youth in recent decades in the United States, prompting an urgent need for understanding and identifying their associated risk factors. Such efforts, however, have been hindered by the lack of easily accessible youth pre-DM/DM data.
    OBJECTIVE: We aimed to first build a high-quality, comprehensive epidemiological data set focused on youth pre-DM/DM. Subsequently, we aimed to make these data accessible by creating a user-friendly web portal to share them and the corresponding codes. Through this, we hope to address this significant gap and facilitate youth pre-DM/DM research.
    METHODS: Building on data from the National Health and Nutrition Examination Survey (NHANES) from 1999 to 2018, we cleaned and harmonized hundreds of variables relevant to pre-DM/DM (fasting plasma glucose level ≥100 mg/dL or glycated hemoglobin  ≥5.7%) for youth aged 12-19 years (N=15,149). We identified individual factors associated with pre-DM/DM risk using bivariate statistical analyses and predicted pre-DM/DM status using our Ensemble Integration (EI) framework for multidomain machine learning. We then developed a user-friendly web portal named Prediabetes/diabetes in youth Online Dashboard (POND) to share the data and codes.
    RESULTS: We extracted 95 variables potentially relevant to pre-DM/DM risk organized into 4 domains (sociodemographic, health status, diet, and other lifestyle behaviors). The bivariate analyses identified 27 significant correlates of pre-DM/DM (P<.001, Bonferroni adjusted), including race or ethnicity, health insurance, BMI, added sugar intake, and screen time. Among these factors, 16 factors were also identified based on the EI methodology (Fisher P of overlap=7.06×106). In addition to those, the EI approach identified 11 additional predictive variables, including some known (eg, meat and fruit intake and family income) and less recognized factors (eg, number of rooms in homes). The factors identified in both analyses spanned across all 4 of the domains mentioned. These data and results, as well as other exploratory tools, can be accessed on POND.
    CONCLUSIONS: Using NHANES data, we built one of the largest public epidemiological data sets for studying youth pre-DM/DM and identified potential risk factors using complementary analytical approaches. Our results align with the multifactorial nature of pre-DM/DM with correlates across several domains. Also, our data-sharing platform, POND, facilitates a wide range of applications to inform future youth pre-DM/DM studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    系统评价和荟萃分析通常需要大量的时间和精力。机器学习模型有可能提高这些过程中的筛选效率。为了有效地评估此类模型,完全标记的数据集-详细说明人类筛选的所有记录及其标记决定-是必要的。本文提出了一个全面的数据集的创建,用于系统回顾边缘性人格障碍的治疗方法,正如Oud等人报道的那样。(2018)用于运行模拟研究。作者坚持PRISMA指南,并发布了搜索查询和包含的记录列表,但没有披露所有标签的完整数据集。我们复制了他们的搜索,面对缺乏初步筛查数据,引入了噪声标签过滤器(NLF)过程,使用主动学习来验证噪声标签。在NLF申请之后,没有发现进一步的相关记录。使用重建数据集的模拟研究表明,与随机读取相比,主动学习可以将筛选时间减少82.30%。本文讨论了差异的潜在原因,提供建议,并引入决策树来帮助重建数据集,以运行仿真研究。
    Systematic reviews and meta-analyses typically require significant time and effort. Machine learning models have the potential to enhance screening efficiency in these processes. To effectively evaluate such models, fully labeled datasets-detailing all records screened by humans and their labeling decisions-are imperative. This paper presents the creation of a comprehensive dataset for a systematic review of treatments for Borderline Personality Disorder, as reported by Oud et al. (2018) for running a simulation study. The authors adhered to the PRISMA guidelines and published both the search query and the list of included records, but the complete dataset with all labels was not disclosed. We replicated their search and, facing the absence of initial screening data, introduced a Noisy Label Filter (NLF) procedure using active learning to validate noisy labels. Following the NLF application, no further relevant records were found. A simulation study employing the reconstructed dataset demonstrated that active learning could reduce screening time by 82.30% compared to random reading. The paper discusses potential causes for discrepancies, provides recommendations, and introduces a decision tree to assist in reconstructing datasets for the purpose of running simulation studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    为了不同的目的,在各种实验参数和设置下进行植物表型实验。它们产生的数据是异构的,复杂,通常记录不佳,因此,很难重复使用。满足社会需求(营养,作物适应性和稳定性)需要更有效的数据集成和重用方法。在这项工作中,我们检查“制作数据公平”需要什么,并调查不仅重复使用FAIR数据的好处和斗争,而且以马铃薯的发育性状为案例研究,利用环境基因型和环境相互作用的QTL进行数据公平。我们承担科学家在FAIR数据点上发现表型数据集的角色,用环境数据验证相关数据集的存在,同时获取并整合它们。我们报告并讨论了现有数据集的可重用性和可重复性的挑战和潜力,使用元数据标准,如MIAPPE,在这个过程中遇到的。
    Plant phenotyping experiments are conducted under a variety of experimental parameters and settings for diverse purposes. The data they produce is heterogeneous, complicated, often poorly documented and, as a result, difficult to reuse. Meeting societal needs (nutrition, crop adaptation and stability) requires more efficient methods toward data integration and reuse. In this work, we examine what \"making data FAIR\" entails, and investigate the benefits and the struggles not only of reusing FAIR data, but also making data FAIR using genotype by environment and QTL by environment interactions for developmental traits in potato as a case study. We assume the role of a scientist discovering a phenotypic dataset on a FAIR data point, verifying the existence of related datasets with environmental data, acquiring both and integrating them. We report and discuss the challenges and the potential for reusability and reproducibility of FAIRifying existing datasets, using metadata standards such as MIAPPE, that were encountered in this process.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Historical Article
    本文介绍了围绕有争议的话题移民形成的活跃在线讨论社区中情绪两极分化的动态研究。使用2012年至2019年瑞典语的推文集合,我们跟踪社区的发展及其随着时间的推移以及2015年欧洲难民危机所代表的外生冲击的背景下的情绪两极分化轨迹。为了达到研究的目的,我们应用网络和情感分析方法来绘制用户在网络社区中的互动,并量化用户的情感极性。分析结果为网络及其社区中的用户两极分化提供了很少的证据,并暗示危机对这个社交媒体平台上的两极分化动态影响有限。然而,我们注意到危机后用户情绪的负面情绪发生了转变,并讨论了上述观察结果的可能解释。
    This paper presents a study on the dynamics of sentiment polarisation in the active online discussion communities formed around a controversial topic-immigration. Using a collection of tweets in the Swedish language from 2012 to 2019, we track the development of the communities and their sentiment polarisation trajectories over time and in the context of an exogenous shock represented by the European refugee crisis in 2015. To achieve the goal of the study, we apply methods of network and sentiment analysis to map users\' interactions in the network communities and quantify users\' sentiment polarities. The results of the analysis give little evidence for users\' polarisation in the network and its communities, as well as suggest that the crisis had a limited effect on the polarisation dynamics on this social media platform. Yet, we notice a shift towards more negative tonality of users\' sentiments after the crisis and discuss possible explanations for the above-mentioned observations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Evaluation Study
    提高国内油菜产量是我国重要的国家目标。研究人员经常使用诸如作物模型之类的工具来确定新品种的最佳管理实践,以增加产量。CROPGRO-Canola模型在中国尚未用于模拟油菜。这项工作的总体目标是确定CROPGRO-Canola模型的关键输入,以便在长江流域使用有限的数据集进行校准。首先,我们进行了全球敏感性分析,以确定对模拟开花天数有很大影响的关键遗传和土壤输入,天成熟,产量,地上生物量,和最大叶面积指数。在长江流域的8个地点进行了一年的扩展傅里叶振幅测试法(EFAST)敏感性分析(空间分析),在武汉的一个地点进行了7年,中国(时态分析)。EFAST软件针对每个站点和年份的4520个输入参数组合运行,导致每个输入参数的灵敏度指数。使用自上而下的一致性方法对参数进行排序以确定相对灵敏度。结果表明,模型输出到开花的天数,天成熟,产量,地上生物量,最大叶面积指数对影响关键生长期持续时间的参数最敏感,比如出现到开花,和这些阶段的温度响应,以及影响收获时总生物量的参数。这五个模型输出对几个土壤参数也很敏感,包括排水上限和下限(SDUL和SLLL)和排水率(SLDR)。参数的灵敏度通常在空间和时间上是稳定的。应用敏感性分析结果对武汉某油菜单种试验模型进行了标定和评价,中国。该模型使用两个季节进行校准,并使用三个季节的数据进行评估。在开花前几天获得了优异的nRMSE值(≤1.71%),到期日(≤1.48%),收率(≤9.96%),和地上生物量(≤9.63%)。这项工作的结果可用于指导研究人员进行中国长江流域的模型校准和评估。
    Increasing domestic rapeseed production is an important national goal in China. Researchers often use tools such as crop models to determine optimum management practices for new varieties to increased production. The CROPGRO-Canola model has not been used to simulate rapeseed in China. The overall goal of this work was to identify key inputs to the CROPGRO-Canola model for calibration with limited datasets in the Yangtze River basin. First, we conducted a global sensitivity analysis to identify key genetic and soil inputs that have a large effect on simulated days to flowering, days to maturity, yield, above-ground biomass, and maximum leaf area index. The extended Fourier amplitude test method (EFAST) sensitivity analysis was performed for a single year at 8 locations in the Yangtze River basin (spatial analysis) and for seven years at a location in Wuhan, China (temporal analysis). The EFAST software was run for 4520 combinations of input parameters for each site and year, resulting in a sensitivity index for each input parameter. Parameters were ranked using the top-down concordance method to determine relative sensitivity. Results indicated that the model outputs of days to flowering, days to maturity, yield, above-ground biomass, and maximum leaf area index were most sensitive to parameters that affect the duration of critical growth periods, such as emergence to flowering, and temperature response to these stages, as well as parameters that affect total biomass at harvest. The five model outputs were also sensitive to several soil parameters, including drained upper and lower limit (SDUL and SLLL) and drainage rate (SLDR). The sensitivity of parameters was generally spatially and temporally stable. The results of the sensitivity analysis were used to calibrate and evaluate the model for a single rapeseed experiment in Wuhan, China. The model was calibrated using two seasons and evaluated using three seasons of data. Excellent nRMSE values were obtained for days to flowering (≤1.71%), days to maturity (≤ 1.48%), yield (≤ 9.96%), and above-ground biomass (≤ 9.63%). The results of this work can be used to guide researchers on model calibration and evaluation across the Yangtze River basin in China.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    使用自然语言处理(NLP)对放射学报告进行自动语言分析可以提供有关患者健康和疾病的有价值的信息。随着其快速发展,NLP研究应具有透明的方法,以便比较方法和可重复性。本系统综述旨在总结将NLP应用于放射学报告的研究的特征和报告质量。
    我们搜索了GoogleScholar的英文研究,这些研究将NLP应用于2015年1月至2019年10月期间任何成像模式的放射学报告。至少两名审阅者独立进行筛选并完成数据提取。我们指定了与数据源相关的15个标准,数据集,地面真相,结果,质量评估的可重复性。NLP性能的主要衡量标准是精度,召回和F1得分。
    在检索到的4836条记录中,我们纳入了164项在放射学报告中使用NLP的研究.NLP最常见的临床应用是疾病信息或分类(28%)和诊断监测(27.4%)。大多数研究使用英语放射学报告(86%)。28%的研究使用了混合成像模式的报告。肿瘤学(24%)是最常见的疾病领域。大多数研究的数据集大小>200(85.4%),但描述其注释的研究比例,培训,验证,和测试集是67.1%,63.4%,45.7%,和分别为67.7%。大约一半的研究报告了准确率(48.8%)和召回率(53.7%)。很少有研究报告进行了外部验证(10.8%),数据可用性(8.5%)和代码可用性(9.1%)。没有与总体报告质量相关的绩效模式。
    在卫生服务和研究中,放射学报告的NLP具有一系列潜在的临床应用。然而,我们发现报告质量欠佳,无法进行比较,再现性,和复制。我们的研究结果支持需要制定特定于临床NLP研究的报告标准。
    Automated language analysis of radiology reports using natural language processing (NLP) can provide valuable information on patients\' health and disease. With its rapid development, NLP studies should have transparent methodology to allow comparison of approaches and reproducibility. This systematic review aims to summarise the characteristics and reporting quality of studies applying NLP to radiology reports.
    We searched Google Scholar for studies published in English that applied NLP to radiology reports of any imaging modality between January 2015 and October 2019. At least two reviewers independently performed screening and completed data extraction. We specified 15 criteria relating to data source, datasets, ground truth, outcomes, and reproducibility for quality assessment. The primary NLP performance measures were precision, recall and F1 score.
    Of the 4,836 records retrieved, we included 164 studies that used NLP on radiology reports. The commonest clinical applications of NLP were disease information or classification (28%) and diagnostic surveillance (27.4%). Most studies used English radiology reports (86%). Reports from mixed imaging modalities were used in 28% of the studies. Oncology (24%) was the most frequent disease area. Most studies had dataset size > 200 (85.4%) but the proportion of studies that described their annotated, training, validation, and test set were 67.1%, 63.4%, 45.7%, and 67.7% respectively. About half of the studies reported precision (48.8%) and recall (53.7%). Few studies reported external validation performed (10.8%), data availability (8.5%) and code availability (9.1%). There was no pattern of performance associated with the overall reporting quality.
    There is a range of potential clinical applications for NLP of radiology reports in health services and research. However, we found suboptimal reporting quality that precludes comparison, reproducibility, and replication. Our results support the need for development of reporting standards specific to clinical NLP studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Comparative Study
    Machine learning (ML) algorithms are now increasingly used in infectious disease epidemiology. Epidemiologists should understand how ML algorithms behave within the context of outbreak data where missingness of data is almost ubiquitous.
    Using simulated data, we use a ML algorithmic framework to evaluate data imputation performance and the resulting case fatality ratio (CFR) estimates, focusing on the scale and type of data missingness (i.e., missing completely at random-MCAR, missing at random-MAR, or missing not at random-MNAR).
    Across ML methods, dataset sizes and proportions of training data used, the area under the receiver operating characteristic curve decreased by 7% (median, range: 1%-16%) when missingness was increased from 10% to 40%. Overall reduction in CFR bias for MAR across methods, proportion of missingness, outbreak size and proportion of training data was 0.5% (median, range: 0%-11%).
    ML methods could reduce bias and increase the precision in CFR estimates at low levels of missingness. However, no method is robust to high percentages of missingness. Thus, a datacentric approach is recommended in outbreak settings-patient survival outcome data should be prioritised for collection and random-sample follow-ups should be implemented to ascertain missing outcomes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    为了确定教师及其家庭成员中使用covid-19和严重covid-19入院的风险,总体而言,并与普通人群中的医护人员和工作年龄的成年人进行比较。
    基于人群的巢式病例对照研究。
    苏格兰,2020年3月至2021年7月,在规定的学校关闭和全面开放期间,以响应covid-19。
    21至65岁的成年人(n=132420)中的所有covid-19病例以及年龄匹配的随机样本,性别,和一般做法(n=1306566)。苏格兰总教学委员会确定成年人在苏格兰学校积极任教,他们的家庭成员是通过唯一的财产参考号识别的。比较人群是苏格兰确定为医护人员的成年人,他们的家庭成员,以及剩余的一般工作年龄人口。
    主要结果是使用covid-19入院,定义为在入院期间对SARS-CoV-2的检测结果为阳性,在测试结果呈阳性的28天内入院,或在出院时接受covid-19的诊断。严重的covid-19被定义为进入重症监护病房或在阳性测试结果28天内死亡,或将covid-19指定为死因。
    大多数教师都很年轻(平均年龄42岁),女性(80%)并且没有合并症(84%)。在普通人群中,所有工作年龄的成年人使用covid-19入院的风险(累积发生率)<1%。在学习期间,在根据年龄调整的条件逻辑回归模型中,性别,一般实践,种族/民族,剥夺,合并症的数量,以及家庭中成年人的数量,与一般人群相比,使用covid-19(比率比0.77,95%置信区间0.64~0.92)和使用严重covid-19(0.56,0.33~0.97)的教师住院风险较低.在苏格兰学校重新开放的第一阶段,在2020年秋季,教师入院率比率为1.20(0.89至1.61),重度covid-19入院率比率为0.45(0.13至1.55)。教师家庭成员的相应调查结果为0.91(0.67至1.23)和0.73(0.37至1.44),面对患者的医护人员分别为2.08(1.73至2.50)和2.26(1.43至3.59)。在第二阶段,教师也有类似的风险,当学校在2021年夏天重新开放时。这些值高于2020年春季/夏季的值,当时学校大部分关闭。
    与工作年龄相似的成年人相比,未发现教师及其家庭成员使用covid-19入院的风险增加,并且发现严重covid-19的风险较低。这些发现应该使那些从事面对面教学的人放心。
    To determine the risk of hospital admission with covid-19 and severe covid-19 among teachers and their household members, overall and compared with healthcare workers and adults of working age in the general population.
    Population based nested case-control study.
    Scotland, March 2020 to July 2021, during defined periods of school closures and full openings in response to covid-19.
    All cases of covid-19 in adults aged 21 to 65 (n=132 420) and a random sample of controls matched on age, sex, and general practice (n=1 306 566). Adults were identified as actively teaching in a Scottish school by the General Teaching Council for Scotland, and their household members were identified through the unique property reference number. The comparator groups were adults identified as healthcare workers in Scotland, their household members, and the remaining general population of working age.
    The primary outcome was hospital admission with covid-19, defined as having a positive test result for SARS-CoV-2 during hospital admission, being admitted to hospital within 28 days of a positive test result, or receiving a diagnosis of covid-19 on discharge from hospital. Severe covid-19 was defined as being admitted to intensive care or dying within 28 days of a positive test result or assigned covid-19 as a cause of death.
    Most teachers were young (mean age 42), were women (80%), and had no comorbidities (84%). The risk (cumulative incidence) of hospital admission with covid-19 was <1% for all adults of working age in the general population. Over the study period, in conditional logistic regression models adjusted for age, sex, general practice, race/ethnicity, deprivation, number of comorbidities, and number of adults in the household, teachers showed a lower risk of hospital admission with covid-19 (rate ratio 0.77, 95% confidence interval 0.64 to 0.92) and of severe covid-19 (0.56, 0.33 to 0.97) than the general population. In the first period when schools in Scotland reopened, in autumn 2020, the rate ratio for hospital admission in teachers was 1.20 (0.89 to 1.61) and for severe covid-19 was 0.45 (0.13 to 1.55). The corresponding findings for household members of teachers were 0.91 (0.67 to 1.23) and 0.73 (0.37 to 1.44), and for patient facing healthcare workers were 2.08 (1.73 to 2.50) and 2.26 (1.43 to 3.59). Similar risks were seen for teachers in the second period, when schools reopened in summer 2021. These values were higher than those seen in spring/summer 2020, when schools were mostly closed.
    Compared with adults of working age who are otherwise similar, teachers and their household members were not found to be at increased risk of hospital admission with covid-19 and were found to be at lower risk of severe covid-19. These findings should reassure those who are engaged in face-to-face teaching.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    转录调节领域通常假设转录水平的变化反映了相应基因的转录状态的变化。虽然这个假设可能适用于大量的成绩单,相当大但仍未被识别的变异部分可能涉及RNA生命周期的其他步骤,这是过早的RNA的加工,和成熟RNA的降解。区分这些层需要互补的实验技术,例如RNA代谢标记或转录实验的阻断。尽管如此,过早和成熟RNA的分析,来自RNA-seq数据中的内含子和外显子读数计数,允许区分转录和转录后调节基因,虽然没有认识到参与转录后反应的具体步骤,这是处理,降解,或两者的组合。我们说明了INSPEcTR/Bioconductor软件包如何用于推断肝细胞癌的TCGARNA-seq样品中的转录后调控。
    The field of transcriptional regulation generally assumes that changes in transcripts levels reflect changes in transcriptional status of the corresponding gene. While this assumption might hold true for a large population of transcripts, a considerable and still unrecognized fraction of the variation might involve other steps of the RNA lifecycle, that is the processing of the premature RNA, and degradation of the mature RNA. Discrimination between these layers requires complementary experimental techniques, such as RNA metabolic labeling or block of transcription experiments. Nonetheless, the analysis of the premature and mature RNA, derived from intronic and exonic read counts in RNA-seq data, allows distinguishing between transcriptionally and post-transcriptionally regulated genes, although not recognizing the specific step involved in the post-transcriptional response, that is processing, degradation, or a combination of the two. We illustrate how the INSPEcT R/Bioconductor package could be used to infer post-transcriptional regulation in TCGA RNA-seq samples for Hepatocellular Carcinoma.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Scientific knowledge cannot be seen as a set of isolated fields, but as a highly connected network. Understanding how research areas are connected is of paramount importance for adequately allocating funding and human resources (e.g., assembling teams to tackle multidisciplinary problems). The relationship between disciplines can be drawn from data on the trajectory of individual scientists, as researchers often make contributions in a small set of interrelated areas. Two recent works propose methods for creating research maps from scientists\' publication records: by using a frequentist approach to create a transition probability matrix; and by learning embeddings (vector representations). Surprisingly, these models were evaluated on different datasets and have never been compared in the literature. In this work, we compare both models in a systematic way, using a large dataset of publication records from Brazilian researchers. We evaluate these models\' ability to predict whether a given entity (scientist, institution or region) will enter a new field w.r.t. the area under the ROC curve. Moreover, we analyze how sensitive each method is to the number of publications and the number of fields associated to one entity. Last, we conduct a case study to showcase how these models can be used to characterize science dynamics in the context of Brazil.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号