Mesh : Computational Biology / methods Software Benchmarking / methods Programming Languages Algorithms

来  源:   DOI:10.1093/bioinformatics/btae230   PDF(Pubmed)

Abstract:
CONCLUSIONS: Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%).
METHODS: All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.
摘要:
结论:预训练的大型语言模型(LLM)具有显着改善的代码生成。随着这些模型的扩大,越来越需要输出来处理更复杂的任务,并适当地专门针对特定领域。这里,由于领域知识的数量,我们以生物信息学为目标,算法,和数据操作这个学科需要。我们介绍BioCoder,为评估LLM生成生物信息学特定代码而开发的基准。BioCoder跨越了大部分领域,涵盖跨文件依赖关系,类声明,和全局变量。它包含从GitHub提取的1026个Python函数和1243个Java方法,还有罗莎琳德项目的253个例子,都与生物信息学有关。使用主题建模,我们表明,所包含代码的总体覆盖范围代表了生物信息学计算的全部范围。BioCoder采用模糊测试框架进行评估。我们已经应用它来评估各种模型,包括InCoder,CodeGen,CodeGen2,SantaCoder,StarCoder,StarCoder+,InstructCodeT5+,GPT-3.5和GPT-4。此外,我们微调了一个模型(StarCoder),证明我们的训练数据集可以提高我们的测试基准的性能(在某些提示配置下,以Pass@K表示>15%,并且始终>3%)。结果突出了成功模型的两个关键方面:(i)成功模型在完整上下文中容纳长提示(>2600个令牌),包括功能依赖。(ii)它们包含特定领域的生物信息学知识,不仅仅是一般的编码能力。这从GPT-3.5/4的性能增益中可以明显看出,与我们基准上的较小型号相比(50%与高达25%)。
方法:所有数据集,基准,Docker映像,测试所需的脚本可在以下网址获得:https://github.com/gersteinlab/biocoder和https://biocoder-benchmark。github.io/。
公众号