wheat LiCoEval Leaderboard wheat

A benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code

Models LiCoEval Scores
General LLM GPT-3.5-Turbo 0.373
GPT-4-Turbo 0.376
GPT4o 0.385
Gemini-1.5-Pro 0.317
Claude-3.5-Sonnet 0.571
Qwen2-7B-Instruct 0.985
GLM-4-9B-Chat 1.0
Llama-3-8B-Instruct 0.714
Code LLM DeepSeek-Coder-V2 0.142
CodeQwen1.5-7B-Chat 0.781
StarCoder2-15B-Instruct 0.780
Codestral-22B-v0.1 0.360
CodeGemma-7B-IT 0.809
WizardCoder-Python-13B 0.153
Models HumanEval Scores
General LLM GPT-3.5-Turbo 72.6
GPT-4-Turbo 85.4
GPT4o 90.2
Gemini-1.5-Pro 71.9
Claude-3.5-Sonnet 92.0
Qwen2-7B-Instruct 79.9
GLM-4-9B-Chat 71.8
Llama-3-8B-Instruct 62.2
Code LLM DeepSeek-Coder-V2 90.2
CodeQwen1.5-7B-Chat 83.5
StarCoder2-15B-Instruct 72.6
Codestral-22B-v0.1 61.5
CodeGemma-7B-IT 56.1
WizardCoder-Python-13B 64.0

Notes

  1. We design a framework for evaluating the license compliance capabilities of LLMs in code generation and provide the first benchmark for evaluating this capability. In addition, we evaluate the license compliance capabilities of 14 popular LLMs, providing insight into improving the LLM training process and regulating LLM usage.
  2. The data above comes from the paper.
  3. Want to use this benchmark? Please visit GitHub repository for more!

Recommendations

It is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:

ClassEval Leaderboard

CRUXEval Leaderboard

EvalPlus Leaderboard

Chatbot Arena Leaderboard

TabbyML Leaderboard

OSSlab-PKU ❤️ Open Source

OSSlab-PKU ❤️ LLMs