A benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code
Models | LiCoEval Scores | |
---|---|---|
General LLM | GPT-3.5-Turbo | 0.373 |
GPT-4-Turbo | 0.376 | |
GPT4o | 0.385 | |
Gemini-1.5-Pro | 0.317 | |
Claude-3.5-Sonnet | 0.571 | |
Qwen2-7B-Instruct | 0.985 | |
GLM-4-9B-Chat | 1.0 | |
Llama-3-8B-Instruct | 0.714 | |
Code LLM | DeepSeek-Coder-V2 | 0.142 |
CodeQwen1.5-7B-Chat | 0.781 | |
StarCoder2-15B-Instruct | 0.780 | |
Codestral-22B-v0.1 | 0.360 | |
CodeGemma-7B-IT | 0.809 | |
WizardCoder-Python-13B | 0.153 |
Models | HumanEval Scores | |
---|---|---|
General LLM | GPT-3.5-Turbo | 72.6 |
GPT-4-Turbo | 85.4 | |
GPT4o | 90.2 | |
Gemini-1.5-Pro | 71.9 | |
Claude-3.5-Sonnet | 92.0 | |
Qwen2-7B-Instruct | 79.9 | |
GLM-4-9B-Chat | 71.8 | |
Llama-3-8B-Instruct | 62.2 | |
Code LLM | DeepSeek-Coder-V2 | 90.2 |
CodeQwen1.5-7B-Chat | 83.5 | |
StarCoder2-15B-Instruct | 72.6 | |
Codestral-22B-v0.1 | 61.5 | |
CodeGemma-7B-IT | 56.1 | |
WizardCoder-Python-13B | 64.0 |
It is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:
OSSlab-PKU ❤️ Open Source
OSSlab-PKU ❤️ LLMs