A benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code
| Models | LiCoEval Scores | |
|---|---|---|
| General LLM | GPT-3.5-Turbo | 0.373 |
| GPT-4-Turbo | 0.376 | |
| GPT4o | 0.385 | |
| Gemini-1.5-Pro | 0.317 | |
| Claude-3.5-Sonnet | 0.571 | |
| Qwen2-7B-Instruct | 0.985 | |
| GLM-4-9B-Chat | 1.0 | |
| Llama-3-8B-Instruct | 0.714 | |
| Code LLM | DeepSeek-Coder-V2 | 0.142 |
| CodeQwen1.5-7B-Chat | 0.781 | |
| StarCoder2-15B-Instruct | 0.780 | |
| Codestral-22B-v0.1 | 0.360 | |
| CodeGemma-7B-IT | 0.809 | |
| WizardCoder-Python-13B | 0.153 |
| Models | HumanEval Scores | |
|---|---|---|
| General LLM | GPT-3.5-Turbo | 72.6 |
| GPT-4-Turbo | 85.4 | |
| GPT4o | 90.2 | |
| Gemini-1.5-Pro | 71.9 | |
| Claude-3.5-Sonnet | 92.0 | |
| Qwen2-7B-Instruct | 79.9 | |
| GLM-4-9B-Chat | 71.8 | |
| Llama-3-8B-Instruct | 62.2 | |
| Code LLM | DeepSeek-Coder-V2 | 90.2 |
| CodeQwen1.5-7B-Chat | 83.5 | |
| StarCoder2-15B-Instruct | 72.6 | |
| Codestral-22B-v0.1 | 61.5 | |
| CodeGemma-7B-IT | 56.1 | |
| WizardCoder-Python-13B | 64.0 |
It is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:
OSSlab-PKU ❤️ Open Source
OSSlab-PKU ❤️ LLMs