This repository contains the code and data pipeline for the paper:
Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models Felix Machtle, Jan-Niclas Serr, Nils Loose, Thomas Eisenbarth University of Luebeck, Germany
We investigate whether LLM code comprehension aligns with human-centric software metrics or reflects distinct, non-human-interpretable patterns. The task is binary I/O consistency: given a program p, input x, and candidate output y, judge whether y is the correct output of running p on x.
Key findings:
- Traditional software metrics predict LLM success weakly (AUROC 0.63)
- A learned shadow model (fine-tuned UniXcoder) achieves AUROC 0.86
- Aggregate benchmarks obscure important model specialization differences
| Model | Accuracy | F1 |
|---|---|---|
| GPT-OSS 120B | 0.960 | 0.959 |
| Llama 3.3 70B Instruct | 0.738 | 0.662 |
| Mistral Small 24B Instruct | 0.744 | 0.685 |
| Phi-4 | 0.733 | 0.674 |
| CodeLlama 13B Instruct | 0.506 | 0.062 |
The full dataset is available on Hugging Face: Felix6326727/beyond-accuracy-code-comprehension
It contains ~80GB of samples with:
- Source code from the Python subset of Project CodeNet
- Generated I/O pairs produced by our type-aware fuzzer
- 200+ static code metrics (cyclomatic complexity, AST structure, opcode statistics, etc.)
- LLM evaluation results for 5 models (success/failure on I/O consistency task)
from datasets import load_dataset
ds = load_dataset("Felix6326727/beyond-accuracy-code-comprehension")
print(ds["test"][0]).
├── data_generation/ # Fuzzing pipeline for I/O pair generation
│ ├── main.py # Entry point — orchestrates fuzzing across CodeNet
│ ├── hill_climb.py # Infers input type signatures via hill climbing
│ ├── fuzzer.py # Generates & shrinks minimal I/O pairs
│ ├── fuzzer_proxy.py # Coverage-guided fitness wrapper
│ ├── random_values_lib.py # Random value generators (int/float/string)
│ └── export_io.py # Aggregates results into labeled dataset
│
├── analysis/
│ ├── eval_performance/
│ │ └── evaluate_performance_of_all_models.py # Accuracy/F1 for all LLMs
│ └── feature_importance/
│ ├── step1_create_export.py # Joins LLM results with code metrics
│ ├── step2_find_features.py # XGBoost + SHAP feature analysis
│ └── SAGE.py # SAGE feature importance ranking
└── README.md
The data generation pipeline works as follows:
Python files (CodeNet)
|
v
main.py ────────────────── orchestrates 60 parallel workers
|
├──> hill_climb.py ─── infers input type signature ("genome")
| uses fuzzer_proxy.py (coverage-guided fitness)
|
├──> fuzzer.py ──────── generates & shrinks I/O pairs
| uses random_values_lib.py
v
.io2.json files (per source file)
|
v
export_io.py ─────────────── aggregates into labeled dataset
| creates negative examples via in-program shuffling
v
training_data.json / test_data.json
For each Python program, we infer the expected input types using a sequential hill-climbing algorithm. The search space is a "genome" string where each character represents an input type: i (integer), s (string), f (float), b (boolean). Coverage is used as the fitness function.
Using the inferred genome, we generate random inputs, execute the program, and collect (input, output) pairs. Inputs are then shrunk to minimal forms that still produce successful executions.
Negative (incorrect) I/O pairs are created via in-program shuffling: pairing an input with the output of a different input to the same program. This preserves lexical and stylistic characteristics while creating semantically incorrect pairs.
# Using Docker (as in the paper)
docker run -it \
-v $(pwd)/data_generation:/mnt \
-v /path/to/Project_CodeNet_Py_Subset:/Project_CodeNet_Py_Subset \
python:3.14.0b3-alpine3.21 sh
pip install tqdm
cd /mnt
python main.pypip install scikit-learn xgboost shap sage-importance pandas numpy
# Evaluate LLM performance
cd analysis/eval_performance
python evaluate_performance_of_all_models.py
# Feature importance analysis
cd analysis/feature_importance
python step1_create_export.py # Prepare feature matrices
python SAGE.py # SAGE feature importance
python step2_find_features.py # SHAP + token features@article{machtle2025beyond,
title={Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models},
author={Machtle, Felix and Serr, Jan-Niclas and Loose, Nils and Eisenbarth, Thomas},
journal={arXiv preprint arXiv:2601.12951},
year={2025}
}This project is for academic research purposes. The dataset is derived from Project CodeNet (Apache 2.0).