Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models

This repository contains the code and data pipeline for the paper:

Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models Felix Machtle, Jan-Niclas Serr, Nils Loose, Thomas Eisenbarth University of Luebeck, Germany

[Paper (arXiv)] [Dataset (Hugging Face)]

Overview

We investigate whether LLM code comprehension aligns with human-centric software metrics or reflects distinct, non-human-interpretable patterns. The task is binary I/O consistency: given a program p, input x, and candidate output y, judge whether y is the correct output of running p on x.

Key findings:

Traditional software metrics predict LLM success weakly (AUROC 0.63)
A learned shadow model (fine-tuned UniXcoder) achieves AUROC 0.86
Aggregate benchmarks obscure important model specialization differences

Models Evaluated

Model	Accuracy	F1
GPT-OSS 120B	0.960	0.959
Llama 3.3 70B Instruct	0.738	0.662
Mistral Small 24B Instruct	0.744	0.685
Phi-4	0.733	0.674
CodeLlama 13B Instruct	0.506	0.062

Dataset

The full dataset is available on Hugging Face: Felix6326727/beyond-accuracy-code-comprehension

It contains ~80GB of samples with:

Source code from the Python subset of Project CodeNet
Generated I/O pairs produced by our type-aware fuzzer
200+ static code metrics (cyclomatic complexity, AST structure, opcode statistics, etc.)
LLM evaluation results for 5 models (success/failure on I/O consistency task)

from datasets import load_dataset

ds = load_dataset("Felix6326727/beyond-accuracy-code-comprehension")
print(ds["test"][0])

Repository Structure

.
├── data_generation/          # Fuzzing pipeline for I/O pair generation
│   ├── main.py               # Entry point — orchestrates fuzzing across CodeNet
│   ├── hill_climb.py          # Infers input type signatures via hill climbing
│   ├── fuzzer.py              # Generates & shrinks minimal I/O pairs
│   ├── fuzzer_proxy.py        # Coverage-guided fitness wrapper
│   ├── random_values_lib.py   # Random value generators (int/float/string)
│   └── export_io.py           # Aggregates results into labeled dataset
│
├── analysis/
│   ├── eval_performance/
│   │   └── evaluate_performance_of_all_models.py   # Accuracy/F1 for all LLMs
│   └── feature_importance/
│       ├── step1_create_export.py   # Joins LLM results with code metrics
│       ├── step2_find_features.py   # XGBoost + SHAP feature analysis
│       └── SAGE.py                  # SAGE feature importance ranking
└── README.md

Pipeline

The data generation pipeline works as follows:

Python files (CodeNet)
        |
        v
    main.py ────────────────── orchestrates 60 parallel workers
        |
        ├──> hill_climb.py ─── infers input type signature ("genome")
        |        uses fuzzer_proxy.py (coverage-guided fitness)
        |
        ├──> fuzzer.py ──────── generates & shrinks I/O pairs
        |        uses random_values_lib.py
        v
    .io2.json files (per source file)
        |
        v
    export_io.py ─────────────── aggregates into labeled dataset
        |                         creates negative examples via in-program shuffling
        v
    training_data.json / test_data.json

1. Type Inference (Hill Climbing)

For each Python program, we infer the expected input types using a sequential hill-climbing algorithm. The search space is a "genome" string where each character represents an input type: i (integer), s (string), f (float), b (boolean). Coverage is used as the fitness function.

2. I/O Pair Generation (Fuzzing)

Using the inferred genome, we generate random inputs, execute the program, and collect (input, output) pairs. Inputs are then shrunk to minimal forms that still produce successful executions.

3. Negative Example Construction

Negative (incorrect) I/O pairs are created via in-program shuffling: pairing an input with the output of a different input to the same program. This preserves lexical and stylistic characteristics while creating semantically incorrect pairs.

Running the Pipeline

Data Generation

# Using Docker (as in the paper)
docker run -it \
  -v $(pwd)/data_generation:/mnt \
  -v /path/to/Project_CodeNet_Py_Subset:/Project_CodeNet_Py_Subset \
  python:3.14.0b3-alpine3.21 sh

pip install tqdm
cd /mnt
python main.py

Analysis

pip install scikit-learn xgboost shap sage-importance pandas numpy

# Evaluate LLM performance
cd analysis/eval_performance
python evaluate_performance_of_all_models.py

# Feature importance analysis
cd analysis/feature_importance
python step1_create_export.py   # Prepare feature matrices
python SAGE.py                  # SAGE feature importance
python step2_find_features.py   # SHAP + token features

Citation

@article{machtle2025beyond,
  title={Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models},
  author={Machtle, Felix and Serr, Jan-Niclas and Loose, Nils and Eisenbarth, Thomas},
  journal={arXiv preprint arXiv:2601.12951},
  year={2025}
}

License

This project is for academic research purposes. The dataset is derived from Project CodeNet (Apache 2.0).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models

Overview

Models Evaluated

Dataset

Repository Structure

Pipeline

1. Type Inference (Hill Climbing)

2. I/O Pair Generation (Fuzzing)

3. Negative Example Construction

Running the Pipeline

Data Generation

Analysis

Citation

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models

Overview

Models Evaluated

Dataset

Repository Structure

Pipeline

1. Type Inference (Hill Climbing)

2. I/O Pair Generation (Fuzzing)

3. Negative Example Construction

Running the Pipeline

Data Generation

Analysis

Citation

License