Skip to content

[ICCV 2025] 3DGraphLLM is a model that uses a 3D scene graph and an LLM to perform 3D vision-language tasks.

License

Notifications You must be signed in to change notification settings

CognitiveAISystems/3DGraphLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

3DGraphLLM

arXiv Huggingace

In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph, which serves as input for LLMs to perform 3D vision-language tasks.

News

[2025.6] We are pleased to inform you that our paper has been accepted for poster presentation at ICCV 2025! πŸŽ‰

[2024.12] We release 3DGraphLLM pre-training on GT instance segmentation scene graphs

[2024.12] We release 3DGraphLLM paper code

πŸ”₯ Semantic relations boost LLM performance on 3D Referred Object Grounding and Dense Scene Captioning tasks

ScanRefer Multi3dRefer Scan2Cap ScanQA SQA3D
Acc@0.25 Acc@0.5 F1@0.25 F1@0.5 CIDEr@0.5 B-4@0.5 CIDEr B-4 EM
Chat-Scene 55.5 50.2 57.1 52.3 77.1 36.3 87.7 14.3
13.7 55.1
3DGraphLLM LLAMA3-8B 62.4 56.6 64.7 59.9 81.0 36.5 88.8 55.9

πŸ”¨ Preparation

  • Prepare the environment:

    conda create -n 3dgraphllm python=3.9.17
    conda activate 3dgraphllm
    conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
    pip install -r requirements.txt
  • If you don't have root permissions to install java (needed for pycocoeval scripts for metrics such as BLEU and CIDER), install it with conda:

conda install -c conda-forge openjdk
  • Download LLM backbone:

    • We use LLAMA3-8B-Instruct in our experiments, which can be downloaded from Hugging Face.

    • Change the llama_model_path in config.py to the path of LLAMA3-8B-Instruct.

  • Annotations and extracted features:

    Please follow the instructions in preprocess.

πŸ€– Training and Inference

  • Pre-training on GT instance segmentation scene graphs.

    • Modify run_gt_pretrain.sh:

      train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=False
      Explanation of "train_tag" and "val_tag"
      • Use # to seperate different datasets

      • Datasets:

        • scanrefer: ScanRefer Dataset
        • scan2cap: Scan2Cap Dataset
        • scanqa: ScanQA Dataset
        • sqa3d: SQA3D Dataset
        • multi3dref: Multi3dRefer Dataset
        • nr3d_caption: A captioning dataset originated from Nr3D.
        • obj_align: A dataset originated from ScanRefer to align the object identifiers with object tokens.
    • Run: bash scripts/run_gt_pretrain.sh

  • Training

    • Modify run.sh:
      train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=False
      pretrained_path="outputs/llama3-8b-gt-pretrain-2/ckpt_00_28927.pth"
    • Run: bash scripts/run.sh
  • Inference

    • Modify run.sh:

      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=True
      pretrained_path="/path/to/pretrained_model.pth"
    • Run: bash scripts/run.sh

πŸš€ Demo

  • Run: bash demo/run_demo.sh. You will be prompted to ask different queries about Scene 435 of ScanNet.

πŸ“ͺ Contact

If you have any questions about the project, please open an issue in this repository or send an email to Tatiana Zemskova.

πŸ“‘ Citation

If you find this work helpful, please consider citing our work as:

@misc{zemskova20243dgraphllm,
      title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding}, 
      author={Tatiana Zemskova and Dmitry Yudin},
      year={2024},
      eprint={2412.18450},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.18450}, 
}

😊 Acknowledgement

Thanks to the open source of the following projects:

Chat-Scene

About

[ICCV 2025] 3DGraphLLM is a model that uses a 3D scene graph and an LLM to perform 3D vision-language tasks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors