This document shows how to build and run a model with Language-Adapter plugin in TensorRT-LLM on NVIDIA GPUs.
The concept of Language Adapter during inference time was introduced in MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer :
we can simply replace a language-specific adapter trained for English with a language-specific adapter trained for Quechua at inference time.
The implementation is done with MOE plugin with static expert selection passed during runtime as a parameter in request.
For instance, encoder-decoder model may leverage language adapter for language-specific translation tasks when each of the language-adapter is trained for a specific language, this language adapter plugin achieves the language switching within one session only by passing in the language_task_uid
to the plugin.
The model checkpoint here is not publicly available. Please leverage layers/language_adapter.py
in your own model.
MODEL_DIR="dummy_model" # model not publicly available
INFERENCE_PRECISION="float16"
TP_SIZE=1
PP_SIZE=1
WORLD_SIZE=1
MODEL_TYPE=language_adapter
MODEL_NAME=$MODEL_TYPE
CKPT_DIR=/scratch/tmp/trt_models/${MODEL_NAME}/${WORLD_SIZE}-gpu/${INFERENCE_PRECISION}
ENGINE_DIR=/scratch/tmp/trt_engines/${MODEL_NAME}/${WORLD_SIZE}-gpu/${INFERENCE_PRECISION}
max_beam=5
max_batch=32
max_input_len=1024
max_output_len=1024
python ../enc_dec/convert_checkpoint.py --model_type ${MODEL_TYPE} \
--model_dir ${MODEL_DIR} \
--output_dir $CKPT_DIR \
--tp_size ${TP_SIZE} \
--pp_size ${PP_SIZE} \
--dtype ${INFERENCE_PRECISION} \
--workers 1
trtllm-build --checkpoint_dir $CKPT_DIR/encoder \
--output_dir $ENGINE_DIR/encoder \
--paged_kv_cache disable \
--moe_plugin auto \
--bert_attention_plugin ${INFERENCE_PRECISION} \
--gpt_attention_plugin ${INFERENCE_PRECISION} \
--gemm_plugin ${INFERENCE_PRECISION} \
--remove_input_padding enable \
--max_input_len ${max_input_len} \
--max_beam_width ${max_beam} \
--max_batch_size ${max_batch}
trtllm-build --checkpoint_dir $CKPT_DIR/decoder \
--output_dir $ENGINE_DIR/decoder \
--paged_kv_cache enable \
--moe_plugin auto \
--bert_attention_plugin ${INFERENCE_PRECISION} \
--gpt_attention_plugin ${INFERENCE_PRECISION} \
--gemm_plugin ${INFERENCE_PRECISION} \
--remove_input_padding enable \
--max_input_len 1 \
--max_beam_width ${max_beam} \
--max_batch_size ${max_batch} \
--max_seq_len ${max_output_len}
A list language_task_uids
that includes the language_task_uid for each input prompt is required:
# translate 2 sentence, 1 to France (language_task_uid=3) 1 to Spanish (language_task_uid=2).
# language_task_uids = [3, 2]
TEXT="Where is the nearest restaurant? Wikipedia is a free online encyclopedia written and maintained by a community of volunteers (called Wikis) through open collaboration and the use of MediaWiki, a wiki-based editing system."
python3 ../run.py --engine_dir $ENGINE_DIR --tokenizer_type "language_adapter" --max_input_length 512 --max_output_len 512 --num_beams 1 --input_file input_ids.npy --tokenizer_dir $MODEL_DIR --language_task_uids 3 2
# Input [Text 0]: ""
# Output [Text 0 Beam 0]: "Où se trouve le restaurant le plus proche ? Wikipédia est une encyclopédie en ligne gratuite écrite et maintenue par une communauté de bénévoles (appelés Wikis) grâce à une collaboration ouverte et à l'utilisation de MediaWiki, un système d'édition basé sur wiki."
# Input [Text 1]: ""
# Output [Text 1 Beam 0]: "¿Dónde está el restaurante más cercano? Wikipedia es una enciclopedia en línea gratuita escrita y mantenida por una comunidad de voluntarios (llamada Wikis) a través de la colaboración abierta y el uso de MediaWiki, un sistema de edición basado en wiki."
Currently Python runtime does not support beam_width > 1.
For Python runtime, full routing information of length [num_tokens, 1] is required for both encoder and decoder, which stacks routing information for each token in a batch of requests.
# language_adapter_routing = get_language_adapter_routings(language_task_uid, input_ids)
TEXT="Where is the nearest restaurant? Wikipedia is a free online encyclopedia written and maintained by a community of volunteers (called Wikis) through open collaboration and the use of MediaWiki, a wiki-based editing system."
python3 ../enc_dec/run.py --engine_dir $ENGINE_DIR --engine_name ${MODEL_NAME} --model_name $MODEL_DIR --max_new_token=64 --num_beams=1
# in the run.py, 2 input prompts and 2 language task uids are provided. The two task uid represent the language of the input prompts to be translated to.
# TRT-LLM output text: ['¿Dónde está el restaurante más cercano? Wikipedia es una enciclopedia en línea gratuita escrita y mantenida por una comunidad de voluntarios (llamada Wikis) a través de la colaboración abierta y el uso de MediaWiki, un sistema de edición basado en wiki.', "Où se trouve le restaurant le plus proche ? Wikipédia est une encyclopédie en ligne gratuite é
crite et maintenue par une communauté de bénévoles (appelés Wikis) grâce à une collaboration ouverte et à l'utilisation de MediaWiki, un système d'édition basé sur wiki."]