View on TensorFlow.org
|
Run in Google Colab
|
View on GitHub
|
Download notebook
|
See TF Hub models
|
This is a demo for using Universal Encoder Multilingual Q&A model for question-answer retrieval of text, illustrating the use of question_encoder and response_encoder of the model. We use sentences from SQuAD paragraphs as the demo dataset, each sentence and its context (the text surrounding the sentence) is encoded into high dimension embeddings with the response_encoder. These embeddings are stored in an index built using the simpleneighbors library for question-answer retrieval.
On retrieval a random question is selected from the SQuAD dataset and encoded into high dimension embedding with the question_encoder and query the simpleneighbors index returning a list of approximate nearest neighbors in semantic space.
More models
You can find all currently hosted text embedding models here and all models that have been trained on SQuAD as well here.
Setup
Setup Environment
Setup common imports and functions
2024-02-02 12:42:03.366166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2024-02-02 12:42:04.103707: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2024-02-02 12:42:04.103807: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2024-02-02 12:42:04.103818: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [nltk_data] Downloading package punkt to /home/kbuilder/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip.
Run the following code block to download and extract the SQuAD dataset into:
- sentences is a list of (text, context) tuples - each paragraph from the SQuAD dataset are split into sentences using nltk library and the sentence and paragraph text forms the (text, context) tuple.
- questions is a list of (question, answer) tuples.
Download and extract SQuAD data
10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
Example sentence and context:
sentence:
('Oxygen gas is increasingly obtained by these non-cryogenic technologies (see '
'also the related vacuum swing adsorption).')
context:
('The other major method of producing O\n'
'2 gas involves passing a stream of clean, dry air through one bed of a pair '
'of identical zeolite molecular sieves, which absorbs the nitrogen and '
'delivers a gas stream that is 90% to 93% O\n'
'2. Simultaneously, nitrogen gas is released from the other '
'nitrogen-saturated zeolite bed, by reducing the chamber operating pressure '
'and diverting part of the oxygen gas from the producer bed through it, in '
'the reverse direction of flow. After a set cycle time the operation of the '
'two beds is interchanged, thereby allowing for a continuous supply of '
'gaseous oxygen to be pumped through a pipeline. This is known as pressure '
'swing adsorption. Oxygen gas is increasingly obtained by these non-cryogenic '
'technologies (see also the related vacuum swing adsorption).')
The following code block setup the tensorflow graph g and session with the Universal Encoder Multilingual Q&A model's question_encoder and response_encoder signatures.
Load model from tensorflow hub
2024-02-02 12:42:11.161871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
The following code block compute the embeddings for all the text, context tuples and store them in a simpleneighbors index using the response_encoder.
Compute embeddings and build simpleneighbors index
Computing embeddings for 10455 sentences 0%| | 0/104 [00:00<?, ?it/s] simpleneighbors index for 10455 sentences built.
On retrieval, the question is encoded using the question_encoder and the question embedding is used to query the simpleneighbors index.
View on TensorFlow.org
Run in Google Colab
View on GitHub
Download notebook
See TF Hub models