This README describes the Question Answering Embedding demo application that uses a Squad-tuned BERT model to calculate embedding vectors for context and question to find right context for question. The primary difference from the bert_question_answering_demo is that this demo demonstrates how the inference can be accelerated via pre-computing the embeddings for the contexts.
Upon the start-up the demo application reads command line parameters and loads network(s) to the InferenceEngine. It also fetches data from the user-provided urls to populate the list of "contexts" with the text. Prior to the actual inference to answer user's questions, the embedding vectors are pre-calculated (via inference) for each context from the list. This is done using the first ("emdbeddings-only") BERT model.
After that, when user type the question and the "embeddings" network is used to calculate an embedding vector for the specified question. Using the L2 distance between the embedding vector of the question and the embedding vectors for the contexts the best (closest) contexts are selected as candidates to further seek for the final answer to the question. At this point, the contexts are displayed to the user.
Notice that question is usually much shorter than the contexts, so calculating the embedding for that is really fast. Also calculating the L2 distance between a context and question is almost free, compared to the actual inference. Together, during question answering, this substantially saves on the actual inference, which is needed ONLY for the question (while contexts are pre-calculated), compared to the conventional approach that has to concatenate each context with the question and do an inference on this large input (per context).
If second (conventional SQuAD-tuned) Bert model is provided as well, it is used to further search for the exact answer in the best contexts found in the first step, and the result then also displayed to the user.
Running the application with the
-h option yields the following usage message:
NOTE: Before running the demo with a trained model, make sure to convert the model to the Inference Engine's Intermediate Representation format (*.xml + *.bin) using the Model Optimizer tool. When using the pre-trained BERT from the model zoo (please see Model Downloader), the model is already converted to the IR.
The application reads text from the HTML pages at the given urls and then answers questions typed from the console. The models and its parameters (inputs and outputs) are also important demo arguments. Notice that since order of inputs for the model does matter, the demo script checks that the inputs specified from the command-line match the actual network inputs. The embedding model is reshaped by the demo to infer embedding vectors for long contexts and short question. Be sure that the original model converted by Model Optimizer with reshape option. Please see general reshape intro and limitations
The application outputs contexts with answers to the same console.
Open Model Zoo Models feature example BERT-large tuned on the Squad* for embedding calculation. It comes with "embedding" in its name. For second stage to find exact answer in filtered context the same models as for
bert_question_answering_demo can be used.
You can use the following command to try the demo (assuming the model from the Open Model Zoo, downloaded with the Model Downloader executed with "--name bert*"):
The demo will use the Wikipedia articles about the Bert character and the speed of light to answer your questions like "what is the speed of light", "how to measure the speed of light", "who is Bert", "how old is Bert", etc.
Notice that when the original "context" (paragraph text from the url) alone or together with the question do not fit the model input (usually 384 tokens for the Bert-Large, or 128 for the Bert-Base), the demo splits the paragraph into overlapping segments. Thus, for the long paragraph texts, the network is called multiple times as for separate contexts.
Even though the demo reports inference performance (by measuring wall-clock time for individual inference calls), it is only baseline performance, as certain tricks like batching, throughput mode can be applied. Please use the full-blown Benchmark C++ Sample for any actual performance measurements.