Formula Recognition Python* Demo

This demo shows how to run Im2LaTeX models. Im2LaTeX models allow to get a LaTeX formula markup from the image.

NOTE: Only batch size of 1 is supported.

## How It Works

The demo application expects an im2latex model that is split into two parts. Every model part must be in the Intermediate Representation (IR) format.

The First model is Encoder which extracts features from an image and prepares first steps of the decoder.

• One input is imgs for input image
• Four outputs are:
• Row encoding out (row_enc_out) extracts features from the image
• hidden and
• context are intermediate states of the LSTM
• init_0 - first state of the encoder

Second model is Decoder that takes as input:

• row_enc_out - extracted images features from the encoder
• Decoding state context (dec_st_c) and
• Decoding state hidden (dec_st_h) - current states of the LSTM
• output_prev - previous output of the Decode Step (for the first time it is init_0 of the encoder)
• Target (tgt) - previous token (for the first time it is START_TOKEN ) Second model is being executed until current decoded token is END_TOKEN or length of the formula is less then --max_formula_len producing one token per each decode step.

The demo application takes an input with the help of the -i argument. This could be:

• Path to a single image
• Path to a folder with images

In this Case non-interactive mode would be triggered. This means that demo will run the model over the input image(s) and will try to predict the formula. The output would be stored in the console or in the output file (if specified)

* Integer identifier of the device (e.g. web-camera), typically 0.
• Path to a video (.avi, .mp4, etc)

This will trigger interactive mode, which would be explained in detail later.

Vocabulary files are provided under corresponding model configuration directory.

### Non-interactive mode

Non-interactive mode assumes that demo processes inputs sequentially. The demo workflow in non-interactive mode is the following:

1. The demo application reads a single image or iterates over all images in the given folder, then crops or resizes and inputs to fit into the input image blob of the network (imgs). Crop and pad is used to keep size of the font.
2. For each image, encoder extracts features from the image
3. While length of the current formula is less then --max_formula_len or current token is not END_TOKEN Decode Step produces new tokens.
4. The demo prints the decoded text to a file if -o parameter specified or into the console and (optionally) renders predicted formula into image.

#### Rendering of the LaTeX formula into image

User has an option to render the LaTeX formula predicted by the demo application into an image. Regardless of what mode is selected (interactive or non-interactive) the process of the rendering of the formula is the same.

##### Requirements for rendering

Sympy python package is used for rendering. To install it, please, run: pip install -r requirements.txt Sympy package needs LaTeX system installed in the operating system. For Windows you can use MiKTeX (just download and install it), for Ubuntu/MacOS you can use TeX Live: Ubuntu: apt-get update && apt-get install texlive MacOS: brew install texlive

Note: Other LaTeX systems should also work.

### Interactive mode

The example of the interface:

When User runs demo application with the -i option and passes video or number of the web-camera device as an argument (typically 0), window with the image simillar to above should pop up.

Example of usage of the interactive mode:

python formula_recognition_demo.py <required args> -i 0

or

python formula_recognition_demo.py <required args> -i input_video.mp4

The window has four main sections:

1. A red rectangle is placed on the center of this window. This is input "target", with the help of which User, moving the camera, can capture formula.
2. Image from the input target will be binarized, preprocessed and fed to the network. Preprocessed and binarized image is placed on the top of the window (near Model input label)
3. If the formula will be predicted with sufficient confidence score, it will be placed right under preprocessed image (near Predicted label)
4. If rendering is available (see the previous Paragraph for details) and predicted formula does not contain LaTeX grammar errors, it will be rendered and placed near Rendered label.

• Use q button to quit from program
• Use o to decrease the size of the input (red) window
• Use p to increase the size of the input window

The overall process is simillar to the Non-interactive mode with the exception that it runs asynchronously. This means model inference and rendering of the formula do not block main thread, so the image from the web camera can move smoothly.

NOTE: By default, Open Model Zoo demos expect input with BGR channels order. If you trained your model to work with RGB order, you need to manually rearrange the default channels order in the demo application or reconvert your model using the Model Optimizer tool with --reverse_input_channels argument specified. For more information about the argument, refer to When to Reverse Input Channels section of Converting a Model Using General Conversion Parameters.

The demo has two preprocessing types: Crop and Pad to target shape and Resize and pad to target shape. Two preprocessing types are used for two different datasets as model trained with concrete font size, so if one wants to run the model on inputs with bigger font size (e.g. if input is photographed in 12Mpx, while model trained to imitate scans in ~3Mpx) they should first resize the input to make font size like in train set. Example of the target font size:

## Running

Run the application with the -h option to see the following usage message:

usage: formula_recognition_demo.py [-h] -m_encoder M_ENCODER -m_decoder
M_DECODER -i INPUT [-no_show]
[-o OUTPUT_FILE] -v VOCAB_PATH
[--max_formula_len MAX_FORMULA_LEN]
[-t CONF_THRESH] [-d DEVICE]
[--resolution RESOLUTION RESOLUTION]
[--preprocessing_type {crop,resize}] [-pc]
[--imgs_layer IMGS_LAYER]
[--row_enc_out_layer ROW_ENC_OUT_LAYER]
[--hidden_layer HIDDEN_LAYER]
[--context_layer CONTEXT_LAYER]
[--init_0_layer INIT_0_LAYER]
[--dec_st_c_layer DEC_ST_C_LAYER]
[--dec_st_h_layer DEC_ST_H_LAYER]
[--dec_st_c_t_layer DEC_ST_C_T_LAYER]
[--dec_st_h_t_layer DEC_ST_H_T_LAYER]
[--output_layer OUTPUT_LAYER]
[--output_prev_layer OUTPUT_PREV_LAYER]
[--logit_layer LOGIT_LAYER]
[--tgt_layer TGT_LAYER]
Options:
-h, --help Show this help message and exit.
-m_encoder M_ENCODER Required. Path to an .xml file with a trained encoder
part of the model
-m_decoder M_DECODER Required. Path to an .xml file with a trained decoder
part of the model
-i INPUT, --input INPUT
Required. Path to a folder with images, path to an
image files, integer identificator of the camera or
path to the video. See README.md for details.
-no_show, --no_show Optional. Suppress pop-up window with rendered
formula.
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Optional. Path to file where to store output. If not
mentioned, result will be stored in the console.
-v VOCAB_PATH, --vocab_path VOCAB_PATH
Required. Path to vocab file to construct meaningful
phrase
--max_formula_len MAX_FORMULA_LEN
Optional. Defines maximum length of the formula
(number of tokens to decode)
-t CONF_THRESH, --conf_thresh CONF_THRESH
Optional. Probability threshold to treat model
prediction as meaningful
-d DEVICE, --device DEVICE
Optional. Specify the target device to infer on; CPU,
GPU, FPGA, HDDL or MYRIAD is acceptable. Sample will
look for a suitable plugin for device specified.
Default value is CPU
--resolution RESOLUTION RESOLUTION
Optional. Resolution of the demo application window.
Default: 1280 720
--preprocessing_type {crop,resize}
Optional. Type of the preprocessing
-pc, --perf_counts
--imgs_layer IMGS_LAYER
Optional. Encoder input name for images. See README
for details.
--row_enc_out_layer ROW_ENC_OUT_LAYER
Optional. Encoder output key for row_enc_out. See
--hidden_layer HIDDEN_LAYER
Optional. Encoder output key for hidden. See README
for details.
--context_layer CONTEXT_LAYER
Optional. Encoder output key for context. See README
for details.
--init_0_layer INIT_0_LAYER
Optional. Encoder output key for init_0. See README
for details.
--dec_st_c_layer DEC_ST_C_LAYER
Optional. Decoder input key for dec_st_c. See README
for details.
--dec_st_h_layer DEC_ST_H_LAYER
Optional. Decoder input key for dec_st_h. See README
for details.
--dec_st_c_t_layer DEC_ST_C_T_LAYER
Optional. Decoder output key for dec_st_c_t. See
--dec_st_h_t_layer DEC_ST_H_T_LAYER
Optional. Decoder output key for dec_st_h_t. See
--output_layer OUTPUT_LAYER
Optional. Decoder output key for output. See README
for details.
--output_prev_layer OUTPUT_PREV_LAYER
Optional. Decoder input key for output_prev. See
--logit_layer LOGIT_LAYER
Optional. Decoder output key for logit. See README for
details.
--tgt_layer TGT_LAYER
Optional. Decoder input key for tgt. See README for
details.

Running the application with an empty list of options yields the short version of the usage message and an error message.

To run the demo, you can use public or pre-trained models. To download the pre-trained models, use the OpenVINO Model Downloader. The list of models supported by the demo is in the models.lst file in the demo's directory.

NOTE: Before running the demo with a trained model, make sure the model is converted to the Inference Engine format (*.xml + *.bin) using the Model Optimizer tool.

To run the demo, please provide paths to the model in the IR format and to an input video or folder with images:

python formula_recognition_demo.py \
-m_encoder <path_to_models>/encoder.xml \
-m_decoder <path_to_models>/decoder.xml \
--vocab_path <path_to_vocab> \
--preprocessing <preproc type> \
-i input_image.png

## Demo Output

The application outputs recognized formula into the console or into the file.