This is a text spotting composite model that simultaneously detects and recognizes text. The model detects symbol sequences separated by space and performs recognition without a dictionary. The model is built on top of the Mask-RCNN framework with additional attention-based text recognition head.
Alphabet is alphanumeric:
|Word spotting hmean ICDAR2015, without a dictionary||71.29%|
Hmean Word spotting is defined and measured according to the Incidental Scene Text (ICDAR2015) challenge.
The text-spotting-0005-detector model is a Mask-RCNN-based text detector with ResNet50 backbone and additional text features output.
1, 3, 768, 1280 in the
1, C, H, W format, where:
C- number of channels
H- image height
W- image width
The expected channel order is
1, 3. Image information: processed image height, processed image width, and processed image scale with respect to the original image resolution.
100. Contiguous integer class ID for every detected object,
0is for text class.
100, 5. Bounding boxes around every detected object in the (top_left_x, top_left_y, bottom_right_x, bottom_right_y, confidence) format.
100, 28, 28. Text segmentation masks for every output bounding box.
100, 64, 28, 28. Text features that are fed to a text recognition head.
The text-spotting-0005-recognizer-encoder model is a fully-convolutional encoder of text recognition head.
1, 64, 28, 28. Text recognition features obtained from detection part.
1, 256, 28, 28. Encoded text recognition features.
1, (28*28), 256. Encoded text recognition features.
1, 1. Index in alphabet of previously generated symbol.
1, 1, 256. Previous hidden state of GRU.
1, 38. Encoded text recognition features. Indices starting from 2 correspond to symbols from the alphabet. The 0 and 1 are special Start of Sequence and End of Sequence symbols correspondingly.
1, 1, 256. Current hidden state of GRU.
[*] Other names and brands may be claimed as the property of others.