OpenVINO™ Model Server (OVMS) is a scalable, high-performance solution for serving machine learning models optimized for Intel® architectures. The server provides an inference service via gRPC or REST API - making it easy to deploy new algorithms and AI experiments using the same architecture as TensorFlow* Serving for any models trained in a framework that is supported by OpenVINO.
The server implements gRPC and REST API framework with data serialization and deserialization using TensorFlow Serving API, and OpenVINO™ as the inference execution provider. Model repositories may reside on a locally accessible file system (for example, NFS), Google Cloud Storage* (GCS), Amazon S3*, MinIO*, or Azure Blob Storage*.
OVMS is now implemented in C++ and provides much higher scalability compared to its predecessor in the Python version. You can take advantage of all the power of Xeon® CPU capabilities or AI accelerators and expose it over the network interface. Read the release notes to find out what's new in the C++ version.
Review the Architecture Concept document for more details.
A few key features:
NOTE: OVMS has been tested on CentOS* and Ubuntu*. Publicly released Docker images are based on CentOS.
The command generates:
openvino/model_server:latestwith CPU, NCS, and HDDL support
openvino/model_server:latest-gpuwith CPU, NCS, HDDL, and iGPU support
.tar.gzrelease package with OVMS binary and necessary libraries in the
The release package is compatible with Linux machines on which
glibc version is greater than or equal to the build image version. For debugging, the command also generates an image with a suffix
NOTE: Images include OpenVINO 2021.1 release.
Find a detailed description of how to use the OpenVINO Model Server in the OVMS Quick Start Guide.
For more detailed guides on using the Model Server in various scenarios, visit the links below:
OpenVINO™ Model Server gRPC API is documented in the proto buffer files in tensorflow_serving_api.
NOTE: The implementations for
GetModelStatusfunction calls are currently available. These are the most generic function calls and should address most of the usage scenarios.
Predict proto defines two message specifications:
PredictResponse used while calling Prediction endpoint.
PredictRequestspecifies information about the model spec, that is name and version, and a map of input data serialized via TensorProto to a string format.
PredictResponseincludes a map of outputs serialized by TensorProto and information about the used model spec.
Get Model Metadata proto defines three message definitions used while calling Metadata endpoint:
A function call
GetModelMetadata accepts model spec information as input and returns Signature Definition content in the format similar to TensorFlow Serving.
Get Model Status proto defines three message definitions used while calling Status endpoint:
GetModelStatusResponse that report all exposed versions including their state in their lifecycle.
Refer to the example client code to learn how to use this API and submit the requests using the gRPC interface.
Using the gRPC interface is recommended for optimal performance due to its faster implementation of input data deserialization. It enables you to achieve lower latency, especially with larger input messages like images.
OpenVINO™ Model Server RESTful API follows the documentation from the TensorFlow Serving REST API.
Both row and column format of the requests are implemented.
NOTE: Just like with gRPC, only the implementations for
GetModelStatusfunction calls are currently available.
Only the numerical data types are supported.
Review the exemplary clients below to find out more how to connect and run inference requests.
REST API is recommended when the primary goal is in reducing the number of client side Python dependencies and simpler application code.
GetModelStatuscalls are implemented using the TensorFlow Serving API.
MultiInferenceare not included.
Output_filteris not effective in the
Predictcall. All outputs defined in the model are returned to the clients.