What’s New in the OpenVINO™ Model Server



Original Source Here

Performance Results

OpenVINO™ model server 2021.1 is implemented in C++ to achieve high performance inference. We kept the following principles in mind when designing the architecture:

  • maximum throughput on a single instance
  • minimal load overhead over inference execution in the backend
  • minimal impact on latency

In Figures 2 and 3, throughput and latency metrics are compared as functions of concurrency (number of parallel clients). The comparison includes both OpenVINO™ model server versions: 2020.4 (implemented in Python) and the new 2021.1 (implemented in C++). In Figure 4, a combined throughput versus latency are presented as cross-correlation dependence. The intensity of workload is controlled by a change of number of parallel clients. All results are obtained using the best-known configurations of OpenVINO™ toolkit and OpenVINO™ model server (read more about this in the documentation) — especially by set the following parameters:

  • Size of the request queue for inference execution — NIREQ
  • Plugin config — CPU_THROUGHPUT_STREAMS
  • gRPC workers
Figure 2. Throughput results (higher is better) for both OpenVINO™ model server versions, versus concurrency measured by the number or parallel streams of requests. Collected for ResNet50 FP32 and ResNet50 INT8 models with batch size 1. Full configuration details in the Materials and Methods section above.
Figure 3. Latency results (lower is better) for both OpenVINO™ model server versions, versus concurrency measured by the number or parallel streams of requests. Collected for ResNet50 FP32 and ResNet50 INT8 models with batch size 1. Full configuration details in the Materials and Methods section above.
Figure 4. Throughput results (higher is better) for both OpenVINO™ model server versions, versus latency (lower is better) measured by the number or parallel streams of requests. Collected for ResNet50 FP32 and ResNet50 INT8 models with batch size 1. Full configuration details in the Materials and Methods section above.

While the Python version is performant for lower concurrency, the biggest advantage in the C++ implementation is scalability. With the C++ version, it is possible to achieve throughput of 1,600 fps without any increase in latency — a 3x improvement from the Python version.

OpenVINO™ model server can be also tuned for a single stream of requests — allocating all available resources to a single inference request. The table in Figure 5 shows response latency from a remote client. The chart visualizes the latency of each processing step for a ResNet50 model quantized to 8-bit precision.

Figure 5. Enumerated factors that contribute to latency among the key processing steps required for a ResNet50 INT8 model as a single stream of inference on a remote client. Measurements are an average of 10 runs.

With the OpenVINO™ model server C++ implementation, there is minimal impact to the latency from the service frontend. Data serialization and deserialization are reduced to a negligible amount thanks to the no-copy design. Network communication from a remote host contributes only 1.7ms in latency[1], even though the request message size is about 0.6MB for the ResNet50 model.

All in all, even for very fast AI models, the primary factor of inference latency is the inference backend processing. OpenVINO™ model server simplifies deployment and application design, without efficiency degradation.

In addition to Intel® CPUs, OpenVINO™ model server supports a range of AI accelerators like HDDL (for Intel® Vision Accelerator Design with Intel® Movidius™ VPU and Intel® Arria™ 10 FPGAs, Intel® NCS (for the Intel® Neural Compute Stick) and iGPU (for integrated GPUs). The latest Intel® Xeon® processors support BFloat16 data type to achieve the best performance.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: