Performance Benchmark between an Embedded GPU and a FPGA

Nowadays is impossible to think of a device without considering it “intelligent” to some extent. If ten years ago “intelligent” systems where carefully designed and could only be used in specific cases (i. e. industry or defense or research), today smart sensors which are big as a button are everywhere, from cellphones to refrigerators, from vacuum cleaner to industrial machines.
If you think that’s it, you’re wrong: the future is tiny.

Embedded systems are more and more needed even in industries, because they’re capable to perform complex elaborations on board, sometimes comparable to the ones carried out by standard PCs. Small, portable, flexible and smart: it is not hard to understand why they’re more and more used!

A plethora of embedded systems is available on the market, according to the needs of the client. One important characteristic to check out is the capability of the embedded platform of choice to be “smart”, which nowadays means if it’s able to run a Deep Learning model with the same performances obtained while running it on a PC. The reason is that Deep Learning models require a lot of resources to perform well, and running them on CPUs usually means to lose accuracy and speed compared to running them on GPUs.

To solve this issue, some companies started to produce embedded platforms with GPUs on board. While the architecture of these systems is still different to the architecture of PCs, they are a quite good improvement on the matter. Another type of embedded systems is the FPGA: these platforms lead the market for a while before embedded GPUs became common, are low-level programmed and because of that are usually high performing.

In this thesis work, conducted in collaboration with Tattile, we performed a benchmark between the Nvidia Jetson TX2 embedded platform and the Xilinx FPGA Evaluation Board ZCU104.

STEP 1: Determine the BASELINE model

To perform our benchmark we selected an example model. We kept it simple by choosing the well-known VGG Net model, which we trained on a host machine equipped with a GPU on the standard dataset CIFAR-10. This dataset is composed by 10 classes of different objects (dogs, cats, airplanes, cars…) with standard image dimensions of 32×32 px.

The network was trained in Caffe for 40000 iterations, reaching an average accuracy of 86.1%. Note that the model obtained after this step is represented in floating point 32bit (FP32), which allows a refined and accurate representation of weights and activations.

Fig. 3 - CIFAR-10 dataset examples for each class.

STEP 2: Performances on Nvidia Jetson TX2

This board is equipped with a GPU, thus allowing a representation in floating point. Even if the FP32 representation is supported, we choose to perform a quantization procedure to reduce the representation complexity of the trained model to a FP16 representation. This choice was driven by the fact that the performances obtained with this representation are considered the best ones by the literature.

We used TensorRT, which is natively installed on the board, to perform the quantization procedure. After this process the model obtained an average accuracy of 85.8%.

STEP 3: Performances on Xilinx FPGA Zynq ZCU104

The FPGA do not support floating point representations. The toolbox used to modify the original model and adapt it for the board is a proprietary one, called DNNDK.

Two configurations were tested: a configuration where the original model was quantized from FP32 to INT8, thus losing the floating point and critically reduce the dimension of the network to few MB. The average accuracy obtained in this case is 86.6%, slightly better than the baseline probably because of the big representation gap, which in some cases lead to correct predictions the ones that were borderline between correct and incorrect.

The second configuration applies a pruning process after the quantization procedure, thus deleting useless layers. In this case the average accuracy reached is of 84.2%, as expected after the combination of both processes.

STEP 4: Performance benchmark of the boards

Finally we compared the performances obtained by the two boards when running inference in real-time. The results can be found in the presentation below: if you’re interested, feel free to download it!

The thesis document is also available on request.