FPGA-based hardware acceleration for CNNs developed using high-Level synthesis

2020-05-12 06:32WEIChuliangCHENRulinGAOQianSUNZhenglong

光學(xué)精密工程 2020年5期

WEI Chu-liang, CHEN Ru-lin*, GAO Qian, SUN Zheng-long

(1. Department of Electronic Engineering, Shantou University, Shantou 515063, China;2. Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen 518054, China;3. School of Science and Engineering, The Chinese University ofHong Kong, Shenzhen 518172, China)

*Corresponding author, E-mail: 16rlchen@stu.edu.cn

Abstract: To accelerate the forward-propagation process of deep-learning networks, a field-programmable gate array (FPGA) hardware-acceleration system for AlexNet was developed using Vivado High-Level Synthesis (HLS), which can greatly reduce the FPGA development cost. Using Vivado HLS, developers can design hardware architectures on an FPGA platform using C/C++ code instead of a hardware-description language. We implemented AlexNet on an FPGA platform using the HLS tool, and then used the PIPELINE and ARRAY_PARTITION directives to optimize the proposed system. An evaluation of the proposed system shows that its performance is three times better than a traditional computing-platform graphics processing unit (GPU). In the future, owing to the high-level encapsulation, the developed system can be easily transformed into other convolutional neural networks for practical operation, which shows its great portability and practical application value.

Key words: deep learning; Field Programmable Gate Array (FPGA); high level synthesis; hardware acceleration circuits

基于高層次融合的卷積神經(jīng)網(wǎng)絡(luò)FPGA硬件加速

魏楚亮1，陳儒林1*，高謙2,3，孫正隆2,3

(1. 汕頭大學(xué) 電子工程系, 廣東汕頭 515063； 2. 深圳市人工智能與機(jī)器人研究院，廣東深圳 518054； 3. 香港中文大學(xué)(深圳) 理工學(xué)院，廣東深圳 518172)

摘要：為了解決神經(jīng)網(wǎng)絡(luò)前向傳播過(guò)程中的硬件加速問(wèn)題，設(shè)計(jì)了一套基于FPGA編程工具Vivado HLS開(kāi)發(fā)的AlexNet神經(jīng)網(wǎng)絡(luò)前向傳播硬件加速系統(tǒng)。該系統(tǒng)能夠確保在達(dá)到相關(guān)應(yīng)用要求的基礎(chǔ)上，有效地節(jié)省開(kāi)發(fā)時(shí)間并降低開(kāi)發(fā)成本。系統(tǒng)基于高級(jí)計(jì)算機(jī)語(yǔ)言C++進(jìn)行FPGA電路的仿真與開(kāi)發(fā)，同時(shí)，靈活運(yùn)用具有很高便捷性及可靠性的Vivado HLS中的PIPELINE和ARRAY_PARTITION指令進(jìn)行系統(tǒng)優(yōu)化。實(shí)驗(yàn)結(jié)果表明，AlexNet神經(jīng)網(wǎng)絡(luò)在本文所構(gòu)建的FPGA加速系統(tǒng)上的運(yùn)行時(shí)間為21.95 ms，比在傳統(tǒng)GPU平臺(tái)上的運(yùn)行時(shí)70 ms少，運(yùn)行速度要3倍以上。此外，每一層的網(wǎng)絡(luò)都實(shí)現(xiàn)了分開(kāi)封裝操作，使系統(tǒng)可便捷地移植到其它成熟的卷積神經(jīng)網(wǎng)絡(luò)上，加速了深度學(xué)習(xí)在各類人工智能系統(tǒng)上的應(yīng)用，在智能產(chǎn)業(yè)具有廣泛的應(yīng)用價(jià)值。

關(guān) 鍵詞:深度學(xué)習(xí)；現(xiàn)場(chǎng)可編程門陣列；高層次融合；硬件加速電路

1 Introduction

In recent years, Convolutional Neural Networks (CNN) have become an important tool in certain informat-ics or engineering fields, e.g., computer vision[1-3], signal processing[4-5], and robotics[6-7], which require a complex artificial intelligence. Other complicated interdisciplinary applications[8-9], including stock-price prediction, gas exploration, medical imaging, etc., are also in need of CNNs.

Graphics Processing Units (GPUs) have been widely used as accelerators for CNNs. Potluri et al.[10]proposed a real-time discrete-time CNN system using a GPU developed with the Open Computing Language (Open CL); it showed better computing performance than the central processing unit (CPU). In addition, Strigl et al.[11]presented a CNN acceleration framework, based on a GPU, for complex problems, e.g., Optical Character Recognition (OCR) or face detection. Other works, including car-plate recognition[12]and denoising prior to image restoration[13], have been proposed using GPUs. GPUs have been proven to perform two to 24 times faster than CPUs.

The Field Programmable Gate Array (FPGA), a more powerful hardware-acceleration circuit, has a smaller clock-cycle requirement than a GPU for the same tasks[14]because of its richer embedded resources, e.g., Digital Signal-Processing (DSP) blocks, registers, and first-in-first-out queues (FIFOs)[15]. Zhang et al.[16]presented an FPGA-based accelerator for a CNN, which achieved a peak performance of 61.62 billion Floating-point Operations Per Second (GFLOPS) under a 100-MHz working frequency, and prominently outperformed the other implementations. However, the GPU is widely used as a deep-learning computing platform because of its efficient development process, while few developers choose FPGAs. According to Ref.[14], it took one person (postdoctoral level) two months to develop a GPU-based real-time phase-based optical-processing system, while it took two people (postdoctoral level) 15 months to finish the same system on an FPGA.

With the development of High-Level Synthesis (HLS), Xilinx presented a novel tool, Vivado HLS[17], to design large-scale complex FPGAs using high-level computer languages[18]. Traditionally, developers have needed to use inefficient, high-cost, low-level Hardware Description Languages (HDLs) for FPGA designs. Using Vivado HLS, developers use C/C++ instead of HDLs to design the FPGA architecture; then, the designed C/C++ code can be automatically converted to a Register-Transfer Level (RTL) model and HDL. Furthermore, Vivado HLS provides different directives to optimize the FPGA design to reduce the system latency and interval. It also shows the design evaluation.

In this paper, we developed an FPGA-based hardware-acceleration system for a CNN, which can be used in a real-time processing system. The rest of the paper is organized as follows. Section 2 introduces the AlexNet architecture. Section 3 illustrates in detail how to develop AlexNet on an FPGA using the HLS tool and optimize the original model through optimization directives. A computing-performance compari-son between the proposed FPGA system and a GPU platform is detailed in Section 4. Section 5 gives a forward-propagation test, based upon the proposed FPGA system. Finally, Section 6 presents a brief conclusion and a challenging project plan.

2 CNN architecture

Here, we chose AlexNet as the deep-learning model to test. AlexNet is widely used in computer-vision tasks[19-21]because of its reasonable trade-offs between speed and accuracy. The complete network comprises eight layers with training weights: the first five are convolution layers and the last three are fully connected. A Rectified Linear Unit (ReLU) non-linearity was implemented to follow every convolutional and fully-connected layer. Moreover, AlexNet has two normalization layers and three max-pooling layers. The author used a softmax function at

Tab.1 AlexNet architecture

the end of the network to distribute the different class labels. If we use ImageNet as a dataset to train the network, with every image having 227×227×3 pixels, the output will be a 1000-way one-dimensional vector because this dataset contains 1000 different classes. The overall AlexNet architecture and detailed information on each layer are shown in Tab. 1.

3 HLS-based development process

Traditionally, an FPGA can be developed at either the Gate Level (GL) or the Register-Transfer Level (RTL). Designing an FPGA in the traditional manner requires the developer to arrange a logic-gate circuit to satisfy the desired need. Many details must be considered, e.g., bit width and time sequence, which requires extensive development time, even for an experienced developer. According to Ref. [14], which compared the development cost of a GPU and a traditionally developed FPGA, the FPGA was much more complex than the GPU.

To reduce FPGA development costs and meet the requirements of more complicated computing tasks, the hardware should be designed at the algorithmic level, which means developers need only focus on the high-level specifications of the problem. For this reason, Xilinx produced Vivado, a new FPGA-development kit, for synthesizing and analyzing HDL architectures. One of Vivado′s most important tools is HLS, which accepts synthesizable subsets of ANSI C/C++, SystemC, and Matlab. The code is analyzed and automati-cally converted into an RTL model and an HDL, which is traditionally generated by gate-level logic-synthesis development software.

Figure 1 shows the workflow for the FPGA development of AlexNet using Vivado HLS. In this system, we used C/C++ as the development language and set all of the computations to use a single floating-point data type. First, we designed AlexNet using a high-level language (C/C++) and conducted simulation experi-ments. Once the experimental results met our requirements, the C/C++ code was converted to HDL and the RTL model was automatically generated through HLS. Furthermore, Vivado HLS provides C/RTL co-simulation to simulate different FPGA on-chip environments and evaluate the use of logic-gate resources in the proposed system.

Fig.1 Development workflow for AlexNet on an FPGA

To optimize the FPGA design, HLS has different directives that reduce the latency and interval. An optimization directive in HLS is another powerful tool to help developers design an FPGA at the algo-rithmic level. It can produce a micro-architecture that meets the desired requirement and area goals. We applied the PIPELINE and ARRAY_PARTITION directives here. Through the PIPELINE directive, the next execution can start before the current execution has finished, which greatly reduces the initiation interval. The ARRAY_PARTITION directive can partition large arrays into multiple smaller arrays or into individual registers, improving the access to data and removing block-RAM bottlenecks, which helps to reduce the latency. Figure 2 shows an example of using the optimization directives in Vivado HLS.

After optimization, the proposed system can be encapsulated into an intellectual property (IP) core. We can directly call the IP core from the FPGA development platform to complete the process of developing an FPGA through HLS, from the C/C++ program to the FPGA on-chip system.

Fig.2 Using optimization directives in one convolution layer

4 System-performance comparison

The proposed system implemented a pre-trained AlexNet model with 60.5 k parameters on a Xilinx xcvu9p-flgb2104-2-i FPGA device. and the development environment was Vivado 2017.4. The operating frequency was set to 100 MHz. For comparison, we implemented the same model with the same parameter bit width in the an NVIDIA 960 m GPU with a 12-GB- memory working environment, developed by using Matlab 2018b.

The performance comparison between the FPGA and GPU platforms is shown in Figure 3. It took 21.95 ms for the proposed FPGA system to complete the forward-propagation procedure for a 227×227×3 pixel image. It took 70 ms on the traditional GPU platform. Thus, the computing speed on the FPGA platform is over three times faster than the GPU one.

Fig.3 Performance comparison between the FPGA and GPU platforms

Fig.4 Running time of each layer in AlexNet

Moreover, the detailed running time of each layer is shown in Figure 4. The execution time decreased from the first to last convolution layers, because the number of parameters was reduced after every convolution layer. Although there were only three fully-connected layers, they took 63.93% of the entire execution time to perform, as shown in Figure 5. Table 2 indicates the resource utilization of the proposed system, which is within the limit of the chosen FPGA board.

Fig.5 Performance comparison between the convolution layers and fully-connected layers

Tab.2 Resource utilization of Xilinx xcvu9p-flgb2104-2-i

ResourceUnits utilizedUnits availableUtilizationBRAM1124432026.01%DSP6686684097.74%FF1404357236448059.39%LUT1075078118224090.93%

5 Forward-propagation test

To put the proposed FPGA system into practice, we used a tabby cat as one of our test inputs. It was obtained from the ImageNet database, which contains 1000 different classifications and was created by the Stanford Vision Lab, Stanford University. Figure 6 shows the input test image and the feature maps of each convolution layer, which indicates the successful forward-propagation process of the proposed FPGA system. With the forward propagation, the feature maps become less visually readable for human beings, but more mathematically understandable for the AlexNet model, as shown in Figures 6(b) to 6(f). Figure 7 shows the prediction results of the input image after the three fully-connected layers and a softmax function. The successful implementation of the forward-propagation test proves that the system can be further used in other related tasks.

Fig.6 (a) Input test image; (b) to (f) output feature maps of each convolution layer

Fig.7 Prediction probability results of the cat-image test

As future work, the proposed FPGA-based AlexNet network system will be used for further studies. For example, a human-robot interaction system, consisting of a UR5 robot arm, Kinect camera, force sensor, and infrared sensor, will be built in our laboratory. The system′s image-processing speed should be as fast as possible to make it more stable and sensitive. Due to its limited resources and fixed circuit design, a GPU is not as applicable to this specific task as an FPGA.

6 Conclusion

This paper proposed an FPGA-based hardware-acceleration system for a deep learning network. The novel Vivado HLS was used as the development tool, instead of a traditional HDL. It enabled designs at the algorithmic level to reduce the development cost. AlexNet was selected as the deep-learning model to test in the proposed system. In the evaluation, the system showed better performance than a GPU. The proposed system can be further employed in various practical projects, e.g., human-robot interaction systems, self-driving cars, and optical signal processing, to accelerate the processing procedure, while dealing with large-scale complex input data. The system can be divided into separate layers, which means it can be simply and flexibly transformed into other similar convolutional neural networks and used in different application scenarios.

Acknowledgments

This work was supported by the Characteristic Innovation Project of Universities in Guangdong Province under Grant No.2018KTSCX061, the Projects of the Jieyang Science and Technology Plan under Grant No.2019007 and Grant No.2019065, the Key Project of Guangdong Province Science and Technology Plan under Grant No.2015B020233018, and Project No.2019-INT010 from the Shenzhen Institute of Artificial Intelligence and Robotics.