# UNIVERSITAT POLITÉCNICA DE VALÉNCIA

## **Final Project of Master**

## Implementation of resilient algorithm Backpropagation on ARM-FPGA through OpenCL

## Contents

#### Introduction

**Chapter 1. Introduction to OpenCL Chapter 2. OpenCL with Altera SDK Chapter 3. Matrix-multiplication Tiling Chapter 4. Backpropagation Algorithm Chapter 5. integrate IPs in OpenCL and Results** Conclusion

## Introduction

#### OpenCL

- The open standard for parallel programming of heterogeneous systems
- Found in personal computers, servers, mobile devices and embedded platforms
- improves the speed and responsiveness of a wide spectrum of applications in numerous market categories such gaming and medical software



#### **OpenCL Platform Model**

- A host is connected to one or more OpenCL devices
- OpenCL device is collection of one or more compute units
- A compute unit is composed of one or more processing elements



#### **OpenCL Platform Model - Application**



... to execute device code

#### **Execution Model**

kernels executes on one or more OpenCL devices and a host program executes on the host

#### The host defines

- the collection of OpenCL devices
- the OpenCL functions
- Applications queue kernels and data
  transfers
  - Performed in-order or out-of-order

#### NDRange index Space



#### The BIG Idea behind OpenCL

- Define N-dimensional computation domain
- Execute a kernel at each point in computation domain



OpenCL C kernel

Traditional loop as a function in C

#### **Memory Model**

 Private Memory Per work-item implemented with registers Local Memory Per workgroup implemented using on-chip memory •Global/Constant Memory Per read/write work-items is resides in off-chip DRRx memory •Host Memory On the host CPU



#### **Overview of OpenCL**



#### **FPGA OpenCL Architecture**

The SoC platform resembles the traditional OpenCL model with a shared global memory that is used to pass data between the ARM host and the FPGA accelerator.



#### **Kernel Compiler**

- C-language front end: parses a kernel description and creates an LLVM Intermediate Representation (IR)
- Live-value analysis: identifies variables consumed and produced by each basic block.
- CDFG Generation: create a Control-Data Flow Graph (CDFG) to represent the operations inside basic block,
- Scheduling: to determine the clock
  cycles in each operation is performed.
  Hardware generation: To generate a hardware circuit for a kernel



#### **OpenCL Host Program**

 Query host for OpenCL devices
 Create a context to associate
 OpenCL devices
 Create programs for execution on one or more associated devices
 Select kernels to execute from the programs

5. Create memory objects accessible from the host and/or the device6. Copy memory data to the device as needed

7. Provide kernels to command queue for execution8. Copy results from the device to the host



#### **OpenCL** Kernels

- Data-parallel function
- Defines many parallel threads of execution
- Each thread has an identifier specified by "get\_global\_id"
- Contains keyword extensions to specify parallelism and memory hierarchy
- Executed by OpenCL device
- CPU
- GPU
  - Accelerator

kernel void sum(\_\_global const float \*a, \_\_global const float \*b, \_global float \*answer) int xid = get\_global\_id(0); answer[xid] = a[xid] + b[xid]; 2 3 5 float \*a = 4 float \*b = 6 5 3 2 4 0 kernel void sum( … float \*answer =

## **Chapter 3. Matrix-multiplication Tiling**

- C = A \* B
- sub-C=sub-A\*sub-B
- sub-C=(sub-A1\*sub-B1)+(sub-A2\*sub-B2)
- implementation doesn't perform
  So Well because the accessing of
  kernel Device to off-chip memory.

Tiling saves time accessing of kernel device to off-chip memory.



## **Chapter 4. Backpropagation Algorithm**

main candidates for the kernels are the lines 1, 2, 3, 4, 5, 6, 7 and 8 of algorithm.

Each kernel would have a similar structure based on a matrix matrix-multiplication

the better solution exist is
 implement a unique kernel and reuse
 7 times in each iteration

matrix-multiplications of forward phase(1,2 and 3) backward1 phase(5 and 7) backward2 phase(4,6 and 8) input : A training set (Inputs\_train,Targets\_train) and a test set (Inputs\_tets,Targets\_test) output: Number of errors in test set

Initweights  $({}^{0}w^{I}, {}^{0}w^{J}, {}^{0}w^{L}) \leftarrow 0;$ for  $i \leftarrow 1$  to E do Initcweights  $({}^{0}cw^{I}, {}^{0}cw^{J}, {}^{0}cw^{L}) \leftarrow 0;$  $u^{I} \leftarrow \texttt{Forward\_layer1}(^{i-1}w^{I}, Inputs\_train);$  $y^{I} \leftarrow \text{Funct_nolineal}(u^{I});$  $y'' \leftarrow \text{der}_Funct_nolineal}(u');$  $u^{J} \leftarrow \text{Forward\_layer2}(^{i-1}w^{J}, y^{I});$  $y^J \leftarrow \text{Funct_nolineal}(u^J);$  $y'^J \leftarrow \text{der}_Funct_nolineal}(u^J);$  $u^L \leftarrow \text{Forward\_layer3}(^{i-1}w^L, y^J);$  $y^L \leftarrow \text{Funct_nolineal}(u^L);$  $y'^{L} \leftarrow \text{der}\_\text{Funct\_nolineal}(u^{L});$  $\delta^L \leftarrow \text{Backward1\_layer3}(Targets\_train, y^L, y'^L);$  $cw^{L} \leftarrow \text{Backward2\_layer3}(\delta^{L}, y^{J}, cw^{L});$ 4  ${}^{j}\theta^{J} \leftarrow \text{Backward1\_layer2}({}^{i-1}w^{L}, \delta^{L});$  $\delta^J \leftarrow {}^j \theta^J \cdot u'^J$ :  $cw^{J} \leftarrow \text{Backward2\_layer2}(\delta^{J}, y^{I}, cw^{J});$ 6  ${}^{j}\theta^{I} \leftarrow \text{Backward1\_layer1}({}^{i-1}w^{J}, \delta^{J});$  $\delta^I \leftarrow {}^j \theta^I \cdot u'^I$ :  $cw^{I} \leftarrow \text{Backward2\_layer1}(\delta^{I}, Inputs\_train, cw^{I});$  ${}^{i}w^{L} \leftarrow \text{Update_layer3}({}^{i-1}w^{L}, cw^{L});$  ${}^{i}w^{J} \leftarrow \text{Update_layer2}({}^{i-1}w^{J}, cw^{J});$  ${}^{i}w^{I} \leftarrow \text{Update_layer1}({}^{i-1}w^{I}, cw^{I});$ 

Fitness  $\leftarrow$  TestNeuralNetwork(Inputs\_test, Targets\_test,  $^Ew^I$ ,  $^Ew^J$ ,  $^Ew^L$ );

#### To create an OpenCL library we need

- RTL Components
  - **RTL source file:** Verilog, VHDL.
  - eXtensible Markup Language File (.xml): The Altera Offline Compiler uses it to integrate RTL components into OpenCL pipeline.
  - Header file (.h):declares the signatures of function(s) that are implement by the RTL component

#### OpenCL emulation model file (.cl): Provides C model for the RTL component that is used only for emulation.

#### OpenCL Functions

- OpenCL source files (.cl): Contains definitions of the OpenCL functions.
- Header file (.h): declares the signatures of function(s) that are defined in the OpenCL source files.

#### Integration of an RTL Module into the AOCL Pipeline



Parallel Execution Model of AOCL Pipeline Stages Integration of an RTL Module into an AOCL Pipeline

#### **Function tangent hyperbolic 'tanh'**



## Study of Backprobagation algorithm

#### 1. ARM-CPU

2. Kernel and Kernel with 'Tanh'

- 1. Block Size = 4 and Work Items = 4
- 2. Block Size = 8 and Work Items = 8
- 3. Block Size = 16 and Work Items = 4
- numHidden1\_MAX ={4, 8, 16, 32, 48, 64}
- numHidden2\_MAX ={4, 8, 16, 32, 48, 64}
- numPatterns\_MAX ={256, 1024, 4096, 16384, 65536}

numInputs\_max and numOutputs\_max are fixed

In seconds (s)

Results

| numPattern<br>s_MAX | 256  | 1024  | 4096   | 16384   | 65536   |         |
|---------------------|------|-------|--------|---------|---------|---------|
| numHidden<br>1      | 16   | 8     | 48     | 32      | 64      |         |
| numHidden<br>2      | 64   | 64    | 64     | 64      | 64      | ARM-CPU |
| CPU time            | 9,72 | 51,6  | 411,84 | 2288,62 | 9257,73 |         |
| Forward<br>time     | 5,04 | 15,78 | 63,02  | 256,38  | 1005,28 |         |
| Backward<br>time    | 5,52 | 35,57 | 348,63 | 2031,91 | 8251,92 |         |

| ) | numPattern<br>s_MAX | 256  | 1024  | 4096  | 16384  | 65536   |  |
|---|---------------------|------|-------|-------|--------|---------|--|
| ) | numHidden<br>1      | 64   | 64    | 64 64 |        | 64      |  |
|   | numHidden<br>2 64   |      | 64    | 64    | 64     | 64      |  |
|   | CPU time            | 4,95 | 16,86 | 64,8  | 263,59 | 1048,06 |  |
|   | Forward<br>time     | 3,2  | 11,69 | 45,86 | 179,74 | 717,26  |  |
|   | Backward<br>time    | 1,51 | 4,89  | 18,68 | 83,56  | 330,18  |  |

| numPattern<br>s_MAX | 256  | 1024 | 4096  | 16384  | 65536  |
|---------------------|------|------|-------|--------|--------|
| numHidden<br>1      | 16   | 8    | 48    | 32     | 64     |
| numHidden<br>2      | 64   | 64   | 64    | 64     | 64     |
| CPU time            | 2,78 | 8,22 | 30,16 | 125,72 | 485,97 |
| Forward<br>time     | 0,97 | 2,96 | 11    | 41,82  | 160,41 |
| Backward<br>time    | 1,63 | 5,04 | 18,95 | 83,61  | 324,99 |

Kernel, BZ=4 and WI=4

Kernel with 'tanh', BZ=4 and WI=4

## Comparison

#### Hardware

|                                     | ARM-CPU    | BZ=4, WI=4                      | BZ=8, WI=8                      | BZ=16,WI=4                        |                                     | ARM-CPU    | BZ=4, WI=4                      | BZ=8, WI=8                      | BZ=16,WI=4                           |
|-------------------------------------|------------|---------------------------------|---------------------------------|-----------------------------------|-------------------------------------|------------|---------------------------------|---------------------------------|--------------------------------------|
| Logic<br>utilization<br>(in ALMs)   |            | 14,328 /<br>32,070<br>( 45 % )  | 26,978 /<br>32,070<br>(84% )    | 25,536 /<br>32,070<br>( 80 % )    | Logic<br>utilization<br>(in ALMs)   |            | 15,047 /<br>32,070<br>( 47 % )  | 28,327 /<br>32,070<br>(88%)     | 26,241/<br>32,070<br>( 82 % )        |
| Total<br>registers                  |            | 27659                           | 56719                           | 49801                             | Total<br>registers                  |            | 28690                           | 58763                           | 50914                                |
| Total block<br>memory bits          |            | 783,992 /<br>4,065,280<br>(19%) | 961,720 /<br>4,065,280<br>(24%) | 1,327,024 /<br>4,065,280<br>(33%) | Total block<br>memory bits          |            | 783,992 /<br>4,065,280<br>(19%) | 961,720 /<br>4,065,280<br>(24%) | 1,327,024 /<br>4,065,280<br>( 33 % ) |
| Total DSP<br>Blocks                 |            | 26 / 87<br>( 30 % )             | 74 / 87<br>( 85 % )             | 74 / 87<br>( 85 % )               | Total DSP<br>Blocks                 |            | 26/87<br>(30 %)                 | 74 / 87<br>( 85 % )             | 74 / 87<br>( 85 % )                  |
| HPS Dynamic<br>(Dual core)<br>Power | 1392.92 mW | 1392.92 mW                      | 1392.92 mW                      | 1392.92 mW                        | HPS Dynamic<br>(Dual core)<br>Power | 1392.92 mW | 1392.92 mW                      | 1392.92 mW                      | 1392.92 mW                           |
| Total FPGA<br>and HPS Power         |            | 2524.70 m₩                      | 3030.51 m₩                      | 2980.95 m₩                        | Total FPGA and<br>HPS Power         |            | 2539.87 m₩                      | 3053.78 m₩                      | 2995.10 mW                           |

Hardware Comparison, Just a Kernel

Hardware Comparison, Kernel with 'tanh'



### Comparison

#### Kernel with 'tanh' vs ARM-CPU





#### numHidden1=numHidden2=64 Block size= Work items=4 - cpu time Speed up forward time numPatterns max





## Conclusion

#### For the best case,

kernel acceleration Matrix-multiplication with 'tanh'

- $\blacktriangleright$  Work Size = 4
- $\succ$  Work Items = 4
- numPatterns\_max =65536
- numHidden1 =64
- numHidden2 =64

#### Speed up

- 9257.73/485.97= 19.05 times faster for CPU time
- 1005.28/160.41= 6.27 times faster for Forward time
- 8251.92/324.99=25.39 times faster for Backward time

# Thank you for your attention

# Any question is welcome