







## On the Resilience of RTL NN Accelerators: Fault Characterization & Mitigation

Behzad Salami, Osman S. Unsal, and Adrian Cristal Kestelman

High Performance Machine Learning Workshop (HPML), 24-Sept-2018, Lyon, France.



## Outline



#### Motivation

- Why accelerators for NNs? Why Register-Transfer Level (RTL) model?
- Why to study resilience in NNs Accelerators?
- Fault Characterization of RTL NN
  - Empirically vulnerability analysis of different components of RTL NN
- Fault Mitigation of RTL NN
  - An efficient technique to mitigate faults
- Summary and Future Works

#### Motivation

Barcelona Supercomputing Center Centro Nacional de Supercomputació

Why Hardware Accelerators for NNs?

- NNs are inherently compute- and power-intensive applications.
- Hardware accelerators i.e., FPGAs and ASICs are commonly used. On the accelerators, NN computations (matrix multiplications) can be performed *in parallel* and with *streaming mode*.
- Register-Transfer Level (RTL) is a hardware design level can be used for both ASICs and FPGAs. It is <u>accurate-enough</u> like hardware and <u>straightforward-enough</u> like software. Thanks to High-Level Synthesize (HLS) Tools.

Why Resilience in NNs?

- Continually increasing the fault rate stemming from *aggressive Undervolting*, manufacturing defects, aging issues, etc, specially in nanoscale technology nodes.
- The accuracy of NN can be significantly affected.



Underscaling the supply voltage *below the nominal level* :

- **Power/Energy Efficiency**: Reduces dynamic and static power; quadratically • and linearly, respectively.
- **Reliability**: Increases the circuit delay and in turn, causes timing faults.

Behzad Salami, Osman S. Unsal, and Adrian Cristal Kestelman, "A Comprehensive 1. Evaluation of Supply Voltage Underscaling in FPGA on-chip Memories", in *Micro51*, 2018.

- Behzad Salami, Osman S. Unsal, and Adrian Cristal Kestelman, "Fault Characterization 2. **Through FPGA Undervolting**", in *FPL*, 2018.
- Behzad Salami, Osman S. Unsal, and Adrian Cristal Kestelman, "A Demo of FPGA Aggressive 3. Voltage Downscaling: Power and Reliability Tradeoffs", in FPL, 2018.



Aggressive Undervolting below the voltage guardband

800

400

2

S

56

ਵੈ

## Outline

Barcelona Supercomputing Center Centro Nacional de Supercomputación

- Motivation
  - Why accelerators for NNs? Why Register-Transfer Level (RTL) model?
  - Why to study resilience in NNs Accelerators?
- Fault Characterization of RTL NN
  - Empirically vulnerability analysis of different components of RTL NN
- Fault Mitigation of RTL NN
  - An efficient technique to mitigate faults
- Summary and Future Works

## **Overall Methodology**



- Register-Transfer Level (RTL) is a hardware design model.
- Advantages of the RTL design:
  - Accurate-enough (similar to the on-silicon design)
  - Straightforward-enough (similar to the software code).
- With the rise of High-Level Synthesize (HLS) tools, RTL models are increasingly being common models.



Register-Transfer Level (RTL) model of the Typical NN

To build the RTL model of the NN , we use **Bluespec** (a cycle-accurate HLS tool).

## Details of the Methodology



|                                                                                                                                                                              |                                                                                                                                                           | Input Layer I $H_{ux} = \sigma(b_u + \sum_{i} I_i \times w_{ij})$<br>Hidden V                                                                                                                                                        |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Neural Network (NN)                                                                                                                                                          |                                                                                                                                                           | Layer(s) W $H_{n,l} = \sigma(b_l + \sum_i H_{n-1,l} \times w_{i,l})$<br>H(K-1) $H_{n,l} = \sigma(b_l + \sum_i H_{n-1,l} \times w_{i,l})$                                                                                             |
| Type<br>Topology (number of layers)<br>Per Layer Size (number of neurons)<br>Total Number of Weights<br>Activation Function                                                  | Fully-Connected Classifier<br>6L (1L input, 4L hidden, 1L output)<br>(784, 1024, 512, 256, 128, 10)= 2714<br>~1.5 million<br>Logarithmic Sigmoid (logsig) | Output (softmax)<br>Layer $O_i = smax(\sigma(b_i + \sum_i H_{k-1,i} \times w_{ij}))$<br>subel = 5 $O_i = smax(\sigma(b_i + \sum_i H_{k-1,i} \times w_{ij}))$<br>subel = 5 $O_i = smax(\sigma(b_i + \sum_i H_{k-1,i} \times w_{ij}))$ |
| Major Benchmark                                                                                                                                                              |                                                                                                                                                           | $ abe  = 2 \qquad  abe  = 1 \qquad  abe  = 1 \qquad  abe  = 1 \qquad  abe  = 4$                                                                                                                                                      |
| Name-Type<br>Number of Images<br>Number of Pixels per Image<br>Number of Output Classes                                                                                      | MNIST [12]- Handwritten Digits<br>Training: 60000, Inference: 10000<br>28*28= 784<br>10                                                                   | $\begin{array}{c ccccccccccccccccccccccccccccccccccc$                                                                                                                                                                                |
| Additional 1                                                                                                                                                                 | Benchmarks                                                                                                                                                |                                                                                                                                                                                                                                      |
| 1. Forest<br>2. Reuters                                                                                                                                                      | [13]<br>[14]                                                                                                                                              | ■ Sign ■ Digit ■ Fraction                                                                                                                                                                                                            |
| Data Representation Model                                                                                                                                                    |                                                                                                                                                           |                                                                                                                                                                                                                                      |
| Type<br>Precision                                                                                                                                                            | 16-bits Fixed-Point (FP)<br>Min sign and digit per layer (Fig. 2)                                                                                         | still 12<br>12<br>12<br>12<br>12<br>12<br>12<br>12<br>12<br>12                                                                                                                                                                       |
| An Example Synthesize                                                                                                                                                        |                                                                                                                                                           | Layero Layeri Layeriz Layeris Layeri4                                                                                                                                                                                                |
| FPGA Platform-Chip<br>Operating Frequency<br>BRAM Usage (Total: 2060)<br>DSP Usage (Total: 2800)<br>FF Usage (Total: 303,600)<br>LUT Usage (Total: 607,200)<br>Number of PEs | VC707-Virtex7<br>100Mhz<br>70.8%<br>8.6%<br>3.8%<br>4.9%<br>64                                                                                            | NN Layer- Register Type                                                                                                                                                                                                              |

#### Fault Model



- Where to inject fault?
  - A set of bits is fully randomly selected among all available NN data.
- Supported type of faults:
  - Permanent (stuck-at-0 or stuck-at-1): stuck to 0 or 1 for the whole execution cycles.
  - Transient: bit-flip for a single cycle
- Statistically significant results:
  - Due to high number of possibilities to inject faults, it is more practical to randomly-select a subset of these possibilities. But how many?



#### **Illustration of Methodology**

Barcelona Supercomputing Center Centro Nacional de Supercomputación





## Vulnerability of Data Types of NN

- Three main data types of a typical NN:
  - Weights or WRs (parameters of the NN, uploaded from the offline training stage)
  - Inputs or IRs (images in MNIST, ...)
  - **InterMediate or IMRs** (the internal NN data, result of multiply-add computations)
- Methodology: Injecting faults in individual data types
  - Select *random bits to inject faults* among individual data types
- Results: Inputs/Intermediate are the least/most vulnerable.
  - Intermediate has the longest digit component.
  - They are in the adder part (not multiplier). ٠



Barcelona

↓<sup>IR</sup>

\*

\*

**IMR** 

**ÍMR** 

IMR

WR

WR

Supercomputing Center

## Vulnerability of Layers of NN

Barcelona Supercomputing Center Centro Nacional de Supercomputación

• There is an activation function between consecutive NN layers.





Typical Neural Network (NN)

- Methodology: Injecting faults in individual NN layers
  - Select random bits to inject faults among individual NN layers
- Results:
  - Inner layers (closer to the output) are relatively more vulnerable, as the result of the less thresholding by activation functions.



#### Vulnerability of Fixed-point Components

- Barcelona Supercomputing Center Centro Nacional de Supercomputación
- Low-precision fixed-point data representation model:
  - More energy-efficient than full-precision floating point
  - 16-bits composed of Sign, Digit, and Fraction Components (minimum for sign and digit and the rest for fraction)



- Methodology: Injecting faults in individual components
  - Select *random bits to inject faults* among individual data components, i.e., sign, digit, and fraction.
- Results: As expected, sign, digit, and fraction components are more vulnerable in order.



## Multiple NN Benchmarks

- Barcelona Supercomputing Center Centro Nacional de Supercomputación
- Validating the generality of results by more benchmarks:
  - **MNIST**: Handwritten digit black-and-white images
    - <u>(|Input|= 784, |Output|= 10)</u>
  - Forest: Cartographic observations for classifying the forest cover type
    - <u>(|Input|= 54, |Output|= 8)</u>
  - **Reuters**: News articles for text categorization
    - (|Input|= 2837, |Output|= 52)
- Discussion on Results:
  - Inherent error rate (without fault): MNIST (2.56%), Forest (5.6%), and Reuters (37.8%)
  - Most of the findings on MNIST are valid for new two benchmarks too, e.g., data sparsity.
  - Reuters is relatively less-sparse so less-effected by stuck-at-1 faults.



## Sparsity of NN Benchmarks



- Data of studied benchmarks are sparse, i.e., more number of '0' than '1'.
  - Previous papers show similar feature for other state-of-the-art benchmarks, .e.g., ImageNet and AlexNet.
- Due to the inherent data sparsity of NNs:
  - Stuck-at-1 faults are more destructive than stuck-at-0 faults.
  - Good for aggressive undervoting faults, as primarily experimented.



## Outline

Barcelona Supercomputing Center Centro Nacional de Supercomputación

- Motivation
  - Why accelerators for NNs? Why Register-Transfer Level (RTL) model?
  - Why to study resilience in NNs Accelerators?
- Fault Characterization of RTL NN
  - Empirically vulnerability analysis of different components of RTL NN
- Fault Mitigation of RTL NN
  - An efficient technique to mitigate faults
- Summary and Future Works

Studied case:

• Brandon Reagen, et. al. Minerva: Enabling Low-power, Highly-Accurate DNN Accelerators (ISCA-2016).

Fault Detection Assumptions:

- There is no limit on the number of faults that can be detected.
- Information is available on which bits are affected.
- Razor shadow register is a feasible solution to achieve above goals.

Fault Mitigation Techniques:

- **Bit Masking**: any bit that experiences fault is replaced with the sign-bit.
- Word Masking: when a fault is detected all bits of the word are reset to '0'.
- **Results:** The combination of Razor with Bit Masking allows the NN weights to tolerate **44X** more faults than Word Masking.



Barcelona

Supercomputing

#### An Enhanced Fault Mitigation Technique

- A combination of **Bit Masking**, **Word Masking**, and **Sign-bit Masking** (if a fault in sign-bit is detected, mask it with MSB).
- It relies on the "sparsity of NN data" and "sign-bit and MSB have same logic".



- Experimental Results:
  - Hybrid technique is 47.3% better than Word Masking.
  - Bit Masking is not efficient when sign-bit is corrupted.



Barcelona Supercomputing

Center

## Outline

Barcelona Supercomputing Center Centro Nacional de Supercomputación

- Motivation
  - Why accelerators for NNs? Why Register-Transfer Level (RTL) model?
  - Why to study resilience in NNs Accelerators?
- Fault Characterization of RTL NN
  - Empirically vulnerability analysis of different components of RTL NN
- Fault Mitigation of RTL NN
  - An efficient technique to mitigate faults
- Summary and Future Works

## Summary & Future Works



#### Summary

- We showed that NN accelerators are susceptible to faults, e.g., <u>Undervolting</u> faults.
- For a more comprehensive analysis, we analyzed the <u>Resilience</u> of NN accelerators in RTL that is a close model to hardware.
- We extracted the severity of different components of the NN accelerator against faults (<u>Fault</u> <u>Characterization</u>).
- We evaluated an efficient technique to minimize the effect of faults on NN accuracy (<u>Fault</u> <u>Mitigation</u>).

#### **Future Works**

- Advanced Neural Network models like CNNs, LSTMs, etc.
- Evaluate the mitigation technique on the silicon.

• Confirming the experimental results by the analytical analysis.



# Thanks!



Contact: Behzad Salami behzad.salami@bsc.es

The research leading to these results has received funding from the European Union's Horizon 2020 Programme under the LEGaTO Project (www.legato-project.eu), grant agreement n° 780681.





# Backup



Underscaling the supply voltage *below the nominal level*:

- **Power/Energy Efficiency**: Reduces quadratic ally dynamic and linearly static power.
- Reliability: Increases the circuit delay and in turn, causes timing faults.



**Aggressive Undervolting is not DVFS!** 

## **Motivation**



Barcelona Supercomputing Center Centro Nacional de Supercomputación

Contribution of FPGAs in large data centers is growing, expected to be in <u>30%</u> of datacenter servers by 2020 (Top500 news).



## Voltage Scaling Capability in Xilinx

Barcelona Supercomputing Center Centro Nacional de Supercomputación



## **Experimental Methodology**

Barcelona Supercomputing

- A Detailed study on <u>FPGA BRAMs</u>, which are a set of bitcells in the row-column format.
- **B** Experimental Methodology:
  - HW: Transfer content of BRAMs to He host 1. the host.
  - 2. **<u>SW</u>**: Analyze data, and adjust voltage of BRAMs.



Floorplan of VC707

Operating frequency is set to the maximum, i.e., ~500mhz.



- 1:  $V_{CCBRAM} = V_{min}$ ; 2: while( $V_{CCBRAM} >= V_{crash}$ ) begin while(numRun  $\leq 100$ ) begin 3: 4: delay(1sec);
- Transfer content of BRAMs to the host; 5:
- Analyse faulty data (rate and location); 6:
- 7: numRun++;
- end

$$P: \quad V_{CCBRAM} - = 10(mV);$$

10: end

#### **Overall Behavior- Power & Reliability**

BSC Barcelona Supercomputing Center Centro Nacional de Suj



#### Fault Characterization at **CRITICAL** Region

Barcelona Supercomputing Center Centro Nacional de Supercomputació

#### Fault Variability between BRAMs

- BRAMs clustering using K-Mean clustering.
- Majority of BRAMs are low-vulnerable.
- ~36% of BRAMs never experience faults.
- Fully non-uniform fault distribution.



VC707



#### KC705 VCCBRAM= Vcrash

\* Different scales in y-axis \* \*Pattern= 18'h3FFFF \*

## **Environmental Temperature**

- **Methodology:** Adjusting environmental temperature, monitoring on-board temperature via PMBus.
- Experimental Observation:
  - <u>At higher temperatures, fault rate is significantly reduced.</u>
  - The <u>rate of this reduction</u> is highly platform-dependent (VC707 > KC705).
- Inverse Temperature Dependency (ITD):
  - For nano-scale technologies, under ultra low-voltage operations, the circuit delay reduces at higher temperatures since supply voltage approaches the threshold voltage.



\* y-axis: VCCBRAM (V), y-axis: fault rate (per 1Mbit) \*

## Summary & Future Works



Barcelona Supercomputing Center Centro Nacional de Supercomputació

#### Summary

- We <u>experimentally</u> showed how Xilinx FPGAs work under aggressive low-voltage operations.
- There is a <u>conservative voltage</u> <u>guardband</u> below the nominal level.
- BRAMs <u>power</u> is significantly reduced through Undervolting; however, <u>reliability</u> degrades below min safe voltage.
- We <u>characterized</u> the behavior of Undervolting faults at the critical region.

## **Future Works**

- <u>Dynamic Vmin scaling</u>, adapted by frequency and temperature.
- More advanced designs, where other components such as <u>I/O</u>, <u>DDR, DSP</u> are undervolted.
- Efficient Fault Mitigation Techniques.
- <u>Profiling applications</u> such as Deep Neural Networks (DNNs), among others.
- Extending Undervolting for other commercial FPGAs such as <u>Intel/Altera.</u>



- Background
  - What does Undervolting mean?
  - Motivation: FPGAs Undervolting
- First Contribution: Undervolting Xilinx FPGAs
  - Experimental Methodology
  - Overall Power and Reliability Trade-off
- Second Contribution: Fault Characterization
  - Fault Variability
  - Fault Types
  - Impact of the Environmental Temperature
- Related Work
- Summary and Future Works

Fault Characterization at **CRITICAL** Region

Barcelona Supercomputing Center Centro Nacional de Supercomputació



## **Related Works of Undervolting**

• Simulation-based: (Lack of precise information of the real hardware.)

## **Focus of Previous Works:**

(1) <u>Covered in our work for FPGAs</u>

- Voltage Guardband
- Fault Characterization at Critical Region
- Impact of Environmental Conditions
   (2) <u>Not-covered in our work on FPGAs (Future Work)</u>
- Dynamic Vmin Prediction
- Fault Mitigation at Critical Region
- Application Profiling



## Future of FPGA Undervolting needs more advanced voltage designs, by <u>vendors</u>:

- 1. Many FPGA platforms, e.g., Zynq are not equipped with voltage scaling capability.
- 2. There is no standard about the voltage distribution among platform components.
- 3. Voltage regulators are hardwired to the host through PMBus interface.
- 4. In many cases, several components on the FPGA platform share a single voltage rail.
- 5. Vendors set unnecessarily conservative voltage guardbands that increase the energy.
- 6. There is no publicly-available circuit-level information of FPGAs.