# Analytical Enhancements and Practical Insights for MPCP with Self-Suspensions

Pratyush Patel, <u>Iljoo Baek</u>, Hyoseung Kim\*, Raj Rajkumar





#### High Computational Demand of Safety-Critical Systems





#### **Problems with Hardware Accelerators**

- They do not support preemption

   Due to high context switching overhead\*<sup>†</sup>
- They handle multiple resource requests in any order
   Concurrent execution on GPU may result in unpredictable delays



- 3 identical CUDA kernels on NVIDIA GTX 1070
- ✓ 97% slowdown on two kernels
- ✓ Unpredictable which kernel gets delayed
- They do not respect task priorities or scheduling policies
   May result in unbounded priority inversion

\* I. Tanasicet al. Enabling preemptive multiprogramming on GPUs. In International Symposium on Computer Architecture (ISCA), 2014.

<sup>+</sup> Some recent GPU architectures support preemption - NVIDIA Volta Architecture <u>https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/</u>

### Existing Solution: Synchronization-Based Approaches<sup>\*†‡</sup>



#### **Benefits of synchronization-based approaches**

- ✓ Do not require any change in accelerator device drivers
- ✓ Existing schedulability analyses can be directly re-used

<sup>\*</sup> G. Elliott and J. Anderson. Globally scheduled real-time multiprocessor systems with GPUs. *Real-Time Syst.*, 48(1):34–74, 2012.

<sup>†</sup> G. Elliott and J. Anderson. An optimal k-exclusion real-time locking protocol motivated by multi-GPU systems. *Real-Time Syst.*, 49(2):140–170, 2013.

<sup>‡</sup> G. Elliott et al. GPUSync: A framework for real-time GPU management. In IEEE Real-Time Systems Symposium (RTSS), 2013.

#### Limitations



- Analytical pessimism
  - Traditional recursion-based analysis<sup>#</sup>
  - o Can lead to expensive over-provisioning

<sup>\*</sup> R. Rajkumar, L. Sha, and J. P. Lehoczky. Real-time synchronization protocols for multiprocessors. In *IEEE Real-Time Systems Symposium (RTSS)*, 1988.
† A. Block et al. A flexible real-time locking protocol for multiprocessors. In *IEEE Embedded and Real-Time Comp. Systems and Apps., (RTCSA)*, 2007.
‡ B. Brandenburg and J. Anderson. The OMLP family of optimal multiprocessor real-time locking protocols. *Design Automation for Embedded Systems*, 2013
# K. Lakshmanan, D. de Niz, and R. Rajkumar. Coordinated task scheduling, allocation and synchronization on multiprocessors. In *IEEE Real-Time Systems*

Symposium (RTSS), 2009.

#### **Our Contributions**

- Analytical enhancements for the Multiprocessor Priority Ceiling Protocol (MPCP)
  - ✓ Tighter bounds for task response times
  - ✓ Allow suspensions when executing critical sections
- Extensive schedulability experiments for a variety of task set parameters
- Prototype implementation and evaluation on Nvidia TX2 running Linux
- Extensions can be used with multiple types of computational accelerators, such as a digital signal processor (DSP) and General-Purpose GPU (GP-GPU)

#### Outline

Motivation & Introduction

#### Suspension-based MPCP

- System model
- Comparison with busy-waiting approach
- Task response time analysis
- Evaluation
- Conclusions

#### **Example of GPU Execution**



### System Model

- Sporadic tasks with constrained deadlines
  - Task  $\tau_i := (C_i, G_i, T_i, \eta_i)$ 
    - $C_i$  : Sum of the WCET\* of all non-critical sections
    - $G_i$  : Sum of the WCET\* of all critical sections
    - $T_i$  : Period (Deadline = Period)
    - $\eta_i$  : Maximum number of critical sections
    - $\zeta_{i,j}$  : Maximum number of suspensions in the j<sup>th</sup> critical section
  - Critical segment  $G_{i,j} := (G_{i,j}^e, G_{i,j}^m)$



- Each hardware accelerator is modeled as a distinct shared resource
- Use partitioned fixed-priority preemptive scheduling

#### Example under **Busy-Waiting MPCP**



#### **Example under Suspension-based MPCP**



11/26

#### **Task Response Time Analysis**

• Worst-Case Response Time  $(W_i)^*$  *i* = task number



<sup>\*</sup> N. Audsley, A. Burns, M. Richardson, K. Tindell, and A. Wellings. Applying new scheduling theory to static priority pre-emptive scheduling. Software Engineering Journal, 8(5):284–292, 1993.

<sup>†</sup> J.-J. Chen et al. Many suspensions, many problems: A review of self-suspending tasks in real-time systems. Technical Report 854, Department of Computer Science, TU Dortmund, 2016.

### **Total Blocking Time Analysis**

For each analyzed task,

#### Request-driven (RD) Approach\*

 Consider the sum of the worst-case blocking times for each lockacquisition request issued by the analyzed task

#### • Job-driven (JD) Approach

 Consider the maximum number of lock-acquisition requests issued by other tasks during the execution of the analyzed task

#### Hybrid Approach

- Upper-bound the maximum lock-acquisition requests possible in RD analysis by using JD analysis – obtain the best of both approaches
  - Different from RTAS'14<sup>†</sup> which simply takes the minimum of RD and JD for the blocking delay

<sup>\*</sup> K. Lakshmanan, D. de Niz, and R. Rajkumar. Coordinated task scheduling, allocation and synchronization on multiprocessors. In IEEE Real-Time Systems Symposium (RTSS), 2009.

<sup>†</sup> H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. Rajkumar. Bounding memory interference delay in COTS-based multi-core systems. In *IEEE Real-Time Technology and Applications Symposium (RTAS)*, 2014. 13/26

### **Request-driven\*** Blocking Time Analysis



For each request made by  $\tau_3$ , its blocking time is given by

 $B_3 = B_{3,1} + B_{3,2} = 102 + 102 = 204$ 

\* K. Lakshmanan, D. de Niz, and R. Rajkumar. Coordinated task scheduling, allocation and synchronization on multiprocessors. In IEEE Real-Time Systems Symposium (RTSS), 2009.

### Job-driven Blocking Time Analysis



## Hybrid Blocking Time Analysis



#### Outline

- Motivation & Introduction
- Suspension-based MPCP
- Evaluation
  - Case Study
  - Schedulability Experiment
- Conclusions

#### **Case Study**

- Motivated by the software system of CMU's self-driving car\*
  - Lane-Change Detector





• Workzone Detector<sup>†</sup>





• Matrix Calculation





\* J. Wei et al. Towards a viable autonomous driving research platform. In IEEE Intelligent Vehicles Symposium (IV), 2013.
 † J. Lee et al. Kernel-based traffic sign tracking to improve highway workzone recognition for reliable autonomous driving. In IEEE International Conference on Intelligent Transportation Systems (ITSC), 2013.
 18/26

#### **Experimental Setup**

- NVIDIA TX2
  - ✓ 6 CPU Cores
  - ✓ 1 GPU (256 Cores, Pascal Arch.)





Task priorities are assigned based on the rate-monotonic policy

#### DEMO



# Suspension-based vs. Busy-waiting MPCP



Performs better in practice, especially for lower-priority tasks
 Allows other tasks to use the CPU while a task is using the GPU

### **Effect of Suspension Overhead**

• Test result w.r.t. the number of co-scheduled tasks w/ small GPU segments



## **Schedulability Experiments**

- Purpose: To explore the impact of the different approaches on task schedulability
- 10,000 randomly-generated tasksets

MPCP – Our Analysis FMLP+ – LP-based Analysis\*

| Parameters                                                    | Values       |  |  |
|---------------------------------------------------------------|--------------|--|--|
| Number of CPUs (m)                                            | 4            |  |  |
| Number of shared resources (g)                                | [1, 3]       |  |  |
| Number of tasks per CPU                                       | [3, 6]       |  |  |
| Percentage of tasks with critical sections                    | [10, 40] %   |  |  |
| Task period and deadline $(T_i = D_i)$                        | [30, 500] ms |  |  |
| Utilization per CPU                                           | [40, 60] %   |  |  |
| Ratio of crit. Sec. len. To non-crit. Sec. len. $(G_i / C_i)$ | [10, 30] %   |  |  |
| Number of critical sections per task $(\eta_i)$               | [1, 3]       |  |  |
| Number of suspensions in a critical section $(\zeta_{i,j})$   | [1, 2]       |  |  |

<sup>\*</sup> The Schedulability Test Collection and Toolkit (SchedCAT). http://github.com/brandenburg/schedcat

#### Schedulability w.r.t. the Percentage of GPU-using Tasks



Hybrid MPCP outperforms both the original MPCP and LP-based FMLP+

#### Schedulability w.r.t. the Percentage of GPU-using Tasks



Hybrid MPCP analysis is over 100x faster than LP-based FMLP+

### Conclusions

#### Suspension-based MPCP

- Motivated by the limitations of the busy-waiting synchronization-based approach
- ✓ Implementation on a real-world embedded platform
- ✓ Significant improvement over the busy-waiting approach
- ✓ Very competitive with and often outperforms LP-based FMLP+
- ✓ 100x better runtime performance compared to LP-based analysis

#### • Future directions

- A detailed study of the suspension overhead trade-offs on modern platforms with accelerators
- Comparison with other synchronization protocols

# Thank You

#### Analytical Enhancements and Practical Insights for MPCP with Self-Suspensions

Pratyush Patel<sup>\*</sup>, <u>Iljoo Baek<sup>\*</sup></u>, <u>Hyoseung Kim<sup>†</sup></u>, Raj Rajkumar<sup>\*</sup>

ibaek@andrew.cmu.edu

\* Carnegie Mellon University <sup>+</sup> University of California, Riverside

# **BACKUP SLIDES**

#### **Self-Suspension Implementation**



- Global Lock 💼
  - ✓ POSIX pthread\_mutex()
  - ✓ Shared memory
- Priority Ceiling
   ✓ sched\_setscheduler()
- CPU Suspension
  - ✓ POSIX pthread\_cond()

#### GPU Execution

- Asynchronous functions ex) cudaMemcpyAsync()
- ✓ Stream and Callback

#### **Experimental Setup**

- NVIDIA TX2
  - 4 ARM Cortex-A9 at 2GHz
  - 2 Denver at 2GHz
  - 1 GPU (256 Cores, Pascal Arch.)
  - Ubuntu 16.04





| 1000                                                                                                                                                                                                                                                                                                                                                                                    |     | $C_i$ | $G_i$ | $T_i = D_i$ | CPU | Test |                      |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|-------|-------|-------------|-----|------|----------------------|
|                                                                                                                                                                                                                                                                                                                                                                                         | LC  | 13.5  | 3.19  | 39.5        | 1   | 123  | 1                    |
| LC                                                                                                                                                                                                                                                                                                                                                                                      | WZ  | 29.48 | 4.04  | 50          | 2   | 123  | General Use Case     |
|                                                                                                                                                                                                                                                                                                                                                                                         | AM1 | 11.05 | 5.12  | 100         | 1   |      | (2)                  |
|                                                                                                                                                                                                                                                                                                                                                                                         | AM2 | 8.81  | 9.38  | 165         | 1   |      | GPU Overload         |
| WZ                                                                                                                                                                                                                                                                                                                                                                                      | AM3 | 32.97 | 10.88 | 300         | 2   |      |                      |
| Cj.         Gj.         Tj. = 0,         CPU         Test.           LC         13.5         3.19         39.5         1         0.00.00           WZ         29.44         4.64         50         2         0.00.00           AM1         11.65         5.12         100         1         0.0           AM2         8.41         9.36         15.2         100         1         0.0 | AM4 | 1.15  | 2.7   | 120         | 1   | 0    | (3)<br>Overhead Test |
| AM1 - 19<br>AM5 1 - 19<br>AM5 1 - 19<br>AM5 1 - 19<br>AM1 ~ 5                                                                                                                                                                                                                                                                                                                           | AM5 | 1.15  | 0.7   | 120         | 1   | 3    |                      |

\* All times are in milliseconds (ms)

#### **Suspension Overhead**

- CPU-side overhead
  - ✓ Suspension
  - ✓ Context Switching
- GPU-side overhead
  - ✓ Asynchronous Calls
  - Callback functions



GPU access time < the suspension overhead

→ the busy-wait approach is better

#### **Self-Suspension Implementation**



- Global Lock
  - ✓ Shared memory
  - ✓ pthread\_mutex()
- Suspension
   ✓ a POSIX conditional variable
- CUDA-related
  - ✓ Asynchronous CUDA functions ex) cudaMemcpyAsync()
  - ✓ Stream and Callback
- Priority Ceiling
   ✓ sched\_setscheduler()

```
new_pri =
highest_pri + (cur_pri – lowest_pri) + 1
```

Ex)

```
Task 1: 80 (cpu)
Task 2: 54 (gpu) -> 85 = 80 + (54 - 50) + 1
Task 3: 53 (gpu) -> 84 = 80 + (53 - 50) + 1
Task 4: 52 (gpu) -> 83 = 80 + (52 - 50) + 1
Task 6: 50 (gpu) -> 81 = 80 + (50 - 50) + 1
```

#### **Task Response Time with Suspension**

- Worst-Case Response Time  $(W_i)$ 
  - Time span between a request and the end of the request



### Schedulability Experiments

- Purpose: To explore the impact of the different approaches on task schedulability
- Schedulability: How many taskets are schedulable?



# **Blocking Definition**

- A task  $\tau_i$  is said to be blocked
  - $\checkmark$  If a local task  $\tau_i$  with a lower base priority is scheduled while of  $\tau_i$  is pending.
  - ✓ If any task  $\tau_k$  has locked the resource that  $\tau_i$  is waiting for.



## **Blocking Definition**

#### • Direct Blocking (DB)

- is incurred when any task  $\tau_k$  has locked the resource that  $\tau_i$  is waiting for.

#### Prioritized Blocking (PB)

- is incurred when lower-priority tasks executing with priority ceilings preempt the CPU execution of  $\tau_i$ 

#### Indirect Blocking (IB)

- is incurred when a task  $\tau_x$  accessing a resource with a higher priority ceiling preempts the execution of  $\tau_j$ , which is holding the resource that  $\tau_i$  is waiting for.



#### **Task Response Time with Suspension**

• Worst-case response time under partition fixed-priority scheduling



Core 1  $T_i$ 

Core 2  $\tau_k$ 

 $\uparrow$ 

#### • Direct Blocking (DB)

- is incurred when any task  $\tau_k$  has locked the resource that  $\tau_i$  is waiting for.





#### • Direct Blocking (DB)



#### • Direct Blocking (DB)

- is incurred when any task  $\tau_k$  has locked the resource that  $\tau_i$  is waiting for.



#### ✓ Hybrid Approach



#### • Direct Blocking (DB)



#### **Blocking Analysis: Example**

| Task           | C <sub>i</sub> | G <sub>i</sub> | n <sub>i</sub> | <b>G</b> <sub>i,1</sub> | <b>G</b> <sub>i,2</sub> | <b>G</b> <sub>i,3</sub> | Т   |
|----------------|----------------|----------------|----------------|-------------------------|-------------------------|-------------------------|-----|
| T <sub>1</sub> | 20             | 10             | 2              | 5                       | 5                       | -                       | 100 |
| T <sub>2</sub> | 10             | 30             | 3              | 10                      | 10                      | 10                      | 400 |
| T <sub>3</sub> | 10             | 25             | 3              | 10                      | 10                      | 5                       | 200 |



### Nvidia TX2



# Nvidia TX2

|           | NVIDIA<br>Jetson TX1                                                 | NVIDIA<br>Jetson TX2                                                     |  |  |  |
|-----------|----------------------------------------------------------------------|--------------------------------------------------------------------------|--|--|--|
| CPU       | ARM Cortex-A57 (quad-core) @ 1.73GHz                                 | ARM Cortex-A57 (quad-core) @ 2GHz +<br>NVIDIA Denver2 (dual-core) @ 2GHz |  |  |  |
| GPU       | 256-core Maxwell @ 998MHz                                            | 256-core Pascal @ 1300MHz                                                |  |  |  |
| Memory    | 4GB 64-bit LPDDR4 @ 1600MHz   25.6<br>GB/s                           | 8GB 128-bit LPDDR4 @ 1866Mhz   59.7<br>GB/s                              |  |  |  |
| Storage   | 16GB eMMC 5.1                                                        | 32GB eMMC 5.1                                                            |  |  |  |
| Encoder*  | 4Kp30, (2x) 1080p60                                                  | 4Kp60, (3x) 4Kp30, (8x) 1080p30                                          |  |  |  |
| Decoder*  | 4Kp60, (4x) 1080p60                                                  | (2x) 4Kp60                                                               |  |  |  |
| Camera†   | 12 Ianes MIPI CSI-2   1.5 Gb/s per Iane  <br>1400 megapixels/sec ISP | 12 Ianes MIPI CSI-2   2.5 Gb/sec per Iane<br>  1400 megapixels/sec ISP   |  |  |  |
| Display   | 2x HDMI 2.0 / DP 1.2 / eDP 1.2   2x MIPI DSI                         |                                                                          |  |  |  |
| Wireless  | 802.11a/b/g/n/ac 2×2 867Mbps  <br>Bluetooth 4.0                      | 802.11a/b/g/n/ac 2×2 867Mbps  <br>Bluetooth 4.1                          |  |  |  |
| Ethernet  | 10/100/1000 BASE-T Ethernet                                          |                                                                          |  |  |  |
| USB       | USB 3.0 + USB 2.0                                                    |                                                                          |  |  |  |
| PCIe      | Gen 2   1×4 + 1 x1                                                   | Gen 2   1×4 + 1×1 or 2×1 + 1×2                                           |  |  |  |
| CAN       | Not supported                                                        | Dual CAN bus controller                                                  |  |  |  |
| Misc I/O  | UART, SPI, I2C, I2S, GPIOs                                           |                                                                          |  |  |  |
| Socket    | 400-pin Samtec board-to-board connector, 50x87mm                     |                                                                          |  |  |  |
| Thermals‡ | -25°C to 80°C                                                        |                                                                          |  |  |  |
| Power††   | 10W                                                                  | 7.5W                                                                     |  |  |  |
| Price     | \$299 at 1K units                                                    | \$399 at 1K units                                                        |  |  |  |

# Nvidia TX2





- SM is limited to 2,048 per SM.
- Shared memory usage is limited to 64KB per SM and 48KB per block.
- the total number of threads each block can use is limited to 1,024.
- Each thread can use up to 255 registers
- A block can use up to 32,768 registers (regardless of its thread count).
- Additionally, there is a limit of 65,536 registers in total on each SM.