Connecting Things to the IoT by Using Virtual Peripherals on a Dynamically Multithreaded Cortex M3
The Internet of Things communicates with the world by using a wide range of different sensors and actuators. These interfaces are based on a wide range of various protocols, such as I2C, SPI, RS232, 1-wire, and so on. There are two conceptional different solutions to provide these interfaces. One is to use dedicated hardware for it. An example would be to use a peripheral on a system-on-a-chip (SoC). All SoC providers offer the families of SoC solutions with different kind of hardware peripheral combinations. The alternative concept it to run virtual peripherals as a software routine on a CPU, preferable on a multithreaded CPU. C-Slow Retiming (CSR) is a known design transformation to generate multithreaded CPUs. This paper argues, that system hyper pipelining overcomes the limitations of CSR by adding thread stalling, bypassing, and reordering techniques to better cope with the challenges of multithreading. This dynamic multithreaded environment is ideal for running virtual peripheral. The benefits of using system hyper pipelining for virtual peripherals are demonstrated on a Cortex M3-based system.
We will argue, that a single CPU manufactured in a low cost technology running at 100MHz is not capable of serving multiple independent VP. We propose a solution which transforms a standard CPU into a multithreaded CPU and with a higher performance per area factor at the same time. It will be shown that our proposal is an ideal solution to run multiple VPs. C-Slow Retiming (CSR) is a well known design transformation to convert a standard CPU into a mutlithreading CPU. CSR provides C copies of a given design by inserting registers and by reusing the combinatorial logic in a time sliced fashion. CSR therefore improves the performance per area factor. Leiserson et. al. introduced the concept of C-Slow Retiming (CSR) in . Timing driven CSR on RTL has been shown by Strauch in . In recent publications, CSR is used to maximize the throughput-area efficiency in  by Su et al., CSR is used on SMT-processors in  by Akram et al., and to improve performance in consumer image devices by Cadenas et al. in . When CSR is applied, only a fixed number of threads can be executed. The threads can only be executed in one fixed order. These limitations can be overcome by using System Hyper Pipeline (SHP), which was introduced by Strauch in . SHP is a combination of C-Slow Retiming (CSR) and Parallel Processing. SHP adds a higher flexibility to the pure CSR based multithreading. The number of active threads can range from zero active threads to a number of threads greater than C. Individual threads can be stalled, bypassed and reordered and therefore be executed at different speeds.
A thread controller (TC) is proposed, which is accessible by all design copies through one special function register (SFR, Table I). Threads can be added to the active thread list by writing their thread specific start address and configuration bits to the SFR within one single cycle. Threads can kill themselves by writing a zero into the SFR. Threads resulting Fig. 4. Abstract view of the thread controller mechanism. TABLE I SPECIAL FUNCTION REGISTERS (SFR) OF THE THREAD CONTROLLER from a hard- or software-interrupt are started with a specified priority level. If no other thread with a higher priority should be executed, hard- and software interrupts start without any cycle delay. Fig. 4 shows how the Thread ID (TID) of an active thread passes through the ID-Queue (IDQ). Once a TID exits the IDQ, it is inserted in the ID-FIDO, unless the ID-FIFO is empty or when it has a higher priority value (PRIORITY) than the thread coming out of the ID-FIFO. In both cases, the TID is directly inserted into the IDQ again and not stored in the ID-FIFO. By default, the IDQ is filled with the next ID-FIFO entry. The IDQ indicates, that a specific timeslot is active by setting the relevant valid bit V to one. If a timeslot is not used (e.g. less than C threads are active), then the last active thread is re-executed (to avoid switching activity). For these thread “copies“, the inactive valid bit V of the IDQ indicates, that it should not be further processed when leaving the IDQ. Threads with a higher priority run faster than threads with a lower priority. Threads with the same priority run at the same speed. This means, when less or equal C threads with the highest priority are active, then a thread with the highest priority has a predictable runtime and cannot be delayed by others–for example by a heavy interrupt load. This rule enables the scheduling of real time tasks, which is outside the scope of this paper.
This paper outlines how C-Slow Retiming (CSR) and parallel programming can be combined to a new method called System Hyper Pipelining (SHP). SHP is a formal process which lets you convert any digital design (e.g. a CPU) into a system with identical and independent design copies, whereas each running application (or program, thread etc.) can be stalled, bypassed or reordered individually. This stringent design transformation process can be automatically accomplished within seconds. SHP benefits from the higher performance per area (PpA) factor – which can be achieved when using CSR – and from a flexible reuse method of the design resources.
 IEEE IoT Initiative, accessed on Nov. 15, 2016. [Online]. Available: https://iot.ieee.org
 C. E. Leiserson and J. B. Saxe, “Retiming synchronous circuitry,” Algorithmica, vol. 6, nos. 1–6, pp. 5–35, Jun. 1991.
 T. Strauch, “Timing driven C-slow retiming on RTL for multicores on FPGAs,” in Proc. Inter. Conf. Parallel Comput. (ParCo), Munich, Germany, Sep. 2013, pp. 512–522.
 M. Su, L. Zhou, and C. Shi, “Maximizing the throughput-area efficiency of fully-parallel low-density parity-check decoding with C-slow retiming and asynchronous deep pipelining,” in Proc. ICCD, Lake Tahoe, CA, USA, Oct. 2007, pp. 636–643.
 M. Akram, A. Khan, and M. Sarfaraz, “C-slow technique vs. multiprocessor in designing low area customized set processor for embedded applications,” Int. J. Comput. Appl., vol. 36, no. 7, pp. 30–36, Dec. 2011.
 J. Cadenas, S. Sherratt, P. Huerta, W.-C. Kao, and G. M. Megson, “C-slow retimed parallel histogram architectures for consumer imaging devices,” Trans. Consum. Electron., vol. 59, no. 2, pp. 291–295, May 2013.
 T. Strauch, “The effects of system hyper pipelining on three computational benchmarks using FPGAs,” in Proc. 11th Int. Symp. Appl. Reconfigurable Comput. (ARC), Bochum, Germany, 2015, pp. 280–290.
 ARM Architecture Reference Manual, Thumb-2 Supplement, ARM Limited, Cambridge, U.K., Dec. 2004,
 Parallax Inc. Equip Your Genius. accessed on Nov. 15, 2016. [Online]. Available: https://www.parallax.com
 G. Martins, D. Lacey, A. Moses, M. Rutherford, and K. Valavanis, “A case for I/O response benchmarking of microprocessors,” in Proc. 38th Annu. Conf. IEEE Ind. Electron. Soc. (IECON), Montreal, QC, Canada, Oct. 2012, pp. 3018–3023.