Fault Tolerant Scheduler: Preparing to start SoC

Before the official start

 General notes:
  •  I will be working on an extension to the scheduler in the supercore, which has a readable interface for the user
  • The fault tolerant scheduler will be embedded into the scheduling policy
  • We assume that multiple execution versions of a control task are available, all with different levels of protection (three for now: error correcting version, error detecting version, basic version); the fault tolerant scheduler receives these execution versions from the rtems user application
  • Not the whole task needs to be protected, the user can also choose to only protect a certain block of a task, this way some part of the code can be run without protection
  • By giving the users the freedom to implement execution versions, they can decide about the level of protection themselves
  • RTEMS fault tolerant scheduler provides tools to implement techniques which decide when to execute which version, since the versions with high level of protection use lots of system resources; the goal is to minimize the number of redundant executions, or only use protection in critical situations
  • Techniques for deciding when to execute which version can be exchanged  

In the first week I will focus on implementing a version of the scheduler which has very simple fault tolerance techniques.

Using the fault tolerant scheduler needs to be easy for the user. With the use of simple functions calls, the settings of the scheduler should be configurable. For now, the scenario I imagine is like this:

  1. A task which wants to use the fault tolerant (ft) scheduler gets configured to use protection. The rtems user passes all relevant information about the protection to the ft scheduler.
  2. The task gets scheduled by the standard scheduler of rtems, e.g. with RM scheduling, and the task gets allocated to the cpu. Before the task executes its protected code, it calls the ft scheduler.
  3. The ft scheduler then decides which execution version of the task needs to be run, and informs the task about which version it needs to execute (or starts that version?)
  4. The task will then run the protected block and the task ends, or leave the protected section, do other activities, and then terminate.

It is also important to find out how things work around the scheduler, and think about the basics:
  •  the way scheduling is done in rtems
  • how the task versions should be organized in the user application
  • managing the task versions and activating them correctly
  • task to scheduler communication: e.g. when errors are detected, interrupt the current task, terminate it, and trigger an error recovery version
I will post more on the underlying fault tolerance techniques when I'm implementing them. For now I will focus on the basics.

Which faults can we protect the system from, using the ft scheduler ?

The scheduler will be designed to protect the system from faults in certain parts of the task entity, i.e. variables in main memory, register, cache, and during data transfer of these values between hardware components. The system will also be protected from the effects of these faults, i.e. incorrect calculations (lack of, or faulty control task output), and incorrect values can be detected and corrected (if needed).
The main application of the scheduler are control tasks such as path tracking, gyro sensor, steering, stabilization control, since we can guarantee correct output values.
We mainly focus on those faults which may not cause unrecoverable system state. For example, if a fault occurs in an area we do not or cannot protect, such as the instruction code of the scheduler itself, the system may crash. The ft scheduler cannot prevent the system from the effects of such kind of faults.
However, from the execution versions which are provided by the RTEMS user we expect that detection and correction are always (or with very high probability) possible, meaning that all errors can be detected and corrected successfully.  In Silent Data corruption for example, when errors remain unnoticed, would be an issue if the detection/correction rate would not be sufficiently high.

On scheduling and schedulability:

Since the fault tolerant scheduler manages the executions of the different task versions, the schedulability needs to be considered seperately. The recent study by Chen et al. proposed a schedulability analysis based on the multiframe task model (tasks available with different execution times, and the task just executes these different versions following a predefined pattern). This is why we assume that the system is schedulable, and no deadlines are missed when using the fault tolerance techniques. I will add tools to check the schedulability of the task configurations later in the project.     

Approaches for soft-error handling are from the research group in TU Dortmund:
Paper by K.H. Chen

This paper by Gedare Bloom is helpful for my project and to understand the principles of scheduling in rtems:
Paper

These two Blog entries, also by Gedare, describe how the scheduler works and how one can add a new scheduler:
Link
Link

The RTEMS C User's Guide also covers a lot:
Link

A talk by Joel Sherill also helps to get to know RTEMS more:

Kommentare

Beliebte Posts aus diesem Blog

Fault Tolerant Scheduler preliminary API

Static Pattern-based Execution (SRE) in the Fault Tolerant Scheduler

Fault Injection and Detection in Static Pattern-based Execution (SDR)