Fault Injection and Detection in Static Pattern-based Execution (SDR)
What is Static Detection and Recovery (SDR) ?
In an earlier blog post I explained how Static Reliable Execution (SRE) uses a pattern (predefined sequence of bits) to decide which version of the task to execute. To comply to an (m,k) requirement (m out of k tasks have to be correct to prevent mission failure) SRE executes an error-correction version if there is a "1" at the current bit position, and a basic version if there is a "0".
The problem with the SRE approach is that an error-correction version is executed m times to comply to (m,k). In general, error-correction consumes a lot of resources, more than for error detection. One way to correct errors in a calculation is to run the calculation three (or more) times, and then compare the result to use majority voting. This is the reason why we want to execute as less error-correction versions as possible.
Instead of executing an error-correction version every time there is a "1" in the pattern, we can follow the rules of the technique Static Detection and Recovery (SDR). SDR executes an error-detection version of a task if there is a "1" in the pattern, and a basic version if there is a "0" next.
Now it seems like we are just replacing all the error-correction versions with error-detection versions. This is true, but if an error is detected, SDR executes an error-correction version immediately.
As an exampme, let us consider the pattern {0,1,1,1}. In case of SDR, the system executes one basic version, and then three error-detection versions of a task - if there are no faults. If there are faults, e.g. on the second and fourth instance (marked in red) the system executes a reliable version immediately afterwards. With B for basic version, D for detection version and C for correction version, the executions of the tasks will be as follows: {B, D+C, D, D+C}, where "+C" stands for "execute correction version immediately after detecting an error".
In reality, errors only occur occasionally (rarely actually, but more often in space than on earth), so executing the correction versions only when necessary (when an error is detected) saves us from blindly executing the costly correction versions if there is a "1" in the pattern.
Fault Injection in the Fault Tolerant Scheduler
To implement and test SDR in the Fault Tolerant Scheduler, a fault injection mechanism is needed, to simulate the occurrence of faults. A fault rate can be specified, e.g. p% fault rate (per task). Then, an integer seed variable has to be chosen, and the random number generator in C has to be initialized with it by calling srand(seed) once. This way the rand() function gives us a sequence of random numbers when it is called sequentially. The sequence is the same for every seed value, i.e. seed = 5 gives us a certain sequence of random numbers, and seed = 6 a different sequence, but the sequences are reproducible.
For example, the fault rate could be specified as 3%. Then a predefined number of random values in [0,100] is created and the sequence is stored in an array, as described in the function rand_nr_list() below.
To decide whether a fault should be injected, one value out of the array is considered. Only if this random value is smaller than the fault rate, a fault is injected. The value random value is retrieved by the function get_rand() below, and the value rand_count specifies which random value has already been used. The function fault_status get_fault(...) checks whether the random number is smaller than the fault rate with rand_nr <= fault_rate and returns the fault status. The decision to inject a fault is determined by calling get_fault( get_rand() ), which returns the fault status. When testing the application with a control task in the future, a certain control value can be manipulated every time get_fault() returns a fault, e.g. the control value could be set to the maximum possible value to simulate the occurrence of a fault.
void rand_nr_list(void)
{
for ( uint16_t i = 0; i < NR_RANDS; i++ )
{
rands_0_100[i] = rand() / (RAND_MAX / (100 + 1) + 1);
}
return;
}
uint8_t get_rand(void)
{
return rands_0_100[rand_count++];
}
fault_status get_fault(uint8_t rand_nr)
{
if( fault_rate == 0 )
{
return NO_FAULT;
}
if ( fault_rate == 100 )
{
faults++;
return FAULT;
}
if ( rand_nr <= fault_rate )
{
faults++;
return FAULT;
}
else
{
return NO_FAULT;
}
}
Fault Detection in the Fault Tolerant Scheduler
We know now how the occurrence of faults is simulated in the Fault Tolerant Scheduler. To run SDR, an approach to detect errors is needed.
As described in earlier posts, the user implements the three execution versions of a task. This means that the user implements their own fault detection and correction.
At the end of each detection version, the task detects an error (e.g. by executing the calculation in the task twice, and then comparing the results). The fault status is then sent to the Fault Tolerant Scheduler. The detection version has to call the function fault_detection_routine(rtems_id id, fault_status fs) which takes care of the error handling. In case a fault occurred when executing a detection version with SDR, this function will trigger the release of a reliable task version immediately.
In the next post I will describe how the dynamic fault-tolerance techniques DRE and DDR work.
----
The video of today is about the James Webb Space Telescope, scheduled to launch in October 2018:
In an earlier blog post I explained how Static Reliable Execution (SRE) uses a pattern (predefined sequence of bits) to decide which version of the task to execute. To comply to an (m,k) requirement (m out of k tasks have to be correct to prevent mission failure) SRE executes an error-correction version if there is a "1" at the current bit position, and a basic version if there is a "0".
The problem with the SRE approach is that an error-correction version is executed m times to comply to (m,k). In general, error-correction consumes a lot of resources, more than for error detection. One way to correct errors in a calculation is to run the calculation three (or more) times, and then compare the result to use majority voting. This is the reason why we want to execute as less error-correction versions as possible.
Instead of executing an error-correction version every time there is a "1" in the pattern, we can follow the rules of the technique Static Detection and Recovery (SDR). SDR executes an error-detection version of a task if there is a "1" in the pattern, and a basic version if there is a "0" next.
Now it seems like we are just replacing all the error-correction versions with error-detection versions. This is true, but if an error is detected, SDR executes an error-correction version immediately.
As an exampme, let us consider the pattern {0,1,1,1}. In case of SDR, the system executes one basic version, and then three error-detection versions of a task - if there are no faults. If there are faults, e.g. on the second and fourth instance (marked in red) the system executes a reliable version immediately afterwards. With B for basic version, D for detection version and C for correction version, the executions of the tasks will be as follows: {B, D+C, D, D+C}, where "+C" stands for "execute correction version immediately after detecting an error".
In reality, errors only occur occasionally (rarely actually, but more often in space than on earth), so executing the correction versions only when necessary (when an error is detected) saves us from blindly executing the costly correction versions if there is a "1" in the pattern.
Fault Injection in the Fault Tolerant Scheduler
To implement and test SDR in the Fault Tolerant Scheduler, a fault injection mechanism is needed, to simulate the occurrence of faults. A fault rate can be specified, e.g. p% fault rate (per task). Then, an integer seed variable has to be chosen, and the random number generator in C has to be initialized with it by calling srand(seed) once. This way the rand() function gives us a sequence of random numbers when it is called sequentially. The sequence is the same for every seed value, i.e. seed = 5 gives us a certain sequence of random numbers, and seed = 6 a different sequence, but the sequences are reproducible.
For example, the fault rate could be specified as 3%. Then a predefined number of random values in [0,100] is created and the sequence is stored in an array, as described in the function rand_nr_list() below.
To decide whether a fault should be injected, one value out of the array is considered. Only if this random value is smaller than the fault rate, a fault is injected. The value random value is retrieved by the function get_rand() below, and the value rand_count specifies which random value has already been used. The function fault_status get_fault(...) checks whether the random number is smaller than the fault rate with rand_nr <= fault_rate and returns the fault status. The decision to inject a fault is determined by calling get_fault( get_rand() ), which returns the fault status. When testing the application with a control task in the future, a certain control value can be manipulated every time get_fault() returns a fault, e.g. the control value could be set to the maximum possible value to simulate the occurrence of a fault.
void rand_nr_list(void)
{
for ( uint16_t i = 0; i < NR_RANDS; i++ )
{
rands_0_100[i] = rand() / (RAND_MAX / (100 + 1) + 1);
}
return;
}
uint8_t get_rand(void)
{
return rands_0_100[rand_count++];
}
fault_status get_fault(uint8_t rand_nr)
{
if( fault_rate == 0 )
{
return NO_FAULT;
}
if ( fault_rate == 100 )
{
faults++;
return FAULT;
}
if ( rand_nr <= fault_rate )
{
faults++;
return FAULT;
}
else
{
return NO_FAULT;
}
}
Fault Detection in the Fault Tolerant Scheduler
We know now how the occurrence of faults is simulated in the Fault Tolerant Scheduler. To run SDR, an approach to detect errors is needed.
As described in earlier posts, the user implements the three execution versions of a task. This means that the user implements their own fault detection and correction.
At the end of each detection version, the task detects an error (e.g. by executing the calculation in the task twice, and then comparing the results). The fault status is then sent to the Fault Tolerant Scheduler. The detection version has to call the function fault_detection_routine(rtems_id id, fault_status fs) which takes care of the error handling. In case a fault occurred when executing a detection version with SDR, this function will trigger the release of a reliable task version immediately.
In the next post I will describe how the dynamic fault-tolerance techniques DRE and DDR work.
----
The video of today is about the James Webb Space Telescope, scheduled to launch in October 2018:
Kommentare
Kommentar veröffentlichen