mirror of
https://github.com/ARM-software/arm-trusted-firmware.git
synced 2025-04-08 05:43:53 +00:00

Add details about RAS error handling philosophies and its implementation It also updates the tests introduced to verify them. Signed-off-by: Manish Pandey <manish.pandey2@arm.com> Change-Id: Iedc8c1e713dad05baadd58e5752df36fa00121a7
346 lines
14 KiB
ReStructuredText
346 lines
14 KiB
ReStructuredText
Reliability, Availability, and Serviceability (RAS) Extensions
|
||
**************************************************************
|
||
|
||
This document describes |TF-A| support for Arm Reliability, Availability, and
|
||
Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
|
||
later CPUs, and also an optional extension to the base Armv8.0 architecture.
|
||
|
||
For the description of Arm RAS extensions, Standard Error Records, and the
|
||
precise definition of RAS terminology, please refer to the Arm Architecture
|
||
Reference Manual and `RAS Supplement`_. The rest of this document assumes
|
||
familiarity with architecture and terminology.
|
||
|
||
**IMPORTANT NOTE**: TF-A implementation assumes that if RAS extension is present
|
||
then FEAT_IESB is also implmented.
|
||
|
||
There are two philosophies for handling RAS errors from Non-secure world point
|
||
of view.
|
||
|
||
- :ref:`Firmware First Handling (FFH)`
|
||
- :ref:`Kernel First Handling (KFH)`
|
||
|
||
.. _Firmware First Handling (FFH):
|
||
|
||
Firmware First Handling (FFH)
|
||
=============================
|
||
|
||
Introduction
|
||
------------
|
||
|
||
EA’s and Error interrupts corresponding to NS nodes are handled first in firmware
|
||
|
||
- Errors signaled back to NS world via suitable mechanism
|
||
- Kernel is prohibited from accessing the RAS error records directly
|
||
- Firmware creates CPER records for kernel to navigate and process
|
||
- Firmware signals error back to Kernel via SDEI
|
||
|
||
Overview
|
||
--------
|
||
|
||
FFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from
|
||
errors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous
|
||
External Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling
|
||
and Error Recovery interrupts.
|
||
RAS Framework in TF-A allows the platform to define an external abort handler and to
|
||
register RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard
|
||
Error Records as introduced by the RAS extensions
|
||
|
||
|
||
.. __: `Standard Error Record helpers`_
|
||
|
||
.. _Kernel First Handling (KFH):
|
||
|
||
Kernel First Handling (KFH)
|
||
===========================
|
||
|
||
Introduction
|
||
------------
|
||
|
||
EA's originating/attributed to NS world are handled first in NS and Kernel navigates
|
||
the std error records directly.
|
||
|
||
- KFH is the default handling mode if platform does not explicitly enable FFH mode.
|
||
- KFH mode does not need any EL3 involvement except for the reflection of errors back
|
||
to lower EL. This happens when there is an error (EA) in the system which is not yet
|
||
signaled to PE while executing at lower EL. During entry into EL3 the errors (EA) are
|
||
synchronized causing async EA to pend at EL3.
|
||
|
||
Error Syncronization at EL3 entry
|
||
=================================
|
||
|
||
During entry to EL3 from lower EL, if there is any pending async EAs they are either
|
||
reflected back to lower EL (KFH) or handled in EL3 itself (FFH).
|
||
|
||
|Image 1|
|
||
|
||
TF-A build options
|
||
==================
|
||
|
||
- **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3.
|
||
- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH
|
||
- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers.
|
||
- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and
|
||
HANDLE_EA_EL3_FIRST_NS put together.
|
||
|
||
RAS internal macros
|
||
|
||
- **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled.
|
||
|
||
RAS feature has dependency on some other TF-A build flags
|
||
|
||
- **EL3_EXCEPTION_HANDLING**: Required for FFH
|
||
- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform
|
||
|
||
TF-A Tests
|
||
==========
|
||
|
||
RAS functionality is regularly tested in TF-A CI using `RAS test group`_ which has multiple
|
||
configurations for testing lower EL External aborts.
|
||
|
||
All the tests are written in TF-A tests which runs as NS-EL2 payload.
|
||
|
||
- **FFH without RAS extension**
|
||
|
||
*fvp-ea-ffh,fvp-ea-ffh:fvp-tftf-fip.tftf-aemv8a-debug*
|
||
|
||
Couple of tests, one each for sync EA and async EA from lower EL which gets handled in El3.
|
||
Inject External aborts(sync/async) which traps in EL3, FVP has a handler which gracefully
|
||
handles these errors and returns back to TF-A Tests
|
||
|
||
Build Configs : **HANDLE_EA_EL3_FIRST_NS** , **PLATFORM_TEST_EA_FFH**
|
||
|
||
- **FFH with RAS extension**
|
||
|
||
Three Tests :
|
||
|
||
- *fvp-ras-ffh,fvp-single-fault:fvp-tftf-fip.tftf-aemv8a.fi-debug*
|
||
|
||
Inject an unrecoverable RAS error, which gets handled in EL3.
|
||
|
||
- *fvp-ras-ffh,fvp-uncontainable:fvp-tftf.fault-fip.tftf-aemv8a.fi-debug*
|
||
|
||
Inject uncontainable RAS errors which causes platform to panic.
|
||
|
||
- *fvp-ras-ffh,fvp-ras-ffh-nested:fvp-tftf-fip.tftf-ras_ffh_nested-aemv8a.fi-debug*
|
||
|
||
Test nested exception handling at El3 for synchronized async EAs. Inject an SError in lower EL
|
||
which remain pending until we enter EL3 through SMC call. At EL3 entry on encountering a pending
|
||
async EA it will handle the async EA first (nested exception) before handling the original SMC call.
|
||
|
||
- **KFH with RAS extension**
|
||
|
||
Couple of tests in the group :
|
||
|
||
- *fvp-ras-kfh,fvp-ras-kfh:fvp-tftf-fip.tftf-aemv8a.fi-debug*
|
||
|
||
Inject and handle RAS errors in TF-A tests (no El3 involvement)
|
||
|
||
- *fvp-ras-kfh,fvp-ras-kfh-reflect:fvp-tftf-fip.tftf-ras_kfh_reflection-aemv8a.fi-debug*
|
||
|
||
Reflection of synchronized errors from EL3 to TF-A tests, two tests one each for reflecting
|
||
in IRQ and SMC path.
|
||
|
||
RAS Framework
|
||
=============
|
||
|
||
|
||
.. _ras-figure:
|
||
|
||
.. image:: ../resources/diagrams/draw.io/ras.svg
|
||
|
||
Platform APIs
|
||
-------------
|
||
|
||
The RAS framework allows the platform to define handlers for External Abort,
|
||
Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
|
||
refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`.
|
||
|
||
Registering RAS error records
|
||
-----------------------------
|
||
|
||
RAS nodes are components in the system capable of signalling errors to PEs
|
||
through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
|
||
nodes contain one or more error records, which are registers through which the
|
||
nodes advertise various properties of the signalled error. Arm recommends that
|
||
error records are implemented in the Standard Error Record format. The RAS
|
||
architecture allows for error records to be accessible via system or
|
||
memory-mapped registers.
|
||
|
||
The platform should enumerate the error records providing for each of them:
|
||
|
||
- A handler to probe error records for errors;
|
||
- When the probing identifies an error, a handler to handle it;
|
||
- For memory-mapped error record, its base address and size in KB; for a system
|
||
register-accessed record, the start index of the record and number of
|
||
continuous records from that index;
|
||
- Any node-specific auxiliary data.
|
||
|
||
With this information supplied, when the run time firmware receives one of the
|
||
notification mechanisms, the RAS framework can iterate through and probe error
|
||
records for error, and invoke the appropriate handler to handle it.
|
||
|
||
The RAS framework provides the macros to populate error record information. The
|
||
macros are versioned, and the latest version as of this writing is 1. These
|
||
macros create a structure of type ``struct err_record_info`` from its arguments,
|
||
which are later passed to probe and error handlers.
|
||
|
||
For memory-mapped error records:
|
||
|
||
.. code:: c
|
||
|
||
ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
|
||
|
||
And, for system register ones:
|
||
|
||
.. code:: c
|
||
|
||
ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
|
||
|
||
The probe handler must have the following prototype:
|
||
|
||
.. code:: c
|
||
|
||
typedef int (*err_record_probe_t)(const struct err_record_info *info,
|
||
int *probe_data);
|
||
|
||
The probe handler must return a non-zero value if an error was detected, or 0
|
||
otherwise. The ``probe_data`` output parameter can be used to pass any useful
|
||
information resulting from probe to the error handler (see `below`__). For
|
||
example, it could return the index of the record.
|
||
|
||
.. __: `Standard Error Record helpers`_
|
||
|
||
The error handler must have the following prototype:
|
||
|
||
.. code:: c
|
||
|
||
typedef int (*err_record_handler_t)(const struct err_record_info *info,
|
||
int probe_data, const struct err_handler_data *const data);
|
||
|
||
The ``data`` constant parameter describes the various properties of the error,
|
||
including the reason for the error, exception syndrome, and also ``flags``,
|
||
``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler
|
||
<EL3 interrupts>`.
|
||
|
||
The platform is expected populate an array using the macros above, and register
|
||
the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
|
||
passing it the name of the array describing the records. Note that the macro
|
||
must be used in the same file where the array is defined.
|
||
|
||
Standard Error Record helpers
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
|
||
both memory-mapped and System Register accesses:
|
||
|
||
.. code:: c
|
||
|
||
int ras_err_ser_probe_memmap(const struct err_record_info *info,
|
||
int *probe_data);
|
||
|
||
int ras_err_ser_probe_sysreg(const struct err_record_info *info,
|
||
int *probe_data);
|
||
|
||
When the platform enumerates error records, for those records in the Standard
|
||
Error Record format, these helpers maybe used instead of rolling out their own.
|
||
Both helpers above:
|
||
|
||
- Return non-zero value when an error is detected in a Standard Error Record;
|
||
- Set ``probe_data`` to the index of the error record upon detecting an error.
|
||
|
||
Registering RAS interrupts
|
||
--------------------------
|
||
|
||
RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
|
||
Recovery interrupts. For the firmware-first handling paradigm for interrupts to
|
||
work, the platform must setup and register with |EHF|. See `Interaction with
|
||
Exception Handling Framework`_.
|
||
|
||
For each RAS interrupt, the platform has to provide structure of type ``struct
|
||
ras_interrupt``:
|
||
|
||
- Interrupt number;
|
||
- The associated error record information (pointer to the corresponding
|
||
``struct err_record_info``);
|
||
- Optionally, a cookie.
|
||
|
||
The platform is expected to define an array of ``struct ras_interrupt``, and
|
||
register it with the RAS framework using the macro
|
||
``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
|
||
macro must be used in the same file where the array is defined.
|
||
|
||
The array of ``struct ras_interrupt`` must be sorted in the increasing order of
|
||
interrupt number. This allows for fast look of handlers in order to service RAS
|
||
interrupts.
|
||
|
||
Double-fault handling
|
||
---------------------
|
||
|
||
A Double Fault condition arises when an error is signalled to the PE while
|
||
handling of a previously signalled error is still underway. When a Double Fault
|
||
condition arises, the Arm RAS extensions only require for handler to perform
|
||
orderly shutdown of the system, as recovery may be impossible.
|
||
|
||
The RAS extensions part of Armv8.4 introduced new architectural features to deal
|
||
with Double Fault conditions, specifically, the introduction of ``NMEA`` and
|
||
``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
|
||
software which runs part of its entry/exit routines with exceptions momentarily
|
||
masked—meaning, in such systems, External Aborts/SErrors are not immediately
|
||
handled when they occur, but only after the exceptions are unmasked again.
|
||
|
||
|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
|
||
This means that all exceptions routed to EL3 are handled immediately. |TF-A|
|
||
thus is able to detect a Double Fault conditions in software, without needing
|
||
the intended advantages of Armv8.4 Double Fault architecture extensions.
|
||
|
||
Double faults are fatal, and terminate at the platform double fault handler, and
|
||
doesn't return.
|
||
|
||
Engaging the RAS framework
|
||
--------------------------
|
||
|
||
Enabling RAS support is a platform choice
|
||
|
||
The RAS support in |TF-A| introduces a default implementation of
|
||
``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS``
|
||
is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
|
||
top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
|
||
to through platform-supplied error records, probe them, and when an error is
|
||
identified, look up and invoke the corresponding error handler.
|
||
|
||
Note that, if the platform chooses to override the ``plat_ea_handler`` function
|
||
and intend to use the RAS framework, it must explicitly call
|
||
``ras_ea_handler()`` from within.
|
||
|
||
Similarly, for RAS interrupts, the framework defines
|
||
``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
|
||
when a RAS interrupt taken at EL3. The function bisects the platform-supplied
|
||
sorted array of interrupts to look up the error record information associated
|
||
with the interrupt number. That error handler for that record is then invoked to
|
||
handle the error.
|
||
|
||
Interaction with Exception Handling Framework
|
||
---------------------------------------------
|
||
|
||
As mentioned in earlier sections, RAS framework interacts with the |EHF| to
|
||
arbitrate handling of RAS exceptions with others that are routed to EL3. This
|
||
means that the platform must partition a :ref:`priority level <Partitioning
|
||
priority levels>` for handling RAS exceptions. The platform must then define
|
||
the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions.
|
||
Platforms would typically want to allocate the highest secure priority for
|
||
RAS handling.
|
||
|
||
Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt
|
||
<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF|
|
||
documentation. I.e., for interrupts, the priority management is implicit; but
|
||
for non-interrupt exceptions, they're explicit using :ref:`EHF APIs
|
||
<Activating and Deactivating priorities>`.
|
||
|
||
--------------
|
||
|
||
*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.*
|
||
|
||
.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest
|
||
.. _RAS Test group: https://git.trustedfirmware.org/ci/tf-a-ci-scripts.git/tree/group/tf-l3-boot-tests-ras?h=refs/heads/master
|
||
|
||
.. |Image 1| image:: ../resources/diagrams/bl31-exception-entry-error-synchronization.png
|