Self-healing function aims at automatic detection and localization of most of the failures and applies self-healing mechanisms to solve several failure classes, such as reducing the output power in case of temperature failure or automatic fallback to previous software version.
All areas of the cellular network can display faults from time to time. Many can be overcome without a major problem and in many cases backup hardware may be available.
Following main areas can be considered in self-healing:
1. Self recovery of software - the ability to return to a previous software version should issues arise.
2. Self-healing of board faults - this often involves redundant circuits where a spare can be switched in.
3. Cell outage detection - it must be possible to remotely detect when there is an issue with a particular cell.
4. Cell outage recovery - routines to assist with cell recovery, this may include detection and diagnosis and along with an automatic recovery solution, together with a report of the outcome of the action.
5. Cell outage compensation - methods of maintaining the best service to users while repairs are effected.
Return from cell outage compensation - this action, while obvious needs to be included as it must be possible to easily return to the pre fault status, removing any compensation actions that may have been initiated
Fault management and correction requires a lot of human interventions and should be automated as much as possible; hence identification and self-healing of the faults is a significant solution. The following points are important parts of the solution:
- Automatic fault identification
Equipment faults are normally detected by the equipment itself autonomously. However, fault detection messages cannot always be generated or transmitted when the detection system itself is damaged.Such unidentified faults of eNodeB are commonly mentioned as sleeping cells, and they are detected by performance statistics.
- Cell outage compensation
When an equipment fault is detected, SON analyzes internal logs of the equipment, identifies the root cause, and takes some recovery actions such as fallback to the previous software version or switching to the backup units. When the equipment fault cannot be resolved by these actions, the affected cell and the neighbor cells take cooperative actions to minimize user-perceived quality degradation. For example, in an urban area covered by multiple microcells, relocating users from the faulty cell to the normal cells by adjusting coverage and handover related parameters of the nearby cells cooperatively is effective. This results in a reduced failure recovery time and a more efficient allocation of maintenance personnel.