Fault Solution Administration Guide Helix 11.1
Fault Solution Administration Guide Helix 11.1
11.1
Confidentiality, Copyright Notice & Disclaimer
Due to a policy of continuous product development and refinement, TEOCO Corporation or a
TEOCO affiliate company (“TEOCO”) reserves the right to alter the specifications,
representation, descriptions and all other matters outlined in this publication without prior
notice. No part of this document, taken as a whole or separately, shall be deemed to be part
of any contract for a product or commitment of any kind. Furthermore, this document is
provided “As Is” and without any warranty.
This document is the property of TEOCO, which owns the sole and full rights including
copyright. TEOCO retains the sole property rights to all information contained in this
document, and without the written consent of TEOCO given by contract or otherwise in
writing, the document must not be copied, reprinted or reproduced in any manner or form, nor
transmitted in any form or by any means: electronic, mechanical, magnetic or otherwise,
either wholly or in part.
The information herein is designated highly confidential and is subject to all restrictions in any
law regarding such matters and the relevant confidentiality and non-disclosure clauses or
agreements issued with TEOCO prior to or after the disclosure. All the information in this
document is to be safeguarded and all steps must be taken to prevent it from being disclosed
to any person or entity other than the direct entity that received it directly from TEOCO.
TEOCO and Helix are trademarks of TEOCO.
All other company, brand or product names are trademarks or service marks of their
respective holders.
This is a legal notice and may not be removed or altered in any way.
COPYRIGHT © 2020 TEOCO Corporation or a TEOCO affiliate company.
All rights reserved.
Your feedback is important to us: The TEOCO Documentation team takes many measures
in order to ensure that our work is of the highest quality.
If you found errors or feel that information is missing, please send your Documentation-
related feedback to Documentation@teoco.com
Thank you,
Table of Contents
What is the Fault Management Solution? ..................................................................... 1
Who Should Use this Guide? ..................................................................................................... 1
How this Guide is Organized ..................................................................................................... 1
Additional Reading ..................................................................................................................... 1
Alarm Collection ............................................................................................................. 2
Network Alarm Collection ........................................................................................................... 2
Application Alarm Collection ...................................................................................................... 2
Correlation Alarms ................................................................................................................ 2
Service Alarms ...................................................................................................................... 2
TrafficGuard Alarms .............................................................................................................. 2
Alarm Structure .......................................................................................................................... 3
Alarm Management......................................................................................................... 4
Alarm Monitoring ........................................................................................................................ 4
Alarm Class Concept ............................................................................................................ 4
Toggling Alarms .................................................................................................................... 5
Repeated Alarms .................................................................................................................. 5
Maintenance Calendar .......................................................................................................... 6
Schematic Views for FM ....................................................................................................... 6
GEO Maps for FM ................................................................................................................. 6
FaultPro ................................................................................................................................. 6
FM Screener ......................................................................................................................... 6
FM Alarms Summary ............................................................................................................ 7
Anomaly & Trend Information ............................................................................................... 7
Alarm Prediction .................................................................................................................... 7
Site View Display .................................................................................................................. 8
FM Notifications .................................................................................................................... 8
Alarm Correlation ....................................................................................................................... 8
Correlator TRS ...................................................................................................................... 8
Correlator ES ........................................................................................................................ 9
Machine Learning Root-cause Analysis (RCA)..................................................................... 9
Reporting .................................................................................................................................... 9
Alarm Handling ........................................................................................................................... 9
System Description ...................................................................................................... 10
Engines ....................................................................................................................................10
FM Engine(s) .......................................................................................................................10
FM History ...........................................................................................................................10
FaM Admin ..........................................................................................................................11
FM Analytics ........................................................................................................................11
Correlators...........................................................................................................................11
External APIs ......................................................................................................................11
Clients ......................................................................................................................................11
Cruiser Client ......................................................................................................................11
Light Cruiser Monitoring Client ............................................................................................12
History Analysis Client ........................................................................................................12
Administration Client ...........................................................................................................12
Architecture ..............................................................................................................................13
Active/Active architecture ....................................................................................................14
Apache Kafka and Zoo Keeper ...........................................................................................14
Distributed Cache Architecture ...........................................................................................14
Workflows ..................................................................................................................... 15
Post-Installation Workflow ........................................................................................................15
iii
Fault Solution Administration Guide
Displaying the Cruiser System Folder Names in non-English Languages .........................15
Post-Upgrade Workflow ...........................................................................................................16
Defining the Operator Working Environment ...........................................................................18
Configuration ................................................................................................................ 19
Overview ..................................................................................................................................19
Enrichment Rules .....................................................................................................................19
Action Rules .............................................................................................................................20
Condition .............................................................................................................................20
Modifications/Actions ..........................................................................................................20
Delay ...................................................................................................................................21
Activation Time ....................................................................................................................21
Example of Possible Rules .................................................................................................21
Association Rules .....................................................................................................................21
Condition .............................................................................................................................21
Activation Time ....................................................................................................................21
Toggle Rules ............................................................................................................................21
Repeated Rules .......................................................................................................................22
Display Rules ...........................................................................................................................22
Trouble Ticket Integration ........................................................................................................22
Overview .............................................................................................................................22
Trouble Ticket Mapping Rules ............................................................................................23
NeTkT Plugin ......................................................................................................................24
GEO Maps ...............................................................................................................................24
Setting GEO Maps Configuration ........................................................................................24
Setting Base Configuration Region Coordinates ................................................................25
Map Display Parameters .....................................................................................................26
MapsConfig-project.xml Structure Example ........................................................................28
Flooding Protection ..................................................................................................................29
The Flooding Algorithm .......................................................................................................31
FamEngine Flood System Properties .................................................................................32
Flooding of History Alarms ..................................................................................................33
Flooding in FamProxy .........................................................................................................33
Client Protection from Large Amount of Alarms..................................................................35
FM Screener ............................................................................................................................36
User Actions ........................................................................................................................36
Historic Investigations .........................................................................................................37
Severity Management ..............................................................................................................37
Worklog Management ..............................................................................................................37
Project Fields Configuration .....................................................................................................38
Activating Project Fields ......................................................................................................38
Configuring the Display Name of Alarm Fields ...................................................................39
Making Alarm Fields Visible ................................................................................................40
Configuring “Copy the Alarm Fields as Text” ...........................................................................40
Summary View Configuration ..................................................................................................41
Project Summary View Icons Configuration ........................................................................41
FaultPro Configuration .............................................................................................................43
Site View Configuration ............................................................................................................44
Icons Configuration .............................................................................................................44
Tooltip Configuration ...........................................................................................................45
Additional Details Configuration ..........................................................................................45
Service Details Configuration ..............................................................................................46
KPI Presentation .................................................................................................................49
Site View Refresh Rate .......................................................................................................50
Anomaly & Trend Configuration ...............................................................................................51
About config.xml ..................................................................................................................51
Selecting the PredictiveObjects (for Both Trend and Anomaly) ..........................................51
iv
Table of Contents
Defining the HistoryResolution (for Both Trend and Anomaly) ...........................................53
Configuring the Anomaly Learning Phase ..........................................................................54
Configuring the Score Coloring (for Both Trend and Anomaly) ..........................................55
config.xml File Example ......................................................................................................56
Alarms Prediction Configuration ..............................................................................................62
Offline ..................................................................................................................................62
Online ..................................................................................................................................63
ServiceImpact Configuration ....................................................................................................64
Recognizing PM Entity Name in Alarms ..................................................................................64
Maintenance Calendar Configuration ......................................................................................64
Maintenance Calendar Architecture ....................................................................................65
DB Plug-in Configuration.....................................................................................................66
Maintenance Calendar Module Configuration .....................................................................69
FamMaintenace Module Configuration ...............................................................................70
Machine Learning Root Cause Analysis (RCA) Configuration ................................................71
Learning ..............................................................................................................................71
Learning Investigations .......................................................................................................73
Run-Time.............................................................................................................................75
Correlation Graph ................................................................................................................77
Opening Clients ........................................................................................................................78
Opening FM Cruiser from External Applications .................................................................78
Opening FM History from External Applications .................................................................80
Maintenance .................................................................................................................. 82
Verifying that All Components are Running .............................................................................82
J2EE Components ..............................................................................................................82
FM Services ........................................................................................................................82
Running FM Modules ...............................................................................................................83
Checking the System Queues .................................................................................................83
Checking the Memory Consumption ........................................................................................83
History Table Partitioning .........................................................................................................83
TEOCO Monitor .......................................................................................................................83
Troubleshooting ........................................................................................................... 84
Log Files ...................................................................................................................................84
J2EE Server and Client Log Files .......................................................................................84
FM Services Log Files .........................................................................................................84
Server Troubleshooting ............................................................................................................84
Server Components Are Up and Functioning .....................................................................84
Data Loss and Restart ........................................................................................................84
Alarms Display/Update Is Delayed .....................................................................................84
History Data Is Delayed ......................................................................................................84
Insufficient Oracle Connections ..........................................................................................85
Hazelcast Disconnections ...................................................................................................85
Client Troubleshooting .............................................................................................................85
The Installation Starts and then Fails and an Error Message Appears ..............................85
The Installation States that an Old Installation is Interfering with the Installation ...............85
The .Net Framework Installation Fails ................................................................................86
The Application Starts but Fails with a ‘Could not initialize' Message ................................86
The Application Starts but Fails with an ‘Error Installing application' Message ..................86
The Application Starts but Some Operations are not Available ..........................................86
Drop-down and Context Menus are Displayed Behind the Main Window ..........................87
Cruiser Shows ‘Disconnected’ Status .................................................................................87
Delay in Display/Update of the Alarms ...............................................................................87
Statistics ...................................................................................................................................87
FM Module Statistics ...........................................................................................................87
Client Performance Considerations .........................................................................................90
Appendix A: Active Alarm Attributes .......................................................................... 91
v
Fault Solution Administration Guide
Appendix B: History Alarm Attributes......................................................................... 98
Appendix C: Project Active Alarm Attributes ............................................................. 99
Appendix D: Modules Configurable Properties ........................................................ 101
FamAdmin .........................................................................................................................101
FamEngine ........................................................................................................................101
FamHistory ........................................................................................................................112
JFam .................................................................................................................................113
FamProxy ..........................................................................................................................115
FamAnalytics .....................................................................................................................117
WinFam (Cruiser Client) ....................................................................................................117
FaMAdminModule (FM Admin Client) ...............................................................................125
HistoryAnalisysModule (FM History Client) .......................................................................125
vi
What is the Fault Management Solution?
Note: To prevent problems, we recommend that the settings be modified by only one
administrator/system integrator at a time. For example, if two users modify the same rule at
the same time, the last finished operation is executed and the other is ignored without any
warning.
Additional Reading
For administration tasks specific to J2EE modules, refer to the Helix Administration Guide.
1
Fault Solution Administration Guide
Alarm Collection
Alarm collection can be divided into two main types:
Network alarms
Application alarms
Service Alarms
ServiceImpact is a Service Management product that can integrate with the Fault
Management Solution. ServiceImpact performs further analysis and abstraction of alarms by
relating them to end-to-end services such as customer line, data service, or IPTV. This
capability enables service providers to prioritize restoration procedures based on the type of
affected services and customers rather than on the type of impacted network resource. The
analysis is based on the relationship between services and equipment as described in the
Base Configuration module and the alarms’ contents. The ServiceImpact module shows faulty
service details, including the impact on the service and customers. The ServiceImpact module
generates service alarms which are displayed in the FM module.
TrafficGuard Alarms
TrafficGuard is a Performance Management (PM) product. It provides enhanced threshold
capabilities based on existing performance data. Whenever threshold conditions are
breached, TrafficGuard generates a Threshold Crossing Alarm which can be sent to FM and
viewed by operators.
2
Alarm Collection
Alarm Structure
Available fields of the active alarms are specified in the Appendix A: Active Alarm Attributes
chapter. In addition, there is a set of fields intended for specific project usage. These fields
can be populated by project specific Mediation library, by Enrichment Rules, or by project
specific logic. Refer to Project Fields Configuration for more information.
Alarm History extends active alarms with additional fields, which are specified in Appendix B:
History Alarm Attributes.
The fields can be customized (that means changing field label or description) by editing
ProjectActiveAlarm.xml and ProjectHistoryAlarm.xml in the project metadata of the JFam
module. Refer to the Configuration section for more information.
3
Fault Solution Administration Guide
Alarm Management
Alarm management includes the following:
Alarm Monitoring
Alarm Correlation
Reporting
Alarm Handling
Alarm Monitoring
The FM client is used to display alarms to users in real-time. It notifies operators about alarms
raised or cleared in the Cruiser. The FM basic displays are also available using limited
capabilities tools in the Light Cruiser application. The operators can view additional
information about the alarm using alarm details, and even view the alarm’s raw data, if it is
received from the network element. Audible notification is also available. In addition, FM can
send mails or SMSs with regard to important alarms, using the FM Notification mechanism.
To reduce the number of alarms that the operator has to handle, FM detects repeated
(sequential UP events for the same logicID) and toggled (sequential UP-DOWN events for the
same logicID repeated often) alarms and hides them from the operator. In addition, there are
various actions that can be automated through various rules available in the FM Admin
module.
Once alarms are received and displayed, users can investigate the alarms further and handle
them by acknowledging them, deferring them, or clearing them. Cleared alarms can stay
visible in monitoring clients for a predefined period of time and then be removed. All (active
and cleared) alarms and messages received are stored in a historical database that can be
accessed to produce historical reports using the History Analysis tool. All events and actions
that were applied to specific alarm can be investigated through the Event Log display.
FM has a bidirectional connection with NeTkT, TEOCO’s trouble-ticket product allowing
creation of a new ticket for the alarm or appending it to an already existing ticket. Integration
with other trouble ticket systems is also available.
4
Alarm Management
Toggling Alarms
The Alarm Toggling feature is used to reduce the number of flipping alarm instances.
If X (by default 3) or more instances of the same alarm are raised and closed during Y
minutes (by default 10), the alarm is marked as “toggle”. The first 2 (assuming X = 3)
instances of the alarm are treated as regular ones, but the third one remains active (with a
“toggle” mark) regardless of its CLEAR event, and the following alarm instances are ignored.
That is, the following alarm instances “belong” to the third instance and are not treated as
separate instances. By default, data from toggling instances is not copied to the “hosting”
alarm, but can be changed through Toggle Rules.
The alarm remains toggled until there is a Z minutes (by default 15) “silence” period. By
“silence” we mean no UP or DOWN events for this alarm. If the last event was UP, the third
instance remains in toggle state as an active instance. Otherwise, the instance is treated as
cleared.
The fourth and up instances are not seen in history as separate instances. All the toggle
events can be seen in history log of the third instance. Saying that, it is important to notice
that the server unifies all toggling events (per alarm) that occurred in the same second.
Therefore, events may be missing in the history. The buffering time can be configured through
FamEngine’s toggleRepeat.bufferingTime property.
Toggle parameters can be configured through FamEngine server properties (See
FamEngine). It is also possible to change the configuration for a certain group of alarms
through the Toggle Rules in the FM Admin application.
Sequence Example
First alarm instance is raised at 10:00 and cleared at 10:01—regular instance
Second instance is raised at 10:02 and cleared at 10:03—regular instance
Third instance is raised at 10:04—FM recognizes that the two previous instances
occurred less than 10 minutes ago and marks this instance as toggled
The third instance is cleared at 10:05—the instance remains active in toggle state
Fourth instance is raised at 10:06 and cleared at 10:07—no new instance is created.
The user still sees the third instance as active
Fifth instance is raised at 10:08 and cleared at 10:09—no new instance is created. The
user still sees the third instance as active
At 10:24 (15 minutes afterward)—the toggle mark is removed from the third instance
and the user sees it as a regular cleared instance
Repeated Alarms
When an alarm is raised and another alarm with the same Logic ID is already active, the new
alarm instance is considered a Repeated alarm. Repeated alarms are automatically
suppressed by FM and do not appear as new rows in the Active Alarms display.
The data of the Repeated alarm is copied to the original alarms unless dictated otherwise by
Repeated Rules. The original alarm stores information about the number of occurred
repeated alarms and time of the last occurrence.
All the repeated events can be seen in the history log of the alarm. Saying that, it is important
to notice that the server unifies all repeated events (per alarm) that occurred in the same
second. Therefore, events may be missing in the history. The buffering time can be
configured through FamEngine’s toggleRepeat.bufferingTime property.
5
Fault Solution Administration Guide
Maintenance Calendar
Planned maintenance is part of the communication supplier utilities, which include activities
such as fixing network problems, element maintenance, and network element upgrade.
Planned maintenance activities can create many FM alarms that do not indicate actual
problems.
The feature is used to facilitate the NOC operators in handling planned maintenance activities
and special event alarms by displaying relevant alarm-maintenance information.
For this feature configuration, refer to Maintenance Calendar Configuration.
FaultPro
FaultPro is an optional add-on used for assisting Telecom service providers to achieve a high
level of NOC efficiency. It provides the capability for automatic problem correction. It is
designed to automatically (or semi-automatically) solve problems and frees the NOC
personnel from having to deal with them.
FaultPro operates in the following modes:
Automatic Mode—network commands and scripts are activated via automation rules
that meet predefined conditions. The scripts and commands are developed in the
Mediation layer’s NCI module.
Manual Mode—the Send Network Commands module can be accessed from the
Alarm Monitoring application for manually activating commands, scripts, Telnet
sessions to devices, and so on. The list of available commands and scripts is based on
the alarming network element and on the conditions defined in the association rules.
FM Screener
FM Screener is an optional feature that increases the operational efficiency by reducing the
amount of alarms that NOC operators have to manage. Using this module, FM enables
analyzing which alarms are considered unnecessary and automatically marks them as SPAM
by the FM Screener module. In addition, it provides the end-users and the system
administrator control over the list of SPAM alarms. They can easily add or remove SPAM
indications from the Cruiser and/or from the administrator's GUI.
6
Alarm Management
FM Alarms Summary
The Alarms Summary master mode provides the NOC user with a summary visualization of
the network status. Using the Alarms Summary master mode, the summary visualization can
be done for any folder, thus can be adjusted per each use case. The summary criteria is
configurable and is based on the network elements instances attributes, for example, types,
vendors, and geographic location. It includes color visualization of the network status.
The summary view information can be displayed in both gallery (icon) and list (grid) view.
There are predefined icons in the gallery view that can be configured by the administrator.
In addition, the Alarms Summary provides displays of the alarm distribution by selected alarm
attributes.
If ServiceImpact is installed and the user is permitted to use it, service and customer displays
are available.
Alarm Prediction
Alarm Prediction is a tool that predicts network failures and alerts about them based on an
advanced machine-learning algorithm.
The algorithm scans the alarms history and builds a model that can predict the failure before it
occurs. A mathematical likelihood score is assigned to each predicted alarm and the ones
that receive a high likelihood score are triggered and presented in Cruiser for the NOC
engineers to investigate.
This prediction algorithm is completely network agnostic and fully automated. The tool works
on network data and does not require any hard logic implementation using rules or external
reference data.
For information about configuring the Alarms Prediction options, see Alarms Prediction
Configuration
7
Fault Solution Administration Guide
FM Notifications
The FM Notification mechanism enables you to notify specified users by email or SMS about
network changes that are reflected in the FM system. The notification mechanism is built from
the following main functionalities:
Notification contacts and groups—managed in the TEOCO Admin GUI and list the
available users and groups to send notification to. Contacts and groups can be
migrated from the Helix user list or the operator's organization LDAP system. For more
information, see the Notification Mechanism chapter in the TEOCO Admin User Guide.
Notification templates—managed in the FM Admin GUI and provide the ability to
create notification emails or SMS templates. The template can contain a placeholder
for any alarm field. For more information, see the Fault Administrator User Guide.
Notification rules—managed in the FM Admin GUI using the Action rule definition and
provide the ability to define the exact criteria, template to use, and users/groups to
send notifications to. Using action rules you can also send notification to an ad-hoc
user that is not listed in the Notification contact list. For more information, see the Fault
Administrator User Guide.
General mail configuration—managed at the infrastructure level. Refer to the Mail
Server Configuration and SmsByMail Service Configuration chapters in the Helix
Administration Guide.
To control the sender’s e-mail and name that will appear in sent mails, refer to the FamEngine
notification.rule.sender.email and notification.rule.sender.name properties.
Alarm Correlation
The Fault Management product offers several modules for identifying the root cause of
network failures. These modules significantly reduce the volume of alarms that network
operators have to manage, and significantly shorten the time required to figure out what went
wrong in the network.
Correlator TRS
Correlator TRS is an optional topology-based Reasoning System that provides a probabilistic
topology-based root cause analysis. It uses the network’s topology and probabilities to identify
the root cause of alarms. It is capable of making correct decisions even when some alarms
arrive late.
8
Alarm Management
Correlator ES
Correlator ES is an FM add-on that uses If-Then type business rules to identify the root cause
of alarms. It uses correlation rules to analyze a group of alarms and identify the root cause
“parent” alarms, which reflect actual faults and require fixing, and symptomatic child alarms
that are secondary reactions to the primary faults, and as such do not require any action.
Correlator ES creates derived alarms when no alarm in the group adequately describes the
root-cause and suppresses false alarms generated as a result of maintenance activities.
Reporting
FM Reporter is an optional product that enables service providers to easily access web-based
reports that provide a detailed view of current and historical alarms. It also enables the users
to detect critical problems and developing trends, and take proactive actions before these
events escalate into a crisis. It includes predefined reports and enables the user to create
customized reports.
Alarm Handling
FM offers the following options for handling alarms:
Opening a trouble ticket using the NeTkT product. Integration with other Trouble Ticket
management systems is also available.
Sending commands to network elements, using the FaultPro module, which is part of
the FM product suite.
Marking alarms as SPAM/Premium (using FM Screener).
Changing the internal state of the alarms using Acknowledge and Defer commands.
Adding comments (using work logs) to the alarm.
Creating manual parent/child correlation between alarms.
Note: Some of these Helix options can be automated through Action Rules.
9
Fault Solution Administration Guide
System Description
FM is based on two main layers: Engines and Clients.
Monitoring Administration
Clients Clients
HTTP
PM
Correlation
Fam History
Engines (N2/
(J2EE)
Fam Engine J2EE)
Mediation Alarms
(J2EE)
Fam Admin External APIs
(J2EE) (J2EE/N2)
Service
JM
Impact S/
W
Trouble Tickets
S/
SN
s
ail
M
-M
P
S/ E
SM
NeTkT
Engines
FM Engine(s)
FM Engine is the major component responsible for the handling and distribution of alarms
(including communicating with the Mediation layer that in turn communicates with the
network), manual and automatic alarms command execution, mail/SMS notifications, and
many other activities.
To improve scalability and performance of the FM system it is possible to install multiple FM
Engines that will divide the work between them.
FM Engine is a J2EE module that must be deployed in its own EAR. For more information
about J2EE deployment and configuration, see the Helix Administration Guide.
FamProxy is a supplement to FM Engine, providing infrastructure for developing FM
applications. It is installed automatically in the required EARs.
JFam module is an additional automatically installed supplement.
FM History
The FaM History (J2EE) module is responsible for the persistence of history data and events
in the database.
10
System Description
FaM Admin
FaM Admin (J2EE) is the server-side component responsible for the administration services.
FM Analytics
FM Analytics is an optional module responsible for Analytics Predictive Information
calculation.
Correlators
There are three optional correlation engines:
Correlator TRS—based on N2 technology, supplied as part of the FaM API Service
module.
Correlator ES (drools)—based on RedHat BRMS, “ES” module.
Correlator RCA—“FamRCA” module.
External APIs
There are additional modules that provide capabilities of alarm information communication
with external systems. Available protocols are SNMP, message bus (JMS), and web services.
Clients
Cruiser Client
Cruiser is the Helix Fault Management client. It leverages intelligent event-processing
capabilities, advanced Fault Management concepts, and a new telecom-oriented graphical
interface to create the most comprehensive and robust Fault Management solution. Cruiser
enables users to efficiently identify, monitor, and resolve network incidents detected in hybrid
and Next Generation communication networks. The intuitive graphical user interface
streamlines quick problem resolution by providing a consolidated, highly filtered, and
prioritized view of network faults.
The Cruiser Monitoring client is composed of the following modules:
FamShell
WinFam
MapsModule (optional)
FaultProModule (optional)
11
Fault Solution Administration Guide
Administration Client
The FM Admin client enables administrators to perform the administration tasks that are
required to configure the Fault Management solution to best meet the alarm monitoring
requirements.
The application offers the following main functions: alarm rule creation, Trouble Ticket
Mapping rule definition, FM Configuration, and TRS Correlation rule definition.
The Administration client is composed of the following modules:
FamAdminShell
FamAdminModule
12
System Description
Architecture
The following diagram provides a detailed data flow between server-side components:
Historic
Alarms
FM
WR
History
Config
FM
DB
Admin
G Trl Vl Th
Distributed FM Data
Cache (Hazelcast)
FM Data
Mediation
G Trl Vl Th
FM
Proxy
FM Data
FM
Events
Engine
alarms
Active
model
13
Fault Solution Administration Guide
Active/Active architecture
To improve scalability, performance, and fault tolerance of the FM system, it is possible to
install multiple instances of FM Engine that will divide the work between them.
The system could survive a crash of FamEngines instances as long as at least one instance
continues to work. It is known, however, that some events being processed by a crashed
FamEngine will be lost.
A relevant trouble ticket plugin (when exist) should be installed on every FamEngine EAR.
14
Workflows
Workflows
Post-Installation Workflow
The following workflow defines the post-installation steps required to configure the Fault
Management solution.
1. Install all the required Fault Management solution components. Refer to the Helix
Server Installation Guide.
2. Define the library list and activate the library. See the Fault Solution Implementation
Guide for details.
3. Configure the GUI labels and tooltips.
4. Configure FM.
5. Define the users, groups, and roles in the TEOCO Admin application. See the
TEOCO Admin User Guide for details.
6. Define the alarm classes.
7. If necessary, define the project roles in the TEOCO Admin application. See the
TEOCO Admin User Guide for details.
8. Map the alarm classes to user groups. See the TEOCO Admin User Guide for details.
9. (Optional) Define the NCI Commands. See the NCI2 Admin User Guide for details.
10. (Optional) Complete the Locale Configuration for Projects Displaying UI in non-
English languages.
11. Define the users' working environment, (such as folders) for the Cruiser and FM
History applications.
12. (Optional) If NeTkT is installed, integrate the FM system with NeTkT. See the NeTkT
Integration Guide for details. If another Trouble Ticket system is used, perform the
necessary steps to integrate with that system.
13. Verify that all required components are running.
14. Define the operators' working environment.
15. You may validate the system is functioning properly by using the bench sim utility
(alarm simulator).
15
Fault Solution Administration Guide
Note: To prevent corrupted text, the Oracle client should be configured to use the same
character set as the database. Otherwise, the text will be corrupted.
Post-Upgrade Workflow
The following workflow defines the post-upgrade steps required to configure the Fault
Management solution.
1. Check that all the prerequisites are installed on the client and server. See the Helix
Server Installation Guide.
2. Upgrade all the required Fault Management solution components. See the Helix
Server Installation Guide.
3. The following notes are relevant for projects upgrading from versions prior to 8.0:
a. Due to a major change in the FM architecture, the existing project metadata files
may not work. They should be sent to TEOCO S&D for revision. Files supplied
together with the JFam release may be used as a temporary solution until
TEOCO’s recommendation is received.
b. Raise rules and Automation rules were merged into unified Action Rules. While
migration is automatic, we recommend revising the migrated rules.
c. Some alarm fields were removed or made invisible. We recommend revising rules
in FM Admin, folders in Cruiser, and saved queries in FM History. If they use
removed or invisible fields, change the rules to use valid fields.
d. Hook functions of the alarm handler do not exist anymore. Their logic should be
reimplemented using existing means of FM. For example, using Enrichment,
Repeated, and Toggling rules.
e. Alarm Handler Prefs are deprecated. Their values are taken into account during
the upgrade, but from this version onwards, the entire configuration definition is
done through the FamEngine properties. .
4. Update the library list (if required) and activate the library. See the Fault Solution
Implementation Guide for details.
5. Check and adjust the FM configuration.
6. If required, define new alarm classes and map them to BP classes in TEOCO Admin.
See the TEOCO Admin User Guide for details.
7. Define NCI Commands if required. See the NCI2 User Guide for details.
8. If required, fine-tune the integration between FM and NeTkT.
9. Verify that all required components are running.
16
Workflows
This feature enables you to open the FM History display from external applications. It is done
by opening a URL using the appropriate parameters.
The URL prefix is:
http://[your server name]:[port]/
FaMHistoryShell /FaMShellActivator.jsp?
The URL parameters are:
field string The name of the alarm field to filter by (when filtering by a single
field).
value string The value of the alarm field to filter by (when filtering by a single
field)
timecriteria string Relative time: <N> <Hours, Days, Weeks, Months>), in the
format H/D/W/M<N>
Where:
H=Hours, D=Days, W=Weeks, M=Months
For example, W10 indicates 10 weeks.
allparents boolean Determines whether to open the Correlation Tree window or just
filter by the following parameters.
Set as true if you want to open the Correlation Tree window.
Set as false if you do not want to open the Correlation, but you
want to filter the records by the following parameters.
Ignore this parameter if you just want to filter by a single
field/value (for backward compatibility).
LogicID string The value of the LogicID field of the alarm to filter by.
ObjectID int The value of the ObjectID field of the alarm to filter by.
ObjectType int The value of the ObjectType field of the alarm to filter by.
Example:
http://dc50-dev-helix91:3600/ FaMHistoryShell
/FaMHistoryShellActivator.jsp?activate=True&PCStatus=PARENT&ObjectID=123456&Object
Type=78&timecriteria=W10&LogicID=comcast_test_3&allparents=false&DateTimeUp=20/03/
2017 16:43:31.092
17
Fault Solution Administration Guide
18
Configuration
Configuration
Overview
FM system can be configured as follows:
FM Admin GUI.
TEOCO Admin GUI.
FaM Engine and other J2EE modules can be configured by changing module
properties in the appropriate jcore_cfg.xml file. The Modules Configurable Properties
chapter details available module properties. Refer to the Helix Administration Guide
for more details. Changes will take effect after relevant WebLogic server (EAR)
restart.
Certain configurations require changing project metadata files.
Notes:
Enrichment Rules
Enrichment rules are a powerful tool allowing populating or changing the alarm data at any
stage of the alarm life cycle. It is possible to define several rules where each one serves its
own set of alarms. The entire configuration is performed using the FM Admin GUI.
Each rule has the following properties:
Condition
A rule will be applied only on alarms matching the criteria. Criteria can refer to all alarm field
conditions with nested logical “AND”/”OR”/”NOT” between them. Rules will be triggered only
on events specified in the condition, such as: Acknowledge, trouble ticket creation,
parent/child connect/disconnect, and so on.
In addition, Javascript expression (including Mediation Lookups) can be used to define the
criteria.
Change Alarm Fields Values
Enrich the alarm by setting or changing the alarm fields with new updated information.
Lookups and Javascript can be used to populate the alarm fields.
Modify Alarm Class
Change the alarm class of the alarm.
Activation Time
Defines date/time period when rule is active. It is possible to define start and end dates and/or
week-days and/or day hours.
19
Fault Solution Administration Guide
Example of Possible Rules
Update Addition Info field with technician name in charge of the alarmed site.
For more information, refer to the Fault Administrator User Guide.
Action Rules
Action rules are a powerful tool allowing setting any action at any stage of the alarm life cycle.
It is possible to define several rules where each one serves its own set of alarms. The entire
configuration is performed using the FM Admin GUI.
Condition
A rule will be applied only on alarms matching the criteria. Criteria can refer to all alarm field
conditions with nested logical “AND”/”OR”/”NOT” between them. Rules will be triggered only
on events specified in the condition, such as: Acknowledge, trouble ticket creation,
parent/child connect/disconnect, and so on.
In addition, Javascript expression (including Mediation Lookups) can be used to define the
criteria.
Starting from version 8.0, the behavior of the “duration” condition has changed. The duration
of the alarm (the period from the alarm UP time) is checked once at the time of the rule
evaluation. If the alarm duration does not match the condition, the rule is rejected.
Modifications/Actions
The following actions can be applied to the alarm:
Acknowledge/Undo Acknowledge—change of the alarm internal status, usually
means that alarm was noticed by the operator.
Create/Disconnect trouble ticket.
Reject alarm—alarm will be ignored by the system with no further tracking.
Inhibit alarm—alarm will not be shown in the monitoring clients, but will be tracked in
the system.
Apply association—copy work logs and trouble tickets from the previous alarm
instance if it was cleared within X (defined in rule) minutes, that is if previous instance
is close to the current one. Copying trouble tickets mean that a new alarms instance is
appended to trouble tickets of a previous alarm instance.
Do not send to Correlation—alarm will not be sent to a correlation system.
Create trouble ticket for the alarm.
Defer alarm—change of the alarm internal status, usually means that the alarm will be
handled later.
Apply escalation—alarm severity will be raised automatically if alarm is not
acknowledged or cleared within X (defined in rule) minutes.
Defer/Undo Defer—'snooze' mechanism. The alarm will be in deferred status for the
specified amount of time.
Alarm Down—clear the alarm.
Prioritize—raise the alarm priority.
Create worklog.
Run NCI command.
Notification—send email/SMS for specified users.
20
Configuration
Delay
Alarm actions can be delayed for a specific amount of time. The action will be performed at
the end of the period if the alarm is still active and matches the rule criteria. For example, to
create a trouble ticket only 10 minutes after alarm was raised and only if the alarm is still
active.
Activation Time
Defines date/time period when rule is active. It is possible to define start and end dates and/or
week days and/or day hours.
Association Rules
Association rules enable you to “associate” an alarm with programs, web links, and NCI
commands. Invocation parameters are defined using the powerful Javascript language that
refers to alarm field values and Mediation Lookup results.
The Cruiser user will be able to execute programs and commands associated with the alarm
using the right-click menu. This differs from action rules that are executed automatically by
the system.
Programs are executed on a local user PC and therefore must be properly installed and
configured.
Condition
The rule will be applied only on alarms matching the criteria. Criteria can refer to all the alarm
field conditions with nested logical “AND”/”OR”/”NOT” between them.
Activation Time
Defines date/time period when rule is active. It is possible to define start and end dates and/or
week days and/or day hours.
Toggle Rules
The Toggle rules enable you to change the toggling alarm parameters (such as Toggle On,
Toggle Off, and Toggle Depth) and decide which alarm fields should be updated in each
toggling alarm instance.
The rule will be applied only on alarms matching the criteria. Criteria can refer to all the alarm
field conditions with nested logical “AND”/”OR”/”NOT” between them.
Toggle rules are always defined as Blocking. This means that when a rule is executed, it
prevents the execution of the remaining rules with the same criteria.
For more information, see the Fault Administrator User Guide.
21
Fault Solution Administration Guide
Repeated Rules
The Repeated rules enable you to define whether to update the alarm fields with the
repeating alarm’s fields.
The rule will be applied only on alarms matching the criteria. Criteria can refer to all the alarm
field conditions with nested logical “AND”/”OR”/”NOT” between them.
Repeated rules are always defined as Blocking. This means that when a rule is executed, it
prevents the execution of the remaining rules with the same criteria.
For more information, see the Fault Administrator User Guide.
Display Rules
Display rules are a powerful tool enabling setting special FM alarms display attributes for
selected alarm groups so that they are displayed using Italic, Underscore, and/or different text
and/or background colors. This can enable the NOC operators to easily notice these special
alarms.
The rule will be applied only on alarms matching the criteria. Criteria can refer to all the alarm
field conditions with nested logical “AND”/”OR”/”NOT” between them.
In addition, Javascript expression (including Mediation Lookups) can be used to define the
criteria.
When the alarm matches several rules, the coloring instructions are unified. In case of
conflict, the later rule overwrites the previous instructions.
For more information, see the Fault Administrator User Guide.
22
Configuration
The following operations exist:
Create new ticket for the alarm.
Append the alarm to an existing ticket (chosen by the user).
Disconnect the alarm from the ticket (that was created for the alarm or alarm was
appended to).
View the ticket details in the TT system.
Fetch the tickets from the TT system upon certain criteria. For example, when a user
chooses the ticket for the Append operation.
Pass the originating alarm worklog to the TT system.
Note: Worklogs that existed before ticket creation and worklogs of the appended
alarms are passed too.
Update ticket after originating alarm has changed (for example, cleared).
Update ticket status in FM system after it was changed in the TT system.
23
Fault Solution Administration Guide
NeTkT Plugin
NeTkT Plugin is a TT plugin to TEOCO’s NeTkT system.
For configuration details, refer to the NeTkT Administration Guide.
GEO Maps
To successfully implement GEO Maps into the Fault solution, alarms should be populated
with the correct Object Type and Object ID and the Eqp Num should get the NE’s Object ID.
24
Configuration
In addition to refresh config, we recommend FamEngine and FAM EAR restart after making
the GEO Maps configuration changes.
These settings result in having the appropriate values in the Cruiser’s SiteID, Lat, and Long
fields.
Note: To provide correct Cruiser GEO Map displays, all the sites and regions stored in the
Base Configuration module should contain correct coordinates.
To manually calculate coordinates for all regions with empty (null) coordinates:
Use the CMM_DB.PA_UPD_REGION_COORD.UPD_REGION_COORD_ALL
function.
Note: To recalculate all region coordinates, empty all the existing coordinates before
running the procedure.
25
Fault Solution Administration Guide
Name Description
Level Defines the layer’s level in the Maps module. It must match its level
definition in Helix’s Network Data Storage. Level 1 is the highest (for
example, Country) and Level 5 is the lowest (for example, Secondary
Region). Level 0 defines the sites configuration.
Name Defines the layer’s name. It must match its name in Helix’s WinFaM (for
example, Level 0’s name is Sites, Level 1’s name is Country, and Level
2’s name is State).
Description Defines the name of the template used to display the bubble window that
Template shows the details of an element of this layer on the map. It is taken from
the WinFaM Metadata.
Image Defines the name of the image file used to display this layer icon in the
Alarms Map’s Layers pane. It is taken from the WinFam images folder.
MinAlt & Defines the maps altitude range (minimum and maximum) in meters for
which this layer is displayed. We recommend that the MinAlt of each layer
MaxAlt
be equal to the MaxAlt of the layer under it to make sure that exactly one
layer is displayed in any altitude.
BoundingNorth, Defines the layer’s area as a rectangle by its latitude and longitude
BoundingSouth, boundaries in decimal degrees.
BoundingWest,
&BoundingEast
Categories Defines the elements included in this layer, as described in the following
table.
In addition, the layers entry includes the DefaultDescriptionTemplate entry, which is the
default template used to display the bubble window in levels for which the Description
Template is not defined or not valid. It is taken from the WinFaM Metadata.
26
Configuration
The Categories and Category Entries
The Categories entry defines the elements included in the layer. An element in the layer is
defined as a category. A layer can include any number of category items. Usually, the Sites
layer includes several elements and all the other layers include only one (default) element.
Each category entry contains the following elements:
Name Description
id The category’s ID. It is relevant only if the layer includes more than one
element.
IsDefault If it is true, this is a default category and it is used to define any category
that does not have a valid id or if no other category is defined.
DefaultImage Defines the default Image. It is used for a category with no alarms that
does not have an Image or its Image is not valid. It is relevant in a default
category.
pair Defines the mapping between the severity and the icon that represents it.
Name Description
Value One of the severities available for this category. It must be an available
Helix severity.
Notes:
The severity of a category in layer 1 is defined as the highest alarm severity it has.
The severity of a category in any other layer is defined as the highest severity of the
elements included in it (of a lower level).
27
Fault Solution Administration Guide
Name Description
Description The home location map name or description. It is not displayed on the
map. It is used to provide information about the homeview location to the
user viewing the MapsConfig-project.xml contents.
28
Configuration
It contains the following layers (top-down):
Layer 1’s name is Country. Its image is 003-gray.png, and its altitude range is
350,000-5,000,000 meters.
Layer 2’s name is State. Its image is 004-gray.png, and its altitude range is
200,000-500,000 meters.
Layer 3’s name is City. Its image is 005-gray.png, and its altitude range is
100,000-200,000 meters.
Layer 4’s name is Region4. Its image is 006-gray.png, and its altitude range is
50,000-100,000 meters.
Layer 5’s name is Region5. It is not enabled. Therefore, its image and altitude range
are not defined.
Layer 0’s name is Sites. Its image is 001-gray.png, and its altitude range is
500-50,000 meters. It also has a Description Template (DefaultBubbleTemplate.xml),
which is also the DefaultDescriptionTemplate.
All the layers are between latitudes of 3-31 degrees and longitudes of 65-93 degrees.
Layer 0 contains 9 categories. Its category 1 is a default one and has a general default image
(001-gray.png). Its category 2 is not a default one and has a normal behavior image
(009-gray.png). Both categories have the security pairs Critical, Major, Minor, and
Warning, with matching icon images.
When the Go to home location toolbar button is clicked or the Alarm Map is displayed
without a view area definition, the Latitude is 21 degrees, the Longitude is 78 degrees, and
the Altitude is 2,600,000 meters. This homeview location defines India.
Flooding Protection
Alarm flooding is a situation where an exceedingly large amount of alarms is raised in a rate
higher than the FM Server can handle. When this happens, the flooding protection
mechanism is used to ensure that FM Server will keep processing alarms although its
resources are busy. That is done by rejecting certain alarms while saving them in files.
The mechanism uses two configurable protection levels:
Level 1—when crossing level 1 threshold, only alarms defined in FM Admin rules are
automatically rejected by FM Server and saved to files (by default all alarms with
priority <= 4).
Flooding reject rules are configurable and can be changed by the project.
Level 2—upon continuous massive alarm flooding and when level 1 reject rules are not
enough, when level 2 threshold is crossed, all alarms are rejected.
When alarm flooding situation is detected, an indication is sent to the TEOCO Monitor
application and a special alarm is sent to Cruiser indicating the flood and causing orange
borders to be added to the Cruiser grid to warn the NOC operators about certain alarms
rejection.
NOC users can download and open the rejected alarms files by right-clicking the flooding
alarm.
When FM server detects that alarm flooding is over it automatically stops rejecting alarms,
flooding alarms are cleared (and saved to history), and the orange grid border is removed.
The Flooding algorithm uses 3 configurable thresholds:
T1 [100,000] < T2 [200,000] < T3 [300,000]
29
Fault Solution Administration Guide
Note: We recommend not changing default values of these threshold properties without
consulting TEOCO first.
These thresholds are defined in terms of total amount of events in all the queues.
We assume that the system hardware enables processing at least 1,000 events/sec.
Therefore, a default of 100,000 events queue is equivalent to a 100 seconds delay.
Note: The files which contain the rejected alarms are saved in the database and the customer
should maintain the database and clear old files.
30
Configuration
31
Fault Solution Administration Guide
32
Configuration
Flooding in FamProxy
The FamProxy module has 3 implemented flood mechanism thresholds to prevent it from
crashing when it does not manage to handle the rate of active alarms received from
FamEngine. A flood in FamProxy is usually caused when it either does not have enough
resources (usually memory), or the servers have too many event subscribers, or there are
heavy subscribers that prevent from keeping up with the events rate.
When T2 is being reached, FamProxy has 2 different strategies for handling it (according to
flood.handler.mode system property) settings:
BLOCK_EVENTS—blocking FamEngine. The FamProxy JMS queue causes the
alarms to be aggregated in FamEngine until the queues size is less than T2. This
setting is suitable when FamProxy serves clients such as Cruiser and it ensures that
the events rate does not surpass the rate it can handle.
DISCONNECT_SUBSCRIBERS—disconnecting subscribers that have filled the
queues and reconnecting them. This setting is suitable when FamProxy servers have
heavy subscribers such as the ES or ServiceImpact external modules.
When T2 is reached, a flood indication is also sent to the TEOCO Monitor module and when
T3 is reached, FamProxy restarts itself to cause all subscribers and queues to be reset.
33
Fault Solution Administration Guide
34
Configuration
The following Flooding properties are also available but we recommend not changing their
default values. If you must change them, consult R&D first.
35
Fault Solution Administration Guide
FM Screener
FM Screener is a licensed feature.
The purpose of this engine is to generate two lists of the Logic IDs:
black list—Logic IDs classified as SPAM, that most certainly can be ignored by the
operator
white list—Logic IDs classified as Premium, that most certainly should be handled by
the operator.
When a new alarm is raised with an ID appearing in one of the lists, it will be automatically
classified as SPAM or Premium accordingly. All other alarms will be “standard” as before the
feature was introduced.
The SPAM/Premium status can be used in other Cruiser filters as well.
The engine takes the decision based on the following two inputs: user actions and historic
actions.
User Actions
The following actions will implicitly classify an alarm as Premium:
Worklog (can be changed by “anti.spam.perform.non.spam.action.on.worklog”
property)
Trouble ticket (can be changed by “anti.spam.perform.non.spam.action.on.tt”)
Parent/child relations (can be changed by
“anti.spam.perform.non.spam.action.on.correlation”)
Explicit “mark as non-spam” command
The following action will remove the SPAM indication and make an alarm regular:
Acknowledge (can be changed by “anti.spam.remove.spam.indication.on.ack”)
The following action will explicitly classify an alarm as SPAM:
Explicit “mark as spam” command. However, if the alarm ID is already classified as
Premium, special user privilege is required.
Note: The above actions not only change the state of the current alarm instance, but also add
its ID to SPAM/Premium lists and therefore affect instances that follow.
36
Configuration
Historic Investigations
This feature enables investigation of historic actions that were done or not done to the alarms.
There are two types of queries:
Queries generating SPAM list
Queries generating Premium list
Queries run once a day at 1:00AM ("anti.spam.system.query.daily.execution.hour") or every
day (“anti.spam.system.query.execution.days”). A new run of the query overrides all IDs
found in the previous run.
Administrator can enable/disable specific queries through FamAdmin GUI. Disabling the
query will remove all IDs found by it from black/white lists.
There are two levels of priorities on the decisions above:
1. User actions have priority over historic queries.
Therefore if, for example, a historic query classifies the ID as Premium, but a user
executed a ‘mark as spam’ command, the ID will be classified as SPAM.
2. Premium classification has priority over SPAM classification.
Therefore, if, for example, one query classifies the ID as SPAM and another as
Premium, the ID will be classified as Premium.
An administrator can view the generated SPAM and Premium lists in FM Admin and modify
the SPAM indication (swap SPAM/Premium or make the alarm “regular”).
Decisions that caused the ID to be classified as SPAM/Premium lists are also available in FM
Admin.
Severity Management
FM Admin GUI enables you to specify the conversion of alarm priority (1-9) to severity
(Critical/Major/Minor/Warning).
It is also possible to configure the background and text color for every degree of severity. It
will affect the display of the alarms in the monitoring clients.
The system is supplied with 8 standard severity icons. There are four severity categories
(critical, major, minor, and warning), with two sizes for each (16*16 and 24*24). By default,
they are stored in the Project vdir directory for the JFAM product.
If you change the default severity colors that are supplied with the product, you can also
change the icons that will be displayed by loading your own customized icons.
For more information on working with J2EE, refer to the Helix Administration Guide.
Worklog Management
The FM Admin GUI enables the specifying of worklog types and templates that will be
available to the user creating the worklog.
37
Fault Solution Administration Guide
Example
To activate project field “Proj_Varchar_255_1", populate it through the Mediation library and
base on it toggle/repeated rules filtering.
ProjectActiveAlarm.xml:
<Classes>
<Class name="JFam:ProjectActiveAlarm" superClass="JFam:ActiveAlarm">
<!--
<Attribute name="Proj_Varchar_1024_1" type="string" logicalName="Proj_Varchar 1024_1"
label="Proj_Varchar 1024_1" size="1024"/>
<Attribute name="Proj_Varchar_512_1" type="string" logicalName="Proj_Varchar_512_1"
label="Proj_Varchar_512_1" size="512"/>
38
Configuration
……
-->
<Attribute name="Proj_Varchar_255_1" type="string" logicalName="Proj_Varchar_255_1"
label="Project field" description=”Very useful project field” size="255"/>
<!--
<Attribute name="Proj_Varchar_255_2" type="string" logicalName="Proj_Varchar_255_2" …
<Attribute name="Proj_Varchar_255_3" type="string" logicalName="Proj_Varchar_255_3" …
…….
<Attribute name="Proj_Int_15" type="int" logicalName="Proj_Int_15" label="Proj_Int_15"/>
-->
<ProjectionRef referencedName="MediationMappableAttributes"
name="MediationMappableAttributesEx">
<AddAttributeRef>Proj_Varchar_255_1</AddAttributeRef>
</ProjectionRef>
39
Fault Solution Administration Guide
Note: After the change, you will need to restart all FM related EARs.
Note: After the change you will need to restart all FM related EARs.
40
Configuration
Key="VendorPathData"
Which is <field name>PathData, where field name is the attribute name as defined in
JFam.
41
Fault Solution Administration Guide
42
Configuration
<sys:String x:Key="servicePathData" >F1 M 16.000,13.731 C 7.163,13.731 0.000,14.747 0.000,16.000 C
0.000,17.253 7.163,18.269 16.000,18.269 C 24.837,18.269 32.000,17.253 32.000,16.000 C 32.000,14.747
24.837,13.731 16.000,13.731 L 16.000,13.731 Z M 16.000,14.731 C 23.566,14.731 28.335,15.427 30.246,16.000 C
28.335,16.573 23.566,17.269 16.000,17.269 C 8.434,17.269 3.665,16.573 1.754,16.000 C 3.665,15.427
8.434,14.731 16.000,14.731 M 16.000,8.146 C 7.163,8.146 0.000,11.662 0.000,16.000 C 0.000,20.337 7.163,23.854
16.000,23.854 C 24.837,23.854 32.000,20.337 32.000,16.000 C 32.000,11.662 24.837,8.146 16.000,8.146 L
16.000,8.146 Z M 16.000,9.146 C 24.131,9.146 31.000,12.285 31.000,16.000 C 31.000,19.715 24.131,22.854
16.000,22.854 C 7.869,22.854 1.000,19.715 1.000,16.000 C 1.000,12.285 7.869,9.146 16.000,9.146 M 16.000,3.371
C 7.163,3.371 0.000,9.025 0.000,16.000 C 0.000,22.976 7.163,28.629 16.000,28.629 C 24.837,28.629
32.000,22.976 32.000,16.000 C 32.000,9.025 24.837,3.371 16.000,3.371 L 16.000,3.371 Z M 16.000,4.371 C
24.271,4.371 31.000,9.588 31.000,16.000 C 31.000,22.412 24.271,27.629 16.000,27.629 C 7.729,27.629
1.000,22.412 1.000,16.000 C 1.000,9.588 7.729,4.371 16.000,4.371 M 16.000,0.000 C 14.747,0.000 13.731,7.164
13.731,16.000 C 13.731,24.836 14.747,32.000 16.000,32.000 C 17.253,32.000 18.269,24.836 18.269,16.000 C
18.269,7.164 17.253,0.000 16.000,0.000 L 16.000,0.000 Z M 16.000,1.754 C 16.573,3.665 17.269,8.434
17.269,16.000 C 17.269,23.566 16.573,28.335 16.000,30.246 C 15.427,28.335 14.731,23.566 14.731,16.000 C
14.731,8.434 15.427,3.665 16.000,1.754 M 16.000,0.000 C 11.662,0.000 8.146,7.164 8.146,16.000 C 8.146,24.836
11.662,32.000 16.000,32.000 C 20.337,32.000 23.854,24.836 23.854,16.000 C 23.854,7.164 20.337,0.000
16.000,0.000 L 16.000,0.000 Z M 16.000,1.000 C 19.716,1.000 22.854,7.869 22.854,16.000 C 22.854,24.131
19.716,31.000 16.000,31.000 C 12.285,31.000 9.146,24.131 9.146,16.000 C 9.146,7.869 12.285,1.000 16.000,1.000
M 16.000,0.000 C 9.025,0.000 3.371,7.164 3.371,16.000 C 3.371,24.836 9.025,32.000 16.000,32.000 C
22.975,32.000 28.628,24.836 28.628,16.000 C 28.628,7.164 22.975,0.000 16.000,0.000 L 16.000,0.000 Z M
16.000,1.000 C 22.412,1.000 27.628,7.729 27.628,16.000 C 27.628,24.271 22.412,31.000 16.000,31.000 C
9.588,31.000 4.371,24.271 4.371,16.000 C 4.371,7.729 9.588,1.000 16.000,1.000 M 16.000,0.000 C 7.163,0.000
0.000,7.164 0.000,16.000 C 0.000,24.836 7.163,32.000 16.000,32.000 C 24.837,32.000 32.000,24.836
32.000,16.000 C 32.000,7.164 24.837,0.000 16.000,0.000 L 16.000,0.000 Z M 16.000,1.000 C 24.271,1.000
31.000,7.729 31.000,16.000 C 31.000,24.271 24.271,31.000 16.000,31.000 C 7.729,31.000 1.000,24.271
1.000,16.000 C 1.000,7.729 7.729,1.000 16.000,1.000</sys:String>
FaultPro Configuration
FaultPro can be configured to connect to a specific NE in multiple protocols. It can also be
configured to have more than one credentials set with different connection parameters per
each protocol. Each of these sets id defined as an Access in Communication Admin (see the
Communication Admin User Guide).
To configure the required option, set the NeWorkMode property, in JCore cfg of the
FaultProModule EAR. The available values are protocol (default) and access. If there is
more than one access for the selected protocol, FaultPro selects the first in the list.
43
Fault Solution Administration Guide
The ShowNE Commands property determines whether available NCI commands are
displayed in FaultPro (in addition to the ones defined in the Association rules).
Note: Cruiser must be restarted to see the changes after they are done.
The Site View option provides the user a custom-based Site topology view for a selected site.
It provides a graphical display of all the objects that are associated with the chosen site,
based on the BC information and the relevant links between the objects and alarm information
for each object. It can be opened for a selected site or for the From Site of a selected alarm.
It is accessible from the alarm display, GEO Map display, and the Ribbon.
Icons Configuration
To change the Site View icon for a specific attribute:
1. In the SiteViewSetIconByFieldName property, set the name of the attribute by which
you want to determine the icons.
2. In the server, go to the path:
$BASE_DIR/j2ee/project/metadata/WinFam/dotNetBundle/
If it does not exist yet, create this folder.
3. In this folder, create/add a file named IconResources.xaml.
4. In this file, define the new node icon as in IconResources.xaml File Example. The
resource key should be <value>PathData, where value is the value of the attribute
selected in step 1 in the relevant node MD Class.
It should look like this example:
<sys:String x:Key="routerPathData" >F1 M 28.000,9.000 L 23.000,14.000
L 23.000,17.000 L 28.000,12.000 L 28.000,9.000 Z M 22.000,14.000 L
0.000,14.000 L 0.000,17.000 L 22.000,17.000 L 22.000,14.000 Z M
4.000,0.000 L 0.000,4.000 L 22.000,4.000 L 26.000,0.000 L 4.000,0.000
Z M 1.000,8.000 L 19.000,8.000 L 19.000,7.000 L 1.000,7.000 L
1.000,8.000 Z M 20.000,8.000 L 21.000,8.000 L 21.000,7.000 L
20.000,7.000 L 20.000,8.000 Z M 1.000,10.000 L 21.000,10.000 L
21.000,9.000 L 1.000,9.000 L 1.000,10.000 Z M 0.000,5.000 L
22.000,5.000 L 22.000,12.000 L 0.000,12.000 L 0.000,5.000 Z M
28.000,4.000 L 28.000,7.000 L 23.000,12.000 L 23.000,9.000 L
28.000,4.000 Z M 28.000,0.000 L 28.000,3.000 L 23.000,8.000 L
23.000,5.000 L 28.000,0.000 Z M 1.000,22.000 L 2.000,22.000 L
2.000,21.000 L 1.000,21.000 L 1.000,22.000 Z M 3.000,22.000 L
21.000,22.000 L 21.000,21.000 L 3.000,21.000 L 3.000,22.000 Z M
1.000,24.000 L 21.000,24.000 L 21.000,23.000 L 1.000,23.000 L
1.000,24.000 Z M 0.000,19.000 L 22.000,19.000 L 22.000,26.000 L
0.000,26.000 L 0.000,19.000 Z M 28.000,14.000 L 28.000,21.000 L
23.000,26.000 L 23.000,19.000 L 28.000,14.000 Z</sys:String>
44
Configuration
Tooltip Configuration
To change Site View tooltip details:
1. Go to $BASE_DIR/ttij2ee/project/metadata/SchematicViewsServer/classes/.
2. Select the relevant file. For example, to change the tooltip of a node, go to
ProjectViewNode.xml.
3. Add a projection override called CruiserVisibleFields.
By default, the available Site View fields are:
Region1_Name The first region level name (for example, country) in which the
network element is located
Region2_Name The second region level name (for example, state) in which the
network element is located
Region3_Name The third region level name (for example, city) in which the
network element is located
For more details about overriding, refer to the MD Class Refinement chapter in the
Helix Administration Guide.
4. Restart the EAR where SchematicViewsServer is deployed.
Note: If this projection is not defined, all the following fields are presented.
45
Fault Solution Administration Guide
By default, the available Site View fields are:
Region1_Name The first region level name (for example, country) in which the
network element is located
Region2_Name The second region level name (for example, state) in which the
network element is located
Region3_Name The third region level name (for example, city) in which the
network element is located
For more details about overriding, refer to the MD Class Refinement chapter in the
Helix Administration Guide.
4. Restart the EAR where SchematicViewsServer is deployed.
To control the fields in the details presentation of the selected service in the
Service Status tab:
1. For overriding the definitions, refer to the MD Class Refinement chapter in the Helix
Administration Guide.
2. Restart the EAR where BCAPI is deployed.
By default, the available service fields are:
46
Configuration
Service Class Code The service class code (1=Gold, 2=Silver, 3=Bronze)
47
Fault Solution Administration Guide
In the following example, the NEW_FIELD field is added to the new
CMM_DB.NEW_SERVICE_ENRICHMENT_VW view:
<?xml version="1.0" encoding="UTF-8" ?>
- <Classes>
- <Class name="BCAPI:SOC.SOCServiceForObject"
superClass="BCAPI:SOC.SOCServiceWithCustomers"
<Attribute name="NewField" type="string" />
- <DBMapping mainTable="CMM_DB.NEW_SERVICE_ENRICHMENT_VW"
vendor="Oracle" extendsMapping="false">
- <Table>
- <PrimaryKeys>
<Column name="SERVICE_INSTANCE_ID" />
</PrimaryKeys>
<Attribute name="ObjectId" columnName="SERVICE_INSTANCE_ID" />
<Attribute name="Name" columnName="SERVICE_INST_NAME" />
<Attribute name="ServiceTypeId" columnName="SERVICE_TYPE_ID" />
<Attribute name="ServiceTypeName" columnName="SERVICE_TYPE_NAME" />
<Attribute name="ServiceClassCode"
columnName="SERVICE_INSTANCE_CLASS" />
<Attribute name="ServiceClassDescr" columnName="CODES_DESCRIPTION"
/>
<Attribute name="ServiceImportance"
columnName="SERVICE_INSTANCE_IMPORTANCE" />
<Attribute name="ServiceDescription"
columnName="SERVICE_INSTANCE_DESCRIPTION" />
<Attribute name="ServiceInternalType"
columnName="SERVICE_INTERNAL_TYPE" />
<Attribute name="ServiceInternalId"
columnName="SRVINST_INTERNAL_ID" />
<Attribute name="ServiceExternalId"
columnName="SERVICE_INSTANCE_CODE" />
<Attribute name="ServiceCustomers" columnName="CONCAT_CUST_NAMES"
/>
<Attribute name="NewField" columnName="NEW_FIELD" />
</Table>
</DBMapping>
</Class>
</Classes>
48
Configuration
KPI Presentation
This feature presents the status of the KPI selected for the entity as a dot, colored according
to the Coloring Thresholding rules. When hovering over the entity, it presents the name of the
KPI and its value. To enable this feature, you have to configure the mapping between the
entity and its default counter.
50
Configuration
51
Fault Solution Administration Guide
The syntax for selecting a pair of alarm attributes is:
<PredictiveObject>
<AlarmAttribute1>attribute name1</AlarmAttribute1>
<AlarmAttribute2>attribute name2</AlarmAttribute2>
</PredictiveObject>
The syntax for selecting an alarm attribute score to be displayed in the Trend/Anomaly alarm
field:
<PredictiveObject AlarmEnrichment="true">
<AlarmAttribute1>attribute name</AlarmAttribute1>
</PredictiveObject>
Only one <PredictiveObject> can be selected as a source for the alarm enrichment.
For example, if you choose FromSite as the Predictive object for the alarm enrichment and
the alarm has FromSite attribute with “site1” value – the alarm will be enriched with the score
calculated for “site1”.
Formal Schema
52
Configuration
Performance Considerations
By default, at least 20 points are required for the score calculation. For example, defining 7
days with daily resolution will produce only 7 points that will not be sufficient. But if on
contrary, you scan big portions of the history for producing many points, it may degrade the
performance of the algorithm.
Depending on the configuration, the history data size, and the database performance you
may need to optimize the database performance by using additional indexes.
We recommend monitoring the database performance during the initial period and validating
smooth and optimal database functionality.
Formal Schema
53
Fault Solution Administration Guide
Syntax
The syntax for defining a history resolution score to appear in the trend\anomaly alarm field is:
<HistoryResolution AlarmEnrichment="true">
The syntax for defining the history resolution name in the client is:
<Name>resolution name</Name>
The syntax for defining the history days number in the client is:
<HistoryDaysRange>history days number</HistoryDaysRange>
The syntax for defining the aggregation period in the graph is:
<AggregationType>aggregation period</AggregationType>
Aggregation period can be DAILY or HOURLY.
54
Configuration
The syntax for defining how often the learning phase will run is:
<ExecutionDaysInterval>days number</ExecutionDaysInterval>
For example, 7 means that learning process is scheduled to run every 7 days
The syntax for defining on which time of day the learning phase will run is:
<ExecutionTimeOfDay>time of day (HH:MM)</ExecutionTimeOfDay>
Performance Considerations
During the learning phase, running a new process is forked. This process consumes about
the same amount of memory as the EAR process where the FamAnalytics module is running.
Therefore, you have to plan the resources of the machine accordingly.
Formal Schema
55
Fault Solution Administration Guide
TrendConfig PredictiveObjects
The selected Trend single alarm attributes are EquipmentName, EquipmentType,
DeviceName, DeviceType, FromSite, Domain, and Area.
The EquipmentName score is selected to be displayed in the Trend Analytics alarm field.
TrendConfig HistoryResolution
The defined resolutions are:
Daily for 30 days with title 30 Days and execution on 04:30.
Hourly for 7 days with title 7 Days and execution on 05:30.
Hourly for 1 day with title 1 Day and execution on XX:15.
Anomaly PredictiveObjects
The selected Anomaly single alarm attributes are EquipmentType, DeviceName, DeviceType,
FromSite, ServiceName, Vendor, Domain and Area.
Config File
<?xml version="1.0>
<TrendConfig>
<PredictiveObject AlarmEnrichment="true">
<AlarmAttribute1>EquipmentName</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
56
Configuration
<AlarmAttribute1>EquipmentType</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>DeviceName</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>DeviceType</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>FromSite</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>ServiceName</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>Vendor</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>Domain</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>Area</AlarmAttribute1>
</PredictiveObject>
<HistoryResolution AlarmEnrichment="true">
<Name>30 Days</Name>
<HistoryDaysRange>30</HistoryDaysRange>
<AggregationType>DAILY</AggregationType>
<ExecutionSetting>
<Daily>04:30</Daily>
</ExecutionSetting>
</HistoryResolution>
<HistoryResolution>
<Name>7 Days</Name>
<HistoryDaysRange>7</HistoryDaysRange>
<AggregationType>HOURLY</AggregationType>
<ExecutionSetting>
<Daily>05:30</Daily>
57
Fault Solution Administration Guide
</ExecutionSetting>
</HistoryResolution>
<HistoryResolution>
<Name>1 Day</Name>
<HistoryDaysRange>1</HistoryDaysRange>
<AggregationType>HOURLY</AggregationType>
<ExecutionSetting>
<Hourly>15</Hourly>
</ExecutionSetting>
</HistoryResolution>
</TrendConfig>
<AnomalyConfig>
<GraphField>Keyword</GraphField>
<LearningPhase>
<HistoryDaysRange>90</HistoryDaysRange>
<AlarmNameAttribute>Keyword</AlarmNameAttribute>
<AlarmEqpAttribute>EquipmentName</AlarmEqpAttribute>
<ExecutionDaysInterval>7</ExecutionDaysInterval>
<ExecutionTimeOfDay>02:00</ExecutionTimeOfDay>
<MinSupport>100</MinSupport>
<MinInterest>2</MinInterest>
<MinConf>0</MinConf>
<MinIS>0</MinIS>
</LearningPhase>
<PredictiveObject AlarmEnrichment="true">
<AlarmAttribute1>EquipmentName</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>EquipmentType</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>DeviceName</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>DeviceType</AlarmAttribute1>
</PredictiveObject>
58
Configuration
<PredictiveObject>
<AlarmAttribute1>FromSite</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>ServiceName</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>Vendor</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>Domain</AlarmAttribute1>
</PredictiveObject>
<PredictiveObject>
<AlarmAttribute1>Area</AlarmAttribute1>
</PredictiveObject>
<HistoryResolution AlarmEnrichment="true">
<Name>30 Days</Name>
<HistoryDaysRange>30</HistoryDaysRange>
<AggregationType>DAILY</AggregationType>
<ExecutionSetting>
<Daily>03:30</Daily>
</ExecutionSetting>
</HistoryResolution>
<HistoryResolution>
<Name>7 Days</Name>
<HistoryDaysRange>7</HistoryDaysRange>
<AggregationType>HOURLY</AggregationType>
<ExecutionSetting>
<Daily>04:30</Daily>
</ExecutionSetting>
</HistoryResolution>
<HistoryResolution>
<Name>1 Day</Name>
<HistoryDaysRange>1</HistoryDaysRange>
<AggregationType>HOURLY</AggregationType>
<ExecutionSetting>
<Hourly>30</Hourly>
59
Fault Solution Administration Guide
</ExecutionSetting>
</HistoryResolution>
</AnomalyConfig>
<PredictiveRangeConfig>
<PredictiveRangeData>
<Name>Low</Name>
<PredictiveRange>
<MinValue>0</MinValue>
<MaxValue>25</MaxValue>
</PredictiveRange>
<Color>#99bedc</Color>
</PredictiveRangeData>
<PredictiveRangeData>
<Name>Moderate</Name>
<PredictiveRange>
<MinValue>26</MinValue>
<MaxValue>50</MaxValue>
</PredictiveRange>
<Color>#ffc600</Color>
</PredictiveRangeData>
<PredictiveRangeData>
<Name>Significant</Name>
<PredictiveRange>
<MinValue>51</MinValue>
<MaxValue>75</MaxValue>
</PredictiveRange>
<Color>#ff8135</Color>
</PredictiveRangeData>
<PredictiveRangeData>
<Name>Serious</Name>
<PredictiveRange>
<MinValue>76</MinValue>
<MaxValue>100</MaxValue>
</PredictiveRange>
<Color>#ff413f</Color>
</PredictiveRangeData>
<PredictiveRangeData>
60
Configuration
<Name>Low (decrease)</Name>
<PredictiveRange>
<MinValue>-25</MinValue>
<MaxValue>-1</MaxValue>
</PredictiveRange>
<Color>#99bedc</Color>
</PredictiveRangeData>
<PredictiveRangeData>
<Name>Moderate (decrease)</Name>
<PredictiveRange>
<MinValue>-50</MinValue>
<MaxValue>-26</MaxValue>
</PredictiveRange>
<Color>#ffc600</Color>
</PredictiveRangeData>
<PredictiveRangeData>
<Name>Significant (decrease)</Name>
<PredictiveRange>
<MinValue>-75</MinValue>
<MaxValue>-51</MaxValue>
</PredictiveRange>
<Color>#ff8135</Color>
</PredictiveRangeData>
<PredictiveRangeData>
<Name>Serious (decrease)</Name>
<PredictiveRange>
<MinValue>-100</MinValue>
<MaxValue>-76</MaxValue>
</PredictiveRange>
<Color>#ff413f</Color>
</PredictiveRangeData>
</PredictiveRangeConfig>
</AnalyticsConfig>
61
Fault Solution Administration Guide
Offline
The algorithm runs every 7 days ("offline.day.interval" property) at 2:00 AM ("offline.time"
property). In case of execution failure, the algorithm reruns every 30 minutes
(“predictor.offline.retry.minutes.interval” property).
The algorithm analyzes the last 92 days ("number.of.days.for.offline.algorithm" property) of
historic alarm data.
Part of the historic data is used as control data to check the correctness of the predictions.
This way, every predicted alarm name has the following two KPIs:
Precision—how many alarms the algorithm predicted correctly (correct
predictions / all predictions)
Recall—percentage of total results correctly classified by the algorithm (correct
predictions/ all alarms)
Predictions with precision less than 0.5 ("offline.param.min.precision") or that recall less
than 0.5 ("offline.param.min.recall") will be ignored and dropped.
Pre-requisites
For the prediction algorithm to provide the best results, the following prerequisites are
essential:
1. Preferably 3 months of alarm history data is required (but not less than 1
month).
2. Alarm data should include the following information:
o Alarm Name—the name of the alarm, as provided by the vendor (for example,
AIS, Power Failure, or LOS)
o From Site—the site where the alarm originated from
o Area—the geographic hierarchy above SITE
o District—the geographic hierarchy above AREA
o Eqp Name—the equipment that originated the alarm
o Alarmed Object—the none/sub-equipment entity, such as
Card/Interface/Channel/link
o Object ID
o Ancestor object ID
o Site ID
62
Configuration
Alarm Filtering
It is possible to filter alarms that will be used as an input to the offline and online
algorithms by specifying SQL criteria in the FmPredictor property "offline.param.where".
The criteria is over the history_db.NEW_HIST_MAIN table. Alarms evaluated to true WILL
participate in the offline algorithm.
Online
The online algorithm runs every 15 minutes ("predictor.online.minutes.interval") checking
alarms raised in the last 2 hours (“"hours.of.history.alarms.for.online") and correlates them
to the model built during the offline algorithm.
The algorithm predicts and raises the most specific alarms in three levels:
1. Alarmed Object (object id) with Alarm Name
2. Equipment Name (ancestor object id) with Alarm Name
3. Site (site id) with Alarm Name
A predicted alarm will be automatically cleared once the real predicted alarm is raised. If
the real alarm has not occurred, the predicted alarm will be cleared after the prediction
max time has expired. The “Clear reason” field of the alarm will contain the reason for the
clearance.
63
Fault Solution Administration Guide
ServiceImpact Configuration
The Cruiser Alarms Summary mode can provide different displays of the existing services and
customers, based on ServiceImpact information.
The ServiceImpact Admin enables the administrator to set the ServiceImpact system
definition as described in the ServiceImpact Admin User Guide.
For more information about ServiceImpact implementation, see the ServiceImpact
Implementation Guide.
64
Configuration
65
Fault Solution Administration Guide
Note: A time frame is valid only if it has both the start and end time defined.
DB Plug-in Configuration
When using the DB plug-in, the Maintenance Calendar jobs are defined in the following tables
in the CONFIG_DB database.
Table MAINT_JOB
This table contains the Maintenance Calendar Jobs.
Table MAINT_JOB_EXT
This table can be used to enrich maintenance jobs with project specific attributes. See the
details below.
66
Configuration
Table MAINT_JOB_TIME_FRAME
This table contains Maintenance Calendar Job time frames.
Table MAINT_OBJECT
This table contains Maintenance Calendar Job Objects.
Table MAINT_OBJECT_EXT
This table can be used to enrich maintenance objects with project specific attributes.
67
Fault Solution Administration Guide
Table MAINT_OBJECT_TIME_FRAME
This table contains the Maintenance Calendar Job Object time frames.
68
Configuration
69
Fault Solution Administration Guide
70
Configuration
Learning
The learning algorithm runs periodically. It fetches relevant historic alarm data, analyzes
correlation between alarms and divides related alarms into clusters. Clusters are stored in the
database to be used by the runtime logic.
The learning algorithm is executed in a separate process forked from the Managed Server
process where FamRCA EAR is deployed. The process may require significant memory, CPU
and DB resources. It is highly recommend to monitor first runs of the learning process,
validate it has all the required resources and completes successfully.
71
Fault Solution Administration Guide
Alarms Dataset Prerequisites and Recommendations
For the Machine Learning RCA algorithms to provide the best clusters and root causes, the
following prerequisites for the fault data are essential:
1. Preferably, 3 months of alarm data is required (but not less than 1 month).
2. The alarm data should be as informative as possible. It must include the following
information in the different fields:
o Alarm Identifier—a specific identifier of the alarm. This means that instances of
the same alarm are raised with the same Alarm Identifier. This field cannot be
empty and should not contain any redundant data, such as time, temperature,
and internal system index.
o Alert Name—the name of the alarm, as provided by the vendor (for example,
AIS, Power Failure, or LOS).
o Severity/Priority—the severity or priority of the alarm.
o Managed Object—the name of the object that raised the alarm.
Configuration
The following attributes control the algorithm execution parameters (such as time of day,
interval, and retry interval) and defines the data to be collected (such alarm name, keyword,
severity, and the range/resolution of the data).
72
Configuration
Troubleshooting
Checking whether the learning phase has run is done by looking at the
metadata/vdir/learning directory under the EAR folder.
3 files are created during the initial data collection: deep.txt, kwrds.txt, and
lid2kwrd.txt.
Their time of creation can show when the learning phase has started.
rca_learning_log.out and rca_learning_errors.err are output files of the algorithm
itself, showing its progress. Existence of the .err file fails the learning phase.
Usually, this would happen when there is insufficient data or when there are some
mismatches in the 3 input files, due to some momentary discrepancy. If the error occurs
after rerunning the algorithm, it indicates that some specific data (usually the Keyword
attribute) contains invalid content.
rca_out.txt is the output of the algorithm, which is stored in the table
history_db.rca_scores. The file time is the time the phase has ended. However, the
database records hold no time info. Therefore it is not possible to determine if the
records there are obsolete or not.
Usually, there are 3 areas affecting the success of the Learning phase that need to be
checked:
Proper configuration—check the time of execution and the existence of sufficient data
in history_db.new_hist_main for the defined ‘historyRange’
Corrupted/mismatching data—check the aforementioned output and .err files of the
rca_learning algorithm
DB issues—if the rca_out.txt file was created, but the table does not contain any info (or
contains old info), check the jcore.log of the EAR to for database errors when storing the
records
Manual Run
From the Javascript console, with the FamRCA application, running the learning phase is
done with:
Packages.teoco.famrca.offline.LearningTask.runLearning();
Learning Investigations
Sentinel UI allows to display and explore results of the learning run:
Visual presentations of the generated clusters
Investigation of the a specific cluster:
o Information about the cluster alarms
o Graph displaying times when alarm instances of the cluster occurred
The data for the investigation is fetched from the database. We recommend to monitor the
first executions of the flow and perform database adjustments if required.
73
Fault Solution Administration Guide
Configuration
<Property name="widget.fault.alarmsList.amountOfAlarms" type="int"
public="true">
<Value>500</Value>
<Description summary="Defines the amount of alarms which
is get from the server for Active Alarms table" />
</Property>
<Property name="widget.mlrca.investigationGraph.maxLogicIds"
type="int" public="true">
<Value>50</Value>
<Description summary="Defines the maximum logic ids that
can be shown in the investigation graph" />
</Property>
74
Configuration
Run-Time
Correlation Decisions
The description below is based on the default parameters.
Once the alarm belonging to one of the clusters was raised, it forms a “potential family”. New
alarms raised in a 180 seconds (“online.maxTime” property) time range from the raise time of
the first alarm and belonging to the same cluster are added to this potential family. Cleared
alarms are removed. If during 180 seconds the potential family has at least 4
(“online.minCorrelationCluster“ property) active alarms, it becomes a “real family”, otherwise it
is destroyed.
Alarm Filtering
It is possible to limit the volume of alarms that will participate in the RCA correlation process.
To do this, you have to edit the
$BASEDIR/ttij2ee/project/metadata/FamRCA/filter/RCAJSAlarmFilter.filter file to contain a JS
expression in terms of alarm attribute names.
The expression should return TRUE for alarms that you DO NOT want to be processed.
For example, to exclude alarms from site1 site, the expression should be:
FromSite == site1'
Configuration
76
Configuration
Correlation Graph
When looking at the specific correlation in Cruiser, it is possible to see a correlation graph
showing the history of alarms participating in the correlation.
Configuration
The following properties affect the graph presentation:
WinFam:AlarmConnectionsGraphDisplayFieldName—the name of the alarm field
to be displayed in the Correlation Tree graph. By default, LogicID
FamProxy:alarm.service.offspring.history.timespan—the back time to be
presented in the graph in hours. By default: 720 hours
FamProxy:alarm.service.offspring.history.gap—the graph resolution in minutes.
By default, 10 minutes
It is possible to see a tooltip with additional alarm information.
The shown fields are ruled by the “RCAChartAttributes” projections in the ProjectActiveAlarm
MD class. By default, the following fields are displayed:
EquipmentName
FromSite
AlarmText
The same projection is used to determine which alarm information will be saved in the Excel.
77
Fault Solution Administration Guide
Opening Clients
Opening FM Cruiser from External Applications
This feature enables you to open the FM Cruiser display from external applications. It is done
by opening a URL using the appropriate parameters.
The URL prefix is:
https://[apache-host]:[apache-port]/FaMShell_[EAR name]/FaMShellActivator.jsp?
FilterKey string Indicates filtering by a predefined filter. Use the key that
FiltersService provided you.
Navigation string Sets the Cruiser Master Mode for the display.
Mode
The available values are:
active—Active Alarms
correlated—Correlated Alarms
history—Alarms History
ge—GEO Maps
summary—Alarms Summary/Analitics
siteview—Site View
In this case, you have to specify a site id, as
follows:
FieldName=SiteID&FieldValue=[your site
id]&FieldType=int&NavigationMode=siteview
For opening Site View from CAFÉ by Site Band
ID, set FieldName=SBID and FieldValue=[its
value].
To open a specific Site View tab, use the
siteTabName parameter as described below.
78
Configuration
FieldType string string Indicates the type of FieldName. Can also be int.
Notes:
To open a filtered drill down folder, one of the filter parameters (FilterKey,
TempFilterKey, and FieldName) must be set.
All the parameters are optional and single-entry.
When opening the Cruiser client from an external client, the available applications are Cruiser
and Light Cruiser. The following FamProxy properties are used to determine the required
application:
preferableMonitoringClient—the preferred FM application to open from external client
secondPreferableMonitoringClient—the second FM application choice to open from
external client
The selection of the FM application to be opened changes for the different external
applications as follows.
79
Fault Solution Administration Guide
Opening from Schematic Views:
1. If Schematic Views was triggered from one of the FM clients, it opens the triggering
application.
1. If Schematic Views was triggered from Sentinel:
a. User permissions and project installation are checked.
b. If the user has permission for only one application, this application is opened.
c. If the user has permission for more than one application, the FaM Proxy property
is checked and the first available application is opened in the following order:
.
i. preferableMonitoringClient
ii. secondPreferableMonitoringClient
iii. The third one
80
Configuration
The URL parameters are:
field string The name of the alarm field to filter by (when filtering by a single
field).
value string The value of the alarm field to filter by (when filtering by a single
field)
timecriteria string Relative time: <N> <Hours, Days, Weeks, Months>), in the
format H/D/W/M<N>
Where:
H=Hours, D=Days, W=Weeks, M=Months
For example, W10 indicates 10 weeks.
allparents boolean Determines whether to open the Correlation Tree window or just
filter by the following parameters.
Set as true if you want to open the Correlation Tree window.
Set as false if you do not want to open the Correlation, but you
want to filter the records by the following parameters.
Ignore this parameter if you just want to filter by a single
field/value (for backward compatibility).
LogicID string The value of the LogicID field of the alarm to filter by.
ObjectID int The value of the ObjectID field of the alarm to filter by.
ObjectType int The value of the ObjectType field of the alarm to filter by.
Example:
https://dc50-dev-helix91:3600/FaMHistoryShell_FM
/FaMHistoryShellActivator.jsp?activate=True&PCStatus=PARENT&ObjectID=123456&Object
Type=78&timecriteria=W10&LogicID=comcast_test_3&allparents=false&DateTimeUp=20/03/
2017 16:43:31.092
81
Fault Solution Administration Guide
Maintenance
J2EE Components
To verify the J2EE components are running:
1. From the Sentinel application, open the TEOCO Admin application.
2. Click System Configuration.
A list of all installed J2EE managed servers and their status appears.
FM Services
To verify that the FaMAPI Service is running:
1. Type the following command:
ps -ef | grep fam_api
2. Verify that you receive the following results:
9602 9267 0 10:21:56 pts/34 0:09 connect -daemon
fam_api_for_alarms.connect
9267 1359 0 10:21:50 pts/34 0:00 connect -daemon fam_api_parent.connect
82
Maintenance
Running FM Modules
All modules (N2/J2EE) are started/stopped/restarted through
$BASE_DIR/integration/scripts/netrak.ksh utility. It is possible to refer to a specific module (for
example, FamEngine or FAM_SERVICES) or all the system.
Note: Because the processes are related to one another, restarting one of them can cause
implicit restart/reconnect in others.
TEOCO Monitor
TEOCO Monitor is the best way to monitor FM processes and FM health.
The following parameters can be monitored:
FM processes running status
Memory consumption
FaM Engine up/down status
Number of active alarms in the system
Rate of incoming events
Size of queues in the FM system
83
Fault Solution Administration Guide
Troubleshooting
Log Files
J2EE Server and Client Log Files
For more detailed information about the content of each log file, how to locate errors in the
log files, examples of messages, and a description of how to change a log level, refer to the
Diagnostics and Troubleshooting section of the Helix Administration Guide.
Server Troubleshooting
Server Components Are Up and Functioning
Check server logs to validate that Fam/FamEngine/FamHistory/FamCache EARs are running
smoothly.
Check if these EARs suffer from memory shortage.
84
Troubleshooting
Hazelcast Disconnections
In some edge cases (sometimes during heap dump after Out Of Memory) you may see
disconnection of EARs from the hazelcast cluster. The messages are seen in XXX.stderr log
files.
Currently, the EAR is not part of the cluster and it will NOT be possible to access the cache
alarm information or receive alarm notifications.
If such a disconnection is not justified (for example, when EAR was shut down) and is not
restored shortly, you will have to restart the entire FM.
Client Troubleshooting
The following sections discuss various client-related problems that may occur.
85
Fault Solution Administration Guide
Statistics
FM Module Statistics
FM Module Stats Log
Each FM module that processes either active or history alarms on runtime (that is FaM
Engine, FaM History, and Fam Proxy modules) writes a statistics log that periodically prints its
alarm processing internal state. This feature is active by default and all statistics data is
written to a cyclic log file in the module’s EAR logs directory.
The stats file names for the 3 modules are: FamEngineQueueStats.log,
FamHistoryQueueStats.log, and FamProxyQueueStats.log.
The statistics are written on a 5 minutes interval by default, which can be changed via the
relevant system property for each module:
FamEngine module system property, “fam.engine.chain.stats.interval”.
FamHistory module system property, “fam.history.chain.stats.interval”.
FamProxy module system property, “fam.proxy.chain.stats.interval”.
87
Fault Solution Administration Guide
The Component element contains the following major data:
name attribute—chain component’s unique name (for example, FamEventBuilder,
CommandProcessing, or AlarmDataEnricher)
name attribute—chain component’s unique name (for example,
NetworkEventHandler, AlarmFetcher, or CommandExecution)
MsgReceived element—how many messages have been received since chain start
time.
MsgProcessed element—how many messages have been successfully processed
since chain start time.
QueueSize element—how many messages are queued and waiting to be processed.
AvgProcessTime element—the component’s average processing time.
CustomData element—a free text element with component specific stats.
88
Troubleshooting
RejectRules—shows how many reject rules are currently active and how many
alarms were rejected. The UnRejectedAlarms element indicates the amount of alarms
that were once rejected and currently are not (due to either reject rules or alarm data
change).
AlarmFetcher—handles distributed cache alarm fetching. As this component works
in an asynchronous way, its internal queue is stated in the QueuedCommands
element.
EnrichmentRules—shows how many enrichment rules are deployed and how many
alarms were enriched by them. When queues are observed at this point, it might be
related to a slow enrichment rule using a Mediation Lookup.
DistributedCache—updates alarms in distributed caches and shows the cache sizes
of alarms and alarm related data (such as work-logs and TTs). As this component
works in an asynchronous way, its internal queue is stated in the QueuedCommands
element. This component also contains the AlarmsPersistencyQueue element that
shows if there is any DB persistency queue related to the FAM_DB.active_alarms
table, such a queue can indicate a DB performance or availability problem.
EventPublish—sends alarm events from the FaM Engine to various FaM Proxy
instances. When queues are growing, it might be related to events flood, memory, or
some networking issues.
HistoryAlarmsPublisher—publishes events to the FM History module for DB
persistency. Queues can grow here due to memory or networking issues affecting
connectivity to the Kafka brokers.
89
Fault Solution Administration Guide
90
Appendix A: Active Alarm Attributes
Ack User Login Name The login name of the user that performed AckUserLoginName
the acknowledge action
Ack User Name The full name of the user that performed AckUserName
the acknowledge action
Alarmed Object Entity Type of the alarmed object Alarmed Object Entity
Clear User Login Name The login name of the user that performed ClearUserLoginName
the clear action
Clear User Name The full name of the user that performed ClearUserName
the clear action
91
Fault Solution Administration Guide
Defer End Time The time the alarm will be undeferred DeferEndTime
Defer Start Time The time the alarm was deferred DeferTime
Alarmed Object Name The complete path of the alarmed object DeviceName
entity
Alarmed Object Type The type of the alarmed object entity DeviceType
Prioritize Time Datetime when the alarm priority was raised EscalationTime
Original Priority The priority of the alarm before raising the EscalOriginalPriority
priority
92
Appendix A: Active Alarm Attributes
First Ack Date Datetime the alarm was acknowledged for FirstAckDate
the first time
Work Log Indicates that the alarm has at least one IsWLog
Work Log
Alarm Last Action Last action that was performed on the LastChangeAction
alarm
93
Fault Solution Administration Guide
Alarm Last Update Time Datetime of the last update performed on LastChangeTime
the alarm
Last Change User Login The login name of the user who initiated the LastChangeUserLogin
Name last alarm change Name
Last Change User Name The full name of the user that initiated the LastChangeUserName
last alarm change
Last Child Change Time The last time a child alarm was added or LastUpdate
removed to the alarm
Module Name MIB name for SNMP alarms or library name ModuleName
for others
Object Type Object Type identifier (used with 'Object ID' ObjectType
as the alarmed object identifier)
Prediction Median Time The average datetime when the predicted PredictionAvgTime
alarm is expected to raise
94
Appendix A: Active Alarm Attributes
Prediction Max Time The maximum datetime when the predicted PredictionMaxTime
alarm is expected to raise
95
Fault Solution Administration Guide
TT User Name the user that performed the last TT Action TTUser
TT User Login Name The Login of the user that performed last TTUserLoginName
Trouble Ticket action
Unmanaged Object Alarmed object that is not managed in the Unmanaged Object
Base Configuration module
Last Work Log Date The time the last Work Log was created WorkLogDate
Last Work Log Type Name Last Work Log type name WorkLogTypeName
Last Work Log User The full name of the user that added the WorkLogUser
last Work Log
Last Work Log User Login The login name of the user that added the WorkLogUserLogin
Name last Work Log Name
96
Appendix A: Active Alarm Attributes
Maintenance Start The start date and time of the earliest MaintenanceStart
Datetime related maintenance job Datetime
Maintenance End Datetime The end date and time of the latest related MaintenanceEnd
maintenance job Datetime
97
Fault Solution Administration Guide
TT is Assigned TTAssigned
98
Appendix C: Project Active Alarm Attributes
…
Proj_Varchar_512_9
…
Proj_Varchar_255_70
…
Proj_Datetime_5
…
Proj_Int_15
99
Fault Solution Administration Guide
…
Proj_Double_5
100
Appendix D: Modules Configurable Properties
FamAdmin
Property Type Mandatory Default Allowed Description
Name Value Values
FamEngine
Property Type Mandatory Default Allowed Description
Name Value Values
101
Fault Solution Administration Guide
102
Appendix D: Modules Configurable Properties
103
Fault Solution Administration Guide
104
Appendix D: Modules Configurable Properties
105
Fault Solution Administration Guide
106
Appendix D: Modules Configurable Properties
107
Fault Solution Administration Guide
108
Appendix D: Modules Configurable Properties
109
Fault Solution Administration Guide
110
Appendix D: Modules Configurable Properties
111
Fault Solution Administration Guide
FamHistory
Property Name Type Mandatory Default Allowed Description
Value Values
112
Appendix D: Modules Configurable Properties
JFam
Property Name Type Mandatory Default Allowed Description
Value Values
113
Fault Solution Administration Guide
FamProxy
Property Name Type Mandatory Default Allowed Description
Value values
115
Fault Solution Administration Guide
116
Appendix D: Modules Configurable Properties
FamAnalytics
Property Name Type Mandatory Default Allowed Description
Value Values
117
Fault Solution Administration Guide
118
Appendix D: Modules Configurable Properties
119
Fault Solution Administration Guide
120
Appendix D: Modules Configurable Properties
121
Fault Solution Administration Guide
122
Appendix D: Modules Configurable Properties
winfam.notifications int No 6
.fadeout_duration_
seconds
winfam.notifications int No 4
.fadeout_timeout_
seconds
123
Fault Solution Administration Guide
124
Appendix D: Modules Configurable Properties
125