System Management for Networked Embedded Systems and Clusters
Recent years have seen a tremendous increase in the use of embedded systems and their application in mission critical devices. This development created a strong demand for management systems which are able to offer advanced services currently only available in the desktop computer world. In contrast to existing management solutions the vast number of prospective client systems, with their mobility and limited computing power, requires an efficient, highly scalable architecture. One of the most important challenges to be met is the effective management of the amount of information processed and the organization of the management system
The SysMES Architecture is being developed to be compliant to these requirements. It is suitable for use in clusters and microcomputers as well as in embedded systems with certain specifications, such as a processor, an operating system and the capability to run applications programmed in C or Java.
The SysMES Framework is a scalable, decentralized, fault tolerant, dynamic, rule based tool set for the monitoring of networks of target systems. The management algorithms consist of the following steps: system and application monitoring, recognition of undesirable states, event (message) generation, local event handling on the target, event forwarding to the management framework, event handling on the management side, rule checking and automatic reaction. The delivered events provide information about the configuration of the targets and the management layer is capable of starting an update or reconfiguring them.
The SysMES architecture is multi-tiered and consists of many management components based on industry standards operating together to create a complete management system. These standards ensure interoperability and manufacturer independence. Such standards are XML for exchanging data, Enterprise Java Beans (EJB) Technology running on JBOSS applications servers for the implementation of the server functionalities, and the Common Information Model (CIM) from the Distributed Management Task Force (DMTF) for the building of an Object Model.
SysMES Architecture Overview
SysMES Properties and Functionality
SysMES Management Client
Enviroment Modelling for System-Management
The main area of interest is the storage and maintainance of the configuration data in a object orientated manner. Runtime data and statistical information are not required because they are already covered by a vast array of managment systems and protocols.
The computer based network environment consists of two layers. Layer number one is the physical view. It describes the systems, their features, parameters and the connection between these systems. The second layer is the logical view, which describes the non pyhysical elements.
The Standards used are: the Common Information Model (CIM) and the Web Based Enterprise Management (WBEM), both developed by the Distributed Management Task Force (DMTF), and the Unified Modelling Language (UML) developed by the Object Management Group (OMG)
Rule Based Event Management Model
The SysMES Rule Evaluation Subsystem
The SysMES framework utilises rules to automatise the management of a distributed environment. Distributed components
generate events to indicate their current state and send them to management servers, where those events are evaluated based
on predefined rules. If the conditions of a rule are fulfilled related management actions are triggered.
SysMES @ ALICE HLT
The ALICE heavy-ion particle physics experiment is currently being built at CERN near Geneva. It will use a PC cluster of 400 dual-processor machines for the last stages of the data readout process and a network of 400 microcomputers for the configuration and control of the cluster nodes. One of the most mportant objectives to be achieved in such experiments is to guarantee the utilized devices are running correctly during the experiment life-time. A second aspect is the extremely high availability and reliability requirements of the applications being run. The third point to be considered is the demand for configuration management due to the usage of configurable hardware accelerator (FPGAs) on the cluster nodes. The SysMES Architecture is being used to fullfill these requirements.
There are several secondary components that are being used/developed:
Lemon Monitoring of Systems and Applications
The LHC Era Monitoring (LEMON) system is a CERN developed system originally designed for the monitoring of the LHC Computing Grid (LCG). It has since expanded and become one of the most commonly used systems for cluster monitoring, due to its highly scalable design, good visualisation possibilities, and open source nature.
SNMP Monitoring of Racks and Switches
Monitoring of environmental parameters (like temperature or mains voltage) are observed by a Rack Monitoring System. When critical events occur the RMS triggers SysMES via SNMP traps. A continuous monitoring of the measured values and the net traffic is passed to Lemon by SNMP.
Dynamic Cluster Configuration
One of the main purposes of the HLT cluster is the online reconstruction of detector data. Therfore, there are several different analysis components runnning on the nodes according to configurations specific for a given experiment. These configurations can be modified during runtime in order to optimize the cluster usage. A framework is being developed which can be used to change the cluster configuration dynamically.
Cluster Resource Manager
As part of the framework for the dynamic cluster configuration, a resource manager is being developed. Its purpose is providing information necessary for creating new configurations, i.e. current status of the cluster, available resources, etc. Resources can then be allocated/reserved and are marked busy until they are freed.
EPICS / SysMES
EPICS is an internationally developed software infrastructure used for controlling devices attached to networks. It is used all around the world to control equipment in facilities like particle accelerators, telescopes and large experiments. It was originally a joint development by the Los Alamos and Argonne National Laboratories which began in 1989, but due to distribution under an open source license, it is now being cooperatively developed at almost every institution at which it’s implemented. It has a vast knowledge and support base, with frequent conferences, an online support channel and many modules and extensions. It has become a versatile, scalable and diverse distributed control system, making it perfectly suited to modification into a monitoring system.
EPICS/SNMP based Management of the HLT-Cluster