The future of system management is automation. As the number of systems and virtual machines being managed continues to grow, and the complexity of distributed applications increases, automation is the only way to keep things running smoothly.
Specifically, we need fine grain control of systems. This means that we can do things like configure local storage and networks, start and stop services, install software and patches, and so forth. We are looking at interactive control – making changes, seeing the results of these changes, and making further changes in response. Another aspect of interactive management is responding to changes in a system, such as a hardware failure, a file system running out of space, or perhaps an attack on a system. Interactive management may have a human in the loop or the interaction may be with a script, a program, or perhaps even an advanced expert system.
This interactive manipulation complements configuration management systems such as Puppet. With a configuration management system, you put a system into a known state. With interactive manipulation you work with the system until it does what you want it to. You will usually want to use both approaches, since each has strengths and weaknesses.
This automation requires several things:
- The ability to query a system. This includes determining its configuration (HW and SW) and current state and status. As an example, if you are monitoring the temperature of a system, is the lm-sensors service installed, configured, enabled, and currently running?
- The ability to change a system. This includes things like configuring storage, configuring networks, changing firewall rules, and installing software.
- Generating alerts when something interesting happens. It is not effective to poll 1,000 systems looking for items of interest; it is necessary for the 1,000 systems to tell you if something you are interested in happens. Going back to the lm-sensors example, you might want to trigger an alert when the cpu temperature exceeds 150 degrees F. You might also want to trigger an alert if the lm-sensors service fails.
- Remote Operation. In general, you don’t want to put a complete management system on each managed system. You want to have a centralized management capability containing the business logic which manages large numbers of systems.
In designing a system to support these elements you end up with a design that has a management console (or management program or management framework) which initiates operations on remote systems. These operations are performed by a program on the remote system. A program that is intended to perform an operation when called from another system is commonly called an agent.
It is straightforward to create an agent to perform a specific task. This tends to result in the creation of large numbers of specialized agents to perform specific tasks. Unfortunately, these agents don’t always work well with each other, come from multiple sources, have to be individually installed, and produce a complex environment.
Building on the Automation Requirements, what if we create:
- A standard way to query systems.
- A standard way to change a system.
- A standard set of alerts.
- A standard remote API to perform these operations.
- A standard infrastructure to handle things like communications, security, and interoperation.
- All included with the operating system and maintained as part of the OS.
Building a system like this means building a standard set of tools and interfaces that can be shared by any application that needs to interact with the managed system. Having a standard API means that applications and scripts can easily call the functions that they need to use. Having a common infrastructure greatly simplifies interoperation and makes it much easier to develop management tools that touch multiple subsystems.
Including these tools with the OS means that applications have a known set of tools that they can rely on. It also means that the tools are updated and maintained to keep in sync with the OS, that security issues are addressed, and that there is a single place to report problems.
A system that implements these capabilities provides a solid foundation for developing automated tools for system management. “Automated Tools” can mean anything from a sophisticated JBoss application using Business Rules and Business Process Management to automate responses to a wide range of system alerts to a custom script to create a specific storage configuration.
A system that implements these capabilities also provides a great foundation for building interactive client applications – client applications that use a command line interface, that are built on scripts, or even a GUI interface.
These are the guiding principles for OpenLMI.