Documentos de Académico
Documentos de Profesional
Documentos de Cultura
www.optumis.com
Sanjay Raina
December, 2010
Contents Introduction
IT Systems Management tools and technologies
Introduction 2 continue to be crucial to efficient delivery of IT
Problem Statement 2 services. The IT landscape is evolving all the time
Current Practice 3 and there are increasing demands placed on the
Optumis Concerto 4
management of IT. Systems Management tools
have failed to keep pace with these
Implementation 8
developments and often fail to deliver the
Business Benefits 13 potential value as promised by the vendors. The
Summary 13 main reason behind this is that IT Systems
References 13 Management tends to be disjointed and silo
based. This white paper presents a holistic
approach to IT Systems and Service
Management. The approach advocates a
declarative, data-driven framework for specifying
management structures and policy. This
combined with the notion of abstraction of
management data results in a integrated
paradigm that allows stakeholders to make
effective decisions about the complex IT
infrastructure and applications in a coherent and
consistent way.
Problem Statement
Today businesses rely heavily on IT to keep the
revenue streams flowing and to run the day to
day back office functions. It is therefore more
critical than ever that the tools and technologies
that manage the IT infrastructure are effective in
delivering high levels of availability and
productivity whilst keeping the costs down.
The same principle is applied to the Due to the large number of states and state
subsequent layers shown above. The next transitions, representing a complex end to end
layer provides an aggregated view of the Enterprise Management system as a single state
management data to applications. At this machine is an impossible task. In the past,
layer, management data from multiple techniques such as State Charts [6] have been
element managers can be combined to developed to overcome the state explosion
provide a more analytical interpretation of problem. We have used the concept of
the data. Note that the context is still cooperating state machines. Each state machine
technology related. The next layer titled represents one Enterprise Management function
Business Service Management consists of or process and they link together to form an end
abstraction that provides a business and to end model. Fig. 3 below shows a chain of state
machines representing the detection, information and data link layer adds information
reporting and resolution of a fault. about the physical media. Each layer acts as a
provider of service that is consumed by the layer
above.
Data
Application
TCP
Data
Header Transport
IP
TCP Data
Header Internet
Frame
FIGURE 3. COOPERATING STATE MACHINES Header
Frame data Link
Data manipulation
FIGURE 4. THE WELL KNOW NETWORK PROTOCOL STACK
Efficient manipulation of data is an important
aspect of any abstract or physical machine.
A similar approach can be applied to Enterprise
Operands are commonly used in an
Management data that is successively enriched
instruction set (of abstract or real machine)
by layers in the stack. Each consuming layer
and allows instructions to perform operations
enriches the data further before providing it to
efficiently. Management data tends to be
the next layer.
passed around quite frequently amongst
various components in a Systems
Management solution and it is essential that
the data is optimally formatted and
structured. This is addressed by employing
two other concepts widely used in general
computing: normalization and encapsulation. FIGURE 5. EVENT MANAGEMENT DATA BEING
SUCCESSIVELY ENRICHED
Normalization refers to the transformation of
the structure of the management data into a Enrichment involves filling in the missing
canonical form. Without normalization, information into the normalized event format as
considerable effort has to be expended in described above. The information can be
interpreting the data emanating from various supplied by external information sources such as
management tools. the CMDB, an operational data store or the
Incident Management database.
Encapsulation is most prominently used by
the TCP/IP protocol suite to provide The standardization and enrichment of
abstraction of network protocols and management data provides a number of benefits:
services. As shown in Fig. 4 below, data • Management Systems tend to generate
packets are encapsulated with headers at large volumes of data, much of which is
each layer. The TCP layer adds a TCP/UDP noise. This data needs to be aggregated
header to identify the source and destination and correlated to pin-point the root
access point. The IP header adds routing cause. Standardization of management
data formats plays a crucial role in this languages are examples of the declarative
regard. The standardized format paradigm. Specialized configuration files can also
makes matching of events efficient. be considered declarative, and even though they
The rules for duplicate detection and are not programming languages, they do enable
suppression become simplified. computation based on what rather than how.
Detection and prevention of event Another example of a declarative language is
storms is also simplified. It is also Prolog where programs are specified as facts and
possible to apply more granular rules, rules in a knowledge base. An inference engine
e.g. one can put very specific alerts then attempts to find solutions based on the
from a particular resource or from a rules and facts.
whole datacenter into maintenance.
• Due to the added context information Enterprise Management tools are generally
available, it is easier to perform programmed in an imperative manner using
business impact management. proprietary rule bases and databases. When
Enrichment of management data implementing Enterprise Management solutions,
enables more accurate and automated a significant amount of effort is spent on
processing of events within a encoding the control flow, i.e. specifying how
management system. New, service particular tasks are to be accomplished. We
impacting events can be generated advocate a declarative approach where the
based on location or service emphasis is on what rather than the how. So,
information from the CMDB or rather than specifying how to monitor a disk in a
Incident information from the Incident particular tool, using tool specific data structures,
database. we can specify the monitoring parameters in an
• Management data often traverses a abstract form, as shown below.
number of boundaries when various
functions are performed. The <DiskThresh>
information conveyed by the data is <Hostname>Ferrari</Hostname>
often interpreted by a multitude of <Diskname>C:</Diskname>
systems and personnel. It helps a <PctUsedWarn>90</PctUsedWarn>
great deal if the information being <PctUsedCrit>95</PctUsedCrit>
passed around is consistently </DiskThresh>
structured.
A driver component then takes this abstract
notation and converts it into tool specific
Declarative, data-driven instructions. The declarative approach can be
programming applied right across the board. The fragment
Most programs are written in an imperative below shows how an alert matching a certain
paradigm where the developer instructs the criteria can be specified to be routed to a
computer how to get a certain task done. In resolver group.
declarative programming, on the other hand,
the developer simply states what is to be <IncidentProfile>
achieved, and leaves it up to the system to <hostname> Ferrari </hostname>
get the job done. XML and related markup <resource> DISK </resource>
<threshname> PctUsed understand and you don’t need specialists in the
</threshname> different tools to manage and maintain the
<threshop> LessThan </threshop> management data. The data can be managed by
<threshval> 95 </threshval> a wider section of the IT service delivery
<resolver_group>GTI_GB_WENG</tick organization rather than just the specialists.
etqueue>
<priority>P2</priority> Finally, there is the advantage that all data can
<scim>Server OS, EMEA Intel</scim> now be made available to personnel, based on
</IncidentProfile> their role, for configuration and reporting
purpose. A user can update monitoring,
Similarly, the fragment below shows how an maintenance windows, enrichment data, Incident
alert matching a certain criteria can be resolver group information, notifications
specified to be suppressed during a calendar and calling tree, all in one place.
maintenance window.
Implementation
<MaintenanceMode> Although the Enterprise Management Abstract
<hostname> Ferrari </hostname> Machine covers a wide range of functions, its
<resource> DISK </resource> implementation is expected to be a veneer of
<threshname> PctUsed software that runs on top of existing tools and
</threshname> systems. We do not intend to reinvent well
<thre shop> LessThan </threshop> established functions of Systems Management
<threshval> 95 </threshval> and most of the heavy lifting is expected to be
<suppressstart> done by existing tools and systems. This section
3-Aug-2010 11:00:00 describes how the key aspects described in the
</suppressstart> previous section can be realized. Two scenarios
<suppressend> are outlined to demonstrate the use of the
27-Dec-2010 12:00:00 concepts discussed.
</suppressend>
<suppressday>Sunday</suppressday> In common with other abstract (and indeed
<suppresshour>05:00-- physical) machines, the operation of EMAM is
11:00</suppresshour> characterized by:
</MaintenanceMode> • A workflow component that executes the
logic of the computation being
This has a major advantage in that the policies performed. This may take the form of a
and rules for management have to be program of instructions compiled and
specified just once. The underlying tools can then processed by a CPU or, in the case of
be replaced at any time without having to an operating system a sequence of
rewrite the policies and rules for the new processes being scheduled from a work
tool. Integration to various tools is done via queue. In the case of EMAM, program
SOAP/WSDL or tool specific APIs. execution takes the form of a sequence of
state machines.
Another advantage is that the management
data in declarative form is easier to
• Operands used by the instructions in a product [9]. The fragment below shows a XAML
program. These take the form of local representation of a state machine.
storage (registers or stack) in
conventional machines. In the EMAM <StateMachineWorkflowActivity
these operands are typically alert x:Class="EMAMWorkflow.Monitor"
data, incident records, change records Name="Monitor"
etc. InitialStateName="Idle"
• Reference data is used by the xmlns="http://schemas.microsoft.com/winfx/200
workflow component. This is usually 6/xaml/workflow"
general purpose storage in a xmlns:x="http://schemas.microsoft.com/winfx/2
conventional machine where the 006/xaml">
results of computation are stored. In <StateActivity x:Name="Idle">
the EMAM, the reference data is <EventDrivenActivity
typically operational configuration x:Name="CheckThreshold">
data stored in a <HandleExternalEventActivity
configuration/operational data store. x:Name="HandleCheckThreshold"
EventName="Check">
</HandleExternalEventActivity>
State-machine Workflows <CodeActivity
A program in the EMAM is expressed in the x:Name="DoCheckCode"
form of a workflow of state machines. The
control of the program propagates through ExecuteCode="DoCheckCode_ExecuteCode">
state machines, with one state machine </CodeActivity>
triggering another. The programming of the <SetStateActivity
EMAM takes place by specifying the sequence x:Name="SetChecking"
in which these state machines are triggered. TargetStateName="Checking">
The program is specified declaratively, in the </SetStateActivity>
form of a database table or an XML based </EventDrivenActivity>
markup. Table 1 below shows a sample </StateActivity>
workflow specification. …………
</StateMachineWorkflowActivity>
TABLE 1. STATE TRANSITION TABLE
State Current Event Next Next State
Machine State Condition State Machine The above XAML code can be loaded directly into
Monitor Idle Check Checking Monitor
Monitor Checking Breach Alerted Monitor
Microsoft Workflow Foundation to appear as in
Monitor Checking NotBreached Idle Monitor Fig. 6 below.
Monitor Alerted DupDetected Duplicate Monitor
Monitor Alerted Not Dup Unique Normalized
Monitor Duplicate Drop Idle Monitor
Normalize " " " "
" " " " "
Normalize
In this step, the event fields are normalized
into a canonical form. The idea is that no
matter what tool or method is used to detect
the fault, its representation is the same, in an
abstract form and not dependent on the
underlying tool. Table 3 below shows alert
data in normalized form
Create Problem Ticket
TABLE 3. NORMALIZED ALERT DATA
Based on the mapping table, and using the alert
data, a new problem ticket is created. The
problem ticket follows a standard form just like
the alert, to ensure consistency. Whether an
alert was generated automatically as above or
the ticket created manually by a Service Desk
operator, the representation should be the same.
The fields can now be used consistently to The problem ticket now forms the basis for
perform matching and analytics at various tracking the alert and is used when performing
levels. These fields serve as a key in matching escalation etc.
against the different types of management
Escalation
data.
Escalation is a core function of the Incident
Management process. The escalation function
can be performed on the problem ticket using The final step in this sequence is to mark the
escalation data in a table such as below. Change as implemented and close the Incident.
The workflow will automatically close or clear the
TABLE 5. ESCALATION TABLE alert in the monitoring tools.
Priority P1 P2 P3 P4
Level of High, Production Degraded Minimal
Impact Critical, severely operations impact
Scenario 2: VM server provisioning
Fatal impacted
2 hrs First
This scenario depicts another common situation
response of requesting and provisioning a virtual server. As
4 hrs Work First
around response with the previous scenario, the management
24 hrs Mgmt Work First solution consists of a series of state machines.
notification around response
48 hrs Mgmt Work First The state machine workflow is outlined in Table 6
notification around response below with the associated operand and reference
1 wk Resolution Mgmt notif Work
around data.
2 wks Resolution
3 wks Resolution
TABLE 6. STATE MACHINE WORKFLOW FOR VM SERVER
Release Resolution
PROVISIONING
State Machine Operand Ref Data
Notification Create Request Service Request
Based on a calling tree and calendar Check Service Service Request CMDB, Service
Catalog Catalog
information the problem ticket can generate Check capacity Service Request CMDB
notifications. Create Change Service Request, CMDB
Request Change Ticket
Provision VM Change Ticket CMDB
Create Change Ticket Close Change, Change Ticket,
Service Request Service Request
Once the right personnel have been notified
and the resolution identified, a change record Create request
is created to perform the change. In our A Service Request is created manually by a
example here, the change involves a change requester. As before, the request is turned into a
to the monitoring thresholds as it was standardized form so as to make its processing
deemed to be a spurious alert. The change easier. This state machine workflow routes the
request follows the change management request to individuals in the organization for
process, including appropriate reviews, action, alerts the manager as necessary when the
approvals and assignment of change current owner does not respond to the request,
implementers. and escalates or transfers the request to the next
level of support. At this stage only a few fields
Update monitoring threshold
such as request number and request owner are
Once the change has been approved and
populated in the service request.
implementer notified, the monitoring
threshold is updated in the database. Note Supplement information from Service
that no change has been made to the Catalog
monitoring tool or any rule sets, and such a This step looks up the Service Catalog to fill in the
change can be performed by a non-specialist details about the servers. This step is comparable
since it is a simple data change. to the Enrichment step in the previous scenario.
Additional attributes include response deadline,
Close change, incident and alert server asset data etc.
Summary
Check capacity
Once the service request is sufficiently The approach described in this white paper is
qualified, the next step checks there is based on ideas and principles widely used in
adequate capacity on the physical general computing to overcome the problem of
infrastructure. Checks are performed to complexity and inter-operability. The approach
determine CPU, Memory and Storage capacity results in a more holistic solution to the problem
and appropriate personnel notified, if of Enterprise Management. A concept of
necessary. Enterprise Management Abstract Machine is
presented that utilizes state machine workflows
Create and manage Change Ticket and declarative, data-driven programming to
Once the right personnel have been notified decouple management procedures and data
and the checks performed, a Change Ticket is from the underlying tools. Such an approach
created to perform the change. results in a federated management model that
enables optimal use of people, processes and
Provision VM technology. Management applications and
This is essentially a manual step, in which the processes can be implemented quickly and
implementer creates the Virtual Machine. efficiently, without getting bogged down by the
mechanics of the tools.
Close change and service request
The final step in this sequence is to mark the References
Change as implemented and close the [1] BMC Patrol, http://www.bmc.com/products/product-
corresponding Service Request. listing/ProactiveNet-Performance-Management.html
[2] CA, http://www.ca.com/us/products.aspx
[3] HP OpenView Operations,
Business Benefits https://h10078.www1.hp.com/cda/hpms/display/main/hpms_h
ome.jsp?zn=bto&cp=1_4011_100
The integrated approach to Enterprise [4] IBM Tivoli, http://www.ibm.com/software/tivoli
Systems Management provides a number of [5] IBM, Tivoli Management Framework, http://www-
01.ibm.com/software/tivoli/products/mgt-framework
key related benefits to the business. [6] D.Harel. Statecharts: a visual formalism for complex systems.
• The solution enables optimal use of Science of Computer Programming 8:231-274. North-
Holland 1987.
technology and human resources to [7] Macehiter Ward-Dutton, The New Face of IT Service
deliver significant cost reduction in Management, 2007.
[8] Microsoft System Center Operations Manager,
managing IT systems. http://www.microsoft.com/systemcenter/en/us/operations-
• Standardisation and systematic reuse manager.aspx
[9] Microsoft Windows Workflow Foundation,
of processes and procedures leads to http://msdn.microsoft.com/en-
increased automation and efficient us/library/ms735921(VS.90).aspx
[10] Office of Government Commerce: Best Management
practice. Practice, IT Service Management, http://www.best-
• The solution significantly improves management-practice.com/IT-Service-Management-ITIL
productivity, allowing support staff to
improve service delivery and add
value rather than constantly fire
fighting.