Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Mark Hiscock
Software Engineer
IBM Hursley Park Lab
United Kingdom
Simon Gormley
Software Engineer
IBM Hursley Park Lab
United Kingdom
May 11, 2005
Copyright International Business Machines Corporation 2005. All rights reserved.
This whitepaper explains how you can easily configure and achieve high availability
using IBMs enterprise messaging product, WebSphere MQ V5.3 and later. This
paper is intended for:
o Systems architects who make design and purchase decisions for the IT
infrastructure and may need to broaden their designs to incorporate
HA.
o System administrators who wish to implement and configure HA for
their WebSphere MQ environment.
Table of Contents
1. Introduction ........................................................................................................................................3
2. High availability..................................................................................................................................4
3. Implementing high availability with WebSphere MQ ....................................................................6
3.1. General WebSphere MQ recovery techniques..............................................................................6
3.2. Standby machine - shared disks....................................................................................................7
3.2.1. HA clustering software .........................................................................................................9
3.2.2. When to use standby machine - shared disks ......................................................................10
3.2.3. When not to use standby machine - shared disks................................................................10
3.2.4. HA clustering active-standby configuration .......................................................................11
3.2.5. HA clustering active-active configuration ..........................................................................12
3.2.6. HA clustering benefits ........................................................................................................13
3.3. z/OS high availability options.....................................................................................................16
3.3.1. Shared queues (z/OS only)..................................................................................................16
3.4. WebSphere MQ queue manager clusters ....................................................................................19
3.4.1. Extending the standby machine - shared disk approach......................................................20
3.4.2. When to use HA WebSphere MQ queue manager clusters.................................................21
Page 2
1. Introduction
With an ever increasing dependence on IT infrastructure to perform critical business
processes, the availability of this infrastructure is becoming more important. The
failure of an IT infrastructure results in large financial losses, which increases with the
length of the outage [5]. The solution to this problem is careful planning to ensure that
the IT system is resilient to any hardware, software, local or system wide failure. This
capability is termed resilience computing, which addresses the following topics:
o
o
o
o
o
o
High availability
Fault tolerance
Disaster recovery
Scalability
Reliability
Workload balancing and stress
Page 3
2. High availability
The HA nature of an IT system is its ability to withstand software or hardware failures
so that it is available as much of the time as possible. Ideally, despite any failure
which may occur, this would be 100% of the time. However, there are factors, both
planned and unplanned, which prohibit this from being a reality for most production
IT infrastructures. These factors lead to the unavailability of the infrastructure,
meaning the ideal availability (per year) can be measured as the percentage of the year
for which the system was available. For example:
Figure 1. Number 9s availability per year
Availability%
99
3.65 days
99.9
8.76 hours
99.99
52.6 minutes
99.999
5.26 minutes
99.9999
30.00 seconds
Figure 1 shows that a 30 second outage per year is called Six 9s availability
because of the percentage of the year the system was available.
Factors that cause a system outage and reduce the number of 9s up time, fall into two
categories: those that are planned and those that are unplanned. Planned disruptions
are either systems management (upgrading software or applying patches), or data
management (backup, retrieval, or reorganization of data). Conversely, unplanned
disruptions are system failures (hardware or software failures) or data failures (data
loss or corruption).
Maximizing the availability of an IT system is to minimize the impact of these
failures on the system. The primary method is the removal of any single point of
failure (SPOF) so that should a component fail, a redundant or backup component is
ready to take over. Also, to ensure enterprise messaging solutions are made highly
available, the softwares state and data must be preserved in the event of a failure and
made available again as soon as possible. The preservation and restoration of this data
removes it as a single point of failure in the system.
Some messaging solutions remove single points of failure, and make software state
and data available, by using replication technologies. These may be in the form of
asynchronous or synchronous replication of data between instances of the software in
a network. However, these approaches are not ideal as asynchronous replication can
cause duplicated or lost data and synchronous replication incurs a significant
Page 4
Page 5
By reading each section, you can select the best HA methodology for your scenario.
This paper uses the following terminology:
Machine A computer running an operating system.
Queue manager A WebSphere MQ queue manager that contains queue and
log data.
Server A machine that runs a queue manager and other 3rd party services.
Private message queues These are queues owned by a particular queue
manager and are only accessible, via WebSphere MQ applications, when the
owning Queue manager is running. These queues are to be contrasted with
shared messages queues (explained below), which are a particular type of
queue only available on z/OS.
Shared message queues These are queues that reside in a Coupling Facility
and are accessible by a number of queue managers that are part of a Queue
Sharing Group. These are only available on z/OS and are discussed later.
Page 6
Page 7
The standby machine is ready to read the queue manager data and logs from the
shared disk and to assume the IP address of the primary machine [3].
A shared external disk device is used to provide a resilient store for queue data and
queue manager logs so that replication of messages are avoided. This preserves the
once and once only delivery characteristic of persistent messages. If the data was
replicated to a different system, the messages stored on the queues have been
duplicated to the other system, and once and once only delivery cannot be guaranteed.
For instance, if data was replicated to a standby server, and the connection between
the two servers fails, the standby assumes that the master has failed, takes over the
master servers role, and starts processing messages. However, as the master is still
operational, messages are processed twice, hence duplicated messages occur. This is
avoided when using a shared hard disk because the data only exists in one physical
location and concurrent access is not allowed.
The external disk used to store queue manager data should also be RAID1 enabled to
prevent it being a single point of failure (SPOF) [8]. The disk device may also have
multiple disk controllers and multiple physical connections to each of the machines, to
provide redundant access channels to the data. In normal operation, the shared disk is
mounted by the master machine, which uses the storage to run the queue manager in
the same way as if it were a local disk, storing both the queues and the WebSphere
1
Page 8
Page 9
Page 10
Page 11
In larger installations, where several resource groups exist and more than one server
needs to be made highly available, it is possible to use one backup machine to cover
several active servers. This setup is known as an n+1 configuration, and has the
benefit of reduced redundant hardware costs, because the servers do not have a
dedicated backup machine each. However, if several servers fail at the same time, the
backup machine may become overloaded. These extra costs must be weighed up
against the potential cost of more than one server failing, and more than one backup
machine being required.
A quorum is the minimum number of members of a deliberative body necessary to conduct the
business of that group.
Page 14
Page 15
Page 16
GRP1
QM 2
QM 3
QM 1
QA
Coupling
Facility
A further benefit of using shared queues is utilizing shared channels. You can use
shared channels in two different scenarios to further extend the high availability of
WebSphere MQ.
First, using shared channels, an external queue manager can connect to a specific
queue manager in the QSG using channels. It can then put messages to the shared
queue via this queue manager. This allows for queue managers in a distributed
environment to utilize the HA functionality provided by shared queues. Therefore, the
target application of messages put by the queue manager can be any of those running
on a queue manager in the QSG.
Second, you can use a generic port so that a channel connecting to the QSG could be
connected to any queue manager in the QSG. If the channel loses its connection
(because of a queue manager failure), then it is possible for the channel to connect to
another queue manager in the QSG by simply reconnecting to the same generic port.
3.3.1.1 Benefits of shared message queues
The main benefit of a shared queue is its high availability. There are numerous
customer selectable configuration options for CF storage, ranging from running on
standalone processors with their own power supplies to the Internal Coupling Facility
(ICF) that runs on spare processors within a general zSeries server. Another key
factor is that the Coupling Facility Control Code (CFCC) runs in its own LPAR,
where it is isolated from any application or subsystem code.
In addition, it naturally balances the workload between the queue managers in the
QSG. That is, a queue manager will only request a message from the shared queue
when the application, which is processing messages, is free to do so. Therefore, the
availability of the messaging service is improved because queue managers are not
flooded by messages directly. Instead, they consume messages from the shared queue
when they are ready to do so.
Also, should greater message processing performance be required, you can add extra
queue managers to the QSG to process more incoming messages. With persistent
messages, both private and shared, the message processing limit is constrained by the
speed of the log. With shared message queues, each queue manager uses its own log
Page 17
Page 18
QM 1
Application
QM 2
QM 3
QM 4
QM 6
QM 5
cluster 1
Page 19
Application
Local Queue
QM 1
QM
QM 2
QM
log
QM 6
QM 3
QM
log
log
QM 4
QM 5
cluster 1
In this example, if queue manager 4 fails, it fails over to the same machine as queue
manager 3, where both queue managers will run until the failed machine is repaired.
Page 20
Page 23
Page 25
4 For z/OS, note that pagesets are only consistent on every third checkpoint.
Page 26
4.3. Automation
The detection of the failure, failover to a standby machine, and restart of the queue
manager (and applications) should be automated. By reducing operator intervention,
the time required to failover a queue manager to a backup machine is significantly
reduced. This allows normal service to be resumed as quickly as possible.
You can achieve the automation of this process by using HA clustering software as
described in HA clustering software.
Page 27
Page 28
Page 29
Page 30
6. Conclusion
This paper discussed approaches for implementing high availability solutions using
the WebSphere MQ messaging product.
Choosing a solution to achieve a highly available system is based on the HA
requirements of that system. For instance, is each message important? Can a trapped
message wait a few hours until a machine is restarted, or must it be made available as
soon as possible? If it is the former, then a simple clustering approach is enough.
However, the latter requirement requires the use of HA clustering software and
hardware. Also, are software applications reliant on specific software or hardware
resources? If so, a HA cluster solution is critical when interdependent groups of
resources must be failed over together.
Note that the approaches discussed in this paper for implementing high availability
with WebSphere MQ all employ common HA principles. You should adhere to those
principles when implementing any highly available IT system. The first is the use of a
single copy of any data. This makes the data much easier to manage, there are no
ambiguities about who owns the real data and there are no issues in reconciling the
data if there is a corruption. When a failover occurs, only one instance of the software
has access to the real data, avoiding any confusion. The only exception to this
statement is when you implement a disaster recovery solution to move copies of
critical data off site. In this case, you cannot use a copy of the data to remove the
single point of failure and to provide high availability. Instead, if a site wide failure
occurs, the backup is used to restore critical data and to resume services (possibly on
another site).
Second, always verify software that stores persistent state on disk to ensure it
performs synchronous writes to the disk and to ensure hardening of the data.
Asynchronous writes to a disk can result in software believing the data has been
hardened to disk when, in fact, it has not. WebSphere MQ always writes persistent
data synchronously to disk to ensure it has been hardened, and therefore, recoverable
in the event of a queue manager failure.
Third, implementing redundancy at a hard disk level, to remove the disk as a single
point of failure, is a simple step that prevents the loss of critical data if a disk fails.
Despite synchronous writes ensuring the data has been hardened to disk, a disk failure
can still destroy the data. Therefore, implement technologies, such as RAID, to
provide a disk level redundancy of data.
Fourth, and often overlooked, implement process controls for the administration of
production IT systems. Often it is administrative errors that cause outages because of
improperly tested software updates, incorrect parameter settings, or destructive
actions performed by administrators. By having proper process controls and security
restrictions, you can minimize these errors. Also, HA clustering software provides a
single administration view of all machines in a HA cluster, which minimizes
administration effort.
Lastly, programming applications to avoid affinities between clients and servers and
long running Units of Work are good practices. The first allows applications to be
Page 31
Page 32
Page 33
Resources
[1]. WebSphere MQ for Z/OS Concepts and Planning Guide Chapter 2 (Shared
Queues),
http://www306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/platspe
cific.html
[2]. WebSphere MQ queue manager clusters,
http://www306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/crossla
test.html
[3]. WebSphere MQ High Availability, Mark Taylor, Transaction and Messaging
Technical Conference
[4]. Choosing the right availability solution, L.Sherman,
http://whitepapers.zdnet.co.uk/0,39025945,60018358p-39000482q,00.htm
[5]. Understanding Downtime, Business Continuity Solution Series, Vision
Solutions Whitepaper,
http://www.visionsolutions.com/BCSS/White-Paper-102_final_vision_site.pdf
[6]. WebSphere MQ Manuals for z/OS, Systems Administration Guide, Chapter
14, Page 151,
http://www306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/platspe
cific.html
[7]. Achieving High Availability Objectives, CNT whitepapers,
http://www.cnt.com/documents/?ext=pdf&filename=PL581
[8]. A definition of the term RAID, webopedia.com,
http://www.webopedia.com/TERM/R/RAID.html
Page 34