Está en la página 1de 34

Understanding high availability with WebSphere MQ

Mark Hiscock
Software Engineer
IBM Hursley Park Lab
United Kingdom
Simon Gormley
Software Engineer
IBM Hursley Park Lab
United Kingdom
May 11, 2005
Copyright International Business Machines Corporation 2005. All rights reserved.
This whitepaper explains how you can easily configure and achieve high availability
using IBMs enterprise messaging product, WebSphere MQ V5.3 and later. This
paper is intended for:
o Systems architects who make design and purchase decisions for the IT
infrastructure and may need to broaden their designs to incorporate
HA.
o System administrators who wish to implement and configure HA for
their WebSphere MQ environment.
Table of Contents
1. Introduction ........................................................................................................................................3
2. High availability..................................................................................................................................4
3. Implementing high availability with WebSphere MQ ....................................................................6
3.1. General WebSphere MQ recovery techniques..............................................................................6
3.2. Standby machine - shared disks....................................................................................................7
3.2.1. HA clustering software .........................................................................................................9
3.2.2. When to use standby machine - shared disks ......................................................................10
3.2.3. When not to use standby machine - shared disks................................................................10
3.2.4. HA clustering active-standby configuration .......................................................................11
3.2.5. HA clustering active-active configuration ..........................................................................12
3.2.6. HA clustering benefits ........................................................................................................13
3.3. z/OS high availability options.....................................................................................................16
3.3.1. Shared queues (z/OS only)..................................................................................................16
3.4. WebSphere MQ queue manager clusters ....................................................................................19
3.4.1. Extending the standby machine - shared disk approach......................................................20
3.4.2. When to use HA WebSphere MQ queue manager clusters.................................................21

Understanding high availability with WebSphere MQ


3.4.3 When not to use HA WebSphere MQ queue manager clusters............................................21
3.4.4. Considerations for implementation of HA WebSphere MQ queue manager clusters.........22
3.5. HA capable client applications ...................................................................................................24
3.5.1. When to use HA capable client applications.......................................................................25
3.5.2. When not to use HA capable client applications.................................................................25
4. Considerations for WebSphere MQ restart performance ............................................................26
4.1. Long running transactions ..........................................................................................................26
4.2. Persistent message use ................................................................................................................27
4.3. Automation .................................................................................................................................27
4.4. File systems ................................................................................................................................27
5. Comparison of generic versus specific failover technology...........................................................29
6. Conclusion.........................................................................................................................................31
Appendix A Available SupportPacs.................................................................................................33
Resources...............................................................................................................................................34
About the authors .................................................................................................................................34

Page 2

Understanding high availability with WebSphere MQ

1. Introduction
With an ever increasing dependence on IT infrastructure to perform critical business
processes, the availability of this infrastructure is becoming more important. The
failure of an IT infrastructure results in large financial losses, which increases with the
length of the outage [5]. The solution to this problem is careful planning to ensure that
the IT system is resilient to any hardware, software, local or system wide failure. This
capability is termed resilience computing, which addresses the following topics:
o
o
o
o
o
o

High availability
Fault tolerance
Disaster recovery
Scalability
Reliability
Workload balancing and stress

This whitepaper addresses the most fundamental concept of resilience computing,


high availability (HA). That is, An application environment is highly available if it
possesses the ability to recover automatically within a prescribed minimal outage
window [7]. Therefore, an IT infrastructure that recovers from a software or
hardware failure, and continues to process existing and new requests, is highly
available.

Page 3

Understanding high availability with WebSphere MQ

2. High availability
The HA nature of an IT system is its ability to withstand software or hardware failures
so that it is available as much of the time as possible. Ideally, despite any failure
which may occur, this would be 100% of the time. However, there are factors, both
planned and unplanned, which prohibit this from being a reality for most production
IT infrastructures. These factors lead to the unavailability of the infrastructure,
meaning the ideal availability (per year) can be measured as the percentage of the year
for which the system was available. For example:
Figure 1. Number 9s availability per year
Availability%

Downtime per Year

99

3.65 days

99.9

8.76 hours

99.99

52.6 minutes

99.999

5.26 minutes

99.9999

30.00 seconds

Figure 1 shows that a 30 second outage per year is called Six 9s availability
because of the percentage of the year the system was available.
Factors that cause a system outage and reduce the number of 9s up time, fall into two
categories: those that are planned and those that are unplanned. Planned disruptions
are either systems management (upgrading software or applying patches), or data
management (backup, retrieval, or reorganization of data). Conversely, unplanned
disruptions are system failures (hardware or software failures) or data failures (data
loss or corruption).
Maximizing the availability of an IT system is to minimize the impact of these
failures on the system. The primary method is the removal of any single point of
failure (SPOF) so that should a component fail, a redundant or backup component is
ready to take over. Also, to ensure enterprise messaging solutions are made highly
available, the softwares state and data must be preserved in the event of a failure and
made available again as soon as possible. The preservation and restoration of this data
removes it as a single point of failure in the system.
Some messaging solutions remove single points of failure, and make software state
and data available, by using replication technologies. These may be in the form of
asynchronous or synchronous replication of data between instances of the software in
a network. However, these approaches are not ideal as asynchronous replication can
cause duplicated or lost data and synchronous replication incurs a significant

Page 4

Understanding high availability with WebSphere MQ


performance cost as data is being backed up in real time. It is for these reasons that
WebSphere MQ does not use replication technologies to achieve high availability.
The next section describes methods for making a WebSphere MQ queue manager
highly available. Each method describes a technique for HA and when you should and
should not consider it as a solution.

Page 5

Understanding high availability with WebSphere MQ

3. Implementing high availability with WebSphere MQ


This section discusses the various methods of implementing high availability in
WebSphere MQ. Examples show when you can or cannot use HA.

Standby machine shared disks and z/OS high availability options


describe HA techniques for distributed and z/OS queue managers,
respectively.
WebSphere MQ Queue Manager clusters describes a technique available to
queue manages on all platforms.
HA capable client applications describes a client-side technique applicable
on all platforms.

By reading each section, you can select the best HA methodology for your scenario.
This paper uses the following terminology:
Machine A computer running an operating system.
Queue manager A WebSphere MQ queue manager that contains queue and
log data.
Server A machine that runs a queue manager and other 3rd party services.
Private message queues These are queues owned by a particular queue
manager and are only accessible, via WebSphere MQ applications, when the
owning Queue manager is running. These queues are to be contrasted with
shared messages queues (explained below), which are a particular type of
queue only available on z/OS.
Shared message queues These are queues that reside in a Coupling Facility
and are accessible by a number of queue managers that are part of a Queue
Sharing Group. These are only available on z/OS and are discussed later.

3.1. General WebSphere MQ recovery techniques


On all platforms, WebSphere MQ uses the same general techniques for dealing with
recovery of private message queues after a failure of a queue manager. With the
exception of shared messages queues (see Shared queues), messages are cached in
memory and backed by disk storage if the volume of message data exceed the
available memory cache.
When persistent messaging is used, WebSphere MQ logs messages to disk storage.
Therefore, in the event of a failure, the combination of the message data on disk plus
the queue manager logs can be used to reconstruct the message queues. This restores
the queue manager to a consistent state at the time just before the failure occurred.
This recovery involves completing normal Unit or Work resolution, with in-flight
messages being rolled back, in-commit messages being complete, and in-doubt
messages waiting for coordinator resolution.
The following sections describe how the above general restart process is used in
conjunction with platform specific facilities, such as HACMP on AIX or ARM on
z/OS, to quickly restore message availability after failures.

Page 6

Understanding high availability with WebSphere MQ


WebSphere MQ also provides a mechanism for improving the availability of new
messages by routing messages around a failed queue manager transparently to the
application producing the messages. This is called Websphere MQ clustering and is
covered in WebSphere MQ Queue Manager clusters.
Finally on z/OS, WebSphere MQ supports shared message queues that are accessible
to a number of queue managers. Failure of one queue manager still allows the
messages to be accessed by other queue managers. These are covered in z/OS high
availability options.

3.2. Standby machine - shared disks


As described above, when a queue manager fails, a restart is required to make the
private message queues available again. Until then, the messages stored on the queue
manager will be stranded. Therefore, you cannot access them until the machine and
queue manager are returned to normal operation. To avoid the stranded messages
problem, stored messages need to be made accessible, even if the hosting queue
manager or machine is inoperable.
In the standby machine solution, a second machine is used to host a second queue
manager that is activated when the original machine or queue manager fails. The
standby machine needs to be an exact replica, at any given point in time of the master
machine, so that when failure occurs, the standby machine can start the queue
manager correctly. That is, the WebSphere MQ code on the standby machine should
be at the same level, and the standby machine should have the same security
privileges as the primary machine.
A common method for implementing the standby machine approach is to store the
queue manager data files and logs on an external disk system that is accessible to both
the master and standby machines. WebSphere MQ writes its data synchronously to
disk, which means a shared disk will always contain the most recent data for the
queue manager. Therefore, if the primary machine fails, the secondary machine can
start the queue manager and resume its last known good state.

Page 7

Understanding high availability with WebSphere MQ


Figure 2. An active-standby setup

The standby machine is ready to read the queue manager data and logs from the
shared disk and to assume the IP address of the primary machine [3].

A shared external disk device is used to provide a resilient store for queue data and
queue manager logs so that replication of messages are avoided. This preserves the
once and once only delivery characteristic of persistent messages. If the data was
replicated to a different system, the messages stored on the queues have been
duplicated to the other system, and once and once only delivery cannot be guaranteed.
For instance, if data was replicated to a standby server, and the connection between
the two servers fails, the standby assumes that the master has failed, takes over the
master servers role, and starts processing messages. However, as the master is still
operational, messages are processed twice, hence duplicated messages occur. This is
avoided when using a shared hard disk because the data only exists in one physical
location and concurrent access is not allowed.
The external disk used to store queue manager data should also be RAID1 enabled to
prevent it being a single point of failure (SPOF) [8]. The disk device may also have
multiple disk controllers and multiple physical connections to each of the machines, to
provide redundant access channels to the data. In normal operation, the shared disk is
mounted by the master machine, which uses the storage to run the queue manager in
the same way as if it were a local disk, storing both the queues and the WebSphere
1

Using a RAID configuration protects against data loss, such as mirroring.

Page 8

Understanding high availability with WebSphere MQ


MQ log files on it. The standby machine cannot mount the shared disk and therefore,
cannot start the queue manager because the queue manager data is not accessible.
When a failure is detected, the standby machine automatically takes on the master
machines role, and as part of that process, mounts the shared disk and starts the
queue manager. The standby queue manager replays the logs stored on the shared
disk to return the queue manager to the correct state, and resumes normal operations.
Note that messages on queues that are failed over to another queue manager retain
their order on the queue. This failover operation can also be performed without the
intervention of a server administrator. It does require external software, known as
HA clustering, to detect the failure and initiate the failover process.
Only one machine has access to the shared2 disk partition at a time, and only one
instance of the queue manager runs at any one time to protect data integrity of
messages. The objective of the shared disk is to move the storage of important data
(for example, queue data and queue manager logs) to a location external to the
machine, so that when the master machine fails, another machine may use the data.

3.2.1. HA clustering software


Much of the functionality in the standby machine configuration is provided by
external software, often termed as HA clustering software [4]. This software addresses
high availability issues using a more holistic approach than single applications, such
as WebSphere MQ, can provide. It also recognizes that a business application may
consist of many software packages and other resources, all of which need to be highly
available. This is because another complication is introduced when a solution consists
of several applications that have a dependency on each other. For example, an
application may need access to both WebSphere MQ and a database, and may need to
run on the same physical machine as these services. HA clustering provides the
concept of resource groups, where applications are grouped together. When failure
occurs in of one of the applications in the group, the entire group is moved to a
standby server, satisfying the dependency of the applications. However, this only
occurs if the HA clustering software fails to restart the application on its current
machine. It is also possible to move the network address and any other operating
system resources with the group so that the failover is transparent to the client. If an
individual software package was responsible for its own availability, it may not be
able to transfer to another physical machine and will not be able to move any other
resources on which it is dependent.
By using HA clustering to cope with these low level considerations, such as network
address takeover, disk access, and application dependencies, the higher level
applications are relieved of this complexity. Although there are several vendors
providing HA clustering, each package tends to follow the same basic principles and
provide a similar set of basic functionality. Some solutions, such as Veritas Cluster
Server and SteelEye LifeKeeper, are also compatible with multiple platforms to
provide a similar solution in heterogeneous environments.
In the same way that WebSphere MQ removed the complexity of application
connectivity from the programmer, HA clustering techniques help provide a simple,
2

A more accurate name would be switchable disks.

Page 9

Understanding high availability with WebSphere MQ


generic solution for HA. This means applications, such as messaging and data
management, can focus on their core competencies leaving HA clustering to provide a
more reliable availability solution than resource-specific monitors. HA clustering
also covers both hardware and software resources, and is a proven, recognized
technology used in many other HA situations. HA clustering products are designed to
be scalable and extensible to cope with changing requirements. IBMs AIX
HACMP product, SteelEye LifeKeeper, and Veritas Cluster Server scale up to 32
servers. HACMP, LifeKeeper, and Cluster Server have extensions available to allow
replication of disks to a remote site for disaster recovery purposes.

3.2.2. When to use standby machine - shared disks


The standby machine solution is ideal for messages that are delivered once and only
once. For example, in billing and ordering systems, it is essential that messages are
not duplicated so that customers are not billed twice, or sent two shipments instead of
one.
As HA clustering software is a separate product that sits along side existing
applications, this methodology is also suited to convert an existing server, or set of
servers to be highly available. It is possible to gradually convert a set of servers to be
highly available. In large installations where there are many servers, HA clustering is
a cost effective choice through the use of an n+1 configuration. In this approach, a
single machine is used as a backup for a number of live servers. Hardware redundancy
is reduced and therefore, cost is reduced, as only one extra machine is required to
provide high availability to a number of active servers.
As already shown, HA clustering software is capable of converting an existing
application and its dependent resources to be highly available. It is, therefore, suited
to situations where there are several applications or services that need to be made
highly available. If those applications are dependent on each other, and rely on
operating system resources, such as network addresses to function correctly, HA
clustering is ideally suited.

3.2.3. When not to use standby machine - shared disks


HA clustering is not always necessary when considering an HA solution. Although
the examples given below are served by an HA clustering method, other solutions
would serve just as well and it would be possible to utilize HA clustering at a later
date if required.
If the trapped messages problem is not applicable, such as there is no need to restart a
failed queue manager with its messages intact, then shared disks are not necessary.
This occurs if the system is only used for event messages that will be re-transmitted
regularly, for messages that expire in a relatively short time, or for
non-persistent messages (where an application is not relying on WebSphere MQ for
assured delivery). For these situations, you can make a system highly available by
using WebSphere MQ queue manager clustering only. This technology load balances
messages and routes around failed servers. See WebSphere MQ Queue Manager
clusters for more information on queue manager clusters.

Page 10

Understanding high availability with WebSphere MQ


In situations where it is not important to process the messages as soon as possible,
then HA clustering may provide too much availability at too much of an expense. For
example, if trapped messages can wait until an administrator restarts the machine, and
hence the queue manager is restarted (using an internal RAID disk to protect the
queue manager data), then HA clustering is considered too comprehensive of a
solution. In this situation, it is possible to allow access for new messages using
WebSphere MQ queue manager clustering, as in the case above.
The shared disk solution requires the machines to be physically close to each other, as
the distance from the shared disk device needs to be small. This makes it unsuitable
for use in a disaster recovery solution. However, some HA clustering software can
provide disaster recovery functionality. For example, IBMs HACMP package has an
extension called HAGEO, which provides data replication to remote sites. By
backing up data in this fashion, it is possible to retrieve it if a site wide failure occurs.
However, the off-site data may not be the most up-to-date because the replication is
often delayed by a few minutes. This is because instantaneous replication of data to an
off-site location incurs a significant performance hit. Therefore, the more important
the data, the smaller the time interval will be, but the greater the performance impact.
Time and performance must be traded against each other when implementing a
disaster recovery solution. Such solutions do not provide all of the benefits of the
shared disk solution and are beyond the scope of this document.
The following sections describe two possible configurations for HA clustering. These
are termed active-active and active-standby configurations.

3.2.4. HA clustering active-standby configuration


In a generic HA clustering solution, when two machines are used in an activestandby
configuration, one machine is running the applications in a resource group and the
other is idle. In addition to network connections to the LAN, the machines also have a
private connection to each other. This is either in the form of a serial link or a private
Ethernet link. The private link provides a redundant connection between the machines
for the purpose of detecting a complete failure. As previously mentioned, if a link
between the machines fails, then both machines may try to become active. Therefore,
the redundant link reduces the risk of communication failure between the two. The
machines may also have two external links to the LAN. Again, this reduces the risk of
external connectivity failure, but also allows the machines to have their own network
address. One of the adapters is used for the service network address, such as the
network address that clients use to connect to the service, and the other adapter has a
network address associated with the physical machine. The service address is moved
between the machines upon failure to provide HA transparency to any clients.
The standby machine monitors the master machine via the use of heartbeats. These
are periodic checks by the standby machine to ensure that the master machine is still
responding to requests. The master machine also monitors its disks and the processes
running on it to ensure that no hardware failure has occurred. For each service
running on the machine, a custom utility is required to inform the HA clustering
software that it is still running. In the case of WebSphere MQ, the SupportPacs
describing HA configurations provide utilities to check the operation of queue

Page 11

Understanding high availability with WebSphere MQ


managers, which can easily be adapted for other HA systems. Details of these
SupportPacs are listed in Appendix A.
A small amount of configuration is required for each resource group to describe what
should happen at start-up and shutdown, although in most cases this is simple. In the
case of WebSphere MQ, this could be a start up script containing commands to start
the queue manager (for example, strmqm), listener (for example, runmqlsr), or any
other queue manager programs. A corresponding shutdown script is also needed, and
depending on the HA clustering package in use, a number of other scripts may be
required. Samples for WebSphere MQ are provided with the SupportPacs described
in Appendix A.
As the heartbeat mechanism is the primary method of failure detection, if a heartbeat
does not receive a response, the standby machine assumes that the master server has
failed. However, heartbeats may not respond because of a number of reasons, such as
an overloaded server, or communication failure. There is a possibility that the master
server will resume processing at a later stage, or is still running. This can lead to
duplicate messages in the system and is not desired.
Managing this problem is also the role of the HA clustering package. For example,
RedHat Cluster services and IBMs HACMP work around this problem by having a
watchdog timer with a lower timeout than the cluster. This ensures that the machine
reboots itself before another machine in the cluster takes over its role. Programmable
power supplies are also supported, so other machines in the cluster can power cycle
the affected machine, to ensure that it is no longer operational before starting the
resource group. Essentially, the machines in the cluster have the capability to turn the
other machines off.
Some HA clustering software suites also provide the capability to detect other types of
failure, such as system resource exhaustion, or process failure, and try to recover from
these failures locally. For WebSphere MQ, you can implement on AIX using the
appropriate SupportPac (see Appendix A) to restart a queue manager locally, which is
not responding. This can avoid the more time consuming operation of completely
moving the resource group to another server.
You should design the machines used in HA clustering to have identical
configurations to each other. This includes installed software levels, security
configurations, and performance capabilities, to minimize the possibility of resource
group start-up failure. This ensures that machines in the network all have the
capability to take on another machines role.
Note that for active-standby configurations, only one instance of an application is
running at any one moment and therefore, software vendors may only charge for one
instance of the application, as is the case for WebSphere MQ.

3.2.5. HA clustering active-active configuration


It is also possible to run services on the redundant machine in what is termed an
activeactive configuration. In this mode, the servers are both actively running
programs and acting as backups for each other. If one server fails, the other continues
Page 12

Understanding high availability with WebSphere MQ


to run its own services, as well as the failed servers. This enables the backup server
to be used more effectively, although when a failure does occur, the performance of
the systems is reduced because it has taken on extra processing.
In Figure 3, the second active machine runs both queue managers if a failure occurs.
Figure 3. An active-active configuration

In larger installations, where several resource groups exist and more than one server
needs to be made highly available, it is possible to use one backup machine to cover
several active servers. This setup is known as an n+1 configuration, and has the
benefit of reduced redundant hardware costs, because the servers do not have a
dedicated backup machine each. However, if several servers fail at the same time, the
backup machine may become overloaded. These extra costs must be weighed up
against the potential cost of more than one server failing, and more than one backup
machine being required.

3.2.6. HA clustering benefits


HA clustering software provides the capability to perform controlled failover of
resource groups. This allows administrators to test the functionality of a configured
system, and also allow machines to be gracefully removed from an active cluster. This
can be for maintenance purposes, such as hardware and software upgrades or data
backup. It also allows failed servers, once repaired, to be placed back in the cluster
and to resume their services. This is known as fail-back[4]. A controlled failover
operation also results in less downtime because the cluster does not need to detect the
Page 13

Understanding high availability with WebSphere MQ


failure. There is no need to wait for the cluster timeout. Also, as the applications, such
as WebSphere MQ, are stopped in a controlled manner, the start up time is reduced
because there is no need for log replay.
Using the abstract resource groups makes it possible for a service to remain highly
available. This occurs when the machine that is normally running the services has
been removed from the cluster. This is only true as long as the other machines have
comparable software installed and access to the same data, meaning any machine can
run the resource group. The modular nature of resource groups also helps the gradual
uptake of HA clustering in an existing system and easily allows services to be added
at a later date. This also means that in a large queue manager installation, you can
convert mission critical queue managers to be highly available first, and later convert
the less critical queue managers, or not at all.
Many of the requirements for implementing HA clustering are also desirable in more
bespoke, or product-centric HA solutions. For example, RAID disk arrays [8], extra
network connections and redundant power supplies all protect against hardware
failure. Therefore, improving the availability of a server results in additional cost,
whether a bespoke or HA clustering technique is used. HA clustering may require
additional hardware over and above some application specific HA solutions, but this
enables a HA clustering approach to provide a more complete HA solution.
You can easily extend the configuration of HA clustering to cover other applications
running on the machine. The availability of all services is provided via a standard
methodology and presented through a consistent interface rather than being
implemented separately by each service on the machine. This in turn reduces
complexity and staff training times and reduces errors being introduced during
administration activities.
By using one product to provide an availability solution, you can take a common
approach to decision making. For instance, if a number of the servers in a cluster are
separated from the others by network failure, an unanimous decision is needed to
decide which servers should remain active in the cluster. If there were several HA
solutions in place (such as each product uses its own availability solution), each with
separate quorum algorithms3, then it is possible that each algorithm has a different
outcome. This could result in an invalid selection of active servers in the cluster that
may not be able to communicate. By having a separate entity, in the form of the HA
clustering software, to decide which part of the cluster has the quorum, only one
outcome is possible, and the cluster of servers continues to be available.
Summary
The shared disk solution described above is a robust approach to the problem of
trapped messages, and allows access to stored messages in the event of a failure.
However, there will be a short period of time where there is no access to the queue
manager while the failure is being detected, and the service is being transferred to the
standby server. It is possible during this time to use WebSphere MQ clustering to
provide access for new messages because its load balancing capabilities will route
3

A quorum is the minimum number of members of a deliberative body necessary to conduct the
business of that group.

Page 14

Understanding high availability with WebSphere MQ


messages around the failed queue manager to another queue manager in the cluster.
How to use HA clustering with WebSphere MQ clustering is described in When to
use WebSphere MQ queue manager clusters.

Page 15

Understanding high availability with WebSphere MQ

3.3. z/OS high availability options


z/OS provides a facility for operating system restart of failed queue managers called
Automatic Restart Manager (ARM). It provides a mechanism, via ARM policies, for a
failed queue manager to be restarted in place on the failing logical partition
(LPAR). Or, in the case of an LPAR failure, started on a different LPAR along with
other subsystems and applications grouped together, such that the subsystem
components provide the overall business solution can be restarted together.
In addition, with a parallel sysplex, Geographically Dispersed Parallel Sysplex
(GDPS) provides the ability for automatic restart of subsystems, via remote DASD
copying techniques, in the event of a site failure.
The above techniques are restart techniques that are similar to those discussed earlier
for distributed platforms. We will now look at a capability which maximizes the
availability of message queues in the event of queue manager failures that does not
require queue manager restart.

3.3.1. Shared queues (z/OS only)


WebSphere MQ shared queues is an exploitation of the z/OS-unique Coupling
Facility (CF) technology that provides high-speed access to data across a sysplex via a
rich set of facilities to store and retrieve data. WebSphere MQ stores shared message
queues in the Coupling Facility, and this in turn, means that unlike private message
queues, they are not owned by any single queue manager.
Queue managers are grouped into Queue Sharing Groups (QSGs), analogous to Data
Sharing Groups with data-sharing DB2. All queue managers within a QSG can access
shared message queues for putting and getting of messages via the WebSphere MQ
API. This enables multiple putters and getters on the same shared queue from within
the QSG. Also WebSphere MQ provides peer recovery such that inflight shared queue
messages are automatically rolled back by another member of the QSG in the event of
a queue manager failure.
WebSphere MQ still uses its logs for capturing persistent message updates so that in
the extremely unlikely event of a CF failure, you can use the normal restart
procedures to restore messages. In addition, z/OS provides system facilities to
automatically duplex the CF structures used by WebSphere MQ. The combination of
these facilities provides WebSphere MQ shared message queues with extremely high
availability characteristics.
Figure 4 shows three queue managers: QM1, QM2 and QM3 in the QSG GRP1
sharing access to queue A in the coupling facility. This setup allows all three queue
managers to process messages arriving on queue A.

Page 16

Understanding high availability with WebSphere MQ


Figure 4. Three queue managers in a QSG share queue A on a Coupling Facility

GRP1

QM 2

QM 3

QM 1
QA
Coupling
Facility

A further benefit of using shared queues is utilizing shared channels. You can use
shared channels in two different scenarios to further extend the high availability of
WebSphere MQ.
First, using shared channels, an external queue manager can connect to a specific
queue manager in the QSG using channels. It can then put messages to the shared
queue via this queue manager. This allows for queue managers in a distributed
environment to utilize the HA functionality provided by shared queues. Therefore, the
target application of messages put by the queue manager can be any of those running
on a queue manager in the QSG.
Second, you can use a generic port so that a channel connecting to the QSG could be
connected to any queue manager in the QSG. If the channel loses its connection
(because of a queue manager failure), then it is possible for the channel to connect to
another queue manager in the QSG by simply reconnecting to the same generic port.
3.3.1.1 Benefits of shared message queues
The main benefit of a shared queue is its high availability. There are numerous
customer selectable configuration options for CF storage, ranging from running on
standalone processors with their own power supplies to the Internal Coupling Facility
(ICF) that runs on spare processors within a general zSeries server. Another key
factor is that the Coupling Facility Control Code (CFCC) runs in its own LPAR,
where it is isolated from any application or subsystem code.
In addition, it naturally balances the workload between the queue managers in the
QSG. That is, a queue manager will only request a message from the shared queue
when the application, which is processing messages, is free to do so. Therefore, the
availability of the messaging service is improved because queue managers are not
flooded by messages directly. Instead, they consume messages from the shared queue
when they are ready to do so.
Also, should greater message processing performance be required, you can add extra
queue managers to the QSG to process more incoming messages. With persistent
messages, both private and shared, the message processing limit is constrained by the
speed of the log. With shared message queues, each queue manager uses its own log
Page 17

Understanding high availability with WebSphere MQ


for updates. Therefore, deploying additional queue managers to process a shared
queue means the total logging cost is liquidated gradually over a number of queue
managers. This provides a highly scalable solution.
Conversely, if a queue manager requires maintenance, you can remove it from the
QSG, leaving the remaining queue managers to continue processing the messages.
Both the addition and removal of queue managers in a QSG can be performed without
disrupting the already existing members.
Lastly, should a queue manager fail during the processing of a Unit of Work, the other
members of the QSG will spot this and Peer Recovery is initiated. That is, if the
unit of work was not completed by the failed queue manager, another queue manager
in the QSG will complete the processing. This arbitration of queue manager data is
achieved via hardware and microcode on z/OS. This means that the availability of the
system is increased as the failure of any one queue manager does not result in trapped
messages or inconsistent transactions. This is because Peer Recovery either completes
the transaction or rolls it back. For more information on Peer Recovery and how to
configure it, see z/OS Systems Administration Guide [6].
The benefits of shared queues are not solely limited to z/OS queue managers.
Although you cannot setup shared queues in a distributed environment, it is possible
for distributed queue managers to place messages onto them through a member of the
QSG. This allows for the QSG to process a distributed applications message in a
z/OS HA environment.
3.3.1.2. Limitations of shared message queues
With WebSphere MQ V5.3, physical shared messages are limited to be less than
63KB in size. Any application that attempts to put a message greater than this limit
receives an error on the MQPUT call. However, you can use the message grouping
API to construct a logical message greater than 63KB, which consists of a number of
physical segments.
The Coupling Facility is a resilient and durable piece of hardware, but it is a single
point of failure in this high availability configuration. However, z/OS provides
duplexing facilities, where updates to one CF structure are automatically propagated
to a second CF. In the unlikely event of failure of the primary CF, z/OS
automatically switches access to the secondary, while the primary is being rebuilt.
This system-managed duplexing is supported by WebSphere MQ. While the rebuild is
taking place, there is no noticeable application effect. However, this duplexing will
clearly have an effect on overall performance.
Finally, a queue manager can only belong to one QSG and all queue managers in a
QSG must be in the same sysplex. This is a small limitation on the flexibility of
QSGs. Also a QSG can only contain a maximum of 32 queue managers. For more
information on shared queues, see WebSphere MQ for Z/OS Concepts and Planning
Guide [1].

Page 18

Understanding high availability with WebSphere MQ

3.4. WebSphere MQ queue manager clusters


A WebSphere MQ queue manager cluster is a cross platform workload balancing
solution that allows WebSphere MQ messages to be routed around a failed queue
manager. It allows a queue to be hosted across multiple queue managers, thus
allowing an application to be duplicated across multiple machines. It provides a
highly available messaging service allowing incoming messages to be forwarded to
any queue manager in the cluster for application processing. Therefore, if any queue
manager in the cluster fails, new incoming messages continue to be processed by the
remaining queue managers.
In Figure 5, an application puts a message to a cluster queue on QM2. This cluster
queue is defined locally on QM1, QM4 and QM5. Therefore, one of these queue
managers will receive the message and process it.
Figure 5. Queue managers 1,4, and 5 in the cluster receive messages in
order
cluster Queue
Local Queue

QM 1

Application

QM 2

QM 3

QM 4

QM 6
QM 5
cluster 1

By balancing the workload between QM1, QM4, and QM5, an application is


distributed across multiple queue managers making it highly available. If a queue
manager fails, the incoming messages are balanced among the remaining queue
managers.
While WebSphere MQ clustering provides continuous messaging for new messages, it
is not a complete HA solution because it is unable to handle messages that have
already been delivered to a queue manager for processing. As we have seen above, if
a queue manager fails, these trapped private messages are only processed when the
queue manager is restarted.
However, by combining WebSphere MQ clustering with the recovery techniques
covered above, you can create an HA solution from both new and existing messages.
The following section shows this in action in a distributed shared disk environment.

Page 19

Understanding high availability with WebSphere MQ

3.4.1. Extending the standby machine - shared disk approach


By hosting cluster queue managers on active-standby or active-active setups, trapped
messages, on private or cluster queues, are made available when the queue manager is
failed over to a standby machine and restarted. The queue manager will be failed over
and will begin processing messages within minutes instead of the longer amount of
time it would take to manually recover and repair the failed machine or failed queue
manager in the cluster.
The added benefit of combining queue manager clusters with HA clustering is that the
high availability nature of the system becomes transparent to any clients using it. This
is because they are putting messages to a single cluster queue. If a queue manager in
the cluster fails, the clients outstanding requests are processed when the queue
manager is failed over to a backup machine. In the meantime, the client needs to take
no action because its new requests will be routed around the failure and processed by
another queue manager in the cluster. The client must only be tolerant if its requests
are taking slightly longer than normal to be returned in the event of a failover.
Figure 6 shows each queue manager in the cluster in an active-active, standby
machine-shared disk configuration. The machines are configured with separate shared
disks for queue manager data and logs to decrease the time required to restart the
queue manager. See Considerations for WebSphere MQ restart performance for
more information.
Figure 6. Queue managers 1,4, and 5 have active standby machines
Cluster Queue

Application

Local Queue

QM 1

QM

QM 2

QM

log

QM 6

QM 3

QM

log

log

QM 4

QM 5
cluster 1

In this example, if queue manager 4 fails, it fails over to the same machine as queue
manager 3, where both queue managers will run until the failed machine is repaired.

Page 20

Understanding high availability with WebSphere MQ

3.4.2. When to use HA WebSphere MQ queue manager clusters


Because this solution is implemented by combining external HA clustering
technology with WebSphere MQ queue manager clusters, it provides the ultimate
high availability configuration for distributed WebSphere MQ. It makes both
incoming and queued messages available and also fails over not only a queue
manager, but also any other resources running on the machine. For instance, server
applications, databases, or user data can fail over to a standby machine along with the
queue manager.
When using HA WebSphere MQ clustering in an active-standby configuration, it is a
simpler task to apply maintenance or software updates to machines, queue managers,
or applications. This is because you can first update a standby machine, then a queue
manager can fail over to it, ensuring that the update works correctly. If it is successful,
you can update the primary machine and then the queue manager can fail back onto it.
HA WebSphere MQ queue manager clusters also greatly reduce the administration of
the queue managers within it, which in turn reduces the risk of administration errors.
Queue managers that are defined in a cluster do not require channel or queue
definitions setup for every other member of the cluster. Instead, the cluster handles
these communications and propagates relevant information to each member of the
cluster through a repository.
HA WebSphere MQ queue manager clusters are able to scale applications linearly
because you can add new queue managers to the cluster to aid in the processing of
incoming messages. Conversely, you can remove queue managers from the cluster for
maintenance and the cluster can still continue to process incoming requests. If the
queue managers presence in the cluster is required, but the hardware must be
maintained, then you can use this technique in conjunction with failing the queue
manager over to a standby machine. This frees the machine, but keeps the queue
manager running.
It is also possible for administrators to write their own cluster workload exits. This
allows for a finer control of how messages are delivered to queue managers in the
cluster. Therefore, you can target messages at machines in different ratios based on
the performance capabilities of the machine (rather than in a simple round robin
fashion).

3.4.3 When not to use HA WebSphere MQ queue manager clusters


HA WebSphere MQ queue manager clusters require additional proprietary HA
hardware (shared disks) and external HA clustering software (such as HACMP). This
increases the administration costs of the environment because you also need to
administer the HA components. This approach also increases the initial
implementation costs because extra hardware and software are required. Therefore,
balance these initial costs with the potential costs incurred if a queue manager fail and
messages become trapped.
Note that non-persistent messages do not survive a queue manager failover. This is
because the queue manager restarts once it has been failed over to the standby
machine, causing it to process its logs and return to its most recent known state. At
Page 21

Understanding high availability with WebSphere MQ


this point, non persistent messages are discarded. Therefore, if your application
requires non-persistent messages, take into account this factor.
If trapped messages are not a problem for the applications (for example, the response
time of the application is irrelevant or the data is updated frequently), then HA
WebSphere MQ queue manager clusters are probably not required. That is, if the
amount of time required to repair a machine and restart its queue manager is
acceptable, then having a standby machine to take over the queue manager is not
necessary. In this case, it is possible to implement WebSphere MQ queue manager
clusters without any additional HA hardware or software.

3.4.4. Considerations for implementation of HA WebSphere MQ


queue manager clusters
When configuring an active-active or active-standby setup in a cluster, administrators
should test to ensure that the failover of a given node works correctly. Nodes should
be failed over, when and where possible, to backup machines to ensure the failover
processes work as designed and that no problems are encountered when a failover is
actually required. Perform this procedure at the discretion of the administrators. It
may cause problems or outages in a future production environment if failover does not
happen smoothly.
As with queue manager clusters, do not code WebSphere MQ applications as machine
or queue manager specific, such as relying on resources only available to a single
machine. This is because when applications are failed over to a standby machine,
along with the queue manager they are running on, they may not have access to these
resources. To avoid these administrative problems, machines should be as equal as
possible with respect to software levels, operating system environments, and security
settings. Therefore, any failed over applications should have no problems running.
Avoid message affinities when programming applications. This is because there is no
guarantee that messages put to the cluster queue will be processed by the same queue
manager every time. It is possible to use the MQ Open Option BIND_ON_OPEN to
ensure an applications messages are always delivered to the same queue manager in
the cluster. However, an application performing this operation incurs reduced
availability because this queue manager may fail during message processing. In this
case, the application must wait until the queue manager is failed over to a backup
machine before it can begin processing the applications requests. If affinities had not
been used, then no delay in message processing would be experienced. Another queue
manager in the cluster would continue processing any new requests.
Application programmers should avoid long running transactions in their applications.
This is because these will greatly increase the restart time of the queue manager when
it is failed over to a standby machine. See Considerations for WebSphere MQ restart
times for more information.
When implementing a WebSphere MQ cluster solution, whether for an HA
configuration or for normal workload balancing, be careful to have at least two full
cluster repositories defined. These repositories should be on machines that are highly
Page 22

Understanding high availability with WebSphere MQ


available. For example, they have redundant power supplies, network access and hard
disks, and are not heavily loaded with work. Repositories are vital to the cluster
because they contain cluster wide information that is distributed to each cluster
member. If both of these repositories are lost, it is impossible for the cluster to
propagate any cluster changes, such as new queues or queue managers. However, the
cluster continues to function with each members partial repositories until the full
repositories are restored.

Page 23

Understanding high availability with WebSphere MQ

3.5. HA capable client applications


You can achieve high availability on the client side rather than using HA clustering,
HA WebSphere MQ queue manager clusters, or shared queue server side techniques
as previously described. HA capable clients are an inexpensive way to implement
high availability, but usually it results in a large client with complex logic. This is not
ideal and a server side approach is recommended. However, HA capable clients are
discussed here for completeness.
Most occurrences of a queue manager failure result in a connection failure with the
client. Even if the queue manager is returned to normal operation, the client
disconnects and remains so until the code used to connect the client to the queue
manager is executed again.
One possible solution to the problem of a server failure is to design the client
applications to reconnect, or connect to a different, but functionally identical server.
The clients application logic has to detect a failed connection and reconnect to
another specified server.
The method of detecting and handling a failed connection depends on the MQ API in
use. MQ JMS, for instance, provides an exception listener mechanism that allows the
programmer to specify code to be run upon a failure event. The programmer can also
use Java try catch blocks to allow failures to be handled during code execution.
The MQI API reports a failure upon the next function call that requires
communication with the queue manager. In this scenario, it is the programmers
responsibility to resolve the failure.
The management of the failure depends on the type of application and also, if there
are any other high availability solutions in place. A simple reconnect to the same
queue manager may be attempted, and if successful, the application can resume
processing. You can configure the application with a list of queue managers that it
may connect to. Upon failure, it can reconnect to the next queue manager in the list.
In an HA clustering solution, clients still experience a failed connection if a server is
failed-over to a different physical machine. This is because it is not possible to move
open network connections between servers. The client also may need to be configured
to perform several reconnect attempts to the server, and/or wait a period of time to
allow time for the server to restart.
If the application is transactional, and the connection fails mid-transaction, the entire
transaction needs to be re-executed when a new connection is established. This is
because WebSphere MQ queue managers will rollback any uncommitted work at
start-up time.
You can supplement many server-side HA solutions with the use of client side
application code designed to cope with the temporary loss, or need to reconnect to a
queue manager. A client that contains no extra code may need user intervention, or
even need to be completely restarted to resume full functionality. There is obviously
extra effort required to code the client application to be HA aware, but the end result
is a more autonomous client.
Page 24

Understanding high availability with WebSphere MQ

3.5.1. When to use HA capable client applications


HA capable clients are ideally suited when an application has a number of clients that
need to reconnect in the event of a failure and no HA solution has been implemented
on the server side. This allows clients to connect themselves to alternative services
while the failed service is restored.

3.5.2. When not to use HA capable client applications


When a robust extensible high availability solution is required, the HA focus is on the
server side rather than the client side. Clients with complex HA logic become large
and must be maintained. Also, new clients coming onto the system must implement
the same logic. However, a transparent server side HA solution negates the need to
implement this technology
Also, if there is a requirement for a thin client, then there is no room for bulky HA
logic. Therefore, you must implement the HA solution on the server side.

Page 25

Understanding high availability with WebSphere MQ

4. Considerations for WebSphere MQ restart


performance
The most important factor in making an IT system highly available is the length of
time required to recover from a failure. The methods described for making a
WebSphere MQ queue manager highly available all involve situations, where a queue
manager has failed and it must be restarted on the same machine or a standby
machine. Therefore, the quicker you restart a queue manager, the quicker it can
complete any outstanding work and begin to process any new requests.
The quickest way to do this is to attempt to first failover the queue manager to the
same machine it failed on. This is only possible if the queue manager has not failed
due to a hardware problem (an external HA clustering technology can determine this).
This approach will result in a much quicker restart time and a less disruptive failover
because there is no need to move resources, such as network addresses, queue
managers, applications, and shared disks to the standby machine. However, if you
cannot achieve this, then the queue manager must be failed over to a standby machine.
Therefore, minimizing the amount of start-up processing the queue manager must do
to regain its state will minimize the amount of time the queue manager is unavailable.
The next sections discuss factors that affect the start-up time of the queue manager.

4.1. Long running transactions


If your client applications have long running transactions that use persistent messages,
then this increases the amount of time a queue manager takes to start up.
Design applications to avoid the use of long running transactions, because these can
affect the amount of log data that needs to be replayed during recovery. By
committing transactions as frequently as possible, the amount of log replay required to
recover the transaction is reduced. WebSphere MQ uses automatically generated
checkpoints to determine the point at which the log will replay. A checkpoint is a
point where the log and queue files/pagesets are consistent4. If a transaction is not
committed for several checkpoints, it follows that the size of the log required to
recover the queue manager increases. Therefore, short transaction times reduce the
amount of data to be processed when recovering a queue manager. It is possible to
force a checkpoint on z/OS using log archiving or when a number of log records
matching the LOGLOAD value have been written.
The use of shorter transactions also has the benefit of reducing the possibility of the
queue manager exhausting the available log space (and the quantity of log space
required). This results in a long running transaction being rolled back on distributed
to release space. On z/OS, a transaction will not be rolled back in this instance.
Instead, it is necessary to access the archive logs if the transaction backs out. This
could significantly extend the time that the backout takes. It is also important to note
that if the transaction backs out, and not all of the log records are available, then the
queue manager will terminate.

4 For z/OS, note that pagesets are only consistent on every third checkpoint.

Page 26

Understanding high availability with WebSphere MQ


For instance, if the queue manager has a long running Unit of Work (UOW), then it
must scan back over a number of logs to recover it. By introducing frequent commits
into the application code, it is possible to minimize long start-up times due to large
UOWs. This also reduces the number of log files required to recover the queue
manager. If they have been backed up onto another medium, such as tape, this
significantly increases the restart time of the queue manager.

4.2. Persistent message use


Persistent messages are first written to the queue manager log (for recovery purposes)
and then to the queue file/pageset if the message is not gotten immediately. The queue
manager replays the log during recovery. Reducing the amount of log to be
reprocessed reduces the time required for recovery.
Non-persistent messages are not written to the log so they do not increase the queue
managers restart time. However, note that if an application is relying on WebSphere
MQ to provide data integrity, you must use persistent messages to ensure message
delivery. Also, as non-persistent messages are not logged, they do not survive a queue
manager restart. A new class of message service was introduced with WebSphere MQ
5.3 CSD 6, which is positioned between persistent and non-persistent messaging. It
allows non-persistent messages to survive a queue manager restart, although some
messages may be lost because of the absence of message logging that persistent
messaging provides.
On non z/OS platforms, you can enable this message class by setting the queue
parameter NPMCLASS to HIGH. On z/OS, this functionality is an emergent
property of the use of shared queues as non-persistent messages are stored in the
Coupling Facility. They do not get removed on the queue manager startup.

4.3. Automation
The detection of the failure, failover to a standby machine, and restart of the queue
manager (and applications) should be automated. By reducing operator intervention,
the time required to failover a queue manager to a backup machine is significantly
reduced. This allows normal service to be resumed as quickly as possible.
You can achieve the automation of this process by using HA clustering software as
described in HA clustering software.

4.4. File systems


Use of a journaled file system is recommended on distributed platforms to reduce the
time required to recover a file system to a working state. A journaling file system uses
a journal to maintain a list of the file transactions being written to the disk. In the
event of a failure, the disk structure remains in a consistent state because it can be
rebuilt at boot time (for example, recovered from the journal) and can be used
immediately.
On a non-journaling file system, the state of the file system after a failure is not
known and it is necessary to use a utility such as scandisk, or e2fsck, to find and fix
errors. As the use of a journal avoids this problem, there is no need to perform a time

Page 27

Understanding high availability with WebSphere MQ


consuming file system scan to verify the integrity before you can use it. Common
journaled file systems include Windows NTFS, Linux ext3, ReiserFS, and JFS.
On z/OS, WebSphere MQ provides facilities for taking backup copies of the message
data while the system is running. You can use these, in conjunction with the logs, to
recover WebSphere MQ in the event of media failure. Taking periodic backups is
recommended to reduce the amount of log data that needs to be processed at restart.
Finally, to decrease the start-up time of a queue manager which has been failed over,
store the queue manager log and queue files on separate disks. This increases
performance in the recovery of the queue manager from its logs because it will face
no disk contention for the queue files.

Page 28

Understanding high availability with WebSphere MQ

5. Comparison of generic versus specific failover


technology
The WebSphere MQ methods for high availability standby machine and shared disks
and HA WebSphere MQ queue manager clusters both rely on external HA clustering
software and hardware to monitor hardware resources and application data, to run
processes, and to perform a failover process if any of these fail.
The alternative approach to this solution is to utilize a product specific HA approach.
These provide an out of the box experience and are usually tailored for each
software application. These solutions primarily provide data replication to a specified
partner so that failover can occur if a primary instance fails. You should fully
investigate this product specific high availability approach before considering its use
in a serious HA implementation.
The primary reason for this investigation is that the software may rely upon the
synchronization of data between product instances. Data replication in this manner is
discussed in the section High availability at the beginning of this paper. It states
these approaches are not ideal as asynchronous replication can cause duplicated or
lost data and synchronous replication incurs a significant performance costs.
Therefore, replicating data in this manner is not a good method for high availability.
Another reason to avoid product specific approaches is that they only tend to allow a
single software product to be failed over. However, an external HA clustering solution
offers the ability to failover and restart interdependent groups of resources, such as
other software applications and hardware resources. For instance, it is possible to
failover WebSphere Business Integration Message Brokers with WebSphere MQ and
DB2 using IBMs HACMP technology. This extensibility is vital when considering
the wider scope of high availability for all server applications and hardware resources.
An external HA clustering approach utilizes available machines on the network more
effectively. It is able to dynamically failover an application and any other resources to
a single backup machine, shared by a number of queue managers in the network
(often called an N+1 solution). This means a standby machine is not required for
every active machine in the network.
HA clustering technology detects subtle failures, such as an unexpected increase in
network latency (thus heartbeats are not received), or the primary machine stalling for
a short period of time due to increased IO. In either of these situations, the secondary
machine may think its primary peer has failed and it begins to take over work.
External HA clustering technologies, such as HACMP, perform these complex tasks.
However, product specific technologies may not perform these tasks. This may mean
that both the primary machine and the secondary think they are the primary machine.
This leads to a split brain problem and duplicate message processing.
Avoid the split brain situation using external HA clustering technology. This
arbitrates all resources in the network and can decide which machines have access to
the data. Therefore, in the event of a failure, the HA clustering software can provide

Page 29

Understanding high availability with WebSphere MQ


access to the shared resources to the standby machine. This machine is now
considered the primary machine by all.
To conclude, investigate product specific approaches because their HA approaches
may not be flexible or expandable enough to incorporate the much wider demands of
a highly available IT infrastructure.

Page 30

Understanding high availability with WebSphere MQ

6. Conclusion
This paper discussed approaches for implementing high availability solutions using
the WebSphere MQ messaging product.
Choosing a solution to achieve a highly available system is based on the HA
requirements of that system. For instance, is each message important? Can a trapped
message wait a few hours until a machine is restarted, or must it be made available as
soon as possible? If it is the former, then a simple clustering approach is enough.
However, the latter requirement requires the use of HA clustering software and
hardware. Also, are software applications reliant on specific software or hardware
resources? If so, a HA cluster solution is critical when interdependent groups of
resources must be failed over together.
Note that the approaches discussed in this paper for implementing high availability
with WebSphere MQ all employ common HA principles. You should adhere to those
principles when implementing any highly available IT system. The first is the use of a
single copy of any data. This makes the data much easier to manage, there are no
ambiguities about who owns the real data and there are no issues in reconciling the
data if there is a corruption. When a failover occurs, only one instance of the software
has access to the real data, avoiding any confusion. The only exception to this
statement is when you implement a disaster recovery solution to move copies of
critical data off site. In this case, you cannot use a copy of the data to remove the
single point of failure and to provide high availability. Instead, if a site wide failure
occurs, the backup is used to restore critical data and to resume services (possibly on
another site).
Second, always verify software that stores persistent state on disk to ensure it
performs synchronous writes to the disk and to ensure hardening of the data.
Asynchronous writes to a disk can result in software believing the data has been
hardened to disk when, in fact, it has not. WebSphere MQ always writes persistent
data synchronously to disk to ensure it has been hardened, and therefore, recoverable
in the event of a queue manager failure.
Third, implementing redundancy at a hard disk level, to remove the disk as a single
point of failure, is a simple step that prevents the loss of critical data if a disk fails.
Despite synchronous writes ensuring the data has been hardened to disk, a disk failure
can still destroy the data. Therefore, implement technologies, such as RAID, to
provide a disk level redundancy of data.
Fourth, and often overlooked, implement process controls for the administration of
production IT systems. Often it is administrative errors that cause outages because of
improperly tested software updates, incorrect parameter settings, or destructive
actions performed by administrators. By having proper process controls and security
restrictions, you can minimize these errors. Also, HA clustering software provides a
single administration view of all machines in a HA cluster, which minimizes
administration effort.
Lastly, programming applications to avoid affinities between clients and servers and
long running Units of Work are good practices. The first allows applications to be
Page 31

Understanding high availability with WebSphere MQ


failed over to any machine and still continue running. The second allows servers to be
restarted quickly so that they do not have large amounts of outstanding work to
process.
We can conclude that implementing high availability using an external HA clustering
solution can bring large benefits to an IT infrastructure. It can allow groups of
resources to be failed over, single copies of data to be maintained, and simpler
administration of resources. IBM WebSphere MQ, DB2, WebSphere Application
Server and WebSphere Business Integration Message Broker all support high
availability through HA clustering software, and all provide resources to configure
easily. This approach is considerably more flexible than a product specific solution.
You can expand this approach way beyond its initial scope.
Ultimately, high availability is a combination of implementing the correct server side
infrastructure, avoiding single points of failure wherever they may lie (in hardware or
software), and being flexible in the HA approach. The cost of implementing HA can
initially be seen as an expensive undertaking, but you must always balance it with the
potential cost of losing IT systems or critical data.
External HA clustering software can solve many issues of high availability, but high
availability is only a small part of resilience computing. You must address concepts
such as disaster recovery, fault tolerance, scalability, and reliability to provide a 24 by
7 solution that is available 100% of the time.

Page 32

Understanding high availability with WebSphere MQ

Appendix A Available SupportPacs


These SupportPacs are provided free from IBM and assist in the setup and
configuration of WebSphere MQ using different HA clustering technologies.
MC41 Configuring WebSphere MQ for iSeries High Availability
http://www1.ibm.com/support/docview.wss?rs=203&uid=swg24006894&loc=en_US&cs=utf8&lang=en
MC63 WebSphere MQ for AIX Implementing with HACMP
http://www1.ibm.com/support/docview.wss?rs=203&uid=swg24006416&loc=en_US&cs=utf8&lang=en
MC68 Configuring WebSphere MQ with Compaq Trucluster for high availability
http://www1.ibm.com/support/docview.wss?rs=203&uid=swg24006383&loc=en_US&cs=utf8&lang=en
MC69 Configuring WebSphere MQ with Sun Cluster 2.X
http://www1.ibm.com/support/docview.wss?rs=203&uid=swg24000112&loc=en_US&cs=utf8&lang=en
MC6A Configuring WebSphere MQ for Sun Solaris with Veritas Cluster Server
http://www1.ibm.com/support/docview.wss?rs=203&uid=swg24000678&loc=en_US&cs=utf8&lang=en
MC6B WebSphere MQ for HP-UX Implementing with Multi Computer/Service
Guard
http://www.ibm.com/support/docview.wss?rs=203&uid=swg24004772&loc=en_US&
cs=utf-8&lang=en

Page 33

Understanding high availability with WebSphere MQ

Resources
[1]. WebSphere MQ for Z/OS Concepts and Planning Guide Chapter 2 (Shared
Queues),
http://www306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/platspe
cific.html
[2]. WebSphere MQ queue manager clusters,
http://www306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/crossla
test.html
[3]. WebSphere MQ High Availability, Mark Taylor, Transaction and Messaging
Technical Conference
[4]. Choosing the right availability solution, L.Sherman,
http://whitepapers.zdnet.co.uk/0,39025945,60018358p-39000482q,00.htm
[5]. Understanding Downtime, Business Continuity Solution Series, Vision
Solutions Whitepaper,
http://www.visionsolutions.com/BCSS/White-Paper-102_final_vision_site.pdf
[6]. WebSphere MQ Manuals for z/OS, Systems Administration Guide, Chapter
14, Page 151,
http://www306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/platspe
cific.html
[7]. Achieving High Availability Objectives, CNT whitepapers,
http://www.cnt.com/documents/?ext=pdf&filename=PL581
[8]. A definition of the term RAID, webopedia.com,
http://www.webopedia.com/TERM/R/RAID.html

About the authors


Mark Hiscock joined IBM in 1999 while studying at the same time for his Computer
Science degree. He has worked in the Hursley Park Laboratory in the United
Kingdom testing IBMs middleware suite of applications from WebSphere MQ
Everyplace to WebSphere Business Integration Message Brokers. He now works as a
customer scenarios tester for WebSphere MQ and WebSphere Business Integration
Message Brokers, basing his testing on real world customer scenarios.
You can reach him at mark.hiscock@uk.ibm.com.
Simon Gormley joined IBM in 2000 as a software engineer, and works at the Hursley
Park Laboratory in the United Kingdom. He is currently working in the WebSphere
MQ and WebSphere Business Integration Brokers test team, and focusing on
recreating customer scenarios to form the basis of tests. You can reach him at
sgormley@uk.ibm.com.

Page 34

También podría gustarte