- 1 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
- 2 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved. 1. Introduction Oracle Database 11g Release 1 was released in October 2007 as the latest major version of Oracle Database. In this version, Oracle Data Guard offers a number of new and innovative features to help ensure business continuity by protecting important corporate data, including a feature that initiates a failover to a remote standby system in the event the production system fails due to a disaster or emergency. Oracle Corporation J apan and Hitachi Ltd. performed verification tests of Oracle Data Guard at the Oracle GRID Center, building a large-scale transaction environment for a simulated production system combining Hitachi BladeSymphony high-reliability blade servers and Oracle Database 11g Release 1. This white paper introduces the BCM (Business Continuity Management) platform solution realized by combining Hitachis hardware and Oracle Database 11g Release 1 and results of verification with respect to the effectiveness of features provided by Oracle Active Data Guard, a new option in the Oracle Database 11g Release.
Acknowledgements Oracle Corporation J apan established a partnership with Hitachi Ltd. and other grid strategy partner companies in November 2006, opening the Oracle GRID Center (http://www.oracle.co.jp/solutions/grid_center/index.html), a facility that incorporates the most advanced technologies, with the goal of constructing next-generation business solutions capable of optimizing enterprise system infrastructures. Publication of this white paper was made possible by hardware and software provided to the Oracle GRID Center by Intel Corporation and Cisco Systems G.K., which support the purpose of the Oracle GRID Center, as well as support and aid provided by engineers from these companies. We wish to express our sincere gratitude to the companies and engineers for their support. *All rights reserved. Disclaimer This document is provided for informational purposes only. The contents hereof are subject to change without prior notice. Oracle Corporation J apan or Hitachi, Ltd does not warrant that this document is error-free, nor does it provide any other warranties or conditions, whether expressed or implied, including implied warranties and conditions of merchantability or fitness for a particular purpose. Oracle Corporation J apan and Hitachi Ltd. specifically disclaim any liability with respect to this document. No contractual obligations are formed by this document, either directly or indirectly. This document may not be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without prior written permission from Oracle Corporation J apan and Hitachi Ltd. Trademarks BladeSymphony is a registered trademark of Hitachi Ltd. ORACLE is a registered trademark of Oracle Corporation. Intel and Xeon are trademarks of Intel Corporation in the United States and other countries. Red Hat is a trademark or a registered trademark of Red Hat Inc. in the United States and other countries. Linux is a registered trademark of Linus Torvalds. Cisco is a registered trademark of Cisco Systems, Inc. in the United States and other countries. Other names of companies and products used herein are trademarks or registered trademarks of their respective owners.
- 3 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
2. Contents 1. Introduction............................................................................................................................................. 2 2. Contents................................................................................................................................................... 4 3. Criticality of Business Continuity Management (BCM) ..................................................................... 6 4. Oracle Data Guard ................................................................................................................................. 7 5. Examples of BCM Platform Solutions Realized by Hitachi and Oracle .......................................... 10 6. Verifying Oracle Active Data Guard................................................................................................... 12 6-1 Purpose and specifics of verification tests........................................................................................12 6-2 Verification environment...................................................................................................................13 6-2-1 System configuration.............................................................................................................13 6-2-2 Hardware used.......................................................................................................................13 6-2-3 Software used.........................................................................................................................14 6-2-4 About workloads....................................................................................................................14 7. Verification Results............................................................................................................................... 15 7-1 Creating a standby database using RMAN network duplicate..........................................................15 7-2 Effective use of standby site via Oracle Active Data Guard and reductions in system downtime based on effective use of standby site...............................................................................................19 7-3 Measuring REDO apply performance for standby database.............................................................23 7-4 Fast-Start Failover.............................................................................................................................27 7-5 Failover under high-load transaction condition.................................................................................29 8. Summary ............................................................................................................................................... 32
- 4 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
Figures Figure 4-1: Schematics of Oracle Data Guard operation.......................................................................7 Figure 4-2: Effective use of standby database via Real-time Query......................................................8 Figure 4-3: Effective use of standby database with Snapshot Standby.................................................8 Figure 4-4: Fast-Start Failover operation..............................................................................................9 Figure 5-1: Online system maintenance based on Hitachi hardware and Oracle Data Guard..........10 Figure 5-2: Data protection with rapid application of server resources at reduced standby cost......11 Figure 6-1: Configuration of the system used in verification tests......................................................13 Figure 7-1: Conventional standby database production method..........................................................16 Figure 7-2: Creating a standby database using RMAN network duplicate..........................................16 Figure 7-3: Previous drawbacksRelationship between standby site use time and system downtimes.....................................................................................................................20 Figure 7-4: Effective use of standby site via Oracle Active Data Guard.............................................21 Figure 7-5: Simulated business scenario used in verification tests......................................................22 Figure 7-6: Process of failover to physical standby database..............................................................23 Figure 7-7: Low REDO apply performance........................................................................................24 Figure 7-8: Adequate REDO apply performance................................................................................24 Figure 7-9: Fast-Start Failover operation............................................................................................27 Figure 7-10: Verifying failover under high-load transaction conditions.............................................29 Table Table 7-1: Apply performance comparison patterns...........................................................................25 Table 7-2: Verification configuration patterns....................................................................................29 Table 7-3: Verified failure patterns.....................................................................................................29 Table 7-4: Verified failure patterns and verification results................................................................30 Graphs Graph 6-1: CPU usage of primary database servers during load generation.......................................15 Graph 7-1: Comparison of standby data production times (via conventional method and using RMAN network duplicate)............................................................................................17 Graph 7-2: CPU usage and network transfer volume in creation of standby database via conventional method (top: primary database server, bottom: standby database server).....................17 Graph 7-3: CPU usage and network transfer volume in production of standby database using RMAN network duplicate (top: primary database server, bottom: standby database server).....18 Graph 7-4: Business transaction throughput, CPU usage of primary database server, and network transfer volumes during creation of standby database using RMAN network duplicate19 Graph 7-5: Effective use of CPU resources of standby site with Oracle Active Data Guard..............21 Graph 7-6: Reductions in system downtime via Oracle Active Data Guard during use of physical standby site....................................................................................................................22 Graph 7-7: Comparison of volume of generated REDO against REDO apply performance...............25 Graph 7-8: Apply performance comparison........................................................................................26 Graph 7-9: Transactions during failure of all instances for the primary database and patterns in CPU usage for individual database servers............................................................................31
- 5 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
3. Criticality of Business Continuity Management (BCM) IT systems have grown increasingly important for corporations. Even in the event of an earthquake-induced site failure or system failure caused by hardware malfunction, corporations must continue to safeguard critical business data such as customer information and rapidly restore system functionality to ensure continuing services. In particular, corporations must meet the following requirements: Business continuity Interruptions or outages affecting important services pose serious threats to the entire business, in certain cases resulting not just in lost income, but serious damage to the confidence of customers and associated companies. Data protection Data remains a critical asset for any company. Corporate datafor example, payroll or employee information, client records, valuable research results, financial records, or history informationcan require both significant sums and effort to reconstruct or regenerate once lost, if this is even possible, and in some cases such data loss may impair a companys capacity to continue operating. System flexibility to adapt to changes IT systems must ensure business continuity even in the event of unplanned system downtimes, including system failure. These systems must also minimize the duration of planned downtimes, including downtimes for software updates and hardware maintenance, to reduce any negative effects on business operations. Particularly in the case of open systems, the rapid pace of software development requires that procedures for updating software and applying software patches be kept as short as possible in order to keep systems up to date and maintain systems in a robust condition. With respect to hardware, rapid developments in multi-core CPU technology in recent years now makes it possible in certain cases to improve performance and reduce TCO simply by replacing existing equipment with the latest hardware. In general, agility and flexibility have become enterprise system requirements. Cost efficiencyEffective use of standby sites Also important for ensuring high cost efficiency is effective use of the server resources at standby sites set aside for disasters and other emergency situations. Ensuring high cost efficiency leads to the acquisition of countermeasures against system failure. Low resource efficiency at established standby sites during ordinary operations, on the other hand, will generally make it more difficult to acquire adequate funding, etc. for systems. Combining Hitachi BladeSymphony or Hitachi Storage hardware with Oracle Real Application Clusters (Oracle RAC) and Oracle Data Guard makes it possible to deliver a solution that resolves such issues.
- 6 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
- 7 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved. 4. Oracle Data Guard Oracle Data Guard creates a standby database as a copy of the production database (called the primary database) and provides features that perform a series of comprehensive services for that database, including maintenance, management, and monitoring. A standby database is created as a copy that maintains transactional consistency with the primary database. Following the creation of the standby database, REDO sent from the primary database are used to reflect changes made in the primary database. If the primary database becomes unavailable due to down, whether planned or unplanned, the standby database gains primary database status to minimize the downtime. The Oracle Data Guard is provided by Oracle Database Enterprise Edition. Primary database In normal operati on In emergenci es
Standby database Copy Primary database connected during normal operation Connection switches to standby database in the event of failure. Standby database Primary database
Figure 4-1: Schematics of Oracle Data Guard operation Standby databases generally come in one of two configurations. One, a physical standby database, is identical to the primary database at the physical block level. The other, a logical standby database, is identical to the primary database at the logical row data level. The version of Oracle Data Guard in Oracle Database 11g Release 1 features various enhancements. Introduced below are some of the new features examined in our verification testing. Oracle Active Data Guard In previous release versions, application of REDO had to be suspended when accessing data in a physical standby database. A Oracle Active Data Guard option with Oracle Database 11g Release 1 enables access to data in a physical standby database without suspending the application of REDO. This feature is called Real-time Query. This feature enhancement allows normal use of a physical standby database for reporting and other tasks..
Physical standby database Primary database Normal operation Patch process reporting Backup acquisition Off-loading of reporting process and backup acquisition to standby database Oracle Data Guard
Figure 4-2: Effective use of standby database via Real-time Query Oracle Active Data Guard features a high-speed incremental backup feature based on a change-tracking file when obtaining backups from a standby database, thereby offering both high availability and convenient data protection against failures in the event of planned downtimes or unplanned outages at the production site. Snapshot Standby The Snapshot Standby feature enables temporary use of a physical standby database as an easy-to-use read-write test database. Even while being used as a test database, the physical standby database can receive REDO from the primary database, allowing it to continue providing the data protection feature. A snapshot standby database is also easily returned to physical standby database status. Snapshot standby Primary database Normal operation Oracle Data Guard Client for testing REDO transfers continue while database is open Open as a temporary read- write test database
Figure 4-3: Effective use of standby database with Snapshot Standby - 8 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
Creating a standby database using RMAN network duplicate Previous release versions required the acquisition of a full backup of the primary database on local site, transfer of the backup to standby site and restoring of the backup to create a standby database. With Oracle Database 11g Release 1, the enhanced Recovery Manager (RMAN) network duplicate feature, used for database duplication, backups primary database while at the same time restoring over the network to the standby. Network duplicate saves time and storage Fast-Start Failover The Fast-Start Failover provides a feature that automatically detects failures in the primary database and initiates failover after failure detection. Detection of failure and initiation of failover are performed by the observer set up separately from the primary database and standby database. The observer is a component of Data Guard Broker. Fast-Start Failover enables automatic failover in the event of a primary database failure without administrator intervention. Automatic failover REDO transfer Standby database Primary database Observer Monitoring Monitoring
Figure 4-4: Fast-Start Failover operation In previous release versions, Fast-Start Failover could be used only in Maximum Availability modewhich required synchronous transfers of REDO. Oracle Database 11g Release 1 now supports Maximum Performance mode to allow asynchronous REDO transfer settings, allowing use in a wider range of operating environments. The new version also provides greater flexibility in determining whether or not to initiate a failover at the time of failure detection, thereby meeting various failover requirements.
- 9 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
- 10 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved. 5. Examples of BCM Platform Solutions Realized by Hitachi and Oracle Described below are some examples of the BCM solution realized through the combination of Hitachi hardware and Oracle Database 11g Release 1. Online system maintenance Figure 5-1 shows an example of a Data Guard system configuration consisting of a production business environment and a test environment. The test environment is used for report tasks using Oracle Active Data Guard features or as a development environment using the Snapshot Standby feature. This sample configuration permits not only the application of patch sets to Oracle software and version updates, but also BladeSymphony server blade replacements and additions in combination with the Oracle Data Guard switchover feature, and seamless online disk addition to production environments via Hitachi Storage virtualization. The combination of Hitachi hardware and Oracle Database 11g Release 1 enables online maintenance of both software and hardware with minimal impact on production operations. Test envir onment Or acl e Data Guard confi gur ation (1) Switchover to test environment (2) Replacement with new blade server Oracle rolling upgrades Onli ne hard disk addition to storage pool Onli ne blade server replacement No need to set LVM, ASM, or other OS No need to reboot for disk recogniti on Swi tchover of production environment to minimize impact on business operations Pr oduct ion environment
Figure 5-1: Online system maintenance based on Hitachi hardware and Oracle Data Guard Data protection at reduced standby costs and rapid addition of server resources Figure 5-2 shows an example of a configuration with minimum allocation of standby database server resources. It provides data protection using Oracle Data Guard while minimizing standby database costs. If the primary database fails due to a disaster or other reason, a failover to the standby database is initiated to enable continuing business operations. However, restoring the
service levels of the primary database generally requires the allocation of additional resources to ensure the same level of processing capacity as the primary databasea requirement that generally costs a great deal of time and money. But combining the provisioning features of BladeSymphony and Oracle Real Application Clusters can significantly reduce the cost of adding server resources while enabling immediate response. Pr imar y database Normal operations 4-node RAC 1-node RAC Primary database failure 4-node RAC Primary database failure due t o disaster...
Maintaining data protection at low initial cost byallocating minimum server resources to the standby database Additional server resources are required if the standbydatabase is used to continue business operations. Combining BladeSymphony's and Oracle' s provisioning functions enables significantlysimplified additional tasks and immediate response. +3 nodes Provi si oni ng St andby dat abase Pr imar y database St andby dat abase Data Guard configuration Data Guard configuration 1-node RAC
Figure 5-2: Data protection with rapid application of server resources at reduced standby cost
- 11 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
- 12 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved. 6. Verifying Oracle Active Data Guard
6-1 Purpose and specifics of verification tests We performed verification testing at the Oracle GRID Center with the following three main goals: Confirming the effectiveness of new Oracle Data Guard features We performed verification tests to confirm the effectiveness and usability of the new Oracle Data Guard features and to check for any important considerations when using the features. In the verification testing, we focused mainly on the following features: Creating a standby database using RMAN network duplicate Benefits of creating a standby using RMAN network duplicate feature Oracle Active Data Guard Benefits of effectively using the standby database with Real-time Query feature of Oracle Active Data Guard and reductions in system downtimes based on effective use of the standby database Snapshot Standby Fast-start Failover Performance and failover under large-scale high-volume transaction We performed the verification tests to check for fast, effective failover to the standby database in the event of a failure while the primary database was under heavy loads and with the CPU and network resources at maximum capacity. Another goal was to identify any potential issues associated with use in large-scale, high-volume transaction environments. These represent critical performance aspects, since the primary purpose of introducing Oracle Data Guard is to achieve switchover to the standby site in the event of a primary site failure. Establishing best practices We performed verification testing to establish procedures for creating a standby database and managing an Oracle Data Guard environment. * For a list of the procedures that proved effective in our verification tests, please refer to the separate document titled Oracle Database 11g Release 1 Physical Standby Setting Guide(Japanese only).
6-2 Verification environment
6-2-1 System configuration Figure 6-1 shows the configuration of the system used in our verification tests. The same public network was used to connect client machines to the database server and to transmit REDO from the primary site to the standby site. The network bandwidth was 1 Gbps. Primary site Client machines Standby site Database server: Hitachi BladeSymphony BS320 Primary site: 2-node RAC Standby site: 2-node RAC Cisco Catalyst 6504 Cisco Catalyst 3750 Storage: Hitachi Adaptable Modular Storage
Figure 6-1: Configuration of the system used in verification tests 6-2-2 Hardware used Database server Model Hitachi BladeSymphony BS320 4 blades CPU Dual-Core Intel
Xeon
processor 3 GHz 2 sockets/blade Memory 8 GB Client machine Model Intel White Box, 4 units CPU Quad-Core Intel
Xeon
processor 2.66 GHz
1 socket/server Memory 4 GB Storage Model Hitachi Adaptable Modular Storage (AMS) Hard disk 144 GB 28 HDD (+2 HDD as spare) RAID group configuration 2D+1P 8 (for Oracle database) - 13 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
6-2-3 Software used Database server OS Red Hat Enterprise Linux 4.5 Oracle Oracle Database 11g Release 1 (11.1.0.6) Enterprise Edition Oracle Real Application Clusters Oracle Active Data Guard Oracle Partitioning Client machine OS Red Hat Enterprise Linux 4 Update 3 Oracle Oracle Client 10g Release 2 (10.2) 6-2-4 About workloads In our verification tests, we used an online transaction processing system (OLTP) for a simulated online Web shopping site as a workload model. SQL statements generated by J PetStore were provided as a sample application for Spring Framework (http://www.springframework.org), an open-source J 2EE framework, were multi-executed by a custom application. The process flow is described below. (1) User sign-on A user ID was randomly selected and a search performed for user information. select from account, profile, signon where account.userid=? and signon.password =? and ;
(2) Product search A keyword for product search was randomly generated and a search performed for the product. Adjustments were made so that the search results totaled approximately 100 on average. select from category where catid =?; select from product wherelowernamelike ?;
(3) Product selection One item was selected from the search results (hits). select from item, product where i.itemid =? and
(4) Stock quantity check The quantity of the selected item in stock was checked. select from inventory where itemid =?
(5) Order placement Order data for the specified product was issued. insert into orders ; insert into orderstatus ; insert into lineitem ;
The quantity of ordered products was subtracted from the inventory quantity in the stock management list. Update inventory set qty=qty-1 where itemid =?;
(6) Order finalization commit - 14 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
The above-mentioned processes were multi-executed by client machines. As shown in Graph 6-1, the workload generated a heavy load on the primary database server. CPU usage of primarydatabase server 1 0 20 40 60 80 100 0 120 240 360 480 600 720 840 960 1080 1200 Time (sec) C P U
u s a g e
( % )
user system iowait CPU usage of primarydatabase server 2 0 20 40 60 80 100 0 120 240 360 480 600 720 840 960 1080 1200 Time (sec) C P U
u s a g e
( % )
user system iowait
Graph 6-1: CPU usage of primary database servers during load generation 7. Verification Results
7-1 Creating a standby database using RMAN network duplicate Creating a standby database requires the copying of database files from the primary database to the standby site. With versions up to Oracle Database 10g, this was generally achieved by obtaining a backup of the primary database and transferring backup files to the standby site via network using ftp or scp, or by copying the backup file to a tape and sending the tape to the standby site. Oracle Database 11g Release 1 enhances the RMAN duplicate command to allow copying of database files from the primary database currently online directly to the standby site. This eliminates the need to obtain a backup at the primary site and to produce a duplicate from the backup at the standby site. It also eliminates the need to arrange a disk space to store the backup file at both the primary and standby sites. Comparison and verification of standby database creation by the conventional method and using RMAN network duplicate We created standby databases by the conventional method and from the active database, measuring the time required to create a standby database and CPU usage during that process. We then compared and examined the results. The total size of the primary database used in this verification test was approximately 170 GB. Conventional method (Figure 7-1) (1) A backup file was created by online backup using RMAN (2) The backup file was sent from the primary site to the standby site across a network using scp. (3) The database was restored from the backup file by RMAN. - 15 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
Creating a standby database from an active database (Figure 7-2) (1) The online primary database file was copied directly to the standby database.
Primary database (1) Creating a backup file (online backup by RMAN) Primary site Standby database Backup file Standby site Conventi onal standby database constructi on method (3) Database restored by backup file using RMAN. (2) Transfer of backup file by scp Backup file
Figure 7-1: Conventional standby database production method
Primary database (1) Directly copying an online database file Primary site Standby site Standby database using RMAN network duplicate Creating a standby database
Figure 7-2: Creating a standby database using RMAN network duplicate Graph 7-1 compares the time required to create a standby database by the conventional method and directly from the active database. Creating a standby database from the active database does not require the creation of a backup at the primary site and the restoration of the database at the standby site, enabling creation of the standby database in about 1/3 the time required by the conventional method. - 16 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
0 3000 6000 9000 12000 Time (sec) Conventional method Creating a standbydatabase using RMAN networkduplicate
Graph 7-1: Comparison of standby data production times (via conventional method and using RMAN network duplicate) Graph 7-2 shows the CPU usage of the primary database server and standby database server and network transfer volumes during the creation of the standby database by the conventional method. Approximately 30% of the CPU resources were used to create a backup file at the primary site and to restore the database at the standby site. Graph 7-3 shows the CPU usage of the primary database server and standby database server and network transfer volumes during the creation of the standby database from the active database. Compared to the conventional method, creating a standby database from the active database kept CPU usage at low levels and achieved efficient network transfer/copying of online data files. And network transfer volumes per unit time are high, resulting in higher speeds than copying by scp. CPU usage of standbydatabase server 0 10 20 30 40 50 60 70 80 90 100 0 1200 2400 3600 4800 6000 7200 8400 9600 Time (sec) C P U u s a g e
( % ) Database restoration by RMAN CPU usage of primary database server 0 10 20 30 40 50 60 70 80 90 100 0 1200 2400 3600 4800 6000 7200 8400 9600 Time (sec) C P U u s a g e
( % ) user system iowait Networktransfer vol ume of pri mary database server 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 1200 2400 3600 4800 6000 7200 8400 9600 10800 Time (sec) N e t w o r k
t r a n s f e r
v o lu m e
( K b y t e s /s )
Receiving volume (kB/s) Online backup by RMAN Backup file transfer byscp Backup file transfer byscp Transmitting volume (kB/s) Networktransfer vol ume for standby database server 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 1200 2400 3600 4800 6000 7200 8400 9600 10800 Time (sec) N e t w o r k
t r a n s f e r
v o lu m e
( K b y t e s /s )
Backup file reception by scp Receiving volume (kB/s) Transmitting volume (kB/s) user system iowait
Graph 7-2: CPU usage and network transfer volume in creation of standby database via conventional method (top: primary database server, bottom: standby database server)
- 17 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
CPU usage of primary database server 0 20 40 60 80 100 0 600 1200 1800 2400 3000 Time (sec) C P U u s a g e
( % ) user system iowait CPU usage of secondarydatabase server 0 20 40 60 80 100 0 600 1200 1800 2400 3000 Time (sec) C P U u s a g e
( % ) user system iowait Networktransfer vol ume of pri mary database server 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 600 1200 1800 2400 3000 Time (sec) N e t w o r k
t r a n s f e r
v o lu m e
( K b y t e s /s )
rxKB /s txKB /s Networktransfer vol ume of secondarydatabase server 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 600 1200 1800 2400 3000 Time (sec) Direct copying ofonline database file Receiving volume (kB/s) Transmitting volume (kB/s) rxKB /s txKB /s Receiving volume (kB/s) Transmitting volume (kB/s) N e t w o r k
t r a n s f e r
v o lu m e
( K b y t e s /s )
Graph 7-3: CPU usage and network transfer volume in production of standby database using RMAN network duplicate (top: primary database server, bottom: standby database server) Effect on business transactions during creation of standby database using RMAN network duplicate To examine the effects on business transactions of creating a standby database while transactions are being processed, we created a standby database from the active database while generating a business transaction load on the primary database. Graph 7-4 shows results for measurements of business transaction throughput, CPU usage of the primary database server, and network transfer volumes. In this case, contention between business transaction processing and database file transfer processing reduced business transaction throughput by approximately 20%. Transfer volumes of nearly 80 MB/s were recorded during the transfer of the database file. Since business transactions under ordinary operating conditions utilized approximately 20 MB/s, database file volumes transferred to the standby site are estimated to be about 60 MB/s. Since transfer volumes would be lower than under conditions with no load, it took longer to create a standby database in this test case. The effect on the business transaction performance is expected to vary depending on the process characteristics of the transaction being processed. In actual use, we recommend that users consider creating a standby database in a time with low business loads to minimize effects on business operations, as well as configuring a separate network to transfer REDO. In high latency network environment like WAN, throughput of network duplicate might be improved by tuning network I/O buffer size. Please refer to '14.2 Configuring I/O buffer space' of Net Services Administrators Guide 11g Release 1(11.1). - 18 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
0 20000 40000 60000 80000 100000 120000 0 360 720 1080 1440 1800 2160 2520 2880 3240 3600 3960 4320 4680 Time (sec) N e t w o r k
t r a n s f e r
v o l u m e
( K b y t e s / s ) rxKB txKB /s CPU usage of primary database server 0 20 40 60 80 100 0 360 720 1080 1440 1800 2160 2520 2880 3240 3600 3960 4320 4680 Time (sec) C P U
u s a g e
( % ) user system iowait Transaction throughput 0 360 720 1080 1440 1800 2160 2520 2880 3240 3600 3960 4320 4680 Time (sec) T r a n s a c t io n
th r o u g h p u t Effect of creation of standbydatabase using RMAN networkduplicate on transaction throughput was about 20% i n our verification tests. Standby database production in process Receiving volume (kB/s) Network transfer volume of primary database server Total transfer vol ume was about 80 MB/s. Bysubtracting about 20 MB/s used bybusi ness transactions fromthis figure, the database file transfer vol ume is estimated to be about 60 MB/s. Transmitting volume (kB/s)
Graph 7-4: Business transaction throughput, CPU usage of primary database server, and network transfer volumes during creation of standby database using RMAN network duplicate 7-2 Effective use of standby site via Oracle Active Data Guard and reductions in system downtime based on effective use of standby site Oracle Data Guard versions up to Oracle Database 10g had the following issue related to effective use of the standby site. Application of REDO had to be stopped when the standby site is used on a read-only basis by physical standby features. A periodic data synchronizing process was required to reduce downtimes caused by primary site failure. This meant the standby site had to be set to the managed recovery mode at regular intervals, making operations more complicated. Logical Standby are accessible during application of REDO, but there are limitations relate to the data type and other factors. These restrictions meant using the standby site previously required complex procedures. Longer standby site use times meant longer times required to recovery the database in case of failure, impairing availability (Figure 7-3). - 19 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
Standby site use time (Log data application downtime) Volume of log data required in case of primary site failure Propor t ional t o system downtime caused by failure
Figure 7-3: Previous drawbacksRelationship between standby site use time and system downtimes Real-time Query of Oracle Active Data Guard, a new feature provided with Oracle Database 11g Release 1, resolves these issues and enables effective use of the standby site while ensuring system availability. The following two points were verified to confirm the effectiveness of Oracle Active Data Guard. (1) Effective use of standby site with Oracle Active Data Guard We confirmed that the standby site could be used for read-only at all times while a physical standby feature accessed the REDO. (2) Reducing system downtimes during effective use of physical standby site We confirmed the absence of any need to perform periodic synchronization due to (1), allowing reductions in downtimes attributable to a primary site failure to a specific duration. Effective use of standby site with Oracle Active Data Guard In the simulated situation shown in Figure 7-4, we confirmed the behavior resulting from applying additional loads on the standby site, like daily processing and report batch application, while the primary site was under online transaction loads associated with online shopping operations. Real-time Query feature of Oracle Active Data Guard enabled the transfer and application of REDO while additional tasks were performed at the standby site. - 20 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
REDO transfer and application Primary database OLTP transaction Standby database SELECT/query load Real-time Query Date/time processing, report batch Online shopping business Additional operations
Figure 7-4: Effective use of standby site via Oracle Active Data Guard Graph 7-5 compares CPU usage of the standby database server while the Real-time Query applies a SELECT load to the standby site and CPU usage with no load applied. When no SELECT load is applied by Real-time Query, the standby database server performs only the REDO apply process, and CPU usage is less than 20%. Application by Real-time Query of an additional load results in CPU resource use exceeding 90%, confirming full use of CPU resources previously not fully utilized. CPU usage of standby database server 0 20 40 60 80 100 0 60 120 180 240 300 360 420 480 540 600 Time (sec) C P U
u s a g e
( % )
With SELECT load Without SELECT load Only REDO log appl yis performed. CPU use is low. Even as REDO l og data is being appli ed, a SELECT load was applied, resulting in effecti ve resource use.
Graph 7-5: Effective use of CPU resources of standby site with Oracle Active Data Guard Reduced system downtimes during effective use of physical standby site The primary site was assumed to ran a 24-hour online shopping business as shown in Figure 7-5, and the standby site was assumed to operate in the Read Only mode for report batch application and daily processing in the period from nighttime to daytime. - 21 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
6:00 Online shopping service 12:00 18:00 24:00 Primary site Report batch Daily processing Online shopping service Failover Generation of primary site failure during use of standby site Online shopping service downtime Standby site
Figure 7-5: Simulated business scenario used in verification tests If a failure occurs in the primary site while the physical standby database is in use, failover of the online shopping service to the standby site takes place, but application of all REDO transferred from the primary database must also be completed. Much of transferred REDO might be applied under the conventional method because REDO application cant be performed while the physical standby runs. If the Real-time Query feature of Oracle Active Data Guard is used, the REDO application is performed as needed while the standby site runs, thereby reducing failover time. Graph 7-6 gives the results of the verification test performed based on this assumption. The graph shows transaction throughput remained at 0 from the time of failure to the time of regenerating loads on the new primary database after the standby database was changed the role to the primary database to resume services. This duration is defined as the failover time. We compared one case based on the conventional method against another based on Oracle Active Data Guard. The failover time with Oracle Active Data Guard was greatly reduced compared to the failover time with the conventional method. With the conventional method, the volume of REDO not applied at the time of the failover was approximately 20 GB. Volumes of unapplied REDO exceeding this amount will lengthen failover times accordingly. Time T r a n s a c t io n
th r o u g h p u t Time T r a n s a c t io n
th r o u g h p u t Extended downtime for online shopping functions Use of standby site with conventional method Use of standby site based with Oracle Active Data Guard Short failover time resulting from conti nuous applicati on of l og data even during the use of the physical standbydatabase
Graph 7-6: Reductions in system downtime via Oracle Active Data Guard during use of physical standby site - 22 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
- 23 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved. 7-3 Measuring REDO apply performance for standby database The following two objectives generally need to be considered when examining system availability: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). In Oracle Data Guard, the PRO is related to the settings made for REDO transfer from the primary database to the standby database and transfer performance. This is because REDO not transferred to the standby database at the time of failover are lost. REDO apply performance for the standby database affects the RTO because failover time in Oracle Data Guard included the time required to process unapplied REDO.(*) Figure 7-6 illustrates the general process of failover to a physical standby database. Time until failure is detected Generation of failure Start of failover operation Completion of failover operation Downtime from an application perspective Failover operation of Data Guard Application of unapplied REDO Role change Opening of instance
Figure 7-6: Process of failover to physical standby database (*) Although Oracle Data Guard can resume service immediately after a failure, without application of unapplied REDO, we recommend processing all applicable REDO before resuming services for maximum data security. One way to assess the adequacy of REDO apply performance is to compare the REDO apply performance for the standby database against the volume of REDO generated by the primary database. If the REDO apply performance falls short of the volume of generated REDO, the difference in the most recent data between the primary database and standby database will occur, increasing the volume of unapplied REDO. This can extend failover times in the event of a failure.
REDO transfer Standby database Low apply performance expands the difference between received and applied REDO. After time n Primary database Transferred/received REDO Applied REDO Primary database REDO transfer Standby database
Figure 7-7: Low REDO apply performance REDO apply performance that exceeds the volume of generated REDO minimizes the volume of unapplied REDO and reduces failover times.
REDO transfer Standby database Adequate apply performance minimizes differences. After time n Primary database Transferred/received REDO Applied REDO Primary database REDO transfer Standby database
Figure 7-8: Adequate REDO apply performance - 24 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
We compared the volume of REDO generated when the primary database is under large transaction loads against the REDO apply performance of the standby database to assess REDO apply performance. Oracle statistical information was obtained before and after load generation for the primary database and the difference between the two values used to calculate the volume of REDO generated per second. We measured REDO apply performance by applying a group of archived REDO log files totaling about 3 GB. Oracle instances in standby were restarted before the start of measurement, and V$RECOVERY_PROGRESS view was used to confirm the REDO apply size per second to measure the apply performance. Since Oracle Active Data Guard was used during measurement, Oracle instances for the standby database were read-only. Graph 7-7 shows the results of a comparison of the volume of generated REDO against REDO apply performance. 0 2 4 6 8 10 Ratio of amount of generati on to appl yperformance Volume of generated REDO REDO appl y performance
Graph 7-7: Comparison of volume of generated REDO against REDO apply performance The graph indicates that the REDO apply performance far surpassed the total volume of REDO generated by primary database instances. In Oracle Database 11 g Release 1, one instance handles REDO applications for a standby database in an Oracle RAC configuration. Although the configuration of the disks on which online REDO log files and archived REDO log files are located affects REDO apply performance, the measurements indicate performance in the verification test environment is sufficient to apply the REDO generated by multiple nodes without delays. We then compared REDO apply performance in a case in which the physical standby database was set to READ ONLY OPEN against performance in a case in which the physical standby database was set to MOUNT status. The comparison sought to determine whether Oracle Active Data Guard affects REDO apply performance. The measurement method was the same as the method previously described. We used the following three patterns to compare measurements. Pattern No. Standby instance 1 Standby instance 2 1 MOUNT MOUNT 2 READ ONLY OPEN MOUNT 3 READ ONLY OPEN READ ONLY OPEN Table 7-1: Apply performance comparison patterns - 25 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
Graph 7-8 shows the results of the performance comparison (value of 1 assigned to the apply performance for pattern 1) 0 0.2 0.4 0.6 0.8 1 1.2 1 2 3 Pattern No. A p p l y
p e r f o r m a n c e
r a t i o
Apply performance ratio
Graph 7-8: Apply performance comparison The apply performance was consistent whether or not the instances of the physical standby database were in the MOUNT or READ ONLY OPEN status. This indicates Oracle Active Data Guard has no impact on REDO apply performance.
- 26 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
- 27 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved. 7-4 Fast-Start Failover The Fast-Start Failover feature automatically detects failures in the primary database and starts failover after failure detection. In Oracle Database 10g Release 2, protection mode is set to Maximum Availability to use the Fast-Start Failover feature. This required setting synchronous REDO transfers. Synchronous transmission of REDO guarantees commit-level protection of update data to the primary database, but its effects on performance, including slower response times for the primary database due to network performance limitations, must be considered when business functions require high response performance. In Oracle Database 11g Release, Fast-Start Failover can be used in Maximum Performance protection mode, which enables setting for asynchronous REDO transfer, allowing correspond with greater numbers of cases. When asynchronous REDO transfer is set, a lag may arise between the most recent data for the primary database and the standby database, which would result in data loss in a failover. The Fast-Start Failover feature in Oracle Database 11g Release 1 allows the administrator to preset the allowed time lag for failover and determines whether or not to start failover based on that value in the event of failure. In our verification testing, we set the time lag value to 60 seconds, then halted all instances of the primary database using the abort option to check First-Start Failover operations. Figure 7-9 shows the behavior after failure generation. Standby database Primary database (2) Observer (1)
Figure 7-9: Fast-Start Failover operation (1) When the primary database connection remains unavailable for a certain duration, the observer concludes a failure has occurred. Any value can be set for the time period used to determine a failure. (2) The observer checks the time lag in the latest update information for the primary database and standby database. If the value of the time lag is less than the preset value, a failover is initiated. The value of the time lag can be checked with v$dataguard_stats view on the standby database. In our verification testing, the time lag was 0 seconds, as shown below. Thus, a failover was executed.
SQL> sel ect name, val ue f r omv$dat aguar d_st at s wher e name=' t r anspor t l ag' ;
NAME VALUE - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - t r anspor t l ag +00 00: 00: 00 If the lag exceeds the preset threshold value, a failover will not be initiated. This is because a time lag value greater than the threshold value means the volume of lost data is unacceptable. In this case, the Fast-Start Failover status is shown to be TARGET OVER LAG LIMIT when checked in v$database view of the standby database. SQL> sel ect f s_f ai l over _st at us f r omv$dat abase;
FS_FAI LOVER_STATUS - - - - - - - - - - - - - - - - - - - - - - TARGET OVER LAG LI MI T As above, we confirmed that the Fast-Start Failover feature of Oracle Database 11g Release 1 was capable of achieving automatic failover to meet the data protection requirements of each system, even with the asynchronous REDO transfer setting set to Maximum Performance mode. Oracle Database 11g Release 1 allows the setting of various conditions in addition to the time lag value to allow detailed control of automatic failover behavior. These extended features should reduce the time and work required for failover management.
- 28 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
- 29 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved. 7-5 Failover under high-load transaction condition While Oracle RAC provides features to ensure business continuity in the event of local failures within sitesfor example, single-node failures in the primary databaseOracle Data Guard helps ensure business continuity even against site failures on a scale involving all nodes of the primary database. In our verification testing, we simulated a number of possible failure types while generating high loads to the primary database, executing failovers to the standby database when necessary to confirm transaction processing continuity. Figure 7-10 shows the failure cases used in the verification tests. In each of the three Oracle Data Guard configurations (A, B, and C shown in Table 7-2), failures 1 through 5 (Table 7-3) were simulated.
(1) Failure of all instances of the primary database Primary database Primary site Standby si te (2) Total primary database server failure (4) Failure of all instances of the standby database (3) Network communication failure between primary and standby databases Standby database (5) Listener failure of the standby database Failure verifi cation patterns
Figure 7-10: Verifying failover under high-load transaction conditions Configuration Oracle Data Guard Protection mode Status of standby site A Maximum Performance mode Oracle Active Data Guard B Maximum Availability mode Oracle Active Data Guard C Maximum Performance mode Snapshot Standby Table 7-2: Verification configuration patterns # Simulated failure Failure-reproducing method 1 Failure of all Oracle instances for the primary database Execution of srvctl stop database -o abort command for primary node 1 2 Failure of all primary database servers Execution of halt-n -f command for primary node 1 and node 2 3 Network communication failure between primary and standby databases Network cable disconnection 4 Failure of all Oracle instances for the standby database Execution of srvctl stop database -o abort command for standby node 1 5 Listener failure for the standby database Simultaneous kill of listener process for standby node 1 and node 2 Table 7-3: Verified failure patterns
We used the following verification procedure: (1) Began generating load to primary database. (2) Simulated primary database failure. (3) Stopped load generation. (4) Initiated failover to standby database. (5) Resumed load generation. In all configurations, the verification result showed the expected behavior (Table 7-4). We confirmed that failover to the standby database would enable continuous processing of transactions for cases involving the failure of all Oracle instances for the primary database and all server failure. # Simulated failure Behavior after failure 1 Failure of all Oracle instances for the primary database For each configuration, we confirmed continuous processing of transactions following the execution of failover to the standby database. 2 Failure of all primary database servers For each configuration, we confirmed continuous processing of transactions following the execution of failover to the standby database. 3 Network communication failure between primary and standby databases For each configuration, we confirmed that continuous processing of transactions was possible using the primary database. For configuration B, we halted transaction processing for the duration (set to 30 seconds in the verification test) set with the NET_TIMEOUT attribute, after which continuous processing was possible. 4 Failure of all Oracle instances for standby database For each configuration, we confirmed that continuous processing of transactions was possible using the primary database. For configuration B, we also confirmed continuous processing of transactions was possible. 5 Listener failure for standby database For each configuration, we confirmed that continuous processing of transactions was possible using the primary database. Table 7-4: Verified failure patterns and verification results The following introduces one of the characteristic behaviors exhibited by the failover operation occurring under high-load transaction conditions. Graph 7-9 shows transaction throughput during the all-instances failure of the primary database in configuration A and patterns of CPU usage in the individual primary and standby servers. After failure in (1), the failover was completed and transactions resumed in (2). Transaction throughput declined before (3) due to contention between disk I/O resulting from standby REDO log files clearing performed by the database server following the failover and disk I/O associated with online REDO log files generated by the resumed transactions. The time required to clear standby REDO log files depends on total file size and disk - 30 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
I/O performance. This behavior can be circumvented by having enough I/O bandwidth to handle normal work load and additional I/O caused by clearing of the standby REDO log files. or by configuring online REDO log files and standby REDO log files on separate disks to avoid disk I/O contention. 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 T r a n s a c tio n th r o u g h p u t CPU usage of primary instance 1 CPU usage of primary instance 2 CPU usage of standby instance 1 CPU usage of standby instance 2 Transaction throughput (1) (2) (3) (1) Generati on of failure of all instances for the primary database (1) to (2) Failover to standby database (2) to (3) Clear REDO processing
Graph 7-9: Transactions during failure of all instances for the primary database and patterns in CPU usage for individual database servers
- 31 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved.
- 32 - Copyright 2008 Hitachi, Ltd. All Rights Reserved. Copyright 2008 Oracle Corporation J apan. All Rights Reserved. 8. Summary Verification tests at the Oracle GRID Center confirmed the effectiveness of Oracle Data Guard in Oracle Database 11g Release 1 with a Hitachi platform. Specifically, we confirmed the capabilities of the Oracle Active Data Guard, a new option introduced in Oracle Database 11g Release 1, to make effective use of resources at the standby site and reduce failover times in the event of failures based on effective use of the standby database. We believe that Oracle Database 11g Release 1 with its new feature can dramatically improve the cost efficiency of disaster recovery systems over previous versions. We also examined patterns resulting from failures under a large-scale transaction load environment, confirming transaction continuity. We are confident that a disaster recovery solution based on a combination of Hitachi hardware and Oracle Database 11g Release 1/Oracle Data Guard will provide the support needed to ensure high levels of BCM for corporate infrastructures. Precautions concerning use of this document The contents of this white paper are based on the results of verification tests performed at the Oracle GRID Center. We make no guarantees that the same results will be achieved under all conditions. Actual results will depend on various factors, including the specific conditions under the clients environment.