Está en la página 1de 100

Advanced Technical Support, Americas

AIX
Performance Tuning
Customer Technical Session

Steve Nasypany nasypany@us.ibm.com

ATS System p AIX Performance


Including materials from Charlie Cler, Luc Smolders and Dan Braden

5/14/2008 © 2008 IBM Corporation


Advanced Technical Support, Americas

AIX Performance Tuning


 Purpose
– Virtualization issues
– Overview of generic memory, process and IO tuning
– Highlight new tool function in AIX
 Agenda
– Virtual Processors
– Virtual I/O Server
– Generic Memory Tuning
– Generic Process Tuning
– Generic IO Tuning
– LPAR and CEC Recordings/Reports
– AIX TL-06 Impacts on Tools
– AIX TL-07 & AIX 6.1 Impacts on Tools

2 © 2008 IBM Corporation


Advanced Technical Support, Americas

Virtual Processors

3 © 2008 IBM Corporation


Advanced Technical Support, Americas

Shared Processor LPARs (Micro-partitions) - Definitions


 LPARs are defined to be dedicated or shared
ƒ Dedicated partitions use whole number of CPUs
ƒ Shared partitions use whole or fractions of CPUs (smallest increment is 0.1, can be greater than 1.0)

 Shared processor pools - subset (or all) of physical CPUs in a system


 Desire is to have all of the installed processors in the shared pool and no dedicated CPU LPARs.

 Entitled capacity expressed in the form of number of 10% CPU units


ƒ Desired: Size of partition at boot time
ƒ Minimum: Partition will start will less than desired, but won’t start if Minimum capacity not
available
ƒ Maximum: DLPAR changes to desired cannot exceed this capacity
ƒ Divided among all of the LPARs within a shared processor pool
ƒ Uncapped capacity cannot exceed number of virtual processors for an LPAR

 Capped vs uncapped
ƒ Capped: CPU Capacity limited to desired setting.
ƒ Uncapped: CPU Capacity limited by unused capacity in ‘pool’ and cannot exceed number of
virtual processors (not related to maximum processing units)

 Shared Pool LPARs run in ‘virtual’ processors


 Time slicing of CPUs between partitions

 Priority weighting to determine preference for spare cycles


 Automatic Load Balancing (default is 128, 0 implies no use of spare cycles, 255 is max priority)

4 © 2008 IBM Corporation


Advanced Technical Support, Americas

Physical, Logical, Virtual Layers

AIX 5.2 AIX 5.3 AIX 5.3 AIX 5.3 AIX 5.3

LPAR LPAR Micro Micro Micro

SMT=on Partition Partition Partition


SMT=on SMT=off SMT=on
L L LL LL LL LL L L L L L L Logical

V V V V V V V V V Virtual

2 CPUs 1 CPU 2.1 Proc. Units 0.8 Proc Units 1.2 Proc Units
Physical
(dedicated) (dedicated) 13 CPU Shared Processor Pool*

16 CPU SMP Server

Think “PVL “ P=Physical V=Virtual L=Logical (SMT)


* All activated, non-dedicated CPUs are automatically placed into the shared processor pool.
Only 2.1+0.8+1.2 = 4.1 processor units of “desired capacity” has been allocated from the pool of 13 CPUs

5 © 2008 IBM Corporation


Advanced Technical Support, Americas

SPLPAR Summary
Shared Processor concepts
splpar 2
 Partitions run on a Virtual Processor
(VP).
virtual timebase
 VP runs on Physical Processors (PP)
only part of the time.
 A VP has one or two logical processor
depending on the SMT state. virtual 1CPU
splpar Dispatch splpar 3
 Minimum size of a partition is .1 with splpar1
increments of 1/100th of a processing virtualtimebase
timebase
Wheel (10ms) virtual timebase

virtual virtual timebase


unit.
 A partition’s capacity is defined by the
entitlement and for uncapped partition
by the number of VPs. splpar 4

Phyp (hypervisor) is responsible for virtual timebase


virtual timebase
scheduling & dispatching VPs on PPs.
 Using a 10msec dispatch wheel. Dispatched
 Partition’s time become “virtual”,
which is maintained by the phyp in the
physical CPU
partition’s PURR.
100 units

timebase

6 © 2008 IBM Corporation


Advanced Technical Support, Americas

Hypervisor Dispatch Algorithm

The diagram illustrates the hypervisor splpar 2


dispatch algorithm, which can be viewed
using the metaphor of a “wheel” with a fixed virtual timebase
rotation period of 10 ms to guarantee that
each VP will receive it’s share of entitlement
in a timely fashion.
 At time period 0 a new 10 ms dispatch
window and splpar 4’s VP is dispatched virtual 1CPU
splpar Dispatch splpar 3
to a physical processor, and will run of 5 splpar1
msecs.
virtualtimebase
timebase
Wheel (10ms) virtual timebase

virtual virtual timebase


 At time period 5, splpar 3’s VP is
dispatched for 1 msecs
 At time period 6, splpar 2’s VP is
dispatched for 2 msecs
splpar 4
 Finally, at the end of the 10 ms dispatch
window, splpar 1’s VP is dispatched for 2 virtual timebase
msecs virtual timebase

LPAR # Entitlement
Dispatched / # of VP
Physical
1 .2 / 1
Processor
2 .2 / 1
0 1 2 3 4 5 6 7 8 9 10 physical CPU
3 .1 / 1
100 units
Dispatch Window 4 .5 / 1
timebase

7 © 2008 IBM Corporation


Advanced Technical Support, Americas

Virtual Processors and Processing Unit Relationship

Virtual Processors Range Of Processing


Units that the LPAR
Assigned to LPAR can utilize Example: An LPAR has 2 virtual
processors. This means that it’s
1 0.1 - 1 minimum must be 0.2 or higher (0.1
per virtual processor). The max proc.
2 0.2 - 2 units that it can utilize is 2.0.

3 0.3 - 3 If we want this LPAR to use more


than 2.0 physical CPUs worth of
4 0.4 - 4 cycles, we need to dynamically add
more virtual processors, perhaps 2
… …10x range
more. This would make its new
minimum 0.4 and it max utilization
4.0.

The “desired” number of virtual processors establishes the maximum number of


processing units that an LPAR can access.
8 © 2008 IBM Corporation
Advanced Technical Support, Americas

Virtual Processors and Processing Unit Relationship

Different number of
AIX 5.3 virtual processors AIX 5.3
LPAR LPAR

Same amount of
V V V V processing units V V
1.6 Proc. Units 1.6 Proc Units

Each virtual processor will receive 0.4 Each virtual processor will receive
processing units 0.8 processing units

Max processing units accessible to Max processing unit accessible to


handle peak workload is 4 handle peak workload is 2

Individual processes/threads Individual processes/threads


may run slower may run faster

Workloads with a lot of Workloads with a lot of


processes/threads may run faster processes/threads may run slower
Consider the peak processing requirements when setting the desired number of virtual processors. In
addition, the quantity of virtual processors can be adjusted to match the number of processes/threads
present in the workload.
9 © 2008 IBM Corporation
Advanced Technical Support, Americas

Virtual Processors and Processing Unit Relationship

Different number of
AIX 5.3 virtual processors AIX 5.3
LPAR LPAR

Excess processing
V V V V Unit Capacity V V
Available
4.0 Proc. Units 2.0 Proc Units

Each virtual processor will receive 1.0 Each virtual processor will receive 1.0
processing units processing units

Max processing units accessible to Max processing unit accessible to


handle peak workload is 4 handle peak workload is 2

Virtual processors receive 1 full CPU Virtual processors receive 1 full CPU
worth of processing units. worth of processing units.

Workloads with a lot of Workloads with a lot of


processes/threads may run faster due to processes/threads may run slower due
larger number of virtual processors. to lower number of virtual processors.

In the presence of excess processing units, virtual processors receive same amount of processing units.

10 © 2008 IBM Corporation


Advanced Technical Support, Americas

Sizing Processing Units and Virtual Processors

Peak requirement is 3.5 CPUs (processing units)

Normal requirement is 0.9 CPUs (processing units)


Processing Units (CPUs)

4
3
2
1
0

Time

Processing Units Sizing:


Need to size desired processing units to address non-peak, normal workload.
Desired = 0.9 (set to match 0.9 processing units, normal requirement)
Minimum = Starting point might be 0.5, or approx. ½ of Desired.
Maximum = 4+

Virtual Processor Sizing:


We need to size desired number of virtual processors to be able to handle peak load.
Desired = 4 (round 3.5, peak requirement up to next whole number)
Minimum = Starting point might be 2, or ½ of Desired.
Maximum = 4+

11 © 2008 IBM Corporation


Advanced Technical Support, Americas

Operating within the Shared Processor Pool

Processing Units (CPUs)


Processing Units (CPUs)

3 4
0 1 2 3 4

0 1 2
Time Time

CPU Utilization CPU Utilization


3 4

3 4
3 4

Processing Units
Processing Units

Processing Units

1 (CPUs)
1 (CPUs)

1 (CPUs)
2

2
2
0

0
0

Time Time Time


Desired Proc. Units User of extra Proc. Units Donor of extra Proc. Units

The goal is to match Users and Donors so that the planned overall
shared processing pool CPU utilization does not exceed 100%.
12 © 2008 IBM Corporation
Advanced Technical Support, Americas

Deployment Choices
 No information on application behavior and utilization of resources?
– Use dedicated processors
• Minimize risk, but excess capacity is unused
• Collect performance data to determine suitability for moving to micro-
partitions
– Use shared processors
• Allocate entitlement liberally, uncap, until resource behavior known
 Mixed applications, variable behavior
– Size to known peaks
• Enough application, benchmark or local performance information to model
expected behavior
• Size each to micro-partition, allocate extra shared pool and memory
resources
– Collect performance data to validate model, free shared pool and memory
allocation to optimize
 Well-defined applications
– Detailed application knowledge allowing for partitions to be individually over-
committed (don’t conflict for shared resources)
– Ideal usage of resources
13 © 2008 IBM Corporation
Advanced Technical Support, Americas

Virtual Processors - Folding


 Problem: Customers specify too many VPs, causing excess dispatches
 Solution: Dynamically adjust active Virtual Processors based on load
– System consolidates loads onto a minimal number of VPs
• Scheduler computes utilization of VPs every second
– If VPs needed to host physical utilization is less than the current active VP count, a VP is
put to sleep
– If VPs needed are greater than the current active VPs, more are enabled
– On by default in AIX 5.3 ML3
• vpm_xvcpus tunable
 Increases processor utilization and affinity
– Inactive VPs don’t get dispatched and waste physical CPU cycles
– Fewer VPs can be more accurately dispatched to physical resources by Hypervisor
 When to adjust
– Burst/Batch workloads with short response-time requirements may need sub-second
dispatch latency
• Disable or manually tune the number of VPs
– # schedo –o vpm_xvcpus=[-1 | N]
– Where N specifies the number of VPs to enable in addition to the number of VPs
needed to consume physical CPU utilization

14 © 2008 IBM Corporation


Advanced Technical Support, Americas

Virtual Processors - Tools


 Folding
– Tools still show data by logical processors, whether folding is active or not!
– mpstat –s will show physical and associated logical utilization
• Disable/enable folding to see how loads are consolidated
 Too many VPs
– High context switch rates
• mpstat –a: ilcs field monitor the number of involuntary logical context switches
• No rules-of-thumb, must baseline normal operation and watch for changes as loads increase
– Lock contention analysis with tools like trace and splat
 Too few VPs
– Available processor cycles in pool
• Available pool processors: lparstat / topas –C available pool processor (APP) value
• Shared physical processor utilization totals: topas –C/-R
• APP requires Processor Utilization Authority checkbox to be set in HMC processor panel
– High CPU utilization, High entitled capacity, work not getting done
• Lparstat / vmstat high %user + %sys, %entc
 Monitoring Context Switches with mpstat
– Hypervisor
• vlcs voluntarily gives cycles to shared pool or to another VP in the same partition
• ilcs involuntary, VP forced to yield (entitlement consumed, etc)
– Operating system
• cs – ics voluntary, thread yields (completes work, issues a read(), etc)
• ics involuntary, thread forced to yield (reaches time-slice dispatch length)
15 © 2008 IBM Corporation
Advanced Technical Support, Americas

CPU Monitoring - mpstat command


Shows detailed logical processor information
up to 29 new metrics (when using -a option)
default mode shows
utilization metrics (%user, %sys, %idle, %wait)
major and minor page faults (with and without disk I/O)
number of syscalls and interrupts
dispatcher metrics
number of migrations
voluntary and involuntary context switches
logical processor affinity (percentage of redispatches inside MCM)
run queue size
fraction of processor consumed
percentage of entitlement consumed (shared mode)
number of logical context switches (shared mode)
hardware preemptions
Focus of AIX development, sar command is updated only as required to remain functional
-d shows detailed software and hardware dispatchers metrics, MCM affinity (dedicated only)
-i shows detailed interrupt metrics
-s shows SMT utilization

16 © 2008 IBM Corporation


Advanced Technical Support, Americas

mpstat Example CPU user & sys values are relative to physical consumed

# mpstat 1 So a lcpu may look “busy” but actual usage is low


System configuration: lcpu=2 ent=0.5
cpu min maj mpc int cs ics rq mig lpa sysc us sy wa id pc %ec lcs U – Unused capacity
0 0 0 0 176 128 59 1 0 100 54 33 38 0 29 0.00 0.5 131
1 0 0 0 10 0 0 0 0 - 0 0 3 0 97 0.00 0.3 131 ALL – System Wide Usage
U - - - - - - - - - - - - 0 99 0.50 99.3 -
ALL 0 0 0 186 128 59 1 0 100 54 0 0 0 100 0.00 0.7 131

cpu Logical CPU number


min Minor page fault – no disk I/O required to satisfy
maj Major page fault – disk I/O required to satisfy fault
mpc Total number of mpc interrupts – Interprocessor calls

int Total number of interrupts


cs Total number of context switches
ics Total number of involuntary context switches
rq Total number of processes on the run queue
mig Total number of thread migration to another logical processor
lpa Total number of re-dispatches within affinity domain 3
sysc Total number of system calls
us / sys / wa / id The percentage of physical processor utilization consumed. Note: interpret relative to pc.
Fraction of physical processor consumed (shared partition or when SMT enabled). The pc of the cpuid
pc
U row represents the number of unused physical processors relative to entitlement
%ec The percentage of entitled capacity consumed.
lcs Total number of logical CPU context switches
17 © 2008 IBM Corporation
Advanced Technical Support, Americas

mpstat –s
# mpstat -s

Proc0 Proc2 Proc4 Proc6


80% 78% 75% 82% [shared mode only]
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7
40% 40% 68% 10% 35% 40% 41% 41%

(delta PURR / delta time-base) * 100


 Represents the percentage of dispatch cycles given to a
logical processor
 Interpreted as the percentage of physical processor
consumed by a logical processor
 Only AIX tool that will dynamically show Virtual Processor
Folding (adjust vpm_xvcpus tunable)
Schedo –p –o vpm_xvcpus = 0 | -1

Physical Processor / Virtual Process Busy – with SMT enabled each physical process has
proc
two logical processors.
Logical CPU number and the overall busy percentage, which is the sum or user + system
cpu
mode utilization. Gives the relative SMT split between processors.

18 © 2008 IBM Corporation


Advanced Technical Support, Americas

lparstat Review
# lparstat -h 1 4
System configuration: type=Shared mode=Capped smt=On lcpu=4 mem=4096 psize=2 ent=0.40
%user %sys %wait %idle physc %entc lbusy app vcsw phint %hypv hcalls Additional information when
----- ---- ----- ----- ----- ----- ------ --- ---- ----- ----- ------ “-h” flag is specified

84.9 2.0 0.2 12.9 0.40 99.9 27.5 1.59 521 2 13.5 2093
86.5 0.3 0.0 13.1 0.40 99.9 25.0 1.59 518 1 13.1 490

%user / %sys / Shows the percentage of the entitled processing capacity used. So, you would say that the system is consuming
%wait / %idle 86.9% (84.9 + 2) of four-10th of a physical processor. For dedicated partitions, the entitled capacity = # of physical
processors
physc Shows the number of physical processors consumed. For a capped partition this number will not exceed the entitled
capacity. For an uncapped partition this number could match the number of processors in the shared pool; however,
this my be limited based on the number of on-line Virtual Processors.
%entc Shows the percentage of entitled capacity consumed. For a capped partition the percentage will not exceed 100%;
however, for uncapped partitions the percentage can exceed 100%.

lbusy Shows the percentage of logical processor utilization that occur while executing in user and system mode. Note: In
this example we’re using approx 25% of the logical processors. This is the “traditional” measure of CPU utilization Shared
using time-based sampling. As this value approaches 100% it may indicate that the partition could make use of Mode
additional VPs. Only
app Shows the number of available processors in the shared pool. The shared pool ‘psize’ is 2 processors. Must set
‘Allow shared processor pool utilization authority’. View the “properties” for a partition and click the Hardware tab,
then Processors and Memory.
vcsw Shows the number of virtual context switches.

phint Shows the number of phantom interrupts. A phantom interrupt is an interrupt that belongs to another shared partition.

%hypv / hcalls Shows the percentage of time spent in the hypervisor and the number of hypervisor calls.

19 © 2008 IBM Corporation


Advanced Technical Support, Americas

Topas CEC Monitoring Review (AIX 5.3 ML 3)


 topas
–C
Upper section displays aggregated CEC information
Lower section displays shared/dedicated data – closely mimics lparstat
Topas CEC Monitor Interval: 10 Thu Jul 28 17:04:57 2006
Partition Info Memory (GB) Processor
Monitored : 6 Monitored :24.6 Monitored :1.2 Shr Physical Busy: 0.27
UnMonitored: - UnMonitored: - UnMonitored: - Ded Physical Busy: 2.70
Shared : 3 Available :24.6 Available : -
Dedicated : 3 UnAllocated: 0 UnAllocated: - Hypervisor
Capped : 1 Consumed : 2.7 Shared :1.5 Virt. Context Switch: 632
Uncapped : 1 Dedicated : 5 Phantom Interrupts : 7
Pool Size : 3
Avail Pool :2.6
Host OS M Mem InU Lp Us Sy Wa Id PhysB Ent %EntC Vcsw PhI
-------------------------------------shared-------------------------------------
ptoolsl3 A53 c 4.1 0.4 2 14 1 0 84 0.08 0.50 15.0 208 0
ptoolsl2 A53 C 4.1 0.4 4 20 13 5 62 0.17 0.50 36.5 219 5
ptoolsl5 A53 U 4.1 0.4 4 0 0 0 99 0.02 0.50 0.1 205 2

------------------------------------dedicated-----------------------------------
ptoolsl1 A53 S 4.1 0.5 4 20 10 0 70 0.60
ptoolsl4 A53 4.1 0.5 2 100 0 0 0 2.00
ptoolsl6 A52 4.1 0.5 1 5 5 12 88 0.10

•M – System Mode
•c means capped, C - capped with SMT
•u means shared, U - uncapped with SMT
•S means SMT
© 2008 IBM Corporation
Advanced Technical Support, Americas

Virtual I/O Server

21 © 2008 IBM Corporation


Advanced Technical Support, Americas

Virtual SCSI - CPU


 Know the applications I/O behavior
– Review information on application block sizes
– Determine I/O Per Second (IOPS) and bytes transferred
• iostat -D
• filemon
– Otherwise, you need to size to maximum I/O rates
• Will waste entitlement when using dedicated partitions
• Likely just guessing with shared partitions
 Disk guidelines for IOPS
• Use 60 IOPS for planning for 7200 RPM disks
• Use 75 IOPS for planning for 10,000 RPM disks
• Use 100 IOPS for planning for 15,000 RPM disks
• Large block, sequential IO can sustain 2X these rates
• Most disk subsystems get better IOPS on their disks
– Use 150 IOPS for DS4000, DS6000 and DS8000 disks

22 © 2008 IBM Corporation


Advanced Technical Support, Americas

Virtual SCSI - CPU


 CPU sizing method for VSCSI server
– Determine approximate cycles per second for an I/O type
• Available in VIOS Planning Guide
– http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/home.
html
– Allocation =
• (# of IOPS X Cycles per second) / CPU Frequency

23 © 2008 IBM Corporation


Advanced Technical Support, Americas

Virtual SCSI - CPU


Number of 4KB 8KB 32KB 64KB 128KB
CPU cycles
per operation*
Physical Disk 45000 47000 58000 81000 120000
LVM 49000 51000 59000 74000 105000
* Based on 1.65 GHz POWER5

Example: Two clients, using physical disk storage


Client1: peak of 10,000 IOPS of size 32KB
Client2: peak of 5,000 IOPS of size 64KB
Allocation = (# of IOPS X Cycles per second) / CPU Frequency
(10,000 X 58,000 + 5,000 X 81,000)/1,650,000,000 = 0.60 processors

* Can adjust cpu cycles required for other processors by computing a ratio between processor speeds

24 © 2008 IBM Corporation


Advanced Technical Support, Americas

Virtual SCSI – Tuning/Tools


 Measure I/O
– Just as you do with a non-VIOS system
– Use iostat –aD for detailed disk and adapter information
• Mitigate any high IOPS
– Transfers per second (tps)
– Now broken down by read/write transfers, most customers don’t know
• Look for unbalanced IO on PVs
– May indicate data layout issues
• Look for high wait/service times, high queue full (sqfull)
• Adapter totals (-a) can be associated with HW limits
– Use filemon trace tool
• To determine average block sizes
• Identify hot LVs and PVs
– Look for high read/write times
> Writes > 2 msec (cached), > 10 msec (non-cached)
> Reads > 20 msec
– Look for large deltas between PV and LV layers
> File system buffer tunings (cross check vmstat –v counters)
 Server-specific tools
– viostat – wrapper around iostat
– topas – global values, disk and network metrics
– netstat – network statistics
– Most modes can be run as padmin, but root shell may have more options
25 © 2008 IBM Corporation
Advanced Technical Support, Americas

Virtual Ethernet
 Packets transferred in memory between partitions on the same
server
– Higher throughput than physical ethernet
– Physical devices do not support MTU 65394
 Throughput linearly scales with processor entitlements
– MTU 9000 is 3X MTU 1500
– MTU 65394 is 7X MTU 9000
– Try to use the highest MTU
 No unique TCP/IP tunables methodology
 TCP Checksum Offloading
– Because virtual network does not suffer from physical network link
errors, checksums do not need to be generated (this is the default in
later AIX 5.3 levels)
• # chdev –l <device> -a chksum_offload=yes

26 © 2008 IBM Corporation


Advanced Technical Support, Americas

Shared Ethernet
 Heavy network load, use same sizings as dedicated systems
– MTU 1500, 1 CPU
– MTU 9000, 0.5 CPU
 Shared processors
– Shared processors can result in higher latency, decreasing throughput
– For bursty network loads, use uncapped and allow for more entitlement than would
be allocated for a dedicated partition hosting the same application
 Tools
• lsattr –El en#
• topas
• entstat, netstat
• seastat
– Tool from Nigel Griffith simplifies output, provides intervals
– http://www-941.ibm.com/collaboration/wiki/display/WikiPtype/nmon
 Whenever there is a VIO Client/Server issue, check if there is a CPU constraint
first
– Add entitlement (shared), uncap or increase CPUs (dedicated)
– Use larger MTU sizes if possible

27 © 2008 IBM Corporation


Advanced Technical Support, Americas

Generic Memory Tuning

28 © 2008 IBM Corporation


Advanced Technical Support, Americas

AIX Memory Management Overview


 The role of Virtual Memory Manager (VMM) is to provide the capability for
programs to address more memory locations than are actually available in
physical memory.
 On AIX this is accomplished using segments that are partitioned into fixed
sizes called “pages”.
– A segment is 256M
– default page size 4K
– POWER 4+ and POWER5 can define larger page sizes
 The VMM maintains a list of free frames that can be used to retrieve pages
that need to be brought into memory.
– The VMM replenishes the free list by removing some of the current pages from
real memory (i.e., steal memory).
– The process of moving data between memory and disk is called “paging”.
 The VMM uses a Page Replacement Algorithm (implemented in the lrud
kernel threads) to selects pages that will be removed from memory.
29 © 2008 IBM Corporation
Advanced Technical Support, Americas

AIX Page Replacement Algorithm - Review


 The basic Page Replacement Algorithm uses a technique known as a “clock hand” algorithm.
– Scan the page frame table examining the “reference bit” for each page.
• Reference bit “off” – Candidate to be stolen based on memory type and VMM parameters
• Reference bit “on” – Set the reference bit “off”, which “ages” the references so that the next time the pages is
scanned it can be stolen.
 To improve the page selection process the Virtual Memory Manager (VMM) maintains the following information:
– Segments types
• Working – Process data and stack, shared memory, shared library, etc.,
• Persistent – Files from JFS file system
• Clients – Files from JFS2 file system or NFS
– Segments classification
• Computational – Working or program text
• Non-Computational (or File Memory) – Persistent or Client segments
– Number of Non-Computational memory pages
– Re-Page rate for computational and non-computational memory (note: a re-page is a failure of the page replacement
algorithm).
– Counters are displayed with vmstat -v
 A number of tunable parameters can be used to influence the decisions made by the page replacement algorithm
 On a MP system, the page replacement algorithm is called ‘lrud’, which is a multi-threaded processes
– Managed per memory pool
 Page Replacement algorithm (lrud) runs under the following conditions
– Number of free frames in a memory pool drop below minfree
– strict_maxclient=1 if at maxclient or numclient drops below minfree
– strict_maxperm=1 if at maxperm or numperm drops below minfree
– Note: maxperm is number of non-computational pages
– WLM Trigger point is hit

30 © 2008 IBM Corporation


Advanced Technical Support, Americas

AIX Page Replacement Algorithm – Clock Hand


Frame is physical
Page
Frame Seg Seg Ref Mod Pages are virtual
Class type Bit Bit Space
90000 C W N N Candidate to Steal
90001 C W Y N Set Ref bit to N

90002 C W Y Y
Set Ref bit to N
90003 C W Y Y JFS
Write to Page Space
90004 C W N N
90005 C P N N Candidate to Steal
90006 C P Y N
90007 NC C N N
Set Ref bit to N
90008 NC C Y Y Write to JFS2
JFS2

lrud is a multi-threaded – 2 free lists per lrud thread NFS


One lrud per memory pool
# of Memory Pools based on Parameters maintained by the VMM for each mempool that
Memory affinity enabled – at least one per MCM or influence page replacement process.
DCM in the lpar (possible more based on Number of non-computational pages
cpu_scale_memp)
Re-Page Rate for computation pages
Controlled via the cpu_scale_memp
Re-Page Rate for non-computation pages
LRU Buckets (vmo –a| grep lrubucket) – 128K 4K pages

31 © 2008 IBM Corporation


Advanced Technical Support, Americas

Page Replacement & VMM Parameters - Review


numperm Number of non-computational pages in memory. This
is not the number of persistent pages in memory. strict_maxperm & Computational
Note: can be less than numclient because ‘text’ pages maxperm%
are classified as computational

strict_maxclient &
numclient Number of client pages in memory. Working Segs
maxclient%
Client / Persistent Segs
maxperm% Configured maximum number of non-computational hold “Text”
pages in memory. Enforcement is controlled by
strict_maxperm and lrud.
numclient
numperm
maxclient% Configured maximum number of client pages in
memory. Client pages are a sub-set of non-
computational pages. This is why maxclient% <= Non-Computational
maxperm%. Enforcement is controlled by
strict_maxclient and lrud. Persistent Segs
Client Segs
strict_maxperm & Sets hard or soft enforcement of file system cache
strict_maxclient limits. When memory is available, soft enforcement
will allow memory utilization to grow beyond the Minperm%
configured limit.

Steal File Pages lru_file_repage=0 and numperm > minperm


lru_file_repage=1 and Computational Re-page (Rc) > Non-Computational Re-page (Rf)
or numperm > maxperm / maxclient
Steal Either lru_file_repage=1 and Not Rc > Rf and minperm < numperm < maxperm/maxclient

Steal Either numperm < minperm

32 © 2008 IBM Corporation


Advanced Technical Support, Americas

vmstat – Global settings

# vmstat -v
233472 memory pages
197128 lruable pages
5201 free pages
0 memory pools
53534 pinned pages
80.0 maxpin percentage
20.0 minperm percentage
80.0 maxperm percentage
36.5 numperm percentage
72058 file pages
0.0 compressed percentage
0 compressed pages
39.3 numclient percentage
80.0 maxclient percentage
77641 client pages VMM counters provide a snapshot of
0 remote pageouts scheduled
0 pending disk I/Os blocked with no pbuf memory used for file cache.
0 paging space I/Os blocked with no psbuf
2740 filesystem I/Os blocked with no fsbuf
200 client filesystem I/Os blocked with no fsbuf
0 external pager filesystem I/Os blocked with no fsbuf

33 © 2008 IBM Corporation


Advanced Technical Support, Americas

Memory Monitoring with vmstat


# vmstat -I 1 Key Points:
Computational (avm)
System configuration: lcpu=2 mem=912MB
Free Frames
kthr memory page faults cpu Paging Rates
-------- ----------- ------------------------------ ------------ -----------
r b p avm fre fi fo pi po fr sr in sy cs us sy id wa Scanning Rates
1 1 0 139893 2340 12288 0 0 0 0 0 200 25283 496 77 16 0 7
1 1 0 139893 1087 4503 0 8 733 3260 126771 415 9291 440 82 15 0 3
3 0 0 139893 1088 9472 0 1 95 9344 100081 191 19414 420 77 20 0 3
1 1 0 139893 1087 12547 0 6 0 12681 13407 207 25762 584 71 21 0 7
1 2 0 140222 1013 6110 1 39 0 6169 6833 160 15451 471 83 11 0 5
1 2 0 139923 1087 6976 0 31 2 7062 7599 183 19306 544 79 14 0 7

b The number of threads blocked waiting for a file system I/O operation to complete.
kthr

p The number of threads blocked waiting for a raw device I/O operation to complete.
avm The number of active virtual memory pages, which represents computational memory requirements.
The maximum avm number divided by number of real memory frames equals the computational
Memory

memory requirement.
fre The number of frames of memory on the free list. Note: A frame refers to physical memory vs. a page
which refers to virtual memory.
fi / fo File pages In and File pages Out per second, which represents I/O to and from a file system.
pi / po Page Space Page In and Page Space Page Out per second, which represents paging.
Page

fr / sr The number of pages scanned ‘sr’ and the number of pages stolen (or freed) ‘fr’. The ratio of scanned
to freed represents relative memory activity. The ratio will start at 1 and increase as memory
contention increases. I had to examine ‘Sr # of pages’ to steal ‘Fr pages’. Note: Interrupts are
disabled at times when ‘lrud’ is running.

34 © 2008 IBM Corporation


Advanced Technical Support, Americas

svmon – Global Memory


Size: Total # of Memory
Frames (real
# svmon -G Memory)
size inuse free pin virtual
memory 233472 125663 107809 108785 140123 Inuse: # of Frames in
pg space 262144 54233 Use

work pers clnt lpage Free: # of Frames on


pin 67825 0 0 40960 Free List
in use 79725 536 4442 0
Pin: # of pinned
pgsize size free Frames
lpage pool 16 MB 10 10
Virtual: Size of Virtual
Memory (matches
vmstat ‘avm’)

Working (or computational) memory = 140123 %Computational = virtual/size = 140213 / 233472


Pers (or JFS file cache) memory = 536
Clnt (or JFS2 and NFS file cache) memory = 4442

In addition, the system has been configured to use large pages. A total of 160 MB (10 16MB large pages), which are all free.

35 © 2008 IBM Corporation


Advanced Technical Support, Americas

svmon – User Memory Only option that


breaks down usage by
 System
 Shared
 Exclusive
# svmon –U db2user
===============================================================================
User Inuse Pin Pgsp Virtual
db2inst1 37598 7620 0 35927

PageSize Inuse Pin Pgsp Virtual


s 4 KB 37598 7620 0 35927 AIX 5.3 ML3
L 16 MB 0 0 0 0

...............................................................................
SYSTEM segments Inuse Pin Pgsp Virtual
13347 7334 0 13347
Segment Used by the System
and shared by all processes

PageSize Inuse Pin Pgsp Virtual


s 4 KB 13347 7334 0 13347
L 16 MB 0 0 0 0

Vsid Esid Type Description PSize Inuse Pin Pgsp Virtual


0 0 work kernel s 11665 7323 0 11665
1001 9ffffffd work shared library s 1612 0 0 1612
14014 9ffffffe work shared library s 30 0 0 30
1a58a - work s 25 7 0 25
1a48a - work s 15 4 0 15

36 © 2008 IBM Corporation


Advanced Technical Support, Americas

svmon – User Memory


...............................................................................
EXCLUSIVE segments Inuse Pin Pgsp Virtual
11567 286 0 9924
Segment Used Exclusively by

PageSize Inuse Pin Pgsp Virtual


s 4 KB 11567 286 0 9924
L 16 MB 0 0 0 0

Vsid Esid Type Description PSize Inuse Pin Pgsp Virtual


1f56f 11 work text data BSS heap s 5670 0 0 5670
1551 11 work text data BSS heap s 677 0 0 677
1f54f 9001000a work shared library data s 618 0 0 618
the User

18588 9001000a work shared library data s 497 0 0 497


a59a 11 work text data BSS heap s 364 0 0 364
8698 - clnt /dev/hd2:110883 s 323 0 - -
7b0 11 work text data BSS heap s 261 0 0 261

...............................................................................
SHARED segments Inuse Pin Pgsp Virtual
Segment Shared by other Users

12684 0 0 12656

PageSize Inuse Pin Pgsp Virtual


s 4 KB 12684 0 0 12656
L 16 MB 0 0 0 0

Vsid Esid Type Description PSize Inuse Pin Pgsp Virtual


/ processes

10a0 90000000 work shared library text s 9750 0 0 9750


14484 77000000 work default shmat/mmap s 1537 0 0 1537
a1da 90020014 work shared library s 745 0 0 745
8478 78000000 work default shmat/mmap s 620 0 0 620

37 © 2008 IBM Corporation


Advanced Technical Support, Americas

svmon – Process Memory


# svmon -P 401636 64-Bit? Multi-Threaded? Using Large Pages?

-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd LPage
401636 memget 34758 4798 0 34754 N N N

Vsid Esid Type Description LPage Inuse Pin Pgsp Virtual

19532 2 work process private - 20017 3 0 20017


0 0 work kernel segment - 9258 4795 0 9258
1f09d d work loader segment - 5466 0 0 5466
19612 f work shared library data - 13 0 0 13
18613 1 clnt code,/dev/projlv:8239 - 3 0 - -
c527 - clnt /dev/hd2:4183 - 1 0 - -
# svmon -P 401636 Some of pages move to page space

-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd LPage
401636 memget 26932 4798 3562 34754 N N N

Vsid Esid Type Description LPage Inuse Pin Pgsp Virtual

19532 2 work process private - 16707 3 3310 20017


0 0 work kernel segment - 9002 4795 174 9258
1f09d d work loader segment - 1216 0 74 5466
19612 f work shared library data - 7 0 4 13
c527 - clnt /dev/hd2:4183 - 0 0 - -
18613 1 clnt code,/dev/projlv:8239 - 0 0 - -

38 © 2008 IBM Corporation


Advanced Technical Support, Americas

Memory Usage – ps and svmon commands


# svmon –P 385212
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd 16MB
385212 db2fmp 24330 7854 0 24308 Y Y N
Different Units!
Vsid Esid Type Description PSize Inuse Pin Pgsp Virtual

0 0 work kernel s 12140 7828 0 12140 Svmon in 4 KB


10a0 90000000 work shared library text s 6736 0 0 6736
1001 9ffffffd work shared library s 1616 0 0 1616
1a46a 77000000 work default shmat/mmap s 1537 0 0 1537 Ps in 1KB
5535 9001000a work shared library data s 611 0 0 611
7537 11 work text data BSS heap s 596 0 0 596
1d0 90020014 work shared library s 594 0 0 594
4534 80020014 work USLA heap s 165 0 0 165 Relating ps and svmon
1f52f f00000002 work process private s 156 22 0 156 memory usage?
470 70000000 work default shmat/mmap s 83 0 0 83
14014 9ffffffe work shared library s 23 0 0 23 Process report contains
6536 ffffffff work application stack s 19 0 0 19
1e52e 8fffffff work private load data s 14 0 0 14
shared segments (kernel,
1d46d - work s 14 4 0 14 library, persistent/client
a53a 10 clnt text data BSS heap, s 10 0 - - files, etc), unlike User
/dev/hd1:337405 report.
a1ba 9fffffff clnt USLA text,/dev/hd2:957 s 10 0 - -
1f48f 78000001 work default shmat/mmap s 4 0 0 4 Find private process
17047 - clnt /dev/hd2:12948 s 2 0 - -
memory
RSS will match Inuse
# ps vg 385212
PID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU %MEM COMMAND SIZE will match Virtual
385212 - A 0:00 457 6244 6284 32768 13 40 0.0 1.0 db2hmon

39 © 2008 IBM Corporation


Advanced Technical Support, Americas

Shared Memory, ipcs and svmon


The ipcs command reports information about inter-process communication facilities, which include
shared memory, semaphores, and message queues. The following command limits the report to
shared memory segments only.
# ipcs -bmS
IPC status from /dev/mem as of Wed Aug 10 16:20:33 EDT 2005 Requested size of the Shared
T ID KEY MODE OWNER GROUP SEGSZ memory Segment - Real memory is
Shared Memory: not allocated until the process /
SID : thread actually uses the memory.
0x1e375
m 4 0x080d3d74 --rw-rw-rw- db2inst1 db2grp1 140665792 When an application requests
SID : memory it receives a “pointer” and a
0x3448 “promise” from the kernel.
m 5 0x080d3d61 --rw------- db2inst1 db2grp1 22855680 Use svmon to determine how much
SID : is currently ‘In Use’.
0x440f
m 6 0xffffffff --rw------- db2fenc1 db2fgrp1 245284864
SID : 140665792 / 4096 = 3434 pages
0x18473
m 7 0x080d3e68 --rw-rw---- db2inst1 db2grp1 58720256
SID : Segment Id used with the command
0xf444
m 282066952 0x0d000ada --rw-rw-rw- root system 1440 svmon -lS
SID : which will give you information about
0x145df the segment and a listing of the
attached processes.

40 © 2008 IBM Corporation


Advanced Technical Support, Americas

Size of Shared Memory Segments


# svmon –P <ora_pid> Shows how much Oracle SGA has been used
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd LPage
655398 oracle 75202 4832 5883 67518 Y N N

Vsid Esid Type Description LPage Inuse Pin Pgsp Virtual


190d4 70000000 work default shmat/mmap - 44020 0 0 44020
7c6c 70000001 work default shmat/mmap - 7793 0 0 7793

51813 51813

# ipcs -bm Pages Pages

IPC status from /dev/mem as of Thu Jun 2 16:57:11 EDT 2005


T ID KEY MODE OWNER GROUP SEGSZ
Shared Memory:
m 0 0x58001294 --rw-rw-rw- root system 134217728 In-Use vs. Virtual vs.
m 1048578 0x0d000ada --rw-rw-rw- root system 1440 Requested

m 1048579 0xffffffff --rw-rw---- root system 4096


m 76546054 0x3f0bcb08 --rw-r----- oracle osdba 301998080 = 73730 pages

41 © 2008 IBM Corporation


Advanced Technical Support, Americas

Memory Pool Balance


 Seeing Mempool Activity
– Previous AIX 5.3 levels have had issues with unbalanced memory pools
• Single pool could be starved while others had memory
• Seen when vmstat reports available memory, but system is scanning or paging
– Variety of command-lines floating around for kdb ouptut. Most readable
output I have found:
echo "mempool *\nfrs *" | kdb

(0)> mempool * Total in 4K


VMP MEMP NB_PAGES FRAMESETS NUMFRB
memp_frs+010000 00 000 00133644 000 001 002 003 0009C624
(0)> frs *
VMP MEMP PSZ FRS NEXT_FRS NB_PAGES NUMFRB
memp_frs+000000 00 000 4K 000 00000001 0005A4F0 00014135 321.3MB
memp_frs+000080 00 000 4K 001 FFFFFFFF 0005A4B4 0001434F 323.4MB
memp_frs+000100 00 000 64K 002 00000003 00003F66 00003A07 928.5MB
memp_frs+000180 00 000 64K 003 FFFFFFFF 00003F64 00003A13 929.2MB

Only one pool shown here, but look for large disparity in
pages (NB_PAGES) or free frames (NUMFRB)

42 © 2008 IBM Corporation


Advanced Technical Support, Americas

Other Tools - Memdetails


 Ever look inside PerfPMR? There are some nice scripts in there
 Memdetails.sh
• Runs svmon, ipcs, kdb and vmstat
• Reports kernel and user memory breakdowns
• Much more friendly than decrypting svmon
• Not officially supported as an end-user tool

43 © 2008 IBM Corporation


Advanced Technical Support, Americas

====================================================|==========|===========
Memory Overview | Pages | Megabytes
----------------------------------------------------|----------|-----------
Total memory in system | 524288 | 2048.00
Total memory in use | 518325 | 2024.70
Free memory | 5963 | 23.29
====================================================|==========|===========
Segment Overview | Pages | Megabytes
----------------------------------------------------|---------|-----------
Total segment id mempgs | 486394 | 1899.97
Total fork tree segment pages | 0 | 0.00
Total kernel segment id mempgs | 197484 | 771.42
jfs segment | 32 | 0.12
kernel heap | 134437 | 525.14
kernel segment | 17037 | 66.55
lfs segment | 656 | 2.56
lock instrumentation | 0 | 0.00
mbuf pool | 28128 | 109.87
mpdata debug | 1024 | 4.00
other kernel segments | 7278 | 28.42
page space disk map | 16 | 0.06
page table area | 1237 | 4.83
process and thread tables | 80 | 0.31
vmm ame segment | 16 | 0.06
vmm data segment | 560 | 2.18
….
vmm vmintervals | 16 | 0.06
miscellaneous kernel segs | 3998 | 15.61
Total kernel mem w/ no segment id (wlm_hw_pages) | 31676 | 123.73
RMALLOC | 9 | 0.03
SW_PFT | 12288 | 48.00
PVT | 1024 | 4.00
PVLIST | 16384 | 64.00
RTAS_HEAP | 2396 | 9.35
----------------------------- | |
Total | 32101 | 125.39

44 © 2008 IBM Corporation


Advanced Technical Support, Americas

===========================================================================
Detailed Memory Components | Pages | Megabytes
----------------------------------------------------|----------|-----------
Light Weight Trace memory | 4092 | 15.98
LVM Memory | 928 | 3.62
Total Kernel Heap memory | 134439 | 525.15
JFS2 total non-file memory | 542 | 2.11
metadata_cache | 78 | 0.30
inode_cache | 272 | 1.06
fs bufstructs | 140 | 0.54
misc jfs2 | 52 | 0.20
misc kernel heap | 133897 | 523.03
Total file memory | 228385 | 892.12
Total clnt (JFS2, NFS,...) file memory | 0 | 0.00
Total pers (JFS) memory | 228385 | 892.12
Total text memory | 9863 | 38.52
Total clnt text memory | 0 | 0.00
Total pers text memory | 9863 | 38.52
User memory | |
USER: root | |
total process private memory | 16292 | 63.64
total shared memory | 1543 | 6.02
working (shared w/ other users) | 18674 | 72.94
working (exclusive to user) | 29450 | 115.03
shared memory (exclusive to user) | 5 | 0.01
shared memory (shared w/ other users) | 1538 | 6.00
shlib text (shared w/ other users) | 17136 | 66.93
shlib text (exclusive to user) | 928 | 3.62
file pages | 1588 | 6.20
file pages (exclusive to user) | 1588 | 6.20
file pages (shared w/ other users) | 78 | 0

45 © 2008 IBM Corporation


Advanced Technical Support, Americas

===========================================================================
Memory accounting summary | 4K Pages | Megabytes
----------------------------------------------------|----------|-----------
Total memory in system | 524288 | 2048.00
Total memory in use | 518325 | 2024.70
Kernel identified memory (segids,wlm_hw_pages) | 225162 | 879.53
Kernel un-identified memory | 3998 | 15.61
Fork tree pages | 0 | 0.00
Large Page Pool free pages | 0 | 0.00
Huge Page Pool free pages | 0 | 0.00
User private memory | 18548 | 72.45
User shared memory | 1543 | 6.02
User shared library text memory | 18064 | 70.56
Text memory | 9863 | 38.52
File memory | 228385 | 892.12
User un-identifed memory | 12762 | 49.85
---------------------- | |
Total accounted in-use | 518325 | 2024.70
Free memory | 5963 | 23.29
---------------------- | |
Total identified (total ident.+free) | 507528 | 1982.53
Total unidentified (kernel+user w/ segids) | 16760 | 65.46
---------------------- | |
Total accounted | 524288 | 2048.00
Total unaccounted | 255 | 0.99

Unidentified user could be:


- shared memory segments currently not attached by processes
- shared libraries currently not used by any processes
- miscellaneous

46 © 2008 IBM Corporation


Advanced Technical Support, Americas

Memory Tuning – lru_file_repage


 General AIX 5.3 guidance since early 2006
– Managing maxperm/maxclient is hard, many customers tune too low and have
performance problems even when memory is available
– Support for AIX 5.2 backported in ML-04
• Caveat: AIX 5.2 file cache should be limited to less than 24 GB
 Suggested Combination
– maxperm% = maxclient% = <High Percentage, 80-90%>
– minperm% = <Low Percentage, 3-5%>
• Review numperm via vmstat –v and set this lower
– strict_maxperm = 0
– strict_maxclient = 1
– lru_file_repage = 0 (default = 1)
– lru_poll_interval = 10
• Tells each lrud daemon to poll every 10 msec for interrupts
 The file cache will be allowed to grow; however, when the VMM needs memory it will
steal only file pages. Why? Because we’ve set lru_file_repage = 0. This tells the VMM
to steal file only pages.
 What is <High Percentage>
– If possible, set so maxclient% is always greater than numclient% (vmstat –v)
• Why? Maxclient is a hard limit; therefore, lrud will not run
 What is <Low Percentage>
– Set so that numperm (vmstat –v) is always greater than minperm%
• Why? If numperm drops below minperm then the VMM can steal either computational
or non-computational memory.

47 © 2008 IBM Corporation


Advanced Technical Support, Americas

VMM Tuning Combination Summary – Goal is to


prevent paging of computational memory.
Recommended Method: Classic Method*:
lru_file_repage = 0 lru_file_repage = 1
strict_maxperm = 0 strict_maxperm = 0
strict_maxclient = 1 strict_maxclient = 0
maxperm% = maxclient% = High Percentage maxperm% = maxclient% = 20% (or small number)
minperm% = Low Percentage minperm% = 5
lru_poll_interval=10 lru_poll_interval=10
* This method is appropriate for system that don’t have
‘lru_file_repage’ tunable.
Calculated Method: Avoid:
lru_file_repage = 0 strict_maxperm = 1 and strict_maxclient = 0
strict_maxperm = 0 strict_maxperm = strict_maxclient = 0 & lru_file_repage = 0
strict_maxclient = 1
maxperm% = maxclient% = 1 - % Computational + 20%
lru_poll_interval=10
Where,
%Computational = max. AVM / Real Memory Frames

48 © 2008 IBM Corporation


Advanced Technical Support, Americas

AIX 5.3 – minfree and maxfree changes

 minfree and maxfree on AIX 5.3 are now applied to each memory pool. With AIX 53,
total free list = minfree * # of memory pools
 In ealrlier releases of AIX (5.2 and 5.1), minfree was divided by the number of memory pools
so that the total free list (determined by adding minfree for *each* memory pool) equaled the
vmo/vmtune value of minfree.

AIX Level minfree mempools LRUD starts when


51/52 1024 4 free_list =< 1024
53 1024 4 free_list =< (4 * 1024)

Initial Setting AIX 5.3 Initial Setting AIX 5.2


minfree = max( 960, lcpus * 120 ) minfree = max( 960, lcpus * 120)
-----------
# mempools maxfree = minfree + (Max Read Ahead * lcpus)
maxfree = minfree + (Max Read Ahead * lcpus)
----------------------
# of mempools

Where,
Max Read Ahead = max( maxpgahead, j2_maxPageReadAhead)

49 © 2008 IBM Corporation


Advanced Technical Support, Americas

64KB Pages
 64K pages are intended to be general purpose.
 64K pages will be automatically managed by the kernel.
– Automatically used by the kernel and shared library text regions
– Fully pageable
– Size of the 64K page pool is dynamically adjusted and managed by the kernel.
– The kernel will vary the number of 4K and 64K pages to meet system demand
 It is expected that many applications will see performance benefits when using 64K pages
rather than 4K pages.
 Performance Monitoring commands have been updated to reflect and report on memory
usage by page size.
 64K Pages can be used for Data, Stack and Text regions via an environment variable
LDR_CNTRL or the modification of an application XCOFF binary.
– Data (-bdatapsize / DATAPSIZE)
• Example: ldedit –bdatapsize [binary]
• Example: LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STACKPSIZE=64K
– Stack (-bstackpsize / STACKPSIZE)
– Text (-btextpsize / TEXTSIZE)
(Note: Environment variable overrides XCOFF setting)
 64K Pages can be used for shared memory regions; however, application code must be
modified.
 Reference Guide to Multiple Page Size Support for more detail
– http://www-03.ibm.com/servers/aix/whitepapers/multiple_page.pdf

50 © 2008 IBM Corporation


Advanced Technical Support, Americas

Page Size Performance Considerations


 Memory Usage
– Use of larger page size may result in memory fragmentation, which could increase an applications
memory footprint
– Increased footprint (i.e., memory usage) could have a negative impact on performance.
– Use ps, vmstat and svmon to determine impact
 Page Translation overhead
– Applications with a measurable amount of page translation overhead. For example, OLTP and JAVA
applications, should see improved performance using 64K pages
 Benefit is reducing the overhead of translating a virtual address to a physical
address (Translation Look-aside Buffers)
– Since there are a limited number of TLB slots, using large page size increases the amount of an
address space that can be accessed without incurring translation delays.
– hpmcount tool reports TLB miss rate (relative to the number of instructions executed). Effective way
to determine if there is a potential performance improvement in using large pages.
 Applications that use shared memory have to be modified to use 64K pages.
 Large and Huge pages (16M / 16G) may improve the performance of applications
that repeatedly access large amounts of memory.
 Will it help? Only way to know is to try
– conduct a benchmark to quickly and accurately determine the impact of large pages.

51 © 2008 IBM Corporation


Advanced Technical Support, Americas

Tuning Memory
 Memory Model Tuning
– %Computational < 80% - Large Memory Model – Goal is to adjust tuning parameters to prevent
paging
• Multiple Memory pools
• Page Space smaller than Memory
• Must Tune VMM key parameters (lru_file_repage)
– %Computational > 80% - Small Memory Model – Goal is to make paging as efficient as possible
• Add multiple page spaces on different spindles
• Make all pages space the same size to ensure round-robin scheduling
• PS = 1.5 computational requirements (smaller ratios for systems greater than 16 GB)
– No changes required in AIX 6.1
 Check for unbalanced Memory Pools
– Use KDB “mempool *” to check number of frames, update to latest APAR levels
 Application Memory Adjustments
– Consider alternate page sizes
– Reduce SGA, pinned allocations, other application-specific memory tunings
 Implement VMM-related mount options to reduce cache needs
– Use DIO / CIO
– Release behind or read and/or write
 Add additional memory

52 © 2008 IBM Corporation


Advanced Technical Support, Americas

Generic Process Tuning

53 © 2008 IBM Corporation


Advanced Technical Support, Americas

AIX Scheduler Policy - Review


SCHED_OTHER Default AIX scheduling policy (a.k.a Fair Round-Robin). Each thread has an initial priority that is
modified by the scheduler based on the short term CPU utilization. Thread execution is time-sliced.
SCHED_RR Round-Robin (RR) scheduling policy. Each thread has a fixed priority. Threads at the same priority
level run for a fixed time slice in first-in-first-out order. A thread will run until:
a)Yields the CPU voluntarily
b)Blocked waiting for I/O
c)Uses up its time slice
Set via thread_setsched() or setpri() system call – requires root privilege
SCHED_FIFO First-In-First-Out (FIFO) scheduling. Each thread has a fixed priority. Threads at the same priority level
run to completion in FIFO order. Rarely used because of its non-preemptive nature. A thread will run
until:
a) Yields the CPU voluntarily – e.g., sleep() or select()
b) Blocked due to resource contention
c) Blocked waiting for I/O
Set via thread_setsched() system call – requires root privilege.
SCHED_FIFO2 Variation of FIFO policy. A thread is placed at the head of its run queue if it was asleep for only a short
time (less than predefined number of ticks schedo affinity_lim)
SCHED_FIFO3 Variation of FIFO policy. A thread is placed at the head of the queue when it’s ready to run.

Scheduling policy can impact performance


 FIFO might be a good choice for ‘batch” jobs that use a lot of CPU; however, other processes may be blocked
 Round Robin gives each thread a fixed time slice; however, it trends to favor CPU tasks and penalize I/O tasks.
 Fair Round robin gives each thread a fixed time slice and favors I/O tasks and penalize CPU tasks because the
processor is adjusted based on CPU usage.

54 5/14/2008 © 2008 IBM Corporation


Advanced Technical Support, Americas

Thread Model - Review


 Threads provide independent flow of control within a process. If the user threads need to access kernel
services, the user threads will be serviced by an associated kernel threads.
– User threads are implemented in various package – the most notable being libpthreads
– In the libpthreads implementation user threads sit on top of a “Virtual processor”, which are on top of kernel
threads. (Note: This VP is not the same as the a Micro-Partition VP.)
– A multithreaded user process can use one of two models:
• 1:1 Thread Model – System Scope
– User threads map 1:1 to kernel threads
– Threads are scheduled by the kernel scheduler
– AIXTHREAD_SCOPE=S
• M:N Thread Model – Process Scope
– Several user threads map to a pool of VPs, which then map to a kernel threads
– pthreads library will handle the scheduling of user threads to VP and then the kernel scheduler will
schedule to associated kernel thread
– The default is 8:1 – 8 user threads mapped to 1 kernel thread
– Default AIXTHREAD_SCOPE=P
– Environment Variable can be used to modify default setting:
> AIXTHREAD_MNRATIO=p:k (User to kernel thread ratio)
> AIXTHREAD_SLPRATIO=k:p (kernel threads held in reserve for sleeping threads)
> AIXTHREAD_MINKTHREADS=n (min number of kernel threads)
 System Contention Scope
– Applications that create and delete user threads can benefit from system contention scope because of
reduced overhead associated with harvesting and library scheduling.
 Process Contention Scope
– When thousands of user threads exist, there may be less overhead to schedule them in the library rather than
manage thousands of kernel threads.
 In General, for most workloads we recommend AIXTHREAD_SCOPE=S – particularly important for multi-
threaded applications (e.g., java) and Oracle database (set in /etc/environments or Oracle profile).

55 5/14/2008 © 2008 IBM Corporation


Advanced Technical Support, Americas

Context Switches - Review


 What is a Context Switch? Action performed to remove one running entity and replace
it with another entity.
– Within AIX – A context switch is the action performed to remove a thread from a processor
and replace it with another.
– Within the Hypervisor – A context switch is the action performed to remove a Virtual
Processor (and it’s associated logical processor) from the physical processor and replace it
with another.
– To perform this action “state” information from the currently running entity must be saved
and “state” information from the new entity must be restored.
 Types of Context Switches
– Voluntary Context Switches
• Operating System Level (mpstat cs - ics) – A threads yields the processor on its own. For example,
after issuing a read() system call
• Hypervisor Level (mpstat vlcs) – A Virtual Processor yields the processor via a CEDE or CONFER call.
– Involuntary Context Switches
• Operation System Level (mpstat ics) – A thread is forced to yield it’s time. For example, consuming all
its time slice (on AIX this is 10ms).
• Hypervisor Level (mpstat ilcs) – A Virtual Processor if forced yields the processor. For example, The
Virtual Processors’ Entitlement has been consumed.
– Context Switches are essential in a multi-programmed1 environment; however, while a
context switch is occurring no useful work, from a processes view point, is being done.
Therefore, the time required to perform a context switch is highly optimized.

1 The ability to run processes concurrently.

56 5/14/2008 © 2008 IBM Corporation


Advanced Technical Support, Americas

Context Switches – Review


 The mpstat command provides the following information related to logical process context
switches
– Total number of logical context switches (lcs), which is displayed with default mpstat.
– Total number of voluntary context switches (vlcs), which is displayed using the ‘-d’ flag of mpstat.
– Total number of involuntary context switched (ilcs), which is displayed using the ‘-d’ flag of mpstat.
lcs = vlcs + ilcs
 The mpstat command provides the following information related to process context switches
– Total number of context switches (cs)
– Total number of involuntary context switches (ics)
– Total number of voluntary can be calculated (cs – ics)
 Questions how many context switches are too many?
– No rules of thumb exist
– Voluntary – Usually not an issue because it means no work for the CPU. But there could be other
bottlenecks.
– Involuntary – Could be an issue, but generally the bottleneck will materialize in an easier to
diagnosis metric; such as, CPU utilization, run queue size, physical processors consumed.
 How can the context switch metrics be used?
– Establish a baseline and compare when system encounter performance problems.

57 5/14/2008 © 2008 IBM Corporation


Advanced Technical Support, Americas

Context Switches – Which Application?


 How do we determine which threads are performing the most context switches?
– Run a trace on the dispatch hook
– Use the curt performance tool
– ps command showing processes with most activity
 Trace the hook that is collected when a thread is put on the ready queue for execution
– Execution:
trace -aj 11F; sleep 10; trcstop
trcrpt -O exec=on,pid=on,tid=on -o /tmp/trace.out
– Will list the elapsed and delta time, command, PID, TID, priority, scheduler policy and cpu runqueue
• May have to grep file by command/PID/TID to count instances of a specific application/thread
 Use the curt tool to process an AIX trace file and produce a CPU utilization and
process/thread/pthread report
– Execution:
trace -a -T50000000 -L 100000000 [-r PURR] -C all -o /tmp/trc.raw; sleep 1; trcstop
trcrpt -r -C all /tmp/trc.raw > /tmp/trc.fmt
trcnm > /tmp/trc.nm
gensyms > /tmp/trc.gensyms
curt [-r PURR] -epst -i /tmp/trc.fmt -m /tmp/trc.nm -n /tmp/trc.gensyms > curt.out
– Will generate a report that includes the number of process dispatches by processor
– You must use –r PURR options with trace and curt when executing in a micro-partition or SMT is
enabled

58 5/14/2008 © 2008 IBM Corporation


Advanced Technical Support, Americas

Improving System Performance – CPU Bound


 Setting Contention Scope to System for multithreaded applications
– AIXTHREAD_SCOPE=S
– Use for multi-threaded applications that prefer to manage their own threads, or environments where the total number
of running threads is not a large multiple of the number of CPUs
– Use for monolithic, non-threaded applications when you want them to dominate CPU utilization
 Disable/Enable Simultaneous Multi-Threading
– Heavier, fewer thread environments may not benefit from SMT
 Manipulating thread / process priorities and/or scheduler
– Setting the “nice value” with nice or renice commands
– Set the Priority and Scheduler policy for a specific thread with thread_setsched()
– Set the Priority to “fixed” and Scheduler policy to SCHED_RR for all threads in the specified process with setpri()
– Scheduler change command does not ship with AIX, but SupportLine PerfPMR script includes it (setsched)
 Adjusting Scheduling algorithm for SCHED_OTHER
– Two parameters can be modified to change the behavior of the scheduling algorithm:
• Short Term CPU Usage Penalty Impact (sched_R). Value range from 0 to 32.
– A value of zero indicates no CPU penalty for short term CPU usage – essentially creating a fixed priority
process.
– A value of 32 indicates that every clock tick would increase a processes priority value by 1.
• How long should the system penalize a process for using CPU resources (sched_D). Value range from 0 to 32.
– A value of zero indicates that the system should “forget” about short term CPU usage every second.
– A value of 32 indicates that the system should always remember about short term CPU usage.
• The default value for both parameters is 16, and can be modified using the schedo command
 Adding more processors, increase entitlements
– Will not help when a thread is CPU Bound
– Or the application is single threaded
 Move workloads to non-peak periods, tune applications, use faster processors

59 © 2008 IBM Corporation


Advanced Technical Support, Americas

Generic IO Tuning

60 © 2008 IBM Corporation


Advanced Technical Support, Americas

The AIX IO Stack - Review


Application Application memory area caches data to
avoid IO
Logical File
System
Raw LVs

Local FS Remote FS NFS caches file attributes


Raw disks

JFS/JFS2 NFS NFS has a cached filesystem for NFS clients

JFS and JFS2 cache use extra system RAM


VMM
JFS uses persistent pages for cache
JFS2 uses client pages for cache
LVM

Device Driver (s) Queues exist for both adapters and disks
Disk Subsystem (optional) Adapter device drivers use DMA for IO
Disk subsystems have read and write cache
Disk
Disks have memory to store commands/data
Cache Write Cache - ack sent back to application

61 © 2008 IBM Corporation


Advanced Technical Support, Americas

Memory Buffers and IO - vmstat


Calculate Delta values
# vmstat –v > vmstat_v.before
# vmstat -v
233472 memory pages << system under stress for some period of time >>
217720 lruable pages
6871 free pages # vmstat –v > vmstat_v.after
1 memory pools
121384 pinned pages
80.0 maxpin percentage
20.0 minperm percentage Increasing value over time – increment hd_pbuf_cnt or
80.0 maxperm percentage pv_min_pbuf (depends on OS level)
27.5 numperm percentage
60024 file pages (increase the value in multiples of the default)
0.0 compressed percentage
0 compressed pages
32.2 numclient percentage
JFS File system
80.0 maxclient percentage Increasing value over time – increment numfsbufs
Add more 70289 client pages
paging 0 remote pageouts scheduled (increase the value in multiples of the default)
0 pending disk I/Os blocked with no pbuf
devices or
25613 paging space I/Os blocked with no psbuf
stop paging 2740 filesystem I/Os blocked with no fsbuf
2965 client filesystem I/Os blocked with no fsbuf Tune NFS - Use nfso's
333 external pager filesystem I/Os blocked with no fsbuf nfs_v3_pdts and nfs_v3_vm_bufs
for rfsbufwaitcnt

JFS2 File system


Increasing value over time – increment j2_dynamicBufferPreallocation and j2_nBufferPerPagerDevice
(increase the value in multiples of the default)

62 © 2008 IBM Corporation


Advanced Technical Support, Americas

I/O problems - Memory


 Look for increased IO service times between the LV and PV layers
ƒ Use filemon to check LV-to-PV deltas

ƒ Inadequate file system buffers (vmstat –v)

ƒ Inadequate disk buffers (vmstat –v)

ƒ Inadequate disk or adapter queue_depth (lsattr –EHl <device>)

ƒ i-node locking: decrease file sizes or use cio mount option if possible

ƒ Disable interrupts

– Page replacement daemon (lrud): decrease lru_poll_interval to 1

– syncd: reduce file system cache (< 24 GB on AIX 5.2)

– Insufficient disks/adapters or limits on disk subsystem (iostat –aD, filemon)

 For sequential IO, IO rates are not near the disk(s) capability?
ƒ For reads, FS or disk subsystem “read ahead” is insufficient or inhibited
ƒ IOs are queuing somewhere in the IO stack due to a bottleneck
ƒ We expect to have a bottleneck somewhere on the IO stack since we're
pushing the data thru as fast as possible

63 © 2008 IBM Corporation


Advanced Technical Support, Americas

Disk internals – Review


One actuator Read/write heads
Multiple heads and platters
More sectors/track on outer edge Actuator Platters
Memory on disk electronics
IO requests and data are temporarily stored here
Sector Tracks
Allows data to transfer on the interconnect at rates
greater than the disk's transfer rate

IO service time = queuing time + seek + latency + transfer


= queuing time + 0-5 ms + 0-8 ms + 0.15 - 9.6 ms (4 KB to 128 KB)
= queuing time + ( 0.15 to 22.6 ) ms

Additional latencies at interconnection and thru IO layers; note there are


additional latencies for networked disks (e.g. SANs and NAS)
Disk bottleneck = IOs queuing at the disk
We avoid queuing time by proper sizing and data layout
We can reduce seek by good data placement

64 © 2008 IBM Corporation


Advanced Technical Support, Americas

Random vs Sequential IO
 Know Your IO Patterns
 Remember iostat, filemon, application and DB metrics
 Random IO
ƒ Spread your IOs across the disks to balance the IO load

ƒ IO sizes should be less than or equal to strip sizes

 Sequential IO
ƒ Use lspv, lslv, fileplace and filemon commands to determine how localized IO is

ƒ Stripe your IO to maximize throughput

– Use one IO stream when striping or one IO stream per disk


– Avoid multiple sequential IO streams on the same disk – too many sequential
streams could become random
– Outer edge of disks get greater MB/s
– Increase JFS* read ahead (remember to adjust maxfree)

4 KB IOs -1 disk MB/s IOPS


Random vs.
Sequential Sequential 7.0 1790
IOPS Random 0.3 80

Assuming 4 ms seeks, 8 ms latency, transfer rates of 7 MB/s for older SSA disk
(this ignores some internal disk factors, newer disks get better performance)

65 © 2008 IBM Corporation


Advanced Technical Support, Americas

I/O Tuning – iostat -aD Read/write IOPS


# iostat -a –D  rps/wps
System configuration: lcpu=2 drives=3 paths=1 vdisks=1
Adapter:  PV
scsi0 xfer: bps tps bread bwrtn
 Virtual adapter
0.0 0.0 0.0 0.0
Paths/Disk:  Paths
hdisk0_path0 xfer: %tm_act bps tps bread bwrtn
0.0 0.0 0.0 0.0 0.0
read: rps avgserv minserv maxserv timeouts fails
0.0 0.0 0.0 0.0 0 0
write: wps avgserv minserv maxserv timeouts fails
0.0 0.0 0.0 0.0 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
0.0 0.0 0.0 0.0 0.0 0
Vadapter
vsci0 xfer: tps bread bwrtn partition-id
0.0 0.0 0.0 ####
read: avgserv minserv maxserv Use –l option for wide
0.0 0.0 0.0 column, one device
write: avgserv minserv maxserv per line format
0.0 0.0 0.0
queue: avgtime mintime maxtime avgsqsz qfull
0.0 0.0 0.0 0.0 0
Disk:
hdisk10 xfer: %tm_act bps tps bread bwrtn
0.0 0.0 0.0 0.0 0.0
read: rps avgserv minserv maxserv timeouts fails
0.0 0.0 0.0 0.0 0 0
write: wps avgserv minserv maxserv timeouts fails
0.0 0.0 0.0 0.0 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
0.0 0.0 0.0 0.0 0.0 0

© 2008 IBM Corporation


Advanced Technical Support, Americas

I/O Tuning – iostat -D Service times you could only get from filemon before

hdisk1 xfer: %tm_act bps tps bread bwrtn


87.7 62.5M 272.3 62.5M 823.7
read: rps avgserv minserv maxserv timeouts fails
271.8 9.0 0.2 168.6 0 0
write: wps avgserv minserv maxserv timeouts fails
0.5 4.0 1.9 10.4 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
1.1 0.0 14.1 0.2 1.2 2374
Virtual adapter’s extended throughput report (-D)
Metrics related to transfers (xfer:) All –D outputs are rates, except
tps Indicates the number of transfers per second issued to the adapter. sqfull, which is an interval delta.
recv The total number of responses received from the hosting server to this adapter. TL06+ APARs convert to rate
sent The total number of requests sent from this adapter to the hosting server.
partition id The partition ID of the hosting server, which serves the requests sent by this adapter. Can’t exceed queue_depth for the disk
Adapter Read/Write Service Metrics (read:)
avgserv Indicates the average time. Default is in milliseconds.
If this is often > 0, then increase queue_depth
minserv Indicates the minimum time. Default is in milliseconds.
maxserv Indicates the maximum time. Default is in milliseconds.
Adapter Wait Queue Metrics (wait:)
avgtime Indicates the average time spent in wait queue. Default is in milliseconds. Average IO Sizes
mintime Indicates the minimum time spent in wait queue. Default is in milliseconds.  read = bread/rps
maxtime Indicates the maximum time spent in wait queue. Default is in milliseconds.
avgwqsz Indicates the average wait queue size.  write = bwrtn/wps
qvgsqsz Indicates the average service queue size – Waiting to be sent to the disk.
sqfull Indicates the number of times the service queue becomes full.

67 © 2008 IBM Corporation


Advanced Technical Support, Americas

I/O Monitoring – filemon most active files/segments


# filemon –Oall –ofmon.out ; sleep 60 ; trcstop Produce all reports
Files
Most Active Files Segments
------------------------------------------------------------------------ Logical Volumes
#MBs #opns #rds #wrs file volume:inode Physical Volumes
------------------------------------------------------------------------
400.0 1 102401 0 big_400 /dev/outfiles:13
1.2 4 296 0 unix /dev/hd2:4153
0.7 2 170 0 services /dev/hd4:1807
0.4 202 104 0 group /dev/hd4:23
Most Active Segments
------------------------------------------------------------------------
#MBs #rpgs #wpgs segid segtype volume:inode Key stats at the file
------------------------------------------------------------------------ system level
256.0 65536 0 6ef6 client MBs transferred
144.0 36864 0 0ef0 client # of Open sys calls
0.0 0 6 63f6 client # of Read sys calls
# of Write sys calls

I/O activity to VMM


Note: 400MBs requested at the file level = 400MBs at
segments – This data is now
the segment level; therefore, all the data was read
in the file system cache.
from disk – None of the data was cached.
You can use segid with
svmon –S

68 © 2008 IBM Corporation


Advanced Technical Support, Americas

I/O Monitoring – filemon most active LV/PV


Trace buffer size –
Increase when you see
# filemon –T25000000 –P –Olv,pv –ofmon.out ; sleep 60 ; trcstop “trace Overflow events”
Most Active Logical Volumes or reduce monitoring
------------------------------------------------------------------------ time.
util #rblk #wblk KB/s volume description
------------------------------------------------------------------------
0.51 709808 294608 8306.9 /dev/ofs002lv N/A
0.02 0 1744 14.4 /dev/hd8 jfs2log
0.02 0 1609 13.3 /dev/appuat_c N/A
0.02 0 1609 13.3 /dev/appuat_d N/A
0.01 888 1521 19.9 /dev/ofs_002_a N/A
1 block is 512 bytes
0.01 40 1521 12.9 /dev/ofs_002_b N/A
# of pages = #rblk / 8
Most Active Physical Volumes
------------------------------------------------------------------------
util #rblk #wblk KB/s volume description
------------------------------------------------------------------------
1.00 709808 294976 8309.9 /dev/hdisk1 N/A
0.99 40 6703 55.8 /dev/hdisk7 N/A
0.98 888 3517 36.4 /dev/hdisk0 N/A

69 © 2008 IBM Corporation


Advanced Technical Support, Americas

I/O Tuning – filemon detailed LV


#filemon –T25000000 –Pv –Opv,lv –o myfile.out; sleep 60; trcstop
------------------------------------------------------------------------
Detailed Logical Volume Stats (512 byte blocks) Response Time (msecs)
------------------------------------------------------------------------
Reads < 20 msecs
VOLUME: /dev/ofs002lv description: N/A Write w/cache < 2 msecs
reads: 10172 (0 errs)
read sizes (blks): avg 69.8 min 8 max 256 sdev 101.8
read times (msec): avg 2.513 min 0.213 max 108.228 sdev 5.321
read sequences: 6296
read seq. lengths: avg 112.7 min 8 max 10240 sdev 731.9
writes: 16480 (0 errs)
write sizes (blks): avg 17.9 min 8 max 256 sdev 15.2
write times (msec): avg 1.014 min 0.228 max 35.433 sdev 1.539
write sequences: 16414
write seq. lengths: avg 17.9 min 8 max 704 sdev 18.5
seeks: 22502 (84.4%)
seek dist (blks): init 512740232,
avg 10093106.9 min 8 max 480236768 sdev 52384904.5
time to next req(msec): avg 2.263 min 0.014 max 171.678 sdev 5.326
throughput: 8306.9 KB/sec
utilization: 0.51

Key Indicator of a performance bottleneck Sequential vs. Random Workload


 Response times are too long # of Writes = # of Write Sequences, workload is random
 Significant delta between the Logical and Higher seek percentage indicates random
Physical layer response times.

70 © 2008 IBM Corporation


Advanced Technical Support, Americas

I/O Tuning – filemon detailed PV Average IO sizes


 Blks are 512 bytes in AIX
 69 x 512 = ~32KB average size

------------------------------------------------------------------------
Detailed Physical Volume Stats (512 byte blocks)
------------------------------------------------------------------------

VOLUME: /dev/hdisk1 description: N/A


reads: 10172 (0 errs)
read sizes (blks): avg 69.8 min 8 max 256 sdev 101.8
read times (msec): avg 2.188 min 0.001 max 108.213 sdev 4.835
read sequences: 6296
read seq. lengths: avg 112.7 min 8 max 10240 sdev 731.9
writes: 16487 (0 errs)
write sizes (blks): avg 17.9 min 8 max 256 sdev 15.2
write times (msec): avg 0.400 min 0.001 max 19.784 sdev 0.783
write sequences: 16421
write seq. lengths: avg 18.0 min 8 max 704 sdev 18.5
seeks: 22509 (84.4%)
seek dist (blks): init 791801736,
avg 10156034.0 min 8 max 6810 sdev 5275.1
seek dist (%tot blks):init 56.76268,
avg 0.72807 min 0.00000 max 48.82100 sdev 3.78211
time to next req(msec): avg 2.262 min 0.013 max 171.681 sdev 5.325
throughput: 8309.9 KB/sec
utilization: 1.00

71 © 2008 IBM Corporation


Advanced Technical Support, Americas

Device tuning
List device attributes with # lsattr -EHl <device>
Attributes with a value of True for user_settable can be changed
Sometimes you can change these via smit
Allowable values can be determined via:
# lsattr -Rl <device> -a attribute
Disks
ƒ Usually a parameter indicating number of commands that can be queued at the disk

device driver, usually queue_depth


ƒ Can be increased for small IOs

ƒ Study the other parameters before changing them

Adapters
ƒ Usually a parameter for number of commands to queue at the adapter device driver

that can be changed (FC num_cmd_elems)


ƒ Usually a parameter for DMA memory area - often needs to be increased for very

large sequential IO (FC devices max_xfer_size = 0x200000 increases driver buffer


from 16MB to 128 MB)
Be careful about using very large queue_depth values - test with these under a
heavy IO load before making major changes
ƒ Note that IO characteristics vary from production and backup/restore

These attributes usually require stopping use of the device to change it


72 © 2008 IBM Corporation
Advanced Technical Support, Americas

Data placement problems?


 Spread data structures across as many spindles as possible
 Use application striping across separate containers
 Put different data structures on different sets of disks
 Put high sequential IO rate structures on their own disk sets, outer edge
 Use intra-policy of middle for high IOPS structures (minimize seeks)
Move an LV or some of the hot partitions to another hdisk in the VG
 migratelp SourceLV/LP DestPV/PP
 migratepv -l LVname SourcePV DestPV
 Use reorgvg: set new LV attributes, and run it
 Free PPs must exist in the VG
 Large amount of IO will occur, and may take hours
 Can use this for VG or LV

73 © 2008 IBM Corporation


Advanced Technical Support, Americas

New mount option - noatime


 Ingo Molnar (Linux kernel developer) said:
– "It's also perhaps the most stupid Unix design idea of all times. Unix is really
nice and well done, but think about this a bit: 'For every file that is read from
the disk, lets do a ... write to the disk! And, for every file that is already
cached and which we read from the cache ... do a write to the disk!'"
 If you have a lot of file activity, you have to update a lot of timestamps
– File timestamps
• File creation (ctime)
• File last modified time (mtime)
• File last access time (atime)
– New mount option noatime disables last access time updates for JFS2
– File systems with heavy inode access activity due to file opens can have
significant performance improvements
 APARs
– IZ11282 AIX 5.3
– IZ13085 AIX 6.1

74 © 2008 IBM Corporation


Advanced Technical Support, Americas

Tuning and Improving System Performance


 Adjust the key IOO Tuning Parameters
– JFS* read ahead for sequential IO
 Adjust device specific tuning Parameters
– hdisk queue depths, adapter or subsystem read ahead
 Reducing memory/cache needs
– DIO / CIO
• Requires application support
• Direct memory transfers from application memory to drivers
– Release behind on read and/or write mounts
• Useful if files will not benefit from remaining in cache
• mount –rbr, -rbw, -rbrw
• Tells VMM not to cache
 Improve the data layout
 Add additional hardware resources

75 © 2008 IBM Corporation


Advanced Technical Support, Americas

Tools for
LPAR & CEC
Historical Performance

76 © 2008 IBM Corporation


Advanced Technical Support, Americas

LPAR & CEC Historical Tools


 AIX
– New AIX function with LPAR and CEC performance history
• Local Recording/Reports (emulates topas)
• CEC Recording/Reports (emulates topas –C)
– SMIT configuration updates (TL06)
 Other
– IBM – Free
• Nigel Griffiths Nmon and seastat
• Stephen Atkins Nmon Analyzer and consolidation tools
• Lparmon – IBM Briefing Center, Alphaworks
• Gmon – Alphaworks?
– Tivoli
• Integrated Tivoli Monitor (ITM SE)
– Open Solutions
• Ganglia (see earlier Wiki link for more info)

77 © 2008 IBM Corporation


78 Advanced Technical Support, Americas
IBM Global Services
AIX Recordings/Reports - 5.3 TL6 Update
•SMIT panels Configure Topas Options

•setup access to partitions Add Host to topas external subnet search file
not on local subnet (Rsi.hosts)


List hosts in topas external subnet search
turn on/off CEC and local recordings file (Rsi.hosts)

• display recording status


Configure Recordings
List Available Recordings

• generate reports Show current recordings status


Generate Report
►to file
Configure Recordings
►to printer
►to stdout Enable CEC Recording

•eliminates Disable CEC Recording


Enable Local Recording
►need to know file location and names Disable Local Recording
►topasout syntax

Type Date Start Stop


local 07/01/23 00:02:42 23:58:53
Recording Status local 07/01/24 00:03:18 23:59:45
========= ====== local 07/01/25 00:04:45 23:56:11
CEC Not Enabled local 07/01/26 00:01:11 23:57:37
Local Not Enabled local 07/01/27 00:04:13 23:59:34
WLE Not Enabled local 07/01/28 00:04:04 23:57:52
Barcelona 2006 local 07/01/29 00:02:52 01:12:54

78
© IBM Corporation 2007
What's new in AIX 5.3
Advanced Technical Support, Americas

Topasout Local Report – Detailed


•Detailed report: topasout –R detailed
Report: System Detailed --- hostname: ptoolsl1 version 1.0
Start:12/21/05 10.00.00 Stop:12/21/05 11.00.00 Int: 5 Min Range: 60 Min
Time: 10.00.00 --------------------------------------------------------------
CPU UTIL MEMORY PAGING EVENTS/QUEUES NFS
Kern 12.0 PhyB 0.7 Sz,GB 16.0 Sz,GB 4.0 Cswth 3213 SrvV2 32
User 8.0 Ent 0.5 InU 4.3 InU 2.3 Syscl 43831 CltV2 12
Wait 0.0 EntC 15.2 %Comp 3.1 Flt 221 RunQ 1 SrvV3 44
Idle 78.0 LP 4 %NonC 9.0 Pg-I 87 WtQ 0 CltV3 18
SMT ON Mode Shr %Clnt 2.0 Pg-O 44 VCSW 1214

Network KBPS I-Pack O-Pack KB-I KB-O


en0 0.6 7.5 0.5 0.3 0.3 Layout similar to topas
en1 22.3 820.1 124.3 410.0 61.2
lo0 0.0 0.0 0.0 0.0 0.0

Disk Busy% KBPS TPS KB-R KB-W


hdisk0 0.0 0.0 0.0 0.0 0.0
hdisk1 0.0 0.0 0.0 0.0 0.0
Time: 10.05.00 --------------------------------------------------------------
CPU UTIL MEMORY PAGING EVENTS/QUEUES NFS
Kern 12.0 PhyB 0.7 Sz,GB 16.0 Sz,GB 4.0 Cswth 3213 SrvV2 32
User 8.0 Ent 0.5 InU 4.3 InU 2.3 Syscl 43831 CltV2 12
Wait 0.0 EntC 15.2 %Comp 3.1 Flt 221 RunQ 1 SrvV3 44
Idle 78.0 LP 4 %NonC 9.0 Pg-I 87 WtQ 0 CltV3 18
SMT ON Mode Shr %Clnt 2.0 Pg-O 44 VCSW 1214

Network KBPS I-Pack O-Pack KB-I KB-O


en0 0.6 7.5 0.5 0.3 0.3
en1 22.3 820.1 124.3 410.0 61.2
lo0 0.0 0.0 0.0 0.0 0.0

Disk Busy% KBPS TPS KB-R KB-W


hdisk0 0.0 0.0 0.0 0.0 0.0
hdisk1 0.0 0.0 0.0 0.0 0.0

© 2008 IBM Corporation


Advanced Technical Support, Americas

Topasout Local Report - Summary


Dedicated partitions: topasout –R summary

Report: System Summary - hostname: ptoolsl1 version 1.0


Start:12/20/05 14.00.00 Stop:12/20/05 15.00.00 Int: 5 Min Range: 60 Min
Mem: 16.2 GB Dedicated SMT:OFF Logical CPUs: 2
Time InU Us Sy Wa Id PhysB RunQ WtQ CSwitch Syscall PgFault
-------------------------------------------------------------------------------
14.00.00 21.1 11 8 0 81 0.2 1 0 3432 5050 17
14.05.00 21.1 16 5 0 79 0.3 1 0 532 3104 14
14.10.00 21.2 13 7 0 20 0.2 1 0 652 4326 13

Shared partitions: topasout –R summary


Report: System Summary - hostname: ptoolsl1 version 1.0
Start:12/21/05 10.00.00 Stop:12/21/05 11.00.00 Int: 5 Min Range: 60 Min
Psize:1.0 Mem: 16.2 GB Shared SMT:OFF Logical CPUs: 2
Time InU Us Sy Wa Id PhysB Ent %EntC RunQ WtQ CSwitch Syscall PgFault
-------------------------------------------------------------------------------
10.00.00 21.1 11 8 0 81 0.2 0.5 23.2 1 0 3432 5050 17
10.05.00 21.1 16 5 0 79 0.3 0.5 25.0 1 0 532 3104 14
10.10.00 21.2 13 7 0 20 0.2 0.5 23.4 1 0 652 4326 13

© 2008 IBM Corporation


Advanced Technical Support, Americas

Topasout Local Report – Adapter I/O


Disk report: topasout –R disk

Report: Total Disk I/O Summary - hostname: ptoolsl1 version:1.0


Start:04/25/06 00.00.00 Stop:04/26/06 00.00.00 Int:05 Min Range:1440 Min
Mem: 8.0 GB Dedicated SMT:ON Logical CPUs:16
Time InU PhysB %Bsy MBPS TPS MB-R MB-W
-------------------------------------------------------------------------------
00.00.05 6.5 12.50 45.5 120.5 300.1 100.1 20.4
00.00.10 6.7 13.40 55.0 240.0 320.2 240.0 0.0
00.00.15 7.0 14.70 60.4 160.2 350.3 40.1 120.1
00.00.20 7.4 15.50 72.3 200.7 410.5 20.3 180.4

LAN report: topasout –R lan


Report: Total LAN I/O Summary - hostname: ptoolsl1 version:1.0
Start:03/12/06 17.15.00 Stop:03/12/06 20.30.00 Int:05 Min Range: 195 Min
Psize:1.0 Mem: 16.2 GB Shared SMT:OFF Logical CPUs: 2
Time InU PhysB MBPS I-Pack O-Pack MB-I MB-O Rcvdrp Xmtdrp
-------------------------------------------------------------------------------
17.15.00 3.2 6.30 20.0 310.5 120.2 16.2 3.8 120 160
17.20.00 3.3 6.45 22.3 220.3 225.7 11.1 11.2 118 165
17.25.00 3.2 6.15 18.5 275.6 158.0 11.6 6.9 121 162
17.30.00 3.4 6.55 19.4 270.2 156.9 11.3 6.1 124 154

© 2008 IBM Corporation


Advanced Technical Support, Americas

Topasout CEC Report - Detailed


Detailed CEC: topasout –R detailed [recording name]
Report: Topas CEC Detailed --- hostname: ptoolsl1 version:1.0
Start:05/02/06 07.00.00 Stop:05/02/06 17.00.00 Int:05 Min Range:600 Min
Time: 07.00.00 -----------------------------------------------------------------
Partition Info Memory (GB) Processors
Monitored : 8 Monitored : 0.0 Monitored : 7 Shr Physical Busy: 2.2
UnMonitored: 0 UnMonitored: 0.0 UnMonitored: 0 Ded Physical Busy: 0.4
Shared : 3 Available :32.0 Available : 7
Dedicated : 2 UnAllocated: 0 UnAllocated: 1 Hypervisor Layout similar to topas -C
Capped : 1 Consumed : 8.7 Shared : 4 Virt. Context Switch:332
Uncapped : 2 Dedicated : 3 Phantom Interrupts : 2
Pool Size : 2
Avail Pool : 1
Host OS M Mem InU Lp Us Sy Wa Id PhysB Ent %EntC Vcsw PhI
--------------------------------shared------------------------------------------
ptools1 A53 u 1.1 0.4 4 15 3 0 82 1.30 0.50 22.0 200 5
ptools5 A53 U 12 10 1 12 3 0 85 0.20 0.25 0.3 121 3
ptools3 A53 C 5.0 2.6 1 10 1 0 89 0.15 0.25 0.3 52 2
-------------------------------dedicated----------------------------------------
ptools4 A53 S 0.6 0.3 2 12 3 0 85 0.60
ptools6 A52 1.1 0.1 1 11 7 0 82 0.50
ptools8 A52 1.1 0.1 1 11 7 0 82 0.50
Time: 07.05.00 -----------------------------------------------------------------
Partition Info Memory (GB) Processors
Monitored : 8 Monitored : 0.0 Monitored : 7 Shr Physical Busy: 2.2
UnMonitored: - UnMonitored: 0.0 UnMonitored: 0 Ded Physical Busy: 0.4
Shared : 3 Available :32.0 Available : 7
Dedicated : 2 UnAllocated: - UnAllocated: 1 Hypervisor
Capped : 2 Consumed : 8.7 Shared : 4 Virt. Context Switch:332
Uncapped : 2 Dedicated : 3 Phantom Interrupts : 2
Pool Size : 2
Avail Pool : 1
Host OS M Mem InU Lp Us Sy Wa Id PhysB Ent %EntC Vcsw PhI
--------------------------------shared------------------------------------------
ptools1 A53 u 1.1 0.4 4 15 3 0 82 1.30 0.50 22.0 200 5
ptools5 A53 U 12 10 1 12 3 0 85 0.20 0.25 0.3 121 3
ptools3 A53 C 5.0 2.6 1 10 1 0 89 0.15 0.25 0.3 52 2
-------------------------------dedicated----------------------------------------
ptools4 A53 S 0.6 0.3 2 12 3 0 85 0.60

© 2008 IBM Corporation
Advanced Technical Support, Americas

Topasout CEC Report - Summary


 Summary CEC: topasout –R summary [recording name]
Report: Topas CEC Summary --- hostname: ptoolsl1 version:1.0
Start:02/09/06 00.00.00 Stop:02/09/06 23.55.00 Int: 5 Min Range:1440 Min
Partition Mon: 7 UnM: 1 Shr: 4 Ded: 3 Cap: 3 UnC: 1
-CEC------ -Processors-------------------- -Memory (GB)------------
Time ShrB DedB Mon UnM Avl UnA Shr Ded PSz APP Mon UnM Avl UnA InU
00.05.00 3.2 1.1 5 2 7 1 4 3 2 1 16.0 0.0 32.0 0.0 8.1
00.10.00 2.9 0.9 5 2 7 1 4 3 2 1 16.0 0.0 32.0 0.0 8.3
00.15.00 2.1 1.3 5 2 7 1 4 3 2 1 16.0 0.0 32.0 0.0 8.5

...

configuration change at 02.15.00


Partition Mon: 8 UnM: 0 Shr: 4 Ded: 4 Cap: 3 UnC: 1
-CEC------ -Processors-------------------- -Memory (GB)------------
Time ShrB DedB Mon UnM Avl UnA Shr Ded PSz APP Mon UnM Avl UnA InU
02.15.00 3.1 2.5 7 0 7 1 4 5 2 1 18.0 0.0 32.0 0.0 9.1
02.20.00 1.9 1.5 7 0 7 1 4 5 2 1 18.0 0.0 32.0 0.0 6.8
02.25.00 2.0 3.3 7 0 7 1 4 5 2 1 18.0 0.0 32.0 0.0 7.8

...

© 2008 IBM Corporation


Advanced Technical Support, Americas

Topas CEC Recording - Problems


 Topas –C | -R do not resolve remote hosts
– The CEC function reuses a registered inetd protocol, known as xmquery
• Part of the Performance Toolbox/Aide product for System p AIX
• If the Performance Aide is installed, xmservd is the operating agent started by xmquery
• If the Performance Aide is not installed, xmtopas is the AIX agent started by xmquery
– Topas utilizes the Remote Statistics (RSi) API to connect the local and remote agents
• When initialized, this API issues an xmquery protocol on a specific port. Systems inetd service
see this query, and start the agent
• Agent then replies to host which issued the query
– The xmquery protocol call defaults to poll systems only within the local subnet
• If LPARs within the same CEC operate on multiple subnets:
– Create a $HOME/Rsi.hosts file
– List of fully qualified hostname or IP address for remote systems, one entry per line!
– API will initialize and use this file to poll systems in other subnets
 Topas is slow to start, recognize new hosts
– Reconfiguration code has been optimized in TL5. While startup time may require up to one
minute, new systems should be recognized within 1-2 minutes.
 Check for service updates to AIX filesets for latest fixes
– bos.perf.tools contains topas
– perfagent.tools contains metrics instrumentation and API

84 © 2008 IBM Corporation


Advanced Technical Support, Americas

AIX TL06

85 © 2008 IBM Corporation


86 Advanced Technical Support, Americas
IBM Global Services
Dedicated idle cycles donation
•Similar in concept to idle partitions ceding idle cycles to shared pool
•Differences
•there is a guaranteee not to get phantom interrupts (interrupts for other partitions)
•the partition keeps running on the same physical processors
•must be enabled on HMC
•New phyp instrumentation collects
•donated cycles
►voluntarily donated by an idle dedicated partition to shared pool
•stolen cycles
►cyclesstolen by phyp from a dedicated partition to run maintenance tasks (hypervisor overhead)
►can happen whether donation is enabled or not (just wasn’t instrumented before)

•Tools metrics impact


•processors belonging to donating dedicated partitions are counted in pool size
•PURR stops on context switches
►similarto what happens to shared partitions
►tools will compensate so that dedicated percentages are still relative to total capacity

•Tools updated
•lparstat, mpstat andBarcelona
sar 2006
•topas and topasout reports
86
© IBM Corporation 2007
What's new in AIX 5.3
87 Advanced Technical Support, Americas
IBM Global Services
Dedicated idle cycles donation - lparstat
•New mode # lparstat 1 3

$ lparstat -i
System configuration: type=Dedicated mode=Donating
Node Name : smt=On lcpu=2 mem=800
va01
Partition Name :
va %user %sys %wait %idle physc vcsw
Partition Number : 2 ----- ---- ----- ----- ----- -------
Type : 0.1 0.4 0.0 99.5 0.68 670234
Dedicated-SMT 0.0 0.2 0.0 99.8 0.68 670234
Mode : 0.0 0.2 0.0 99.8 0.68 670234
Donating
Entitled Capacity :
1.00
Partition Group-ID :
32770
Shared Pool ID : - donation causes
Online Virtual CPUs : 1 hardware context
Maximum Virtual CPUs : 1 switches
Minimum Virtual CPUs : 1
Online Memory :
800 MB Stay relative to
Maximum Memory : partition capacity.
1024 MB
Minimum Memory :
128 MB In this case one shows actual physical processor
Variable Capacity Weight : - processor consumption:
Minimum Capacity :
1.00
Maximum Capacity :
number of physical processors
1.00 minus donated and stolen cycles
Capacity Increment :
1.00
Barcelona 2006
Maximum Physical CPUs in system : 4
Active Physical CPUs in system : 4
Active CPUs in Pool : -
Unallocated Capacity : -
87 Physical CPU Percentage :
© IBM Corporation 2007
What's new in AIX 5.3
100.00%
Unallocated Weight : -
88 Advanced Technical Support, Americas
IBM Global Services
Dedicated idle cycles donation - lparstat details
•New -d flags shows more details %idon, %bdon: percentages of
idle and busy times donated

•Example with donation enabled %istol, %bstol: percentages of


idle and busy times stolen
# lparstat –d

System configuration: type=Dedicated mode=Donating smt=On lcpu=2 mem=800

%user %sys %wait %idle %idon %bdon %istol %bstol


----- ---- ----- ----- ------ ----- ----- ------
0.1 0.2 2.1 97.7 12.79 6.8 4.8 2.75

•Example without donation and in combination with -h


# lparstat -dh

System configuration: type=Dedicated mode=Capped smt=On lcpu=2 mem=800

%user %sys %wait %idle %hypv hcalls %istol %bstol


----- ---- ----- ----- ----- ------ ------ ------
0.1 0.2 2.1 97.7 0.0 391 4.8 2.75
Barcelona 2006

88
© IBM Corporation 2007
What's new in AIX 5.3
89 Advanced Technical Support, Americas

Dedicated idle cycles donation - sar and mpstat


IBM Global Services

•sar
•automatically displays phyc when donation is enabled
•mpstat
•automaticaly displays pc and lcs if donation is enabled
•new -h option to show more details on hypervisor related statistics
►donation enabled
System configuration: lcpu=2 mode=Donating
cpu pc ilcs vlcs idon bdon istol bstol
0 0.3 50327 687231635 10.2 4.5 0.59 0.32
1 0.5 61702 684989764 10.2 4.5 0.59 0.32
ALL 0.8 112029 1372221399 20.4 9.0 1.18 0.64

idon, bdon: percentages of idle


►donation disabled
and busy times donated:
System configuration: lcpu=2 mode=Capped
cpu pc ilcs vlcs istol bstol istol, bstol: percentages of idle
0 0.3 503727 687231635 0.59 0.32 and busy times stolen
1 0.41 61702 684989764 0.59 0.32
ALL 0.71 565429 1372221399 1.18 0.64

►shared partition
System configuration: lcpu=2 ent=0.5 mode=Uncapped
cpu Barcelona
pc ilcs 2006
vlcs
0 0.6 503727 687231635
1 0.6 61702 684989764
ALL 0.8 565429 1372221399

89
© IBM Corporation 2007
What's new in AIX 5.3
90 Advanced Technical Support, Americas
IBM Global Services
Dedicated idle cycles donation - topas -L
Interval: 2 Logical Partition: Fri Sep 22 09:01:46 2006
Donating SMT ON Online Memory: 3200.0
Partition CPU Utilization Online Virtual CPUs: 1 Online Logical CPUs: 2
%user %sys %wait %idle %hypv hcalls %istl %bstl %idon %bdon vcsw
1 1 0 98 1 200 0 2.1 3.5 10.0 1.0 includes same
updates as
===============================================================================
LCPU minpf majpf intr csw icsw runq lpa scalls usr sys _wt idl pclparstat and
Cpu0 0 0 190 176 84 0 100 5089 1 2 0 97 0.52mpstat
Cpu1 0 0 14 0 0 0 0 0 0 0 0 100 0.48

•Topasout report
Report: System Detailed --- hostname: ptoolsl1 version: 1.2
new version
Start:12/21/05 10.00.00 Stop:12/21/05 11.00.00 Int: 5 Min Range: 60 Min
number to mark
Time: 10.00.00 --------------------------------------------------------------
CPU UTIL MEMORY PAGING EVENTS/QUEUES NFS the new format
Kern 12.0 PhyB 0.7 Sz,GB 16.0 Sz,GB 4.0 Cswth 3213 SrvV2 32
User 8.0 Ent 0.0 InU 4.3 InU 2.3 Syscl 43831 CltV2 12
Wait 0.0 EntC 0.0 %Comp 3.1 Flt 221 RunQ 1 SrvV3 44
Idle 78.0 bdon 0.1 %NonC 9.0 Pg-I 87 WtQ 0 CltV3 18
SMT ON idon 1.0 %Clnt 2.0 Pg-O 44 VCSW 1214
LP 4 bstl 0.5
Mode Don istl 0.0

Network KBPS I-Pack O-Pack KB-I KB-O


Barcelona 2006
en0 0.6 7.5 0.5 0.3 0.3
-
Disk Busy% KBPS TPS KB-R KB-W
hdisk0 0.0 0.0 0.0 0.0 0.0
90
© IBM Corporation 2007
What's new in AIX 5.3
91 Advanced Technical Support, Americas
IBM Global Services
Dedicated idle cycles donation - topas -C
•Example of topasout report for CEC recording
Report: Topas CEC Detailed --- hostname: ptoolsl1 version: 1.2 donated
Start:02/09/06 06.30.00 Stop:02/09/06 07.30.00 Int:60 Min Range: 600 Min cycles
Partition Info Memory (GB) Processors Avail Pool: 1.3
Monitored : 8 Monitored : 0.0 Monitored : 7 Shr Physical Busy: 2.2
UnMonitored: - UnMonitored: 0.0 UnMonitored: 0 Ded Physical Busy: 0.4 stolen cycles
Shared : 6 Available :32.0 Available : 7 Donated Physical CPUs:0.7
Uncapped : 1 UnAllocated: - UnAllocated: 1 Stolen Pysical CPUs: 0.1
Capped : 7 Consumed : 8.7 Shared : 4 Hypervisor
Dedicated : 2 Dedicated : 3 Virt. Context Switch:332
Donating : 2 Donated : 1 Phantom Interrupts : 2
Pool Size : 2
donated
Host OS M Mem InU Lp Us Sy Wa Id PhysB Vcsw Ent %EntC PhI
--------------------------------shared------------------------------------------
processors
ptools1 A53 u 1.1 0.4 1 15 3 0 82 1.30 200 0.50 22.0 5
ptools5 A53 U 12 10 2 12 3 0 85 0.20 121 0.25 0.3 3
ptools3 A53 C 5.0 2.6 2 10 1 0 89 0.15 52 0.25 0.3 2 donating
ptools7 A53 c 2.0 0.4 1 0 1 0 99 0.05 2 0.10 0.3 2 partitions
Host OS M Mem InU Lp Us Sy Wa Id PhysB Vcsw %istl %bstl %bdon %idon
------------------------------dedicated-----------------------------------------
ptools4 A53 D 0.6 0.3 2 12 3 0 85 0.60 110 1 2 0 5
ptools6 A52 d 1.1 0.1 1 11 7 0 82 0.50 50 10 5 10 0
ptools8 A52 1.1 0.1 1 11 7 0 82 0.50 5 0 1 - -
ptools2 A52 1.12006
Barcelona 0.1 1 11 7 0 82 0.50 4 0 2 - -
Time: 07.30.00 -----------------------------------------------------------------

91
© IBM Corporation 2007
What's new in AIX 5.3
92 Advanced Technical Support, Americas
IBM Global Services

AIX 5.3 TL07 & AIX 6.1

Barcelona 2006

92
© IBM Corporation 2007
What's new in AIX 5.3
93 Advanced Technical Support, Americas
IBM Global Services
TL07 & AIX 6.1
•iostat
•Tape device support!
►Uses standard dkstat structures, same as disk devices
►Includes support for read/write service times
►No queuing, so no queue wait metrics
►ATAPE only at this time
•Filesystem and Workload Partition reports (AIX 6)
•svmon
•AIX 6.1 dynamic pages sizes
►4K– 64K mixed segment
►Short (segment), detailed (by page) and long (by page size) reports
•Major usage improvements coming in 2008
►Reports simplification!
►Better system and process reports
• Similar to current user reports that provide breakdown by system, shared and exclusive resources
• Better answers to “what are all my processes footprint?”
•Workload Partitions Support
•Commands
►ps, ipcs, netstat, proc*, trace,
Barcelona 2006vmstat
•Tools
►topas, tprof, filemon, netpmon, pprof, curt
93
© IBM Corporation 2007
What's new in AIX 5.3
94 Advanced Technical Support, Americas
IBM Global Services
TL07 & Power6
•Allows definition of virtual pools
•Subset(s) of shared physical processor pool
•Have their own capacity limits similar to partitions
►entitled capacity
• sum of partitions entitled capacity and pool reserved capacity
►maximum capacity
•Makes it possible to set uncapped partitions maximum capacity
►not necessarily equal to their number of virtual processors
•Can be used to lower cost with virtual pool aware licenses
►4 ABC licenses versus 7
•lparstat
•-i will show pool entitlement and max capacity
•topas
•Updated CEC panel
►new pool section (p option)
•Two roll ups
►CEC level
►by virtual pool (f option)

•topasout reports Barcelona 2006

•New pool level report


94
•Shared partitions sorted by pool
© IBM Corporation 2007
What's new in AIX 5.3
95 Advanced Technical Support, Americas
IBM Global Services

AIX 5.3 TL08 & AIX 6.1 TL02

Barcelona 2006

95
© IBM Corporation 2007
What's new in AIX 5.3
96 Advanced Technical Support, Americas
IBM Global Services

New Tunables
• vmo additions
► psm_timeout_interval = 20000
• Determines the timeout interval, in milliseconds, to wait for page size
management daemons to make forward progress before LRU page
replacement is started. This setting is only valid on the 64-bit kernel. Default:
20 seconds. Possible values: 0 through 60,000 (1 minute). When page size
management is working to increase the number of page frames of a particular
page size, LRU page replacement is delayed for that page size for up to this
amount of time. On a heavily loaded system, increasing this tunable can give
the page size management daemons more time to create more page frames
before LRU runs.
• Basically, 64 KB page migrations can cause a deadlock between lrud and
psmd
► wlm_mem_limit_nonpg = 1
• Selects whether non-pageable page sizes (16M, 16G) are included in the
WLM realmem and virtmem counts. If 1 is selected, then non-pageable page
sizes are included in the realmem and virtmem limits count. If 0 is selected,
then only pageable page sizes (4K, 64K) are included in the realmem and
virtmem counts. This value can only be changed when WLM Memory
Accounting is off, or the change will fail.
Barcelona 2006

96
© IBM Corporation 2007
What's new in AIX 5.3
97 Advanced Technical Support, Americas
IBM Global Services

New Tunables
•ioo JFS2 Sync Tunables
The file system sync operation can be problematic in situations where there is very
heavy random I/O activity to a large file. When a sync occurs all reads and writes from
user programs to the file are blocked. With a large number of dirty pages in the file the
time required to complete the writes to disk can be large. New JFS2 tunables are
provided to relieve that situation.
►j2_syncPageCount
Limits the number of modified pages that are scheduled to be written by sync in one
pass for a file. When this tunable is set, the file system will write the specified number
of pages without blocking i/o to the rest of the file. The sync call will iterate on the
write operation until all modified pages have been written.
Default: 0 (off), Range: 0-65536, Type: Dynamic, Unit: 4KB pages
►j2_syncPageLimit
Overrides j2_syncPageCount when a threshold is reached. This is to guarantee that
sync will eventually complete for a given file. Not applied if j2_syncPageCount is off.
Default: 16, Range: 1-65536, Type: Dynamic, Unit: Numeric
• If application response times impacted by syncd, try j2_syncPageCount settings from
256 to 1024. Smaller values improve short term response times, but still result in larger
syncs that impact reponse times over larger intervals.
• These will likely require
Barcelona a lot of experimentation, and detailed analysis of IO
2006
• Does not apply to mmap() or shmat() memory files.

97
© IBM Corporation 2007
What's new in AIX 5.3
98 Advanced Technical Support, Americas
IBM Global Services

New links of interest


• AIX Wiki
•Should be your first stop
•http://www-941.ibm.com/collaboration/wiki/display/WikiPtype/Home
•Links to performance information and tools like, nmon, nmon analyzer
• Reference links for customers at developerWorks
•Overview of AIX page replacement (David Hepkin, Kernel Architect)
►Simple, clean overview of VMM
►Use as reference for lru_file_repage=0 questions
►http://www.ibm.com/developerworks/aix/library/au-
vmm/index.html?S_TACT=105AGX20&S_CMP=EDU
• Optimizing AIX 6.1 Performance Tuning (Ken Milberg)
►vmo AIX 5.3 vs AIX 6.1 table
►http://www.ibm.com/developerworks/aix/library/au-
aix6tuning/index.html?S_TACT=105AGX20&S_CMP=EDU
• Network Tuning Series (Ken Milberg)
►Three parts, last one posted at the end of January
►http://www.ibm.com/developerworks/aix/library/au-
aixoptimization-
Barcelona 2006
netperform1/index.html?S_TACT=105AGX20&S_CMP=EDU

98
© IBM Corporation 2007
What's new in AIX 5.3
99 Advanced Technical Support, Americas
IBM Global Services

Thanks

Barcelona 2006

99
© IBM Corporation 2007
What's new in AIX 5.3
Advanced Technical Support, Americas

References and Sources

 Cler, C., (2006). Configuring POWER5 Shared Pool Partitions Presentation


 IBM System p Advanced POWER Virtualization Best Practices, Redbook Draft, Oct. 2006.
 AIX 5L Virtualization Performance Management Course. 2006. IBM Unix Software Support Education.
 Advanced POWER Virtualization on IBM eServer p5 Servers: Architecture and Performance Considerations.
2006.
 AIX 5.3 Product Documentation. Retrieved Sept 2005, from
http://publib.boulder.ibm.com/infocenter/pseries/index.jsp?topic=/com.ibm.pseries.doc/hardware.htm
 Braden, B., (2006-2007). Disk Sizing, Data Layout and Tuning Presentation.
 Barker, R., (2006). VIO Server Guidelines Presentation.
 Smolders, L., (2006-2007). Performance Tools Presentations.
 Mathis, H.M., Mericas, A.E, McCalpin J.D., Eickemeyer R. J., and Kunkel S.R.. Characterization of
simultaneous multithreading (SMT) efficiency in POWER5. IBM Journal of Research and Development
Volume 49, Number 4/5 2005. Retrieved November 2005 from
http://www.research.ibm.com/journal/rd/494/mathis.html

100 © 2008 IBM Corporation

También podría gustarte