AIX Knowledge Base: May 2013

Friday, May 10, 2013

AIX 6.1 TL Upgrade using multibos

Starting with AIX* 5.3 Technology Level (TL) 3, the new multibos utility has been provided that allows an AIX administrator to create and maintain two separate, bootable instances of the AIX OS within the same root volume group (rootvg). This second instance of rootvg is known as a standby Base Operating System (BOS) and is an extremely handy tool for performing AIX TL and Service Pack (SP) updates.
Multibos lets you install, update and customize a standby instance of the AIX OS without impacting the running and active production instance of the AIX OS. This is valuable in environments with tight maintenance windows. Instead of requiring an outage window of several hours to apply a new TL or SP, you’ll only need a small outage at a convenient time to reboot the system. This may help reduce the size of the after-hours effort often required when performing AIX updates, as all the maintenance activities can be performed during business hours. After hours you could log in from home and reboot the system.
Backing out from an AIX TL update is also made easier with multibos. To go back to a previous TL, you reboot the system on the original AIX instance boot logical volume (BLV). It’s also possible to update several AIX systems at once using multibos, which again reduces the amount of after-hours effort required when performing AIX maintenance activities.
Multibos is similar to an alternate disk installation. However, there are several differences between the two methods, one of which is that there’s no need for an additional disk to clone the rootvg. Both utilities can be used to achieve the same goal. Choose the one that’s the best fit for your AIX environment.

Getting Started

Before attempting to use multibos, check that the prerequisites have been met. First, the system must be running AIX 5.3 with TL3 or higher. Next, ensure that there’s enough free space in rootvg for a copy of each BOS logical volume (LV). By default, the BOS file systems in rootvg(/, /usr, /var, and /opt) and the BLV are copied. All other file systems and LVs are shared between BOS instances. Check the number of free physical partitions in rootvg (i.e., # lsvg rootvg | grep FREE). If all of the requirements can’t be met, then a traditional update should be performed.
Ensure that you document the system and perform a mksysb before performing any maintenance activity.

Steps to Upgrade using Multibos

1. Create a separate filesystem for the updates
2. Create a backup of the system
3. Check current TL level
4. Preview/Create the multibos standby BOS instance.
5. Preview/Patch the OS in the standby instance
6. Verify the oslevel in the standby instance after the update
7. Reboot and check
8. Optionally return to the pre-update BOS image.

1. Create a separate FileSystem for the /usr/sys/inst.images folder

Where I will put the (TL) updates.

 # rm -R /usr/sys/inst.images/
 # crfs -v jfs2 -A yes -g rootvg -m /usr/sys/inst.images -a size=6G
 # mount /usr/sys/inst.images/
 # df -g /usr/sys/inst.images
 Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
 /dev/fslv00        6.00      6.00    1%        3     1% /usr/sys/inst.images

==============================================================================
 -v      FS type
 -A yes  Append /etc/filesystems - automatic mount at boot
 -g      volume group
 -m      mount point
 -a size The expected size of the new FS
==============================================================================

2. Create a backup

If something should go wrong, it is good to have a backup. The following backup is preformed from a NIM master server, backing up the client (power2) that will be updated.

 #  nim -o define -t mksysb  -a server=master -a location=/export2/mksysb/P2_MK_PREU -a mk_image=yes -a source=power2 P2_MK_PREU
+---------------------------------------------------------------------+
                System Backup Image Space Information
              (Sizes are displayed in 1024-byte blocks.)
+---------------------------------------------------------------------+

Required = 2732693 (2669 MB)    Available = 5957884 (5819 MB)

Creating information file (/image.data) for rootvg.

Creating list of files to back up 

0512-038 savevg: Backup Completed Successfully.

==============================================================================
-o operation
-t type
-a attribute
   location=where the backup will reside
   mk_images= create mksysb image file
   source= NIM client name
P2_MK_PREU NIM object mksysb name
==============================================================================

Now copy the updates from wherever you have them to the folder /usr/sys/inst.images/

# cp -R -p 6100-07-00-1140 /usr/sys/inst.images/

Next create a .toc (table of content file) in the directory where the updates reside using the inutoc command

# iutoc /usr/sys/inst.images/6100-07-00-1140/

3. Check the current TL

# oslevel -s 6100-04-04-1014 AIX Release 6.1 Technology Level 04 Service Pack 04 Year 2010 Week 14

4. Preview/Create the multibos standby BOS instance

Remove any pervious stanby BOS instances, so we can update a clean environment.

# multibos -R
...
Return Status = SUCCESS

Create a standby BOS (Base Operating System) instance using the multibos command.
Initially I will run a preview operation

# multibos -Xsp
Return Status = SUCCESS

==============================================================================
-X automatically extend FS
-s create standby BOS instance
-p preview mode
==============================================================================

Now we actually run the command without preview mode. This may take a while even two.

# multibos -Xs
Initializing multibos methods ...
Initializing log /etc/multibos/logs/op.alog ...
Gathering system information ...

+-----------------------------------------------------------------------------+
Setup Operation
+-----------------------------------------------------------------------------+
Verifying operation parameters ...
Creating image.data file ...

+-----------------------------------------------------------------------------+
Logical Volumes
+-----------------------------------------------------------------------------+
Creating standby BOS logical volume hd5
Creating standby BOS logical volume hd4
Creating standby BOS logical volume hd2
Creating standby BOS logical volume hd9var
Creating standby BOS logical volume hd10opt

+-----------------------------------------------------------------------------+
File Systems
+-----------------------------------------------------------------------------+
Creating all standby BOS file systems ...
Creating standby BOS file system /bos_inst on logical volume hd4
reating standby BOS file system /bos_inst on logical volume hd4
Creating standby BOS file system /bos_inst/usr on logical volume hd2
Creating standby BOS file system /bos_inst/var on logical volume hd9var
Creating standby BOS file system /bos_inst/opt on logical volume hd10opt

+-----------------------------------------------------------------------------+
Mount Processing
+-----------------------------------------------------------------------------+
Mounting all standby BOS file systems ...
Mounting /bos_inst
Mounting /bos_inst/usr
Mounting /bos_inst/var
Mounting /bos_inst/opt

+-----------------------------------------------------------------------------+
BOS Files
+-----------------------------------------------------------------------------+
Including files for file system /
Including files for file system /usr
Including files for file system /var
Including files for file system /opt

Copying files using backup/restore utilities ...
Percentage of files copied:   0.00%
Percentage of files copied:   1.35%
...
Percentage of files copied:  98.74%
Percentage of files copied: 100.00%

+-----------------------------------------------------------------------------+
Boot Partition Processing
+-----------------------------------------------------------------------------+
Active boot logical volume is bos_hd5.
Standby boot logical volume is hd5.
Creating standby BOS boot image on boot logical volume hd5
bosboot: Boot image is 43345 512 byte blocks.

+-----------------------------------------------------------------------------+
Mount Processing
+-----------------------------------------------------------------------------+
Unmounting all standby BOS file systems ...
Unmounting /bos_inst/opt
Unmounting /bos_inst/var
Unmounting /bos_inst/usr
Unmounting /bos_inst

+-----------------------------------------------------------------------------+
Bootlist Processing
+-----------------------------------------------------------------------------+
Verifying operation parameters ...
Setting bootlist to logical volume hd5 on hdisk0.
ATTENTION: firmware recovery string for standby BLV (hd5):
boot /pci@800000020000003/pci@2,4/pci1069,b166@1/scsi@0/sd@8:4
ATTENTION: firmware recovery string for active BLV (bos_hd5):
boot /pci@800000020000003/pci@2,4/pci1069,b166@1/scsi@0/sd@8:2

Log file is /etc/multibos/logs/op.alog
Return Status = SUCCESS

5. Preview/Patch the OS in the standby instance
Preview TL update

# multibos -Xacp -l /usr/sys/inst.images/6100-07-00-1140
...
Log file is /etc/multibos/logs/op.alog
Return Status = SUCCESS

==============================================================================
-X automatically extend FS if required
-a update_all install option
-c Performs a customized update of the software in standby BOS
-p preview
-l location
==============================================================================

Now we can run the update without the -p option

# multibos -Xac -l /usr/sys/inst.images/6100-07-00-1140 

Initializing multibos methods ...
Initializing log /etc/multibos/logs/op.alog ...
Gathering system information ...

+-----------------------------------------------------------------------------+
Customization Operation
+-----------------------------------------------------------------------------+
Verifying operation parameters ...
Validating install images location /usr/sys/inst.images/6100-07-00-1140

+-----------------------------------------------------------------------------+
Mount Processing
+-----------------------------------------------------------------------------+
Mounting all standby BOS file systems ...
Mounting /bos_inst
Mounting /bos_inst/usr
Mounting /bos_inst/var
Mounting /bos_inst/opt

+-----------------------------------------------------------------------------+
Software Update
+-----------------------------------------------------------------------------+
Installing software to standby BOS ...

install_all_updates: Initializing system parameters.
install_all_updates: Log file is /var/adm/ras/install_all_updates.log
install_all_updates: Checking for updated install utilities on media
......

installp:  * * * A T T E N T I O N ! ! !
        Software changes processed during this session require
        any diskless/dataless clients to which this SPOT is
        currently allocated to be rebooted. 

install_all_updates: Checking for recommended maintenance level 6100-07.
install_all_updates: Executing /usr/bin/oslevel -rf, Result = 6100-05
install_all_updates: ATTENTION, the system recommended maintenance level
does not correspond to the highest level known to install_all_updates.
For more details, execute /usr/bin/oslevel -rl 6100-07.

install_all_updates: Log file is /var/adm/ras/install_all_updates.log
install_all_updates: Result = SUCCESS

+-----------------------------------------------------------------------------+
Boot Partition Processing
+-----------------------------------------------------------------------------+
Active boot logical volume is bos_hd5.
Standby boot logical volume is hd5.
Creating standby BOS boot image on boot logical volume hd5
Could not load program ls:
Symbol resolution failed for ls because:
        Symbol time64 (number 40) is not exported from dependent
          module /usr/lib/multibos_chroot/usr/ccs/lib/libc.a(shr.o).
Examine .loader section symbols with the 'dump -Tv' command.
/usr/sbin/bosboot[16]: kernsize /= 1024: bad number
multibos: 0565-039 Error processing standby BOS boot image.
Return Status: FAILURE

The filesets where updated fine but multibos failed to create a bootimage on hd5
The “Symbol resolution failed for ls because:
Symbol time64 (number 40) is not exported from dependent
module /usr/lib/multibos_chroot/usr/ccs/lib/libc.a(shr.o).”
This error is addressed in APAR IV03737.

==============================================================================
IV03737 Abstract: symbol resolution failure creating standby BOS boot 

IV03737 Symptom Text:
 Doing multibos with updates can fail:
 Creating standby BOS boot image on boot logical volume
 bos_hd5
 Could not load program ls:
 Symbol resolution failed for ls because:
 Symbol time64 (number 40) is not exported from dependent
 module /usr/lib/multibos_chroot/usr/ccs/lib/libc.a(shr.o).
 Examine .loader section symbols with the 'dump -Tv' command.
 /usr/sbin/bosboot[16]: kernsize /= 1024: 0403-009 The
 specified number is not valid for this command.
 multibos: 0565-039 Error processing standby BOS boot image.
 multibos: 0565-035 Error setting up standby BOS.

==============================================================================

In order to fix this issue with multibos it is nacessary to apply the APAR fixes on the live system (not the standby BOS instance). Afterwards verify if the APAR fixes have been applied. A system reboot will be required after the APAR fixes have been installed.

# inutoc /usr/sys/inst.images/IV03737/

# install_all_updates -d /usr/sys/inst.images/IV03737/ -cxY
==============================================================================
-d Device
-c Commit all
-x Expand FS if necessary
-Y Agree to all SW licenses
==============================================================================
# instfix -ik IV03737
    All filesets for IV03737 were found.

Now to start on a fresh environment after the APAR’s fixes have been installed recreate the multibos stanby BOS.

# multibos -R

# multibos -Xs

Finally return to the initial task of updating the system in the standby BOS.

# multibos -Xac -l /usr/sys/inst.images/6100-07-00-1140/
Initializing multibos methods ...
Initializing log /etc/multibos/logs/op.alog ...
Gathering system information ...

+-----------------------------------------------------------------------------+
Customization Operation
+-----------------------------------------------------------------------------+
Verifying operation parameters ...
Validating install images location /usr/sys/inst.images/6100-07-00-1140/

+-----------------------------------------------------------------------------+
Mount Processing
+-----------------------------------------------------------------------------+
Mounting all standby BOS file systems ...
Mounting /bos_inst
Mounting /bos_inst/usr
Mounting /bos_inst/var
Mounting /bos_inst/opt

+-----------------------------------------------------------------------------+
Software Update
+-----------------------------------------------------------------------------+
Installing software to standby BOS ...

install_all_updates: Initializing system parameters.
install_all_updates: Log file is /var/adm/ras/install_all_updates.log
install_all_updates: Checking for updated install utilities on media.
install_all_updates: Processing media.
install_all_updates: Generating list of updatable installp filesets.

...
...
...

install_all_updates: Checking for recommended maintenance level 6100-07.
install_all_updates: Executing /usr/bin/oslevel -rf, Result = 6100-07
install_all_updates: Verification completed.
install_all_updates: Log file is /var/adm/ras/install_all_updates.log
install_all_updates: Result = SUCCESS

+-----------------------------------------------------------------------------+
Boot Partition Processing
+-----------------------------------------------------------------------------+
Active boot logical volume is bos_hd5.
Standby boot logical volume is hd5.
Creating standby BOS boot image on boot logical volume hd5
bosboot: Boot image is 49180 512 byte blocks.

+-----------------------------------------------------------------------------+
Mount Processing
+-----------------------------------------------------------------------------+
Unmounting all standby BOS file systems ...
Unmounting /bos_inst/opt
Unmounting /bos_inst/var
Unmounting /bos_inst/usr
Unmounting /bos_inst

Log file is /etc/multibos/logs/op.alog
Return Status = SUCCESS

6. Verify the oslevel in the standby instance after the update

We can actually now enter the secondary BOS instance like a chroot and check the oslevel

# multibos -S
Initializing multibos methods ...
Initializing log /etc/multibos/logs/op.alog ...
Gathering system information ...

+-----------------------------------------------------------------------------+
Multibos Shell Operation
+-----------------------------------------------------------------------------+
Verifying operation parameters ...

+-----------------------------------------------------------------------------+
Mount Processing
+-----------------------------------------------------------------------------+
Mounting all standby BOS file systems ...
Mounting /bos_inst
Mounting /bos_inst/usr
Mounting /bos_inst/var
Mounting /bos_inst/opt

+-----------------------------------------------------------------------------+
Multibos Root Shell
+-----------------------------------------------------------------------------+
Starting multibos root shell ...
Active boot logical volume is bos_hd5.
Script started, file is /etc/multibos/logs/scriptlog.120204105327.txt
MULTIBOS> oslevel -s
6100-07-01-1141
MULTIBOS>exit

Verify the bootlist order

# bootlist -m normal -o
hdisk0 blv=hd5 pathid=0
hdisk0 blv=bos_hd5 pathid=0

7. Reboot and check

# oslevel -s
6100-07-01-1141
# instfix -i | grep ML
    All filesets for 6100-00_AIX_ML were found.
    All filesets for 6100-01_AIX_ML were found.
    All filesets for 6100-02_AIX_ML were found.
    All filesets for 6100-03_AIX_ML were found.
    All filesets for 6100-04_AIX_ML were found.
    All filesets for 6100-05_AIX_ML were found.
    All filesets for 6100-06_AIX_ML were found.
    All filesets for 6100-07_AIX_ML were found.

8. Optionally return to the pre-update BOS image.

If necessary you can return to the initial TL, just change the boot order.

 # bootlist -m normal -o
 hdisk0 blv=hd5 pathid=0
 hdisk0 blv=bos_hd5 pathid=0

 # bootlist -m normal hdisk0 blv=bos_hd5 hdisk0 blv=hd5

 # bootlist -m normal -o
 hdisk0 blv=bos_hd5 pathid=0
 hdisk0 blv=hd5 pathid=0

 # oslevel -s
 6100-07-01-1141

Reboot the server and check with oslevel.

 # oslevel -s
 6100-04-11-1140

So we are now back on the initial TL.

Thursday, May 9, 2013

Performance Tuning -- VMM

Introduction

Just what is swap (paging) space? It all starts with the VMM. VMM uses swap space (paging) as a holding bin for a process that is not using active RAM. Because of its purpose, it is a critical component of overall system performance. As an administrator, you need to know how to monitor and tune your paging parameters. The paging space itself is a special logical volume that stores the information that is currently not accessed. You must make sure that your system has adequate paging space. If the paging space is too low, entire processes can be lost and the system can crash when your space fills up. Though it is important to reiterate that paging is a normal part of VMM, it is even more important you really understand how the kernel brings the process into RAM—too much paging definitely hinders performance. AIX, through tight integration of the kernel and VMM, makes use of a methodology called demand paging. In fact, most of the kernel itself resides in virtual memory, which helps free up segments for other processes. I'll dig deeper into how this works and discuss some of the tools you need to use to manage and tune your paging space.

You will find that the tuning you do is based on what type of system you have. For example, systems that are using an Oracle Online Transaction Processing (OLTP) type of database usually have specific recommendations on how much swap space to configure and how to tune the paging parameters. As discussed in previous installments of the series (see Resources), you cannot really tune your paging settings unless you really know what is going on in the host system. You need to understand the tools to use, how best to analyze the data that you will be capturing, and familiarize yourself with best practices for implementing your paging space. It has been my experience that the number one cause for a system crash is running out of paging space. If you read this article carefully and follow its recommendations, this should never happen to you. Obviously, you never want your system to crash but, if it does, you want it to be due to a hardware failure and nothing that you did or forgot to do as the systems administrator.

Demand paging

In this section, I provide an overview of how AIX handles paging, define swapping and paging, and drill down into the different modes of paging space allocation. These concepts help you understand subsequent sections on monitoring, configuring, and tuning.

Most administrators think of paging as something that is onerous. Paging is actually a very normal part of what AIX does, due to the tight integration of its kernel with the VMM and its implementation of demand paging. The way demand paging works is that the kernel only loads a few pages at a time into real memory. When the CPU is ready for another page, it looks at the RAM. If it cannot find it there, a page fault occurs, and this signals the kernel to bring more pages into RAM from disk. One advantage of demand paging is that the paging space does not have to be particularly large, because data is constantly being shuffled between paging space and RAM. On older UNIX® systems, paging was preallocated to disks, whether they were used or not. This caused a condition where disk space would be allocated that was never used. Demand paging, in essence, avoids the condition where this disk space is allocated for no purpose. Swapping of processes is kept to a minimum, because many more jobs can be stored in RAM. This is true, because only parts of processes (pages) are stored in RAM.

What about swapping? Though often used interchangeably, there is a subtle difference between paging and swapping. As discussed, only parts of the process are moved back and forth between disk and RAM with paging. When swapping occurs, you are moving entire processes back and forth. For this to happen, AIX suspends the entire process prior to moving it to paging space. It could then only continue to process when it is swapped back into RAM at a later event. This is not good and you should do everything you can to prevent swapping from occurring, which can cause another condition called thrashing (I'll get into this more later).

As a UNIX administrator, you are probably already aware of some of the concepts of paging and swapping. AIX provides three different modes of paging space allocation: deferred page space allocation, late page space allocation, and early page space allocation. The default policy of AIX is deferred page space allocation. This works by making sure that the allocation of paging space is delayed until the time that it is necessary to page out the page, which ensures that there is no wasted paging space. In fact, when you have a large amount of RAM, you might actually never even use any of your paging space (see Listing 1).

Listing 1. Ensuring that there is no wasted paging space

                
# lsps -a

Page Space      Physical Volume   Volume Group    Size %Used Active  Auto  Type
hd6             hdisk0            rootvg        4096MB     1     yes   yes lv

Only one percent of paging space is used in Listing 1.

Let's view how AIX is currently handling paging space allocation (see Listing 2).

Listing 2. Checking how AIX is handling paging space allocation

                
# vmo -a | grep def
  defps = 1

Listing 2 illustrates that the default method, deferred page space allocation, is being used. To disable this policy, you need to set the parameter to 0. This activates the system to use the late paging space allocation policy. Late paging space allocation causes paging disk blocks not to be allocated until its corresponding pages in RAM are touched. This method is usually intended for environments where optimum performance is more important than reliability. In the scenario presented here, a program can fail due to the lack of memory. What about early page space allocation? This policy is usually used if you want to make certain that processes will not be killed because of low paging conditions. Early page space allocation preallocates paging space. This is the opposite end of the spectrum from late paging space allocation. It is used in environments where reliability rules. The way to turn this on would be to set the PSALLOC environment variable to early (PSALLOC=early).

You should also be aware of the garbage collection feature first introduced in AIX Version 5.3. This allows you to free up paging-space disk blocks, which allows you to configure less paging space than you would ordinarily need. This feature is only available for the default deferred page space allocation policy.

Monitoring and configuring paging space

In this section, I'll show you how to monitor the paging space on your system. I'll also discuss the various commands used for configuring paging space and other tools that help you work with paging space as a systems administrator.

The simplest way of determining the amount of paging space used on your system is by running the lsps command (see Listing 3).

Listing 3. Running the lsps command

                
# lsps -s
Total Paging Space   Percent Used
      4096MB               1%

You looked earlier at the -a flag. I prefer using the -s flag, because the -a flag shows only paging space that is being used while the -s command gives you a summary of all paging space allocated, including space allocated using early page space allocation. Of course, this only applies if the default method of paging allocation was turned off.

Next on the plate is vmstat. Part 2 of this series discussed vmstat in great detail, which is one of my favorite VMM monitoring tools. I find that it is the quickest way to determine what is going on in your system. If there is a lot of paging and thrashing going on, you will find it here.

Let's look at some output shown in Listing 4.

Listing 4. Using vmstat

                
# vmstat 1 5

System Configuration: lcpu=2 mem=4096MB
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
r  b   avm   fre  re   pi  po  fr   sr    cy    in   sy  cs  us sy id wa
1  0 166512  627    0   0   1  0   92    0 277  3260 278   3  1  96  0
1  0 166512  623    0   0   1  0   40    0 253  2260 108   2  1  96  1
1  0 166512  627    0   0   0  0   0     0 248  3343  91   0  1  96  2
1  0 166512  627    0   0   0  0   2     0 247  3164  84   0  1  99  0
1  0 166512  627    0   1   0  0   0     0 277  3260  83   2  1  97  0

The columns most meaningful for your purposes here are:

avm —This column represents the amount of active virtual memory (in 4k pages) you are using, not including file pages.
fre —This column represents the size of your memory free list. In most cases, I don't worry when this is small, as AIX loves using every last drop of memory and does not return it as fast as you might like. This setting is determined by the minfree parameter of the vmo command. At the end of the day, the paging information is more important.
pi —This column represents the pages paged in from the paging space.
po —This column represents the pages paged out to the paging space.

As you can see in Listing 4, there is essentially no paging going on in the system.

Listing 5 shows an example of a system that is probably thrashing.

Listing 5. Possible thrashing system

                
# vmstat 2 3

System Configuration: lcpu=4 mem=4096MB
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
r  b   avm   fre  re   pi  po  fr   sr    cy    in   sy  cs  us sy id wa
1  2 166512  7    0    57 127  0   929    0 2779 3260 1278 3 30  50  0 20
1  5 166512  12   0    39 129  0   409    0 2538 2260 1108 2 10  30 10 50
1  6 166512  110  0     8 212  0   480    0 2487 3343 991  0 27  33 20 30

How can you tell this? First of all, look at the po column. This signifies that pages are consistently being moved back and forth between disk and RAM. You should also see a bottleneck on your system, as the blocked processes and wait times are abnormally high. The freelist is also lower than it should be. In looking at the freelist with the vmo command, you determined that the number was 120. This means that this number should not be falling below the 120 mark. Ordinarily, I would say it is not a problem when your freelist is low but, in this case, it is below where it should be. When this occurs, it usually signifies that thrashing is going on in your system. A classic sign of thrashing is when the operating system attempts to release resources by first warning processes to release paging space and then killing entire processes. In tuning vmo parameters, you can help set the thresholds when thrashing starts. You can also look at memory usage with either topas or nmon. Both of these utilities graphically display the paging in a more user-friendly format (see Listing 6).

Listing 6. Paging displayed graphically using topas

                
Topas Monitor for host:    testbox               EVENTS/QUEUES    FILE/TTY
Sun May 20 11:48:42 2007   Interval:  2         Cswitch      86  Readch    90043
                                                Syscall    1173  Writech    1336
Kernel    0.5   |#                           |  Reads       103  Rawin         1
User      0.0   |                            |  Writes       91  Ttyout      157
Wait      0.0   |                            |  Forks         0  Igets         0
Idle     99.5   |############################|  Execs         0  Namei       147
                                                Runqueue    0.0  Dirblk        0
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Waitqueue   0.0
en1       1.6      4.0     4.0     0.2     1.4
en2       0.0      0.0     0.0     0.0     0.0  PAGING           MEMORY
lo0       0.0      0.0     0.0     0.0     0.0  Faults        0  Real,MB    4095
                                                Steals        0  % Comp     16.6
Disk    Busy%     KBPS     TPS KB-Read KB-Writ  PgspIn        0  % Noncomp  84.3
hdisk0    0.0      0.0     0.0     0.0     0.0  PgspOut       0  % Client    0.5
hdisk1    0.0      0.0     0.0     0.0     0.0  PageIn        0
hdisk3    0.0      0.0     0.0     0.0     0.0  PageOut       0  PAGING SPACE
                                                Sios          0  Size,MB    4096
Name            PID  CPU%  PgSp Owner                            % Used      0.5
topas        156220   0.2   2.5 root            NFS (calls/sec)  % Free     99.4
sldf          96772   0.2   0.2 rds             ServerV2       0
syncd         12458   0.0   0.6 root            ClientV2       0   Press:
lrud           9030   0.0   0.0 root            ServerV3       0   "h" for help
gil           10320   0.0   0.1 root            ClientV3       0   "q" to quit

The PAGING column (shown in bold in Listing 6) shows that there is no paging going on at all.

What about maintaining the size of your paging space? You do this with the swap command (see Listing 7) in AIX.

Listing 7. Using the swap command

                
# swap -l
device              maj,min     total       free
/dev/hd6            10,  2      4096MB      4093MB

This tells you that you have one swap partition defined. You'll also notice that only 3MB are actually being used. Listing 8 shows what happens if your paging space utilization is too high.

Listing 8. Running out of paging space

                
# lsps -a

Page Space      Physical Volume   Volume Group    Size %Used Active  Auto  Type
hd6             hdisk0            rootvg        4096MB    78    yes   yes   lv

In this case, your paging space is starting to get dangerously low. It is possible that your system has been up for a very long time. If you are running a database such as Oracle, virtual memory does not get released until you recycle your database. Let's see how long your system has been up (see Listing 9).

Listing 9. Using the uptime command

                
# uptime
  11:58AM   up 9 days,  15:50,  23 users,  load average: 0.00, 0.03, 0.04

As shown in Listing 9, the system has been up for only nine days. If the paging space utilization has increased to 78 percent in such a short amount of time, you should consider adding more paging space. If you have plenty of space on your system, I would add another partition.

One best practice to keep in mind is to keep your paging spaces at the same size. In this case, I would add another 4GB of paging space to your rootvg volume. You can do this with the System Management Interface Tool (SMIT) and use either the smit mkps and smit swapon commands to activate the paging space. Alternatively, you can use the swapon (including swapoff) commands from the command line. If you can, use disks that are least used for paging areas. Also try not to allocate more than one paging logical volume for each physical disk. Though some administrators don't mind putting paging space on external storage, I personally don't like that practice. If you do this and the external storage is not available on a reboot, your system might crash (depending upon the amount of space allocated to paging). If you can, spread them across multiple platters and, of course, make sure they are online by using the lsps -a command.

How much paging space do you need on your system? What is the rule of thumb? First, start with the folks that own your application. The DB2® or Oracle teams should be able to tell you how much paging space needs to be allocated on your system from a database perspective. If you are a small shop, you'll have to do the research on your own. Be careful, though. Database administrators usually like to request the highest number of everything and might instruct you to double the amount of paging space as your RAM (the old rule of thumb). Generally speaking, if my system has greater than 4GB of RAM, I usually like to create a one-to-one ratio of paging space versus RAM. Monitor your system frequently after going live. If you see that you are never really approaching 50 percent of paging space utilization, don't add the space. A quick look at the recent Oracle for AIX documentation (see Resources) confirms this principle. It states that the recommended initial setting for paging space be half the size of RAM plus 4GB with an upper limit of 32GB. It recommends monitoring space with the

lsps 
-a

command and not to worry unless the utilization is over 25 percent on the system. Adding additional space that you won't use gives you absolutely nothing extra.

I'm often asked how can you tell if a process is using paging space? Take a look at svmon, as shown in Listing 10.

Listing 10. Using svmon

                
# svmon -P | grep -p 17602
-------------------------------------------------------------------------------
     Pid Command          Inuse      Pin     Pgsp  Virtual 64-bit Mthrd LPage
   17602 sendmail         11877     3211        0    11691      N     N     N

After identifying the PID number, using svmon can drill down to this level. This can help you determine whether or not tuning needs to be done to your application to either help stop the paging or to tune your operating system. Do a man on svmon, as there are many other purposes to this AIX memory-specific utility.

Tuning with vmo

In this section, I use vmo to tune paging parameters that can significantly reduce the amount of paging on your systems. I also discuss thresholds to change and parameters that can influence your overall scanning overhead.

So what can you tune on VMM to cut down on paging? In the first installment of the series (see Resources) , I discussed the minperm and maxperm parameters in great detail, and I'll summarize some of the most important concepts here. Tuning vmo settings allows you to favor either working or persistent storage. You want it to favor working storage. The way to prevent AIX from paging working storage and to utilize the caching from your database would be to set maxperm to a high value (greater than 80) and to make sure the lru_file_repage=0 parameter indicates whether or not the VMM re-page counts should be considered and what type of memory it should steal. The default setting is 1, so you need to change it to 0. This is done using the vmo command. When you set the parameter to 0, it tells the VMM that you prefer that it steal only file pages rather than computational pages. This is what you want to do. You also need to set the minperm, maxperm, and maxclient parameters, as shown in Listing 11 below.

Listing 11. Setting the minperm, maxperm and maxclient parameters

                
vmo -p -o minperm%=5
vmo -p -o maxperm%=90
vmo -p -o maxclient%=90

In prior AIX versions, you would tune strict_maxperm and strict_maxclient from their default numbers. With AIX Version 5.3, changing the lru_file_repage parameter is a far more effective way of tuning, as you would prefer AIX file caching not be used at all. Now let's briefly summarize minfree and maxfree. If the number of pages on your free list falls below the minfree parameter, VMM starts to steal pages until the free list has at least the amount of pages in the maxfree parameter. The default settings in AIX Version 5.3 usually seem to work (see Listing 12).

Listing 12. Default settings for maxfree and minfree

                
# vmo -a | grep free
              maxfree = 1088
              minfree = 960

Let's discuss tuning page space thresholds. As stated earlier, when your paging space starts becoming very low, it starts to warn offending processes and then kills them. What thresholds can you change here to influence this activity? They would be npsware, npskill, and nokilluid. Npswarn is the threshold that is used to signal the processes when space is getting low. Npskill is the threshold where AIX starts killing processes. If your policy is early page space allocation, it will not kill the process. If you recall, I discussed earlier that this was the most reliable method of paging. Nokillid is an important threshold because, if this is set to 1, it makes certain that processes owned by root will not be killed, even when the npskill threshold is reached.

Further, when a process cannot be forked because of a paging space issue, the scheduler retries to fork it again up to five times, delaying 10 clock ticks before each retry. You can change the schedo parameter to increase or decrease the amount of tries. The parameter used for this is the pacefork value. Another important parameter you can look at is lrubucket. Tuning this can reduce the scanning overhead. Because the page replacement algorithm is always looking for free frames while it is doing its scanning on systems with a lot of memory, the number of frames to scan can be significant. Increasing the value decreases the amount of buckets that need to be scanned. This can help performance. Listing 13 uses the vmo command with the -a option to display the values for lrubucket.

Listing 13. Displaying the value for lrubucket

                
# vmo -a | grep lru
      lru_file_repage = 1
    lru_poll_interval = 0
            lrubucket = 131072 (this is in 4 KB frames)

To increase the default value from 512MB to 1GB, use

# vmo -o 
lrubucket=262144

.

And that's how you can significantly reduce paging on your AIX system using vmo.

Summary

Part 3 of this series looked at some of the tools that are available to you in capturing data for swap analysis. You used some system administration commands to display and configure swap on your system, and learned about paging and swapping and the various methods of paging that are available on AIX. You also reviewed some best practices when configuring paging space on your systems. Finally, you studied specific methods of tuning your VMM specific to handle paging and swapping. Parts 1 and 2 of this series went over the VMM in great detail and covered troubleshooting memory bottlenecks. You used various tools to help you monitor your systems for both short-term analysis and long-term trending. You also learned all about the general tuning methodology and the importance of monitoring systems prior to bottlenecks occurring. This enables you to establish a baseline while your system is healthy so that you can practice some of the methods discussed in this series, which include tuning your memory subsystems. Just make sure you test them on your development or test environments prior to deploying any changes to production.

Performance Tuning -- DISK

Introduction

A critical component of disk I/O tuning involves implementing best practices prior to building your system. Because it is much more difficult to move things around when you are already up and running, it is extremely important that you do things right the first time when planning your disk and I/O subsystem environment. This includes the physical architecture, logical disk geometry, and logical volume and file system configuration.

When a system administrator hears that there might be a disk contention issue, the first thing he or she turns to is iostat. iostat, the equivalent of using vmstat for your memory reports, is a quick and dirty way of getting an overview of what is currently happening on your I/O subsystem. While running iostat is not an inappropriate kneejerk reaction at all, the time to start thinking about disk I/O is long before tuning becomes necessary. All the tuning in the world will not help if your disks are not configured appropriately for your environment from the get-go. Further, it is extremely important to understand the specifics of disk I/O and how it relates to AIX® and your System p™ hardware.

When it comes to disk I/O tuning, generic UNIX® commands and tools help you much less than specific AIX tools and utilities that have been developed to help you optimize your native AIX disk I/O subsystem. This article defines and discusses the AIX I/O stack and correlates it to both the physical and logical aspects of disk performance. It discusses direct, concurrent, and asynchronous I/O: what they are, how to turn them on, and how to monitor and tune them. It also introduces some of the long-term monitoring tools that you should use to help tune your system. You might be surprised to hear that iostat is not one of the tools recommended to help you with long-term gathering of statistical data.

Finally, this article continues to emphasize the point that regardless of which subsystem you are looking to tune, systems tuning should always be thought of as ongoing process. The best time to start monitoring your systems is when you have first put your system in production and it is running well, rather than waiting until your users are screaming about slow performance. You really need to have a baseline of what the system looked like when it was behaving normally in order to analyze data when it is presumably not performing adequately. When making changes to your I/O subsystem, make these changes one at a time so that you will be in a position to really assess the impact of your change. In order to assess that impact, you'll be capturing data using one of the long-term monitoring tools recommended in this article.

Disk I/O overview

This section provides an overview of disk I/O as it relates to AIX. It discusses the physical aspects of I/O (device drives and adapters), the AIX I/O stack, and concepts such as direct, concurrent, and asynchronous I/O. The concept of I/O pacing is introduced, along with recent improvements to iostat, to help you monitor your AIX servers.

It shouldn't surprise you that the slowest operation for running any program is the time actually spent on retrieving the data from disk. This all comes back to the physical component of I/O. The actual disk arms must find the correct cylinder, the control needs to access the correct blocks, and the disk heads have to wait while the blocks rotate to them. The physical architecture of your I/O system should be understood prior to any work on tuning activities for systems, as all the tuning in the world won't help a poorly architected I/O subsystem that consists of a slow disk or inefficient use of adapters.

Figure 1 clearly illustrates how tightly integrated the physical I/O components relate to the logical disk and its application I/O. This is what is commonly referred to as the AIX I/O stack.

Figure 1. The AIX I/O stack

You need to be cognizant of all the layers when tuning, as each impacts performance in a different way. When first setting up your systems, start from the bottom (the physical layer) as you configure your disk, the device layer, its logical volumes, file systems, and the files and application. I can't emphasize enough the importance in planning your physical storage environment. This involves determining the amount of disk, type (speed), size, and throughput. One important challenge with storage technology to note is that while storage capabilities of disk are increasing dramatically, the rotational speed of the disk increases at a much slower pace. You must never lose sight of the fact that while RAM access takes about 540 CPU cycles, disk access can take 20 million CPU cycles. Clearly, the weakest link on a system is the disk I/O storage system, and it's your job as the system administrator to make sure it doesn't become even more of a bottleneck. As alluded to earlier, poor layout of data affects I/O performance much more than any tunable I/O parameter. Looking at the I/O stack helps you to understand this, as Logical Volume Manager (LVM) and disk placement are closer to the bottom than the tuning parameters (ioo and vmo).

Now let's discuss some best practices of data layout. One important concept is making sure that your data is evenly spread across your entire physical disk. If your data resides on only a few spindles, what is the point exactly of having multiple logical unit numbers (LUNs) or physical disks? If you have a SAN or another type of storage array, you should try to create your arrays of equal size and type. You should also create them with one LUN for each array and then spread all your logical volumes across all the physical volumes in your Volume Group. As stated previously, the time to do this is when you first configure your system, as it is much more cumbersome to fix I/O problems than memory or CPU problems, particularly if it involves moving data around in a production environment. You also want to make certain that your mirrors are on separate disks and adapters. Databases pose separate, unique challenges so, if possible, your indexes and redo logs should also reside on separate physical disks. The same is true for temporary tablespaces often used for performing sort operations. Back to the physical. Using high-speed adapters to connect the disk drives are extremely important, but you must make certain that the bus itself does not become a bottleneck. To prevent this from happening, make sure to spread the adapters across multiple buses. At the same time, do not attach too many physical disks or LUNs to any one adapter, as this also significantly impacts performance. The more adapters that you configure, the better, particularly if there are large amounts of heavily utilized disk. You should also make sure that the device drivers support multi-path I/O (MPIO), which allows for load balancing and availability of your I/O subsystem.

Direct I/O

Let's return to some of the concepts mentioned earlier, such as direct I/O. What is direct I/O? First introduced in AIX Version 4.3, this method of I/O bypasses the Virtual Memory Manager (VMM) and transfers data directly to disk from the user's buffer. Depending on your type of application, it is possible to have improved performance when implementing this technique. For example, files that have poor cache utilization are great candidates for using direct I/O. Direct I/O also benefits applications that use synchronous writes, as these writes have to go to disk. CPU usage is reduced because the dual data copy piece is eliminated. This copy occurs when the disk is copied to the buffer cache and then again from the file. One of the major performance costs of direct I/O is that while it can reduce CPU usage, it can also result in processes taking longer to complete for smaller requests. Note that this applies to persistent segments files that have a permanent location on disk. When the file is not accessed through direct I/O with the IBM Enhanced Journaled File System for AIX 5L™ (JFS2), the file is cached as local pages and the data copied into RAM. Direct I/O, in many ways, gives you the similar performance of using raw logical volumes, while still keeping the benefits of having a JFS filesystem (for example, ease of administration). When mounting a file system using direct I/O, you should avoid large, file-enabled JFS filesystems.

Concurrent I/O

What about concurrent I/O? First introduced in AIX Version 5.2, this feature invokes direct I/O, so it has all the other performance considerations associated with direct I/O. With standard direct I/O, inodes (data structures associated with a file) are locked to prevent a condition where multiple threads might try to change the consults of a file simultaneously. Concurrent I/O bypasses the inode lock, which allows multiple threads to read and write data concurrently to the same file. This is due to the way in which JFS2 is implemented with a write-exclusive inode lock, allowing multiples users to read the same file simultaneously. As you can imagine, direct I/O can cause major problems with databases that continuously read from the same file. Concurrent I/O solves this problem, which is why it's known as a feature that is used primarily for relational databases. Similar to direct I/O, you can implement this either through an open system call or by mounting the file system, as follows:

# mount -o cio /u

When you mount the file system with this command, all its files use concurrent I/O. Even more so than using direct I/O, concurrent I/O provides almost all the advantages of using raw logical volumes, while still keeping the ease of administration available with file systems. Note that you cannot use concurrent I/O with JFS (only JFS2). Further, applications that might benefit from having a file system read ahead or high buffer cache hit rates might actually see performance degradation.

Asynchronous I/O

What about asynchronous I/O? Synchronous and asynchronous I/O refers to whether or not an application is waiting for the I/O to complete to begin processing. Appropriate usage of asynchronous I/O can significantly improve the performance of writes on the I/O subsystem. The way it works is that it essentially allows an application to continue processing while its I/O completes in the background. This improves performance because it allows I/O and application processing to run at the same time. Turning on asynchronous I/O really helps in database environments. How can you monitor asynchronous I/O server utilization? Both iostat (AIX Version 5.3 only) and nmon can monitor asynchronous I/O server utilization. Prior to AIX Version 5.3, the only way to determine this was using the nmon command. The standard command for determining the amount of asynchronous I/O (legacy) servers configured on your system is:

pstat -a | egrep ' aioserver' | wc -l

The iostat -A command reports back asynchronous I/O statistics (see Listing 1).

Listing 1. iostat -A command

                
# iostat -A

System configuration: lcpu=2 drives=3 ent=0.60 paths=4 vdisks=4                 
                                                                                
aio: avgc avfc maxgc maxfc maxreqs avg-cpu: % user % sys % idle % iowait physc % entc
       0   0    32    0      4096            6.4    8.0    85.4    0.2    0.1    16.0
                                                                                
Disks:         % tm_    act      Kbps      tps     Kb_read   Kb_wrtn                   
hdisk0           0.5    2.0       0.5       0         4                   
hdisk1           1.0    5.9       1.5       8         4                   
hdisk2           0.0    0.0       0.0       0         0

What does this all mean?

avgc: This reports back the average global asynchronous I/O request per second of the interval you specified.
avfc: This reports back the average fastpath request count per second for your interval.
maxgc: This reports back the max global asynchronous I/O request since the last time this value was fetched.
maxfc: This reports back the maximum fastpath request count since the last time this value was fetched.
maxreqs: This is the maximum asynchronous I/O requests allowed.

How many should you configure? The rule of thumb is to set the maximum number of servers equal to ten times the amount of disk or ten times the amount of processors. MinServers would be set at one half of this amount. Other than having some more kernel processes hanging out that really don't get used (using a small amount of kernel memory), there really is little risk in oversizing the amount of MaxServers, so don't be afraid to bump it up. How is this done? You can either use the chdev command or the smit fastpath command:

# smit aio (or smit posixaio)

This is also how you would enable asynchronous I/O on your system.

To increase your maxservers to 100 from the command line, use this command:

# chdev -l aio0 -a maxservers=100

Note that you must reboot prior to this change taking effect. On occasion, I'm asked what is the difference between aio and posixaio. The major difference between the two involve different parameter passing, so you really need to configure both.

One last concept is I/O pacing. This is an AIX feature that prevents disk I/O-intensive applications from flooding the CPU and disks. Appropriate usage of disk I/O pacing helps prevent programs that generate very large amounts of output from saturating the system's I/O and causing system degradation. Tuning the maxpout and minpout helps prevent threads performing sequential writes to files from dominating system resources.

You can also limit the effect of setting global parameters, by mounting file systems using an explicit 0 for minput and maxpout:

# mount -o minpout=0,maxpout=0 /u

Monitoring

This section provides an overview of the AIX-specific tools (sar, topas, and nmon) available to monitor disk I/O activity. These tools allow you to quickly troubleshoot a performance problem and capture data for historical trending and analysis.

Don't expect to see iostat in this section, as iostat is a UNIX utility that allows you to quickly determine if there is an imbalanced I/O load between your physical disks and adapters. Unless you decide to write your own scripting tools using iostat, it will not help you with long-term trending and capturing data.

sar is one of those older generic UNIX tools that have been improved over the years. While I generally prefer the use of more specific AIX tools, such as topas or nmon, sar provides strong information with respect to disk I/O. Let's run a typical sar command to examine I/O activity (see Listing 2).

Listing 2. Using sar

                
# sar -d 1 2

AIX newdev 3 5    06/04/07

System Configuration: lcpu=4 disk=5

07:11:16     device    %busy    avque    r+w/s   blks/s   avwait   avserv

07:11:17     hdisk1      0      0.0        0        0      0.0      0.0
             hdisk0     29      0.0      129       85      0.0      0.0
             hdisk3      0      0.0        0        0      0.0      0.0
             hdisk2      0      0.0        0        0      0.0      0.0
                cd0      0      0.0        0        0      0.0      0.0


07:11:18     hdisk1      0      0.0        0        0      0.0      0.0
             hdisk0     35      0.0      216      130      0.0      0.0
             hdisk3      0      0.0        0        0      0.0      0.0
             hdisk2      0      0.0        0        0      0.0      0.0
                cd0      0      0.0        0        0      0.0      0.0



Average    hdisk1        0      0.0        0        0      0.0      0.0
             hdisk0     32      0.0      177       94      0.0      0.0
             hdisk3      0      0.0        0        0      0.0      0.0
             hdisk2      0      0.0        0        0      0.0      0.0
                cd0      0      0.0        0        0      0.0      0.0

Let's break down the column headings from Listing 2.

%busy: This command reports back the portion of time that the device was busy servicing transfer requests.
avque: In AIX Version 5.3, this command reports back the number of requests waiting to be sent to disk.
r+w/s: This command reports back the number of read or write transfers to or from a device (512 byte units).
avwait: This command reports the average wait time per request (milliseconds).
avserv: This command reports the average service time per request (milliseconds).

You want to be wary of any disk that approaches 100 percent utilization or a large amount of queue requests waiting for disk. While there is some activity on the sar output, there really are no I/O problems because there is no waiting for I/O. You need to continue to monitor the system to make sure that other disks are also being used besides hdisk0. Where sar is different than iostat is that it has the ability to capture data for long-term analysis and trending through its system activity data collector (sadc) utility. Usually turned off in cron, this utility allows you to capture data for historic trending and analysis. Here's how this works. As delivered on AIX systems by default, there are two shell scripts that are normally commented out (/usr/lib/sa/sa1 and /usr/lib/sa/sa2) that provide daily reports on the activity of the system. The sar command actually calls the sadc routine to access system data (see Listing 3).

Listing 3. Example cronjob

                
# crontab -l | grep sa1

0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 &
0 * * * 0,6 /usr/lib/sa/sa1 &
0 18-7 * * 1-5 /usr/lib/sa/sa1 &

What about something a little more user-friendly? Did you say topas? topas is a nice performance monitoring tool that you can use for a number of purposes, including, but not limited to, your disk I/O subsystem.

Figure 2. topas
topas

Take a look at the topas output from a disk perspective. There is no I/O activity going on here at all. Besides the physical disk, pay close attention to "Wait" (in the CPU section up top), which also helps determine if the system is I/O bound. If you see high numbers here, you can then use other tools, such as filemon, fileplace, lsof, or lslv, to help you figure out which processes, adapters, or file systems are causing your bottlenecks. topas is good for quickly troubleshooting an issue when you want a little more than iostat. In a sense, topas is a graphical mix of iostat and vmstat, though with recent improvements, it now allows the ability to capture data for historical analysis. These improvements were made on AIX Version 5.3, and no doubt were made because of the popularity of a similar tool that was created by someone from IBM—IBM does not officially support this tool.

This is nmon (my favorite AIX performance tool). While nmon provides a front-end similar to topas, it is much more useful in terms of long-term trending and analyses. Further, it gives the system administrator the ability to output data to an Excel spreadsheet that comes back in pretty looking charts (tailor made for senior management and functional teams) that clearly illustrate your bottlenecks. This is done through a tool called nmon analyzer, which provides the hooks into nmon. With respect to disk I/O, nmon reports back the following data: disk I/O rates, data transfers, read/write ratios, and disk adapter statistics.

Here is one small example of where nmon really shines. Say you want to know which processes are hogging most of the disk I/O and you want to be able to correlate it with the actual disk to clearly illustrate I/O per process. nmon usage helps you more then any other tool. To do this with nmon, use the -t option; set your timing and then sort by I/O channel. How do you use nmon to capture data and import it into the analyzer?

Use the sudo command and run nmon for three hours, taking a snapshot every 30 seconds:

# sudo nmon -f -t -r test1 -s 30 -c 180

Then sort the output file that gets created:

# sort -A testsystem_yymmdd.nmon > testsystem_yymmdd.csv

When this is completed, ftp the .csv file to your PC, start the nmon analyzer spreadsheet (enable macros), and click on analyze nmon data. You can download the nmon analyzer from here.

Figure 3 is a screenshot taken from an AIX 5.3 system, which provides a disk summary for each disk in kilobytes per second for reads and writes.

Figure 3. Disk summary for each disk in kilobytes per second for reads and writes
Disk summary for each disk in kilobytes per second for reads and writes

nmon also helps track the configuration of asynchronous I/O servers, as you can see from the output in Listing 4.

Listing 4. Tracking the configuration of asynchronous I/O servers with nmon

                
lsattr -El aio0         
lsattr -El aio0 autoconfig available STATE to be configured at system restart True
lsattr -El aio0 fastpath   enable    State of fast path                       True
lsattr -El aio0 kprocprio  39        Server PRIORITY                          True
lsattr -El aio0 maxreqs    16384     Maximum number of REQUESTS               True
lsattr -El aio0 maxservers 100       MAXIMUM number of servers per cpu        True
lsattr -El aio0 minservers 50        MINIMUM number of servers                True

Before AIX Version 5.3, nmon was the only tool that showed you the amount of asynchronous I/O servers configured and the actual amount being used. As illustrated in the previous section, iostat has recently been enhanced to provide this function.

Conclusion

This article addressed the relative importance of the disk I/O subsystem. It defined and discussed the AIX I/O stack and how it related to both physical and logical disk I/O. It also covered some best practices for disk configuration in a database environment, looked at the differences between direct and concurrent I/O, and also discussed asynchronous I/O and I/O pacing. You tuned your asynchronous I/O servers and configured I/O pacing. You started up file systems in concurrent I/O mode and studied when to best implement concurrent I/O. Further, you learned all about iostat and captured data using sar, topas, and nmon. You also examined different types of output and defined many of the flags used in sar and iostat. Part 2 of this series drills down to the logical volume manager layer of the AIX I/O stack and looks at some of the snapshot-type tools, which help you quickly access the state of your disk I/O subsystem. Part 3 focuses primarily on tracing I/O usage using tools, such as filemon and fileplace, and how to improve file system performance overall.

Network Adapter Issue in AIX

Network Issue Troubleshooting

1. Check the number of network interfaces and their status:

# lsdev -CH | grep en
ent0 available 10-68 3Com 3C905-TX-IBM fast Etherlink XL NIC
ent1 defined 10-80 IBM PCI Ethernet Adapter (22100020)
inet0 available Internet Network Extension
en0 available Standard Ethernet Network Interface
en1 defined Standard Ethernet Network Interface

In this example, there are two network interfaces, ent0 and ent1. ent0 is a fast, 100MB card while ent1 is a 10MB card. ent0's status is "available" meaning that is presently active; on the other hand, ent1's status is "defined" which means that it could be activated but is not at this time.

2. Use the netstat command:

# netstat -in
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
lo0 16896 link#1 587 0 695 0 0
lo0 16896 127 127.0.0.1 587 0 695 0 0
lo0 16896 ::1 587 0 695 0 0
en0 1500 link#6 2.60.8c.f2.1d.f6 6455 0 1112 0 0
en0 1500 216.131.202. 216.131.202.172 6455 0 1112 0 0

Check that the first three lines are lo0, also, confirm that en0 is the active interface, record the IP number.

3. Investigate the attributes of the active interface:

# lsattr -El en0
mtu 1500 Max IP Packet Size for this device TRUE
remmtu 576 Max IP Packet Size for remote networks TRUE
netaddr 216.131.202.172 Internet address TRUE
state up Current Interface Status TRUE
netmask 255.255.255.0 Subnet mask TRUE
security none Security level TRUE
authority Authorized Users TRUE
broadcast Broadcast Address TRUE
netaddr6 N/A TRUE
alias6 N/A TRUE
prefixlen N/A TRUE
alias4 N/A TRUE

4.Determine the routing information:

#netstat -rn
Routing tables
Destination Gateway Flags Refs Use If PTMU Exp Groups
Route tree for Protocol Family 2 (internet) :
default 216.131.202.10 UG 1 397 en0 - -
127/8 127.0.0.1 U 4 265 lo0 - -
216.131.202/24 216.131.202.172 U 3 35419 en0 - -

Check that the router's IP number is the correct one and that the U and G flags are set.

5. Use the arp command to check on address resolution:

# arp -an
? (216.131.202.191) at 8:0:20:92:a1:c6 (ethernet)
? (216.131.202.171) at 0:10:83:27:ba:7f (ethernet)

6. Check transmit and receive stats:

# enstat -d ent0 | more
----------------------------------------------------------
ETHERNET STATISTICS (ent0) :
Device Type: 3Com 3C905-TX-IBM Fats Etherlink XL
NIC Harware address: 02:60:8c:f2:1d:f6
Elapsed Time: 0 days 2 hours 5 minutes 48 seconds

Transmit Statistics: Recieve Statistics:
-------------------- -------------------
Packets: 38269 Packets: 25841
Bytes: 45846710 Bytes: 5512839
Interrupts: 38269 Interrupts: 25651
Transmit Errors: 0 Receive Errors: 0
Packets Dropped: 0 Packets Dropped: 0

If there are no packets sent or received, there is probably a cable problem.

7. Look at the duplex and speed setting on the card:

# smit chgenet [choose en0]
Ethernet Adapter ent0
Media Type 100BaseTX
TX to RX Queue Partition Ratio 3:5
Driver TX Waiting Queue Size 32
Driver RX waiting Queue Size 32
Full Duplex? yes
Use alternate address? no
Alternate Ethernet Address 0x
TX Start Threshold - Fragmented 512
Apply change to DATABASE only no

If the card is not set as above, it is recommended that it should. To change the above settings:
a. Telnet to server's console
b. Detach the card:
# ifconfig en0 detach
c. Reconfigure it:
# smit chgenet
d. Bring it up:
# chdev -l en0 -a state=up
e. Reset tcpip:
# smit tcpip

8. Try to listen to the port:

# tcpdump -i en0 -I
18:34:20.333473494 ple-dns-01.peoplesoft.com.domain > st-ibm07.peoplesoft.com

If you don't see any output, then probably the cable that connects to the catalyst or the catalyst port itself could be defective

PowerVM Disk Virtualization (Hybrid) --- vSCSI + NPIV

PowerVM Disk Virtualization

Virtual SCSI
vSCSI is a mechanism that allows the VIOS to present disk volumes to client LPARs across a virtualized SCSI connection. The VIOS owns the physical disk volumes and they can be locally attached or SAN based. The disk volumes are made available to client LPARs using the vSCSI interface and they appear as locally attached SCSI hard disk drives (hdisks) to the client LPAR.

N_Port ID Virtualization
NPIV provides an alternative method for disk virtualization on the VIOS. With NPIV, the physical Fibre-Channel adapter assigned to the VIOS can have 64 downstream virtual World Wide Names (WWN) associated with it. Virtual WWNs can then be assigned to client LPARs

Hybrid vSCSI and NPIV Implementation
VSCSI was the first disk-access virtualization technology provided with PowerVM. It works very well and is in use in many businesses. From a management perspective, VSCSI takes us back to the original UNIX server-implementation model where all disk is presented to the client LPAR as locally attached SCSI disk. The VIOS administrator manages all of the OS and data disk volumes, similar to when servers were configured with local SCSI disk drives. This can become quite a challenging task for servers with large amounts of data, as the responsibility for disk redundancy and backup falls upon their shoulders. One possible solution is to implement a hybrid disk-virtualization model.

With the hybrid model, vSCSI is used for the OS disks and NPIV is used for the data disks. This emulates the traditional environment where there was local SCSI OS disk and Fibre Channel-attached data disk. The client LPAR administrators see local SCSI disk for installation and management of the OS. All of the data disks are presented to the client LPAR through virtual Fibre-Channel connections on the NPIV interface. With regard to disk configuration and management, this allows the client LPAR administrator to focus on managing the OS disk volumes and the SAN administrator to focus on managing the data volumes.

Active Memory Sharing
Active Memory Sharing (AMS) enables the sharing of a pool of physical memory among partitions on a single IBM Power Systems server Power 6 or later, helping to increase memory utilization and drive down system costs.

In order to use the Active Memory Sharing feature of IBM PowerVM, the following are the minimum requirements:

An IBM Power System server based on the POWER6 processor
Enterprise PowerVM activation
Firmware level 340_075
HMC version 7.3.4 service pack 2 (V7R3.4.0M2) for HMC managed systems
Virtual I/O Server Version 2.1.0.1-FP21 for both HMC and IVM managed systems
AIX 6.1 TL 3
Novell SuSE SLES11

The memory is dynamically allocated amongst the partitions as needed, to optimize the overall physical memory usage in the pool. Instead of assigning a dedicated amount of physical memory to each logical partition, the POWER Hypervisor constantly provides the physical memory from the Shared Memory Pool as needed.

Logical memory:
Quantity of memory that the operating system manages and can access. Logical memory pages that are in use may be backed up by either physical memory or a pool’s paging device.

For example, four logical partitions with 10 GB of dedicated memory each can be configured to share a memory pool of 40 GB, each with 15 GB of logical memory assigned.

Paging:
A Paging Virtual I/O Server is a partition that provides paging services for a shared memory pool and manages the paging spaces for shared memory partitions associated with a shared memory pool. A Virtual I/O Server enabled as a Paging Virtual I/O Server is designed to serve one shared memory pool.

This new configuration does not change the global memory requirements, and every logical partition can have the same amount of physical memory it had before. However, memory allocation is highly improved since an unexpected memory demand due to unplanned peak of one logical partition can be satisfied by the shared pool. In deed unused memory pages from shared-memory partitions can be automatically assigned to the more demanding one automatically

The hypervisor has to use a paging device to back up the excess memory that it cannot back up using the physical memory.

A paging device is required for each shared memory partition. The size of the paging device must be equal to or larger than the maximum logical memory defined in the partition profile. The paging devices are owned by a Virtual I/O Server. A paging device can be a logical volume or a whole physical disk. Disks can be local or provided by an external storage subsystem through a SAN.

Reserved storage device pool will be created automatically if AMS will be used, it will need for shared memory paging device.
(Ensure that PVIDs for paging devices for physical volumes set up by the HMC are cleared before use.)

A Virtual Asynchronous Service Interface (VASI) is a virtual device that allows communications between the Virtual I/O Server and the hypervisor. In AMS environment, this device is used for handling hypervisor paging activity.

Active Memory Expansion (AME) -- Power VM

Active Memory Expansion (AME)

Introduction
IBM’s POWER7™ systems with AIX® feature Active Memory™ Expansion, a new technology for expanding a system’s effective memory capacity. Active Memory Expansion employs memory compression technology to transparently compress in-memory data, allowing more data to be placed into memory and thus expanding the memory capacity of POWER7 systems. Utilizing Active Memory Expansion can improve system utilization and increase a system’s throughput.

Active Memory Expansion Overview
Active Memory Expansion relies on compression of in-memory data to increase the amount of data that can be placed into memory and thus expand the effective memory capacity of a POWER7 system. The in-memory data compression is managed by the operating system, and this compression is transparent to applications and users.

Active Memory Expansion is configurable on a per-logical partition (LPAR) basis. Thus, Active Memory Expansion can be selectively enabled for one or more LPAR’s on a system.

When Active Memory Expansion is enabled for a LPAR, the operating system will compress a portion of the LPAR’s memory and leave the remaining portion of memory uncompressed. This results in memory effectively being broken up into two pools – a compressed pool and an uncompressed pool. The operating system will dynamically vary the amount of memory that is compressed based on the workload and the configuration of the LPAR.

The operating system will move data between the compressed and uncompressed memory pools based on the memory access patterns of applications. When an application needs to access data that is compressed, the operating system will automatically decompress the data and move it from the compressed pool to the uncompressed pool, making it available to the application. When the uncompressed pool is full, the operating system will compress data and move it from the uncompressed pool to the compressed pool. This compression and decompression activity is transparent to the application.

Because Active Memory Expansion relies on memory compression, some additional CPU utilization will be consumed when Active Memory Expansion is in-use. The amount of additional CPU utilization needed for Active Memory Expansion will vary based on the workload and the level of memory expansion being used.

System Requirements
Active Memory Expansion is supported across all POWER7 systems. In order to use Active Memory Expansion, the following minimum levels of software are required:

HMC: V7R7.1.0.0
eFW: 7.1
AIX: 6.1 TL4 SP2