Está en la página 1de 49

Troubleshooting XenServer deployments

Tomasz Czajka, Sr. Support Engineer 8th of October 2010

Agenda
Case Study: Production down
Learn: XenServer crash Case study: Singlepathing Q& A

Production down

VM dont start - why?


Basic troubleshooting in XenCenter

Cannot start a VM The SR is not available error


Storage Repositry (SR) in broken state Repair does not work.

Use CLI to troubleshoot

SR

SR PBD PBD

Broken storage
What is broken?
PBD = Physical Block Device
Volume Group Name: <Prefix>+SR UUID

PBD PBD
SCSI ID

XenServer_1 XenServer_2

SR
has UUID (unique ID)

# xe pbd-list currently-atached=false

Storage troubleshooting
Goal: Reproduce and analyse the logs

/var/log/xensource.log* ; SMlog* ; messages* ;


# tail f /var/log/messages > /tmp/ShortLog # date # echo Unplugging cable >> messages messages (UTC) <> xensource.log (local)

PBD unplugged
Plugging PBD manually # grep PBD.plug xensource.log

# xe pbd-list host-uuid=... sr-uuid=...


# xe pbd-plug uuid=...
SR_BACKEND_FAILURE_47: The SR is not available no such volume group: VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4

# xe sr-list name-label=My SR params=uuid 19856cba-830c-e298-79fa-84a79eb658f4

Volume Group
What is VG?

Logical Volume Manager (LVM) 3 VMs 1 virtual disk each

HDD / LUN

Physical Volume (PV)

Virtual Disk
Logical Volume (LV)

VDI VDI VDI

HDD / LUN

Physical Volume (PV)


Physical Volume (PV)

Volume Group Volume Group (VG) (VG)

Logical Volume (LV) Logical Volume (LV)

HDD / LUN

Storage Repository

SR

Volume Group
Matching the UUID

# vgs
# vgs 'VG_XenStorage-19856cba-830c-e298-79faVG 84a79eb658f4' #PV #LV #SN Attr VSize VFree
VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3

Volume group "VG_XenStorage-19856cba-830ce298-79fa-84a79eb658f4" not found

1 1 1 1

18 2 11 1

0 0 0 0

wz--n- 89.99G 19.48G wz--n- 129.07G 129.05G wz--n- 49.99G 2.84G wz--n1.99G 1.98G

Examining HDD/LUN
Checking SCSI ID

check SCSI ID (unique for each SCSI device)

PBD
SCSI ID

# xe pbd-list params=device-config sr-uuid=... device-config SCSIid: 360a9800050334f49633459

Examining HDD/LUN
Can Linux kernel see this block device? (SCSI device)

# hdparm -t /dev/disk/by-id/scsi-360a98045234t654...

Timing buffered disk reads: 138 MB in 3.02 seconds = 45.68 MB/sec (LUN readable!

Addressing SCSI disks


# ls -lR /dev/disk | grep 360a9800050334f4963345767656c546
/dev/disk/by-id scsi-360a9800050334f4963345767656c546a -> /dev/sde /dev/disk/by-scsibus

360a9800050334f4963345767656c546a-1:0:0:5 -> /dev/sdc


360a9800050334f4963345767656c546a-2:0:0:5 -> /dev/sde

/dev/mapper/360a9800050334f4963345767656c546
Also check /dev/disk/by-path

Examining HDD/LUN
Is the LUN empty?

udevinfo -q all -n /dev/disk/by-id/scsi-360a9800050334f496334576765...

...
ID_FS_TYPE=LVM2 member

...
If this is LVM member, why there is no VG on it?

Examining HDD/LUN
Is there a VG created on PV?

# pvs
PV VG Fmt /dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 /dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 PV VG Fmt Attr /dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 /dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 /dev/mapper/360a9800050334f496334595a32306431 /dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 Attr Psize # pvs |grep 360a9800050334f496334595a32306431 a89.99G Free 19.48G aPsize 49.99G Free2.84G a14.99G 6.45G a1.99G 1.98G a129.07G 129.05G

VG_XenStorage-332432-430d-3423-4332434-5485974

lvm2 a-

14.99G

14.99G

# xe sr-list name-label="My SR" params=uuid 19856cba-830c-e298-79fa-84a79eb658f4 VG_Xenstorage<UUID> differs from SR UUID !

No original VG on the LUN


Potential reasons:

(Re)installation of host in the same pool


Unplug FC / Zoning

(Re)installation of host in other pool


Zoning

Adding SR with xe sr-create in CLI ...BE VERY CAREFUL!

Volume Group
...has been recreated!

Lost LVM metadata


Lost 100 MB of the VDI data Action steps: dont shutdown running VMs Online backup for running Vms (now)

Block-level clone of the whole LUN (now)


Assess professional data recovery

Volume Group
Looking for LVM metadata backup

Make a copy first # cp /etc/lvm/backup/* /root/backup/

/etc/lmv/backup/VG_XenStorage-19856cba-830ce298-79fa-84a79eb658f4
Check backup timestamp (within the file)
LVs in backup file
# cat /etc/lvm/backup/VG... | grep VHD
LV LV LV

VDI in xapi database

# xe vdi-list sr=<uuid> params=uuid


VDI VDI VDI

Volume Group
Removing new VG and PV

# vgremove "VG_XenStorage-<new SR uuid>


# pvremove /dev/mapper/<SCSI ID>

Volume Group
Recreating PV and VG from backup

# pvcreate --uuid <PV uuid from backup file> --restorefile /etc/lvm/backup/VG_XenStorage-<SR_UUID> /dev/mapper/<SCSI ID>
# vgcfgrestore VG_XenStorage-<SR UUID> -f /etc/lvm/backup/VG_XenStorage-<SR UUID>

Examining HDD/LUN
Confirm that VG name contains SR uuid...

# pvs |grep 360a9800050334f496334595a32306431


PV VG Fmt Attr lvm2 aPsize Free 14.99G /dev/mapper/360a9800050334f496334595a32306431 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4

14.99G

# xe sr-list name-label="My SR" params=uuid


19856cba-830c-e298-79fa-84a79eb658f4 VG_Xenstorage<UUID> matches SR UUID

Volume Group
Checking Logical Volumes
# lvs

Logical Volume (LV) Logical Volume (LV) Logical Volume (LV)

MGT VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi---

4.00M

VHD-352d31ec-aeb6-4601-8ea9-990575dab395 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--4.02G 4.02G

VHD-fbce18dd-397e-444e-9470-b6fa240243d9 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi---

VHD-ff744448-1b7f-4cc8-80b1-cd38b6c90c98 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M

Storage Repository
Plugging PBD again...
# xe pbd-plug uuid=
# xe sr-scan uuid= Error code: SR_BACKEND_FAILURE_46
Success! But no VDIs shown...

Error parameters: , The VDI is not available [opterr=Error scanning VDI 7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32]
# xe vdi-list uuid=<above number>

# lvremove /dev/VG_XenStorage-19856cba-830c-e298-79fa84a79eb658f4/VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32
# xe sr-scan uuid=
Success! All VDIs shown...

Well done!

What weve learned


...by troubleshooting Production Down issue
PBD to get plugged needs... LUN/HDD PV VG (SR) LV (VDI) VG name generated from SR uuid (+ prefix) LV name generated from VDI uuid (+ prefix) Displaying VG (vgs), PV (pvs), LV (lvs) Addressing block devices (/dev/disk)

Examining HDD/LUN with "hdparm t"


Restoring PV & VG from backup

The XenServer Crash

The XenServer Crash?


Unresponsive or rebooting host

Kernel panic or crash dump


Error on Console, host locked Memory addressing, Bug in OS, Hardware failure

No Kernel Panic and no crash dump


Host rebooting / frozen / no errors on the console Hardware failure, OS busy (I/O), user action

Symptom: Host is unresponsive

Symptom: Host rebooted itself

Serial console
Boot the host to the console CTX120540 & reboot Generate crashdump CTX120540 & reboot

No serial console

/var/crash/<date> exists No crashdump Review crashdump


HA disabled
Analyse /var/log/ messages, xensource.log

Connect local console Any errors on the console?

HA enabled Disable HA Host fenced? Check /var/log/xha.log


Analyse /var/log/messages, xensource.log for HA reasons

Review crashdump

Take photos and reboot

Analyse /var/log/ messages, xensource.log

Add noreboot option in extlinux.conf


Still rebooting? examine hardware

Contact Citrix Tech Support

Getting into details


As easy as grep
Startup strings:
# cd /var/log # grep klogd messages -B100

Analyse /var/log/ messages, xensource.log

# grep SERVER START xensource.log -B100

Inside crash log directory


/var/crash/<stamp> crash.log Hypervisor console ring Domain0.log Domain0 console ring

Review crashdump

Domain1,2,3...log Debug.log xen-memory-dump

HA activity, page fault, driver, storage issues CPU stack - to be analysed by Citrix Tech Support
Citrix Confidential - Do Not Distribute

Investigating crash.log
XenConsole ring
located at the bottom of the file
(XEN) Watchdog timer fired for domain 0 (XEN) Domain 0 shutdown: watchdog rebooting machine. Why watchdog triggered? /var/log/xha.log (Network or Storage heartbeat failed) Why heartbeat failed? /var/log/messages (DMP, kernel, drivers, I/O errors)
Review crashdump (cont)

Investigating crash.log
Page fault
Other examples:
(XEN) **************************************** (XEN) Panic on CPU 6: (XEN) FATAL TRAP: vector = 14 (page fault) (XEN) [error_code=0000] , IN INTERRUPT CONTEXT (XEN) **************************************** (XEN) (XEN) Reboot in five seconds...

What weve learned


Learn: XenServer crash
Host really crashed? Kernel Panic Crashdump

Triggering Crashdump manually


Locating host reboot in the logs Reviewing crashdump logs

Single-Pathing

Storage Performance issue


DMP has been enabled to improve performance
Virtual Machines are running on different iSCSI SRs
LinuxGuestVM:~# hdparm -t /dev/xvdb /dev/xvdb:

Timing buffered disk reads: 30.41 MB/sec

96 MB in

3.07 seconds =

Storage Performance
Checking multipath status

# mpathutil status

/dev/mapper/....

360a9800050334f496334596c71665246 dm-13 NETAPP,LUN [size=2.0G][features=0][hwhandler=0][rw]

\_ round-robin 0 [prio=4][enabled]
\_ 3:0:0:2 sdk 8:160 [active][ready] \_ 4:0:0:2 sdj 8:144 [active][ready]
/dev/

Storage Performance
Determining current performance on domain0

Testing multi-path device


# hdparm /dev/mapper/<scsi id> Testing single-path devices # hdparm /dev/sdj # hdparm /dev/sdm In all cases: 30 MB/sec

Storage Performance
Determining usage of paths

# iostat x <device> # iostat x /dev/sdk /dev/sdj 5


Device sdk sdj Blk_read/s 803.50 784.00 Blk_wrtn/s 33.0 32.8 Blk_read 4122 3922 Blk_wrtn 160 155

Both paths are used equally

Storage Performance
Checking if there are really 2 iSCSI sessions # ls -alR /dev/disk/by-path/ | egrep "(sdk|sdj)" ip-10.1.200.40:3260-iscsi-iqn.199208.com.netapp:MyNetapp.luns-lun-2 -> ../../sdk ip-10.1.201.40:3260-iscsi-iqn.199208.com.netapp:MyNetapp.luns-lun-2 -> ../../sdj

Storage Performance
Checking if different paths are really used

# tcpdump -i any port 3260


# watch "ifconfig | egrep '(eth0 |eth1 )' -A5 | egrep '(eth|bytes)' "
eth0 Link encap:Ethernet HWaddr 00:1D:09:70:88:2C
TX bytes:170615419 (162.7 MiB)

RX bytes:1490076463 (1.3 GiB) eth1 Link encap:Ethernet

HWaddr 00:1D:09:70:88:2E

RX bytes:1801238

(166 MiB)

TX bytes:46695876 (44.5 MiB)

Storage Performance
Checking source IP addresses for iSCSI sessions

# netstat -at | grep iscsi


10.1.200.138:53049 10.1.200.178:46684 10.1.200.40:iscsi-target 10.1.201.40:iscsi-target ESTABLISHED ESTABLISHED

Storage Performance
Checking kernel routing table

# route
Destination Gateway Genmask Iface

10.1.200.0
10.1.200.0 default

*
* 10.1.200.1

255.255.255.0
255.255.255.0 0.0.0.0

xenbr0
xenbr1 xenbr0

Storage Performance
Configuration of management interfaces in XenCenter

Modify ISCSI_2 into 10.1.201.78

Storage Performance
Determining current performance on domain0

# route
Destination Gateway Genmask Iface

10.1.200.0
10.1.201.0 default

*
* 10.1.200.1

255.255.255.0
255.255.255.0 0.0.0.0

xenbr0
xenbr1 xenbr0

Storage Performance
Configuring kernel routing table

...or (not recommended)


Add to /etc/rc.local
# route add -host 10.1.200.40 xenbr0 # route add -host 10.1.201.40 xenbr1

What about Pool Upgrade and Pool Join?

Storage Performance
Determining current performance on VM
LinuxVM:~# hdparm -t /dev/xvdb
/dev/xvdb: Timing buffered disk reads: 45 MB/sec

Well Done!

What weve learned


Case study: Single-pathing
/dev/ locations for single and multi-path devices # mpathutil status

# hdparm t
# iostat # ifconfig, # tcpdump, # netstat, # route # watch Best practices for iSCSI storages

Questions

Resources
First aid kit
http://docs.xensource.com XenServer documentation

http://support.citrix.com/product/xens/ - Knowledge Center


http://forums.citrix.com/support - Support forums http://community.citrix.com/citrixready/xenserver - XenServer Central (one-stop information center)

Before you leave


Session surveys are available online at www.citrixsynergy.com starting Thursday, 7 October
Provide your feedback and pick up a complimentary gift card at the registration desk

Download presentations starting Friday, 15 October, from your My Organiser Tool located in your My Synergy Microsite event account

También podría gustarte