Documentos de Académico
Documentos de Profesional
Documentos de Cultura
2.
Introduction
FLC insertion to build an FLC redundant configuration, requires application of the correct
procedure (attached MOP) to prevent FLC in a never-ending loop of synchronization, this as a
consequence a Bad sector (or bad-blocks) on the HDD of the active FLC.
A bad sector on the FLCs HDD can be hidden Bad Sector isnt on part of the HDD used by
the application SW, but it can inhibits the correct synchronization of the FLC (HDD mirroring).
Affected product:
1850TSS-320, 1850TSS-160
Affected releases:
Involved cards:
Reference AR:
AR 1-5226437
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
cod. 3AL92110AA**
3.
Problem Description
A bad sector (or bad block) is a sector on a HDD disk drive, that cannot be used (OS inability
to successfully access it) due to permanent damage, e.g. physical damage to the disk surface.
When there is FLC redundancy, the bad blocks are automatically recovered (the redundant
disk controller remaps the logical sector to a different physical sector) but if the bad block
affects a portion of the HDD that the SW does not use or the working FLC is not in a redundant
configuration, the bad block is hidden and it becomes a potential silent failure when a new FLC
is later inserted (to create a FLC redundancy). If this occurs then the synchronization of HDD
fails, leaving the working FLC in a never-ending restart loop, or in a worst case, results in the
restart of active FLC, with the newly inserted FLC becoming active without the correct RAID card
synchronization, with unpredictable consequent e.g. the DB is blanked or becomes corrupt.
4. Recommendations
Apply the attached MOP as follows, to ensure the safe and correct insertion of a
new FLC to build a redundant FLC configuration.
5. Disclaimer
The information is believed to be accurate at the time of publishing based on currently available
information. Use of the information constitutes acceptance for use in an AS IS condition. There
are no warranties with regard to this information. Neither the author nor the publisher accepts
any liability for any direct, indirect, or consequential loss or damage arising from use of, or
reliance on, this information.
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 2 of 20
DISTRIBUTION LIST
Alcatel-Lucent
ABSTRACT
This document provides the Method Of Operation Procedure (MOP) for EC-320 spare card
insertion.
01
30/05/2014
Creation
C. Colombo
V. Mascolo
02
10/06/2014
Creation
C. Colombo
V. Mascolo
03
20/06/2014
Creation
C. Colombo
V. Mascolo
04
11/07/2014
Creation
C. Colombo
V. Mascolo
ED
DATE
CHANGE NOTE
APPROVAL
ORIGINATOR(S)
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 3 of 20
REVIEW
30/05/2014, Creation
10/06/2014, changes Ed.2 (edition only for DTAG)
20/06/2014, changes Ed.3 (edition only for release 4.1.60)
11/07/2014, changes Ed.4 (edition for all releases)
17/11/2014, changes to signal that the fake HWFAIL alarm could affect the standby
FLC when the tool is running.
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 4 of 20
TABLE OF CONTENTS
1.
GLOSSARY .................................................................................................................................................... 6
2.
PURPOSE ........................................................................................................................................................ 7
3.
PREREQUISITES .......................................................................................................................................... 9
4.
EC320 SPARE CARD INSTALLATION - AUTOMATIC CHECK VIA SNATCH TOOL ................ 10
5.
6.
7.
8.
9.
10.
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 5 of 20
1. Glossary
MOP
Method Of Procedure
TEC
R&D
LDC
FLC
LED
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 6 of 20
2. Purpose
This document provides the Method Of Operation Procedure (MOP) for EC-320 spare card
insertion for all releases and with the improvements introduced in the Rel 4.1.60.
This document refers to following Operational Notice: ON-1114-OMSN-1850TSS320/160: Maintenance procedure for FLC spare card insertion.
This MOP is applicable to 1850TSS320/160 with the following releases according to
ND_tool release 1.3b23:
Equipment
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
NE rel.
3.2.
3.2.3
3.2.4
3.2.5
3.2.6
3.2.7
SWpackage
3.16.
3.23.02
3.24-06
3.25-01
3.26-04
3.27-06
Snatch
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
3.4
3.4.1
3.4.2
3.4.3
3.4.4
3.4.5
3.4.6
3.4.7
3.4.8
3.4.9
3.4.10
3.4.20
3.41.12
3.42-40
3.43-03
3.43-04
3.44-04
3.45-06
3.46-04
3.47-35
3.48-02
3.49-09
3.49-10
3.49-11
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
3.6
3.6.1
3.6.2
3.6.3
3.6.4
3.6.5
3.60.35
3.61-09
3.62-01
3.63-54
3.64-02
3.65-01
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
4.0
4.0.1
4.0.2
4.0.3
4.0.4
4.00.99
4.01-01
4.02-17
4.03-20
4.04-03
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 7 of 20
Equipment
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
NE rel.
4.1
4.1.1
4.1.2
4.1.3
4.1.40
4.1.45
4.1.47
4.1.50
4.1.55
4.1.60
SWpackage
4.10-24
4.11-30
4.12-11
4.13-10
4.10.40-B023
4.10.45-B055
4.10.47-B066
4.10.50-B099
4.10.55-B107
4.10.60-E070
Snatch
TSS-320/160
5.0
5.00-27
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
5.1
5.1.15
5.1.20
5.1.30
5.1.35
5.1.36
5.10-63
5.10.15-AA01
5.10.20-A059
5.10.30-B028
5.10.35-B099
5.10.36-B104
TSS-320/160
6.0
TSS-320/160
6.0.5
6.00.00-B028
6.00.05-B02
6.00.05-BC02
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 8 of 20
3. PREREQUISITES
This section details the resources that must be available and the pre-checks that must be
completed prior to physically commencing the procedure.
Resources required
Field enginneer on-site equipped with laptop and serial debug and LAN cable (this
one recommended fro ssh connection).
Diagnostic tool (ND_tool release 1.3b23) described in ON-1072 (edition 04)
Preconditions: Node A
EC-320 inserted with no equipment alarms and with no LED indication of hw issues.
Both MT-320 cards with no equipment alarms and with no LED indication of hw
issues.
No provisioning activities ongoing on the node (ZIC, TL1, CLI).
A valid and updated MIB stored on the OMS and ZIC.
Check that alarms can be explained and correlated.
RECOMMENDATION: A HealthCheck should be executed on NodeA in advance. Please
contact Alcatel-Lucent in order to get the HC executed.
NOTE: around two hours is the estimated time to complete the procedure and the pre &
post checks.
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 9 of 20
./hctss_quick_check.sh flcHD
When the snatch has completed, check the srpt log and in the case of success, you will
not find the tag FAILED. In detail concerning the badblocks in the srpt youll find the
following output (PASSED means ok, FAILED means the partition has block errors)
07:53:14 UTC 05-05-2014;
07:54:24 UTC 05-05-2014;
08:02:13 UTC 05-05-2014;
08:02:14 UTC 05-05-2014;
151.98.28.240; FLC
151.98.28.240; FLC
151.98.28.240; FLC
151.98.28.240; FLC
HD
HD
HD
HD
check
check
check
check
In the case of FAILED, please open an AR ticket including the log files generated and do not
insert the EC320 spare.
In case of success, skip the chapter 5 (How to recover badblocks) and go directly to the
chapter 6: EC320 SPARE CARD INSERTION.
NOTE1: while the tool is running the HWFAIL alarm could affect the FLC standby. It is a fake
alarms that disappear after the diagnosis tool has completed the check.
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 10 of 20
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 11 of 20
Open a telnet session with the serial debug cable connected to the EC320 ACT.
Configure the EC320 spare before to insert it (suggested also if the NE is configured
in auto-configuration mode).
Insert the EC320 spare card. At this point according to the issue we observed in field, it
could happen sometimes that the EC320 ACT gets stuck:
If the following output is observed (very high speed scrolling line),
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
EXT3-fs error (device md2): ext3_journal_start_sb: <2>ext3_abort called.
EXT3-fs error (device md2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device md2) in start_transaction: Journal has aborted
EXT3-fs error (device md2) in start_transaction: Journal has aborted
EXT3-fs error (device md2) in start_transaction: Journal has aborted
EXT3-fs error (device md2) in start_transaction: Journal has aborted
Page 12 of 20
NE release 4.1.60:
NE release 4.1.60:
In the case the EC-320 spare card insertion behaviors as expected, please follow the
remaining steps:
Raid recovery status can be monitored with the following commands:
testraid
cat /proc/mdstat
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Repeat the second command from time to time to monitor progress (see below example).
Figure 1: example of output of testraid command when the disks are not
synchronized.
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 14 of 20
It is necessary to kill the process with following command: pkill raid-mgt; raid-mgt
and go to chapter 10. Troubleshooting.
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 15 of 20
7. PROCESS TIMELINE
FLC stand-by
insertion
Pre-checks (40
minutes)
Contingency
(20 min)
1 Hour (RAID COPY & RESYNC phase)
NOTE: This timeline considers the end of the EC-320 spare card insertion
and in service status;
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
8. Exit Criteria
The procedure is considered as successfully completed upon confirmation that the spare EC320 has been correctly downloaded and in service on the TSS-320.
On local debug console or ssh session via EC-320 prompt, it can be verified that the testraid
is showing:
/raid state SYNC
/local disk state PRESENT
/remote disk state PRESENT
On /proc/mdstat md2, md1 and md0 are all reporting double U icon: [UU]
Reported hereafter an example of the info stated above:
On local ZIC or embedded ZIC verify that both EC-320 are ready on service without alarms:
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
In addition, it should be checked that alarms before and after procedure can be explained
and correlated (objective to confirm condition of the node and services are the same before
and after the procedure).
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
9. ANNEX: USB-SATA
SATA device wrong serial number
in the spare FLC, it will be present the following picture:
After a correct plug-in
Connect to EC320 ACT via serial debug and digit the following commands:
root@FLC320-1-ACT:/root#
ACT:/root# cd /proc/cpld/actsby/
root@FLC320-1-ACT:/root#
ACT:/root# echo 1 > HDD_REQ
wait some seconds, you should see an output like as:
usb 1-1:
1: USB disconnect, address 3
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
10.
ANNEX: Troubleshooting
END OF DOCUMENT
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction
Page 20 of 20