Está en la página 1de 20

OPERATIONAL NOTICE

SUBJECT: ON:1114-OMSN-1850TSS-320/160: Procedure for insertion of


protection FLC Card insertion.
SEVERITY: Major
REQUIRED ACTION: Mandatory. Follow the attached MOP for a secure procedure
of the insertion of an FLC to build a redundant FLC
configuration.
REFERENCE ID: OND_OMSN_TSS_0714_1114
DATE OF ISSUE: July 11th, 2014
EDITION: 02
AUTHOR: P. Kaligaric
APPROVAL: C. Colombo
CONTACT: paolo_vittorio.kaligaric@alcatel-lucent.com
1.

Reason for the issue

Ed02: 17/11/2014 Changes in the MOP:


NOTE: while the tool is running the HWFAIL alarm could affect the FLC standby. It is a fake
alarms that disappear after the diagnosis tool has completed the check.

2.

Introduction

FLC insertion to build an FLC redundant configuration, requires application of the correct
procedure (attached MOP) to prevent FLC in a never-ending loop of synchronization, this as a
consequence a Bad sector (or bad-blocks) on the HDD of the active FLC.
A bad sector on the FLCs HDD can be hidden Bad Sector isnt on part of the HDD used by
the application SW, but it can inhibits the correct synchronization of the FLC (HDD mirroring).
Affected product:

1850TSS-320, 1850TSS-160

Affected releases:

3.x, 4.x, 5.x, 6.x

Involved cards:

First Level Controller (EC320)

Reference AR:

AR 1-5226437

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

cod. 3AL92110AA**

3.

Problem Description

A bad sector (or bad block) is a sector on a HDD disk drive, that cannot be used (OS inability
to successfully access it) due to permanent damage, e.g. physical damage to the disk surface.
When there is FLC redundancy, the bad blocks are automatically recovered (the redundant
disk controller remaps the logical sector to a different physical sector) but if the bad block
affects a portion of the HDD that the SW does not use or the working FLC is not in a redundant
configuration, the bad block is hidden and it becomes a potential silent failure when a new FLC
is later inserted (to create a FLC redundancy). If this occurs then the synchronization of HDD
fails, leaving the working FLC in a never-ending restart loop, or in a worst case, results in the
restart of active FLC, with the newly inserted FLC becoming active without the correct RAID card
synchronization, with unpredictable consequent e.g. the DB is blanked or becomes corrupt.

4. Recommendations
Apply the attached MOP as follows, to ensure the safe and correct insertion of a
new FLC to build a redundant FLC configuration.

5. Disclaimer
The information is believed to be accurate at the time of publishing based on currently available
information. Use of the information constitutes acceptance for use in an AS IS condition. There
are no warranties with regard to this information. Neither the author nor the publisher accepts
any liability for any direct, indirect, or consequential loss or damage arising from use of, or
reliance on, this information.

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 2 of 20

Method of Operation Procedure (MOP)


EC-320 spare card insertion

DISTRIBUTION LIST
Alcatel-Lucent

ABSTRACT
This document provides the Method Of Operation Procedure (MOP) for EC-320 spare card
insertion.

01

30/05/2014

Creation

C. Colombo

V. Mascolo

02

10/06/2014

Creation

C. Colombo

V. Mascolo

03

20/06/2014

Creation

C. Colombo

V. Mascolo

04

11/07/2014

Creation

C. Colombo

V. Mascolo

ED

DATE

CHANGE NOTE

APPROVAL

ORIGINATOR(S)

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 3 of 20

REVIEW






30/05/2014, Creation
10/06/2014, changes Ed.2 (edition only for DTAG)
20/06/2014, changes Ed.3 (edition only for release 4.1.60)
11/07/2014, changes Ed.4 (edition for all releases)
17/11/2014, changes to signal that the fake HWFAIL alarm could affect the standby
FLC when the tool is running.

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 4 of 20

TABLE OF CONTENTS
1.

GLOSSARY .................................................................................................................................................... 6

2.

PURPOSE ........................................................................................................................................................ 7

3.

PREREQUISITES .......................................................................................................................................... 9

4.

EC320 SPARE CARD INSTALLATION - AUTOMATIC CHECK VIA SNATCH TOOL ................ 10

5.

HOW TO RECOVER THE FLC BADBLOCKS FOUND ....................................................................... 11

6.

EC-320 SPARE CARD INSERTION .......................................................................................................... 12

7.

PROCESS TIMELINE ................................................................................................................................. 16

8.

EXIT CRITERIA .......................................................................................................................................... 17

9.

ANNEX: USB-SATA DEVICE WRONG SERIAL NUMBER................................................................. 19

10.

ANNEX: TROUBLESHOOTING .......................................................................................................... 20

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 5 of 20

1. Glossary

MOP

Method Of Procedure

TEC

Alcatel-Lucent third level Technical Excellence Centre

R&D

Research and Development

LDC

Local Data Controller

FLC

First Level Controller

LED

Light Emitting Diode

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 6 of 20

2. Purpose
This document provides the Method Of Operation Procedure (MOP) for EC-320 spare card
insertion for all releases and with the improvements introduced in the Rel 4.1.60.
This document refers to following Operational Notice: ON-1114-OMSN-1850TSS320/160: Maintenance procedure for FLC spare card insertion.
This MOP is applicable to 1850TSS320/160 with the following releases according to
ND_tool release 1.3b23:
Equipment
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160

NE rel.
3.2.
3.2.3
3.2.4
3.2.5
3.2.6
3.2.7

SWpackage
3.16.
3.23.02
3.24-06
3.25-01
3.26-04
3.27-06

Snatch






TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160

3.4
3.4.1
3.4.2
3.4.3
3.4.4
3.4.5
3.4.6
3.4.7
3.4.8
3.4.9
3.4.10
3.4.20

3.41.12
3.42-40
3.43-03
3.43-04
3.44-04
3.45-06
3.46-04
3.47-35
3.48-02
3.49-09
3.49-10
3.49-11














TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160

3.6
3.6.1
3.6.2
3.6.3
3.6.4
3.6.5

3.60.35
3.61-09
3.62-01
3.63-54
3.64-02
3.65-01

TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160

4.0
4.0.1
4.0.2
4.0.3
4.0.4

4.00.99
4.01-01
4.02-17
4.03-20
4.04-03










OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 7 of 20

Equipment
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160

NE rel.
4.1
4.1.1
4.1.2
4.1.3
4.1.40
4.1.45
4.1.47
4.1.50
4.1.55
4.1.60

SWpackage
4.10-24
4.11-30
4.12-11
4.13-10
4.10.40-B023
4.10.45-B055
4.10.47-B066
4.10.50-B099
4.10.55-B107
4.10.60-E070

Snatch











TSS-320/160

5.0

5.00-27

TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160
TSS-320/160

5.1
5.1.15
5.1.20
5.1.30
5.1.35
5.1.36

5.10-63
5.10.15-AA01
5.10.20-A059
5.10.30-B028
5.10.35-B099
5.10.36-B104





TSS-320/160

6.0

TSS-320/160

6.0.5

6.00.00-B028
6.00.05-B02
6.00.05-BC02




OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 8 of 20

3. PREREQUISITES
This section details the resources that must be available and the pre-checks that must be
completed prior to physically commencing the procedure.

Resources required

Field enginneer on-site equipped with laptop and serial debug and LAN cable (this
one recommended fro ssh connection).
Diagnostic tool (ND_tool release 1.3b23) described in ON-1072 (edition 04)

Preconditions: Node A

EC-320 inserted with no equipment alarms and with no LED indication of hw issues.
Both MT-320 cards with no equipment alarms and with no LED indication of hw
issues.
No provisioning activities ongoing on the node (ZIC, TL1, CLI).
A valid and updated MIB stored on the OMS and ZIC.
Check that alarms can be explained and correlated.
RECOMMENDATION: A HealthCheck should be executed on NodeA in advance. Please
contact Alcatel-Lucent in order to get the HC executed.

NOTE: around two hours is the estimated time to complete the procedure and the pre &
post checks.

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 9 of 20

4. EC320 Spare Card Installation - AUTOMATIC Check via snatch tool


NOTE: the procedure will be executed on site by personnel, nevertheless some preliminary
operations are required as below described:
a. TEC will provide the TSS320DIAGNOSIS hctss_quick_check tool. Install and run it
few days in advance to check the general status of the FLC ACT.
b. Run the script with the option flc
 ./hctss_quick_check.sh flc
When the snatch has completed, checks the srpt log and in the case you find the tag
FAILED somewhere, please open an AR ticket including the log files generated and do not
insert the EC320 spare.
c. Approaching the day for the FLC spare insertion, be sure there is no provisioning
acivity in progress and users logged in. Then run the script with the option flcHD,
to check there are no badblock sectors on the FLC ACT hard disk.


./hctss_quick_check.sh flcHD

When the snatch has completed, check the srpt log and in the case of success, you will
not find the tag FAILED. In detail concerning the badblocks in the srpt youll find the
following output (PASSED means ok, FAILED means the partition has block errors)
07:53:14 UTC 05-05-2014;
07:54:24 UTC 05-05-2014;
08:02:13 UTC 05-05-2014;
08:02:14 UTC 05-05-2014;

151.98.28.240; FLC
151.98.28.240; FLC
151.98.28.240; FLC
151.98.28.240; FLC

HD
HD
HD
HD

check
check
check
check

sda1 partition any bad


sda2 partition any bad
sda3 partition any bad
sda4 partition any bad

block found PASSED


block found PASSED
block found PASSED
block found PASSED

In the case of FAILED, please open an AR ticket including the log files generated and do not
insert the EC320 spare.
In case of success, skip the chapter 5 (How to recover badblocks) and go directly to the
chapter 6: EC320 SPARE CARD INSERTION.

NOTE1: while the tool is running the HWFAIL alarm could affect the FLC standby. It is a fake
alarms that disappear after the diagnosis tool has completed the check.

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 10 of 20

5. HOW TO RECOVER the FLC BADBLOCKS FOUND


In the case there are badblocks found either via automatic tool (or manual check) we
recommend to replace the FLC according to the following steps:
a) Arrange in ALU LAB an FLC, badblocks free, with the same SWPKG, the same MIB
and XCOMM parameters of the FLC to be replaced.
b) Go in field and replace the FLC errored with the new one.
c) Take back in ALU LAB the FLC errored, and provide remote session connection via
serial debug cable.
d) TEC will take care to perform some check on the disk health status and possibly
to recovery the badblocks.

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 11 of 20

6. EC-320 SPARE CARD INSERTION


ATTENTION, Before to plug in the EC card, read this chapter carefully!
It is MANDATORY before the EC320 spare insertion to:

Open a telnet session with the serial debug cable connected to the EC320 ACT.

Configure the EC320 spare before to insert it (suggested also if the NE is configured
in auto-configuration mode).

Insert the EC320 spare card. At this point according to the issue we observed in field, it
could happen sometimes that the EC320 ACT gets stuck:
 If the following output is observed (very high speed scrolling line),
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
scsi0 (0:0): rejecting I/O to dead device
EXT3-fs error (device md2): ext3_journal_start_sb: <2>ext3_abort called.
EXT3-fs error (device md2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device md2) in start_transaction: Journal has aborted
EXT3-fs error (device md2) in start_transaction: Journal has aborted
EXT3-fs error (device md2) in start_transaction: Journal has aborted
EXT3-fs error (device md2) in start_transaction: Journal has aborted

the next action depends of the NE release:


OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 12 of 20

NE release 3.x, 4.x, 5.x, 6.x

NE release 4.1.60:

URGENTLY PLUG OUT the EC320 card


spare just inserted and reset the EC320
ACT via the reset button (the button close
to R symbol on FLC).
In this condition,the procedure is
stopped and jump to chapter:
10. ANNEX Troubleshooting.
Attention: if the FLC inserted is not
removed IMMEDIATELY, there is a
potential risk that it takes the control of
the NE with serious issue impact on
traffic, not having any MIB inside. Please
highlight this possible condition with in
field operatin people.

DO NOT PLUG OUT the EC320 card


spare just inserted, but wait that the
EC320 ACT will reboot automatically
auto-recovering the normal condition.

 Otherwise the procedure can continue as follow:


after the EC320 ACT reboots, wait for the EC320 gets the FLC_IN_SERVICE status, about
15 minutes.
Then the synchronization process will start automatically and in the case of success,
soon you will see both the orange led blinking, signaling that the FLC aligmement process is
started.
NOTE about the issue related to USB-SATA device wrong serial number:
NE release 3.x, 4.x, 5.x, 6.x

NE release 4.1.60:

The issue related to the USB-SATA


device wrong serial number, is still
present;
please
see
a
detailed
description in chapter:
9. ANNEX USB-SATA device wrong
serial number.

The issue related to the USB-SATA


device wrong serial number (described in
Chapter 9 of this document), has been
solved.
The software automatically will
recover the expected serial number, and
only in the case of unsuccessful the EC320
spare will be declared in HWFAIL and the
synchronization process stopped.

In the case the EC-320 spare card insertion behaviors as expected, please follow the
remaining steps:
Raid recovery status can be monitored with the following commands:
testraid
cat /proc/mdstat
OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Repeat the second command from time to time to monitor progress (see below example).

Figure 1: example of output of testraid command when the disks are not
synchronized.

Figure 2: example of output of cat /proc/mdstat command when the synchronization


is in progress

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 14 of 20

Figure 3: example of output of cat /proc/mdstat command when the synchronization


is completed.

When the synchronization is completed (see Figure 3) go to chapter 8.


NOTE: In case that the synchronization doesnt conclude successfully (as
shown in Figure 3) it will be present the following condition:
cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
1075136 blocks [2/2] [_U]
bitmap: 0/132 pages [0KB], 4KB chunk
md1 : active raid1 sdb1[0] F sda1[1]
1049472 blocks [2/2] [_U]
bitmap: 0/129 pages [0KB], 4KB chunk
md0 : active raid1 sdb2[0] F sda2[1]
1049536 blocks [2/2] [_U]
bitmap: 1/129 pages [4KB], 4KB chunk
unused devices: <none>

It is necessary to kill the process with following command: pkill raid-mgt; raid-mgt
and go to chapter 10. Troubleshooting.

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 15 of 20

7. PROCESS TIMELINE

FLC stand-by
insertion
Pre-checks (40
minutes)

Contingency
(20 min)
1 Hour (RAID COPY & RESYNC phase)

NOTE: This timeline considers the end of the EC-320 spare card insertion
and in service status;

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

8. Exit Criteria
The procedure is considered as successfully completed upon confirmation that the spare EC320 has been correctly downloaded and in service on the TSS-320.
On local debug console or ssh session via EC-320 prompt, it can be verified that the testraid
is showing:
/raid state SYNC
/local disk state PRESENT
/remote disk state PRESENT
On /proc/mdstat md2, md1 and md0 are all reporting double U icon: [UU]
Reported hereafter an example of the info stated above:

On local ZIC or embedded ZIC verify that both EC-320 are ready on service without alarms:

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Verify the FileVolumeStatus at /LTAG and /RTAG path:


Expected value: /LTAG/FileVolumeStatus.cfg is CONGRUENT;
Expected value: /RTAG/FileVolumeStatus.cfg is CONGRUENT;
Example:

In addition, it should be checked that alarms before and after procedure can be explained
and correlated (objective to confirm condition of the node and services are the same before
and after the procedure).

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

9. ANNEX: USB-SATA
SATA device wrong serial number
in the spare FLC, it will be present the following picture:
After a correct plug-in

In the case the Serial Number is M


M0000000000000 instead of the one on the above picture
(1234567890ABCDEF) please operate as following on the EC320 ACT:
a.
b.
c.
d.

Connect to EC320 ACT via serial debug and digit the following commands:
root@FLC320-1-ACT:/root#
ACT:/root# cd /proc/cpld/actsby/
root@FLC320-1-ACT:/root#
ACT:/root# echo 1 > HDD_REQ
wait some seconds, you should see an output like as:
usb 1-1:
1: USB disconnect, address 3

e. wait some seconds, 3 o 4.


f. root@FLC320-1-ACT:/root#
ACT:/root# echo 0 > HDD_REQ
g. wait some seconds, you should see an output like as:
usb 1-1:
1: new high speed USB device using ehci_hcd and address 4
usb 1-1:
1: Product: USB2.0 Storage Device
usb 1-1:
1: Manufacturer: Cypress Semiconductor
usb 1-1:
1: SerialNumber: 1234567890ABCDEF

if so, the operation to reconfigure the usb


usb-SATA device has been performed
med successfully and
the synchronization process will start normally and go to chapter 8.

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

10.

ANNEX: Troubleshooting

In case of unexpected problems, such as:


Redundand EC320 is not correctly displayed in ZIC
Synchronization cannot be termintated
Unexpected Rebooting of EC320
Any other kind of strange behavior
Please inform Alcatel-Lucent opening an Assistance Request (AR Cares ticket)

END OF DOCUMENT

OND_OMSN_TSS_0714_1114
Alcatel-Lucent Internal
Proprietary Use pursuant to Company instruction

Page 20 of 20

También podría gustarte