+ Add/request New Update: 13170581 - ING BANK Umraniye
+ Add/request New Update: 13170581 - ING BANK Umraniye
Notes
Time (SP) Significant Events ( note all these event repeat at various times, but there are too many
to list)
14:44:58 first instance of CD23.28 SSP DMA transfer errors against port 0xD on both DS 1c & 2c
:: first instance of 0B3E.01 SET_DA_CHK_RESET bit set on both DS
:: first instance of D52E IOBS_ABORTED_INTERNAL status received from CDI and we see disk connect
messages
:: first instance of BF3E.54 Drive PHY Hard Reset against multiple drives.
14.44:59 fist instance of C03E.87 DAE override timer set as eses_state < 10 on port 0x17s but >= 6
14:34:01 all ESES structure/information is rebuilt on port 0x17s
14:34:08 first instance of 0322 task 20 failure for reason_code: E_PHYS_REASON_SAS_NO_DAE_CONNECTION
:: first SCSI check condition from port 0x17 with TARGET OPERATING CONDITIONS HAVE CHANGED sense
(06/3F)
14:34:18 second instance of 0322 task 20 failure for reason_code:
E_PHYS_REASON_SAS_NO_DAE_CONNECTION
14:34:28 3rd instance of 0322 task 20 failure for reason_code: E_PHYS_REASON_SAS_NO_DAE_CONNECTION
14:34:38 4th instance of 0322 task 20 failure for reason_code: E_PHYS_REASON_SAS_NO_DAE_CONNECTION
:: first instance of 700B.C6 port 0xD is disabled by task20 on both DS for reason code No DAE
Connection ( on one port)
:: 0B3E.02 CLEAR_DA_CHK_RESET
14:34:41 first instance of BE38.1C to indicate we had a generation code change from 6 to 7.
14:36:21 first SCSI check condition from port 0x13 with POWER ON, RESET, OR BUS DEVICE RESET
OCCURRED sense (06/29)
2019-03-27 14:36:24 B53E.02 on port 0xD indicate that our LP task recovered the link to port 0xD as the LCC as
powered up again
05:39
14:36:32 first 7380.02 from the IMs stating DAE 6 is "inserted" as the LCC recovers temporarily
Gary Ruby 14:40:31 first instance of BC2B.57 errors as logged return code IOBS_DMA_ERROR from CDI when trying
to get page 2
14:41:21 first instance of 700B.C7 port 0xD is disabled by task20 on both DS for reason code ESES
ready Timeout.
14:41:51 Frist instance of BE38.1C indicating we had a gen code change on LCC A, other gencode
changes logged in the DLOG
14:42:05 first instance of DE01.CA stuck config bit set on the IO expander for LCC B
14:43:23 first instance of DE01.CC stuck config bit clear on the IO expander for LCC B
14:44:17 first instance of C03E.81/91 DAE/RAID unavailable after an ESES request timeout across both
DS
14:44:19 first instance of C03E.80/90 DAE/RAID Available again across both DS
14:45:23 last incident of BE38.1C on both DS signifying a gencode 1C->1D
14:45:32 last incident of BE38.99 from LCC A to signify that our ESES request timed out after 3 secs
14:45:34 last instance of C03E.81/91 signifying RAID/DAE unavailability across both DS on port 0x17
14:45:38 02B1.A0, Vault trigger is ON, NTCV is set and the system vault is activated
:: 02b1.14/15 vaulting due to loss of access to the DAE(s)
✉ SURGE Support
https://surge.isus.emc.com/surge/ViewOPT.php?OPTNum=552586+ 1/7
11/12/21, 3:03 PM 552586
Final RCA
Date: 27-Mar-19
Customer: ING BANK UMRANIYE
SR Number: 13170581
Code Level: 5977_1131/0041
Engineers Involved: Gary Ruby, Tim Gaspar
Escalation Engineer: Kevin Crowley
Surge OPT: 552586
KB: KB 531191 has been drafted by CS, and is in the process of been published.
The issue started with a HW issue on LCC B of DAE 6 in SB-1. The LCC was experiencing constant resets at a
frequency of about 20 seconds, but started off at a lower frequency. The LCC failed and the part was shut
down soon after. As the LCC recovered we reenabled the port but the board would fail multiple times again.
Each time the LCC would fail, We have to rebuild our ESES info/data structure due to a generation code
change (as signaled by the FW that it has to rebuild the configuration information internally due to some
change in the DAE) so the ESES state of our data structure is now < 10 (ES_READY) due to the rebuild. At
around 14:45:29 SP time we went to retrieve the configuration information from CDES on LCC A, where we have
to send an ESES request ( a SCSI 1C command to read the SES pages) but we got request timeouts (BE38.99) on
our ESES requests (they took long than 3 seconds) as the FW on LCC A was going through the process of
2019-03-27 attempting to get information from the failing LCC B. LCC B was sending ? back off? requests to LCC A as it
05:36 was in the process of booting so the ?good? LCC was giving the failing LCC B more time before it replied to
Gary Ruby our SES request.
The DS has a low priority task that checks the vault triggers ( RAID Availability, DAE availability, power
zones etc.) every 5 seconds and if our eses_state is NOT 10 on both ports to the DAE we will pull the
trigger on the next occurrence of the LP task (sets the override timer for 5 seconds and gives us a small
bit of time to get out of the vault condition). In this instance, as the good LCC A was in the process of
rebuilding its ESES structure due to the generation code change, its eses_state < 6 as it was in the middle
of been rebuild and it took over 5 seconds to do so. When the LP task was run, our state on both ports 0xD
were in an initialization state as the board was booting, and our good LCC had not rebuilt its tables yet
as the CDES FW was spending far too long in with requests to LCC B. As a result of the eses_state on both
peer ports not been at the required state, the LP task set a NTV ( need to vault) trigger and the box
triggered a vault as it found that DAE 6 was unavailable.
After extensive testing New CDES revision is 15.60 was develop to trim down the time spent by CDES so it
should respond to the re-init sequence much quicker, and within our 5 second window.
? An augmented LCC was inserted to a test frame to inject I2C errors and IO expander resets on one
LCC to mimic the constant resetting and TWI errors.
? We could get the LCC to reset at set frequency by injecting a reset command over the serial port on
the LCC. At FW 15.17 we could reliably get the box to vault at even a 40s reset cycle
? We eliminated the TWI injection as well, and narrowed the problem to just the resets.
Thougker:bim:SURGEFieldMarker:
Y1rNͧ
Uq߶U ߶UXq߶Uw߶U"߶U1Y߶U UimY1Uvw߶Uvw߶Uvw߶UXw߶U`$1YU11Uim11 BI11'߶U
11(gw߶Uc߶U11!߶UU11$߶UU11hq߶UHpw߶UY1Uyw߶Uyw߶Uyw߶UWw߶U`$YY[w
UPq߶Up߶Uyw߶U{w߶Uww߶U"߶UAY@zw߶UssLevel_MDT:SURGEValueMarker:None 1Apcw߶UY1Uzw߶Uzw߶Uzw߶UVw߶U`$AY}w߶U
12 02:01:14: bim : 10.84.184.236: R1AU11#߶UUY1M31Uq߶U߶U|w߶Up"߶U1Yh`w߶U/
߶UY1Uq߶Uq߶U|w߶Uw߶U`$YYIOUq߶U߶U{w߶U"߶UYYUq߶Uq߶U]w߶U5߶U`$AYHq߶U12 0
2019-03-25
08:47
prefinal sent in, updating for SLO
Gary Ruby
2019-03-21
pre-final was already sent in. dropping severity to stop SLO warnings.
07:51
Severity For Version 5977.1131.1131 Changed from Critical/High to Medium.
Gary Ruby
2019-03-19
08:18
pre-final mostly done, finishing it off
Gary Ruby
✉ SURGE Support
https://surge.isus.emc.com/surge/ViewOPT.php?OPTNum=552586+ 2/7
11/12/21, 3:03 PM 552586
2019-03-15
08:01
still in pre-final, had to deal with other stuff all week.
Gary Ruby
2019-03-13
************ Surge Request ************
12:41
KB: 531191 is in draft for this issue, will write it up and publish tomorrow.
Martin Hayes
2019-03-11
08:02
in pre-fianl
Gary Ruby
2019-03-08
08:06
555152 with relationship duplicate is added to the Related Problems List.
Martin Hayes
2019-03-07 Can we have ucode OPT opened up for new CDES firmware to fix this issue?
Problem keys added:
14:20
Returned_Copy_OPT
Timothy Gaspar Status For Version 5977.1131.1131 Changed from Assigned to Returned.
2019-03-07
08:05
Discussion with dev on new FW and possible release to happen later today US time.
Gary Ruby
No issues seen in Vmax on both boxes below with 15.50a running over the weekend. Will schedule a meeting
for this week to discuss next step.
2019-03-05
1. Modified LCC recreation running with 15.50a
09:34
? Injected every 20 seconds. Will run over the weekend and stop all host traffic if Vmax gets 3
Gary Ruby seconds ESES timeout or our re-init sequence takes > 5 seconds after receiving gen code change on expander.
2. Customer LCC in the Sybil room different box.
? All LCC?s in this box are at 15.50a including customer LCC . Will run over the weekend with temp
at 100F.
? Same like recreation 1 will stop host traffic on eses timeout or re-init eses sequence > 5 seconds.
Still working with CDES Engineering to fix our issue. After many debug firmware?s over
the past month this week I was given 15.50a that ran for 24 hours with no failures. Recreation was
injection power off to my modified LCC every 20 seconds. Past recreation I could recreate our Vmax issue
within the hour. This is good news and we will continue to try to recreate and work with CDES Engineering.
The next step is to change the injection time to every 10 seconds.
15.50a Recreation:
? Inject Power off every 20 seconds. Ran for 24 hours no issues.
o Serial logs attached to Jira during this run.
? I verified the Vmax logs that we did NOT see any Vmax ESES timeouts (BE38.99) of 3 seconds for one
command and we didn?t see our ucode re-init eses sequence take > 5 seconds after gen code change (063f) was
received from one expander on the good LCC.
2019-02-28 o Our re-init esse sequence sends multipole ESES request to expander that just reported the gen code
10:15 change.
Gary Ruby ? Customer issue and previous recreation I would get the 3 second VMAX timeout for one ESES command
and our re-init sequence would take > 7 seconds.
? With 15.50a I see our re-init sequence for one expander take less than 2 seconds for all ESES
request.
2019-02-28
08:03 overnight testing from Tuesday, seemed fine as we checked it yesterday, awaiting last nights results now as
Gary Ruby well.
2019-02-27
08:29 Still working with CDES Eng. We are getting debug firmware from CDES every other day and recreation our
Timothy Gaspar issue with the modified LCC. CDES is working to understand the problem and develop a firmware fix.
2019-02-27
08:03 running tests on CDES FW 15.50f look to help significantly with rolling reboots and timing out on the good
Gary Ruby LCC.
2019-02-25
06:51
case is still in recreation for CDES.
Gary Ruby
✉ SURGE Support
https://surge.isus.emc.com/surge/ViewOPT.php?OPTNum=552586+ 3/7
11/12/21, 3:03 PM 552586
2019-02-22
06:17
new debug FW is still been tested
Gary Ruby
2019-02-20 Waiting on new firmware from CDES to fix issue. Last firmware didn't fix issue. Customer LCC still
12:52 running in temp. room at 100 degrees and we see same behavior we did see at customer site. Customer LCC is
Timothy Gaspar rebooting when temperature in DAE is increased.
2019-02-20
07:43
ongoing CDES investigations.
Gary Ruby
2019-02-18
08:16
updating cause of SLO, still going back and forward with CDES in the background.
Gary Ruby
Still working with CDES Eng. We have loaded multiple debug firmware to find root cause. Ongoing
2019-02-14
08:02 The Customer LCC does show a rolling reboot like our modified LCC when it arrived at 176. Customer LCC is
Timothy Gaspar running in the Sybil room with temp at 100 degrees trying to recreate issue.
2019-02-11
08:01
ongoing testing with CDES debug fix to nail the problem down with the delays in the thread processing.
Gary Ruby
(INTERNAL ONLY!!!!)
from CDES:
"I have attached 15.50e. It contains a potential fix for the issue where EMA thread constantly retries,
2019-02-06 even when p2p send is failing with the 0x1201e (BACKOFF) error. Please give it try. I would like to see if
09:29 preventing the EMA thread from constantly retrying alone alleviates your issue or if we need to also
Gary Ruby prevent the ThreadMgr thread from constantly retrying.
[^CDES_Bundled_FW_15_50e.bin]"
seems they are working on a possible issue, but since we STILL don't have the customer LCC< not sure this
is something new, or just another hidden issue.
"I don't see the signature where ses_peer_new_config() takes a long time which is what I expected with my
"fix". So basically my previous debug build showed that both ThreadMgr thread and EMA thread were
2019-02-05 excessively retrying P2P transactions on 0x1201e. My "fix" in 15.50d tried to eliminate the EMA thread
retries which were slowing down ses_peer_new_config(). Although I don't see anything obvious in the serial
08:09
log yet, it is possible that ThreadMgr thread retrying excessively is also slowing down SES responses.
Gary Ruby Could you please kick off another run?"
it would seem that the slowdown in response to our SES requests to update our ESES pages might be coming
from the ThreadMgr thread of CDES. we are continuing to attempt recreations here. it still looks like CDES
was hung here either way.
2019-02-01
still recreating with CDES, it seems we are spending a lot of time in the per ses_peer_new_config()
08:05
function in CDES, which would be CDES trying to ping the peer LCC when it finds it has a new config page.
Gary Ruby this could have lead to the time outs on the "good" LCC here. still investigating .
2019-01-30
13:00 Still working on recreation with modified LCC. Customer LCC should be arriving soon and we will put it in
Timothy Gaspar my system when it does we will pull the LCC logs off to have CDES eng review them
!!!!!!!!!!!
note: Eng and L2 CS information only, misrepresenting the good and the bad here will lead to confusion,
don't take this as something you can go back to folks with.
!!!!!!!!!!!
2019-01-28 recreation ongoing, we can reliably recreate the issue in house using a doctored LCC to kill power even at
15.50 when we trigger every 20 secs, but we are engaged with CDES eng as to why. the problem doesn't
08:44
recreate at 15.50 at 60 sec injections.
Gary Ruby
CDES asked for a few more tests to be run and we are drilling down to what's going on with CDES during
these vault situations to see what if anything can be done.
with the box not recreating at 15.50 on an overnight test with the 60 sec injection, I think 15.50 could
still provide a lot of relief to most customers, but it seems we were going every 20 secs in the customer
case, so we still need to proceed here.
Update:
2019-01-25 Working with CDES Eng.
10:26 Still Working on recreation with modified LCC.
Timothy Gaspar Waiting on FA for the LCC.
Reviewing the customer logs and CDES logs.
2019-01-25
08:15
working on my timeline again.
Gary Ruby
2019-01-22
03:52
working an on going escalation opt yesterday
Gary Ruby
✉ SURGE Support
https://surge.isus.emc.com/surge/ViewOPT.php?OPTNum=552586+ 4/7
11/12/21, 3:03 PM 552586
2019-01-18
got another case to review today, but made a bit more head way on my own timeline for a write up. I have to
06:04
pour over the logs to inset all the relevant moments and filter out some of the "noise" that came with the
Gary Ruby incident.
2019-01-16
08:02
meeting with CDES Engineering later. still working on the timeline
Gary Ruby
2019-01-14
working on setting up a detailed timeline for a write up at a later date and set the events up for myself.
11:26
we are still going back and forth with CDES eng on a few questions. will get all the bits together first
Gary Ruby and then start a write up at some point in the week hopefully.
2019-01-11
10:14 following up with CDES on a timed out SCSI command. some good idea came form our interlock with Dev we want
Gary Ruby to follow up on to try and drill down.
2019-01-10
14:55
Still reviewing the logs and ucode. Like Gary stated above we see i2c errors in cdes logs.
Timothy Gaspar
2019-01-09 I'm seeing some I"C error at around the time our box dropped form the ATRC/ELOG, would like to see the logs
off the LCC returned as well.
12:01
Gary Ruby we also have open questions out to CDES eng as well. we have a fairly good understanding of what happened
here, but we still; need to dig through some of the other investigations here before we can call it a day.
SURGE Upload
SURGE Upload
2019-01-09
03:55 EQA Engineer changed from BLANK to Kevin Crowley
Kevin Crowley
✉ SURGE Support
https://surge.isus.emc.com/surge/ViewOPT.php?OPTNum=552586+ 5/7
11/12/21, 3:03 PM 552586
Gary updated Surge 552536 instead of this one. His update below.
at about 14:47:28 we get a generation code change on ESES 8 on port 0x17 (LCC A DAE 5)and we have to reset
our ESES DB and relearn the environment. the ESES gets through to state 5, which is
2019-01-08 ESES_STATE_INIT_SEQUENCE, but this gets stuck in this initialization stage for over 5 seconds which
17:30 triggers the vault.
Timothy Gaspar
waiting on ATRC and ELOG info to verify any issues in the LCC A for DAE 5 which is just prior on the chain
to DAE 6 where we had a bad LCC. this blocked access to DAE 6 from both sides which triggered the vault.
additionally, due to not getting the 08F2.xx lifesigns information, we won't be able to get any information
on why DS 2c dropped DD. the info is flushed from the buffers now so its a dead investigation before it
starts.
2019-01-08 updating problem keys and reviewing supplied logs. I'll be the Eng owner for this case.
Problem keys added:
09:22
DUDL
Gary Ruby Customer RCA
Martin,
Can you open Surge and assign it to Gary Ruby and collect the additional logs below?
Thank you
Tim
SURGE Upload
SURGE Upload
✉ SURGE Support
https://surge.isus.emc.com/surge/ViewOPT.php?OPTNum=552586+ 6/7
11/12/21, 3:03 PM 552586
SURGE Upload
2019-01-08
05:20
The status of this OPT has been changed to Assigned.
Martin Hayes
Current Impact:
The array vaulted 9 times, once the LCC was remvoed from the array it stabalized.
Expectations of Engineering:
Work with TS2 once a review is sent in.
✉ SURGE Support
https://surge.isus.emc.com/surge/ViewOPT.php?OPTNum=552586+ 7/7