Page MenuHomePhabricator

CPU temperature issues in cp hosts
Open, In Progress, HighPublic

Description

We got cp servers in esams && magru with temperature issues:

vgutierrez@cumin1002:~$ sudo -i cumin 'A:cp' 'zgrep "Core temperature is above threshold, cpu clock is throttled" /var/log/kern.lo* > /dev/null && echo "CPU is getting throttled"  || echo "CPU is OK"'
112 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112
===== NODE GROUP =====                                                                                                                                                           
(6) cp[3071-3072].esams.wmnet,cp[7009,7011,7015-7016].magru.wmnet                                                                                                                
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----                                                                                                                      
CPU is getting throttled                                                                                                                                                         
===== NODE GROUP =====                                                                                                                                                           
(106) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3070,3073-3081].esams.wmnet,cp[7001-7008,7010,7012-7014].magru.wmnet,cp[4037-4052].ulsfo.wmnet                                                                                                                                              
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----                                                                                                                      
CPU is OK                                                                                                                                                                        
================

impacted hosts:

  • cp3071
  • cp3072
  • cp7009
  • cp7011
  • cp7015
  • cp7016

Note that we already had an SSD crash on cp7015 (T371554)

Event Timeline

@RobH / @wiki_willy could we get this task prioritized on your side?

I'm now looking into these. Overall just these specific servers report heat issues while they are weighted the same as other cp hosts within the same fleet?

In the past, heat issues that are sporadic within a fleet typically tend to be caused by improper installation of thermal paste or its degradation over time. I'm going to split the esams items to their own sub-task and open a support case for them. If the issue is fixed by new thermal paste in ESAMS, we'll do the same for MAGRU.

RobH changed the task status from Open to Stalled.Sep 17 2024, 5:20 PM

Stalling parent task while working on fixing the esams hosts (esams is easier to get parts in and out than magru, so esams is better testbed for the repair).

@Vgutierrez: willy mentioned to me in our 1:1 that traffic thought these may be something other than a thermal paste issue and that I should expect an update on this task with details.

As we're planning ot move ahead with thermal paste swap on the two esams hosts next week, should we do something else instead? Please advise.

Apologies for the long text that follows but the TL;DR is that we think that issues in magru are not confined to just the CPU on the affected hosts but rather, the servers themselves and thus possibly the entire rack, given the number of affected hosts.

NVMe temperatures

While this task and T374986 document the CPU throttling due to the increased temperatures, while digging into this, we observed the temperature reported by the NVMe drives is also higher in magru than a comparable site ulsfo, even though magru on average gets almost half of the traffic ulsfo does:

1.png (822×1 px, 125 KB)

https://grafana.wikimedia.org/goto/BdVLzEgNR?orgId=1

2.png (695×1 px, 87 KB)

https://grafana.wikimedia.org/goto/wajjfUgHg?orgId=1 (comparison between magru, ulsfo, esams)

It seems like in magru, even though we have not hit the warning or critical temperatures for the NVMes (confirmed via a cumin query), we are quite close in some cases, if you look at the peaks of temperatures above:

$ sudo nvme id-ctrl /dev/nvme0n1 # random host in magru, to show the crit/warn temperatures
wctemp    : 343 (69.85 °C)
cctemp    : 350 (76.85 °C)

Given that magru is serving traffic for just three countries in South America so far and not even being utilized to its ideal peak capacity, this trend might be worrying. But more importantly, this contradicts the assumption that the temperature issues are only confined to the CPUs and suggests a problem with the rack(s).

Also affects LVS and DNS hosts in magru

To check that this is not an issue with just the cp hosts, running a comparison for the DNS boxes (two each in magru and ulsfo):

3.png (852×1 px, 120 KB)

https://grafana.wikimedia.org/goto/iLZVGEgNR?orgId=1

While dns700[12] have an average of ~100 req/sec more than dns400[34], this still does not explain the ~25 °C difference between the hosts in the sites. Thus load is unlikely to be a factor here. If we compare it with esams included, we see that even though esams averages around ~1k req/sec on each box, the temperatures in magru are still higher.

Similarly, it is also affecting the LVS hosts lvs700[12]. Note that lvs7003 is not affected but it is also the backup host and not serving any traffic. Even then it reports a temperature difference of ~30 °C between the comparable lvs4010 (backup in ulsfo).

$ sudo cumin 'A:lvs-magru' 'dmesg -T | grep -i "core temperature is above"'
3 hosts will be targeted:
lvs[7001-7003].magru.wmnet
OK to proceed on 3 hosts? Enter the number of affected hosts to confirm or "q" to quit: 3
===== NODE GROUP =====
(1) lvs7001.magru.wmnet
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----
[Tue Jun 11 20:48:21 2024] mce: CPU72: Core temperature is above threshold, cpu clock is throttled (total events = 52)
[Wed Jun 26 16:18:21 2024] mce: CPU14: Core temperature is above threshold, cpu clock is throttled (total events = 738)
[Mon Jul 29 08:47:48 2024] mce: CPU68: Core temperature is above threshold, cpu clock is throttled (total events = 1526)
[Mon Jul 29 08:47:48 2024] mce: CPU20: Core temperature is above threshold, cpu clock is throttled (total events = 1522)
[Fri Aug 23 07:17:56 2024] mce: CPU68: Core temperature is above threshold, cpu clock is throttled (total events = 6321)
[Fri Aug 23 07:17:56 2024] mce: CPU20: Core temperature is above threshold, cpu clock is throttled (total events = 6317)
===== NODE GROUP =====
(1) lvs7002.magru.wmnet
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----
[Wed Jun  5 05:01:11 2024] mce: CPU26: Core temperature is above threshold, cpu clock is throttled (total events = 9)

Timeline (and the non-relation to load)

The timeline of setting up magru is as follows:

  1. We finished provisioning and bringing the servers "live" (not serving production traffic) by May 2 2024. SAL
  2. On May 2 2024, we turned on the measure-magru.wikimedia.org domain that points to upload-lb.wikimedia.org. SAL

https://grafana.wikimedia.org/goto/79wLSygHR?orgId=1

We were averaging ~30rps to the cp servers at this point.

  1. But even then, we were already hitting temperatures in excess of 90 °C a week later, without the site servicing any meaningful production traffic.

https://grafana.wikimedia.org/goto/gGW5IsgNR?orgId=1

This thus again confirms that issue is not related to the load because if we were reaching Tjunction temperatures (and exceeding Tcase) without any real usage on the CPUs.

Ruling out BIOS issues

We started by ruling out that PerfPerWattOptimizedOs is correctly set on all hosts in magru. We confirmed this via Redfish.

("get", "/redfish/v1/Systems/System.Embedded.1/Bios").json()['Attributes']['SysProfile']

Since the provisioning cookbook was used, it is unlikely that any other settings were missed but to account for changes in firmware/iDRAC/Redfish we then also verified EnergyPerformanceBias being set to BalancedPerformance and ProcPwrPerf to OsDbpm for all cp hosts.

Summary

Based on the above, here is a summary of the current observations:

  1. The temperature issues not only affect the CPUs but also the NVMes. Thus thermal paste on the CPUs is unlikely to be an issue.
  2. The issue is not limited to the cp hosts in magru and also extends to the DNS hosts, Ganeti cluster, and LVSes where there is a ~30 °C difference.
  3. It is unlikely that load is an issue give that magru serves the least amount of traffic compared to other sites, for all clusters: cp, DNS, LVS, Ganeti.
  4. The BIOS settings/provisioning also seems to be an unlikely culprit.
  5. The issue may become more profound if we shift more countries to magru.

Thanks for providing all the details on this, @ssingh. @RobH - as we chatted about earlier today, we could ask Ascenty to double-check that there are enough perf tiles in the cold aisle, confirm that the blanket panels are in place (and if not, add them), and possibly get a temperature and humidity reading in that area. Thanks, Willy

Draft body of support request for magru temp investigation: https://docs.google.com/document/d/1T-XwSS_Rwfb9nfC1aHQW4AjptLjxiviyZfGFdFcowZY/edit?usp=sharing

Copied below for ease of reference, but any suggested edits should take place on the google doc:

Support,

We're seeing higher temperature levels from our servers in our racks at your facility than expected when compared to our other sites. When we check the server's intake temperatures, we're seeing a large divergence between hosts of anywhere 19C to 25C within the same rack.

We would like to ask for a temperature investigation on our two racks to check for the following items:

  • Ensure blanking panels are installed on the following U spaces. If no panels are installed, does Ascenty provide for use? If so, please install onto:
    • B3: U: 1,15-33, 35-36, 38-42, 44-46. Please ensure no blanking panels on U34, 37, 43.
    • B4: U: 1, 14-33, 35-36, 38-39, 41-42, 44-46. Please ensure no blanking panels on U34, 37, 40, 43.
  • Please take temperature measurements after blanking panel installation and adjust perforated floor tiles as needed to ensure all points in rack (lower, middle, top) are receiving the same level of cooling.

Once the above is complete, please provide feedback if panels were installed (if possible snap some photos) and provide feedback if floor tiles had to be adjusted or if temps were consistent across the rack.

I'm going to keep the draft document open and simple-english it more over the next couple of hours before I submit into the Ascenty portal.

Opened ticket CS1011077 for the above updated google doc draft.

Hi @RobH: Any follow-up from Ascenty on when they plan on installing the blanking panels? Thanks!

Hi @RobH: Any follow-up from Ascenty on when they plan on installing the blanking panels? Thanks!

The panels were installed successfully at the end of last week, we should see resulting better temps out of magru now.

Additionally the two esams hosts had their cpu thermal paste reapplied about 7 hours ago, so they should stop throttling for temp issues.

Unfortunately, it appears that we're still having throttling issues in magru:

brett@cumin2002:~$ sudo -i cumin 'A:cp' 'zgrep "Core temperature is above threshold, cpu clock is throttled" /var/log/kern.lo* > /dev/null && echo "CPU is getting throttled"  || echo "CPU is OK"'
112 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112
===== NODE GROUP =====
(7) cp1109.eqiad.wmnet,cp[7002,7005,7009,7011,7013,7016].magru.wmnet
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----
CPU is getting throttled
===== NODE GROUP =====
(105) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1108,1110-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001,7003-7004,7006-7008,7010,7012,7014-7015].magru.wmnet,cp[4037-4052].ulsfo.wmnet
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----
CPU is OK

max by(instance) (ipmi_temperature_celsius{instance=~"^cp7.*"}) over the last 24 hours yields:

InstanceTemperature (Celcius)
cp700185
cp700290
cp700388
cp700491
cp700591
cp700687
cp700791
cp700888
cp700990
cp701086
cp701189
cp701291
cp701390
cp701489
cp701581
cp701691

Some observations:

Has the BIOS version disparity been tested?

Some observations:

That stinks! I'll have to open a task and ask them to ensure blanking panels have been installed and report our temp issues so they can investigate on their end now that our reshuffle is done.

RobH added a subtask: Restricted Task.Dec 11 2024, 6:56 PM
BCornwall changed the task status from Stalled to In Progress.Dec 12 2024, 7:11 PM
BCornwall triaged this task as High priority.
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy