CPU temperature issues in cp hosts
Open, In Progress, HighPublic
Actions

Assigned To

Authored By

	Vgutierrez
	Sep 4 2024, 11:28 AM

Description

We got cp servers in esams && magru with temperature issues:

vgutierrez@cumin1002:~$ sudo -i cumin 'A:cp' 'zgrep "Core temperature is above threshold, cpu clock is throttled" /var/log/kern.lo* > /dev/null && echo "CPU is getting throttled"  || echo "CPU is OK"'
112 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112
===== NODE GROUP =====                                                                                                                                                           
(6) cp[3071-3072].esams.wmnet,cp[7009,7011,7015-7016].magru.wmnet                                                                                                                
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----                                                                                                                      
CPU is getting throttled                                                                                                                                                         
===== NODE GROUP =====                                                                                                                                                           
(106) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3070,3073-3081].esams.wmnet,cp[7001-7008,7010,7012-7014].magru.wmnet,cp[4037-4052].ulsfo.wmnet                                                                                                                                              
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----                                                                                                                      
CPU is OK                                                                                                                                                                        
================

impacted hosts:

cp3071
cp3072
cp7009
cp7011
cp7015
cp7016

Note that we already had an SSD crash on cp7015 (T371554)

Related Objects
Search...

Status	Assigned	Task
In Progress	RobH	T373993 CPU temperature issues in cp hosts
Open	herron	T373995 CPU thermal throttling: saturation panel isn't working as expected
Resolved	RobH	T374986 cp307[12] thermal issues
		Restricted Task

Event Timeline

Vgutierrez created this task.Sep 4 2024, 11:28 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2024, 11:28 AM

Maintenance_bot added a project: SRE.Sep 4 2024, 11:29 AM

Vgutierrez mentioned this in T373995: CPU thermal throttling: saturation panel isn't working as expected.Sep 4 2024, 11:56 AM

Fabfur subscribed.Sep 4 2024, 8:06 PM

@RobH / @wiki_willy could we get this task prioritized on your side?

wiki_willy assigned this task to RobH.Sep 12 2024, 5:28 PM

I'm now looking into these. Overall just these specific servers report heat issues while they are weighted the same as other cp hosts within the same fleet?

In the past, heat issues that are sporadic within a fleet typically tend to be caused by improper installation of thermal paste or its degradation over time. I'm going to split the esams items to their own sub-task and open a support case for them. If the issue is fixed by new thermal paste in ESAMS, we'll do the same for MAGRU.

RobH mentioned this in T374986: cp307[12] thermal issues.Sep 17 2024, 4:58 PM

Stalling parent task while working on fixing the esams hosts (esams is easier to get parts in and out than magru, so esams is better testbed for the repair).

KOfori moved this task from Backlog to Radar/Not for service by Traffic on the Traffic board.Sep 20 2024, 12:09 PM

BCornwall subscribed.Sep 24 2024, 3:45 PM

@Vgutierrez: willy mentioned to me in our 1:1 that traffic thought these may be something other than a thermal paste issue and that I should expect an update on this task with details.

As we're planning ot move ahead with thermal paste swap on the two esams hosts next week, should we do something else instead? Please advise.

Apologies for the long text that follows but the TL;DR is that we think that issues in magru are not confined to just the CPU on the affected hosts but rather, the servers themselves and thus possibly the entire rack, given the number of affected hosts.

NVMe temperatures

While this task and T374986 document the CPU throttling due to the increased temperatures, while digging into this, we observed the temperature reported by the NVMe drives is also higher in magru than a comparable site ulsfo, even though magru on average gets almost half of the traffic ulsfo does:

https://grafana.wikimedia.org/goto/BdVLzEgNR?orgId=1

https://grafana.wikimedia.org/goto/wajjfUgHg?orgId=1 (comparison between magru, ulsfo, esams)

It seems like in magru, even though we have not hit the warning or critical temperatures for the NVMes (confirmed via a cumin query), we are quite close in some cases, if you look at the peaks of temperatures above:

$ sudo nvme id-ctrl /dev/nvme0n1 # random host in magru, to show the crit/warn temperatures
wctemp    : 343 (69.85 °C)
cctemp    : 350 (76.85 °C)

Given that magru is serving traffic for just three countries in South America so far and not even being utilized to its ideal peak capacity, this trend might be worrying. But more importantly, this contradicts the assumption that the temperature issues are only confined to the CPUs and suggests a problem with the rack(s).

Also affects LVS and DNS hosts in magru

To check that this is not an issue with just the cp hosts, running a comparison for the DNS boxes (two each in magru and ulsfo):

https://grafana.wikimedia.org/goto/iLZVGEgNR?orgId=1

While dns700[12] have an average of ~100 req/sec more than dns400[34], this still does not explain the ~25 °C difference between the hosts in the sites. Thus load is unlikely to be a factor here. If we compare it with esams included, we see that even though esams averages around ~1k req/sec on each box, the temperatures in magru are still higher.

Similarly, it is also affecting the LVS hosts lvs700[12]. Note that lvs7003 is not affected but it is also the backup host and not serving any traffic. Even then it reports a temperature difference of ~30 °C between the comparable lvs4010 (backup in ulsfo).

$ sudo cumin 'A:lvs-magru' 'dmesg -T | grep -i "core temperature is above"'
3 hosts will be targeted:
lvs[7001-7003].magru.wmnet
OK to proceed on 3 hosts? Enter the number of affected hosts to confirm or "q" to quit: 3
===== NODE GROUP =====
(1) lvs7001.magru.wmnet
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----
[Tue Jun 11 20:48:21 2024] mce: CPU72: Core temperature is above threshold, cpu clock is throttled (total events = 52)
[Wed Jun 26 16:18:21 2024] mce: CPU14: Core temperature is above threshold, cpu clock is throttled (total events = 738)
[Mon Jul 29 08:47:48 2024] mce: CPU68: Core temperature is above threshold, cpu clock is throttled (total events = 1526)
[Mon Jul 29 08:47:48 2024] mce: CPU20: Core temperature is above threshold, cpu clock is throttled (total events = 1522)
[Fri Aug 23 07:17:56 2024] mce: CPU68: Core temperature is above threshold, cpu clock is throttled (total events = 6321)
[Fri Aug 23 07:17:56 2024] mce: CPU20: Core temperature is above threshold, cpu clock is throttled (total events = 6317)
===== NODE GROUP =====
(1) lvs7002.magru.wmnet
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----
[Wed Jun  5 05:01:11 2024] mce: CPU26: Core temperature is above threshold, cpu clock is throttled (total events = 9)

Timeline (and the non-relation to load)

The timeline of setting up magru is as follows:

We finished provisioning and bringing the servers "live" (not serving production traffic) by May 2 2024. SAL
On May 2 2024, we turned on the measure-magru.wikimedia.org domain that points to upload-lb.wikimedia.org. SAL

https://grafana.wikimedia.org/goto/79wLSygHR?orgId=1

We were averaging ~30rps to the cp servers at this point.

But even then, we were already hitting temperatures in excess of 90 °C a week later, without the site servicing any meaningful production traffic.

https://grafana.wikimedia.org/goto/gGW5IsgNR?orgId=1

This thus again confirms that issue is not related to the load because if we were reaching Tjunction temperatures (and exceeding Tcase) without any real usage on the CPUs.

Ruling out BIOS issues

We started by ruling out that PerfPerWattOptimizedOs is correctly set on all hosts in magru. We confirmed this via Redfish.

("get", "/redfish/v1/Systems/System.Embedded.1/Bios").json()['Attributes']['SysProfile']

Since the provisioning cookbook was used, it is unlikely that any other settings were missed but to account for changes in firmware/iDRAC/Redfish we then also verified EnergyPerformanceBias being set to BalancedPerformance and ProcPwrPerf to OsDbpm for all cp hosts.

Summary

Based on the above, here is a summary of the current observations:

The temperature issues not only affect the CPUs but also the NVMes. Thus thermal paste on the CPUs is unlikely to be an issue.
The issue is not limited to the cp hosts in magru and also extends to the DNS hosts, Ganeti cluster, and LVSes where there is a ~30 °C difference.
It is unlikely that load is an issue give that magru serves the least amount of traffic compared to other sites, for all clusters: cp, DNS, LVS, Ganeti.
The BIOS settings/provisioning also seems to be an unlikely culprit.
The issue may become more profound if we shift more countries to magru.

Thanks for providing all the details on this, @ssingh. @RobH - as we chatted about earlier today, we could ask Ascenty to double-check that there are enough perf tiles in the cold aisle, confirm that the blanket panels are in place (and if not, add them), and possibly get a temperature and humidity reading in that area. Thanks, Willy

Draft body of support request for magru temp investigation: https://docs.google.com/document/d/1T-XwSS_Rwfb9nfC1aHQW4AjptLjxiviyZfGFdFcowZY/edit?usp=sharing

Copied below for ease of reference, but any suggested edits should take place on the google doc:

Support,

We're seeing higher temperature levels from our servers in our racks at your facility than expected when compared to our other sites. When we check the server's intake temperatures, we're seeing a large divergence between hosts of anywhere 19C to 25C within the same rack.

We would like to ask for a temperature investigation on our two racks to check for the following items:

Ensure blanking panels are installed on the following U spaces. If no panels are installed, does Ascenty provide for use? If so, please install onto:

B3: U: 1,15-33, 35-36, 38-42, 44-46. Please ensure no blanking panels on U34, 37, 43.

B4: U: 1, 14-33, 35-36, 38-39, 41-42, 44-46. Please ensure no blanking panels on U34, 37, 40, 43.

Please take temperature measurements after blanking panel installation and adjust perforated floor tiles as needed to ensure all points in rack (lower, middle, top) are receiving the same level of cooling.

Once the above is complete, please provide feedback if panels were installed (if possible snap some photos) and provide feedback if floor tiles had to be adjusted or if temps were consistent across the rack.

I'm going to keep the draft document open and simple-english it more over the next couple of hours before I submit into the Ascenty portal.

Opened ticket CS1011077 for the above updated google doc draft.

cmooney subscribed.Oct 2 2024, 7:26 AM

Hi @RobH: Any follow-up from Ascenty on when they plan on installing the blanking panels? Thanks!

In T373993#10196133, @ssingh wrote:

Hi @RobH: Any follow-up from Ascenty on when they plan on installing the blanking panels? Thanks!

The panels were installed successfully at the end of last week, we should see resulting better temps out of magru now.

Additionally the two esams hosts had their cpu thermal paste reapplied about 7 hours ago, so they should stop throttling for temp issues.

RobH moved this task from Backlog to Hardware Failure / Repair on the ops-magru board.Oct 10 2024, 3:08 PM

Unfortunately, it appears that we're still having throttling issues in magru:

brett@cumin2002:~$ sudo -i cumin 'A:cp' 'zgrep "Core temperature is above threshold, cpu clock is throttled" /var/log/kern.lo* > /dev/null && echo "CPU is getting throttled"  || echo "CPU is OK"'
112 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112
===== NODE GROUP =====
(7) cp1109.eqiad.wmnet,cp[7002,7005,7009,7011,7013,7016].magru.wmnet
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----
CPU is getting throttled
===== NODE GROUP =====
(105) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1108,1110-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001,7003-7004,7006-7008,7010,7012,7014-7015].magru.wmnet,cp[4037-4052].ulsfo.wmnet
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----
CPU is OK

max by(instance) (ipmi_temperature_celsius{instance=~"^cp7.*"}) over the last 24 hours yields:

Instance	Temperature (Celcius)
cp7001	85
cp7002	90
cp7003	88
cp7004	91
cp7005	91
cp7006	87
cp7007	91
cp7008	88
cp7009	90
cp7010	86
cp7011	89
cp7012	91
cp7013	90
cp7014	89
cp7015	81
cp7016	91

Some observations:

magru has the highest average CPU temperature by site yet the lowest average fan speed by site (now that the magru host swapping is finished, the cp average is even higher). Even the ambient temperature is higher.
magru rests around the same frequency scale as eqiad and draws around the same amount of power despite a significantly lower load (it's the same hardware).
cp1109 is on BIOS version 1.10.2 while cp7002 is on 1.11.2. The changelog for 1.11.2 does ship with newer microcode... Newer versions of that BIOS don't mention any thermal changes.

Has the BIOS version disparity been tested?

In T373993#10385350, @BCornwall wrote:

Some observations:

magru has the highest average CPU temperature by site yet the lowest average fan speed by site (now that the magru host swapping is finished, the cp average is even higher). Even the ambient temperature is higher.

That stinks! I'll have to open a task and ask them to ensure blanking panels have been installed and report our temp issues so they can investigate on their end now that our reshuffle is done.

MoritzMuehlenhoff subscribed.Dec 6 2024, 3:48 PM

RobH added a subtask: Restricted Task.Dec 11 2024, 6:56 PM

BCornwall changed the task status from Stalled to In Progress.Dec 12 2024, 7:11 PM

BCornwall triaged this task as High priority.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

	F57555764: 3.png
	Sep 26 2024, 1:18 AM

	F57555760: 2.png
	Sep 26 2024, 1:18 AM

	F57555757: 1.png
	Sep 26 2024, 1:18 AM

CPU temperature issues in cp hostsOpen, In Progress, HighPublicActions