We got cp servers in esams && magru with temperature issues:
vgutierrez@cumin1002:~$ sudo -i cumin 'A:cp' 'zgrep "Core temperature is above threshold, cpu clock is throttled" /var/log/kern.lo* > /dev/null && echo "CPU is getting throttled" || echo "CPU is OK"' 112 hosts will be targeted: cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112 ===== NODE GROUP ===== (6) cp[3071-3072].esams.wmnet,cp[7009,7011,7015-7016].magru.wmnet ----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' ----- CPU is getting throttled ===== NODE GROUP ===== (106) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3070,3073-3081].esams.wmnet,cp[7001-7008,7010,7012-7014].magru.wmnet,cp[4037-4052].ulsfo.wmnet ----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' ----- CPU is OK ================
impacted hosts:
- cp3071
- cp3072
- cp7009
- cp7011
- cp7015
- cp7016
Note that we already had an SSD crash on cp7015 (T371554)