Content-Length: 407192 | pFad | http://phabricator.wikimedia.org/T238751

s ⚓ T238751 Only generate maxlag from pooled query service servers.
Page MenuHomePhabricator

Only generate maxlag from pooled query service servers.
Closed, ResolvedPublic

Description

In T221774 max lag started being counted when figuring out the max lag value.
Everything there is now deployed and working, but the median turns out to not be the best value to use for the lag.

The discovery team pointed us toward https://config-master.wikimedia.org/pybal/eqiad/wdqs and https://config-master.wikimedia.org/pybal/codfw/wdqs which shows the pooled servers.

Using this we would be able to actually base the maxlag for the query service on the maximum lag of pooled servers.

The maintenance script could first query these 2 config locations, creating a list of instances that we actually care about.
WikimediaPrometheusQueryServiceLagProvider:: getLags could then filter out any not pooled servers.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Addshore lowered the priority of this task from Medium to Low.
Addshore added a subscriber: Ladsgroup.

@Joe If you get any chance to look at the puppet patch again that would be great!
Any comments that @Ladsgroup and I or others can work toward would be grand, but right now we will wait for that review

Change 589873 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] Switch query service maxlag to be median +1

https://gerrit.wikimedia.org/r/589873

Change 589874 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.28] Switch query service maxlag to be median +1

https://gerrit.wikimedia.org/r/589874

Change 589874 abandoned by Addshore:
Switch query service maxlag to be median 1

https://gerrit.wikimedia.org/r/589874

Change 589873 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] Switch query service maxlag to be median +1

https://gerrit.wikimedia.org/r/589873

Addshore lowered the priority of this task from Low to Lowest.Jun 15 2021, 3:53 PM

Lowest as this hasn't come up in quite some time, and blocked on being able to do this "properly".

Lucas_Werkmeister_WMDE raised the priority of this task from Lowest to Low.Nov 12 2021, 10:45 AM

Bumping priority as it has come up again. Between 5:43 and 8:35 UTC today, wdqs1005 caught up on triples and lag, Wikidata maxlag exceeded 5 s, and bot edits stopped almost completely, which was enough of a decrease in total edit rate that an alert started firing. (I hope that wdqs1005 was actually depooled while it was catching up, but I don’t know how to find that out for sure.)

Change 764875 had a related patch set uploaded (by Addshore; author: Addshore):

[operations/puppet@production] Temp remove codfw

https://gerrit.wikimedia.org/r/764875

Change 764875 merged by Ryan Kemper:

[operations/puppet@production] Temp remove codfw from wikidata updateQueryServiceLag check

https://gerrit.wikimedia.org/r/764875

Change 764830 had a related patch set uploaded (by Addshore; author: Addshore):

[operations/puppet@production] Revert \"Temp remove codfw from wikidata updateQueryServiceLag check\"

https://gerrit.wikimedia.org/r/764830

Lydia_Pintscher raised the priority of this task from Low to High.EditedMar 30 2022, 4:03 PM
Lydia_Pintscher subscribed.

Changing the priority to high based on today's discussion in the query service sync. This is becoming more important for editors because maxlag is growing when a server is depooled, causing a slowdown of edits for bots.

Change 764830 merged by Bking:

[operations/puppet@production] Revert "Temp remove codfw from wikidata updateQueryServiceLag check"

https://gerrit.wikimedia.org/r/764830

@Joe (Also pinging @akosiaris as I know joe is out right now).
It seems like the ideal solution of T239392: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. might not happen for some time.
Would it be possible to resolve this for now with https://gerrit.wikimedia.org/r/c/operations/puppet/+/553097 which I believe would have been "fine" TM for the last 2.5 years and decreased humans touching things and also decreased the number of issues users end up seeing around delayed / broken but depooled wdqs hosts?

@Joe (Also pinging @akosiaris as I know joe is out right now).
It seems like the ideal solution of T239392: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. might not happen for some time.
Would it be possible to resolve this for now with https://gerrit.wikimedia.org/r/c/operations/puppet/+/553097 which I believe would have been "fine" TM for the last 2.5 years and decreased humans touching things and also decreased the number of issues users end up seeing around delayed / broken but depooled wdqs hosts?

That patch is already out of date. lvs2003 is no longer around pointing out exactly why the approach of hardcoding an LVS server in the patch would be problematic.

Do we really need this now that everything is on flink and fancy?

Yes, we've had several occurrences over the last months where editors complained about maxlag being high and it was caused by a server being depooled. (And yes I really believe we need to keep the maxlag coupling.)

I think @Ladsgroup s point is rather than we potentially do not need to include wdqs delay in maxlag now, as the updater is stable and fast.

Yes, we've had several occurrences over the last months where editors complained about maxlag being high and it was caused by a server being depooled. (And yes I really believe we need to keep the maxlag coupling.)

Indeed, but this was only caused by the fact that this check exists and is broken in a couple of ways.
A server being depooled right now has no / minimal user facing impact of itself, ONLY the fact that this check is broken and thus falsely creates maxlag.

Change 797077 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] maintenance::wikidata: Update cron with lb and lb-pool params

https://gerrit.wikimedia.org/r/797077

Given the changes we've made to puppet in the meantime, I am now able to feed the right parameters to the script if we want to.

The following patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/797077

will cause these corresponding changes in puppet: https://puppet-compiler.wmflabs.org/pcc-worker1002/35487/mwmaint1002.eqiad.wmnet/index.html

@Addshore does that look right to you?

The big advantage is that we won't need to change anything when a new LVS server goes into rotation, puppet will take care of update it, although there will probably be a short interval when a server is not responsive when a new one gets installed. What is the failure mode of the Wikidata code?

I believe that this will do the right thing!

Change 797077 merged by Giuseppe Lavagetto:

[operations/puppet@production] maintenance::wikidata: Update cron with lb and lb-pool params

https://gerrit.wikimedia.org/r/797077

Change 801762 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] Revert "maintenance::wikidata: Update cron with lb and lb-pool params"

https://gerrit.wikimedia.org/r/801762

Sadly I had to revert, because the --lb and the --lb-pool commands are not recognized by the script.

mwmaint1002:~$ /usr/local/bin/mwscript extensions/Wikidata.org/maintenance/updateQueryServiceLag.php --wiki wikidatawiki --cluster wdqs --prometheus prometheus.svc.eqiad.wmnet --prometheus prometheus.svc.codfw.wmnet --lb-pool wdqs_80 --lb lvs1019:9090 --lb lvs2009:9090
Unexpected option lb-pool!
Unexpected option lb!

Change 801762 merged by Giuseppe Lavagetto:

[operations/puppet@production] Revert "maintenance::wikidata: Update cron with lb and lb-pool params"

https://gerrit.wikimedia.org/r/801762

Right, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikidata.org/+/552544 needs to go in first!
(Sorry for missing that @Joe , the extended timeline messed with my head)

@Addshore Does it mean we need to then de-abandon that change, or should we just create a new patch to re-enable these options? Why was it abandoned in the first place?

@Addshore Does it mean we need to then de-abandon that change, or should we just create a new patch to re-enable these options? Why was it abandoned in the first place?

I'm not sure if de-abandoning would work (not sure if there are conflicting changes since it was written), but that is certainly worth a try, and otherwise recreate it

Change 552544 restored by Itamar Givon:

[mediawiki/extensions/Wikidata.org@master] Only generate QS maxlag for pooled servers

https://gerrit.wikimedia.org/r/552544

Joe changed the task status from Open to Stalled.Aug 12 2022, 5:13 AM
Joe removed Joe as the assignee of this task.

Hi, any news on this front? I'll release this bug as its completion doesn't depend on me right now. When the functionality has been merged, please reassign to me.

Change 841136 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Wikidata.org@master] updateQueryServiceLag: Add lb(-pool) options for forward compatibility

https://gerrit.wikimedia.org/r/841136

Change 841136 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@master] updateQueryServiceLag: Add lb(-pool) options for forward compatibility

https://gerrit.wikimedia.org/r/841136

Change 841164 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Wikidata.org@wmf/1.40.0-wmf.4] updateQueryServiceLag: Add lb(-pool) options for forward compatibility

https://gerrit.wikimedia.org/r/841164

Change 841165 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Wikidata.org@wmf/1.40.0-wmf.5] updateQueryServiceLag: Add lb(-pool) options for forward compatibility

https://gerrit.wikimedia.org/r/841165

Change 841164 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@wmf/1.40.0-wmf.4] updateQueryServiceLag: Add lb(-pool) options for forward compatibility

https://gerrit.wikimedia.org/r/841164

Mentioned in SAL (#wikimedia-operations) [2022-10-11T13:56:47Z] <hoo@deploy1002> Started scap: Backport for [[gerrit:841164|updateQueryServiceLag: Add lb(-pool) options for forward compatibility (T315423 T238751)]]

Mentioned in SAL (#wikimedia-operations) [2022-10-11T13:57:07Z] <hoo@deploy1002> hoo and hoo: Backport for [[gerrit:841164|updateQueryServiceLag: Add lb(-pool) options for forward compatibility (T315423 T238751)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-10-11T14:01:45Z] <hoo@deploy1002> Finished scap: Backport for [[gerrit:841164|updateQueryServiceLag: Add lb(-pool) options for forward compatibility (T315423 T238751)]] (duration: 04m 57s)

Change 841165 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@wmf/1.40.0-wmf.5] updateQueryServiceLag: Add lb(-pool) options for forward compatibility

https://gerrit.wikimedia.org/r/841165

Change 844993 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Wikidata.org@master] Add an integration test for updateQueryServiceLag

https://gerrit.wikimedia.org/r/844993

Change 845016 had a related patch set uploaded (by Hoo man; author: Addshore):

[mediawiki/extensions/Wikidata.org@wmf/1.40.0-wmf.6] Only generate QS maxlag for pooled servers

https://gerrit.wikimedia.org/r/845016

Change 552544 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@master] Only generate QS maxlag for pooled servers

https://gerrit.wikimedia.org/r/552544

Change 844993 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@master] Add an integration test for updateQueryServiceLag

https://gerrit.wikimedia.org/r/844993

Change 845016 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@wmf/1.40.0-wmf.6] Only generate QS maxlag for pooled servers

https://gerrit.wikimedia.org/r/845016

Mentioned in SAL (#wikimedia-operations) [2022-10-20T20:59:37Z] <hoo@deploy1002> Finished scap: Backport for [[gerrit:845016|Only generate QS maxlag for pooled servers (T315423 T238751)]] (duration: 07m 12s)

hoo claimed this task.

Change 553097 abandoned by Majavah:

[operations/puppet@production] Update cron with lb and lb-pool params

Reason:

https://gerrit.wikimedia.org/r/553097









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://phabricator.wikimedia.org/T238751

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy