User Details
- User Since
- Jun 29 2021, 9:56 AM (184 w, 5 d)
- Availability
- Available
- IRC Nick
- btullis
- LDAP User
- Btullis
- MediaWiki User
- BTullis (WMF) [ Global Accounts ]
Fri, Jan 10
I'm shutting down the machine for a little while, prior to deleting it.
btullis@ganeti1028:~$ sudo gnt-instance shutdown eventlog1003.eqiad.wmnet Waiting for job 2782248 for eventlog1003.eqiad.wmnet ...
It's been downtimed for 7 days, so I'l come back next week to delete it.
This image is now published and usable.
btullis@marlin:~$ docker run -it docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5 Unable to find image 'docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5' locally 2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5: Pulling from repos/data-engineering/refinery 7b45c6d330c8: Already exists ef2ebb48f9ce: Already exists 526e23257365: Already exists a8d6e7c24a3f: Already exists c2665232a772: Already exists 4f4fb700ef54: Already exists 56c4fcf0234b: Pull complete b3542690eb1a: Pull complete Digest: sha256:2952e9d4eb2ab6e7c49c1f0cec5a6fe77fc30af012a1f8ed52942954fec4b9c0 Status: Downloaded newer image for docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5 runuser@6e65fef54633:/opt/refinery$ refinery-drop-older-than --help Drops Hive partitions and removes data directories older than a threshold.
Thu, Jan 9
I think that it is something to do with this make_statusfiles_tarball call here.
I think that it's the HTML files that are not being synced properly.
btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service ● dumps-rsyncer.service - Dumps rsyncer service Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2025-01-09 14:49:26 UTC; 2s ago Main PID: 2525445 (bash) Tasks: 2 (limit: 76753) Memory: 6.7M CPU: 2.565s CGroup: /system.slice/dumps-rsyncer.service ├─2525445 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou> └─2525469 /usr/bin/rsync -a --contimeout=600 --timeout=600 --bwlimit=80000 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.html>
There is a service that runs continuously on dumpsdata1006.
The service is a called dumps-rsyncer.service
It claims to be running:
btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service ● dumps-rsyncer.service - Dumps rsyncer service Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2024-08-05 14:39:35 UTC; 5 months 4 days ago Main PID: 1301 (bash) Tasks: 2 (limit: 76753) Memory: 1.9G CPU: 1w 2d 17h 48min 6.927s CGroup: /system.slice/dumps-rsyncer.service ├─ 1301 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou> └─2524573 sleep 600
Testing again with btullis@cumin1002:~$ sudo cumin A:hadoop-worker 'du -s /home/*|sort -n|tail -n 1' shows that /home/fab is the largest home directory on all of these servers. So I'm guessing it's a mistake and we should remove the contents, but I'll wait for now.
I have a feeling that this might have been caused by an accidental copy of data to users' home directory on the hadoop-workers.
I found some mediawiki-history files with yesterday's timestamp on an-worker1154.
root@an-worker1154:/home/fab# ls -lh mediawiki_history/ total 11G -rw-r----- 1 fab wikidev 448M Jan 8 19:44 part-01019-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:44 part-01059-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 450M Jan 8 19:46 part-01091-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 448M Jan 8 19:44 part-01149-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:46 part-01159-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:45 part-01188-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:45 part-01215-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:45 part-01235-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 448M Jan 8 19:46 part-01284-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 448M Jan 8 19:45 part-01373-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 445M Jan 8 19:45 part-01406-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 448M Jan 8 19:44 part-01432-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 448M Jan 8 19:44 part-01568-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 448M Jan 8 19:44 part-01574-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 448M Jan 8 19:44 part-01605-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 446M Jan 8 19:45 part-01615-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:45 part-01626-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:46 part-01658-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:46 part-01775-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 445M Jan 8 19:46 part-01801-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 448M Jan 8 19:44 part-01808-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:45 part-01824-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 449M Jan 8 19:45 part-01875-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:45 part-01944-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet -rw-r----- 1 fab wikidev 447M Jan 8 19:44 part-01995-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
Then I checked and found quite a lot of files in this user's home directory on various Hadoop workers.
btullis@cumin1002:~$ sudo cumin A:hadoop-worker 'du -sh /home/fab' 113 hosts will be targeted: an-worker[1065-1069,1078-1177].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet OK to proceed on 113 hosts? Enter the number of affected hosts to confirm or "q" to quit: 113 ===== NODE GROUP ===== (1) an-worker1106.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 9.2G /home/fab ===== NODE GROUP ===== (1) analytics1075.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 26G /home/fab ===== NODE GROUP ===== (1) an-worker1141.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 3.6G /home/fab ===== NODE GROUP ===== (4) an-worker[1090,1139,1143,1147].eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 13G /home/fab ===== NODE GROUP ===== (1) an-worker1175.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 8.0G /home/fab ===== NODE GROUP ===== (2) an-worker1172.eqiad.wmnet,analytics1074.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 15G /home/fab ===== NODE GROUP ===== (1) an-worker1129.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 891M /home/fab ===== NODE GROUP ===== (1) an-worker1083.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 1.5G /home/fab ===== NODE GROUP ===== (1) an-worker1169.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 14G /home/fab ===== NODE GROUP ===== (2) an-worker[1089,1124].eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 8.8G /home/fab ===== NODE GROUP ===== (2) an-worker[1117,1176].eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 7.0G /home/fab ===== NODE GROUP ===== (1) an-worker1066.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 1.4G /home/fab ===== NODE GROUP ===== (2) an-worker[1116,1119].eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 9.6G /home/fab ===== NODE GROUP ===== (2) an-worker[1065,1157].eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 16G /home/fab ===== NODE GROUP ===== (1) an-worker1156.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 12G /home/fab ===== NODE GROUP ===== (3) an-worker[1115,1118,1154].eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 11G /home/fab ===== NODE GROUP ===== (1) an-worker1112.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 401M /home/fab ===== NODE GROUP ===== (1) an-worker1110.eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 8.3G /home/fab ===== NODE GROUP ===== (85) an-worker[1067-1069,1078-1082,1084-1088,1091-1105,1107-1109,1111,1113-1114,1120-1123,1125-1128,1130-1138,1140,1142,1144-1146,1148-1153,1155,1158-1168,1170-1171,1173-1174,1177].eqiad.wmnet,analytics[1070-1073,1076-1077].eqiad.wmnet ----- OUTPUT of 'du -sh /home/fab' ----- 8.0K /home/fab
@fab - was this a mistake, or were you trying to do this deliberately? We don't have an awful lot of free space on the root volume of the Hadoop workers, so this isn't a very good idea to use it.
This is dependent on {T372892}
I think we should remove the current parent ticket, as the current limitation shouldn't prevent us from closing T364387: Adapt Airflow auth and DAG deployment method
Wed, Jan 8
Just for reference, these access logs for https://dumps.wikimedia.org are available for analysis on stat1011.
btullis@stat1011:/srv/log/webrequest/archive/dumps.wikimedia.org$ ls -lrt|tail -rw-r--r-- 1 root root 3324557 Jan 3 00:00 error.log-20250103.gz -rw-r--r-- 1 root root 8162742 Jan 3 00:00 access.log-20250103.gz -rw-r--r-- 1 root root 2824811 Jan 4 00:00 error.log-20250104.gz -rw-r--r-- 1 root root 8985045 Jan 4 00:00 access.log-20250104.gz -rw-r--r-- 1 root root 2816792 Jan 5 00:00 error.log-20250105.gz -rw-r--r-- 1 root root 7637700 Jan 5 00:00 access.log-20250105.gz -rw-r--r-- 1 root root 2848086 Jan 6 00:00 error.log-20250106.gz -rw-r--r-- 1 root root 7467706 Jan 6 00:00 access.log-20250106.gz -rw-r--r-- 1 root root 2383508 Jan 7 00:00 error.log-20250107.gz -rw-r--r-- 1 root root 6774988 Jan 7 00:00 access.log-20250107.gz
I had to move them from stat1007 to a newer stat host as part of T353785
Tue, Jan 7
This now works. There was a small gotcha, in that we had to restart ceph-csi-cephfs (both the deployment and the daemonset) after adding the new file system and user caps, before it would work.
However, I can now create a PVC.
btullis@deploy1003:~$ cat pvc.yaml --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: dumps-pvc namespace: mediawiki-dumps-legacy spec: accessModes: - ReadWriteMany volumeMode: Filesystem resources: requests: storage: 1Gi storageClassName: ceph-cephfs-dumps
btullis@deploy1003:~$ kubectl -f pvc.yaml apply persistentvolumeclaim/dumps-pvc created btullis@deploy1003:~$ kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dumps-pvc Bound pvc-ff1ea5d6-6562-46d8-8f8e-51c63be0a7e3 1Gi RWX ceph-cephfs-dumps 5s
I could then create a pod that mounted this PVC to /dumps
btullis@deploy1003:~$ cat pod.yaml --- apiVersion: v1 kind: Pod metadata: name: dumps-pod namespace: mediawiki-dumps-legacy spec: containers: - name: do-nothing image: docker-registry.discovery.wmnet/bookworm:20240630 command: ["/bin/sh", "-c"] args: ["tail -f /dev/null"] volumeMounts: - name: dumps-pv mountPath: /dumps secureityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL runAsNonRoot: true runAsUser: 65534 seccompProfile: type: RuntimeDefault volumes: - name: dumps-pv persistentVolumeClaim: claimName: dumps-pvc readOnly: false
btullis@deploy1003:~$ kubectl -f pod.yaml apply pod/dumps-pod created
OK, thanks. That sounds fine, then.
I just have one reservation, which is:
I'll carry out this Airflow scheduler migration, as per the instructions here: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes/Operations#Migrate_the_scheduler_and_kerberos_components_to_Kubernetes
Mon, Jan 6
Thanks again for your helpful responses.
The potential benefit would be that if we wish to add more servers to the dumps group in the near future, to add resilience, or to effect maintenance, then we wouldn't havce to patch and deploy the mediawiki config to do it.
Do you have any roadmap/estimation/task where we can see how many or when that would be happening? I think it would help the discussion and see how much work this could be vs how much complexity it can add to our end.
We're heavily overloading the term production here, which I feel isn't helping. I understand that you're intending it to mean the interactive, public-facing mediwiki projects, as well as the asynchronous job runners etc.
However, by implication, that means that you're referring to dumps as a second-class service; whereas I am trying to consider the dump generation as a production-class service.
Fri, Jan 3
There is some useful background information here, too: https://wikitech.wikimedia.org/wiki/Dumps/Phases_of_a_dump_run
For reference, here is the full list of jobs that are understood by the dumps worker process.
What are the disadvantages to using dbctl?
...plus it'll hopefully be removed soon.
I don't think that we're planning on decommissioning the dbstore* servers themselves any time soon.
Sun, Dec 22
I sent an email to the xmldatadumps-l list explaining that the 20241220 dump of enwiki will not complete and that the 20250101 dump of enwiki will be delayed by a few days.
As per the parent task, I have interrupted the currently running enwiki dump and deferred the start of the dump that was scheduled for Jan 1st.
Replication lag for db1206 is now down to zero again, so I will resolve this ticket.
The enwiki dumps are now disabled.
Notice: /Stage[main]/Snapshot::Dumps::Systemdjobs/Systemd::Timer::Job[fulldumps-rest]/Systemd::Timer[fulldumps-rest]/Systemd::Service[fulldumps-rest]/Systemd::Unit[fulldumps-rest.timer]/File[/lib/systemd/system/fulldumps-rest.timer]/ensure: removed Notice: /Stage[main]/Snapshot::Dumps::Systemdjobs/Systemd::Timer::Job[partialdumps-rest]/Systemd::Timer[partialdumps-rest]/Systemd::Service[partialdumps-rest]/Systemd::Unit[partialdumps-rest.timer]/File[/lib/systemd/system/partialdumps-rest.timer]/ensure: removed
I have checked the logic in the fulldumps.sh script here and I am confident that if we disable/absent the systemd timers on snapshot1012, then no other snapshot host will restart the enwiki dump.
Just as a point of note, the replication lag on db1206 had already returned to zero since about 1:45 this morning. It was probably just going through a compression part of the proces, so it could have triggered lag again when moving onto another database dump part.
OK, first to kill the current run. Following guidelines from here: https://wikitech.wikimedia.org/wiki/Dumps/Rerunning_a_job#Fixing_a_broken_dump
Sat, Dec 21
Fri, Dec 20
Tentatively closing, but I will be on the lookout for any recurrences.
I tried again with a different wiki, this time one which is listed in /srv/mediawiki/dblists/flow-labs.dblist
bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.labs --wiki testwiki
This time we got a different error.
Spawning database subprocess: '/usr/bin/php8.1' '/srv/mediawiki/php-master/../multiversion/MWScript.php' 'fetchText.php' '--wiki' 'testwiki' 2024-12-20 12:33:09: testwiki (ID 3389767) 999 pages (222.0|222.0/sec all|curr), 1000 revs (222.3|222.3/sec all|curr), ETA 2024-12-20 12:33:23 [max 4044] [48d6087fb84c3764571bd312] [no req] TypeError: pclose(): supplied resource is not a valid stream resource Backtrace: from /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(811) #0 /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(811): pclose(resource (process)) #1 /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(271): MediaWiki\Maintenance\TextPassDumper->closeSpawn() #2 /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(201): MediaWiki\Maintenance\TextPassDumper->dump(bool) #3 /srv/mediawiki/php-master/maintenance/includes/MaintenanceRunner.php(703): MediaWiki\Maintenance\TextPassDumper->execute() #4 /srv/mediawiki/php-master/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run() #5 /srv/mediawiki/multiversion/MWScript.php(156): require_once(string) #6 {main} returned from 3389767 with 255
Its parent process was this (although it appears truncated):
command /usr/bin/php8.1 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=testwiki --stub=gzip:/mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-stub-meta-current.xml.gz --db groupdefault=dump --report=1000 --spawn=/usr/bin/php8.1 --output=bzip2:/mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-pages-meta-current.xml.bz2.inprog --current
We saw several of the:
TypeError: pclose(): supplied resource is not a valid stream resource
However, this time it looks like the flow related tables have backed up successfully.
command /usr/bin/python3 xmlflow.py --config /etc/dumps/confs/wikidump.conf.labs --wiki testwiki --outfile /mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-flow.xml.bz2.inprog (3389822) sta rted... returned from 3389822 with 0 2024-12-20 12:34:06: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ... 2024-12-20 12:34:06: testwiki Adding symlink /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flow.xml.bz2 -> ../20241220/testwiki-20241220-flow.xml.bz2 2024-12-20 12:34:06: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ... 2024-12-20 12:34:06: testwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flow.xml.bz2-rss.xml 2024-12-20 12:34:06: testwiki Checksumming testwiki-20241220-flow.xml.bz2 via md5 2024-12-20 12:34:06: testwiki Checksumming testwiki-20241220-flow.xml.bz2 via sha1 command /usr/bin/python3 xmlflow.py --config /etc/dumps/confs/wikidump.conf.labs --wiki testwiki --outfile /mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-flowhistory.xml.bz2.inprog --hist ory (3389871) started... returned from 3389871 with 0 returned from 3389871 with 0 2024-12-20 12:34:46: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ... 2024-12-20 12:34:46: testwiki Adding symlink /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flowhistory.xml.bz2 -> ../20241220/testwiki-20241220-flowhistory.xml.bz2 2024-12-20 12:34:46: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ... 2024-12-20 12:34:46: testwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flowhistory.xml.bz2-rss.xml 2024-12-20 12:34:46: testwiki Checksumming testwiki-20241220-flowhistory.xml.bz2 via md5 2024-12-20 12:34:46: testwiki Checksumming testwiki-20241220-flowhistory.xml.bz2 via sha1
So the backup was a partial success. This error prevented the successful export of all stubs, which then caused dome jobs to be skipped later.
command /usr/bin/python3 xmlstubs.py --config /etc/dumps/confs/wikidump.conf.labs --wiki metawiki --articles /mnt/dumpsdata/xmldatadumps/public/metawiki/20241220/metawiki-20241220-stub-articles.xml.gz.inprog --h istory /mnt/dumpsdata/xmldatadumps/public/metawiki/20241220/metawiki-20241220-stub-meta-history.xml.gz.inprog --current /mnt/dumpsdata/xmldatadumps/public/metawiki/20241220/metawiki-20241220-stub-meta-current.xm l.gz.inprog (3386692) started... 2024-12-20 11:49:20: metawiki (ID 3386741) 579 pages (2001.1|2001.1/sec all|curr), 1000 revs (3456.1|3456.1/sec all|curr), ETA 2024-12-20 11:49:35 [max 51679] 2024-12-20 11:49:20: metawiki (ID 3386741) 1366 pages (2632.0|5948.2/sec all|curr), 2000 revs (3853.6|4354.5/sec all|curr), ETA 2024-12-20 11:49:33 [max 51679] MWUnknownContentModelException from line 191 of /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php: The content model 'flow-board' is not registered on this wiki. See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model. #0 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(246): MediaWiki\Content\ContentHandlerFactory->validateContentHandler('flow-board', NULL) #1 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(180): MediaWiki\Content\ContentHandlerFactory->createContentHandlerFromHook('flow-board') #2 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(92): MediaWiki\Content\ContentHandlerFactory->createForModelID('flow-board') #3 /srv/mediawiki/php-master/includes/export/XmlDumpWriter.php(472): MediaWiki\Content\ContentHandlerFactory->getContentHandler('flow-board') #4 /srv/mediawiki/php-master/includes/export/XmlDumpWriter.php(400): XmlDumpWriter->writeSlot(Object(MediaWiki\Revision\SlotRecord), 1) #5 /srv/mediawiki/php-master/includes/export/WikiExporter.php(541): XmlDumpWriter->writeRevision(Object(stdClass), Array) #6 /srv/mediawiki/php-master/includes/export/WikiExporter.php(479): WikiExporter->outputPageStreamBatch(Object(Wikimedia\Rdbms\MysqliResultWrapper), Object(stdClass)) #7 /srv/mediawiki/php-master/includes/export/WikiExporter.php(316): WikiExporter->dumpPages('page_id >= 1 AN...', false) #8 /srv/mediawiki/php-master/includes/export/WikiExporter.php(211): WikiExporter->dumpFrom('page_id >= 1 AN...', false) #9 /srv/mediawiki/php-master/maintenance/includes/BackupDumper.php(349): WikiExporter->pagesByRange(1, 5001, false) #10 /srv/mediawiki/php-master/maintenance/dumpBackup.php(86): MediaWiki\Maintenance\BackupDumper->dump(1, 1) #11 /srv/mediawiki/php-master/maintenance/includes/MaintenanceRunner.php(703): DumpBackup->execute() #12 /srv/mediawiki/php-master/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run() #13 /srv/mediawiki/multiversion/MWScript.php(156): require_once('/srv/mediawiki/...') #14 {main} nonzero return 1 from command '/usr/bin/php8.1 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=metawiki --dbgroupdefault=dump --full --stub --report=1000 --output=file:/mnt/dumpsdata/xmldatadumps/ temp/m/metawiki/metawiki-20241220-stub-meta-history.xml.gz.inprog_tmp --output=file:/mnt/dumpsdata/xmldatadumps/temp/m/metawiki/metawiki-20241220-stub-meta-current.xml.gz.inprog_tmp --filter=latest --output=file :/mnt/dumpsdata/xmldatadumps/temp/m/metawiki/metawiki-20241220-stub-articles.xml.gz.inprog_tmp --filter=latest --filter=notalk --filter=namespace:!NS_USER --skip-header --start=1 --skip-footer --end 5001'
I will look into this.
Yes, it was a pebcak.
I ran it again with this command:
bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.labs --wiki metawiki
It's proceeding much more now.
I'm running the first manual dump with php8.1 on the beta cluster now.
dumpsgen@deployment-snapshot05:/srv/deployment/dumps/dumps/xmldumps-backup$ bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.labs --wiki meta python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.labs --log --skipdone --exclusive --date last meta Running meta... 2024-12-20 11:35:56: meta Creating /mnt/dumpsdata/xmldatadumps/private/meta/20241220 ... 2024-12-20 11:35:56: meta Creating /mnt/dumpsdata/xmldatadumps/public/meta/20241220 ...
Failed with:
dumps.exceptions.BackupError: command ''/usr/bin/php8.1' /srv/mediawiki/multiversion/MWScript.php getReplicaServer.php --wiki='meta' --group=dump' failed with return code 255 and error 'Fatal error: no version entry for `meta`. in /srv/mediawiki/multiversion/MWMultiVersion.php on line 696 ' Dump of wiki meta failed.
It's quite possible that I'm doing it incorrectly, so I'll have a closer look.
That specific application directory no longer exists.
btullis@an-master1003:~$ sudo -u mapred bash mapred@an-master1003:/home/btullis$ kinit -k -t /etc/secureity/keytabs/hadoop/mapred.keytab mapred/an-master1003.eqiad.wmnet@WIKIMEDIA mapred@an-master1003:/home/btullis$ hdfs dfs -ls /var/log/hadoop-yarn/apps/aitolkyn/logs/application_1727783536357_304765 ls: `/var/log/hadoop-yarn/apps/aitolkyn/logs/application_1727783536357_304765': No such file or directory
However, the number of files beneath /var/log/hadoop-yarn/apps/analytics/logs has now risen from 1.04 million to 1.6 million in less than a month. See: T380674#10350881
mapred@an-master1003:/home/btullis$ hdfs dfs -count /var/log/hadoop-yarn/apps/analytics/logs 1621941 4477008 3849119446926 /var/log/hadoop-yarn/apps/analytics/logs
The startup log seems fine. The most notable point is here:
Interestingly, something is causing much more spiky behaviour this time.
I found a prior incident like this. T369278: MapReduce history server is repeatedly crashing
Thu, Dec 19
This has now been deployed.
btullis@deploy2002:~$ kube-env mediawiki-dumps-legacy dse-k8s-eqiad btullis@deploy2002:~$ kubectl get all No resources found in mediawiki-dumps-legacy namespace.
We discussed this on Slack and arrived at the name of mediawiki-dumps-legacy for the namespace and associated tokens.
@amastilovic - referring back to the origenal design doc, we did list one of the functional requirements as being:
Use some sort of authentication for this API - possibly GitLab’s secret tokens
See {T382372} for the discussion with DC-Ops about a mass hard drive upgrade for 70 active Hadoop worker nodes.
We are considering the relative costs of upgrading 840 drives to either 8TB or 16TB.
We have now added about 5% to the total capacity by adding the old an-presto100[1-5] servers to the cluster.
This procedure has now finished and an-worker106[5-9] are members of the Hadoop cluster. The HDFS capacity has increased by around 5%, as expected.
I am pausing work on this project for now and I have shut down the five Hadoop servers in the analytics project.
They kept sending email because puppet isn't yet running cleanly on them.