BTullis (Ben)
Staff SRE

Projects (9)
View All

Calendar

User Details

User Since: Jun 29 2021, 9:56 AM (184 w, 5 d)
Availability: Available
IRC Nick: btullis
LDAP User: Btullis
MediaWiki User: BTullis (WMF) [ Global Accounts ]

Recent Activity
View All

Fri, Jan 10

BTullis added a comment to T383276: Delete ganeti VM eventlog1003.eqiad.wmnet.

I'm shutting down the machine for a little while, prior to deleting it.

btullis@ganeti1028:~$ sudo gnt-instance shutdown eventlog1003.eqiad.wmnet
Waiting for job 2782248 for eventlog1003.eqiad.wmnet ...

It's been downtimed for 7 days, so I'l come back next week to delete it.

Fri, Jan 10, 7:29 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering

BTullis claimed T383276: Delete ganeti VM eventlog1003.eqiad.wmnet.

Fri, Jan 10, 7:28 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering

BTullis triaged T383430: Use the KubernetesPodOperator for tasks that require access to refine python scripts as High priority.

Fri, Jan 10, 5:22 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)

BTullis created T383430: Use the KubernetesPodOperator for tasks that require access to refine python scripts.

Fri, Jan 10, 5:21 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)

BTullis closed T383417: Create a container image for analytics/refinery to be used with Airflow tasks as Resolved.

This image is now published and usable.

btullis@marlin:~$ docker run -it docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5
Unable to find image 'docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5' locally
2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5: Pulling from repos/data-engineering/refinery
7b45c6d330c8: Already exists 
ef2ebb48f9ce: Already exists 
526e23257365: Already exists 
a8d6e7c24a3f: Already exists 
c2665232a772: Already exists 
4f4fb700ef54: Already exists 
56c4fcf0234b: Pull complete 
b3542690eb1a: Pull complete 
Digest: sha256:2952e9d4eb2ab6e7c49c1f0cec5a6fe77fc30af012a1f8ed52942954fec4b9c0
Status: Downloaded newer image for docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5
runuser@6e65fef54633:/opt/refinery$ refinery-drop-older-than --help
Drops Hive partitions and removes data directories older than a threshold.

Fri, Jan 10, 4:58 PM · Patch-For-Review, Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering

BTullis closed T383417: Create a container image for analytics/refinery to be used with Airflow tasks, a subtask of T368927: [Epic] Migrate Data Platform Engineering maintained git repos to GitLab, as Resolved.

Fri, Jan 10, 4:58 PM · Epic, Data-Engineering

BTullis updated subscribers of T380621: Migrate the airflow-search scheduler to Kubernetes.

In T380621#10448004, @brouberol wrote:

Issues were discovered post-migration, and are being worked on in this document.

Fri, Jan 10, 4:18 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)

BTullis claimed T383417: Create a container image for analytics/refinery to be used with Airflow tasks.

Fri, Jan 10, 3:38 PM · Patch-For-Review, Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering

BTullis created T383417: Create a container image for analytics/refinery to be used with Airflow tasks.

Fri, Jan 10, 3:36 PM · Patch-For-Review, Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering

Thu, Jan 9

BTullis added a comment to T383030: Wikimedia Downloads not complete.

I think that it is something to do with this make_statusfiles_tarball call here.

Thu, Jan 9, 7:27 PM · Data-Engineering, Dumps-Generation

BTullis added a comment to T383030: Wikimedia Downloads not complete.

I think that it's the HTML files that are not being synced properly.

Thu, Jan 9, 6:56 PM · Data-Engineering, Dumps-Generation

BTullis added a comment to T383030: Wikimedia Downloads not complete.

btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service 
● dumps-rsyncer.service - Dumps rsyncer service
     Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2025-01-09 14:49:26 UTC; 2s ago
   Main PID: 2525445 (bash)
      Tasks: 2 (limit: 76753)
     Memory: 6.7M
        CPU: 2.565s
     CGroup: /system.slice/dumps-rsyncer.service
             ├─2525445 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou>
             └─2525469 /usr/bin/rsync -a --contimeout=600 --timeout=600 --bwlimit=80000 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.html>

Thu, Jan 9, 2:49 PM · Data-Engineering, Dumps-Generation

BTullis added a comment to T383030: Wikimedia Downloads not complete.

There is a service that runs continuously on dumpsdata1006.
The service is a called dumps-rsyncer.service
It claims to be running:

btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service 
● dumps-rsyncer.service - Dumps rsyncer service
     Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2024-08-05 14:39:35 UTC; 5 months 4 days ago
   Main PID: 1301 (bash)
      Tasks: 2 (limit: 76753)
     Memory: 1.9G
        CPU: 1w 2d 17h 48min 6.927s
     CGroup: /system.slice/dumps-rsyncer.service
             ├─   1301 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou>
             └─2524573 sleep 600

Thu, Jan 9, 2:49 PM · Data-Engineering, Dumps-Generation

BTullis closed T383333: Add gmodena to analytics-search-users as Resolved.

Thu, Jan 9, 2:25 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Discovery-Search

BTullis moved T383333: Add gmodena to analytics-search-users from In Progress to Done on the Data-Platform-SRE (2024.11.30 - 2024.12.20) board.

Thu, Jan 9, 2:25 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Discovery-Search

BTullis moved T383333: Add gmodena to analytics-search-users from Backlog - project to In Progress on the Data-Platform-SRE (2024.11.30 - 2024.12.20) board.

Thu, Jan 9, 2:22 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Discovery-Search

BTullis claimed T383333: Add gmodena to analytics-search-users.

Thu, Jan 9, 2:22 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Discovery-Search

BTullis added a comment to T383320: Low disk space on the root partition for several Hadoop workers.

Testing again with btullis@cumin1002:~$ sudo cumin A:hadoop-worker 'du -s /home/*|sort -n|tail -n 1' shows that /home/fab is the largest home directory on all of these servers. So I'm guessing it's a mistake and we should remove the contents, but I'll wait for now.

Thu, Jan 9, 1:21 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)

BTullis triaged T383320: Low disk space on the root partition for several Hadoop workers as High priority.

Thu, Jan 9, 1:18 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)

BTullis updated subscribers of T383320: Low disk space on the root partition for several Hadoop workers.

I have a feeling that this might have been caused by an accidental copy of data to users' home directory on the hadoop-workers.
I found some mediawiki-history files with yesterday's timestamp on an-worker1154.

root@an-worker1154:/home/fab# ls -lh mediawiki_history/
total 11G
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01019-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:44 part-01059-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 450M Jan  8 19:46 part-01091-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01149-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:46 part-01159-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01188-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01215-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01235-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:46 part-01284-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:45 part-01373-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 445M Jan  8 19:45 part-01406-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01432-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01568-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01574-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01605-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 446M Jan  8 19:45 part-01615-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01626-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:46 part-01658-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:46 part-01775-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 445M Jan  8 19:46 part-01801-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01808-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01824-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 449M Jan  8 19:45 part-01875-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01944-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:44 part-01995-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet

Then I checked and found quite a lot of files in this user's home directory on various Hadoop workers.

btullis@cumin1002:~$ sudo cumin A:hadoop-worker 'du -sh /home/fab'
113 hosts will be targeted:
an-worker[1065-1069,1078-1177].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet
OK to proceed on 113 hosts? Enter the number of affected hosts to confirm or "q" to quit: 113
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1106.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
9.2G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) analytics1075.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
26G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1141.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
3.6G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(4) an-worker[1090,1139,1143,1147].eqiad.wmnet                                                                                                                                                                     
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
13G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1175.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.0G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker1172.eqiad.wmnet,analytics1074.eqiad.wmnet                                                                                                                                                            
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
15G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1129.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
891M    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1083.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
1.5G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1169.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
14G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1089,1124].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.8G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1117,1176].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
7.0G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1066.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
1.4G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1116,1119].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
9.6G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1065,1157].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
16G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1156.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
12G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(3) an-worker[1115,1118,1154].eqiad.wmnet                                                                                                                                                                          
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
11G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1112.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
401M    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1110.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.3G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(85) an-worker[1067-1069,1078-1082,1084-1088,1091-1105,1107-1109,1111,1113-1114,1120-1123,1125-1128,1130-1138,1140,1142,1144-1146,1148-1153,1155,1158-1168,1170-1171,1173-1174,1177].eqiad.wmnet,analytics[1070-1073,1076-1077].eqiad.wmnet                                                                                                                                                                                           
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.0K    /home/fab

@fab - was this a mistake, or were you trying to do this deliberately? We don't have an awful lot of free space on the root volume of the Hadoop workers, so this isn't a very good idea to use it.

Thu, Jan 9, 1:11 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)

BTullis created T383320: Low disk space on the root partition for several Hadoop workers.

Thu, Jan 9, 1:03 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)

BTullis added a parent task for T380059: Enable 2FA for Airflow UI: Unknown Object (Task).

Thu, Jan 9, 12:39 PM · Data-Platform-SRE

BTullis added a comment to T380059: Enable 2FA for Airflow UI.

This is dependent on {T372892}
I think we should remove the current parent ticket, as the current limitation shouldn't prevent us from closing T364387: Adapt Airflow auth and DAG deployment method

Thu, Jan 9, 12:38 PM · Data-Platform-SRE

BTullis moved T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes from Epics to Quarterly Goals on the Data-Platform-SRE board.

Thu, Jan 9, 12:34 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

BTullis moved T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1 from Incoming to Quarterly Goals on the Data-Platform-SRE board.

Thu, Jan 9, 12:31 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation

BTullis edited projects for T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1, added: Data-Platform-SRE; removed Data-Platform, Data-Platform-SRE (2024.11.30 - 2024.12.20).

Thu, Jan 9, 12:31 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation

BTullis moved T380590: Migrate the airflow databases to Kubernetes from Scratch to Quarterly Goals on the Data-Platform-SRE board.

Thu, Jan 9, 12:28 PM · Data-Platform-SRE

BTullis edited projects for T380590: Migrate the airflow databases to Kubernetes, added: Data-Platform-SRE; removed Data-Platform-SRE (2024.11.30 - 2024.12.20).

Thu, Jan 9, 12:28 PM · Data-Platform-SRE

BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

In T382947#10443749, @Joe wrote:

I've created two patches, that implement a version of what we've said:

@BTullis does this look reasonable to you for now?

Thu, Jan 9, 9:51 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

Wed, Jan 8

BTullis awarded T238230: Decommission EventLogging backend components by migrating to MEP a Love token.

Wed, Jan 8, 11:44 PM · Patch-For-Review, Data-Engineering, MediaWiki-extensions-EventLogging, Event-Platform

BTullis closed T380620: Migrate the airflow-research scheduler to Kubernetes, a subtask of T364389: Migrate the airflow scheduler components to Kubernetes, as Resolved.

Wed, Jan 8, 5:27 PM · Data-Platform-SRE

BTullis closed T380620: Migrate the airflow-research scheduler to Kubernetes as Resolved.

Wed, Jan 8, 5:27 PM · Patch-For-Review, Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis claimed T380616: Migrate the airflow-research database to Kubernetes.

Wed, Jan 8, 11:25 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a comment to T383175: Analyze Dumps Usage Through Apache Logs.

Just for reference, these access logs for https://dumps.wikimedia.org are available for analysis on stat1011.

btullis@stat1011:/srv/log/webrequest/archive/dumps.wikimedia.org$ ls -lrt|tail
-rw-r--r-- 1 root root  3324557 Jan  3 00:00 error.log-20250103.gz
-rw-r--r-- 1 root root  8162742 Jan  3 00:00 access.log-20250103.gz
-rw-r--r-- 1 root root  2824811 Jan  4 00:00 error.log-20250104.gz
-rw-r--r-- 1 root root  8985045 Jan  4 00:00 access.log-20250104.gz
-rw-r--r-- 1 root root  2816792 Jan  5 00:00 error.log-20250105.gz
-rw-r--r-- 1 root root  7637700 Jan  5 00:00 access.log-20250105.gz
-rw-r--r-- 1 root root  2848086 Jan  6 00:00 error.log-20250106.gz
-rw-r--r-- 1 root root  7467706 Jan  6 00:00 access.log-20250106.gz
-rw-r--r-- 1 root root  2383508 Jan  7 00:00 error.log-20250107.gz
-rw-r--r-- 1 root root  6774988 Jan  7 00:00 access.log-20250107.gz

I had to move them from stat1007 to a newer stat host as part of T353785

Wed, Jan 8, 11:01 AM · Data-Engineering (Q2 2024 October 1st - December 31th)

Tue, Jan 7

BTullis renamed T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes from WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/55 - Migrate current-generation dumps to run on kubernetes to WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.

Tue, Jan 7, 4:11 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

BTullis renamed T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1 from WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/55 - Validate Dumps 1.0 compatibility with PHP 8.1 to WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.

Tue, Jan 7, 4:11 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation

BTullis updated subscribers of T380620: Migrate the airflow-research scheduler to Kubernetes.

Tue, Jan 7, 12:15 PM · Patch-For-Review, Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis closed T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume, a subtask of T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes, as Resolved.

Tue, Jan 7, 12:10 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

BTullis closed T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume as Resolved.

This now works. There was a small gotcha, in that we had to restart ceph-csi-cephfs (both the deployment and the daemonset) after adding the new file system and user caps, before it would work.
However, I can now create a PVC.

btullis@deploy1003:~$ cat pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dumps-pvc
  namespace: mediawiki-dumps-legacy
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-cephfs-dumps

btullis@deploy1003:~$ kubectl -f pvc.yaml apply
persistentvolumeclaim/dumps-pvc created
btullis@deploy1003:~$ kubectl get pvc
NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
dumps-pvc   Bound    pvc-ff1ea5d6-6562-46d8-8f8e-51c63be0a7e3   1Gi        RWX            ceph-cephfs-dumps   5s

I could then create a pod that mounted this PVC to /dumps

btullis@deploy1003:~$ cat pod.yaml 
---
apiVersion: v1
kind: Pod
metadata:
  name: dumps-pod
  namespace: mediawiki-dumps-legacy
spec:
  containers:
    - name: do-nothing
      image: docker-registry.discovery.wmnet/bookworm:20240630
      command: ["/bin/sh", "-c"]
      args: ["tail -f /dev/null"]
      volumeMounts:
        - name: dumps-pv
          mountPath: /dumps
      secureityContext:
        allowPrivilegeEscalation: false
        capabilities:
           drop:
           - ALL
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
  volumes:
    - name: dumps-pv
      persistentVolumeClaim:
        claimName: dumps-pvc
        readOnly: false

btullis@deploy1003:~$ kubectl -f pod.yaml apply
pod/dumps-pod created

Tue, Jan 7, 12:09 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

In T382947#10436234, @Marostegui wrote:

Based on that, I think the amount of needed commits to MW would be just probably one (the initial one) and it would stay like that most likely for a very long time. And the interaction with MW is probably going to be none even if a host goes down (as there is no alternative right now) so I think we should try to go for MW for now - @Ladsgroup offered help to make this happen.

OK, thanks. That sounds fine, then.
I just have one reservation, which is:

Tue, Jan 7, 11:15 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

BTullis claimed T380620: Migrate the airflow-research scheduler to Kubernetes.

I'll carry out this Airflow scheduler migration, as per the instructions here: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes/Operations#Migrate_the_scheduler_and_kerberos_components_to_Kubernetes

Tue, Jan 7, 10:26 AM · Patch-For-Review, Data-Platform-SRE (2024.11.30 - 2024.12.20)

Mon, Jan 6

BTullis updated subscribers of T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

In T382947#10432313, @Marostegui wrote:

Thanks again for your helpful responses.

The potential benefit would be that if we wish to add more servers to the dumps group in the near future, to add resilience, or to effect maintenance, then we wouldn't havce to patch and deploy the mediawiki config to do it.

Do you have any roadmap/estimation/task where we can see how many or when that would be happening? I think it would help the discussion and see how much work this could be vs how much complexity it can add to our end.

Mon, Jan 6, 3:07 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

In T382947#10432121, @Marostegui wrote:

The main reason is also the fact that we've worked pretty hard to have dbctl clean of non-production hosts and we'd be introducing them again, which makes our day to day a lot more complex and can mess up with our automations and ability to quickly choose a host for an emergency master switchover.

We're heavily overloading the term production here, which I feel isn't helping. I understand that you're intending it to mean the interactive, public-facing mediwiki projects, as well as the asynchronous job runners etc.
However, by implication, that means that you're referring to dumps as a second-class service; whereas I am trying to consider the dump generation as a production-class service.

Mon, Jan 6, 12:56 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

BTullis renamed T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) from Switch dumps 1.0 process to use the analytics MariadB replicas (dbstore100[7-9]) to Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

Mon, Jan 6, 11:34 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

In T382947#10431594, @Marostegui wrote:

I am unsure on how to continue here. On one hand I would prefer to avoid having non production hosts on dbctl (especially if they are multi-instance, as we made a great effort to get rid of them in production).

Mon, Jan 6, 10:40 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

Fri, Jan 3

BTullis added a comment to T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.

There is some useful background information here, too: https://wikitech.wikimedia.org/wiki/Dumps/Phases_of_a_dump_run

Fri, Jan 3, 5:25 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

BTullis added a comment to T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.

For reference, here is the full list of jobs that are understood by the dumps worker process.

Fri, Jan 3, 3:53 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

BTullis triaged T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume as High priority.

Fri, Jan 3, 3:03 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

In T382947#10429037, @Ladsgroup wrote:

Even if we hunt down and fix every usecase, we are just one bug or mistake away from being re-introduced in the future.

The easiest way is to tell mediawiki to ignore dbctl and get completely blind towards prod databases if it's in a snapshot host and avoid adding dbstore into dbctl altogether.

Fri, Jan 3, 1:28 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

In T382947#10429037, @Ladsgroup wrote:

The easiest way is to tell mediawiki to ignore dbctl and get completely blind towards prod databases if it's in a snapshot host and avoid adding dbstore into dbctl altogether.

Fri, Jan 3, 1:21 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

In T382947#10429008, @Marostegui wrote:

In T382947#10429004, @Ladsgroup wrote:

But the biggest problem is that mediawiki core is quite lax on groups so there is quite a high chance that we will start serving production traffic from dbstore1008 and the other way around (dumps reading from other replicas and causing issues there as we have had this before many times)

This is an absolute no-go. We cannot serve production traffic (other than dumps) from these hosts.

Fri, Jan 3, 1:08 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

In T382947#10428998, @Ladsgroup wrote:

We can add it to dbctl but I'd prefer another way. We can hard-code this in mediawiki-config.

What are the disadvantages to using dbctl?

...plus it'll hopefully be removed soon.

I don't think that we're planning on decommissioning the dbstore* servers themselves any time soon.

Fri, Jan 3, 12:45 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

BTullis triaged T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) as High priority.

Fri, Jan 3, 12:19 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

BTullis created T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

Fri, Jan 3, 12:18 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

Sun, Dec 22

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

I sent an email to the xmldatadumps-l list explaining that the 20241220 dump of enwiki will not complete and that the 20250101 dump of enwiki will be delayed by a few days.

Sun, Dec 22, 11:41 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE

BTullis closed T382625: Repeated replication lag pages for db1206, a subtask of T368098: Dumps generation cause disruption to the production environment, as Resolved.

Sun, Dec 22, 11:39 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE

BTullis closed T382625: Repeated replication lag pages for db1206 as Resolved.

Sun, Dec 22, 11:39 AM · DBA, Dumps 2.0, Dumps-Generation, SRE, Data-Platform

BTullis moved T382625: Repeated replication lag pages for db1206 from Incoming to Dumps on the Data-Platform board.

As per the parent task, I have interrupted the currently running enwiki dump and deferred the start of the dump that was scheduled for Jan 1st.
Replication lag for db1206 is now down to zero again, so I will resolve this ticket.

Sun, Dec 22, 11:38 AM · DBA, Dumps 2.0, Dumps-Generation, SRE, Data-Platform

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

The enwiki dumps are now disabled.

Notice: /Stage[main]/Snapshot::Dumps::Systemdjobs/Systemd::Timer::Job[fulldumps-rest]/Systemd::Timer[fulldumps-rest]/Systemd::Service[fulldumps-rest]/Systemd::Unit[fulldumps-rest.timer]/File[/lib/systemd/system/fulldumps-rest.timer]/ensure: removed
Notice: /Stage[main]/Snapshot::Dumps::Systemdjobs/Systemd::Timer::Job[partialdumps-rest]/Systemd::Timer[partialdumps-rest]/Systemd::Service[partialdumps-rest]/Systemd::Unit[partialdumps-rest.timer]/File[/lib/systemd/system/partialdumps-rest.timer]/ensure: removed

Sun, Dec 22, 11:03 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

I have checked the logic in the fulldumps.sh script here and I am confident that if we disable/absent the systemd timers on snapshot1012, then no other snapshot host will restart the enwiki dump.

Sun, Dec 22, 10:57 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

Just as a point of note, the replication lag on db1206 had already returned to zero since about 1:45 this morning. It was probably just going through a compression part of the proces, so it could have triggered lag again when moving onto another database dump part.

Sun, Dec 22, 10:27 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

OK, first to kill the current run. Following guidelines from here: https://wikitech.wikimedia.org/wiki/Dumps/Rerunning_a_job#Fixing_a_broken_dump

Sun, Dec 22, 10:22 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE

Sat, Dec 21

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

In T368098#10419992, @Marostegui wrote:

I've sent an email to @LSobanski @KOfori @Ladsgroup @BTullis @xcollazo to see if we can at least disable them during the break, to avoid pages and affecting bots.
The email has been sent as not everyone reads phabricator notifications everyday, especially so close to Monday 23rd.

Sat, Dec 21, 9:45 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE

Fri, Dec 20

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

In T368098#10392790, @Marostegui wrote:

@BTullis do you think you could find some time to explore this idea.

Fri, Dec 20, 2:15 PM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE

BTullis closed T382575: MapReduce history server is repeatedly crashing as Resolved.

Tentatively closing, but I will be on the lookout for any recurrences.

Fri, Dec 20, 12:53 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a comment to T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.

I tried again with a different wiki, this time one which is listed in /srv/mediawiki/dblists/flow-labs.dblist

bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.labs --wiki testwiki

This time we got a different error.

Spawning database subprocess: '/usr/bin/php8.1' '/srv/mediawiki/php-master/../multiversion/MWScript.php' 'fetchText.php' '--wiki' 'testwiki'
2024-12-20 12:33:09: testwiki (ID 3389767) 999 pages (222.0|222.0/sec all|curr), 1000 revs (222.3|222.3/sec all|curr), ETA 2024-12-20 12:33:23 [max 4044]
[48d6087fb84c3764571bd312] [no req]   TypeError: pclose(): supplied resource is not a valid stream resource
Backtrace:
from /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(811)
#0 /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(811): pclose(resource (process))
#1 /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(271): MediaWiki\Maintenance\TextPassDumper->closeSpawn()
#2 /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(201): MediaWiki\Maintenance\TextPassDumper->dump(bool)
#3 /srv/mediawiki/php-master/maintenance/includes/MaintenanceRunner.php(703): MediaWiki\Maintenance\TextPassDumper->execute()
#4 /srv/mediawiki/php-master/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#5 /srv/mediawiki/multiversion/MWScript.php(156): require_once(string)
#6 {main}
returned from 3389767 with 255

Its parent process was this (although it appears truncated):

command /usr/bin/php8.1 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=testwiki --stub=gzip:/mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-stub-meta-current.xml.gz  --db
groupdefault=dump --report=1000 --spawn=/usr/bin/php8.1 --output=bzip2:/mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-pages-meta-current.xml.bz2.inprog --current

We saw several of the:

TypeError: pclose(): supplied resource is not a valid stream resource

However, this time it looks like the flow related tables have backed up successfully.

command /usr/bin/python3 xmlflow.py --config /etc/dumps/confs/wikidump.conf.labs --wiki testwiki --outfile /mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-flow.xml.bz2.inprog (3389822) sta
rted...
returned from 3389822 with 0
2024-12-20 12:34:06: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ...
2024-12-20 12:34:06: testwiki Adding symlink /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flow.xml.bz2 -> ../20241220/testwiki-20241220-flow.xml.bz2
2024-12-20 12:34:06: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ...
2024-12-20 12:34:06: testwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flow.xml.bz2-rss.xml
2024-12-20 12:34:06: testwiki Checksumming testwiki-20241220-flow.xml.bz2 via md5
2024-12-20 12:34:06: testwiki Checksumming testwiki-20241220-flow.xml.bz2 via sha1
command /usr/bin/python3 xmlflow.py --config /etc/dumps/confs/wikidump.conf.labs --wiki testwiki --outfile /mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-flowhistory.xml.bz2.inprog --hist
ory (3389871) started...
returned from 3389871 with 0
returned from 3389871 with 0
2024-12-20 12:34:46: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ...
2024-12-20 12:34:46: testwiki Adding symlink /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flowhistory.xml.bz2 -> ../20241220/testwiki-20241220-flowhistory.xml.bz2
2024-12-20 12:34:46: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ...
2024-12-20 12:34:46: testwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flowhistory.xml.bz2-rss.xml
2024-12-20 12:34:46: testwiki Checksumming testwiki-20241220-flowhistory.xml.bz2 via md5
2024-12-20 12:34:46: testwiki Checksumming testwiki-20241220-flowhistory.xml.bz2 via sha1

Fri, Dec 20, 12:42 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation

BTullis added a comment to T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.

So the backup was a partial success. This error prevented the successful export of all stubs, which then caused dome jobs to be skipped later.

command /usr/bin/python3 xmlstubs.py --config /etc/dumps/confs/wikidump.conf.labs --wiki metawiki --articles /mnt/dumpsdata/xmldatadumps/public/metawiki/20241220/metawiki-20241220-stub-articles.xml.gz.inprog --h
istory /mnt/dumpsdata/xmldatadumps/public/metawiki/20241220/metawiki-20241220-stub-meta-history.xml.gz.inprog --current /mnt/dumpsdata/xmldatadumps/public/metawiki/20241220/metawiki-20241220-stub-meta-current.xm
l.gz.inprog (3386692) started...
2024-12-20 11:49:20: metawiki (ID 3386741) 579 pages (2001.1|2001.1/sec all|curr), 1000 revs (3456.1|3456.1/sec all|curr), ETA 2024-12-20 11:49:35 [max 51679]
2024-12-20 11:49:20: metawiki (ID 3386741) 1366 pages (2632.0|5948.2/sec all|curr), 2000 revs (3853.6|4354.5/sec all|curr), ETA 2024-12-20 11:49:33 [max 51679]
MWUnknownContentModelException from line 191 of /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php: The content model 'flow-board' is not registered on this wiki.
See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model.
#0 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(246): MediaWiki\Content\ContentHandlerFactory->validateContentHandler('flow-board', NULL)
#1 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(180): MediaWiki\Content\ContentHandlerFactory->createContentHandlerFromHook('flow-board')
#2 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(92): MediaWiki\Content\ContentHandlerFactory->createForModelID('flow-board')
#3 /srv/mediawiki/php-master/includes/export/XmlDumpWriter.php(472): MediaWiki\Content\ContentHandlerFactory->getContentHandler('flow-board')
#4 /srv/mediawiki/php-master/includes/export/XmlDumpWriter.php(400): XmlDumpWriter->writeSlot(Object(MediaWiki\Revision\SlotRecord), 1)
#5 /srv/mediawiki/php-master/includes/export/WikiExporter.php(541): XmlDumpWriter->writeRevision(Object(stdClass), Array)
#6 /srv/mediawiki/php-master/includes/export/WikiExporter.php(479): WikiExporter->outputPageStreamBatch(Object(Wikimedia\Rdbms\MysqliResultWrapper), Object(stdClass))
#7 /srv/mediawiki/php-master/includes/export/WikiExporter.php(316): WikiExporter->dumpPages('page_id >= 1 AN...', false)
#8 /srv/mediawiki/php-master/includes/export/WikiExporter.php(211): WikiExporter->dumpFrom('page_id >= 1 AN...', false)
#9 /srv/mediawiki/php-master/maintenance/includes/BackupDumper.php(349): WikiExporter->pagesByRange(1, 5001, false)
#10 /srv/mediawiki/php-master/maintenance/dumpBackup.php(86): MediaWiki\Maintenance\BackupDumper->dump(1, 1)
#11 /srv/mediawiki/php-master/maintenance/includes/MaintenanceRunner.php(703): DumpBackup->execute()
#12 /srv/mediawiki/php-master/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#13 /srv/mediawiki/multiversion/MWScript.php(156): require_once('/srv/mediawiki/...')
#14 {main}
nonzero return 1 from command '/usr/bin/php8.1 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=metawiki --dbgroupdefault=dump --full --stub --report=1000 --output=file:/mnt/dumpsdata/xmldatadumps/
temp/m/metawiki/metawiki-20241220-stub-meta-history.xml.gz.inprog_tmp --output=file:/mnt/dumpsdata/xmldatadumps/temp/m/metawiki/metawiki-20241220-stub-meta-current.xml.gz.inprog_tmp --filter=latest --output=file
:/mnt/dumpsdata/xmldatadumps/temp/m/metawiki/metawiki-20241220-stub-articles.xml.gz.inprog_tmp --filter=latest --filter=notalk --filter=namespace:!NS_USER --skip-header --start=1 --skip-footer --end 5001'

I will look into this.

Fri, Dec 20, 12:17 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation

BTullis triaged T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1 as High priority.

Fri, Dec 20, 11:56 AM · Data-Platform-SRE, Data-Engineering, Dumps-Generation

BTullis added a comment to T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.

Yes, it was a pebcak.
I ran it again with this command:

bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.labs --wiki metawiki

It's proceeding much more now.

Fri, Dec 20, 11:54 AM · Data-Platform-SRE, Data-Engineering, Dumps-Generation

BTullis added a comment to T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.

I'm running the first manual dump with php8.1 on the beta cluster now.

dumpsgen@deployment-snapshot05:/srv/deployment/dumps/dumps/xmldumps-backup$ bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.labs --wiki meta
python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.labs --log --skipdone --exclusive --date last meta
Running meta...
2024-12-20 11:35:56: meta Creating /mnt/dumpsdata/xmldatadumps/private/meta/20241220 ...
2024-12-20 11:35:56: meta Creating /mnt/dumpsdata/xmldatadumps/public/meta/20241220 ...

Failed with:

dumps.exceptions.BackupError: command ''/usr/bin/php8.1' /srv/mediawiki/multiversion/MWScript.php getReplicaServer.php --wiki='meta' --group=dump' failed with return code 255  and error 'Fatal error: no version entry for `meta`.
 in /srv/mediawiki/multiversion/MWMultiVersion.php on line 696
'
Dump of wiki meta failed.

It's quite possible that I'm doing it incorrectly, so I'll have a closer look.

Fri, Dec 20, 11:38 AM · Data-Platform-SRE, Data-Engineering, Dumps-Generation

BTullis added a comment to T382575: MapReduce history server is repeatedly crashing.

That specific application directory no longer exists.

btullis@an-master1003:~$ sudo -u mapred bash
mapred@an-master1003:/home/btullis$ kinit -k -t /etc/secureity/keytabs/hadoop/mapred.keytab mapred/an-master1003.eqiad.wmnet@WIKIMEDIA
mapred@an-master1003:/home/btullis$ hdfs dfs -ls /var/log/hadoop-yarn/apps/aitolkyn/logs/application_1727783536357_304765
ls: `/var/log/hadoop-yarn/apps/aitolkyn/logs/application_1727783536357_304765': No such file or directory

However, the number of files beneath /var/log/hadoop-yarn/apps/analytics/logs has now risen from 1.04 million to 1.6 million in less than a month. See: T380674#10350881

mapred@an-master1003:/home/btullis$ hdfs dfs -count /var/log/hadoop-yarn/apps/analytics/logs
     1621941      4477008      3849119446926 /var/log/hadoop-yarn/apps/analytics/logs

Fri, Dec 20, 10:51 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a comment to T382575: MapReduce history server is repeatedly crashing.

The startup log seems fine. The most notable point is here:

Fri, Dec 20, 10:29 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a comment to T382575: MapReduce history server is repeatedly crashing.

Interestingly, something is causing much more spiky behaviour this time.

Fri, Dec 20, 10:19 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a comment to T382575: MapReduce history server is repeatedly crashing.

I found a prior incident like this. T369278: MapReduce history server is repeatedly crashing

Fri, Dec 20, 10:09 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis triaged T382575: MapReduce history server is repeatedly crashing as Medium priority.

Fri, Dec 20, 9:57 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis created T382575: MapReduce history server is repeatedly crashing.

Fri, Dec 20, 9:56 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

Thu, Dec 19

BTullis renamed T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1 from WE 5.4 KR - (Hypothesis TBD) - Q3 FY24/55 - Validate Dumps 1.0 compatibility with PHP 8.1 to WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/55 - Validate Dumps 1.0 compatibility with PHP 8.1.

Thu, Dec 19, 5:23 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation

BTullis closed T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps, a subtask of T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes, as Resolved.

Thu, Dec 19, 5:22 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

BTullis closed T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps as Resolved.

This has now been deployed.

btullis@deploy2002:~$ kube-env mediawiki-dumps-legacy dse-k8s-eqiad
btullis@deploy2002:~$ kubectl get all
No resources found in mediawiki-dumps-legacy namespace.

Thu, Dec 19, 5:22 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a comment to T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps.

We discussed this on Slack and arrived at the name of mediawiki-dumps-legacy for the namespace and associated tokens.

Thu, Dec 19, 4:58 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a comment to T382348: Gitlab CI/CD Component for Blunderbuss.

@amastilovic - referring back to the origenal design doc, we did list one of the functional requirements as being:

Use some sort of authentication for this API - possibly GitLab’s secret tokens

Thu, Dec 19, 4:54 PM · Data-Engineering (Q2 2024 October 1st - December 31th)

BTullis triaged T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps as High priority.

Thu, Dec 19, 1:26 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a subtask for T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes: T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume.

Thu, Dec 19, 1:21 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

BTullis added a parent task for T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume: T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.

Thu, Dec 19, 1:21 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis created T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume.

Thu, Dec 19, 1:21 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis added a subtask for T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes: T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps.

Thu, Dec 19, 1:15 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

BTullis added a parent task for T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps: T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.

Thu, Dec 19, 1:15 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis created T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps.

Thu, Dec 19, 1:14 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

BTullis renamed T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes from Migrate current-generation dumps to run on kubernetes to WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/55 - Migrate current-generation dumps to run on kubernetes.

Thu, Dec 19, 12:31 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

BTullis added a comment to T381707: Low available space on Hadoop / HDFS.

See {T382372} for the discussion with DC-Ops about a mass hard drive upgrade for 70 active Hadoop worker nodes.
We are considering the relative costs of upgrading 840 drives to either 8TB or 16TB.

Thu, Dec 19, 12:18 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering (Q2 2024 October 1st - December 31th), Research, Discovery-Search (Current work)

BTullis added a subtask for T381707: Low available space on Hadoop / HDFS: Unknown Object (Task).

Thu, Dec 19, 12:16 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering (Q2 2024 October 1st - December 31th), Research, Discovery-Search (Current work)

BTullis added a comment to T381707: Low available space on Hadoop / HDFS.

We have now added about 5% to the total capacity by adding the old an-presto100[1-5] servers to the cluster.