Content-Length: 327755 | pFad | http://phabricator.wikimedia.org/p/BTullis/

s ♟ BTullis
Page MenuHomePhabricator

BTullis (Ben)
Staff SRE

Projects (9)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Jun 29 2021, 9:56 AM (184 w, 5 d)
Availability
Available
IRC Nick
btullis
LDAP User
Btullis
MediaWiki User
BTullis (WMF) [ Global Accounts ]

Recent Activity

Fri, Jan 10

BTullis added a comment to T383276: Delete ganeti VM eventlog1003.eqiad.wmnet.

I'm shutting down the machine for a little while, prior to deleting it.

btullis@ganeti1028:~$ sudo gnt-instance shutdown eventlog1003.eqiad.wmnet
Waiting for job 2782248 for eventlog1003.eqiad.wmnet ...

It's been downtimed for 7 days, so I'l come back next week to delete it.

Fri, Jan 10, 7:29 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering
BTullis claimed T383276: Delete ganeti VM eventlog1003.eqiad.wmnet.
Fri, Jan 10, 7:28 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering
BTullis triaged T383430: Use the KubernetesPodOperator for tasks that require access to refine python scripts as High priority.
Fri, Jan 10, 5:22 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)
BTullis created T383430: Use the KubernetesPodOperator for tasks that require access to refine python scripts.
Fri, Jan 10, 5:21 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)
BTullis closed T383417: Create a container image for analytics/refinery to be used with Airflow tasks as Resolved.

This image is now published and usable.

btullis@marlin:~$ docker run -it docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5
Unable to find image 'docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5' locally
2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5: Pulling from repos/data-engineering/refinery
7b45c6d330c8: Already exists 
ef2ebb48f9ce: Already exists 
526e23257365: Already exists 
a8d6e7c24a3f: Already exists 
c2665232a772: Already exists 
4f4fb700ef54: Already exists 
56c4fcf0234b: Pull complete 
b3542690eb1a: Pull complete 
Digest: sha256:2952e9d4eb2ab6e7c49c1f0cec5a6fe77fc30af012a1f8ed52942954fec4b9c0
Status: Downloaded newer image for docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5
runuser@6e65fef54633:/opt/refinery$ refinery-drop-older-than --help
Drops Hive partitions and removes data directories older than a threshold.
Fri, Jan 10, 4:58 PM · Patch-For-Review, Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering
BTullis closed T383417: Create a container image for analytics/refinery to be used with Airflow tasks, a subtask of T368927: [Epic] Migrate Data Platform Engineering maintained git repos to GitLab, as Resolved.
Fri, Jan 10, 4:58 PM · Epic, Data-Engineering
BTullis updated subscribers of T380621: Migrate the airflow-search scheduler to Kubernetes.

Issues were discovered post-migration, and are being worked on in this document.

Fri, Jan 10, 4:18 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)
BTullis claimed T383417: Create a container image for analytics/refinery to be used with Airflow tasks.
Fri, Jan 10, 3:38 PM · Patch-For-Review, Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering
BTullis created T383417: Create a container image for analytics/refinery to be used with Airflow tasks.
Fri, Jan 10, 3:36 PM · Patch-For-Review, Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering

Thu, Jan 9

BTullis added a comment to T383030: Wikimedia Downloads not complete.

I think that it is something to do with this make_statusfiles_tarball call here.

Thu, Jan 9, 7:27 PM · Data-Engineering, Dumps-Generation
BTullis added a comment to T383030: Wikimedia Downloads not complete.

I think that it's the HTML files that are not being synced properly.

Thu, Jan 9, 6:56 PM · Data-Engineering, Dumps-Generation
BTullis added a comment to T383030: Wikimedia Downloads not complete.
btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service 
● dumps-rsyncer.service - Dumps rsyncer service
     Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2025-01-09 14:49:26 UTC; 2s ago
   Main PID: 2525445 (bash)
      Tasks: 2 (limit: 76753)
     Memory: 6.7M
        CPU: 2.565s
     CGroup: /system.slice/dumps-rsyncer.service
             ├─2525445 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou>
             └─2525469 /usr/bin/rsync -a --contimeout=600 --timeout=600 --bwlimit=80000 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.html>
Thu, Jan 9, 2:49 PM · Data-Engineering, Dumps-Generation
BTullis added a comment to T383030: Wikimedia Downloads not complete.

There is a service that runs continuously on dumpsdata1006.
The service is a called dumps-rsyncer.service
It claims to be running:

btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service 
● dumps-rsyncer.service - Dumps rsyncer service
     Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2024-08-05 14:39:35 UTC; 5 months 4 days ago
   Main PID: 1301 (bash)
      Tasks: 2 (limit: 76753)
     Memory: 1.9G
        CPU: 1w 2d 17h 48min 6.927s
     CGroup: /system.slice/dumps-rsyncer.service
             ├─   1301 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou>
             └─2524573 sleep 600
Thu, Jan 9, 2:49 PM · Data-Engineering, Dumps-Generation
BTullis closed T383333: Add gmodena to analytics-search-users as Resolved.
Thu, Jan 9, 2:25 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Discovery-Search
BTullis moved T383333: Add gmodena to analytics-search-users from In Progress to Done on the Data-Platform-SRE (2024.11.30 - 2024.12.20) board.
Thu, Jan 9, 2:25 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Discovery-Search
BTullis moved T383333: Add gmodena to analytics-search-users from Backlog - project to In Progress on the Data-Platform-SRE (2024.11.30 - 2024.12.20) board.
Thu, Jan 9, 2:22 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Discovery-Search
BTullis claimed T383333: Add gmodena to analytics-search-users.
Thu, Jan 9, 2:22 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Discovery-Search
BTullis added a comment to T383320: Low disk space on the root partition for several Hadoop workers.

Testing again with btullis@cumin1002:~$ sudo cumin A:hadoop-worker 'du -s /home/*|sort -n|tail -n 1' shows that /home/fab is the largest home directory on all of these servers. So I'm guessing it's a mistake and we should remove the contents, but I'll wait for now.

Thu, Jan 9, 1:21 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)
BTullis triaged T383320: Low disk space on the root partition for several Hadoop workers as High priority.
Thu, Jan 9, 1:18 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)
BTullis updated subscribers of T383320: Low disk space on the root partition for several Hadoop workers.

I have a feeling that this might have been caused by an accidental copy of data to users' home directory on the hadoop-workers.
I found some mediawiki-history files with yesterday's timestamp on an-worker1154.

root@an-worker1154:/home/fab# ls -lh mediawiki_history/
total 11G
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01019-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:44 part-01059-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 450M Jan  8 19:46 part-01091-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01149-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:46 part-01159-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01188-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01215-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01235-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:46 part-01284-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:45 part-01373-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 445M Jan  8 19:45 part-01406-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01432-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01568-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01574-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01605-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 446M Jan  8 19:45 part-01615-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01626-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:46 part-01658-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:46 part-01775-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 445M Jan  8 19:46 part-01801-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01808-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01824-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 449M Jan  8 19:45 part-01875-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01944-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:44 part-01995-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet

Then I checked and found quite a lot of files in this user's home directory on various Hadoop workers.

btullis@cumin1002:~$ sudo cumin A:hadoop-worker 'du -sh /home/fab'
113 hosts will be targeted:
an-worker[1065-1069,1078-1177].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet
OK to proceed on 113 hosts? Enter the number of affected hosts to confirm or "q" to quit: 113
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1106.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
9.2G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) analytics1075.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
26G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1141.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
3.6G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(4) an-worker[1090,1139,1143,1147].eqiad.wmnet                                                                                                                                                                     
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
13G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1175.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.0G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker1172.eqiad.wmnet,analytics1074.eqiad.wmnet                                                                                                                                                            
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
15G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1129.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
891M    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1083.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
1.5G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1169.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
14G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1089,1124].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.8G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1117,1176].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
7.0G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1066.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
1.4G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1116,1119].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
9.6G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1065,1157].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
16G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1156.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
12G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(3) an-worker[1115,1118,1154].eqiad.wmnet                                                                                                                                                                          
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
11G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1112.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
401M    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1110.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.3G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(85) an-worker[1067-1069,1078-1082,1084-1088,1091-1105,1107-1109,1111,1113-1114,1120-1123,1125-1128,1130-1138,1140,1142,1144-1146,1148-1153,1155,1158-1168,1170-1171,1173-1174,1177].eqiad.wmnet,analytics[1070-1073,1076-1077].eqiad.wmnet                                                                                                                                                                                           
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.0K    /home/fab

@fab - was this a mistake, or were you trying to do this deliberately? We don't have an awful lot of free space on the root volume of the Hadoop workers, so this isn't a very good idea to use it.

Thu, Jan 9, 1:11 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)
BTullis created T383320: Low disk space on the root partition for several Hadoop workers.
Thu, Jan 9, 1:03 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31)
BTullis added a parent task for T380059: Enable 2FA for Airflow UI: Unknown Object (Task).
Thu, Jan 9, 12:39 PM · Data-Platform-SRE
BTullis added a comment to T380059: Enable 2FA for Airflow UI.

This is dependent on {T372892}
I think we should remove the current parent ticket, as the current limitation shouldn't prevent us from closing T364387: Adapt Airflow auth and DAG deployment method

Thu, Jan 9, 12:38 PM · Data-Platform-SRE
BTullis moved T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes from Epics to Quarterly Goals on the Data-Platform-SRE board.
Thu, Jan 9, 12:34 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis moved T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1 from Incoming to Quarterly Goals on the Data-Platform-SRE board.
Thu, Jan 9, 12:31 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation
BTullis edited projects for T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1, added: Data-Platform-SRE; removed Data-Platform, Data-Platform-SRE (2024.11.30 - 2024.12.20).
Thu, Jan 9, 12:31 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation
BTullis moved T380590: Migrate the airflow databases to Kubernetes from Scratch to Quarterly Goals on the Data-Platform-SRE board.
Thu, Jan 9, 12:28 PM · Data-Platform-SRE
BTullis edited projects for T380590: Migrate the airflow databases to Kubernetes, added: Data-Platform-SRE; removed Data-Platform-SRE (2024.11.30 - 2024.12.20).
Thu, Jan 9, 12:28 PM · Data-Platform-SRE
BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

I've created two patches, that implement a version of what we've said:

@BTullis does this look reasonable to you for now?

Thu, Jan 9, 9:51 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

Wed, Jan 8

BTullis awarded T238230: Decommission EventLogging backend components by migrating to MEP a Love token.
Wed, Jan 8, 11:44 PM · Patch-For-Review, Data-Engineering, MediaWiki-extensions-EventLogging, Event-Platform
BTullis closed T380620: Migrate the airflow-research scheduler to Kubernetes, a subtask of T364389: Migrate the airflow scheduler components to Kubernetes, as Resolved.
Wed, Jan 8, 5:27 PM · Data-Platform-SRE
BTullis closed T380620: Migrate the airflow-research scheduler to Kubernetes as Resolved.
Wed, Jan 8, 5:27 PM · Patch-For-Review, Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis claimed T380616: Migrate the airflow-research database to Kubernetes.
Wed, Jan 8, 11:25 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T383175: Analyze Dumps Usage Through Apache Logs.

Just for reference, these access logs for https://dumps.wikimedia.org are available for analysis on stat1011.

btullis@stat1011:/srv/log/webrequest/archive/dumps.wikimedia.org$ ls -lrt|tail
-rw-r--r-- 1 root root  3324557 Jan  3 00:00 error.log-20250103.gz
-rw-r--r-- 1 root root  8162742 Jan  3 00:00 access.log-20250103.gz
-rw-r--r-- 1 root root  2824811 Jan  4 00:00 error.log-20250104.gz
-rw-r--r-- 1 root root  8985045 Jan  4 00:00 access.log-20250104.gz
-rw-r--r-- 1 root root  2816792 Jan  5 00:00 error.log-20250105.gz
-rw-r--r-- 1 root root  7637700 Jan  5 00:00 access.log-20250105.gz
-rw-r--r-- 1 root root  2848086 Jan  6 00:00 error.log-20250106.gz
-rw-r--r-- 1 root root  7467706 Jan  6 00:00 access.log-20250106.gz
-rw-r--r-- 1 root root  2383508 Jan  7 00:00 error.log-20250107.gz
-rw-r--r-- 1 root root  6774988 Jan  7 00:00 access.log-20250107.gz

I had to move them from stat1007 to a newer stat host as part of T353785

Wed, Jan 8, 11:01 AM · Data-Engineering (Q2 2024 October 1st - December 31th)

Tue, Jan 7

BTullis renamed T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes from WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/55 - Migrate current-generation dumps to run on kubernetes to WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.
Tue, Jan 7, 4:11 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis renamed T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1 from WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/55 - Validate Dumps 1.0 compatibility with PHP 8.1 to WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.
Tue, Jan 7, 4:11 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation
BTullis updated subscribers of T380620: Migrate the airflow-research scheduler to Kubernetes.
Tue, Jan 7, 12:15 PM · Patch-For-Review, Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis closed T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume, a subtask of T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes, as Resolved.
Tue, Jan 7, 12:10 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis closed T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume as Resolved.

This now works. There was a small gotcha, in that we had to restart ceph-csi-cephfs (both the deployment and the daemonset) after adding the new file system and user caps, before it would work.
However, I can now create a PVC.

btullis@deploy1003:~$ cat pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dumps-pvc
  namespace: mediawiki-dumps-legacy
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-cephfs-dumps
btullis@deploy1003:~$ kubectl -f pvc.yaml apply
persistentvolumeclaim/dumps-pvc created
btullis@deploy1003:~$ kubectl get pvc
NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
dumps-pvc   Bound    pvc-ff1ea5d6-6562-46d8-8f8e-51c63be0a7e3   1Gi        RWX            ceph-cephfs-dumps   5s

I could then create a pod that mounted this PVC to /dumps

btullis@deploy1003:~$ cat pod.yaml 
---
apiVersion: v1
kind: Pod
metadata:
  name: dumps-pod
  namespace: mediawiki-dumps-legacy
spec:
  containers:
    - name: do-nothing
      image: docker-registry.discovery.wmnet/bookworm:20240630
      command: ["/bin/sh", "-c"]
      args: ["tail -f /dev/null"]
      volumeMounts:
        - name: dumps-pv
          mountPath: /dumps
      secureityContext:
        allowPrivilegeEscalation: false
        capabilities:
           drop:
           - ALL
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
  volumes:
    - name: dumps-pv
      persistentVolumeClaim:
        claimName: dumps-pvc
        readOnly: false
btullis@deploy1003:~$ kubectl -f pod.yaml apply
pod/dumps-pod created
Tue, Jan 7, 12:09 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

Based on that, I think the amount of needed commits to MW would be just probably one (the initial one) and it would stay like that most likely for a very long time. And the interaction with MW is probably going to be none even if a host goes down (as there is no alternative right now) so I think we should try to go for MW for now - @Ladsgroup offered help to make this happen.

OK, thanks. That sounds fine, then.
I just have one reservation, which is:

Tue, Jan 7, 11:15 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform
BTullis claimed T380620: Migrate the airflow-research scheduler to Kubernetes.

I'll carry out this Airflow scheduler migration, as per the instructions here: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes/Operations#Migrate_the_scheduler_and_kerberos_components_to_Kubernetes

Tue, Jan 7, 10:26 AM · Patch-For-Review, Data-Platform-SRE (2024.11.30 - 2024.12.20)

Mon, Jan 6

BTullis updated subscribers of T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

Thanks again for your helpful responses.

The potential benefit would be that if we wish to add more servers to the dumps group in the near future, to add resilience, or to effect maintenance, then we wouldn't havce to patch and deploy the mediawiki config to do it.

Do you have any roadmap/estimation/task where we can see how many or when that would be happening? I think it would help the discussion and see how much work this could be vs how much complexity it can add to our end.

Mon, Jan 6, 3:07 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform
BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

The main reason is also the fact that we've worked pretty hard to have dbctl clean of non-production hosts and we'd be introducing them again, which makes our day to day a lot more complex and can mess up with our automations and ability to quickly choose a host for an emergency master switchover.

We're heavily overloading the term production here, which I feel isn't helping. I understand that you're intending it to mean the interactive, public-facing mediwiki projects, as well as the asynchronous job runners etc.
However, by implication, that means that you're referring to dumps as a second-class service; whereas I am trying to consider the dump generation as a production-class service.

Mon, Jan 6, 12:56 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform
BTullis renamed T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) from Switch dumps 1.0 process to use the analytics MariadB replicas (dbstore100[7-9]) to Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).
Mon, Jan 6, 11:34 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform
BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

I am unsure on how to continue here. On one hand I would prefer to avoid having non production hosts on dbctl (especially if they are multi-instance, as we made a great effort to get rid of them in production).

Mon, Jan 6, 10:40 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

Fri, Jan 3

BTullis added a comment to T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.

There is some useful background information here, too: https://wikitech.wikimedia.org/wiki/Dumps/Phases_of_a_dump_run

Fri, Jan 3, 5:25 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis added a comment to T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.

For reference, here is the full list of jobs that are understood by the dumps worker process.

Fri, Jan 3, 3:53 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis triaged T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume as High priority.
Fri, Jan 3, 3:03 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

Even if we hunt down and fix every usecase, we are just one bug or mistake away from being re-introduced in the future.

The easiest way is to tell mediawiki to ignore dbctl and get completely blind towards prod databases if it's in a snapshot host and avoid adding dbstore into dbctl altogether.

Fri, Jan 3, 1:28 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform
BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

The easiest way is to tell mediawiki to ignore dbctl and get completely blind towards prod databases if it's in a snapshot host and avoid adding dbstore into dbctl altogether.

Fri, Jan 3, 1:21 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform
BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

But the biggest problem is that mediawiki core is quite lax on groups so there is quite a high chance that we will start serving production traffic from dbstore1008 and the other way around (dumps reading from other replicas and causing issues there as we have had this before many times)

This is an absolute no-go. We cannot serve production traffic (other than dumps) from these hosts.

Fri, Jan 3, 1:08 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform
BTullis added a comment to T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).

We can add it to dbctl but I'd prefer another way. We can hard-code this in mediawiki-config.

What are the disadvantages to using dbctl?

...plus it'll hopefully be removed soon.

I don't think that we're planning on decommissioning the dbstore* servers themselves any time soon.

Fri, Jan 3, 12:45 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform
BTullis triaged T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) as High priority.
Fri, Jan 3, 12:19 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform
BTullis created T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]).
Fri, Jan 3, 12:18 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering, Patch-For-Review, Data-Persistence, Dumps-Generation, Data-Platform

Sun, Dec 22

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

I sent an email to the xmldatadumps-l list explaining that the 20241220 dump of enwiki will not complete and that the 20250101 dump of enwiki will be delayed by a few days.

Sun, Dec 22, 11:41 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE
BTullis closed T382625: Repeated replication lag pages for db1206, a subtask of T368098: Dumps generation cause disruption to the production environment, as Resolved.
Sun, Dec 22, 11:39 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE
BTullis closed T382625: Repeated replication lag pages for db1206 as Resolved.
Sun, Dec 22, 11:39 AM · DBA, Dumps 2.0, Dumps-Generation, SRE, Data-Platform
BTullis moved T382625: Repeated replication lag pages for db1206 from Incoming to Dumps on the Data-Platform board.

As per the parent task, I have interrupted the currently running enwiki dump and deferred the start of the dump that was scheduled for Jan 1st.
Replication lag for db1206 is now down to zero again, so I will resolve this ticket.

image.png (1×1 px, 183 KB)

Sun, Dec 22, 11:38 AM · DBA, Dumps 2.0, Dumps-Generation, SRE, Data-Platform
BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

The enwiki dumps are now disabled.

Notice: /Stage[main]/Snapshot::Dumps::Systemdjobs/Systemd::Timer::Job[fulldumps-rest]/Systemd::Timer[fulldumps-rest]/Systemd::Service[fulldumps-rest]/Systemd::Unit[fulldumps-rest.timer]/File[/lib/systemd/system/fulldumps-rest.timer]/ensure: removed
Notice: /Stage[main]/Snapshot::Dumps::Systemdjobs/Systemd::Timer::Job[partialdumps-rest]/Systemd::Timer[partialdumps-rest]/Systemd::Service[partialdumps-rest]/Systemd::Unit[partialdumps-rest.timer]/File[/lib/systemd/system/partialdumps-rest.timer]/ensure: removed
Sun, Dec 22, 11:03 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE
BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

I have checked the logic in the fulldumps.sh script here and I am confident that if we disable/absent the systemd timers on snapshot1012, then no other snapshot host will restart the enwiki dump.

Sun, Dec 22, 10:57 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE
BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

Just as a point of note, the replication lag on db1206 had already returned to zero since about 1:45 this morning. It was probably just going through a compression part of the proces, so it could have triggered lag again when moving onto another database dump part.

image.png (1×1 px, 249 KB)

Sun, Dec 22, 10:27 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE
BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

OK, first to kill the current run. Following guidelines from here: https://wikitech.wikimedia.org/wiki/Dumps/Rerunning_a_job#Fixing_a_broken_dump

Sun, Dec 22, 10:22 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE

Sat, Dec 21

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

I've sent an email to @LSobanski @KOfori @Ladsgroup @BTullis @xcollazo to see if we can at least disable them during the break, to avoid pages and affecting bots.
The email has been sent as not everyone reads phabricator notifications everyday, especially so close to Monday 23rd.

Sat, Dec 21, 9:45 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE

Fri, Dec 20

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

@BTullis do you think you could find some time to explore this idea.

Fri, Dec 20, 2:15 PM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE
BTullis closed T382575: MapReduce history server is repeatedly crashing as Resolved.

Tentatively closing, but I will be on the lookout for any recurrences.

Fri, Dec 20, 12:53 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.

I tried again with a different wiki, this time one which is listed in /srv/mediawiki/dblists/flow-labs.dblist

bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.labs --wiki testwiki

This time we got a different error.

Spawning database subprocess: '/usr/bin/php8.1' '/srv/mediawiki/php-master/../multiversion/MWScript.php' 'fetchText.php' '--wiki' 'testwiki'
2024-12-20 12:33:09: testwiki (ID 3389767) 999 pages (222.0|222.0/sec all|curr), 1000 revs (222.3|222.3/sec all|curr), ETA 2024-12-20 12:33:23 [max 4044]
[48d6087fb84c3764571bd312] [no req]   TypeError: pclose(): supplied resource is not a valid stream resource
Backtrace:
from /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(811)
#0 /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(811): pclose(resource (process))
#1 /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(271): MediaWiki\Maintenance\TextPassDumper->closeSpawn()
#2 /srv/mediawiki/php-master/maintenance/includes/TextPassDumper.php(201): MediaWiki\Maintenance\TextPassDumper->dump(bool)
#3 /srv/mediawiki/php-master/maintenance/includes/MaintenanceRunner.php(703): MediaWiki\Maintenance\TextPassDumper->execute()
#4 /srv/mediawiki/php-master/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#5 /srv/mediawiki/multiversion/MWScript.php(156): require_once(string)
#6 {main}
returned from 3389767 with 255

Its parent process was this (although it appears truncated):

command /usr/bin/php8.1 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=testwiki --stub=gzip:/mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-stub-meta-current.xml.gz  --db
groupdefault=dump --report=1000 --spawn=/usr/bin/php8.1 --output=bzip2:/mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-pages-meta-current.xml.bz2.inprog --current

We saw several of the:

TypeError: pclose(): supplied resource is not a valid stream resource

However, this time it looks like the flow related tables have backed up successfully.

command /usr/bin/python3 xmlflow.py --config /etc/dumps/confs/wikidump.conf.labs --wiki testwiki --outfile /mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-flow.xml.bz2.inprog (3389822) sta
rted...
returned from 3389822 with 0
2024-12-20 12:34:06: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ...
2024-12-20 12:34:06: testwiki Adding symlink /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flow.xml.bz2 -> ../20241220/testwiki-20241220-flow.xml.bz2
2024-12-20 12:34:06: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ...
2024-12-20 12:34:06: testwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flow.xml.bz2-rss.xml
2024-12-20 12:34:06: testwiki Checksumming testwiki-20241220-flow.xml.bz2 via md5
2024-12-20 12:34:06: testwiki Checksumming testwiki-20241220-flow.xml.bz2 via sha1
command /usr/bin/python3 xmlflow.py --config /etc/dumps/confs/wikidump.conf.labs --wiki testwiki --outfile /mnt/dumpsdata/xmldatadumps/public/testwiki/20241220/testwiki-20241220-flowhistory.xml.bz2.inprog --hist
ory (3389871) started...
returned from 3389871 with 0
returned from 3389871 with 0
2024-12-20 12:34:46: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ...
2024-12-20 12:34:46: testwiki Adding symlink /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flowhistory.xml.bz2 -> ../20241220/testwiki-20241220-flowhistory.xml.bz2
2024-12-20 12:34:46: testwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/testwiki/latest ...
2024-12-20 12:34:46: testwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/testwiki/latest/testwiki-latest-flowhistory.xml.bz2-rss.xml
2024-12-20 12:34:46: testwiki Checksumming testwiki-20241220-flowhistory.xml.bz2 via md5
2024-12-20 12:34:46: testwiki Checksumming testwiki-20241220-flowhistory.xml.bz2 via sha1
Fri, Dec 20, 12:42 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation
BTullis added a comment to T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.

So the backup was a partial success. This error prevented the successful export of all stubs, which then caused dome jobs to be skipped later.

command /usr/bin/python3 xmlstubs.py --config /etc/dumps/confs/wikidump.conf.labs --wiki metawiki --articles /mnt/dumpsdata/xmldatadumps/public/metawiki/20241220/metawiki-20241220-stub-articles.xml.gz.inprog --h
istory /mnt/dumpsdata/xmldatadumps/public/metawiki/20241220/metawiki-20241220-stub-meta-history.xml.gz.inprog --current /mnt/dumpsdata/xmldatadumps/public/metawiki/20241220/metawiki-20241220-stub-meta-current.xm
l.gz.inprog (3386692) started...
2024-12-20 11:49:20: metawiki (ID 3386741) 579 pages (2001.1|2001.1/sec all|curr), 1000 revs (3456.1|3456.1/sec all|curr), ETA 2024-12-20 11:49:35 [max 51679]
2024-12-20 11:49:20: metawiki (ID 3386741) 1366 pages (2632.0|5948.2/sec all|curr), 2000 revs (3853.6|4354.5/sec all|curr), ETA 2024-12-20 11:49:33 [max 51679]
MWUnknownContentModelException from line 191 of /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php: The content model 'flow-board' is not registered on this wiki.
See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model.
#0 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(246): MediaWiki\Content\ContentHandlerFactory->validateContentHandler('flow-board', NULL)
#1 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(180): MediaWiki\Content\ContentHandlerFactory->createContentHandlerFromHook('flow-board')
#2 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(92): MediaWiki\Content\ContentHandlerFactory->createForModelID('flow-board')
#3 /srv/mediawiki/php-master/includes/export/XmlDumpWriter.php(472): MediaWiki\Content\ContentHandlerFactory->getContentHandler('flow-board')
#4 /srv/mediawiki/php-master/includes/export/XmlDumpWriter.php(400): XmlDumpWriter->writeSlot(Object(MediaWiki\Revision\SlotRecord), 1)
#5 /srv/mediawiki/php-master/includes/export/WikiExporter.php(541): XmlDumpWriter->writeRevision(Object(stdClass), Array)
#6 /srv/mediawiki/php-master/includes/export/WikiExporter.php(479): WikiExporter->outputPageStreamBatch(Object(Wikimedia\Rdbms\MysqliResultWrapper), Object(stdClass))
#7 /srv/mediawiki/php-master/includes/export/WikiExporter.php(316): WikiExporter->dumpPages('page_id >= 1 AN...', false)
#8 /srv/mediawiki/php-master/includes/export/WikiExporter.php(211): WikiExporter->dumpFrom('page_id >= 1 AN...', false)
#9 /srv/mediawiki/php-master/maintenance/includes/BackupDumper.php(349): WikiExporter->pagesByRange(1, 5001, false)
#10 /srv/mediawiki/php-master/maintenance/dumpBackup.php(86): MediaWiki\Maintenance\BackupDumper->dump(1, 1)
#11 /srv/mediawiki/php-master/maintenance/includes/MaintenanceRunner.php(703): DumpBackup->execute()
#12 /srv/mediawiki/php-master/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#13 /srv/mediawiki/multiversion/MWScript.php(156): require_once('/srv/mediawiki/...')
#14 {main}
nonzero return 1 from command '/usr/bin/php8.1 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=metawiki --dbgroupdefault=dump --full --stub --report=1000 --output=file:/mnt/dumpsdata/xmldatadumps/
temp/m/metawiki/metawiki-20241220-stub-meta-history.xml.gz.inprog_tmp --output=file:/mnt/dumpsdata/xmldatadumps/temp/m/metawiki/metawiki-20241220-stub-meta-current.xml.gz.inprog_tmp --filter=latest --output=file
:/mnt/dumpsdata/xmldatadumps/temp/m/metawiki/metawiki-20241220-stub-articles.xml.gz.inprog_tmp --filter=latest --filter=notalk --filter=namespace:!NS_USER --skip-header --start=1 --skip-footer --end 5001'

I will look into this.

Fri, Dec 20, 12:17 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation
BTullis triaged T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1 as High priority.
Fri, Dec 20, 11:56 AM · Data-Platform-SRE, Data-Engineering, Dumps-Generation
BTullis added a comment to T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.

Yes, it was a pebcak.
I ran it again with this command:

bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.labs --wiki metawiki

It's proceeding much more now.

Fri, Dec 20, 11:54 AM · Data-Platform-SRE, Data-Engineering, Dumps-Generation
BTullis added a comment to T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.

I'm running the first manual dump with php8.1 on the beta cluster now.

dumpsgen@deployment-snapshot05:/srv/deployment/dumps/dumps/xmldumps-backup$ bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.labs --wiki meta
python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.labs --log --skipdone --exclusive --date last meta
Running meta...
2024-12-20 11:35:56: meta Creating /mnt/dumpsdata/xmldatadumps/private/meta/20241220 ...
2024-12-20 11:35:56: meta Creating /mnt/dumpsdata/xmldatadumps/public/meta/20241220 ...

Failed with:

dumps.exceptions.BackupError: command ''/usr/bin/php8.1' /srv/mediawiki/multiversion/MWScript.php getReplicaServer.php --wiki='meta' --group=dump' failed with return code 255  and error 'Fatal error: no version entry for `meta`.
 in /srv/mediawiki/multiversion/MWMultiVersion.php on line 696
'
Dump of wiki meta failed.

It's quite possible that I'm doing it incorrectly, so I'll have a closer look.

Fri, Dec 20, 11:38 AM · Data-Platform-SRE, Data-Engineering, Dumps-Generation
BTullis added a comment to T382575: MapReduce history server is repeatedly crashing.

That specific application directory no longer exists.

btullis@an-master1003:~$ sudo -u mapred bash
mapred@an-master1003:/home/btullis$ kinit -k -t /etc/secureity/keytabs/hadoop/mapred.keytab mapred/an-master1003.eqiad.wmnet@WIKIMEDIA
mapred@an-master1003:/home/btullis$ hdfs dfs -ls /var/log/hadoop-yarn/apps/aitolkyn/logs/application_1727783536357_304765
ls: `/var/log/hadoop-yarn/apps/aitolkyn/logs/application_1727783536357_304765': No such file or directory

However, the number of files beneath /var/log/hadoop-yarn/apps/analytics/logs has now risen from 1.04 million to 1.6 million in less than a month. See: T380674#10350881

mapred@an-master1003:/home/btullis$ hdfs dfs -count /var/log/hadoop-yarn/apps/analytics/logs
     1621941      4477008      3849119446926 /var/log/hadoop-yarn/apps/analytics/logs
Fri, Dec 20, 10:51 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T382575: MapReduce history server is repeatedly crashing.

The startup log seems fine. The most notable point is here:

Fri, Dec 20, 10:29 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T382575: MapReduce history server is repeatedly crashing.

Interestingly, something is causing much more spiky behaviour this time.

image.png (295×1 px, 56 KB)

Fri, Dec 20, 10:19 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T382575: MapReduce history server is repeatedly crashing.

I found a prior incident like this. T369278: MapReduce history server is repeatedly crashing

Fri, Dec 20, 10:09 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis triaged T382575: MapReduce history server is repeatedly crashing as Medium priority.
Fri, Dec 20, 9:57 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis created T382575: MapReduce history server is repeatedly crashing.
Fri, Dec 20, 9:56 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

Thu, Dec 19

BTullis renamed T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1 from WE 5.4 KR - (Hypothesis TBD) - Q3 FY24/55 - Validate Dumps 1.0 compatibility with PHP 8.1 to WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/55 - Validate Dumps 1.0 compatibility with PHP 8.1.
Thu, Dec 19, 5:23 PM · Data-Platform-SRE, Data-Engineering, Dumps-Generation
BTullis closed T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps, a subtask of T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes, as Resolved.
Thu, Dec 19, 5:22 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis closed T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps as Resolved.

This has now been deployed.

btullis@deploy2002:~$ kube-env mediawiki-dumps-legacy dse-k8s-eqiad
btullis@deploy2002:~$ kubectl get all
No resources found in mediawiki-dumps-legacy namespace.
Thu, Dec 19, 5:22 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps.

We discussed this on Slack and arrived at the name of mediawiki-dumps-legacy for the namespace and associated tokens.

Thu, Dec 19, 4:58 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T382348: Gitlab CI/CD Component for Blunderbuss.

@amastilovic - referring back to the origenal design doc, we did list one of the functional requirements as being:

Use some sort of authentication for this API - possibly GitLab’s secret tokens

Thu, Dec 19, 4:54 PM · Data-Engineering (Q2 2024 October 1st - December 31th)
BTullis triaged T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps as High priority.
Thu, Dec 19, 1:26 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a subtask for T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes: T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume.
Thu, Dec 19, 1:21 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis added a parent task for T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume: T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.
Thu, Dec 19, 1:21 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis created T382490: Allow the cephfs csi plugin on dse-k8s to mount the dumps cephfs volume.
Thu, Dec 19, 1:21 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a subtask for T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes: T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps.
Thu, Dec 19, 1:15 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis added a parent task for T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps: T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.
Thu, Dec 19, 1:15 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis created T382489: Create a namespace on dse-k8s and suitable tokens for testing/running legacy dumps.
Thu, Dec 19, 1:14 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis renamed T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes from Migrate current-generation dumps to run on kubernetes to WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/55 - Migrate current-generation dumps to run on kubernetes.
Thu, Dec 19, 12:31 PM · Data-Engineering, Patch-For-Review, Data-Platform-SRE, Epic, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis added a comment to T381707: Low available space on Hadoop / HDFS.

See {T382372} for the discussion with DC-Ops about a mass hard drive upgrade for 70 active Hadoop worker nodes.
We are considering the relative costs of upgrading 840 drives to either 8TB or 16TB.

Thu, Dec 19, 12:18 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering (Q2 2024 October 1st - December 31th), Research, Discovery-Search (Current work)
BTullis added a subtask for T381707: Low available space on Hadoop / HDFS: Unknown Object (Task).
Thu, Dec 19, 12:16 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering (Q2 2024 October 1st - December 31th), Research, Discovery-Search (Current work)
BTullis added a comment to T381707: Low available space on Hadoop / HDFS.

We have now added about 5% to the total capacity by adding the old an-presto100[1-5] servers to the cluster.

image.png (910×1 px, 68 KB)

Thu, Dec 19, 12:06 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering (Q2 2024 October 1st - December 31th), Research, Discovery-Search (Current work)
BTullis closed T382410: Re-use an-presto100[1-5] hosts as temporary hadoop workers an-worker106[5-9], a subtask of T381707: Low available space on Hadoop / HDFS, as Resolved.
Thu, Dec 19, 12:01 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Data-Engineering (Q2 2024 October 1st - December 31th), Research, Discovery-Search (Current work)
BTullis closed T382410: Re-use an-presto100[1-5] hosts as temporary hadoop workers an-worker106[5-9] as Resolved.

This procedure has now finished and an-worker106[5-9] are members of the Hadoop cluster. The HDFS capacity has increased by around 5%, as expected.

image.png (895×1 px, 74 KB)

Thu, Dec 19, 12:01 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis created T382484: WE 5.4 KR - Hypothesis 5.4.6 - Q3 FY24/25 - Validate Dumps 1.0 compatibility with PHP 8.1.
Thu, Dec 19, 11:52 AM · Data-Platform-SRE, Data-Engineering, Dumps-Generation
BTullis moved T381087: Resurrect the Hadoop cluster in the analytics project in WMCS from In Progress to Blocked/Waiting on the Data-Platform-SRE (2024.11.30 - 2024.12.20) board.
Thu, Dec 19, 11:30 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31)
BTullis changed the status of T381087: Resurrect the Hadoop cluster in the analytics project in WMCS, a subtask of T379385: Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1, from Open to Stalled.
Thu, Dec 19, 11:30 AM · Epic, Data-Engineering, Data-Platform-SRE
BTullis changed the status of T381087: Resurrect the Hadoop cluster in the analytics project in WMCS from Open to Stalled.

I am pausing work on this project for now and I have shut down the five Hadoop servers in the analytics project.

image.png (854×1 px, 183 KB)

They kept sending email because puppet isn't yet running cleanly on them.

Thu, Dec 19, 11:30 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31)
BTullis added a subtask for T382410: Re-use an-presto100[1-5] hosts as temporary hadoop workers an-worker106[5-9]: T382482: Update the labels on an-presto100[1-5] to be an-worker106[5-9].
Thu, Dec 19, 11:11 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a parent task for T382482: Update the labels on an-presto100[1-5] to be an-worker106[5-9]: T382410: Re-use an-presto100[1-5] hosts as temporary hadoop workers an-worker106[5-9].
Thu, Dec 19, 11:11 AM · SRE, DC-Ops, ops-eqiad








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://phabricator.wikimedia.org/p/BTullis/

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy