Using The Batch Farm: Technische Universität München

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Technische Universität München

Using the Batch Farm


Prologue

• All information + scripts from this talk


also available in

A) transfer.ktas.ph.tum.de
B) /home/www/papers/computing
Overview

• Infrastructure
• Parallel vs single job computing
• Basic commands
• How to …
… arrange a job
… send a job
… monitor my stuff
• Please don’t…
Infrastructure

• 21 compute nodes → 570 cores


• ~ 2 Gb RAM / cores
• 20 GPU job slots
• Standard queue: 2,5h / job
• Long queue: 12h / job
• Local storage ~100 Gb per node
• 1/10 Gbit/s network connection / node

SLURM job scheduler


https://www.schedmd.com
Parallel vs single job

Parallel running Single running

Job 1 Job 4 Job 7 Job 1


Job 2 Job 5 Job 8 Job 2
Job 3 Job 6 Job 9 Job 3

• Independent jobs • Code development


• Parameter scans • Compiling
• MC production • Create Plots / Graphs
• Data analysis (runwise) • Small nTuple analysis
• Creating of independent • Merging of several
output files files
Example: Parallel job
Detector Summary Tape File Analysis (DSTs)

Problem
• 1000 files with 250 events/file

Solution
• Create code locally
• Analyse 1 file per job
• Create 1 output file per job (Plots, Ntuples…)
• Send 1000 jobs to farm
• Merge plots/ntuples afterwards
Example: Single job
Fitting of a peak in plot

Problem
• Fit peaks in 1 or 2 plots

Solution
• Create a macro / program to fit
• Do it locally and check the output

Don’t make life more


complicated than it is!
Basic commands
• sview Here you will get some information about
• sshare the basic commands. Most of them
provide more information, see “command
• sbatch –help”
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview SLURM overview. Job, partition and node
• sshare information in an graphical overview
• sbatch Just enter “sview” in a terminal
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview “Fair share” ranking. (How fast do I get
• sshare the slot for the next job?)
• sbatch Just enter “sshare --all” in a terminal
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview Submit a job to the farm
• sshare
Enter “sbatch --help” for info about the
• sbatch parameters (will be described later)
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview Kill your jobs by id or all of your jobs
• sshare using “scancel –u [ADS]”
• sbatch
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview Gives information about the status of the
• sshare running jobs and the queue.
• sbatch Just enter “squeue” in a terminal
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview Gives information about the nodes,
• sshare queues and user of the farm.
• sbatch Just enter “sinfo” in a terminal
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview A short graphical overview over the users
• sshare currently running jobs on the farm.
• sbatch https://transfer.ktas.ph.tum.de/django/monitor/1/
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview A short text based overview over the users
• sshare currently running jobs on the farm.
• sbatch https://transfer.ktas.ph.tum.de/webpage/monitori
• scancel ng_batchfarm.html
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
How to… arrange a job

• Input:
– File to analyse? (Filelist?)
– Parameters?
• Output:
– Different names/ directories
• Compile before sending to
farm
• How much CPUtime / RAM
• Do I need temporary space?
• Do I need access to /scratch
• Check before farm
Example: Random Numbers

Problem
• Create file with 10 different lines and random
numbers
• Must be scalable to farm

Solution
• Input: the name of the output file has to be given
• Compile
• “Full program”
– Example.cc
– Makefile
→ This generates a executable program
Example: Random Numbers
Example: Random Numbers

• Run it locally to check if it works


How to… send a job

• Select your parameters:


– CPU
– RAM
– Partition
• SLURM can only submit scripts
• Loop over all the jobs you want to submit
• Create a bash/ python script
• Example:
– Create a script with a submit loop (submit.sh)
– Inside, create a temporary script with your job inside
– Run your script
Example: Send 10 jobs
#!/bin/sh
cpu=5 # time limit in minutes for your job,
# will be killed after that time!
mem=100 # ram limit in Mb for your job,
# it will be killed if it exceeds this
nJobs=10 # number of jobs to be performed

# the program is defined here


program=/home/www/papers/computing/programs/Example
name=Example

# the output parameters are defined here


output_path=/home/www/papers/computing/testoutput
output_name=Event
output_end=txt
Example: Send 10 jobs
# generate a random number to identify the jobs stuff exactly
randomID=$RANDOM

for i in `seq 1 $nJobs`; do


tmp_scriptname=/var/tmp/sub_${randomID}_$i.sh

# set your default environment


echo "#!/bin/sh" > $tmp_scriptname
echo ". ~/.bashrc" >> $tmp_scriptname

# execute your program to the local disk


echo "${program} /var/tmp/local_${randomID}_$i.txt" >> $tmp_scriptname

# copy the completed output to your location


echo "cp /var/tmp/local_${randomID}_$i.txt ${output_path}/${output_name}-$i.${output_end}" >>
$tmp_scriptname
# clean up your stuff
echo "rm /var/tmp/local_${randomID}_$i.txt " >> $tmp_scriptname

# submit your temporary script to the farm


sbatch --mem-per-cpu=${mem} --time=${cpu} --job-name=$name-${counter} ${tmp_scriptname}

# delete your temporary script rm -rf ${tmp_scriptname}


Example: Send 10 jobs

# submit your temporary script to the farm


sbatch --mem-per-cpu=${mem} --time=${cpu} --job-name=$name-${counter}
${tmp_scriptname}

# delete your temporary script


rm -rf ${tmp_scriptname}

done
How to… monitor my stuff

• Check your jobs frequently (squeue…)


– Do they disappear suddenly?
– Do they go down too fast?

• Check the log files in case of problems


– What is written there?
– Is it depending on one
machine?

• Try to run a job locally


Error handling

• Have you checked the logfile?


Don’t call an
• Are your scripts and code valid? admin without
having checked all
points!
• Is your data available?

• Is the fileserver present or


under heavy usage?

• Do your jobs last unusually


long?
Important Notes

Some important notes:


• Don’t use /tmp. Use /var/tmp
• Don’t write directly to /scratch, copy at the end of the job
• Clean up after your job
• Try to stay under 50k jobs at one time
• Adjust your CPU and RAM
usage reasonable
• Always check your work

• Be friendly to the others 


Questions?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy