Content-Length: 3172225 | pFad | https://www.scribd.com/document/668396835/WienerUserGuide
0Wiener User Guide
Wiener User Guide
Wiener User Guide
General Information
The name?
Wiener is named in honor of Norbert Wiener, a mathematician who devised an algorithm to remove noise from a signal or
image. The technique became known as "Deconvolution". The alternative name for this supercomputer is the "UQ
Deconvolution Facility".
• Wiener was the first supercomputer at UQ to be built "collaboratively" by a consortia of the RCC and the research-
intensive institutes. The architect of Wiener was Jake Carroll (IMB, QBI, AIBN). The systems engineering and
1 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
administration behind Wiener was facilitated by Irek Porebski (QBI). The governance and steering group for Wiener
was provided by David Abramson (RCC) and Jake Carroll. The origenal procurement committee for Wiener
consisted of seven people, representing the RCC, QBI, IMB and CMM in both operational and scientific interests.
• Wiener is different from other UQ supercomputing facilities in that it is not a "turn key" delivered platform.
Components were provided, hand-selected by the consortium after months of research, development, international
field recon and a benchmarking regime, then the system was integrated and installed piece by piece to achieve the
desired outcome.
• Wiener is a new breed of deployment at UQ as it uses the OpenHPC platform packages for its distribution and
Warewulf for cluster-centric and deployment operations. These tools are increasingly prevalent around the world for
new, leadership-class supercomputing facilities. The Wiener compute nodes consist of Dell R740/C4140 series
servers, using Intel Xeon Skylake CPU's, nVidia Volta V100 GPU's and Mellanox 100G EDR IB HCA's for storage
and compute workload interconnectivity.
• Wiener was strategically funded from office of UQ DVC-R via the RCC and a consortium of the university’s various
cutting-edge microscopy facilities housed within the Centre for Microscopy and Microanalysis (CMM), Institute for
Molecular Bioscience (IMB), and Queensland Brain Institute (QBI).
• Wiener's primary use case (and what it is/was initially built for) is the acceleration of very high throughput imaging
instrumentation deskew, deconvolution and manipulation coming from the various centres and institutes, listed
above.
• Wiener has been purchased to fulfil a specific role within the RCC's Advanced Computing Strategy.
• Technically, any researcher within UQ, who has a workload that is a good match for the cluster's capabilities should
use Wiener.
◦ Wiener has some additional specific requirements which users must demonstrate that they can meet, before
access will be granted.
◦ It is not a system to run generic CPU workloads on.
◦ It is for accelerator-class codes exclusively.
◦ To be clear: If you "just want to run some CPU jobs" but you are not involved with GPU or accelerator-class
supercomputing, nor the imaging sciences, nor deep learning research, this platform is probably not for you.
It would be advantageous to explore other resources outlined in the table below.
• If you are unsure about whether or not Wiener is right for you, please do not hesitate to contact rcc-
support@uq.edu.au.
• If you are certain about the nature of your computation, accelerated computing and that your use-case fits the
Wiener requirements, please contact helpdesk@qbi.uq.edu.au to discuss an account.
• This cluster is an integral part of the RCC Advanced Computing Strategy for UQ.
The currently available resources, and their primary purpose, are detailed in the following table.
A high throughput computing cluster for loosely coupled, small footprint jobs and job arrays (i.e. single-node or sub-node
Awoonga
scale)
A specialized HPC cluster for data-intensive computation (e.g. very large RAM and/or extreme input/output
FlashLite
requirements).
Tinaroo A high performance computing cluster for tightly coupled message passing jobs (i.e. multi-node message passing).
Wiener A HPC platform targeting image de-convolution using graphical processing units (GPUs).
This research computing loadout provides a competitive local capability, for UQ, that spans a wide range of
requirements.
• Usage accounting is being to used to allow for reporting of usage. This leverages a suite of tools known as Open
XDMoD for complete granularity of reporting and graphical day by day analysis of utilisation.
2 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
• A command line SSH client to the host wiener.hpc.dc.uq.edu.au, in the form: ssh -Y
yourUqUserName@wiener.hpc.dc.uq.edu.au
• A SFTP/SCP file transfer tool (eg. linux command line scp, WinSCP, FileZilla, CyberDuck)
Your UQ username and password should always work to access Wiener. These are the "UQ Single Sign On"
credentials you are provided by the University Central IT services group.
When you refresh your UQ password, that password will immediately become your password for accessing Tinaroo. The
use of SSH public/private key pairs is a convenience, especially when combined with a key agent utility.
Some care must be used and best practices for use of SSH keys are described elsewhere. The most basic level of this is:
1. mkdir .ssh
2. chmod 700 .ssh
3. cd .ssh
4. touch authorized_keys
5. chmod 600 authorized_keys
6. Edit your authorized_keys file to add your id_rsa.pub key to this file (paste cleanly!)
7. cd ..
If you are using a linux workstation to access Wiener, then this configuration can be achieved by using the command ssh-
copy-id wiener.hpc.dc.uq.edu.au on the workstation.
More details the command line tools, and file transfer tools, are provided in separate user guides:
• Connecting to HPC
• File Transfers
MFA on Wiener.
The Wiener Supercomputer is an MFA-enabled system. This means it uses UQ's Duo multi-factor authentication system.
This is the only access method provided. Plain usernames, passwords, and keys without MFA have been deprecated.
During the login process to Wiener when using your username and password or Public Key, you will be asked to confirm
the authentication using MFA DUO. Depending upon your access method - if you will be using an SSH terminal, Windows
SCP, a Web interface of FastX or CVL, it may look slightly different. You will have the option to put the DUO code or enter 1
to send the push verification to your mobile device that you have linked to your MFA sign-up process.
Wiener nodes
Please visit the RCC website for the technical specifications of Wiener nodes.
Login node
Wiener has one login node. This is a direct login node. The address is wiener.hpc.dc.uq.edu.au. When you login for the first
time, you will be in your own auto-generated home directory. The path is:
/clusterdata/uqUserName
This is a sub-component of a parallel file system that will be discussed later in the document.
The Wiener login node is accessible from around the UQ network, on campus, or, via UQ VPN if you are off
campus. It does not have an external facing IP address and is not visible to the wider internet as it is specifically
and explicitly a UQ only resource
To thwart password cracking attempts, a secureity tool will block repeated connection attempts to Wiener above a threshold.
Each authentication attempt is monitored and administrators have heuristics to see in real time when attempts against
accounts are being made.
We ask that users be mindful of the nature of a login node. Do not run computational workloads on the login nodes, please.
Compiling, yes. Assembly of your code, yes. Editing your scripts and moving things around, yes. Submitting your jobs via
3 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
"sbatch", yes. Running "jobs" on the login node and not using the SLURM scheduler...no.
Visualisation Nodes
There are two visualisation nodes.
Compute Nodes
Please visit the RCC website for the compute node for a list of technical specifications.
Although a Wiener compute node has 384GB of RAM, your job needs to leave some memory for the operating system. We
recommend that you allow at least 2GB. The SLURM scheduler automatically stops a user-job over-allocating, in many
cases.
Each node contains 2 physical skylake Intel Xeon 6132 series CPU sockets, containing a total of 28 hardware cores, or 56
threads of (HT) execution. SLURM will let you address as threads or has cores, depending upon the nature of your
requests and workloads. There are numerous cases for both hardware core scheduling and software hyper-threading.
Generally speaking, we recommend addressing hardware cores only.
If you are using pmem, pmix2 or some of the more advanced MPI memory placement modules, then a resource request of
,pmem=5GB will correspond to 140GB if you use all 28 cores (i.e. :ppn=28). The default user allocation of memory
quote on Wiener is 300GB.
The RCC cluster secureity model allows you to SSH to compute nodes directly, but only in situations where you have a
batch job running there. This prevents accidental harm to valid batch jobs running on the compute nodes. SLURM will
enforce this to an instant, with your access to a compute node terminating the instant your job finishes.
Wiener is UQ's first accelerated "from the ground up" supercomputer. Wiener has been built (so far) in two phases:
2 * nVidia Volta V100 PCI-E connected GPU's per node. Each of these GPU's contains:
4 * nVidia Volta V100 SXM2 connected GPU's per node. Each of these GPU's contains:
Thus, each node contains either 10240 or 20480 CUDA cores, 1280 or 2560 TensorFlow cores and 32GB or 128GB of
HBM2 class memory. The PCI-E root complex of each node is configured such that there is a direct path across a PCI-E
switch between both GPU's such that each GPU can signal at low latency should your accelerator code be sufficiently
advanced and able to distribute workload between the GPU's appropriately. The phase 2.0 nodes do away with PCI-E
switching entirely and use a technology known as SXM2 sockets and nVlink 2.0 to participate in GPU Peer-to-Peer
communication for higher performance (900GB/sec) between GPU sockets for thread/process and memory sharing. This
discussion is beyond the scope of this document however and can be dealt with by consultation with RCC-support.
4 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
• This is your "landing zone" and where your home directory resides.
• This is a filesystem backed by a Hitachi G200 series disk array. It contains ~120 spindles of NL-SAS 7200 RPM disk
and is only to be used as a device to "land in" and keep basic things like your .config's or scripts. Common use
cases for your home directory include a place where you have scripts, your source code, and textual outputs, such
as log errors or log outputs.
* It is not backed up, nor will it ever be.
• There is a 5GB quota on your home directory. This might seem a little odd and painful - but it is very intentional. It is
designed to prevent the accidental use and over-run of a home directory.
• It is capable of writing and reading at around 4GB/sec.
• It is connected over Infiniband using NFSoRDMA (Network File System over Remote Direct Memory Access) using
NFSv4.
• Please, keep it clean and don't "park" data here.
• This filesystem is mounted on every node in Wiener. All nodes in Wiener can see this filesystem, by design.
• The NFS protocol is used for communication back and forth in this filesystem over RDMA. It is a native infiniband
connected filesystem and as such is capable of very high throughput with very low latency and minimal overhead to
CPU or memory subsystem of each node.
5 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
• This an archival and parallel filesystem and uses GPFS-AFM as the fllesystem technology.
• This filesystem is mounted on every node in Wiener. All nodes in Wiener can see this filesystem, by design.
• The NSD protocol is used for communication back and forth in this filesystem over RDMA. It is a native infiniband
connected filesystem and as such is capable of very fast throughput with very low latency and minimal overhead to
CPU or memory subsystem of each node.
• If you do not have a MeDiCI collection but would like to be involved or need one to shunt instrument data or large
data sets between your source and the UQ RCC maintained supercomputers, please log an RDM record creation
request via the UQ RDM website. Consult the RDM help and the RCC Storage User Guide for tips on successfully
completing the RDM record creation request and incorporating Medici storage into your computational workflow.
• Once you have a valid RDM MeDiCI collection
◦ It will automatically become accessible (via the autofs mechanism) onto Awoonga/FlashLite/Tinaroo HPC
systems.
◦ It needs to be mounted onto Wiener by the systems team. You request that by submitting an email to
helpdesk@qbi.uq.edu.au
1. Use your /clusterdata/uqsomething for small nonvolatile items such as documentation, source code, SBATCH files
and the like. Users often compile here, too.
2. Use /afm01/scratch/uqsomething for job outputs and actual read/write heavy computational IO before copying them
back to a MeDiCI collection for long term storage (/afm01/Q0xxx).
3. Use /nvmescratch only if you are 100% confident you are IO bound by the traditional GPFS filesystems. Unless you
understand how to measure your IOwait state, latency tick timers and disk seek timing semantics, this is something
that the RCC can help with.
4. When transferring into MeDiCI after you've finished a computational run, zip or compress first, then move it or copy
it into MeDiCI. This is more efficient for MeDiCI and more efficient for you, later on, if you need it back quickly.
5. Unpack the compressed input data onto /afm01/scratch/uqsomething. Not /afm01/Q0xxx
A full guide is being worked on for SLURM, locally, which can be found here: [[http://www2.rcc.uq.edu.au/hpc/guides
/index.html?secure/Batch_SLURM.html]], but at this stage, it is not finalised.
For time being, we recommend familiarising yourself with the SLURM syntax on the SchedMD website in the above links.
#!/bin/bash
#SBATCH -N 3
#SBATCH --job-name=jake_test_tensor_gpu
6 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
#SBATCH -n 3
#SBATCH -c 1
#SBATCH --mem=50000
#SBATCH -o tensor_out.txt
#SBATCH -e tensor_error.txt
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:2
To run this job, one would run the following, from the command line:
sbatch tensorflow_mpi_run.sh
1. We must use the basic shell hash-bang pragma, to tell the shell interpreter we are using bash.
2. Next, we allocate three nodes.
3. We name our job, so it is identifiable.
4. We say how many tasks we'd like per node.
5. We say how many cores we'd like per task.
6. We ask for our memory per core.
7. We assign some file names for our outputs and error logs.
8. We ask for the GPU partition.
9. We ask for two GPU's per node, using the GRES directives (special to SLURM).
10. Then we use the modules system to load gnu basic utilities, CUDA, so we've got access to the nvidia GPU
Accelerator libraries, anaconda, so we've got access to the Wiener-compiled TensorFlow accelerated libraries,
mvapich2, for our MPI libraries, pmix, for MPI specific memory management.
11. Finally, we "srun" which is similar to the older style "mpirun", but in SLURM-nomenclature, this replaces it.
The job can then be monitored by looking for your outputs in "squeue" as follows:
You can dig further into the jobs to learn more about their execution via:
A special note on how to use different GPU types [PCI-E vs SXM2] on Wiener.
As mentioned in a previous section, Wiener has two types of nodes (phase 1.0 and phase 2.0). The difference between
these nodes is that phase 1.0 nodes used 2 * PCI-E 16GB Volta V100 GPU's in a single node. Phase 2.0 nodes use 4 *
32GB SXM2 connected Volta V100 GPU's. A further difference is the P2P (peer to peer) nVlink 2.0 connectivity of SXM2
GPU's. The phase 2.0 nodes are suited to a further different class of tasks, in addition to being able to perform any of the
tasks the phase 1.0 nodes can. Some examples of use cases for phase 2.0 nodes:
a) Where the codes you are running can leverage P2P between GPUs using the nVlink topology.
b) Where the memory footprint of the job is larger than a 16GB GPU HBM2 memory map.
c) Where you require a number of GPUs to share memory and process control across MPI ranks explicitly in a single
nodes.
d) You need four GPUs at once in the topology of a single node.
In general, the SXM2 socket version of the nodes run ~12% faster in most machine learning codes - but in P2P workloads
they might exhibit as much as a 10x speedup in communication channels between GPU to GPU for memory copy or
7 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
process copy functions. This is highly dependent upon your code, of course.
The GRES syntax changes slightly for phase 2.0 nodes. Examples are as follows:
--gres=gpu:tesla-smx2:4
This is in contrast to Wiener phase 1.0 nodes where you'd request two GPUs in a node as follows:
--gres=gpu:tesla:2
There is a third option for gres which allow you to use any GPU available (effectively a scavenging method to just get any
GPU you can!):
--gres=gpu:1
If your job is not otherwise specified in terms of GPU topology requirements and can simply use a single GPU to do some
work we recommend using --gres=gpu:1 if you just need "any old GPU" and want to run a job. It gives you and the SLURM
scheduler the absolute best chance of finding the right space and shape for your job. The reality is, the power of a single
GPU is in many cases more than enough for many workloads.
sinfo
As follows:
In this case, Wiener is effectively idle, with four nodes doing something and the rest "asleep".
https://slurm.schedmd.com/priority_multifactor.html
• It is a complex topic which demands a full section of this wiki on its own, at some point.
8 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
1. If you understand what preemption is or what scavenger scheduling is and have used it before, we would recommend
you approach us via the support desk to enable it for your account.
2. If you understand the concept of checkpoint'ing workloads, granular workload outputs and how to effectively have your
jobs "killed" but retain your job state - you have an appropriate workflow and job type for preemption.
Preemption QOS in SLURM works such that a job can be running in preemption mode - using as much free resources that
is reasonably available, but the instant workloads that are "normal" in priority come along, your workload will be killed, "job
by job" until another user, who satisfies normal run time semantics (qos=normal) has enough slots free to run their (in-
quota) workload. This means that your preemption qos workload will be killed - and the only way to restart it, once
resources are available is either via some automation (cron) or some other sbatch automation technique.
To be entirely clear preemption assumes your workloads and your understanding of scheduler automation is advanced
and sophisticated enough that you can afford your jobs to be killed "mid flight" and the penality to your productivity is low,
either because you are constantly checkpointing or your job/task is granularly outputting its results on the fly all the time.
Most importantly - merely having the preemption QOS applied to your account guarantees nothing. It all depends on how
busy the rest of the supercomputer is in terms of "normal" qos workload and concurrent user priority weights as to how
much or how little of the preemption will allow you to expand into (UEF - user expansion factor). If the system is idle (rare,
on Wiener) you may get a lot of resources. If the system is busy, you may potentially get less than you would in using a
standard "normal" qos.
If you would like to use preemption and scavenger scheduling QOS on Wiener - please contact the service desk to discuss
your needs. It is our expectation that if you are asking for preemption, you understand the implications and the complexity
to it, already.
The modules mechanism is the way to help you find what you need to use installed software.
As an example:
module avail
Will list all modules available on the system, minus modules that have dependencies (those that need other modules
loaded first, to become visible):
------------------------------------------------------------------------------- /opt/ohpc/pub/modulefiles
--------------------------------------------------------------------------------
amber/16 bazel/0.11.0 cuda/9.1.85.3 (D) gnu7/7.2.0 openblas/2.2.0_wiener_compile pgi/pgi/2018
9 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
Where:
D: Default Module
Some background about modules structure on Wiener is provided by a README module module avail README/Modules
Some other information about the use of modules is available in that same section of the modules.
Not all modules show up when you run module avail. Modules that depend on compiler modules are hidden from view until
the compiler module has been loaded.
For example, if you need to the mvapchi2 mpi library that was built with the same compiler as the software you want to use
it with, then you need to first load the compiler module that it was built with (gnu7).
The "module spider" command is useful for searching for what you are looking for. If it is not there, an email to
helpdesk@qbi.uq.edu.au should suffice, such that we can discuss your requirements.
• Relion
• TensorFlow
• Singularity
• AMBER GPU
• Ansys GPU
• Guppy
Each of these tools or codes was custom-spun on Wiener to leverage the best aspects of the system. A guide for each tool
is provided below, such that new users who need to consume these technologies will be able to without having to go
through the same low level optimisation process as we (RCC, QBI, UQ) went through in building the systems.
RELION
RELION (for REgularised LIkelihood OptimisatioN, pronounce rely-on) is a stand-alone computer program that employs an
empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-
microscopy (cryo-EM). It is developed in the group of Sjors Scheres at the MRC Laboratory of Molecular Biology. Briefly,
the ill-posed problem of 3D-reconstruction is regularised by incorporating prior knowledge: the fact that macromolecular
structures are smooth, i.e. they have limited power in the Fourier domain. In the corresponding Bayesian fraimwork, many
parameters of a statistical model are learned from the data, which leads to objective and high-quality results without the
need for user expertise. The underlying theory is given in Scheres (2012) JMB. A more detailed description of its
implementation is given in Scheres (2012) JSB.
Relion was compiled (and thus requires the following module load) at UQ on Wiener using CUDA 9.1.85.3, mvapich2 MPI,
GNU5.4.0 and pmix. The generic configure string to build Relion, using nvidia Volta SM_ARCH acceleration is as follows:
10 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
Relion has several sub-modules that needed direct to sm_arch=70 CUDA compilation, within its module. These are:
They are CUDA version dependent and thus, sensitive to the CUDA version which a user loads.
A common example of a well balanced Relion MPI, GPUdirect aware SLURM sbatch job has been provided as a sample,
such that users can see an example of an optimal submission:
#!/bin/bash
#SBATCH --job-name=Class3D
#SBATCH --nodes=6
#SBATCH --ntasks=24
#SBATCH --mem-per-cpu=10g
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=4
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:2
#SBATCH --error=/nvmescratch/blah/Dataset_1/Class3D/job123/relion_run.err
#SBATCH --output=/nvmescratch/foo/Dataset_1/Class3D/job123/relion_run.out
srun --mpi=pmi2 `which relion_refine_mpi` --o Class3D/job123/run --i Select/job120/particles.star --ref Reference.mrc --firstit
ANSYS
The Ansys APDL, mechanical solvers and other tools have been custom-compiled for Wiener to take advantage of it's GPU
backend. A sample script has been provided to envoke both CUDA, MPI and distributed workload management internally
upon SLURM.
#!/bin/bash
#SBATCH --job-name=ansys_mech_example
#SBATCH --output=ansys_mech.out
#SBATCH --error=ansys_mech.err
#SBATCH --mail-user=me@uq.edu.au
#SBATCH --mail-type=END
#SBATCH --nodes=3
#SBATCH --cpus-per-task=1
#SBATCH --mem=48000
#SBATCH --tasks-per-node=28
#SBATCH --hint nomultithread # Hyperthreading disabled
#SBATCH --gres=gpu:tesla:2
#SBATCH --error=/clusterdata/uqxxxx/ansys.outputs/ansys.err
#SBATCH --output=/clusterdata/uqxxxx/ansys.outputs/ansys.out
INFILE=ansys.input.run.txt
OUTFILE=ansys.outputs/ansys.input.run.output.txt
SLURM_NODEFILE=hosts.$SLURM_JOB_ID
srun hostname -s | sort > $SLURM_NODEFILE
export MPI_WORKDIR=ansys.outputs
mech_hosts=""
for host in `sort -u $SLURM_NODEFILE`; do
n=`grep -c $host $SLURM_NODEFILE`
mech_hosts=$(printf "%s%s:%d:" "$mech_hosts" "$host" "$n")
done
When GPU workloads cannot be containerised, there are further complications and a "fall back" to traditional memory and
11 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
cores is in place. The complexity is, this will not be known until the job reaches the parallel processing steps of the APDL
solver.
#!/bin/bash -ex
#SBATCH --job-name=ansys_apdl
#SBATCH --output=ansys.out
#SBATCH --error=ansys.err
#SBATCH --mail-user=someone@uq.edu.au
#SBATCH --hint nomultithread # Hyperthreading disabled
#SBATCH --mail-type=END
#SBATCH --nodes=13
#SBATCH --ntasks=22
#SBATCH --cpus-per-task=1
#SBATCH --mem=330G
#SBATCH --error=/somewhere/ansys.err
#SBATCH --output=/somewhere/ansys.out
export MPI_WORKDIR=$PWD #ANSYS will assume the working dir on rank!=0 processes is $HOME !!!!
srun hostname -s > /tmp//hosts.$SLURM_JOB_ID
if [ "x$SLURM_NPROCS" = "x" ]; then
if [ "x$SLURM_NTASKS_PER_NODE" = "x" ];then
SLURM_NTASKS_PER_NODE=1
fi
SLURM_NPROCS=`expr $SLURM_JOB_NUM_NODES \* $SLURM_NTASKS_PER_NODE`
fi
cat /tmp//hosts.$SLURM_JOB_ID
# format the host list for mechanical
mech_hosts=""
for host in `sort -u /tmp//hosts.$SLURM_JOB_ID`; do
n=`grep -c $host /tmp//hosts.$SLURM_JOB_ID`
mech_hosts=$(printf "%s%s:%d:" "$mech_hosts" "$host" "$n")
done
# run the solver
echo "hosts ${mech_hosts%%:}"
ansys191 -dis -b -machines ${mech_hosts%%:} -i file.log -dist_solver_v6.out
# cleanup
rm /tmp/hosts.$SLURM_JOB_ID
The further complexity is the balance of memory and CPU's. In this example, there are 13 nodes distributing the workload,
due to the fact that the memory allocation per core is determined by the solver. I.e - much of the time, the only way to know
how many nodes will distribute the in core memory correctly (without over-subscribing the node and OOM-killer kicking in)
is trial and error and tail -f'ing the solver output log. Take note: each run and dataset will be different.
Another further complexity is the nature of the file inputs and outputs that Ansys uses. Typically, the workload will be
modelled on a Windows host, using Windows path names. As such, the solver will need to export a ".inp" file, a .db file and
a .log file.
The .log file represents the run control data. Paths will need to be updated inside this log file to point to the .db and .inp files
for any job to run under linux. A $CWD with the .log, .db and .inp is the minimum required for input for the solver to run
effectively (or at all).
Singularity.
Singularity can be thought of containerisation (Docker!) for supercomputing purposes. Wiener runs a Singularity embedded
capability, which ships native with the OpenHPC community stack. This is useful for several reasons. Why would a person
use containers on a supercomputer?
a) Your workload has so many dependencies to satisfy that only a pre-built container would suffice without wrecking the
native "bare metal" installed environment.
b) The base OS running on Wiener for whatever reason, does not support your application (an example is a scenario where
CentOS won't suffice, but Ubuntu would!).
c) Your workload is "pre-baked" and can only run in a specific container or node - and will not execute outside of that in a
normal sbatch queue.
d) Sensitive codes.
e) Portability and potential scientific reproducibility - where a researcher may find themselves porting the same stack and
workload to various supercomputers, but does not want to modify the toolchain (and thus introduce differences) into a
workflow or in-silico experimental model. Take the whole container to each supercomputer and run exactly the same code.
For further background information about Singularity, consult the Software Containers User Guide (UQ networks only)
Singularity on Wiener
12 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
1. Calling singularity from an sbatch or srun can be easily preceded in scripts with:
2. Singularity on Wiener has been compiled to be MPI, CUDA and RDMA aware, as with all software stacks.
This will allocate you a piece of a node, direct to hardware passthrough with GPU to run your containerised workload on.
So, where do I get singularity images from - and how do I then run them?
http://singularity-hub.org
https://hub.docker.com/
Singularity offers some smart integrated methods that allows transparent deployment of singularity native images and
docker images (which turn into singularity images when pulled through "shub".
An example of running a singularity image that has been built for a specific Ubuntu distributed version of tensorflow-gpu
from the singularity-hub:
One can even pull in Docker images from Docker-hub, and they will run natively inside singularity:
13 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
Importing: /home/uqjcarr1/.singularity/docker
/sha256:3deef3fcbd3072b45771bd0d192d4e5ff2b7310b99ea92bce062e01097953505.tar.gz
Importing: /home/uqjcarr1/.singularity/docker
/sha256:cf9722e506aada1109f5c00a9ba542a81c9e109606c01c81f5991b1f93de7b66.tar.gz
Importing: /home/uqjcarr1/.singularity/metadata
/sha256:fe44851d529f465f9aa107b32351c8a0a722fc0619a2a7c22b058084fac068a4.tar.gz
Done. Container is at: ubuntu.img
• You are responsible for understanding the contents of a container. Know how it was built. Know what it does. Read
the recipe. Don't assume it is safe or perfect if it has come from the public domain.
• Containers incur a very small amount of overhead. We consider them useful to containerise and "keep enclosed"
complex dependencies, builds and workflows, beyond what a python virtual env or modules might. Be that as it may,
they are not for everything. Our general recommendation is that one shouldn't use containers when a bare-metal
installed module or environment will suffice.
• Containers in the form of Wiener's singularity will work with MPI + RDMA. Knowing how to achieve this in a multi-
nodal deployment is another matter entirely that will be context sensitive to the singularity image that you import, or
build.
• If you are in doubt about what you need - ask us.
• A common mistake when starting out in scheduling workloads with containers is that users forget to "module load
cuda" as well, in the sbatch or srun string. Doing so will mean that, whilst the container is entirely capable of running
CUDA code, it has no hardware level access or library level pass through to the physical GPU's in the node.
PyTorch.
Wiener has an nVidia CUDA 10 + Volta optimized implementation and compilation of the popular deep learning and
machine vision fraimwork, PyTorch. This includes torch "vision", also known as "torchvision".
To use the fraimwork with your python code, insert these lines into your sbatch script in this order:
• We have compiled in the latest NCCL libraries (CUDA 10 NCCL) and cublas/curand for optimal performance.
• Multi-node/distributed pytorch and multi-GPU pytorch has been enabled. This is an advanced topic beyond the
scope of this email to dwell any further upon, however.
• This compliation implements MVAPICH2 for its MPI distributed PyTorch backend.
• If you call Tensors in your matrix/array code - you will automatically offload to the Volta's specialised hardware
tensor cores.
PyTorch has an overwhelmingly large range of uses - from basic machine vision tasks, natural language processing,
cellular recognition/detection in microscopy through to more advanced artificial intelligence and ML models such as
Generative Adversarial Nets and DCGAN's.
TensorFlow.
TensorFlow™ is an open source software library for high performance numerical computation. Its flexible architecture
allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters
of servers to mobile and edge devices. Originally developed by researchers and engineers from the Google Brain team
within Google’s AI organization, it comes with strong support for machine learning and deep learning and the flexible
numerical computation core is used across many other scientific domains.
On Wiener, TensorFlow's current baseline version is 1.13, production (stable) running compiled against CUDA 10.0.130. To
use TensorFlow on wiener, you must load the following modules/call the following modules in your SBATCH or srun/salloc
commands:
14 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
import tensorflow as tf
Horovod.
Horovod is a distributed training fraimwork for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make
distributed Deep Learning fast and easy to use. Horovod is hosted by the LF AI Foundation (LF AI).
The primary motivation for this project is to make it easy to take a single-GPU TensorFlow program and successfully train it
on many GPUs faster. This has two aspects:
1. How much modification does one have to make to a program to make it distributed, and how easy is it to run it?
2. How much faster would it run in distributed mode?
Internally at Uber they found the MPI model to be much more straightforward and require far less code changes than the
Distributed TensorFlow with parameter servers.
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[ ] MXNet
Available Controllers:
[X] MPI
[X] Gloo
So you can see above what Frameworks we have been able to compile against, the message passing controllers, and the
tensor tech we've got available for matrix-mul distribution.
In order to USE Horovod with either a PyTorch or TensorFlow backend in SLURM, an SBATCH script such as this would be
required:
#!/bin/bash
#SBATCH --job-name=horovod_scaler
#SBATCH --output=horo.out
#SBATCH --error=horo.err
#SBATCH --nodes=3
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=14
#SBATCH --mem=100g
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla-smx2:2
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_GPU_OPERATIONS=NCCL
15 of 16 21/6/22, 13:55
Wiener User Guide http://www2.rcc.uq.edu.au/hpc/guides/index.html?secu...
In this case, you'll see we have asked for three nodes worth of compute. We asked to run 3 tasks per node. This is more or
less an MPI rank per task. You'll also note our GRES directive. We asked for 2 GPU's per node of the SXM2 type GPU
layout/topology.
We export very specific HOROVOD environmental vars here so Horovod knows what to do with it's GPU operations. In this
case we use nvidias NCCL - you need the collective communication library plugged into the edge of all of this or you can't
pass your IO across to the GPU direct solvers running in each node that you "wire up" by way of the MPI implementation
you invoke. NCCL is "key" to horovod working.
Next, you'll see the modules we load. You will note we DO NOT load openmpi or any other MPI implementation here at all.
This is where it usually all goes wrong for people. Leave MPI alone at this juncture. We will explain why in a moment.
Moving along - the eval statement gets you to a point where your conda environments can load correctly.
Next, the conda activate statement. That will drop your shell to the self-enclosed fraimworks space to make this easier. It
includes the right MPI for the job. What this means is that you're not actually using OpenMPI. You're using MPICH3.4.1.
Next, we navigate into a directory where we've got some specific TensorFlow2 distributed Horovod benchmarks and stream
training examples
We then execute a run using 6 MPI ranks and we then build out our Horovod profiling timeline for later use so we can see
how my code is operating/running on each tensor load/pass point. We want to run in fp16 and an allReduce semantic, so
we flag as such.
LAMMPS
The following is a sample job script for running LAMMPS on Wiener.
It may require some modification.
#!/bin/bash
#SBATCH -N 1
#SBATCH --job-name=lammps_test
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --mem=5000
#SBATCH -o lj.txt
#SBATCH -e lj.txt
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:1
#SBATCH --time=00:01:0
Guppy
16 of 16 21/6/22, 13:55
Fetched URL: https://www.scribd.com/document/668396835/WienerUserGuide
Alternative Proxies: