Notes
Notes
Notes
I should also explain why this course exists. A few years ago a member of
the UIS conducted some usability interviews with people in the University
about the usability of some high-end e-science software. The response he
got back was not what we were expecting; it was “why are you wasting our
time with this high level nonsense? Help us with the basics!”
So we put together a set of courses and advice on the basics. And this is
where it all starts. I'm going to talk about the elementary material, what you
need to know and the courses available to help you learn it. You do not need
to attend all the courses. Part of the purpose of this course is to help you
decide which ones you need to take and which ones you don't.
This course is the first in that set of courses designed to assist members of
the University who need to program computers to help with their science.
You do not need to attend all these courses. The purpose of this course is to
teach you the minimum you need for any of them and to teach you which
courses you need and which you don't.
The final slide of the talk gives a URL through which all the relevant UIS
courses can be found.
Course outline
Basic concepts
Good practice
Specialist applications
Programming languages
Course outline
Basic concepts
Good practice
Specialist applications
Programming languages
Serial computing
Single CPU
Multiple CPUs
Single
Instruction
Multiple
Data
MPI
OpenMP
Parallel Programming:
Options and Design
Parallel Programming:
Introduction to MPI
Distributed computing
Multiple computers
Universal principles:
0.1 → 0.1000000000001
and worse…
Program Design:
How Computers Handle Numbers
Text processing
fabliaux
e.g. sequence comparison factrix
text searching falx
faulx
faux
fax
feedbox
…
^f.*x$ fornix
forty-six
fourplex
fowlpox
“Regular expressions” fox
fricandeaux
frutex
fundatrix
But there's more to life than real numbers. Under the covers computers work
with integers just as much as, and sometimes much more than, floating
point numbers. These tend not to be used to represent numbers directly but
to refer into other sorts of data. Good examples of these situations involve
searching either in databases or in texts.
Regular expression courses
Programming Concepts:
Pattern Matching Using Regular Expressions
Basic concepts
Good practice
Specialist applications
Programming languages
“Divide and conquer”
Complex
“divide” problem
Simple Simple
problem problem
“conquer”
Partial Partial
solution solution
“glue” Complete
solution
Suppose you have a task you need the computer to perform. The key to
succeeding is to split your task up into a sequence of simpler tasks. You
may repeat this trick several times, producing ever simpler sub-tasks.
Eventually you get tasks simple enough that you can code them up.
That sounds trivial. But that trick, repeated often, is how programming
works. “Divide and conquer.”
“Divide and conquer” — the trick
Simple Simple
problem problem
“conquer”
Partial Partial
solution solution
The reason that this works so well is that you don't have to use the same
tool for all the subtasks. Different tools are suitable for different bits of your
task. So long as you can glue the parts together again there is no need to
use one tool for everything. While it may sound harder to use lots of different
tools rather than just one the simplification gained by the splitting and the
specificity of the tools more than makes up for it.
Example
”
reading 256-line lumps like this until they’re all done.
In practice, the instructions you are given to write code look scary at first
glance. The trick is to divide and conquer.
Example
Read 256 lines of data represented in a CSV format.
Keep reading 256-line lumps like this until they’re all done.
output
CSV Graphics
Write file CSV format plot a heat graph Write file
Keep reading 256-line lumps like this until they’re all done.
Programs
Functions
“Lumps” ?
Modules
Units
Now we will look at the bits themselves. They go by various names, such as
“objects”, “functions”, “modules” and “units”. I prefer “lumps”. It's a simple,
no-nonsense word that rather deflates the pompous claims made by some
computing people.
If you split your program up into sets of these lumps, and reuse lumps when
you need the same functionality twice or more, then you stand a good
chance of success. If you don't, and write chaotic, unstructured code then
you will have to work much harder to get a program that works and harder
still to get one that works correctly.
So why am I talking about this before we've even looked at any
programming languages? It's because this rule about splitting your program
up into a structured collection of parts is common over every single
programming language. It's an absolute rule — and they're rare in this
business!
Example: unstructured code
a_norm = 0.0
for i in range(0,100):
a_norm += a[i]*a[i]
…
c_norm = 0.0
for i in range(0,100):
c_norm += c[i]*c[i]
So what do I mean? Let's look at an example of bad code. You don't need to
worry about the language (it's a language called Python that we will talk
about later) because I hope the general principle is clear. We're calculating
i =99
∑ x 2i
i=0
for three different sets of 100 values in three different parts of a program.
This commits the cardinal programming sin of repetition. If we wanted to
improve the way we calculated this sum we would have to do it three times.
(And if we want accurate sums we do need to improve it.)
We might also want to speed up our program. But does our program spend
the majority of its time doing these sums or a tiny fraction of its time? If I
spend an hour speeding it up is that more time than I will ever save running
the slightly faster program? I can't tell until I isolate this operation into one
part of my program where I can time it.
Example: structured code
def norm2(v):
v_norm = 0.0 Single instance
for i in range(0,100): of the code.
v_norm += v[i]*v[i]
return v_norm
a_norm = norm2(a)
… Calling the function
three times
b_norm = norm2(b)
…
c_norm = norm2(c)
So let's improve it. We take the repeated operation and move it to a single
place in the code wrapped in a function. Then in the three parts of the
program where we calculate our sum, we simply make use of this function.
(What I am illustrating here is written in the Python but the principle is
universal and hopefully it’s simple enough to read that you don't need
Python fluency to follow along.)
Structured programming
Import
function
Write Test
function function
Time Debug
function Once! function
Improve
All good practice follows from
function structured programming
a_norm = norm2(a)
… No change to
calling function
b_norm = norm2(b)
…
c_norm = norm2(c)
v_norm = 0.0
for w_item in w:
v_norm += w_item
return v_norm
a_norm = norm2(a)
… Still no change to
calling function
b_norm = norm2(b)
…
c_norm = norm2(c)
a_norm = norm2(a)
… No change to
calling function
b_norm = norm2(b)
…
c_norm = norm2(c)
Programming Concepts:
Introduction for Absolute Beginners
Libraries
Written by experts
In every area
Learn what
libraries exist
in your area
Use them
These libraries of functions are your salvation. All you need know is what
libraries exist and how to call them. Much of our programming courses
consists of telling you this and giving you an introduction to the library to get
a feel for its shape.
Your time is better spent on original research than inferior duplication of
work that already exists. Save your effort for your research!
Example libraries
Scientific Python
Numerical Python
So what can we get from libraries? The question “what can't we get”
probably has a shorter answer. You name it; the libraries have got it.
For example the NAG libraries and the Scientific Python libraries have
functions for at least the following topics:
• Roots of equations
• Differential and Integral Equations
• Interpolation, Fitting & Optimisation
• Linear Algebra
• Special Functions
and many, many more.
Hard to improve on library functions
( ) ( )( )
C11 C12 A11 A12 B11 B12
=
C21 C22 A21 A22 B21 B22
M1=(A11+A22)(B11+B22) C11=M1+M2‒M5+M7
M2=(A21+A22)B11 C12=M3+M5
M3=A11(B12‒B22) C21=M2+M4
M4=A22(B21‒B11) C22=M1‒M2+M3+M6
M5=(A11+A12)B22
M6=(A21‒A11)(B11+B12)
Applied recursively: much faster
M7=(A12‒A22)(B21+B22)
Volker Strassen's algorithm, as shown in the slide, is far more efficient for
large matrices but I doubt you would ever have thought of it. I also doubt you
would want to code it, either.
So use a library!
Algorithms
Time taken /
Memory used
vs.
Size of dataset /
Required accuracy
O(n2) notation
Communal working
Source code for most programs (but not all, e.g. spreadsheets) consists of
plain text.
Revision control lets you say that “this is version X of this file” (or this set of
files). You can then say “go back to version 5” or “show me version 2”. Some
systems support locking where you can say “I'm working on this file; nobody
touch it.” Other don’t let you lock but merge back the changes made by
people in different parts of the same file.
Revision control systems work over multiple computers too. This lets lots of
people all work on the same project.
Another standard facility is to split off a “branch”. If you want to experiment
with some changes you start with version 5 say and then create version 5.1,
5.2 etc. rather than 6 and 7. At the end of the experiment, if you like what
you have got, you merge your final version 5.x back, or you simply drop it
and return to the original version 5.0 and start again.
Perhaps most importantly if you do build a version 6 and realise you have it
completely wrong you can abandon it and return to version 5!
Revision control
There are two current revision control systems: subversion and git.
The more recent program, git, was designed by Linus Torvalds when the
subversion program could no longer cope with his software project (the
Linux kernel).
If you have an existing revision control system, use it, whatever it is.
If you are starting from scratch, use git.
git has an additional advantage. If you are prepared to work in an open
environment (anyone can join in) on open source software there is a
company, GitHub, that will give you a git repository free. (And if not, then
they’re not too expensive.) Better still, they will even teach you how to use
git!
Integrated Development Environment
NetBeans Java
Makefile
But the grand-daddy of all build systems outstrips the IDEs thousands to
one. (In fact most IDEs use it behind the scenes.)
The make program records how one file depends on another, so that if you
update one file make knows which other files now need to be rebuilt. It also
stores information about how to rebuild most types of file.
The configuration files for make, known as Makefiles are a little strange.
The strangeness harks back to the creation of the first make in 1977 by
Stuart Feldman of Bell Labs. The (potentially apocryphal) story is this:
Feldman knocked together a quick version of make which used lots of short
cuts and dirty tricks letting him write a prototype quickly. He let some people
use it and went home for the night. It proved so popular that when he
returned to work the following morning it was in use in so many projects that
any change of specification was vetoed by his colleagues.
Building software courses
Unix:
Building, Installing and Running Software
Course outline
Basic concepts
Good practice
Specialist applications
Programming languages
Specialist applications
Microsoft Excel
LibreOffice Calc
Apple Numbers
Example:
Best selling book,
buggy spreadsheets!
This format does not lend itself well to revision control and shared working
as everything is stored in a single, binary format file.
It is also very easy to corrupt data in a spreadsheet and notoriously hard to
debug problems.
A recent, high-profile example of this came from the book “Capital in the
Twenty-First Century”. This was a politically charged book that made some
very significant claims regarding the concentration of money in the hands of
“the 1%”. Unfortunately, the author based his calculations in Excel and made
mistakes. When the mistakes were corrected many of his conclusions
vanished.
There is, incidentally, a moral in this tale beyond “don’t use spreadsheets for
anything important”. When you are debugging a program it is not enough to
stop looking for bugs when you get answers you agree with.
Excel courses
Excel 2010/2013:
Introduction
Analysing and Summarising Data
Functions and Macros
Managing Data & Lists
Statistical software
Stata: Introduction
Matlab
Octave
Mathematica
There are packages for helping with mathematical manipulation and casual
graphing of interim results. The two big players in the Cambridge
environment are MATLAB and Mathematica with MATLAB tending to
dominate over Mathematica. There is also a free product called Octave
which seeks to be a clone of Matlab.
We tend to recommend people avoid Mathematica. While it seems
deceptively easy to use at first there is no consistency to it. With Matlab and
Octave, once you know the way to do a few tasks the rest seem intuitive.
Mathamtical software courses
Matlab:
Manual or automatic?
If you want to want to plot graphs based on the output of your programs you
will need some sort of plotting package. The “bad way” to do this is to take a
graphics library (in Fortran or C, both exist) and to bolt some graphics code
into your numerical program. The right way is to have your numerical
program produce its results and then write a distinct graphics program in a
graphics-specific language or package. Alternatively you can import the data
into a purely manual graphics package and fiddle to your heart's content. I
assume you will have better things to do with your time and just want a
program to create a graph glued to the other bits of your project. This is
what I mean by “automatic” rather than “manual”.
A very common approach is to export as comma separated value files (CSV)
and to then create graphs in Excel or another spreadsheet.
There are two dedicated graphics languages: gnuplot and ploticus.
Both are available on the MCS. In addition there is a graphics module for
Python called matplotlib.
Note that even if your main program is written in Python and you want to use
the Python graphical module we still advise that you split the two tasks —
creating your data and graphing your data — into two separate programs.
Courses for drawing graphs
Python 3: (includes a
Advanced Topics matplotlib unit)
(Self-paced)
Course outline
Basic concepts
Good practice
Specialist applications
Programming languages
Computer languages
Interpreted Compiled
Untyped Typed
The shell is the fundamental interpreted language. The commands you type
at the command line are interpreted by the shell and acted on. Similarly we
can put those commands in a file and have the shell interpret them from
that.
Shells scripts are the classic “glue” for holding together a set of programs. If
you have a set of programs which can be run from the command line and
which have to interoperate then a shell script is what you want to use.
They can also be used for “wrapping” programs. This lets you run programs
with your default parameters, or in a certain environment, without having to
manually set each parameters manually or change your environment
manually each time you run it.
Shell scripts can also be used to run certain small tasks themselves. So
long as the task is very simple, and stays very simple then this is OK. Small
scripts like this have a habit of growing with time, though, and very soon you
end up in a situation where you should be using one of the more powerful
scripting languages we will meet later.
Shell scripts are not suitable for computationally intensive work (though they
can call other programs that are, of course) and they are not suitable for
writing GUIs in (though people have tried).
Shell script
/bin/sh job="${1}"
…
/bin/sh /bin/csh
/bin/bash
/bin/ksh /bin/tcsh
/bin/zsh
There are many shells. The only rational one to choose is bash, the Bourne
again shell, which is a play on the name of Simon Bourne who wrote one of
the very early shells.
The most important schism is between the “C-shell” and the “Bourne shell”
shells. Avoid C-shell; it's dying.
Shell scripting courses
Unix:
Python! ✔
A word of caution is advisable here. We teach quite a bit of shell scripting in
the UCS course, but not all of it. If you ever find yourself looking for an
advanced shell scripting course then our advice is that you have left the
arena where shell script is the right tool for the job. We would recommend
Python as a better alternative.
Just because the shell can do a bit more doesn't mean that you should use
it for that. So this leads us on to the more powerful scripting languages…
High power scripting languages
Python #!/usr/bin/python
import library
…
Perl #!/usr/bin/perl
Both have extensive
libraries of utility functions. use library;
…
Both can call out to libraries
written in other languages.
The shell, which we saw in the previous slides, was designed for launching
other programs rather than being a programming language in its own right.
We will now turn to the two primary scripting languages that were designed
for that purpose: Python and Perl.
Again, neither is directly appropriate to computationally very intensive work
but both can make use of external libraries that have been written in other
languages that are. Python, in particular, has developed a major following in
the scientific community and is no slacker for medium scale problems.
The “Swiss army knife” language
Perl
Suitable for… Bad first language
text processing
data pre-/post-processing Very easy to write
unreadable code
small tasks
The other powerful scripting language we will discuss is Python. Python was
written after Perl became widely used and has the benefit that its author
learned from Perl's mistakes. Despite being more recent it has caught up
and is now very common in Cambridge and the scientific community
worldwide, overtaking Perl.
Python is also very easy to learn and we recommend it as a first
programming language.
It comes with its own fairly extensive libraries which give it the slogan
“batteries included”. Most of what you need for general computing comes
with the language.
In addition the scientific community has built the “Scientific Python” (SciPy)
libraries which are in turn built on top of the “Numerical Python” (NumPy)
libraries which provide very efficient array-handling routines (written in a
language other than Python).
You can learn all about SciPy at http://www.scipy.org/.
Python lends itself very naturally to writing well structured and manageable
code. It has a style of code that is unique and which puts off some people
but it's easily dealt with in the editor. The issue is that where most languages
use open and closed brackets to clump instructions together, Python uses
levels of indentation.
Python courses
Python 3:
Introduction for Absolute Beginners
Python 3:
Introduction for Those with
Programming Experience
Python 3:
Further Topics
(self paced)
Compiled languages
No specialist C
system and
scripts are not C++
fast enough
Compiled
language Fortran
Library
requirement
with no script Java
interface
Use only as
a last resort
compilation
fubar.o snafu.o machine code files
linking
fubar machine code file
executable main()
pow() libc.so.6
zap()
printf() …
printf()
run-time …
execution fubar
I'll use C as the example in these slides, but the same applies for C++ and
Fortran.
We start with the source code (typically multiple files).
Compilation proper consists of taking the individual plain text source files
and turning them into machine code for the computer. Each source file,
fubar.c say, is individually converted into a machine code (or “object
code”) file called an object file, fubar.o, which implements exactly the
same functionality as the source code file. Any function calls in the source
code are translated to function calls in the machine code. If the function's
content isn't defined in the source code then it's not defined in the machine
code. And so it goes on. This is a pure “translation” process; source code is
translated, file for file, into machine code.
The next stage is called “linking”. This is the combination of the various
machine code files into a single executable file. The function definitions
defined in the various object files are tied together with their uses in other
object files. Calls to functions in external libraries are tied to the file
containing the library so that, at run-time, the operating system can hook
those function definitions in too.
No need to compile whole program
Critical
function
Python
script
Also note that if you don't need to write your entire program in a compiled
language just because you need to write part of it that way. For example
there are hooks to call Fortran routines from Python and Python objects
which can be manipulated by Fortran. Many of the support libraries for
Python are written in languages other than Python.
No need to write the whole
program in a compiled language
function.f
f2py
function.c
SWIG
For numerical work there's Fortran. There still is no comparison; if you are
doing numerical work you are best off using Fortran. The best numerical
libraries are written in Fortran too.
However, it is probably the wrong choice for more or less anything else.
You also need to be careful about the various different versions of Fortran.
For a long time Fortran 77 was the standard. Now we tend to use a mix of
Fortran 90 and Fortran 95. Fortran 2003 has yet to make a serious impact.
Fortran course
Fortran:
Introduction to Modern Fortran
Excellent libraries
Memory management
The C programming language made its name by being the language used to
write the Unix operating system. As a result it is the best of the compiled
languages for interfacing with the operating system. Because it is the
language for an operating system used by developers a very large number
of libraries and programs have been written in it.
Arguably it has been superceded for application programming by C++ but it
is still very widely used.
The most important problem with C is the issue of “memory management”.
In C you are required to explicitly “free” objects that you no longer need to
return their memory space allocation to general use. Programs that don't do
this suffer from “memory leaks” and tend to grow with time. Once they get
too big for the system running them they become slow as the system has to
compensate for the amount of memory they claim to need. Finally they
collapse. Alternatively, programmers can accidentally free memory that the
program actually does require. These programs tend to die suddenly. It's
also possible to point accidentally to the wrong part of memory and get
nonsense results back.
All these memory management issues can be handled with careful
programming, but the language offers no assistance of its own.
C++
Extension of C
Object oriented
“Programming: principles
and practice using C++”
Stroustrup, Bjarne (2008)
harder but better for scientific computing
From the intro to Stroustrup’s book
C++:
Programming in Modern C++
12 lectures, 3 terms,
significant homework
Object oriented
Finally I'll talk about Java. This, like C++, is a good general purpose
language and is much easier to learn and to use. It implements automatic
memory management so those difficulties are gone too.
Because it is implemented as a bye-code interpreter, interpreting the code
generated by the supposed compiler, its compiled files work across all
platforms with at least the particular version of the Java runtime system.
Some of its libraries aren't particularly well thought out, however, and there
is a good deal of difference between the various versions of the language,
though the Java maintainers do guarantee back-compatibility. If you stick to
versions 1.6 or later you should do OK.
Java courses
(also classes,
ask at the CL)
Scientific Computing
training.cam.ac.uk/ucs/theme/scientific-comp
scientific-computing@ucs.cam.ac.uk
www.ucs.cam.ac.uk/docs/course-notes/unix-courses