006 Mold Slides
006 Mold Slides
Rui Ueyama
Blue Whale Systems PTE LTD
Overview of this talk
1. Status of the project
2. What is the linker?
3. Why does the linker's speed matter?
4. Performance comparison of mold, LLVM lld and GNU gold
5. Why is mold so fast?
6. Hints on writing faster programs
Who am I?
Rui Ueyama
https://github.com/rui314
Status of the project
Status of the project
mold for Unix (Linux) is production-ready and used by various projects and in
many organizations. It's extremely fast (I'll talk about it later.)
Unless you are writing a program in a scripting language, chances are you are
using a linker!
The origin of the linker
Linkers and Loaders, ISBN 978-1558604964, John R. Levine (1999) Chapter 1.2
Programmers were using libraries of subprograms even before they used assemblers. By 1947, John Mauchly,
who led the ENIAC project, wrote about loading programs along with subprograms selected from a catalog of
programs stored on tapes, and of the need to relocate the subprograms' code to reflect the addresses at which
they were loaded. Perhaps surprisingly, these two basic linker functions, relocation and library search, appear
to predate even assemblers, as Mauchly expected both the program and subprograms to be written in machine
language. The relocating loader allowed the authors and users of the subprograms to write each subprogram
as though it would start at location zero, and to defer the actual address binding until the subprograms were
linked with a particular main program.
A linker would be a program you want to write after inventing a digital computer, even before writing an assembler for
it!
For example, if you write a sort() function in machine code in hex, you'd want to use it in many programs. That
naturally led to a desire to write a linker.
Linker in action (1)
$ make
c++ -c -o main.o main.cc
c++ -c -o object_file.o object_file.cc
c++ -c -o input_sections.o input_sections.cc
……
c++ -c -o glob.o glob.cc
c++ main.o object_file.o input_sections.o output_chunks.o mapfile.o perf.o linker_script.o archive_file.o
output_file.o subprocess.o gc_sections.o icf.o symbols.o commandline.o filepath.o glob.o -o mold
-L/home/ruiu/mold/oneTBB/build/linux_intel64_gcc_cc9.3.0_libc2.31_kernel5.10.0_release/
-Wl,-rpath=/home/ruiu/mold/oneTBB/build/linux_intel64_gcc_cc9.3.0_libc2.31_kernel5.10.0_release/
-L/home/ruiu/mold/mimalloc/out/release -Wl,-rpath=/home/ruiu/mold/mimalloc/out/release -L/home/ruiu/mold/xxHash
-lcrypto -pthread -ltbb -lmimalloc -lz xxHash/libxxhash.a
● The bold part is the command line to invoke a linker. That usually occurs at the very end of
a build.
● gcc/clang/cc (or g++/clang++/c++) are a compiler driver and not a compiler itself (in a
narrower sense.) They invoke appropriate backend commands based on given files' file
extensions.
● The linker command, "ld", is also expected to be indirectly invoked by a compiler driver.
Linker in action (2)
Appending "-###" makes a compiler driver to show the actual ld command line:
Separate compilation
● A compiler takes a fragment of a program and emit compiled machine code
for it
● A linker combines machine code fragments to complete a program
Executable file and its memory image
Executable contains code and data
https://github.com/ifduyue/musl/tree/master/src/stdio
You can see the contents of a static libc library with the following command:
$ ar t /usr/lib/x86_64-linux-gnu/libc.a
init-first.o
libc-start.o
sysdep.o
version.o
check_fds.o
libc-tls.o
elf-init.o
dso_handle.o
...
Static library (2)
Reasons to split libc into multiple object files
You generally don't know which object files in libc provides functions that you use,
but you don't have to!
If an archive file is given, the linker pulls out object files that are necessary to
complete a link. Unnecessary files are not linked.
Linking static libraries
A linker does the following to link static libraries:
● Simple
● Library's code and data are directly copied to an output file, so the output is
self-contained (doesn't depend on another file at runtime.)
● Read the contents of dynamic libraries to find the symbols that are defined by
the libraries
● Create a PLT or GOT for symbols in the dynamic libraires (If the call is within
the same output file, there is no need to create a PLT or GOT)
Summary of linker's tasks
1. Reads object files, static libraries, and dynamic libraries given as arguments
2. Resolve symbols
3. Scan relocation tables to determine contents of PLTs and GOTs
4. Determine the layout of the output
5. Copy data from an input file to an output file
6. Apply relocations
Why does the linker's speed matter?
Because faster is better!
A build system such as make compiles only source files modified since the last
build. If you change one file and rebuild, a compiler and a linker run.
Linking tends to be a bottleneck because the linker always takes all object files as
arguments.
If your build takes 30 seconds, you would switch the window to start web
browsing. But if it's only 3 seconds, you can wait. So it's not just saving 27
seconds for each build. It helps developers to maintain focus.
Performance comparison of
mold, LLVM lld and GNU gold
Speed of mold
mold's speed is about 1 second per 1 GB output file on a high-core-count modern
x86 machine.
It's about twice as slow as copying the output file to another file with cp, which is
extremely fast because the linker does substantially more work than cp does.
It's probably almost impossible to create a linker significantly faster than mold, as
mold is already almost I/O-bounded.
Why is mold so fast?
Why is mold so fast?
Directly use data on mmapped input files without creating intermediate
representations
Parallelize all internal passes whenever possible
● Due to Amdahl's law, the performance of a parallelized program tends to be
determined by the non-parallelized parts of the program
● For example, if a single-threaded part of a program consists 50% of the total
execution time, it cannot be faster than 2x no matter how many cores you add
Use clever techniques to make it faster
Comparison with htop(lld and mold)
Scale of data
Here is a list of elements in input files and Data type # of items
Scales well because there are usually no or very few dependencies between
threads
● It's usually implemented using sophisticated data structures and scales well
with the number of threads.
● mold uses Intel TBB's tbb::concurrent_hash_map
https://github.com/oneapi-src/oneTBB
Symbol resolution in mold
Run a parallel for loop on the input file and add
symbols to the parallel hash map at the same Concurrent hashmap
time
Maps symbol strings to
symbol objects
Computing a cryptographic hash for a file is time consuming, so we split into two
stages:
1. Consider the output file as being made up of 10 MiB records and compute
SHA256 for each record -- This step can be done in parallel
2. Compute a SHA256 of SHA256 records
Miscellaneous speed-up techniques
glibc's default malloc doesn't scale well for a large number of cores
It is faster to overwrite a file already in the buffer cache than to create a new file and write data to it
If a large number of files are mmap'ed, it takes several hundred milliseconds to terminate the process
● As a workaround, fork a child process and do the actual processing in that process, and exit from the
parent process as soon as the child process outputs a file
● Exit from the child process takes several hundred milliseconds, but it is not a problem because it is
not an interactive process
Hints on writing faster programs
Hints on writing faster programs
Don't guess, measure.
● Speculations are usually wrong, so don't waste time to optimize code that doesn't matter.
Don't try to write faster code. But rather, design your data structures in such a way that the program naturally
becomes faster.
● Learn from the first implementation, then use that knowledge to reimplement
● There are a few types of optimizations that you cannot implement without re-doing everything from scratch
● Developing it a second or third time is faster, so your time on the first iteration won't be completely wasted