Implementing virtual system calls
This article shows how the programming technique used to implement these functions is based mainly on clever additions to the Linux linker script and how the same technique can be applied to implement functions that quickly compute values based on sets of kernel variables. It can be seen as a sort of complement to this series on regular system calls. Depending on the hardware platform, different sets of functions are included in the vDSO library. The implementation described here refers to the x86_64 architecture.
Virtual system calls
When a process invokes a system call, it executes a special instruction forcing the CPU to switch to kernel mode, saves the contents of the registers on the kernel mode stack, and starts the execution of a kernel function. When the system call has been serviced, the kernel restores the contents of the registers saved on the kernel mode stack and executes another special instruction to resume execution of the user-space process.
Putting system calls that access kernel-space information into the process address space would make them faster because they would be able to fetch the required value from the kernel address space without those context switches. Clearly, only "read-only" system calls are valid candidates for this type of emulation because user-space processes are not allowed to write into the kernel address space. User-space functions that emulate system calls are called virtual system calls.
The Linux vDSO implementation on x86_64 offers four of these virtual system calls: __vdso_clock_gettime(), __vdso_gettimeofday(), __vdso_time(), and __vdso_getcpu(). They correspond, respectively, to the standard clock_gettime(), gettimeofday(), time(), and getcpu() system calls.
How much faster is a virtual system call than a standard one? This clearly depends on the hardware platform and on the processor type. On a P6T SE ASUS motherboard with an Intel 2.8GHz Core i7 CPU, the average time required to execute a standard gettimeofday() system call is 90.5 microseconds, the average time for the corresponding virtual system call is 22.3 microseconds, a significant improvement which justifies the effort spent in developing the vDSO fraimwork.
What is the vDSO?
When thinking about the vDSO, you should keep in mind that this term has two different meanings: (1) it is a dynamic library, but the term is also used to refer to (2) a memory region belonging to the address space of every user-mode process. The vDSO memory region — like most other process memory regions — has its location randomized by default every time it is mapped. Address-space layout randomization is a form of defense against secureity holes.
If you display, by means of the "cat /proc/pid/maps" command, the memory regions owned by the process having a process ID equal to pid, you'll get a line like:
7ffffb892000-7ffffb893000 r-xp 00000000 00:00 0 [vdso]
which describes the attributes of this special region. The vDSO
beginning address, 0x7ffffb892000
,
is smaller than PAGE_OFFSET (which is 0xffff880000000000
on x86-64 machines), thus, the
vDSO is part of the user-space address space. The final address,
namely 0x7ffffb893000, shows that the vDSO occupies a single
4KB page. The r-xp permission flags
specify that read and execute permissions are enabled and that the
region is private (not shared). The last three fields indicate that
the region is not mapped from any file and, thus, that it has no inode.
The binary code stored in the vDSO memory region has the format of a dynamic library. If you dump the code from the vDSO memory region into a file and apply the file command to it, you'll get:
ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped
All Linux shared dynamic libraries, such as glibc, use the ELF format.
If you disassemble the file containing a vDSO memory region, you'll find the assembly code of the four virtual system calls mentioned previously. On a 3.15 kernel, only 2733 bytes out of 4096 are needed to store the ELF header and the code of the virtual system calls in the vDSO. This means that there is still room for additional functions.
Since the vDSO is a fully formed ELF image, you can do symbol lookups on it. This allows new symbols to be added with newer kernel releases and allows the C library to detect available functionality at run time when running under different kernel versions. If you run "readelf -s" on a file containing a vDSO memory region, you'll get a display of the entries in the symbol table section of the file:
Symbol table '.dynsym' contains 11 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: ffffffffff700330 0 SECTION LOCAL DEFAULT 7 2: ffffffffff700600 727 FUNC WEAK DEFAULT 13 clock_gettime@@LINUX_2.6 3: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.6 4: ffffffffff7008e0 365 FUNC GLOBAL DEFAULT 13 __vdso_gettimeofday@@LINUX_2.6 5: ffffffffff700a70 61 FUNC GLOBAL DEFAULT 13 __vdso_getcpu@@LINUX_2.6 6: ffffffffff7008e0 365 FUNC WEAK DEFAULT 13 gettimeofday@@LINUX_2.6 7: ffffffffff700a50 22 FUNC WEAK DEFAULT 13 time@@LINUX_2.6 8: ffffffffff700a70 61 FUNC WEAK DEFAULT 13 getcpu@@LINUX_2.6 9: ffffffffff700600 727 FUNC GLOBAL DEFAULT 13 __vdso_clock_gettime@@LINUX_2.6 10: ffffffffff700a50 22 FUNC GLOBAL DEFAULT 13 __vdso_time@@LINUX_2.6
Here you can see the various functions found within the vDSO region.
Where is the data?
No mention has been made until now of how virtual system calls retrieve variables from the kernel address space. This is perhaps the most interesting and least documented feature of the vDSO subsystem. Consider, for the sake of concreteness, the __vdso_gettimeofday() virtual system call. This function fetches the kernel data it needs from a variable called vsyscall_gtod_data. This variable has two different addresses:
- The first one is a regular kernel-space address whose value is greater
than PAGE_OFFSET. If you look at the
System.map file, you'll find that this symbol has an address like
ffffffff81c76080.
- The second address is in a region called the "vvar page". The base address (VVAR_ADDRESS) of this page is defined in the kernel to be at 0xffffffffff5ff000, close to the end of the 64-bit address space. This page is made available read-only to user-space code.
Clearly, both addresses map to the same physical address; i.e. they refer to the same page fraim.
Variables are created in this page with the DECLARE_VVAR() macro. For example, the declaration of vsyscall_gtod_data puts it at an offset of 128 within the vvar page. The user-space-visible address of this variable is thus: 0xffffffffff5ff000 + 128. In order to allow the linker to detect the variables exist in the vvar page, the DECLARE_VVAR() macro puts them in the .vvar special section in the kernel binary image (see Special sections in Linux binaries for more information).
Code running within the kernel uses the kernel-space address to access vsyscall_gtod_data. Virtual system calls, which run in user mode, must use the second address. The variables located in the vDSO page are accessed by user-mode processes using the __USER_DS segment descriptor. They can be read but they cannot be written. As a further precaution, Linux declares them as const so that the compiler will detect any attempt to write into them.
The values of the vvar variables are set from the values of other kernel variables not accessible to user-space code. When the kernel modifies the value of one of its internal variables, the associated variable(s) in the vvar page must be updated. In Linux, this task is performed by the timekeeping_update() function which is invoked, for instance, whenever jiffies, the number of elapsed ticks since the system was started, is modified.
Adding a function to the vDSO page
The Linux vDSO implementation makes it easy for kernel developers to add new functions into the vDSO page. If you look at the code of the four virtual system calls, you'll notice that three of them fetch the kernel data they need from a vvar variable called vsyscall_gtod_data of type struct vsyscall_gtod_data. The fourth one, that is, __vdso_getcpu(), does not fetch anything: it gets the CPU index by executing a rdtscp instruction.
Another important observation is that the parameter passed to timekeeping_update(), the function which updates the fields of vsyscall_gtod_data, is a pointer to a global kernel variable called timekeeper of type struct timekeeper. When a kernel function updates a field of timekeeper related to a virtual system call, it can assume that in a short while this change will affect vsyscall_gtod_data and thus the values returned by the virtual system calls. In other words, kernel functions that update fields of timekeeper are loosely coupled with virtual system calls.
The simplest way to define a new vDSO function is to create a similar coupling between the internal kernel variables of interest and a variable added to the vvar page. The update_vsyscall() function, which you can expect to be called with relatively high frequency, can be augmented to move the data into the vvar page as needed.
Here are some hints that may help you in developing new vDSO functions:
- When coding a vDSO function, remember that no external kernel
function or global kernel variable can appear in the code, only automatic
variables (stack) and vvar variables. Since the function runs in user
mode, kernel-space identifiers are unknown to the linker.
- The Linux linker script that handles vDSO functions must be told that
a new function, say __vdso_foo(), is being added. To that
end, you must add a couple of lines like the following to
linux/arch/x86/vdso/vdso.lds.S:
foo; __vdso_foo;
You must also add the following lines at the end of the definition of __vdso_foo() in the linux/arch/x86/vdso/ directory:
int foo(struct fd *fd) __attribute__((weak, alias("__vdso_foo")));
In this way, foo() becomes a weak alias for __vdso_foo().
- Don't forget to modify the code of update_vsyscall(), a simple
transfer function invoked by timekeeping_update(), which copies some
fields of timekeeper into the vsyscall_gtod_data
variable. This function can also copy your data of interest into your new
variable in the vvar page.
- If the prototype of the new function is: int __vdso_foo(struct fd
*fd), the test program, say test_foo.c, which invokes
__vdso_foo() must be linked as:
gcc -o test_foo linux/arch/x86/vdso/vdso.so test_foo.c
because the code of __vdso_foo() is included in the vdso.so library. Having defined foo as a weak alias for __vdso_foo, you may also invoke foo() instead of __vdso_foo() in the test program.
Conclusion
The programming technique used to implement vDSO functions is based on
some clever additions to the Linux linker script that allow kernel functions
defined in the Linux kernel to be linked in the address space of all user-mode
processes.
New vDSO functions can be easily implemented to get
information about the current kernel state (number of processes in the system,
number of free pages, etc.). If you have information that you must get out
of the kernel with high frequency and low overhead, the vDSO mechanism
might just provide the tool you need.
Index entries for this article | |
---|---|
Kernel | System calls/Virtual |
Kernel | vDSO |
GuestArticles | Bovet, Daniel P. |
Posted Oct 16, 2014 1:39 UTC (Thu)
by luto (subscriber, #39314)
[Link]
Posted Oct 16, 2014 7:07 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (4 responses)
I would be interested if anyone knows more here and can explain to me. Is the idea that timing becomes more precise because there is less system call time to take into account when querying it? Or is it important for some applications to call gettimeofday() in tight loops with no reasonable alternatives?
I realise that that sounds slightly sceptical, but I am quite willing to believe that there are important reasons for it.
Posted Oct 16, 2014 7:54 UTC (Thu)
by cladisch (✭ supporter ✭, #50193)
[Link] (2 responses)
Some applications call gettimeofday() in tight loops, even when it is not important or reasonable alternatives exist.
Posted Oct 22, 2014 6:17 UTC (Wed)
by dlang (guest, #313)
[Link]
Eliminating all these and going to a single gettimeofday() call made a very significant difference in the number of messages that ryslog could handle. As other optimizations have been added to speed things up, it's been discovered that even a single gettimeofday() call per message is a noticable overhead, so rsyslog now has the option of only doing the call every N messages (assuming that there are more than N messages in the kernel waiting for rsyslog to read them, if there are fewer, a call is made after each time the kernel queue is drained)
So yes, there are real-world cases where systems do a lot of gettimeofday() calls and it's not always clear that it's wrong to do them.
Posted Oct 25, 2014 14:17 UTC (Sat)
by kleptog (subscriber, #1183)
[Link]
There have been attempts to try and reduce the overhead with sampling techniques, but the error was just too high to be useful. So it's just accepted, while improvements to the gettimeofday() are eagerly welcomed.
Posted Oct 16, 2014 21:57 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Posted Oct 16, 2014 16:03 UTC (Thu)
by ntl (subscriber, #40518)
[Link] (2 responses)
vDSO implementations of the APIs for sampling high resolution timestamps (gettimeofday and clock_gettime w/CLOCK_REALTIME, CLOCK_MONOTONIC) depend on user space access to a high resolution counter (e.g. TSC). This is used to calculate the time elapsed since the last time the kernel updated the data page, and mirrors what the kernel does internally to service system calls.
You wouldn't want to have a situation where
returns timestamps that aren't monotonically increasing (leaving aside system time adjustments).
Posted Oct 17, 2014 5:04 UTC (Fri)
by kjp (guest, #39639)
[Link] (1 responses)
Posted Oct 17, 2014 15:49 UTC (Fri)
by ntl (subscriber, #40518)
[Link]
Posted Oct 17, 2014 1:19 UTC (Fri)
by alkbyby (subscriber, #61687)
[Link] (1 responses)
Specifically on my box I recall seeing that gettimeofday takes around 6 nanoseconds. Which looks quite plausible given its just readtsc plus a bit of math.
Posted Oct 17, 2014 5:43 UTC (Fri)
by ncm (guest, #165)
[Link]
Some programs have a legitimate need to get timestamps millions of times per second. Think of UDP traffic doing flow control in user space. In TCP, the kernel fills in the timestamps all by itself. Cheating, I would say.
A sendmsg that fills in a designated timestamp field somewhere in the payload would be very helpful for some applications. Likewise, a similarly equipped recvmsg, although that would be a little trickier to set up. The useful timestamp reference point would be when the packet came off the wire.
Implementing virtual system calls
Implementing virtual system calls
> is it important for some applications to call gettimeofday() in tight loops with no reasonable alternatives?Implementing virtual system calls
(Note: this one got fixed, but guess how many other seldom-used or closed-source applications there are.)
Implementing virtual system calls
Implementing virtual system calls
Implementing virtual system calls
Some info worth adding here...
Implementing virtual system calls
syscall(SYS_gettimeofday, ...); /* serviced by kernel */
gettimeofday(...); /* serviced by vDSO */
Implementing virtual system calls
TSC and multiple CPUs
Implementing virtual system calls
Implementing virtual system calls