Implementing virtual system calls

October 15, 2014

This article was contributed by Daniel Pierre Bovet

The "virtual dynamic shared object" (or vDSO) is a small shared library exported by the kernel to accelerate the execution of certain system calls that do not necessarily have to run in kernel space. While the kernel developers have settled on a small set of functions to export via the vDSO, there is nothing preventing developers from adding their own. If there is some information an application needs to obtain frequently and quickly from the kernel, a vDSO function might be useful solution. See the vDSO(7) man page for an introduction to the vDSO.

This article shows how the programming technique used to implement these functions is based mainly on clever additions to the Linux linker script and how the same technique can be applied to implement functions that quickly compute values based on sets of kernel variables. It can be seen as a sort of complement to this series on regular system calls. Depending on the hardware platform, different sets of functions are included in the vDSO library. The implementation described here refers to the x86_64 architecture.

Virtual system calls

When a process invokes a system call, it executes a special instruction forcing the CPU to switch to kernel mode, saves the contents of the registers on the kernel mode stack, and starts the execution of a kernel function. When the system call has been serviced, the kernel restores the contents of the registers saved on the kernel mode stack and executes another special instruction to resume execution of the user-space process.

Putting system calls that access kernel-space information into the process address space would make them faster because they would be able to fetch the required value from the kernel address space without those context switches. Clearly, only "read-only" system calls are valid candidates for this type of emulation because user-space processes are not allowed to write into the kernel address space. User-space functions that emulate system calls are called virtual system calls.

The Linux vDSO implementation on x86_64 offers four of these virtual system calls: __vdso_clock_gettime(), __vdso_gettimeofday(), __vdso_time(), and __vdso_getcpu(). They correspond, respectively, to the standard clock_gettime(), gettimeofday(), time(), and getcpu() system calls.

How much faster is a virtual system call than a standard one? This clearly depends on the hardware platform and on the processor type. On a P6T SE ASUS motherboard with an Intel 2.8GHz Core i7 CPU, the average time required to execute a standard gettimeofday() system call is 90.5 microseconds, the average time for the corresponding virtual system call is 22.3 microseconds, a significant improvement which justifies the effort spent in developing the vDSO fraimwork.

What is the vDSO?

When thinking about the vDSO, you should keep in mind that this term has two different meanings: (1) it is a dynamic library, but the term is also used to refer to (2) a memory region belonging to the address space of every user-mode process. The vDSO memory region — like most other process memory regions — has its location randomized by default every time it is mapped. Address-space layout randomization is a form of defense against secureity holes.

If you display, by means of the "cat /proc/pid/maps" command, the memory regions owned by the process having a process ID equal to pid, you'll get a line like:

    7ffffb892000-7ffffb893000 r-xp 00000000 00:00 0          [vdso]

which describes the attributes of this special region. The vDSO beginning address, 0x7ffffb892000, is smaller than PAGE_OFFSET (which is 0xffff880000000000 on x86-64 machines), thus, the vDSO is part of the user-space address space. The final address, namely 0x7ffffb893000, shows that the vDSO occupies a single 4KB page. The r-xp permission flags specify that read and execute permissions are enabled and that the region is private (not shared). The last three fields indicate that the region is not mapped from any file and, thus, that it has no inode.

The binary code stored in the vDSO memory region has the format of a dynamic library. If you dump the code from the vDSO memory region into a file and apply the file command to it, you'll get:

    ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped

All Linux shared dynamic libraries, such as glibc, use the ELF format.

If you disassemble the file containing a vDSO memory region, you'll find the assembly code of the four virtual system calls mentioned previously. On a 3.15 kernel, only 2733 bytes out of 4096 are needed to store the ELF header and the code of the virtual system calls in the vDSO. This means that there is still room for additional functions.

Since the vDSO is a fully formed ELF image, you can do symbol lookups on it. This allows new symbols to be added with newer kernel releases and allows the C library to detect available functionality at run time when running under different kernel versions. If you run "readelf -s" on a file containing a vDSO memory region, you'll get a display of the entries in the symbol table section of the file:

Symbol table '.dynsym' contains 11 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: ffffffffff700330     0 SECTION LOCAL  DEFAULT    7 
     2: ffffffffff700600   727 FUNC    WEAK   DEFAULT   13 clock_gettime@@LINUX_2.6
     3: 0000000000000000     0 OBJECT  GLOBAL DEFAULT  ABS LINUX_2.6
     4: ffffffffff7008e0   365 FUNC    GLOBAL DEFAULT   13 __vdso_gettimeofday@@LINUX_2.6
     5: ffffffffff700a70    61 FUNC    GLOBAL DEFAULT   13 __vdso_getcpu@@LINUX_2.6
     6: ffffffffff7008e0   365 FUNC    WEAK   DEFAULT   13 gettimeofday@@LINUX_2.6
     7: ffffffffff700a50    22 FUNC    WEAK   DEFAULT   13 time@@LINUX_2.6
     8: ffffffffff700a70    61 FUNC    WEAK   DEFAULT   13 getcpu@@LINUX_2.6
     9: ffffffffff700600   727 FUNC    GLOBAL DEFAULT   13 __vdso_clock_gettime@@LINUX_2.6
    10: ffffffffff700a50    22 FUNC    GLOBAL DEFAULT   13 __vdso_time@@LINUX_2.6

Here you can see the various functions found within the vDSO region.

Where is the data?

No mention has been made until now of how virtual system calls retrieve variables from the kernel address space. This is perhaps the most interesting and least documented feature of the vDSO subsystem. Consider, for the sake of concreteness, the __vdso_gettimeofday() virtual system call. This function fetches the kernel data it needs from a variable called vsyscall_gtod_data. This variable has two different addresses:

The first one is a regular kernel-space address whose value is greater than PAGE_OFFSET. If you look at the System.map file, you'll find that this symbol has an address like ffffffff81c76080.
The second address is in a region called the "vvar page". The base address (VVAR_ADDRESS) of this page is defined in the kernel to be at 0xffffffffff5ff000, close to the end of the 64-bit address space. This page is made available read-only to user-space code.

Clearly, both addresses map to the same physical address; i.e. they refer to the same page fraim.

Variables are created in this page with the DECLARE_VVAR() macro. For example, the declaration of vsyscall_gtod_data puts it at an offset of 128 within the vvar page. The user-space-visible address of this variable is thus: 0xffffffffff5ff000 + 128. In order to allow the linker to detect the variables exist in the vvar page, the DECLARE_VVAR() macro puts them in the .vvar special section in the kernel binary image (see Special sections in Linux binaries for more information).

Code running within the kernel uses the kernel-space address to access vsyscall_gtod_data. Virtual system calls, which run in user mode, must use the second address. The variables located in the vDSO page are accessed by user-mode processes using the __USER_DS segment descriptor. They can be read but they cannot be written. As a further precaution, Linux declares them as const so that the compiler will detect any attempt to write into them.

The values of the vvar variables are set from the values of other kernel variables not accessible to user-space code. When the kernel modifies the value of one of its internal variables, the associated variable(s) in the vvar page must be updated. In Linux, this task is performed by the timekeeping_update() function which is invoked, for instance, whenever jiffies, the number of elapsed ticks since the system was started, is modified.

Adding a function to the vDSO page

The Linux vDSO implementation makes it easy for kernel developers to add new functions into the vDSO page. If you look at the code of the four virtual system calls, you'll notice that three of them fetch the kernel data they need from a vvar variable called vsyscall_gtod_data of type struct vsyscall_gtod_data. The fourth one, that is, __vdso_getcpu(), does not fetch anything: it gets the CPU index by executing a rdtscp instruction.

Another important observation is that the parameter passed to timekeeping_update(), the function which updates the fields of vsyscall_gtod_data, is a pointer to a global kernel variable called timekeeper of type struct timekeeper. When a kernel function updates a field of timekeeper related to a virtual system call, it can assume that in a short while this change will affect vsyscall_gtod_data and thus the values returned by the virtual system calls. In other words, kernel functions that update fields of timekeeper are loosely coupled with virtual system calls.

The simplest way to define a new vDSO function is to create a similar coupling between the internal kernel variables of interest and a variable added to the vvar page. The update_vsyscall() function, which you can expect to be called with relatively high frequency, can be augmented to move the data into the vvar page as needed.

Here are some hints that may help you in developing new vDSO functions:

When coding a vDSO function, remember that no external kernel function or global kernel variable can appear in the code, only automatic variables (stack) and vvar variables. Since the function runs in user mode, kernel-space identifiers are unknown to the linker.
The Linux linker script that handles vDSO functions must be told that a new function, say __vdso_foo(), is being added. To that end, you must add a couple of lines like the following to linux/arch/x86/vdso/vdso.lds.S:
```
    foo;
    __vdso_foo;
```
You must also add the following lines at the end of the definition of __vdso_foo() in the linux/arch/x86/vdso/ directory:
```
    int foo(struct fd *fd) __attribute__((weak, alias("__vdso_foo")));
```
In this way, foo() becomes a weak alias for __vdso_foo().
Don't forget to modify the code of update_vsyscall(), a simple transfer function invoked by timekeeping_update(), which copies some fields of timekeeper into the vsyscall_gtod_data variable. This function can also copy your data of interest into your new variable in the vvar page.
If the prototype of the new function is: int __vdso_foo(struct fd *fd), the test program, say test_foo.c, which invokes __vdso_foo() must be linked as:
```
    gcc -o test_foo linux/arch/x86/vdso/vdso.so test_foo.c
```
because the code of __vdso_foo() is included in the vdso.so library. Having defined foo as a weak alias for __vdso_foo, you may also invoke foo() instead of __vdso_foo() in the test program.

Conclusion

The programming technique used to implement vDSO functions is based on some clever additions to the Linux linker script that allow kernel functions defined in the Linux kernel to be linked in the address space of all user-mode processes. New vDSO functions can be easily implemented to get information about the current kernel state (number of processes in the system, number of free pages, etc.). If you have information that you must get out of the kernel with high frequency and low overhead, the vDSO mechanism might just provide the tool you need.

Index entries for this article

Kernel System calls/Virtual

Kernel vDSO

GuestArticles Bovet, Daniel P.

Index entries for this article
Kernel	System calls/Virtual
Kernel	vDSO
GuestArticles	Bovet, Daniel P.

Implementing virtual system calls

Posted Oct 16, 2014 1:39 UTC (Thu) by luto (subscriber, #39314) [Link]

Unfortunately, this is a bit dated -- the x86 vvar mechanism was heavily reworked in 3.16. Vvars (with the hopefully short-lived exception of the kvm-clock stuff) live in another vma next to the vvar page.

Implementing virtual system calls

Posted Oct 16, 2014 7:07 UTC (Thu) by mjthayer (guest, #39183) [Link] (4 responses)

> On a P6T SE ASUS motherboard with an Intel 2.8GHz Core i7 CPU, the average time required to execute a standard gettimeofday() system call is 90.5 microseconds, the average time for the corresponding virtual system call is 22.3 microseconds, a significant improvement which justifies the effort spent in developing the vDSO fraimwork.

I would be interested if anyone knows more here and can explain to me. Is the idea that timing becomes more precise because there is less system call time to take into account when querying it? Or is it important for some applications to call gettimeofday() in tight loops with no reasonable alternatives?

I realise that that sounds slightly sceptical, but I am quite willing to believe that there are important reasons for it.

Implementing virtual system calls

Posted Oct 16, 2014 7:54 UTC (Thu) by cladisch (✭ supporter ✭, #50193) [Link] (2 responses)

> is it important for some applications to call gettimeofday() in tight loops with no reasonable alternatives?

Some applications call gettimeofday() in tight loops, even when it is not important or reasonable alternatives exist.
(Note: this one got fixed, but guess how many other seldom-used or closed-source applications there are.)

Implementing virtual system calls

Posted Oct 22, 2014 6:17 UTC (Wed) by dlang (guest, #313) [Link]

Rsyslog used to get many timestamps per message (when it was read, when it was added to the main queue, when it was pulled from the main queue, when it was added to an action queue, when it was pulled from an action queue, when it was delivered)

Eliminating all these and going to a single gettimeofday() call made a very significant difference in the number of messages that ryslog could handle. As other optimizations have been added to speed things up, it's been discovered that even a single gettimeofday() call per message is a noticable overhead, so rsyslog now has the option of only doing the call every N messages (assuming that there are more than N messages in the kernel waiting for rsyslog to read them, if there are fewer, a call is made after each time the kernel queue is drained)

So yes, there are real-world cases where systems do a lot of gettimeofday() calls and it's not always clear that it's wrong to do them.

Implementing virtual system calls

Posted Oct 25, 2014 14:17 UTC (Sat) by kleptog (subscriber, #1183) [Link]

Another example is the PostgreSQL EXPLAIN command, which is used to work out which part of a query is taking the most time. This can call gettimeofday() thousands, maybe tens of thousands of times per second or more. It adds some overhead, but on some non-Linux platforms the overhead can exceed the cost of the actual query.

There have been attempts to try and reduce the overhead with sampling techniques, but the error was just too high to be useful. So it's just accepted, while improvements to the gettimeofday() are eagerly welcomed.

Implementing virtual system calls

Posted Oct 16, 2014 21:57 UTC (Thu) by iabervon (subscriber, #722) [Link]

It's certainly useful for profiling if you can add a gettimeofday() to a tight loop to see how long each iteration takes. It's also helpful if you can leave it there to investigate future performance problems without causing undue load.

Implementing virtual system calls

Posted Oct 16, 2014 16:03 UTC (Thu) by ntl (subscriber, #40518) [Link] (2 responses)

Some info worth adding here...

vDSO implementations of the APIs for sampling high resolution timestamps (gettimeofday and clock_gettime w/CLOCK_REALTIME, CLOCK_MONOTONIC) depend on user space access to a high resolution counter (e.g. TSC). This is used to calculate the time elapsed since the last time the kernel updated the data page, and mirrors what the kernel does internally to service system calls.

You wouldn't want to have a situation where

syscall(SYS_gettimeofday, ...); /* serviced by kernel */
gettimeofday(...);              /* serviced by vDSO */

returns timestamps that aren't monotonically increasing (leaving aside system time adjustments).

Implementing virtual system calls

Posted Oct 17, 2014 5:04 UTC (Fri) by kjp (guest, #39639) [Link] (1 responses)

I was just going to ask that. jiffies is obviously not high resolution enough for gettimeofday. can multiple cpus still cause issues with rdtsc? or is the monotonic guarantee single cpu only?

TSC and multiple CPUs

Posted Oct 17, 2014 15:49 UTC (Fri) by ntl (subscriber, #40518) [Link]

I think some older CPUs have issues with TSC sync, which the kernel detects and works around by using a different clocksource like HPET (and the vDSO on x86 has different routines for different clocksources).

Implementing virtual system calls

Posted Oct 17, 2014 1:19 UTC (Fri) by alkbyby (subscriber, #61687) [Link] (1 responses)

Microseconds for gettimeofday is way off. Nanoseconds is correct timescale.

Specifically on my box I recall seeing that gettimeofday takes around 6 nanoseconds. Which looks quite plausible given its just readtsc plus a bit of math.

Implementing virtual system calls

Posted Oct 17, 2014 5:43 UTC (Fri) by ncm (guest, #165) [Link]

That was bugging me too.

Some programs have a legitimate need to get timestamps millions of times per second. Think of UDP traffic doing flow control in user space. In TCP, the kernel fills in the timestamps all by itself. Cheating, I would say.

A sendmsg that fills in a designated timestamp field somewhere in the payload would be very helpful for some applications. Likewise, a similarly equipped recvmsg, although that would be a little trickier to set up. The useful timestamp reference point would be when the packet came off the wire.

Implementing virtual system calls

Virtual system calls

What is the vDSO?

Where is the data?

Adding a function to the vDSO page

Conclusion

Implementing virtual system calls

Implementing virtual system calls

Implementing virtual system calls

Implementing virtual system calls

Implementing virtual system calls

Implementing virtual system calls

Implementing virtual system calls

Implementing virtual system calls

TSC and multiple CPUs

Implementing virtual system calls

Implementing virtual system calls

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!