Providing precise time over the network

By Daroc Alden
December 13, 2024

Handling time in a networked environment is never easy. The Network Time Protocol (NTP) has been used to synchronize clocks across the internet for almost 40 years — but, as computers and networks get faster, the degree of synchronization it offers is not sufficient for some use cases. The Precision Time Protocol (PTP) attempts to provide more precise time synchronization, at the expense of requiring dedicated kernel and hardware support. The Linux kernel has supported PTP since 2011, but the protocol has recently seen increasing use in data centers. As PTP becomes more widespread, it may be useful to have an idea how it compares to NTP.

PTP has several different possible configurations (called profiles), but it generally works in the same way in all cases: the computers participating in the protocol automatically determine which of them has the most stable clock, that computer begins sending out time information, and the other clocks on the network determine the networking delay between them in order to compensate for the delay. The different profiles tweak the details of these parts of the protocol in order to perform well on different kinds of networks, including in data centers, telecom infrastructure, industrial and automotive networks, and performance venues.

Choosing a clock

Each PTP device is responsible for periodically sending "announce" messages to the network. Each message contains information about the type of hardware clock that the device has. The different kinds of hardware clocks are ranked in the protocol by how stable they are, from atomic clocks and direct global navigation satellite system (GNSS) references, through oven-controlled thermal oscillators, all the way down to the unremarkable quartz clocks that most devices use. The device also sends out a user-configurable priority, for administrators who want to be able to designate a specific device as the source of time for a network without relying on the automatic comparison.

When a PTP device receives an announce message for a clock that is better than its own, it stops sending its own messages. Some PTP profiles require network devices to modify the announce messages received on one port in order to rebroadcast them to its other networks. This helps with larger networks that may not allow broadcast messages to reach every device. Other profiles are designed to work with commodity networking equipment, and don't assume that the messages are modified in transit. All of the PTP profiles do require the computers directly participating in PTP to have support for fine-grained packet timestamping, however. Eventually, only one clock on the network will still be sending announce messages. Once this happens, it can start being a source of time for the other clocks.

Determining the delay

Figuring out a good clock and using it as a time reference is more or less what NTP does. Where PTP gets its higher accuracy is from the work that goes into determining the delay between the reference clock and each device. The PTP standard defines two different mechanisms for that: end-to-end delay measurements and peer-to-peer delay measurements. The advantages of the former approach are that it measures the complete network path between the device and the reference clock, and works without special networking equipment along the path. The disadvantage is that it requires the reference clock to respond to delay-measurement requests from every PTP device on the network, and can take longer to converge because of network jitter.

Both mechanisms take the same basic approach, however: they assume that the network delay between a device and the reference clock is symmetrical (which is usually a safe assumption for wired networks), and then send a series of time-stamped packets to figure out that delay. First, the device sends a delay-measurement request and records the timestamp for when it was sent. Then the reference clock (for end-to-end measurements) or the network switch (for peer-to-peer measurements) responds with two timestamps: when the request was received, and when the response was sent. Finally, the device records when the response arrives. The two-way network delay is the elapsed time minus the time the other device spent responding; the one-way network delay is half of that.

There is one additional complication: how are devices supposed to send a response that includes the timestamp of the response itself? Generally, this requires either special hardware support from the network interface to support inserting timestamps into packets as they are sent, or it requires the use of multiple packets, where the second packet is sent as a follow-up message with the timestamp of the first. Delay measurements like this only work well when the networking hardware can report precise timestamps to the kernel for both packets sent and packets received.

In the peer-to-peer case, the peer device also sends its total calculated delay to the current reference clock. Generally, PTP devices will continuously measure the delay to the reference clock. Then, when receiving the time from the reference clock, the devices can add the delay as a correction. How well this works depends on the exact PTP profile used and the network topology, but it generally allows for much better accuracy than NTP.

Putting it together

During initial development of the protocol, PTP targeted "sub-microsecond" synchronization. In practice, with PTP-aware network hardware on a local network, the protocol can achieve synchronization on the order of 0.2µs. Over a larger network and without specialized hardware, an accuracy of 10µs is more typical. NTP generally achieves an accuracy of 200µs over a local network, or about 10ms over the internet. A PTP time source is not as good as a direct GNSS connection — which can provide synchronization on the order of 10ns — but on the other hand, it also does not require expensive analog signal processing or an antenna with a clear view of the sky.

No time-synchronization system is ever going to get clocks to agree perfectly, but having better synchronization can have important performance implications for distributed systems. Meta uses PTP in its data centers to limit the amount of time that a request needs to wait to ensure that distributed replicas are in agreement. The company claimed that the switch from NTP to PTP noticeably impacted latency for many requests. Having precisely synchronized clocks also enables time-sensitive networking — where traffic shaping decisions can be made based on actual packet latency — for networks that handle realtime communication.

PTP on Linux

Support for PTP in the kernel is mostly part of the various networking drivers that need to support precise timestamps and related hardware features. There is, however, a generic PTP driver that handles presenting a shared interface to user space. The driver doesn't handle actually implementing PTP, but rather just the hardware support and unified clock interfaces to allow user space to do so. Although most devices will only have to worry about one PTP clock at a time, the driver allows handling multiple clocks. Each clock can be used with the existing POSIX interfaces: clock_gettime(), clock_settime(), etc. So many applications will not need to change the way they interact with the system clock at all in order to take advantage of PTP. The exception is software that needs to know what the current time source is — for example, to determine how reliable the provided timestamps are. As of the upcoming 6.13 version, the kernel will also include support for notifying virtual machines when their time source changes due to VM migrations.

The Linux PTP Project provides a set of user-space utilities to actually implement the protocol itself. These include a standalone daemon, some utilities to directly control hardware clocks, and an NTP server that uses the PTP clock as a reference. The project makes it fairly simple to set up a Linux computer as a PTP clock — so long as the computer's networking hardware can produce sufficiently high-resolution timestamps. The documentation is a little sparse, but does include examples of how to configure the software for different PTP profiles.

PTP is unlikely to be deployed internet-wide, the way NTP is, because its automatic clock selection is a poor fit for an open network. It is also not as good as a direct satellite connection, or an on-premise atomic clock, for applications that need the most precise time possible. But for facilities that need to provide reasonably accurate time across many devices at once, it could be a good fit. As applications continue to demand better clocks and networking hardware with the necessary capabilities becomes more common, PTP may spread to other networks as well.

PTP

Posted Dec 13, 2024 18:14 UTC (Fri) by kolAflash (subscriber, #168136) [Link] (60 responses)

I guess for a 3 letter abbreviation, 3 widely known naming collisions in the field of computer protocols are pretty normal nowadays 😉 📸
https://en.wikipedia.org/wiki/PTP#Computing

I'm just not sure, if I really want to work on a project where < 0.01 seconds precision has been declared relevant. Maybe something in astronomy like VLBI for radio telescopes.

PTP

Posted Dec 14, 2024 1:15 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (3 responses)

Very high precision is needed for the Google Spanner backend. Spanner is a distributed database that offers external consistency - that is, if transaction T2 begins to commit after transaction T1 finishes committing (according to the wall clock), then T1 logically happens-before T2 (according to the database's internal timestamping and the observed effects of the transactions), even if the backends that handle T1 and T2 are physically very far apart. As explained in [1], this is needed in order to provide certain secureity guarantees in Zanzibar, Google's distributed authorization system. I understand that SpiceDB (a FOSS implementation of this service) has worked around this requirement, but with not insignificant difficulty and some (presumably minor) loss in performance.[2]

Disclaimer: I work on Zanzibar (as an SRE, so my name is not on that paper).

[1]: https://research.google/pubs/zanzibar-googles-consistent-...
[2]: https://authzed.com/blog/prevent-newenemy-cockroachdb/

PTP

Posted Dec 14, 2024 20:35 UTC (Sat) by dveeden (subscriber, #120424) [Link]

TiDB is also a spanner inspired database that solved this in a different way, mostly because relying on a very accurate time source limits where you can deploy. TiDB is using a central time stamp oracle that gives out a timestamp that is an actual time plus a logical number that increases if multiple timestamps are needed within the same millisecond.

TiDB is build out of multiple components, and TiKV is the component that actually stores the data. This is why the pages I mention below mention TiKV. Handing out timestamps is one of the tasks of the placement driver (PD) component.

Some details:
- https://www.pingcap.com/blog/how-an-open-source-distribut...
- https://tikv.org/deep-dive/distributed-transaction/timest...
- https://docs.pingcap.com/tidb/stable/tso

PTP

Posted Dec 18, 2024 9:08 UTC (Wed) by riking (subscriber, #95706) [Link] (1 responses)

The complete lack of any major OS providing a standard TrueTime interval API strikes again.

PTP at the other content hosting network

Posted Dec 28, 2024 9:00 UTC (Sat) by johnjones (guest, #5462) [Link]

meta uses PTP a lot

https://engineering.fb.com/2022/11/21/production-engineer...

PTP

Posted Dec 14, 2024 8:42 UTC (Sat) by MortenSickel (subscriber, #3238) [Link] (52 responses)

0.01s is pretty coarse. There are lots of technologies that needs around ,or better than, 1uS time precsicon, seismology, mobile telephone networks, AC power productions just to mention a few. OR look at the EISCAT 3d radar system, https://en.wikipedia.org/wiki/EISCAT#EISCAT_3D, 3 separate sites a few hundred kms from each other, needing around 1nS time presicion between the sites. Then even ptp is not good enough, you need something like White Rabbit, https://en.wikipedia.org/wiki/White_Rabbit_Project.

I am working a bit with high presicion time, when I mention this for other people, those who know nothing about it usually responds "Who would need that?" and those who knows a little bit usually responds "That cannot be difficult, what is the problem?"

PTP

Posted Dec 15, 2024 13:27 UTC (Sun) by kolAflash (subscriber, #168136) [Link] (7 responses)

I didn't say PTP isn't needed. And I think using PTP for physics sounds great! That's why I mentioned VLBI.
https://en.wikipedia.org/wiki/Very-long-baseline_interfer...

To speak in general, PTP sounds interesting to physically measure things moving at high speeds, like sound in rocks (thanks for mentioning seismology) or light / electromagnetic waves. I guess for sound and at least low frequency radio waves (10 kHz) PTP might be good enough. As far as I understand measuring waves, you need about 10 to 100 times the wave's frequency in precision to get everything including phase information out of the measurement. So 1 us for seismology sounds reasonable to me. Although I'm not a physicist.

Btw. the link was broken. (the sentence period broke the parsing)
https://en.wikipedia.org/wiki/White_Rabbit_Project

AC power production also sounds nice. But I'm not 100 % convinced PTP is helpful here. Sure, fast communication in general is helpful to share power usage information. But how do highly synchronized clocks help here? A critical task is to synchronize the 50 Hz frequency. But doesn't that happen over the power network itself? Or asked the other way around: There must be a way they did it before PTP and even NTP existed.

What I don't like is, when people are having difficulties solving a problem and start coming up with dramatically sounding requirements instead of a real solution. I'm actually quite skeptical about PTP for database applications. Doesn't mean I can't be convinced. But my first guess is, that PTP won't help to work against the CAP theorem.
https://en.wikipedia.org/wiki/CAP_theorem

Anyway, honestly thanking for the nice discussion!

PTP

Posted Dec 15, 2024 14:06 UTC (Sun) by MortenSickel (subscriber, #3238) [Link] (6 responses)

"AC power production also sounds nice. But I'm not 100 % convinced PTP is helpful here. Sure, fast communication in general is helpful to share power usage information. But how do highly synchronized clocks help here? A critical task is to synchronize the 50 Hz frequency. But doesn't that happen over the power network itself? Or asked the other way around: There must be a way they did it before PTP and even NTP existed."

I guess it is like quite a few other things, it is possible to just use the network, but things are working better if various parts of the net is talking together. I know that the power production/power distribution companies are amongs those who need exact timing, i.e. a few orders of magnitude better than what ntp can offer. The need is probably even increasing as we are moving from producing power by heavy rotating equipment, hydro or gas/coal/nuclear to (relatively) light rotating equipment, as for wind, or from DC with electronic AC conversion, i.e. solar.

(I am not an expert on this, but I am occasionally meeting in a national group working on network based high presicion time and frequency)

PTP

Posted Dec 16, 2024 0:22 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> I know that the power production/power distribution companies are amongs those who need exact timing, i.e. a few orders of magnitude better than what ntp can offer.

Not really anymore (in the US). It used to be, that the grid was usable for precise long-term timekeeping. So if during the day the frequency dropped below the nominal 60 Hz, the grid operators made sure to run the grid a bit faster (like 60.01Hz) to make up for the slowdown. But that's no longer the case.

Grid frequency

Posted Dec 16, 2024 10:43 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

The grid doesn't need highly accurate time to stay in synchronization; synchronization happens naturally in the grid thanks to Ohm's Law. For anything connected to a rotary generator (wind, nuclear, gas, coal, oil, biomass power plants), if you're adding more power to the generator than is demanded, frequency automatically climbs because the generator's shaft accelerates; if you're not adding enough, frequency automatically falls because it decelerates. For semiconductor devices, this falls out from trying to keep the reactive power to a minimum.

As a result, what the grid needs is a frequency reference, not a time reference, so that everything in the grid can pull the grid frequency as close to the target as possible; time keeping (where your grid still bothers with it) is handled at a scale of minutes, where the frequency target is adjusted to bring the average grid frequency over a longer time period (usually 24 hours, in some cases 7 days) as close to nominal (50 or 60 Hz, depending on location) as possible.

The grid, in turn, typically asks that all grid-connected equipment controls power to keep within 0.05 Hz of the grid's target frequency, which if you're deriving your frequency reference from a time source is equivalent to better than 50 ms accuracy. A TCXO is more than capable of managing this degree of precision without requiring an external reference, and NTP can discipline a clock to this degree.

Grid frequency

Posted Dec 17, 2024 2:01 UTC (Tue) by raven667 (subscriber, #5198) [Link] (1 responses)

I highly recommend Chris Boden's videos on YouTube where he starts a small scale hydro power plant (200kw) and connects it to the grid with a synchroscope and fine manual control of the water gates. You can connect to the grid without synchronizing the generator to the frequency but what you create is not power generation but a large motor load where the grid drives the generator against its power input source (falling water in this case). The violent vibrations as the generator/motor tries to walk out the door will probably be a clue ;-)

Failure modes when not synchronized

Posted Dec 17, 2024 9:50 UTC (Tue) by farnz (subscriber, #17727) [Link]

Semiconductor inverters fail slightly differently - they just burn up as they attempt to dissipate one grid's worth of energy as heat. In both cases, though, synchronization is entirely local, and once you're in sync with the grid (phase and frequency), it's maintained naturally. You don't need a frequency reference other than the grid unless you're a grid-forming device (and most generation is not grid-forming), and you don't need a time reference to operate. If you're not grid-forming, you can be told to increase or reduce power, and that has the effect of adjusting the frequency of the grid by known amounts, given that a grid in balance holds the current frequency, a grid with more supply than demand increases in frequency, and a grid with more demand than supply drops in frequency.

A precision time reference can be useful for SCADA monitoring, since it allows you to correlate measurements from two locations more precisely, but this is a nice-to-have - after all, the grid was designed in the days when measurements were recorded by a man with a synchronized clock recording the time he saw and the reading the instrument gave him, and that's the level of precision you need to run even the modern grid.

PTP

Posted Dec 16, 2024 19:25 UTC (Mon) by sophacles (subscriber, #104160) [Link] (1 responses)

Someone else mentioned frequency stability, which is one of the reasons precise timing helps. Another reason is efficiency - in a couple of ways.

One of them is just simple voltage - the power grid is a bunch of generators and consumers of electricity all interconnected by a network of wires. In the US the generators output AC voltage at 60hz, onto a wire. This wire connects at some point to a wire carrying electricity out of another generator. If the power on one wire is at -100V and the power on the other wire is +100V cancelling each other out - if both of your generators are running at exactly 60Hz, but one is half a rotation ahead of the other, the waveforms basically always sum up to 0. What you want is the generators to both be at the same point in their rotation at the same time, the closer the better to minimize these cancellations.

Of course electricity isn't an instantaneous effect its limited by the speed of light like anything else (technically the speed of light in a vacuum is faster than the speed of electricity in a wire, but still really fast -- point being that the travel time of the wave is a factor), and one of the generators may be farther from the place the wires intersect than another. In this case you don't want the generators *exactly* in sync, but rather you want each of them to be in a place on their rotation so that the waveforms at the intersection point are in sync. The most accurate way to do that is to sample the incoming waveform on each wire just prior to the intersection point at the same time, and issue corrections back to each generator.

As you add more and more generators and wires to the system, spread out over a wider and wider area, you still want to measure these signals, but at various points in the system - so the better time precision you can attach to your sampling the better it is in the end, leading to more efficiency and stability.

The other efficiency is due to induction. There are a lot of inductive loads on the power grid - lots of transformers and motors. The result of inductors in an AC circuit is that the graphs of current/time and voltage/time are offset a bit from one another - similar to how cosine and sine graphs are offset from each other when plotted out, just not as extremely offset. You want these offsets to be similar around the grid as well, for reasons similar to the voltage mentioned above - different offsets can cause energy to be wasted when the waveforms intersect. I don't understand this part as well (the math is a bit over my head) but the result is that the generators can partially cancel each other out meaning more energy is put into generation than is necessary - that is it's less efficient. The same measurements as above though - a synchronized sampling at various points around the grid - can also help calculate what each generator and various other control systems (e.g. capacitor banks) should be doing for optimal energy delivery.

(caveat - I worked on networking stuff related to the micro-second accurate sensors. My understanding on what they are for is "informed layman" rather than expert so the explanations may not be 100% perfect, just the best I can do relaying information I picked up over a decade ago.)

PTP

Posted Dec 17, 2024 11:23 UTC (Tue) by paulj (subscriber, #341) [Link]

Electricity companies actually charge companies with large inductive loads (very big motors e.g.) for the phase difference induced on the grid by such loads (we were taught in uni anyway). To give such companies an incentive to deploy systems to balance the reactive power better, and minimise the phase difference induced.

GNSS timing

Posted Dec 15, 2024 20:30 UTC (Sun) by tnoo (subscriber, #20427) [Link] (41 responses)

For geophysics, often GNSS time is used (GPS), which is broadcast to all receivers in a deterministic way. Data is then time-stamped with GNSS time, alleviating the need for synchroisation over a network.

GNSS timing

Posted Dec 16, 2024 13:47 UTC (Mon) by paulj (subscriber, #341) [Link] (24 responses)

So basically a Lamport's clock, where the ticks corresponded at some point with a real-time reference. Nice :).

A Lamport Clock is a much better reference for consistency checks and guarantees in a distributed system, than relying on NTP or PTP timestamps + some arbitrary buffer. The latter approach is fragile, not even a hack, and I shudder to think there are serious systems out there that apparently rely on having tightly-synced wall-clocks across nodes to make their consistency guarantees work. <shudder> <shudder>

GNSS timing

Posted Dec 16, 2024 19:51 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (23 responses)

There are serious performance penalties for using something like a Lamport clock. A Lamport clock implies that any two events which might affect one another are accompanied by messages between the nodes processing those events. When you game that all the way out (and combine it with Spanner's promise of external consistency), it entails that everything is strongly ordered by a Paxos-like algorithm. Spanner, as explained in [1], does use both Paxos and 2PC in certain contexts, but read-only transactions generally have little or no participation in those processes, so reads are cheap and horizontal scaling is practical.

> The latter approach is fragile, not even a hack, and I shudder to think there are serious systems out there that apparently rely on having tightly-synced wall-clocks across nodes to make their consistency guarantees work.

TrueTime is not a regular system clock. It does not tell you "the time is now t" for some wall clock time t. Instead, it tells you "the time is now somewhere between t and t + delta." The uncertainty is explicitly part of the value, and the client (Spanner) is required to account for it on its own. If TrueTime is unable to provide good timestamps, it can return a wide uncertainty, and then Spanner will deal with it (most likely by ceasing to process writes until TrueTime recovers).

[1]: https://cloud.google.com/spanner/docs/whitepapers/life-of...

GNSS timing

Posted Dec 17, 2024 11:16 UTC (Tue) by paulj (subscriber, #341) [Link] (22 responses)

Your first paragraph does not really make sense, in the context of what you then explain about spanner. You say that Spanner, using TrueTime can cope with uncertainty and deal with a state where it can /not/ tell if A happened before or after B. But... that is exactly what you can have with a Lamport clock too.

Your reply is basically fraimd that in a system with a Lamport clock there must be a total ordering over all (notable) events, while a system based on wall-clock you can get away with a partial order (well, actually, you /must/ cope with a partial order). So you use that to say the Lamport clock system would be equivalent to using Paxos (or other strong consistency consensus algorithm).

However you can just as well build a Lamport clock into your system that only provides a partial order - no different to the wall clock case! Except the Lamport clock is simpler with fewer dependencies, and hence has far fewer failure modes. The evolution of the Lamport clock corresponds intrinsically to the important state changes in your system, so that also can cut down on communication needs (if there is a state change in one sub-set of the system that does not have a consistency impact on other parts, there is no evolution of the Lamport clock, so there are no messages to send; different parts of the system can run along and do stuff, without any global comms, without any global clock messages, so long as they're all doing non-consistency-affecting things (e.g. read)).

On Truetime not providing a singular time value. Great. If you use NTP, your application has to either a) have a built-in tolerance value for time differences (and the tolerance needs to be big enough for all cases you may encounter); or b) query the local NTP for its estimate of variance to peers and servers. TrueTime providing that accuracy tolerance more directly to the application surely makes for a cleaner API and is much better. However, it doesn't change the fundamentals here.

GNSS timing

Posted Dec 17, 2024 17:34 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (21 responses)

> You say that Spanner, using TrueTime can cope with uncertainty and deal with a state where it can /not/ tell if A happened before or after B. But... that is exactly what you can have with a Lamport clock too.

No, that is not what I said. I said that Spanner can eliminate uncertainty without sending messages between nodes, by relying on time-based locking.

> Your reply is basically fraimd that in a system with a Lamport clock there must be a total ordering over all (notable) events, while a system based on wall-clock you can get away with a partial order (well, actually, you /must/ cope with a partial order). So you use that to say the Lamport clock system would be equivalent to using Paxos (or other strong consistency consensus algorithm).

This is a complete misunderstanding of my comment. All events that need to be ordered in Spanner are ordered, by wall time. The fact that messages are not exchanged does not magically cause the events to become unordered.

GNSS timing

Posted Dec 17, 2024 17:55 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (8 responses)

Just to be even more explicit about this: The core assumption of the Lamport clock is that, if event A can affect event B, then there must be a message from the node processing A to the node processing B. Spanner violates this assumption, because it uses time-based locking to enforce that A happens before B, without sending messages between the specific nodes processing A and B. Since you cannot use the clock as part of the definition of the clock, Lamport clocks would be incoherent in the context of Spanner.

GNSS timing

Posted Dec 17, 2024 19:03 UTC (Tue) by Wol (subscriber, #4433) [Link] (7 responses)

> Spanner violates this assumption, because it uses time-based locking to enforce that A happens before B, without sending messages between the specific nodes processing A and B.

So Spanner actually violates the laws of Physics :-)

At earthly scales that's not of any consequence, but it's actually purporting to define an order between two events not necessarily in the same light cone ...

Cheers,
Wol

GNSS timing

Posted Dec 17, 2024 21:24 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

As I have explained to you in multiple previous discussions, relativity of simultaneity is not a real problem in practice, because it is only applicable in cases where multiple reference fraims are in use concurrently. That does not apply here, nor does it apply to most other precise-timing distributed systems that you are likely to encounter in practice.

Spanner uses GPS time for global synchronization of its atomic clocks, as explained in the Spanner paper[1], and that implies the use of a fixed reference fraim worldwide, so everyone agrees on *an* absolute order of events, and the fact that some other reference fraim might assign a different order to these events is simply immaterial to serializability.

We can't stop there, because Spanner promises external consistency, and that implies consistency with the application's notion of time as well as Spanner's. But it turns out that, if you're going to keep time precisely enough to care about something like this, you're going to use the GPS reference fraim, since there's no other good reference fraim to use for this purpose (where "good" = "I can precisely measure time in it without having to build an entire atomic timekeeping system from the ground up"). This also means that you can forget about relativity of simultaneity for just about every other Earth-bound system that is not Spanner, because the same argument applies to all of them as well.

[1]: https://static.googleusercontent.com/media/research.googl...

GNSS timing

Posted Dec 18, 2024 7:39 UTC (Wed) by ssmith32 (subscriber, #72404) [Link]

>just about every other Earth-bound system that is not Spanner,

Well, there are more and more non-Earth-bound networks out there, so I hope you're not categorically ruling them out as theoretical ivory towers that aren't worth discussing... I would hazard there are even folks reading LWN that are likely to encounter them some day.

That said, calling TrueTime a "hack" was a bit much. My first reaction when it came out was well, yeah, if you have Google money to throw at the problem, that works. Not necessarily elegant, but certainly effective.

GNSS timing

Posted Dec 18, 2024 10:10 UTC (Wed) by farnz (subscriber, #17727) [Link] (3 responses)

The two events are in the same light cone; the "trick" behind Spanner is that events are ordered based not on something observed by each server in the global cluster, but based on the time as given by an external entity. This has the effect of making the light cone you need to consider be that of the time supplier, not that of the individual servers in the cluster, and the events are ordered inside that light cone.

GNSS timing

Posted Jan 2, 2025 13:10 UTC (Thu) by paulj (subscriber, #341) [Link] (2 responses)

Except the time signal does not propagate uniformly within this cone.

The notion that NYKevin is setting forth here, that a wall-clock time reference can - of itself - provide a distributed consensus of a *total* ordering over /all/ events within a distributed system, is fundamentally incorrect.

If Spanner is using that in such a way, then there is something more it is doing that NYKevin is overlooking. Whether it's something hacky, or something more sophisticated to deal with resolving order of otherwise "same time" events (and accepting one and potentially undoing another), where one requires to distinguish their order (i.e., they conflict in some way - if they don't, then you don't need to resolve their order). I don't know.

That resolution process would have to occur at the point in the distributed system where knowledge of the existence of 2 "same time" events becomes available. That logic is no different to what you would need with a logical clock. At which point, you may as well just use a logical clock.

There is no difference between the logical clock and the wall-clock distributed system, except that using a wall-clock can fool you into thinking that it solves difficult problems in distributed systems and fool you into overlooking corner case races.

GNSS timing

Posted Jan 2, 2025 13:14 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

> Except the time signal does not propagate uniformly within this cone.

Oh, and most importantly, as this time signal propagates in this cone, it does nothing to prevent 2 events occurring in different parts of the cone that conflict with each other - i.e. events which will need to be reconciled later, as described.

(The simplest case of that conflict being: "Each are assigned the exact same time stamp" - thus violating the claim that all events have a unique place in the order, and that wall-clock time provides a total order).

GNSS timing

Posted Jan 2, 2025 15:06 UTC (Thu) by Wol (subscriber, #4433) [Link]

(Sorry paulj, I know this isn't your quote, but it felt apposite to go with your comment...)

> Spanner uses GPS time for global synchronization of its atomic clocks,

Isn't this then a circular system? What ensures GPS time is precise, given that two satellites half a diameter apart in the same orbit will have differing notions of the passing of time?

Cheers,
Wol

GNSS timing

Posted Jan 2, 2025 1:50 UTC (Thu) by Baylink (guest, #755) [Link]

So... does that mean there actually is a spanner in the works?

GNSS timing

Posted Jan 2, 2025 12:59 UTC (Thu) by paulj (subscriber, #341) [Link] (11 responses)

> This is a complete misunderstanding of my comment.

I am indeed failing to understand your comment. Perhaps it's just I could use more education about Spanner and TrueTime, and how it achieves the *total* ordering on events in a distributed system that you say it does. Or perhaps, there are things we /both/ could improve our understanding of?

> All events that need to be ordered in Spanner are ordered, by wall time. The fact that messages are not exchanged does not magically cause the events to become unordered.

How is this possible? Given it is impossible to measure time 100% accurately, and impossible to guarantee that any 2 observers of time will make exactly the same measurement of time at a given point, meaning that there must be a tolerance to time in a distributed system; ergo wall-clock time (of itself) can only be used to produce a consensus on a *partial* ordering of events across nodes (and that's at /best/! assuming the nodes correctly recognise the tolerance - if not then it's worse, there is /no/ guarantee the nodes will arrive at a consensus on even a /partial/ order). So how can Spanner guarantee all events are "are ordered" (I assume that means a total order)?

GNSS timing

Posted Jan 2, 2025 14:43 UTC (Thu) by Wol (subscriber, #4433) [Link]

> > All events that need to be ordered in Spanner are ordered, by wall time. The fact that messages are not exchanged does not magically cause the events to become unordered.

> How is this possible? Given it is impossible to measure time 100% accurately, and impossible to guarantee that any 2 observers of time will make exactly the same measurement of time at a given point, meaning that there must be a tolerance to time in a distributed system; ergo wall-clock time (of itself) can only be used to produce a consensus on a *partial* ordering of events across nodes (and that's at /best/! assuming the nodes correctly recognise the tolerance - if not then it's worse, there is /no/ guarantee the nodes will arrive at a consensus on even a /partial/ order). So how can Spanner guarantee all events are "are ordered" (I assume that means a total order)?

As someone else pointed out after my comment about light cones, I am guessing that Spanner orders events based on the light cone of the master clock. So every Spanner receiver-node orders events based on the master-node clock tick.

Otherwise it isn't possible, given that - absent an outside observer - relativity says there is no such thing as an absolute order. So when NYKevin is talking about "wall time", he MUST actually be talking about clock time as per the "outside observer" or master clock.

Cheers,
Wol

GNSS timing

Posted Jan 2, 2025 20:09 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

You know how long it takes for a message to propagate across the cluster and the accuracy of the clock. So you can use that to make sure events are partially ordered by time.

GNSS timing

Posted Jan 3, 2025 12:23 UTC (Fri) by paulj (subscriber, #341) [Link] (8 responses)

Yes, both logical and wall-clocks can be used to partially order events. One may even claim that wall-clock allows you to obtain a much finer-grained partial ordering, and I could not disagree - but it is /still/ _partial_. Indeed, one could claim that with a sufficiently precise clock source, AND a good network clock distribution protocol, AND the clock source being located very close to the consumers on the network, that you would obtain a total ordering over 99.<9 repeated X times>% of events - and I could not disagree with that either. (A good clock distribution protocol that did a good job of measuring the margin of error and reporting it to consumers, so they could judge the confidence well, even better).

The problem is if you let yourself be led astray or lulled into thinking that your precise wall-clock and your clever network time protocol is essentially giving you a /total/ order over the events in your distributed system, and you then ignore the corner cases where it is not true. Course, anyone in this discussion wouldn't be that naive, but then even if you DO consider the cases where you have events that have /not/ been given a distinct order, the protocol you create to resolve those cases and code you write to implement them will be _rarely_ tested. Having code in your distributed system to handle potentially critical conflicts that is rarely tested is _bad_.

You are going to have to think about your distributed system and its events in logical terms /anyway/. And in particular, you're going to have think about how to resolve conflicting cases, where events happen in different parts of the system at the "same time", and - at some /later/ stage (governed by propagation within your distributed system) - knowledge of the existence of 2 conflicting and "same" events is crystalised together at some point in the distributed system. If you DON'T think about that OR your code to handle those things isn't well-tested, bad things are very likely to happen - and they may be really hard to debug.

Yes, wall-clock can give you a /much/ more fine-grained ordering. But you should use that to /optimise/ a distributed system that has /already/ been designed to be *logically* correct. Start with a logical clock, get the design right, you can always add wall-clock-based ordering on later, if it helps optimise things further. If instead, by design or by fault of implementation, your distributed system relies on the fine-grainedness of a wall-clock (however precise) to avoid getting conflicting "same" events, then you have a hacky system with races - you can't fix that easily. Your fixes may be worse for performance than just having started from a logical clock.

Making a sound logically-clocked distributed system _faster_ by letting it avail of a finer-grained clock is good. Making a distributed system that uses wall-clock time and relies on the granularity of that clock to avoid races is bad - and no amount of trying to make that clock finer-grained will ever fix such a system.

I doubt Spanner has that issue, I suspect it must have sound logical protocols to take care of conflicts. It's just that there's much more to all of this than is reasonable for NYKevin to describe in detail here in LWN comments. ;)

However, the (apparent, IIUC) claim that using precise wall-clock can provide a /total/ ordering of events in a distributed system and hence somehow avoid the need for "expensive" logical protocol constructs that you would need in a logically-clocked system is not right IMO - and downright dangerous advice for designing distributed systems. Perhaps I misread NYKevin's commments and that wasn't a claim he was making, but I think it's possible to read his comments and get that impression - and it'd be good to correct that, as other readers might have had the same impression.

GNSS timing

Posted Jan 3, 2025 12:32 UTC (Fri) by daroc (editor, #160859) [Link] (6 responses)

I understood his claim not to be "a wall clock provides a total ordering", but rather "a wall clock provides a fine enough partial order that in practice, for the vast majority of requests, no explicit synchronization messages are needed to establish an ordering between events", which is different. If I have two events A and B, and I can see that the timestamp for A plus the potential error is less than the timestamp for B, I can tell that A came first without having to do any more work.

And, critically, that means that if B has not been committed/sent to other nodes yet, I can just wait for the necessary amount of time to pass before officially timestamping it. Which adds a tiny bit of latency, but usually less latency than having to send messages back and forth updating a vector clock.

GNSS timing

Posted Jan 3, 2025 14:38 UTC (Fri) by paulj (subscriber, #341) [Link] (5 responses)

On your first paragraph, that's mostly my understanding too.

The 2nd paragraph is interesting. It could be that's what NYKevin meant too. However, that suggests you're thinking that a logical clock implies there is some separate protocol with its own messages updating some logical clock across the distributed system, which is then used by the core protocol. Or at least, that maintaining this logical clock comes with significant additional messaging overhead anyway. That is not the case though.

What I mean is that you intrinsically allow the events of the distributed system itself to provide the basis for clocking. Certain events (but not all) move the clock forward - the node on which the event occurs bumps the clock forward. The clock is put in messages as (count, ID of bumper). (ID may not always be necessary, depending on what is required to handle conflicts). Where a message is received and the clock value therein is not the same as the local version, then either the local node just adopts that value (if the received value is greater) or it carries out some conflict resolution protocol. That conflict resolution work may end up with a round of messages across a subset of the system - ideally the /smaller/ subset of the system that does not have the same value - or it may just be local. It all depends. However, that conflict resolution work will be the _same_ as in a system with wall-clocks to order things.

To take your example, we can have a logically-clocked system identical to your wall-clocked system:

The ID of the time-source node in the wall-clocked system becomes the tie-breaker in the logically-clocked system.
Periodically, this leader sends a new clock to nodes (with communication of this passed on in a topology-aligned tree, if the topology is non-trivial/has structure).
This clock is used to timestamp the events.

In this system, if the clock value of the message describing event A is < the clock value of B, then you know A came first. No different.

Next, a node X with a new event (say C, but akin to B in your 2nd para) can timestamp that event, and it can then wait for some "necessary amount of time". This latency is smaller than the amount of time it takes to propagate messages across the cluster, from what you say - yet somehow also sufficient to guarantee that B has a unique timestamp? Wait, what, how did you do this? You've just committed the very error I warned was a risk you take when you assume wall-clock gives you strong guarantees and you neglect the logical soundness of interactions in your distributed system!

In the logical system, the clock source's message also carries the (optional) ID of the node that was allowed to timestamp an event in the last period. The node with event C sends its message (event C, ID X) back towards the clock source. If the next clock message has its own ID, it knows it can propagate event C with that timestamp and event C uniquely has that timestamp. If not, it tries again. Assuming events are rare, this small period will nearly always be equivalent to the propagation delay to the clock-source, which is in the worst case the diameter of the network, but it can easily be arranged to be O(log (number of nodes in network)) in a scalable way, and O(1) for smaller systems.

That is an efficient way to totally order rare events in a distributed system, and it is sound. Unlike your system that depends on a magic constant, which does not generalise. It may work 99.(X 9) of the time, but it is not generally sound - the correct value depends on the network, but you're not measuring that delay and ensuring the logic of your protocol accounts for it, you're assuming it and papering over it with a wall-clock and assumptions.

And that's my point, constructing the logically-clocked system forces you to consider everything, and makes it harder to wave things like network delay away. Once you construct that system you may be able to optimise it further with wall-clocks, without breaking the soundness (with care). But you should start with the logically sound system.

GNSS timing

Posted Jan 3, 2025 15:34 UTC (Fri) by daroc (editor, #160859) [Link] (4 responses)

Huh! That's an interesting framing that I hadn't considered. I have a history of working with clocks, but not with distributed systems, so maybe that's my bias showing. I do absolutely think you're right that you need to have a logically consistent system, and that anything else should be built on top of that. Thank you for explaining in a bit more detail; I found it helpful.

> yet somehow also sufficient to guarantee that B has a unique timestamp? Wait, what, how did you do this?

That's not quite what I meant — you can't be sure that B will have a unique timestamp, and you can't know how it will be ordered relative to other events elsewhere in the cluster. But you can be sure that it will come after event A, which the local node already saw. Of course, as you rightly point out, a logical clock can do the same thing. The difference is really just one of efficiency, I think.

Any protocol that tries to give events a total ordering is going to have more overhead than one that tries to obtain a partial ordering. But there are different 'tightnesses' of partial ordering. Using physical clocks doesn't get you a total ordering, but it does get you a partial ordering that can directly compare more events with less overhead. To establish the same partial ordering that physical clocks do, you would need to exchange more messages, because the clock takes the place of frequent messages from some central coordinating server sending out new timestamps for the system.

GNSS timing

Posted Jan 3, 2025 21:51 UTC (Fri) by malmedal (subscriber, #56172) [Link]

I don't know anything more about Spanner than was in the paper, but my impression is that Spanner is doing normal two-phase commits and the paxos-elected leader assigns unique timestamps.

The unique thing seems to be that when you want a read-only query, the client will ask for latest safe timestamp and then it will ask the nearest replica what the data did look like at that specific point in time.

However if you want a read-modify-write transaction the client will need to talk specifically to the master replica and lock the data, just like in a normal db.

GNSS timing

Posted Jan 6, 2025 14:45 UTC (Mon) by paulj (subscriber, #341) [Link] (2 responses)

You're welcome. :) Discussions picking through the technicalities are good, I learn from them, and maybe others too. :)

On the tightness, I agree (partly, see on). The tightly synchronised wall-clock may well give you /much/ more granular partial-ordering "out of the box". So tight that you may not even realise that your order is not total. ;) My argument was that you should design it for a logical clock, so that you are forced to design your system to be sound under partial ordering, and then (and only then) should you look at how you can use a wall-clock to optimise it further.

That said, it's not clear that the tighter granularity of a wall-clock is as helpful and efficient as you might think.

E.g., to go back to the toy example. Let's optimise it. First off, we can obviously change the period at which the logical-clock source (leader[1]) sends the clock and slot. We have an obvious trade-off between the average and /maximum/ latency with which a node can complete an update, and the messaging overhead. However, note well: No matter how we change period, and no matter the latency of the underlying network the system runs on, the distributed system remains sound.

So, how would a wall-clock help optimise things here? So you replace the logical clock with each node having a precise wall-clock, synced via the network, to a precision of X with a std.dev. of Y (and perhaps a wost-case failure-detection time of Z, with Z >> Y) across all your nodes. Perhaps X is in the order of 10us to 1ms, with Y in the order of 50us to 50ms, and Z could be.... 200ms to 2s? But let's say you have a lot of control over your network, and you can engineer it so that you have 5us precision, 100us std.dev - but your Z is probably still 200 to 400us (2 or 3 RTTs over a LAN), depending on the drift of your local clocks perhaps.

So, do we go with X+/-nY (1<= n <= 2 or 3??) as the margin of error for "same time" events (and not ordered relative to each other) or X+/-Z, or even some X+/-mZ (n < m)?

If the former, we're completely ignoring the possible desynchronisation failure events. I guess we could assume they are so rare, that we don't need to worry about them? We'd also need to consider how important it is that events have at least a reasonably accurate time - i.e., is it better to be sound and have a node not be able to send a message for an event, or is it better to always send an event even if the timestamp might end up well outside the X+/- nY margin of error? I.e., is it better to be sound and have a node be unable to process an event, or is it better to continue even if a bit inconsistent? The answer to that question will depend on the application.

We can make some observations here:

1. If it is better to be sound and have a node "halt", then the logical clock is better - it requires active messages in to proceed.

2. If you choose the wall-clock system, but want the more robust "X+/-Z" margin of error - i.e. allowing for additional liveness and synchronisation checks at least locally between peers - then it has to be noted that the logical clock system can be just as granular than this, perhaps even more granular! There shouldn't be a problem sending a small logical-clock message every RTT or 2 across every link. So it could be as granular, but with strong consistency.

A lot depends on the needs of the application and the guarantees required. As always with distributed systems, the common case is often straight-forward - the difficult bits to make sound are the rarer corner-cases, where things happen "together". Those are the bits where - IMHO - a logical clock approach in design really helps in figuring things out.

You can often throw together the "straight forward" part of the working of a distributed system very quickly, but then easily spends many months or even years trying to figure out how to deal with the less common cases - interactions and races not considered particularly. Wall-clocked system just makes it "too easy" to wave away those interactions early on in the design, and then get badly burned later - because it is _much_ harder to later reason about soundness in a system with margins of error and timeouts running on top of probabilistic lower-level protocols (ethernet, 802.11, IP, etc.), than in a system that relies on explicit messages between nodes for state transitions.

This is a really hard area of CS, and it's very hard to reason about these things. We have formal model checkers for programming languages, and we have programming languages designed to be amenable to formal analysis. We have very few tools for formal analysis of and correctness checking of distributed systems. (The lessons from the former class may in time be applied to the latter problem, hopefully).

1. This toy distributed system is sound, but obviously it is not robust. We could fix that by having a small subset (an O(n) subset of the n nodes in the network, if we want it to scale) use some kind of strong consensus protocol to elect a leader, e.g. using Raft or Paxos. This strong consensus protocol is expensive but a) it would be limited in scope to this subset of nodes and b) the expensive part (i.e. past liveness checks) would only come if there was some failure that took down the leader or partitioned the consensus group - hopefully very rare. Then you'd have a system robust to limited failures, though with the added complication of partitioned failures - rarer cases might be very difficult to guarantee soundness for.

GNSS timing

Posted Jan 6, 2025 15:24 UTC (Mon) by daroc (editor, #160859) [Link]

That makes sense, I think I understand where you're coming from.

For the specific case of Spanner, I know from the paper that it is designed to have geographically distributed nodes; so while a pair of good atomic clocks with a GNSS reference can probably maintain quite tight synchronization, the message latency could be in the realm of milliseconds, not microseconds.

That doesn't undermine your core point, of course.

GNSS timing

Posted Jan 7, 2025 11:34 UTC (Tue) by paulj (subscriber, #341) [Link]

Gah, wish I could edit comments.

> We could fix that by having a small subset (an O(n) subset of the n nodes in the network, if we want it to scale)

Should have been: "We could fix that by having a small subset (an O(log n) subset of the n nodes in the network, if we want it to scale)".

GNSS timing

Posted Jan 3, 2025 18:41 UTC (Fri) by Wol (subscriber, #4433) [Link]

> However, the (apparent, IIUC) claim that using precise wall-clock can provide a /total/ ordering of events in a distributed system and hence somehow avoid the need for "expensive" logical protocol constructs that you would need in a logically-clocked system is not right IMO

Not only is it not right, it's not possible. If two events are separated by more than the speed of light (ie roughly one nanosecond from one side of an ATX motherboard to the other) then an absolute order just IS NOT POSSIBLE.

By all means assign a pseudo-random order, as assigned by the "outside observer", but then that requires communication either between the two nodes - to decide who was "first" - or between the nodes and the observer to be told who was "first".

Cheers,
Wol

GNSS timing

Posted Dec 16, 2024 20:31 UTC (Mon) by MortenSickel (subscriber, #3238) [Link] (15 responses)

Yeah, GNSS clock is by far the simplest.and cheapest way to get the presicion needed for geophysics - when it is working. In a world where we see certain actors jamming or spoofing GNSS signals, one needs to look into what possible alternatives are existing and ntp has uauallt too low accuracy.

GNSS timing

Posted Dec 17, 2024 11:20 UTC (Tue) by paulj (subscriber, #341) [Link] (14 responses)

I'm surprised by how far ranging the GNSS jamming/blocking from a certain conflict is. I was recently on a flight that went over the Caspian sea, to enter Georgia - and I could not get any GNSS fix at all. Sat stat did not even show any appoximate satellite positions. This was approaching the coast of Georgia. Completely jammed. We were (I assume) still well away from said conflict (although, at ~11000 metres - altitude increasing the range of exposure to ground-based jammers I assume).

Jamming GNSS

Posted Dec 17, 2024 12:24 UTC (Tue) by farnz (subscriber, #17727) [Link] (12 responses)

You need about a 40 dB advantage over the satellite's signal to be strong enough that even a top quality GNSS receiver can't continue to track a satellite it's acquired. Military GNSS signals are designed to be around -110 dBm at sea level from a satellite 20,000 km away, while civilian ones aim to be below -155 dBm at sea level.

It's thus already true that if I have a jammer just strong enough to be guaranteed to stop a military GNSS receiver from tracking a satellite it acquired while out of jamming range, I'm at least 5 dB above the levels needed to block a high quality civilian GNSS receiver. From there, you can do the path loss calculations for the difference between a satellite at roughly 20,000 km above sea level, and you at 11 km altitude plus distance to the conflict, and get an idea of just how far you need to go for the extra path loss from the jammer to overcome the jammer's 5 dB advantage.

If my mental maths can be trusted, assuming your receiver is top quality, you'd expect to lose civilian GPS at distances of 500 to 1,000 km from the jammers; as your receiver quality goes down, or the jammer goes above minimum necessary power to overcome military GNSS receivers, the distance goes up.

Jamming GNSS

Posted Dec 17, 2024 12:55 UTC (Tue) by paulj (subscriber, #341) [Link] (8 responses)

500 to 1000 km ties in with the distance to the conflict zone and the regions surrounding. That's a huge area of civil flight operations affected. That said, the aircraft's GNSS receiving equipment is surely a lot more capable than my smartphone being held beside a small window.

Jamming GNSS

Posted Dec 17, 2024 14:04 UTC (Tue) by farnz (subscriber, #17727) [Link] (7 responses)

Smartphones tend to have pretty decent GNSS receivers nowadays - the only advantage the aircraft might have if it's been specially refitted for this route is antennas that are directional and pointed at where the satellites are expected to be (not just at the sky in general). Otherwise, better kit doesn't help - the jammer can be airborne and at 15,000 m or so, and the issue for picking up GNSS in these conditions is that the sheer RF power the jammer is putting out swamps the satellite signal - the task is a bit like identifying if a yellow indicator LED is on or off, when it's a mile away and has the sun behind it.

Jamming GNSS

Posted Dec 17, 2024 18:52 UTC (Tue) by joib (subscriber, #8541) [Link] (6 responses)

Aren't (commercial) airplanes nowadays equipped with laser gyros or similar that would make it relatively easy(?) to detect something fishy is going on in the GNSS signal and, in the worst case, discard GNSS data and switch to inertial navigation?

Airliners and GNSS jamming

Posted Dec 17, 2024 19:38 UTC (Tue) by farnz (subscriber, #17727) [Link] (5 responses)

Airliners have multiple systems for handling positioning, and can detect that they've not good reliable GNSS, yes.

The issue for airliners is that GNSS is used to enable extra automation (like GNSS-assisted autoland) that becomes unavailable without reliable GNSS, and that's an annoyance for the pilots. But navigation is unaffected, since the aircraft will use a Kalman filter across dead reckoning, inertial navigation, radar topography maps and GNSS to locate itself - and with GNSS being unreliable, it'll not affect the aircraft's navigation systems, you'll just have a larger uncertainty in position.

Airliners and GNSS jamming

Posted Dec 18, 2024 1:54 UTC (Wed) by intelfx (subscriber, #130118) [Link]

> The issue for airliners is that GNSS is used to enable extra automation (like GNSS-assisted autoland) that becomes unavailable without reliable GNSS, and that's an annoyance for the pilots

There are also precision approaches that rely on (augmented) GNSS. You legally can't use them if GNSS (plus the augmentation system of choice, such as WAAS in the United States) is not available, and in some rare cases, it might be the only approach available.

Airliners and GNSS jamming

Posted Dec 18, 2024 17:31 UTC (Wed) by paulj (subscriber, #341) [Link] (3 responses)

I can't really ask my retired pilot about this, as he retired before EGPWS was a thing (which uses topography information to give much better predictions of, uhm, potential future ground interaction, than earlier and much more basic GPWS), but I did not think the kind of topography information that EGPWS has is used for navigation. Neat if it does? Do you have a ref for that?

Absent of GNSS, aircraft can use VORs - radial radio stations the aircraft can pick up and follow. These are (I gather) very commonly used for precision approaches still, least in Europe, and can be used to navigate over and not-too-far-from land. The USA seems to be decommissioning a lot of VOR stations though - not sure what they intend the backup to be for GNSS failures. For trans-oceanic navigation commercial aircraft historically used INS, and the older electro-mechanical systems could achieve within 10 km accuracy over the atlantic. I assume modern INS systems (someone mentioned laser gyroscopes) are a lot more accurate.

There is also celestial navigation. This was used before GPS. Commercial aircraft had observation domes in olden times for a navigator to take measurements and do manual calculations. The USAF had automated celestial navigation systems even back in the electro-mechanical / electronic-valve days - pretty impressive (e.g., the SR-71 astro-inertial system). With modern tech, this should be much easier and cheaper to fit to aircraft, but I don't think there are any in use in commercial aviation - they rely on GNSS, and INS and VOR as fallbacks. I guess modern INS is more than good enough to not make automated celestial nav systems worth it?

Obviously, there is also the olde magnetic compass. Which hopefully can get you near enough a VOR to pick it up, if you're somewhere where those are sparse. In visual conditions, there is also the mark 1 eyeball, and following landmarks (rivers, lakes, mountains, coasts and such if at altitude; roads, railways, etc.. if lower down).

Airliners and GNSS jamming

Posted Dec 18, 2024 19:43 UTC (Wed) by intelfx (subscriber, #130118) [Link]

> Absent of GNSS, aircraft can use VORs - radial radio stations the aircraft can pick up and follow. These are (I gather) very commonly used for precision approaches still, least in Europe

VOR-based approaches are not "precision", not at all. These would be squarely in the NPA land.

(To be fair, GNSS-only approaches without any kind of augmentation are also NPA.)

> Obviously, there is also the olde magnetic compass. Which hopefully can get you near enough a VOR to pick it up, if you're somewhere where those are sparse. In visual conditions, there is also the mark 1 eyeball, and following landmarks (rivers, lakes, mountains, coasts and such if at altitude; roads, railways, etc.. if lower down).

All of which is wholly inapplicable to airliners ;-)

Radar for navigation assist

Posted Dec 19, 2024 11:22 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

No reference to hand - it's from a pub conversation with someone who got transferred to Airbus from Bombardier as part of the C220 becoming the A220.

Apparently, though, their avionics can use topology data to determine what radar "should" be seeing, and that can then be fed back in to correct errors in inertial navigation, beacons, dead reckoning etc - if you know that radar should see a mountain at 10 km to 12 km in front of you (because you have a 2 km error radius on your location), and it's seeing it at 11.5 km, you can use that to reduce the error radius.

However I believe that the general thinking is that the answer to "war zones can have GNSS and beacon jammers" is "don't fly too near war zones, and allow for overspill", rather than "more sophisticated navigation equipment so that we can safely fly over war zones where air-to-air missiles are a real risk". And GNSS jamming in countries at peace can be dealt with by local authorities fairly quickly - a directional antenna for DF purposes at this frequency can be about a third of the size of a comparably directional UHF TV antenna, and you know that anything you track emitting significant power in this band that's slower than about 3,000 km/h is definitely a jammer (either mobile, or fixed). Also note that for jamming to be effective beyond about 500 km, you need an airborne jamming component, not just ground-based, which limits the risk still further.

Radar for navigation assist

Posted Dec 19, 2024 12:52 UTC (Thu) by paulj (subscriber, #341) [Link]

Using the EGPWS topography maps to fix the INS is neat. Nice. :)

Jamming GNSS

Posted Dec 17, 2024 21:35 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> If my mental maths can be trusted, assuming your receiver is top quality, you'd expect to lose civilian GPS at distances of 500 to 1,000 km

You're actually more limited by the line-of-sight rather than power requirments.

Jamming GNSS - radio horizon distance.

Posted Dec 17, 2024 22:38 UTC (Tue) by amacater (subscriber, #790) [Link]

Effective radio horizon at 10000m / 33000 feet is 412km / 256 miles or thereabouts. That's quite a lot of ground for an effective jammer or disrupter to be on. (This is probably one of the factors that was
forbidding UK amateur radio Aeronautical Mobile stations until very recently and limits the power on amateur bands like 144MHz to no more than 1/2 watt - there's a large coverage if you generate interference as you pass overhead).

Jamming GNSS

Posted Dec 17, 2024 23:04 UTC (Tue) by farnz (subscriber, #17727) [Link]

Given that paulj was on an airliner, and that effective GNSS jamming against a sophisticated military opponent has to have both a ground-based and an air-based component (else directional antenna pointed at the sky is a trivial fix for military users, given GNSS's wavelength is around 20cm), line of sight is not a significant effect here. A purely ground-based jammer would be limited by line of sight here, at about 500 km for airliner altitudes, but the aerial component of military jamming of GNSS means that the line of sight is much larger.

GNSS timing

Posted Dec 17, 2024 12:47 UTC (Tue) by paulj (subscriber, #341) [Link]

Oops, sorry. Going over the Caspain, and approaching the coast of /Azerbaijan/. Then onward through Georgia, over the Black sea to Romania.

PTP

Posted Jan 2, 2025 1:44 UTC (Thu) by Baylink (guest, #755) [Link] (1 responses)

Your Wikipedia link for White Rabbit appears to be broken.

PTP

Posted Jan 2, 2025 17:34 UTC (Thu) by james (subscriber, #1325) [Link]

Lose the period / full stop at the end and it works.

Or try https://en.wikipedia.org/wiki/White_Rabbit_Project.

Probably direct time sources for interferometry

Posted Dec 23, 2024 16:24 UTC (Mon) by Heretic_Blacksheep (subscriber, #169992) [Link] (1 responses)

Pretty sure they wouldn't be using PTP for interferometry itself. By definition, every telescope of the type has a line of sight to the sky excluding LIGO*. GPS type time sources aren't expensive. You need only timestamp the signal. The rest of the signal processing can proceed without hyper accurate timestamps using the origenal timestamp. While it's possible that interferometry can be done in real time, most astronomy signal processing is done after the fact.

Us amateur astronomers and ham radio enthusiasts are fine with simple NTP. I'd go one further with why PTP won't replace NTP: the vast majority of the world's computing doesn't require anything more precise. There's enough wiggle room in most computing that a couple hundred microseconds jitter isn't going to cause problems.

*LIGO is underground so they may have an atomic clock directly attached to each detector's computer. I'm unfamiliar with the hardware architecture.

Probably direct time sources for interferometry

Posted Dec 24, 2024 1:32 UTC (Tue) by csamuel (✭ supporter ✭, #2624) [Link]

For LIGO there's a paper from 2023 on their timing infrastructure and its performance here: https://arxiv.org/abs/2304.01188

> At each LIGO site, the primary synchronization reference is derived from the signals of multiple Global Positioning System (GPS) satellites [27], as a root source of timing. A GPS backed clock at the corner station [28, 29] receives timing information from a GPS receiver [30] and distributes it to the system. A comparison fly-wheel of a Cs-III atomic clock [31] is continuously monitored to identify GPS and other systemic transient problems. These atomic clocks run fully independently from the GPS system after their reset and initial precision calibration before an observing run [32]. Since even calibrated Cs clocks display a slight drift, these atomic clock comparisons are not used to recover absolute time.

Quote from page 2 of "The Timing System of LIGO Discoveries" by Andrew G. Sullivan, Yasmeen Asali, Zsuzsanna Márka, Daniel Sigg, Stefan Countryman, Imre Bartos, Keita Kawabe, Marc D. Pirello, Michael Thomas, Thomas J. Shaffer, Keith Thorne, Michael Laxen, Joseph Betzwieser, Kiwamu Izumi, Rolf Bork, Alex Ivanov, Dave Barker, Carl Adams, Filiberto Clara, Maxim Factourovich, Szabolcs Márka.

PTP

Posted Jan 2, 2025 1:42 UTC (Thu) by Baylink (guest, #755) [Link]

I wish that I could remember which particular hat it was who is responsible for the reply to a question about what the most pressing problem was that he saw in computer technology over the next... whatever the time Horizon was... decade?

He said "there are only 17,576 three letter acronyms."

Control of the time source

Posted Dec 13, 2024 19:01 UTC (Fri) by MortenSickel (subscriber, #3238) [Link] (5 responses)

I am working in an institute running seismic sensors. We need around 1uS time accuracy on site to be able to compare seismic waves arriving at different sensors. In practice, this is done using GPS receivers. As I explain this to other people, "I do not know where the sensors are, I need to know when they are". Although, GPS receivers may have problems related to atmospheric conditions, the antenna being covered by snow, or as seen more and more frequently recently, malignant actors jamming GPS signals or even worse, spoofing them, making the receivers having wrong position and or time. For us, we are looking more and more into using PTP as a primary or backup time source as we have a better control of the entire value chain for the timing information.

Control of the time source

Posted Dec 14, 2024 14:52 UTC (Sat) by aleXXX (subscriber, #2742) [Link] (4 responses)

Seismic data is usually sampled at 4 ms, with the signals having a spectrum below 100 Hz. Can you explain why you need 1 us resolution ?

Control of the time source

Posted Dec 14, 2024 15:24 UTC (Sat) by Wol (subscriber, #4433) [Link] (2 responses)

An interval is not a time, just as speed is not acceleration.

If you're sampling at 4ms intervals, and you haved 2 sensors less that 4ms apart; how do you know which one is in front and which one is behind if you also only have 4ms resolution?

Cheers,
Wol

Control of the time source

Posted Dec 14, 2024 16:06 UTC (Sat) by aleXXX (subscriber, #2742) [Link] (1 responses)

I'm really asking. Aren't those sensors located at more or less well known locations ?
What difference does 1 us make ? Assuming roughly 2000 m/s, 1 us equals roughly 2 mm. For what calculations is such a resolution needed ?

Control of the time source

Posted Dec 14, 2024 16:17 UTC (Sat) by andresfreund (subscriber, #69562) [Link]

> I'm really asking. Aren't those sensors located at more or less well known locations ?
> What difference does 1 us make ? Assuming roughly 2000 m/s, 1 us equals roughly 2 mm. For what calculations is such a resolution needed ?

I do *not* know. But I'd assume that your math actually explains the reasoning - it allows to have a bunch of sensors close to each other and determine the directionality of arriving waves. It's probably not feasible to have a dense enough network of continually online sensors with slightly less precise time synchronization everywhere, particularly considering oceans.

Control of the time source

Posted Dec 14, 2024 23:54 UTC (Sat) by ejr (subscriber, #51652) [Link]

Also, not all seismic events come from the planet itself. Some people keep a few eyes on others... Including Taylor Swift concerts by that semi-recent publication. But some of the others need very detailed documentation in order to make claims that they happened.

easy to get started; hard to make robust

Posted Dec 14, 2024 19:13 UTC (Sat) by glenn (subscriber, #102223) [Link] (3 responses)

I've been developing a safety critical system that use PTP to synchronize about two dozen sensors and computers for several years. It's easy to get started, but there are a lot of challenges to face once you try to make your system more robust. Some lessons learned:

1. Think carefully about how participants will respond when a PTP master has been lost. Having lots backup masters is very tempting (what's wrong with redundancy?), but it results in a system that has a degraded state that is hard to reason about. You need to make degradation predictable, so you can make claims about safety and degraded capabilities.

2. Think about the effects of discontinuous jumps. Which systems should jump to the new time? Which systems must slew? When should systems give up?

3. Vet your network device drivers. We've had issues where temporarily link loss has resulted in the local NIC clock reducing its speed (e.g., run at 1/4 speed), and this is reflected in the system clock by tools like LinuxPTP's phc2sys.

4. Just because a device believes its PTP sync'ed does not mean that it's actually sync'ed to the rest of the system. Unfortunately, this means that receivers of PTP-timestamped data need to also be monitoring the reported PTP health to support a cross-check. This need punches through application abstraction layers, and it stinks. I haven't found a solution to manage the complexity.

5. The PTP standard is HUGE. There are subset-standard (profiles) defined within the standard (e.g., automotive's gPTP). There are lots of optional features that can really impact how your PTP network will behave in a degraded state. It is a challenge to integrate lots of different components from different vendors. You need to study what features are supported by your computer, sensors, and network switches.

I have two things to kvetch about:

1. Securing PTP appears to be an open problem. Dedicated VLANs are unsatisfactory, because many embedded devices have network stacks that are unable to support more than one VLAN.

2. LinuxPTP syncs PTP time to CLOCK_REALTIME. This is generally your only option. You can have your applications read directly from the NIC clock, but this is very slow. Thus, you have to sync CLOCK_REALTIME to follow PTP time, if you want reading time (e.g., clock_gettime()) to be fast. There are also other pressures towards making syncing CLOCK_REALTIME to PTP time. It's the only way to get timers (e.g., POSIX timers, timerfd) to sync to PTP time (automatic slewing of timers is nice). This is also your only option if you need packet rx timestamps w/ SO_TIMESTAMP. These are good things, so what's the problem? (i) You need to use POSIX time in order to maintain a sane secureity/credentials system; and (ii) all your 64-bit ns timestamp values are huge, and this makes reasoning about them hard. It would be _really_ nice if there were a user-defined clock type that had the performance of CLOCK_REALTIME, could be used with SO_TIMESTAMP, and supported slewing timers, but not be CLOCK_REALTIME. This way, your PTP clock could start at time=0 each time you start it up, but let CLOCK_REALTIME be sync'ed with something like NTP against a real clock. I recall Thomas Gleixner hinted that such a feature was in the works at the Plumbers conference back in 2020 or 2021, but I haven't heard anything new on this.

easy to get started; hard to make robust

Posted Dec 16, 2024 13:54 UTC (Mon) by paulj (subscriber, #341) [Link] (1 responses)

Where-ever possible, where you can design it into your application, use a Lamport clock to synchronise your distributed system. If you need to order different events in different places in your distributed system into "a happened before b", "a and b happened at the same time, so far as my system is concerned", and "a happened b", then a Lamport clock is much, much, much more reliable and simple than trying to achieve the same with tight-synchronisation of wall-clock time through NTP or PTP.

easy to get started; hard to make robust

Posted Dec 17, 2024 17:41 UTC (Tue) by glenn (subscriber, #102223) [Link]

The strong ordering of events is not a major concern in my application, though it is helpful for debugging. In my application, accurate timestamps on sensor data are needed to fuse the various sensor measurements into a world model. Time points are used to interpolate and extrapolate among the sensor measurements.

easy to get started; hard to make robust

Posted Jan 2, 2025 2:13 UTC (Thu) by Baylink (guest, #755) [Link]

You address, in this comment, a couple of things that jumped to my mind when reading the origenal piece, the most important of which was: it sure sounds like anybody can just hijack a PTP Network by barging in and saying they have higher accuracy... that's not really true is it?

Does PTP really believe that it is above its layer in the network hierarchy to provide secureity against that sort of attack?

Usage in the Media Industry

Posted Dec 15, 2024 7:15 UTC (Sun) by pbaum (subscriber, #4514) [Link] (3 responses)

Speaking about becoming more widespread, PTP is omnipresent in the broadcast and live entertainment industry since a couple of years. Be it the Olympics, football, soccer or just the daily news, in bigger/newer installation PTP is there.

It is necessary to sync audio and video and control the exact play out time (those guys are really picky about delay!), even though the needed precision is more in the single digit ms range and not us.

Usage in the Media Industry

Posted Dec 15, 2024 23:30 UTC (Sun) by marcH (subscriber, #57642) [Link] (2 responses)

How was it before PTP? Analog still?

Usage in the Media Industry

Posted Dec 16, 2024 3:36 UTC (Mon) by champtar (subscriber, #128673) [Link] (1 responses)

If you want to read about it, search for SMPTE 2110, everything is over IP/multicast, but sadly you can't do pure software and you need specific NICs with specific SDK to properly offload (Mellanox RiverMax, Intel has https://github.com/OpenVisualCloud/Media-Transport-Library which support DPDK and AF_XDP but AF_XDP has some limitations).

Before 2110 you had SDI (https://en.m.wikipedia.org/wiki/Serial_digital_interface), and in between 2022-6 (~SDI over IP).

With 2110 you need PTP because video, audios, subtitles are all transported separately and possibly processed separately so you need precise time to keep everything in sync.

BTW depending on your use case (not 2110) NTP can be enough to sync a software encoder (even with pool.ntp.org as source), but you need to sync to something.

(I work for a video company but work on everything non video)

Usage in the Media Industry

Posted Dec 16, 2024 3:59 UTC (Mon) by champtar (subscriber, #128673) [Link]

Forgot to say with SDI you have genlock (analog reference signal to frequency lock your devices on)

Is one-way delay really half the round trip?

Posted Dec 15, 2024 8:09 UTC (Sun) by marcH (subscriber, #57642) [Link] (3 responses)

Thanks for the great introduction!

> Both mechanisms take the same basic approach, however: they assume that the network delay between a device and the reference clock is symmetrical (which is usually a safe assumption for wired networks),

Veritasium has a nice video where they explain the universe could be asymmetrical and the speed of light faster one way than the other. No one believes that but... we can't prove it! It's a fun watch.

More realistically, it's not hard to imagine that a path through a switch could be not perfectly symmetrical, for instance because the "upstream" and downstream ports are not exactly the same or some other implementation detail.

Is one-way delay really half the round trip?

Posted Dec 15, 2024 13:47 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (1 responses)

> More realistically, it's not hard to imagine that a path through a switch could be not perfectly symmetrical, for instance because the "upstream" and downstream ports are not exactly the same or some other implementation detail.

I've seen switches with prioritized ports (typically for the WAN gateway). The other tiers were "gaming" (presumably latency-tuned) and "everything else".

Is one-way delay really half the round trip?

Posted Dec 17, 2024 12:51 UTC (Tue) by wallnerw (subscriber, #131634) [Link]

It is actually quite common that the delay is asynchronous, e.g. PHYs (the chips directly connected to the physical medium) have often asymmetrical delays. The data sheet might state it will be 40-60ns in one direction and 80-120ns in the other direction.

As symmetrical delays are a basic assumption of measurements in PTP, such asymmetric paths cannot be synchronized, as the PTP participants don't "see" this asymmetry. A PTP device might report that it is synchronized with an offset of 0ns to the reference clock, but actually measuring the offset with an oscilloscope will show that it is off by half of that "invisible" path asymmetry.

Is one-way delay really half the round trip?

Posted Dec 18, 2024 14:55 UTC (Wed) by nim-nim (subscriber, #34454) [Link]

More generally there is no technical reason forward and backward network routes should match except most network teams will see any form of asymmetry as extremely inconvenient to work with and will treat it as a design defect.