The PTP guy

All things time sync and the documentation that never was. @ThePTPGuy

PTP(d) on legacy systems: Solaris 10 part 1 - how we got there

| Comments

sun-clock-ftw-doodle.png

Legacy systems

One of the issues often discussed in the light of time sync requirements for software applications coming from the MiFID II and NMS CAT regulations is the one of legacy systems. There are two parts to this.

First is legacy applications, being applications which do not support the capture of timestamps with the required granularity. The problem lies either with their design (simply put, fields, structures and protocols with no room for nanosecond or even microsecond resolution), or with the technology they use (older Java versions for example). Both of these issues can be remedied with some development and migration effort. Another issue is the accuracy and non-uniformity of timestamps retrieved from the operating system - this is not easily fixed, as it is mostly out of the developers' control. In non-virtualised systems, clock_gettime() or even gettimeofday() usually meet the regulatory requirements. I will not discuss their performance here, but that is not to say that they have no effect. For older Java applications, one can use JNI to call these functions, at the cost of even greater call latency.

The second part is legacy operating systems, and this post is about one of them: Solaris 10.

When we say legacy operating systems in the context of achieving some degree of precision time synchronisation, this usually translates to a) systems which do not implement kernel socket timestamp options, b) systems which do not support any timestamping hardware and are limited to software packet timestamps, and c) systems which by design are not "time sync friendly", such as Windows (awkwardly implemented clock steering, and also no socket timestamps). The inability to use hardware-assisted time sync mostly moves the given environment out of the sub-microsecond range of precision, and incurs a largely unknown accuracy penalty, but it does not, in my opinion, prevent those systems from staying compliant, provided that the aplication is able to achieve what is required in terms of timestamp granularity.

Seeing that Solaris 10 is still used in many finance applications (it just won't die, and never mind Solaris 10 - Solaris 8 and 9 are out there as well), I decided to see if and what could be done to achieve a decent level of time sync performance with it. Solaris 11 supports socket timestamps and hardware PTP (surprise: Oracle uses modified PTPd in Solaris 11). Solaris 10 neither supports PTP hardware nor does it support socket timestamps, and that is way it has been disregarded by many, myself included, when talking about time sync requirements. In the following sections I will share what it took to overcome this and get PTPd rocking.

Since this post fully qualifies for "To Long; Didn't Read" - the short summary is that yes, it worked, and yes, performance looked good at first glance - see building and further if you are not interested in the development details.

No sockets? No problem

There is a simple trick that PTPd has used for quite a while to emulate socket timestamps, and that is to use libpcap to receive PTP messages - instead of network sockets. Libpcap is a library utilising various packet capture interfaces on many operating systems to provide a common API to filter and capture packets. If you have ever used tcpdump, tshark or Wireshark, you have used libpcap. When receiving packets with libpcap, we also receive packet timestamps captured by the kernel as soon as the data was moved off the NIC buffer into the kernel socket buffer (in Linux terms) - which is by no means immediate. We do not need to transmit packets using libpcap, because it will capture outgoing packets as well, allowing us to closely approximate transmission timestamps by capturing our own packets, sent using sockets. PTPd at its core supports receiving these looped back packets. On some systems, libpcap transports provide the only reliable way to get software timestamps.

Based on the above, it may look like the only effort required was to make use of libpcap on Solaris 10. Unfortunately it was not that simple. While libcap does work on Solaris 10, one has no control over the native mechanisms it uses to capture - namely buffering and timeouts. It turned out that packets captured with libpcap were buffered and then clumped together into large bursts, and nothing was received unless a long timeout was set. It quickly became clear that libpcap was not the way to go.

As anyone familiar with Solaris will know, tcpdump is not the tool of choice when inspecting packets. The tool of choice is snoop(1), a somewhat ancient, and yet still terribly useful piece of software. Snoop obviously does not use libpcap. Snoop is a native Solaris utility, and knows nothing of such whippersnapper contenders. Snoop does it the way Unix always has. Snoop uses STREAMS / DLPI (see Open Group: STREAMS and DLPI). With the help of Solaris' libdlpi(3). We are talking late 1980s and early 1990s here. Developing against libdlpi definitely was fun that way. The snoop man page states four-microsecond timestamp accuracy, whatever this may mean.

Getting there

Note: the code resulting from this work is part of the upcoming rewrite of PTPd I am developing, hosted on GitHub under the wowczarek-2.3.2-libcck branch

The task at hand was to first make sure PTPd builds cleanly and functions on Solaris 10, and then implement some libdlpi-based UDP transports for PTPd: IPv4, and (because it is fun) also IPv6. With some initial help from Dan McDonald of Sun and eventually OmniTI fame (OmniOS is facing an uncertain future, unfortunately), ptpd has supported "Solar-ish" systems for a while now (credit do Dan for the word). This mostly meant the Solaris 11 level (OpenSolaris / Illumos), as it made use of some features only present in the recent Solaris versions: firstly the SO_TIMESTAMP socket option, but also some library calls I have mistakenly taken for granted.

The first part was reasonably easy; your regular porting effort. Fun with autotools, hunting for headers, linking libraries, chasing ioctl data type quirks - and finally implementing the missing getifaddrs(3) and daemon(3): see the resulting code on GitHub. Simple stuff, mostly. Getifaddrs() still needs a bit of work, mostly to deal with logical interfaces.

Adding a new set of transports: easy, courtesy of libcck

I hope I will find the time to write a little bit more about libcck at some point. It is one of the two libraries at the core of the "new" PTPd I have been developing for the last year or so. LibCCK, or Clock Construction Kit, is a portable C99-based library providing a few core components for building time sync applications: a clock driver, a timestamping transport and an event timer, and more components coming eventually. A number of implementations is provided for each, and LibCCK comes bundled with a number of transports: Ethernet, IPv4 and IPv6 over regular sockets, Linux hardware timestamping, and libpcap. Libcck supports inet, inet6, ethernet and local address families which transports can implement. I am planning to add support for generic address families such as arbitrary32, arbitrary64, or similar, to allow for some more exotic protocols to be used. All mundane things such as managing address conversion to/from strings are taken care of. There is a semi-well defined API and a new transport mostly needs to implement init(), shutdown(), sendMessage() and receiveMessage() to function. A new transport implementation is registered in a definition file, where other than the enum, name and address family, a set of capability flags is presented. These flags are checked against what the user has requested and what the given interface supports, and unless the user requested a specific implementation to be used, the default behaviour is to probe for the "best" available implementation using the given address family. PTPd itself, or lib1588, has been made completely transport- and address-agnostic, so that when developing against IEEE 1588, one does not care what transport is used and can focus on the protocol itself. At this stage, neither lib1588 nor libcck are standalone libraries yet and are not 100% decoupled, but I am getting there.

Libdlpi, bufmod and pfmod: don't ENF_PUSH me or I'll ENF_POP you

A simple copy and some inline sed editing gave me a template for the two new transports, which compiled cleanly and I could start turning them into libdlpi implementations. For the init() part we needed a regular inet socket, because unlike with Ethernet, if we do not listen on a port, the host will start sending ICMP unreachables. The socket will sit there listening, joining multicast groups if need be, and any data arriving on it will be discarded. The socket will also be used to send data - no point spending time crafting IP/UDP datagrams - . Then for the DLPI part, the transport first has to open the capture device, bind it to a SAP (yes, SAP - luckily it accepts ethertypes), set the right promiscuous mode parameters, and push two modules onto the STREAMS stack: pfmod(7) (so we can filter packets) and bufmod(7) (we will actually disable buffering, but bufmod will provide us with timestamps). Then we will set timeouts, buffering etc., grab a select()-able file descriptor, dynamically build a packet filter expression, finally flush the dlpi handle to drop anything that arrived in the meantime, and that is that. No, wait, build a packet filter expression?

There is a simple set of human-readable filter expressions that snoop accepts, far less sophisticated than libpcap, but adequate. If you have ever used libpcap, you will know that you pass it the filter expression string and get it to compile it to BPF bytecode. Here is the catch and the key "D'oh!" style highlight of this whole mini-project: unlike with libpcap, the support for filter expressions is only implemented in snoop itself, not in pfmod. Pfmod filtering uses bytecode directly, and pfmod's interpreter is much, much more simple than BPF's (I'm not calling it a virtual machine, it's a switch{ statement working on an array). Firstly, you operate on 16-bit words only, so you have to do masking to match on single bytes. The only operations available are to push the given word from the packet onto the stack, push a literal value onto a stack, set an offset, do a branch/jump, pop, or perform a logical operation, which can be combined with a push. This is it. All she wrote. No registers available to store anything. Forget about variable header lengths - you can only load a word from a constant offset, and if you want to shift the offset, it is only by a constant value. Snoop even implements its own interpreter as "user filters", which are engaged when the filter is more complex than what pfmod allows you to do. For libcck, I wrote some building blocks to match on bytes, words, match on whole data blocks, and finally building on those, to match on ethertypes, VLANs and source/destination addresses, which resulted in the following code: dlpi_common.c. The downside is that fixed IPv4 and IPv6 header lengths are assumed, rather than respecting the reported header length.

The rest was easy, and here is how we set up a filter for all things IPv6 for example:

createFilterExpr()
static void
createFilterExpr(TTransport *self, struct pfstruct *pf) {

    CCK_GET_PCONFIG(TTransport, dlpi_udpv6, self, myConfig);

    /* skip VLAN tag if present */
    pfSkipVlan(pf);

    /* IP ethertype */
    pfMatchEthertype(pf, ETHERTYPE_IPV6);

    /* match my own source or my own destination */
    if(TT_UC(self->config.flags)) {
        pfMatchAddressDir(pf, &self->ownAddress, PFD_ANY);
    }

    /* OR on any multicast destinations */
    if(TT_MC(self->config.flags)) {
        bool first = true;
        if(myConfig->multicastStreams) {

            CckTransportAddress *mcAddr;
            LL_FOREACH_DYNAMIC(myConfig->multicastStreams, mcAddr) {

                pfMatchAddressDir(pf, mcAddr, PFD_TO);

                /* if we have no unicast, no need for OR after first mcast group, otherwise an OR */
                if(!first || TT_UC_MC(self->config.flags)) {
                    PFPUSH(ENF_OR);
                }
                first = false;
            }
        }
    }

    /* AND previous */
    PFPUSH(ENF_AND);

    /* UDP */
    pfMatchIpProto(pf, self->family, IPPROTO_UDP);

    /* AND previous */
    PFPUSH(ENF_AND);

    /* match UDP destination port: Ethernet, assumed 40 bytes of IPv6 header, 2 bytes into UDP */
    pfMatchWord(pf, TT_HDRLEN_ETHERNET + 40 + 2, htons(myConfig->listenPort));

    /* AND previous */
    PFPUSH(ENF_AND);

}

This produces a stack of 85 words for unicast. The C code for the IPv4 counterpart is pretty much identical, bar the ethertype and IP header offset (note to self: de-duplicate some code).

This was the main challenge. After that, it was receiving the message, grabbing the bufmod timestamp (microsecond resolution only), capturing the source and destination addresses from protocol headers and getting hold of the payload itself. Finally I had a decent set of software timestamping transports for PTPd to run on Solaris 10. There is a chance this may work on Solaris 9, but I have no Sol9 hosts to hand at the moment.

One mildly interesting fact is that the libcck socket transports have native support for "naïve" timestamps in the absence of any socket timestamping options. This means that the transport calls self->clockDriver->getTime() when a message is received, and as soon as it is transmitted. This in turn calls whatever is available - clock_gettime() or gettimeofday(). The big surprise was that while with Linux for example, this method proves absolutely useless - for Solaris, when running at high message rates, PTPd was actually able to achieve sub-10 us self-reported precision. Paradoxically, using this method, system call delay is incorporated into the mean path delay computed, so if this provided any level of guaranteed worst execution times under load, it would align the clock marginally better. I have not spent enough time testing this, but my bet is that while precision will be pretty awful, accuracy may not be as bad as one could suspect.

Building PTPd on Solaris 10

Please note that I have not tried building with Sun Studio. I built it with GCC and autotools from OpenCSW. OpenCSW is easy to start with, and chances are that you already have it in your system. It has been a while since I have been this intimate with Solaris last, to give you an idea how long, I will just say that OpenCSW was not there. There was Sunfreeware and blastwave.org (now defunct).

Packages which needed installing were GCC (I went for GCC4), coreutils, binutils, libtool, m4, autoconf, automake, gmake. There is also git available, which can be used to grab the source from GitHub: wowczarek-2.3.2-libcck branch. The code archive can be downloaded from here: https://github.com/wowczarek/ptpd/archive/wowczarek-2.3.2-libcck.zip. Unless you add /opt/csw/bin to your $PATH, you have to pass CC=/opt/csw/bin/gcc MAKE=/opt/csw/bin/gmake to ./configure. To generate the configure script, you need to run the standard autoreconf -vi, then ./configure and finally /opt/csw/bin/gmake. But then you have to pass M4=/opt/csw/bin/gm4 to autoreconf, which then cannot find autoconf, so best just save yourself some time and add /opt/csw/bin to $PATH, even temporarily. Once a release is out, the tarball does come with a configure script, so configure + gmake will suffice.

This should build cleanly, although I have only tested the latest Sol10 update (U11). This was tested on x86 hardware - not that this matters for the build process.

Please note that I am not posting more detailed instructions here because, well, anyone comfortable with Solaris should have no issues with this. I am hoping to be able to add PTPd to the OpenCSW repository, to make this even easier.

Testing it

The initial testing was with a hardware PTP GM, over two non-PTP aware switch hops, gigabit, using a mix of copper and fibre, and running at a modest rate of 4/sec sync and 4/sec delay request. Once the slave stabilised, it pretty much settled on a sub-microsecond offset from master with some sub-50 ns mean path delay std dev:

Every 1.0s: cat /var/run/ptpd.status 2>/dev/null                   Wed May 24 19:29:38 2017

Host info          :  xxxxxxxxx, PID 25251
Local time         :  Wed May 24 19:29:38 BST 2017
Kernel time        :  Wed May 24 18:29:38 GMT 2017
Interface          :  bnx1, status: OK
Transport          :  dlpi_udpv4, unicast negotiation
PTP preset         :  slaveonly
Delay mechanism    :  E2E
Sync mode          :  ONE_STEP
PTP domain         :  30, default 30
Port state         :  PTP_SLAVE
Local port ID      :  xxxxxxfffexxxxxx(unknown)/1
Best master ID     :  xxxxxxfffexxxxxx(unknown)/1
Best master addr   :  xxxxxxxxxxxxxxxxx
Master priority    :  Priority1 0, Priority2 20, clockClass 6, localPref 0
Time properties    :  PTP timescale, tracbl: time Y, freq Y, src: GPS(0x20)
UTC properties     :  UTC valid: Y, UTC offset: 37
Offset from Master : -0.000000269 s, mean  0.000000338 s, dev  0.000000579 s
Mean Path Delay    :  0.000032884 s, mean  0.000032876 s, dev  0.000000039 s
PTP Clock status   :  calibrated, stabilised
Message rates      :  4/s sync, 4/s delay, 1/2s announce
Performance        :  Message RX 9/s 1212 Bps, TX 3/s 268 Bps
Announce received  :  4728
Sync received      :  37810
DelayReq sent      :  34019
DelayResp received :  34019
Domain Mismatches  :  477568
Denied Unicast     :  4
State transitions  :  3
PTP Engine resets  :  1

Clock 1: syst      :  * name:  syst         state: LOCKED    ref: PTP
Clock 1: syst      :  * offs:  0.000000269   adev: 23.701   freq: 8616.796

This was only a quick test, without a config file, using unicast and running defaults apart from engaging filters. All of the "domain mismatch" errors come from other PTP domains running in parallel. Unicast was used for no particular reason. The status file output was taken about two and a half hours after ptpd was tarted - but the performance holds up; it looked the same on the next day and I have started collecting statistics to show some more meaningful results. This was on a quiet network and with very little CPU load, but this performed exactly the way I would expect software timestamping to perform.

The command line I used was:

./src/ptpd -C -i bnx1 -d 30 --global:status_file=/var/run/ptpd.status --ptpengine:delay_outlier_filter_enable=y --ptpengine:sync_outlier_filter_enable=y --ptpengine:delay_outlier_filter_autotune_minthreshold=0.8 --ptpengine:sync_outlier_filter_autotune_minthreshold=0.8 --ptpengine:sync_outlier_filter_capacity=40 --ptpengine:delay_outlier_filter_capacity=40 --ptpengine:transport_mode=unicast --ptpengine:unicast_negotiation=y --ptpengine:unicast_destinations=x.x.x.x  --ptpengine:log_delayreq_interval=-4 --ptpengine:sync_outlier_filter_enable=y --ptpengine:delay_outlier_filter_enable=y --ptpengine:sync_outlier_filter_capacity=60 --ptpengine:delay_outlier_filter_capacity=60 --ptpengine:sync_outlier_filter_autotune_minthreshold=0.9 --ptpengine:delay_outlier_filter_autotune_minthreshold=0.9 --ptpengine:log_sync_interval=-4 --ptpengine:sync_stat_filter_enable=y --ptpengine:delay_stat_filter_enable=y --ptpengine:calibration_delay=20

Closing remarks

Low-latency, DLPI-based transports as an alternative to sockets are nothing new. Veritas have done this for their clustering software for example, but nobody, to my knowledge, has used it for time sync on Solaris before. With PTPd's modular design, adding new transports was really easy. I would not call this task the biggest challenge I have come across, if time-consuming, but it took a lot of digging - and the three evenings' worth of coding were definitely worth it I think.

Does this mean that your dusty old Solaris 10 system can comply with MiFID II with this? Not necessarily - see the first section; but this definitely gets it closer. In the next post I will share some "proper" test results in an attempt to verify how well or how bad PTPd really performed on an x86 Solaris 10 machine - using some hardware-based measurements and longer measurement periods. Stay tuned.

Comments

comments powered by Disqus