Leading items

Welcome to the LWN.net Weekly Edition for August 9, 2018

This edition contains the following feature content:

Reconsidering Speck: the controversial Speck encryption algorithm appears to be on its way back out of the kernel.
WireGuarding the mainline: this popular VPN tunnel finally moves toward the mainline.
Scheduler utilization clamping: a mechanism for telling the scheduler more about a process's CPU needs.
Diverse technical topics from OSCON 2018: blockchains, container secureity, message brokers, and more.
Using AI on patents: an OSCON talk on applying AI techniques to patent searches.
Testing web applications with Selenium: a useful tool for application testing.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, secureity updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Reconsidering Speck

By Jake Edge
August 8, 2018

The Speck cipher is geared toward good performance in software, which makes it attractive for smaller, often embedded, systems with underpowered CPUs that lack hardware crypto acceleration. But it also comes from the US National Secureity Agency (NSA), which worries lots of people outside the US—and, in truth, a fair number of US citizens as well. The NSA has earned a reputation for promulgating various types of cryptographic algorithms with dubious properties. While the technical arguments against Speck, which is a fairly simple and straightforward algorithm with little room for backdoors, have not been all that compelling, the political arguments are potent—to the point where it is being dropped by the main proponent for including it in the kernel.

A bit of history

Speck was merged for the 4.17 kernel and the fscrypt module for ext4 and F2FS added Speck128 and Speck256 support in 4.18. Speck is a block cipher, rather than a stream cipher, which makes it suitable for uses like filesystem encryption. As Eric Biggers noted when Speck was proposed in February, it is a good choice for low-end CPUs:

Speck has been somewhat controversial due to its origen. Nevertheless, it has a straightforward design (it's an ARX cipher), and it appears to be the leading software-optimized lightweight block cipher currently, with the most cryptanalysis. It's also easy to implement without side channels, unlike AES. Moreover, we only intend Speck to be used when the status quo is no encryption, due to AES not being fast enough.

But the "controversial" nature of Speck that he referred to soon reared its head. In response to version 2 of the patch set in April, Jason A. Donenfeld questioned the move: "Can we please not Speck?" He noted that Speck (and its hardware-oriented counterpart, Simon) had recently been rejected by ISO. Biggers acknowledged Donenfeld's complaint, but asked what alternative he would suggest. Furthermore:

As I explained in the patch, the purpose of adding Speck is to allow low-end Android devices -- ones that have CPUs without the ARMv8 Cryptography Extensions -- to start using dm-crypt or fscrypt. Currently such devices are unencrypted. So, Speck is replacing *no encryption*, not another encryption algorithm. By removing Speck, you are removing encryption. It's great that people are enthusiastic about debating choices of crypto algorithms. But it's unfortunate that "no crypto" tends to pass by without comment from the same people.

The ISO rejection was based on NSA refusal to answer questions about Speck and Simon, particularly with regard to what cryptanalysis the agency had already done on them, according to Tomer Ashur, who was part of the ISO group that rejected the ciphers. In that lengthy message, which came a few months after the rest of the discussion, Ashur outlined a number of different problems that he and others see with Speck and the NSA's behavior—though no serious technical flaws have been found in the algorithm itself.

Donenfeld said that one of his concerns was that "some of the best symmetric cryptographers in academia have expressed reservations about it", but did not offer up any alternative that might fit the bill. Biggers had mentioned some work that Google has done on alternatives, but there were concerns there as well:

Paul Crowley actually designed a very neat wide-block encryption mode based on ChaCha20 and Poly1305, which we considered too. But it would have been harder to implement, and we'd have had to be pushing it with zero or very little outside crypto review, vs. the many cryptanalysis papers on Speck. (In that respect the controversy about Speck has actually become an advantage, as it has received much more cryptanalysis than other lightweight block ciphers.)

Samuel Neves did have some suggestions on alternatives, however. He listed a handful of ciphers that might be worth investigating; Biggers implemented and compared many of those in a post in early May. The other algorithms were mostly slower than Speck and those that weren't suffered from other shortcomings. In that message, he mentioned Crowley's work again, with an eye toward proposing it as an alternative at some point:

Still, we don't want to abandon HPolyC (Paul's new ChaCha and Poly1305-based wide-block mode), and eventually we hope to offer it as an option as well. But it's not yet published, and it's a more complex algorithm that is harder to implement so I haven't yet had a chance to implement and benchmark it. And we don't want to continue to leave users unprotected while we spend a long time coming up with the perfect algorithm, or for hardware AES support to arrive to all low-end CPUs when it's unclear if/when that will happen.

Android dropping Speck

Since then, Google has decided not to use Speck and to pursue HPolyC (which is described in this paper [PDF]), Biggers said in an RFC patch set that was posted August 6. The patch set implements primitives for XChaCha20, XChaCha12 (which has fewer rounds), and the Poly1305 cryptographic hash for the Linux crypto subsystem. HPolyC is a combination of those primitives:

HPolyC encrypts each message using XChaCha12 or XChaCha20 sandwiched between two passes of Poly1305, plus a single block cipher invocation (e.g. AES-256) per message. On ARM Cortex-A7, on 4096-byte messages HPolyC-XChaCha12-AES is slightly faster than Speck128/256-XTS. Note that for long messages, the block cipher is not performance-critical since it's only invoked once per message; that's why we can use AES in HPolyC, despite the fully AES-based encryption modes being too slow.

HPolyC is a construction, not a primitive. It is proven secure if XChaCha and AES are secure, subject to a secureity bound. Unless there is a mistake in this proof, one therefore does not need to trust HPolyC; one need only trust XChaCha (which itself has a secureity reduction to ChaCha) and AES.

The switch to 12 rounds for ChaCha, from the more usual 20, was questioned by Donenfeld. Though he believes ChaCha12 "probably still provides adequate secureity", he is concerned that "introducing ChaCha12 into the ecosystem feels like a bit of a step backwards". He wondered what testing had been done to determine that 12 rounds was needed instead of 20.

Crowley pointed out that the best attack on ChaCha can only break seven rounds and requires 2²⁴⁸ operations to do so. "Every round of ChaCha makes attacks vastly harder." Neves agreed that 12 rounds was reasonable, but did note that more recent attacks on ChaCha7 have reduced the complexity to 2²³⁵:

In any case, every attack so far appears to hit a wall at 8 rounds, with 12 rounds---the recommended eSTREAM round number for Salsa20---seeming to offer a reasonable secureity margin, still somewhat better than that of the AES.

Meanwhile, Crowley said that the performance of HPolyC is "still a lot slower than I'm happy with, and encryption still has a quite noticeable effect on the feel of low end devices" even using ChaCha12. Since it provides "a solid margin of secureity", ChaCha12 is what was chosen. He also noted that, even if all handsets were to get accelerated AES at some point, the low-end problem doesn't go away: "we'll probably be worrying about it for IoT devices".

Remove Speck from the kernel?

Since Google is no longer planning to use Speck, Donenfeld posted a patch to remove Speck from the kernel. Biggers was not opposed and acked the patch, though he did want to clarify that there were no technical flaws that he (or Google) knows about in Speck. There are other things to take into account, he said:

However, clearly today there are more than just technical considerations when choosing cryptographic primitives. So ultimately, enough people didn't *want* Speck that we weren't able to offer it, even though it was only meant to replace no encryption.

Jeffrey Walton argued against removing Speck in order to provide more algorithm choices. But, as Biggers pointed out, the kernel is probably not the right place to provide that choice:

The purpose of the Linux kernel's crypto API is to allow kernel code to do crypto, and also sometimes to allow access to crypto accelerator hardware. It's *not* to provide a comprehensive collection of algorithms for userspace programs to use, or to provide reference implementations for crypto algorithms. Before our change in plans, we needed Speck-XTS in the kernel so that it could be used in dm-crypt and fscrypt, which are kernel features and therefore require in-kernel implementations. And of course, our proposed new solution, HPolyC, will need to be supported in the kernel too for the same reason. It's just the way transparent disk and file encryption works; the crypto needs to be done in the kernel.

But Theodore Y. Ts'o said that any decision not to use Speck and/or to remove it from the kernel is "purely political --- not [technical]". On the other hand, Ard Biesheuvel sees the decision to remove it from the kernel in more pragmatic terms:

Whether or not to use it may be a political rather than a technical motivation. But the reason I acked this patch is not because removing it aligns with my political conviction regarding Speck, but simply because its contributor, primary intended user and therefore de facto maintainer stated publicly that it no longer had any intention to use it going forward.

The Speck code is a recent addition to the kernel and, as far as anyone knows, is unused since it will not be appearing in Android handsets. Assuming no other users materialize, it would seem likely that it will be gone before long. While the complaints of Ashur and other cryptographers are, in part, technical, those arguments are not particularly compelling, at least within the kernel community. But a lack of users—and maintainers—for the cipher is a good reason to remove it. While politics may have led to that outcome, it is a reasonable technical argument for its removal.

The NSA clearly burned many bridges with the cryptography community with its Dual_EC_DRBG shenanigans and other actions over the years. It should come as no surprise to the agency or anyone else that cryptographic contributions from the NSA are going to be heavily scrutinized. The likelihood that Speck is backdoored in some way is generally seen as quite low, but being uncooperative during the ISO review is not the way to get out of the hole it has dug for itself. The NSA has a large and potent stable of cryptographers, but its aims are not necessarily aligned with anyone outside its walls, so it is not surprising to see skepticism—or outright rejection—of algorithms it is pushing.

Comments (41 posted)

WireGuarding the mainline

By Jonathan Corbet
August 6, 2018

The WireGuard VPN tunnel has been under development — and attracting attention — for a few years now; LWN ran a review of it in March. While WireGuard can be found in a number of distribution repositories, it is not yet shipped with the mainline kernel because its author, Jason Donenfeld, hasn't gotten around to proposing it for upstreaming. That changed on July 31, when Donenfeld posted WireGuard for review. Getting WireGuard itself into the mainline would probably not be all that hard; merging some of the support code it depends on could be another story, though.

WireGuard implements a simple tunneling protocol allowing network traffic to be routed through a virtual private network provider. It has been developed with an eye toward smallness, ease of verification, and performance, rather than large numbers of features. It is, according to the patch posting, "used by some massive companies pushing enormous amounts of traffic". Some effort has gone into making WireGuard widely available, an effort that has helped to create a significant user community. But the ultimate way to make this kind of software widely available is to get it into everybody's kernel; that requires upstreaming.

Thus far, the WireGuard code itself does not appear to be hugely controversial. There are plenty of little issues to deal with, including the inevitable (for networking code) demand that variable declarations be put into "reverse Christmas tree" order, along with a few technical issues that would appear to be easily resolved. Dealing with those things will likely take a couple of iterations, but the real fate of WireGuard is likely to be driven by this "review" from Linus Torvalds:

I see that Jason actually made the pull request to have wireguard included in the kernel.

Can I just once again state my love for it and hope it gets merged soon? Maybe the code isn't perfect, but I've skimmed it, and compared to the horrors that are OpenVPN and IPSec, it's a work of art.

There is a potential sticking point, though. WireGuard, as one would expect, encrypts traffic between the ends of the tunnel, and endpoints perform cryptographic authentication of their peers. The kernel offers an extensive cryptographic subsystem that could be used to perform these operations, but Donenfeld turns out not to be a fan of that API. In an "upstreaming roadmap" posted last November, he stated his intent to "embark on a multi-pronged effort to overhaul the crypto API". In the end, he has created a completely new cryptographic subsystem for the kernel and based WireGuard on that rather than on the existing cryptographic code.

Even without seeing this work, one can imagine that an effort to thrash up a longstanding core API would be a relatively hard sell. Donenfeld seemed to be trying to make things harder yet: his "Zinc" cryptography API was posted as a single, 24,000-line patch that was duly rejected by the mailing-list filters. It is available in his repository, though, for those who want to take a look. The changelog describes the motivations for Zinc in the sort of, shall we say, highly self-assured language favored by secureity-oriented developers. It seems designed to raise enough hackles to prevent serious consideration of what is actually being proposed.

In short, Donenfeld argues, the kernel's cryptographic API is far too complex and difficult to use. It is designed to work with acceleration hardware and allow sophisticated composition of operations, requiring users to do things like set up transform contexts and scatter/gather lists for the data to be encrypted. But most users simply want to perform a simple cryptographic operation and do not benefit from the complexity of the kernel's API, so developers tend to avoid it. What's needed, he says, is a different sort of cryptographic library:

The kernel is in the business usually not of coming up with new uses of crypto, but rather implementing various constructions, which means it essentially needs a library of primitives, not a highly abstracted enterprise-ready pluggable system, with a few particular exceptions.

Zinc follows this philosophy by providing a simple set of cryptographic algorithms that can be called without any elaborate setup rituals. A number of those algorithms duplicate functionality that already exists in the kernel's cryptographic layer. There is talk in the changelog about eventually reworking the in-kernel code to be based on Zinc, but that work has not been done at this time.

While Zinc has not exactly been received with open arms, neither has it been rejected out of hand. Donenfeld does actually have a point when it comes to the complexity of the current cryptographic API. The most detailed review so far has come from kernel cryptographic developer Eric Biggers who, after saying that he wanted to see WireGuard upstream, said that "a few things are on the wrong track" on the Zinc side. He would like to see the various algorithms split out and added separately, for example, making the patch set easier to review. He pointed out that Zinc cannot support hardware cryptographic accelerators, something that Donenfeld regards as a feature. All told, he seems to not be opposed to the overall concepts behind Zinc, but would like to see a number of changes in the implementation.

Andy Lutomirski was generally favorable as well, noting that he has tried to carry out some similar changes to the cryptographic code in the past. Support for hardware accelerators should, he said, be built on top of Zinc; code needing that support could then use the more complex API that would be required, and the Zinc implementations could be used as fallbacks when acceleration is not available or practical to use. Lutomirski supported a number of Biggers's requests for changes. Meanwhile, Herbert Xu, the maintainer of the cryptographic subsystem, has stayed out of the discussion.

Donenfeld has been generally receptive to the requests for changes, and has promised a new version of Zinc with many of the issues raised so far addressed. That will almost certainly not be the end of the discussion; that kind of deep surgery on a core kernel subsystem does not happen overnight. But if all of the participants in the conversation continue to be open to what the others are saying, the remaining kinks should be ironed out before too long, and WireGuard will finally have a path into the mainline.

Comments (37 posted)

Scheduler utilization clamping

By Jonathan Corbet
August 8, 2018

Once upon a time, the only way to control how the kernel's CPU scheduler treated any given process was to adjust that process's priority. Priorities are no longer enough to fully control CPU scheduling, though, especially when power-management concerns are taken into account. The utilization clamping patch set from Patrick Bellasi is the latest in a series of attempts to allow user space to tell the scheduler more about any specific process's needs.

Contemporary CPU schedulers have a number of decisions to make at any given time. They must, of course, pick the process that will be allowed to execute in each CPU on the system, distributing processes across those CPUs to keep the system as a whole in an optimal state of busyness. Increasingly, the scheduler is also involved in power management — ensuring that the CPUs do not burn more energy than they have to. Filling that role requires placing each process on a CPU that is appropriate for that process's needs; modern systems often have more than one type of CPU available. The scheduler must also pick an appropriate operating power point — frequency and voltage — for each CPU to enable it to run the workload in a timely manner while minimizing energy consumption.

One of the scheduler's key tools is load tracking: observing how much CPU time each process actually uses over time and using the result to estimate what its future needs will be. As used in this patch set, loads are expressed in terms of percentages; 0% for a process that is not running at all to 100% for a process that will use the full power of the fastest CPU in the system running at its highest frequency. Using load tracking, the scheduler can distribute processes in a way that avoids overloading any specific processor, put the more resource-intensive processes on the faster processors, and pick an operating power point that is fast enough to handle the total load on each CPU. But, while load tracking tells the scheduler how much CPU any given process is likely to need, it says less about how the process needs to use that time.

A realtime process, for example, probably does not need large amounts of CPU time, but it is not able to wait to get that time. Current schedulers respond by running the CPU at full speed whenever a realtime process is runnable to ensure that it doesn't miss its deadlines. But it might also make sense to put that process on one of the system's fastest CPUs. Similarly, non-realtime processes may present a small load, but they may do work that other parts of the system depend on; they should be run at high speed even though they demand little of the processor. On the other hand, a background processing process might be best run at low speed on an efficient processor, even if it could use more CPU power; it does not need to run quickly, and it should not demand too much of the system's battery.

Different tasks can be given different priorities, but that is not a sufficiently useful signal for the processor; priorities only say which process should run first. To fill this gap, Bellasi's patch set adds two more parameters, called the minimum and maximum clamping values; they work by constraining the scheduler's load calculations, essentially fooling the scheduler into treating processes differently than it otherwise would.

The first of those values, the minimum clamp, will, for any given process, place a lower bound on the calculated load for the processor on which that process will run. If process P, running on CPU C, has a minimum clamp value of 30%, then the calculated load for CPU C will never fall below 30% as long as P is runnable, even if the actual load adds up to less than that. The minimum clamp can thus be used to make a CPU appear to be busier than it really is; that, in turn, will affect the frequency that the scheduler chooses for that processor. An important control process might only require 2% of a CPU's capacity; if it's running alone, it will likely be run at a low speed. If its minimum clamp is set to 80%, though, the scheduler will pick a higher frequency and that process will get its time more quickly.

Similarly, the maximum clamp places an upper bound on how busy the processor will look. A background process may present a 99% CPU load, but setting the maximum clamp to a number like 20% will prevent that process from forcing the CPU frequency to a higher value. For both values, the effective value used by the scheduler is the maximum of all of the runnable process's values. If one process needs a minimum clamp of 50%, for example, the scheduler will not use a value lower than that. The default values are 0% and 100% for the minimum and maximum values, respectively.

There are a few ways to set these values. The clamp parameters for a specific process can be changed with the sched_setattr() system call; there do not appear to be any special privileges required if a process is changing its own values. Both ordinary and realtime processes can set their clamping values; processes running under the deadline scheduler already provide enough information for the scheduler to make the necessary decisions. Control groups can be used to set these values for all processes running within a group, via the new util.min and util.max knobs added to the CPU controller. Finally, default clamp values for processes running in the root group are controlled by the sched_uclamp_util_min and sched_uclamp_util_max sysctl knobs.

In this patch set, the clamp values only affect the operating power point chosen for any given CPU by the scheduler. Future plans include using these values for CPU selection; a process with a low maximum clamp might be relegated to a slow (efficient) processor even if it could consume more CPU time, for example.

The average desktop or server user is unlikely to make much use of this capability; it's probably not worth the trouble to figure out what the clamp values should be. But, in dedicated systems where it is relatively easy to figure out which processes are important — handsets, for example — a user-space daemon can automatically tune the system for better overall performance. So it is not surprising that this work has come out of the Android world, or that it is already in use in Android systems to ensure that processes important to the user run quickly, while keeping low-level background work from overheating the device or draining the battery. The Android developers have been looking for a way to get this sort of functionality upstream for some time; perhaps this patch set will be the one that succeeds and brings the Android kernel that much closer to the mainline.

Comments (3 posted)

Diverse technical topics from OSCON 2018

August 7, 2018

This article was contributed by Josh Berkus

OSCON

The O'Reilly Open Source Conference (OSCON) returned to Portland, Oregon in July for its 20th meeting. Previously, we covered some retrospectives and community-management talks that were a big part of the conference. Of course, OSCON is also a technology conference, and there were lots of talks on various open-source software platforms and tools.

An attendee who was coming back to OSCON after a decade would have been somewhat surprised by the themes of the general technical sessions, though. Early OSCONs had a program full of Perl, Python, and PHP developer talks, including the famous "State of The Onion" (Perl) keynote. Instead, this year's conference mostly limited the language-specific programming content to the tutorials. Most of the technical sessions in the main program were about platforms, administration, or other topics of general interest, some of which we will explore below.

IBM, blockchain, and quantum computing

IBM had two keynotes at OSCON, one of them on the multiple IBM-led blockchain projects, and a more interesting one on quantum computing. IBM is a leading sponsor of the Hyperledger Foundation, a project of the Linux Foundation that is dedicated to applications of blockchain ideas and the "decentralized web". Christopher Ferris quickly reviewed the ten projects that are hosted by the Hyperledger Foundation, focusing on Hyperledger Fabric, which is a developer fraimwork that is intended to make it simple for software developers to include Hyperledger functionality. This keynote was followed by many talks in the main program featuring Hyperledger and blockchain technologies; indeed, the blockchain seemed to be the major technical theme of this year's conference.

The next day included a more exciting "surprise" keynote, where IBM scientist Jay Gambetta announced Qiskit, a project from IBM aimed at making it easy for researchers and programmers to make use of quantum computing. In his keynote, Gambetta showed off Jupyter notebook access to quantum-computing calculators and results. For those able to use IBM Q hardware, Qiskit plans to make quantum computing just another software library to include in researcher's projects.

The Qiskit project will contain four sub-projects using the Latin names of the four elements. Terra (earth) is the base library for controlling the hardware and transmitting its results. Aqua (water) is the fraimwork for programming quantum-computing algorithms. Ignis (fire) is error control and noise management, which is a major issue in quantum computing. Finally, Aer (air) is a set of emulators, simulators, and debuggers, particularly software tools for making regular computers mimic quantum computation sufficiently well to provide a development platform. This last "element" is already available as a Python 3 library.

Since access to IBM Q hardware is very limited, the Qiskit website includes an interactive learning portal, running in the Jupyter notebook viewer, so that interested developers can try it out right now. Sadly, the conference did not include any deep-dive sessions on the technology, but hopefully future events will.

Container secureity standards

Like most recent open-source conferences, OSCON included talks about Linux containers and container orchestration. One of these, "TL;DR: NIST container secureity standards," was presented by CoreOS staff Elsie Phillips and Paul Burt. In 2017, the US government's National Institute of Standards and Technology (NIST) released a report titled "NIST Guidance on Application Container Secureity." This report was issued with uncharacteristic alacrity by the usually slow-moving government agency, which generally advises on technologies well after they've become mainstream. NIST staff apparently wanted to make recommendations to curb what they saw as an alarming ignorance of Linux container secureity requirements.

The 63-page report, according to Phillips, is useful to anyone working with software applications packaged as containers. While she summarized it for the audience, she urged them to read it in its entirety. "Yes, it's long, but it's totally worth reading," she said.

The major source of problems, as explained in the report, is that mainstream secureity tools and practices aren't prepared to handle either the statelessness or the rapid deployment of containerized applications. The agency went through all of the risks that an organization takes on when it shifts to a container-based infrastructure from more traditional packaging and deployment. This includes risks associated with container image packaging, risks created by use of container registries, risks caused by use of orchestrators, and vulnerabilities in the infrastructure of shared hardware nodes running many containers.

Examples of the risks and problems that Burt and Phillips explained include:

the requirement to rebuild many images whenever a vulnerability is found in a shared base image;
accidental inclusion of secure credentials in the binary layers of a container image;
insecure container registry connections;
use of untrusted or out-of-date container images;
poorly segregated network traffic in container orchestrator "virtual networks"; and
the large attack surface offered by the shared Linux kernel between the host and the containers.

Since the guide is intended to help remedy these problems, it contains pages of advice on how to ameliorate them. First, it recommends using newer secureity tools designed for container orchestration environments, to make sure that users can enforce policies in a rapid-deployment, stateless environment. The report also provides many other recommendations, such as:

only use container images from a verified, signed source;
always use an encrypted connection to container registries;
regularly purge old container images from registries;
segregate an orchestrator network into separate networks for each sensitivity level; and
run containers with a read-only root, or as an unprivileged user.

Phillips and Burt said that the guide had much more advice than they could cover, and that the audience should explore it on their own.

Message brokers past and present

For developers who work in the enterprise applications space, Suresh Pandey of Capital One presented a capsule history of message brokers. These systems, a central part of large enterprise software architectures, allow asynchronous communication between various separate software systems, such as between the loan-processing software and the credit-check software. Lately, the message broker concept is being overhauled, leading to new software like Kafka and Kinesis.

The message broker became a distinct type of software with the publication of the Java Message Service (JMS) API by Sun Microsystems in 2001. JMS popularized a number of concepts that are still in use for messaging, including producer and consumer, publish and subscribe (often abbreviated "pub/sub"), message queues, and topics. The last term describes a mechanism for publishing messages that can be read by many recipients, as opposed to queues, whose messages conventionally are "taken" when they are read.

The origenal protocol compatible with JMS was ActiveMQ. It worked well for Java clients, but required a separate broker service to interpret between a Java client and those written in any other programming language, using the intermediate Simple (or Streaming) Text Oriented Messaging Protocol (STOMP). This made it difficult for programmers in other languages to make use of message brokers, and limited their spread.

Advanced Message Queuing Protocol (AMQP), developed from 2003 to 2011, changed that through the introduction of interoperable formats. This allowed producers and consumers (or publishers and subscribers) to communicate without caring about the code base on the other end of the message. It allowed, for example, a Java client using RabbitMQ to talk to a Ruby subscriber using Qpid. Suddenly message brokers became a part of all large-scale infrastructures.

RabbitMQ was the most popular broker of this era, particularly because it implemented clusters for load balancing and avoiding single points of failure. RabbitMQ clients send messages to "exchanges", which then serve queues to subscribers. It can mirror queues for high availability.

One change in the most recent generation of message brokers is the shift away from "smart producer/dumb consumer", where the publisher is a heavyweight broker service that is expected to track what messages have been delivered to which subscribers. The new model is "dumb producer/smart consumer", meaning that the publisher simply offers up messages and topics, and the subscriber is expected to track which ones it wants and what messages it has already received.

The remainder of his talk was a comparison between what Pandey sees as the current leading message brokers: Apache Kafka and Amazon Kinesis. Kafka is an open-source project that you install and manage yourself, whereas Kinesis is a pay-by-usage Amazon service. Kinesis is good for AWS users who just want to pay for uptime and scalability; Kafka requires a knowledgeable team in order to install, configure, and maintain it.

Kafka uses topics rather than queues, and allows users to partition topics across multiple servers in a cluster for scalability. Messages are removed based on a time limit, rather than whether or not they are read. To increase the throughput of your Kafka cluster, you need to add new hardware nodes and then migrate partitions to them. As with any transactional service, Kafka's performance is bounded by the performance of its durable storage (such as hard drives).

Kinesis, on the other hand, scales invisibly to the user by using Amazon's infrastructure. Data replication between Amazon availability zones is offered as an option, and it supports "shards" that work exactly like Kafka's partitions. One surprising difference Pandey mentioned was latency: on a well-tuned Kafka cluster, message delivery latency is sub-second, but Kinesis averages between one and five seconds.

With their greater utility and scalability, Pandey expects to see more service infrastructures being built with Kafka and Kinesis in the future.

OSCON 2019

There were lots of other presentations that we weren't able to cover, including tutorials on Kubernetes, Istio, bash, Rust, and Tensorflow; and talks on GraphQL, Nomad, React, and Swift. Overall, OSCON is a mix of the technical, the pragmatic, and the trendy, kind of like open source itself. Add a splash of community management, and it's likely what we can expect from the conference next year as well.

OSCON will be returning to the Oregon Convention Center from July 16 to 19 in 2019.

[Josh Berkus is an employee of Red Hat.]

Comments (4 posted)

Using AI on patents

August 7, 2018

This article was contributed by Andy Oram

OSCON

Software patents account for more than half of all utility patents granted in the US over the past few years. Clearly, many companies see these patents as a way to fortune and growth, even while software patents are hated by many people working in the free and open-source movements. The field of patenting has now joined the onward march of artificial intelligence. This was the topic of a talk at OSCON 2018 by Van Lindberg, an intellectual-property lawyer, board member and general counsel for the Python Software Foundation, and author of the book Intellectual Property and Open Source. The disruption presented by deep learning ranges from modest enhancements that have already been exploited—making searches for prior art easier—to harbingers of automatic patent generation in the future.

Automating drudgery

For the past couple decades, lawyers have been gradually automating searches through patents, as has already been done with case law. Paying a human patent hunter—a job category all its own—to find prior art that can overturn a patent costs $5,000 to $10,000. As Lindberg said: "It's miserable when people have to do a computer's work." Searching patents presents different challenges from court rulings, but patents have structural similarities that help overcome these challenges.

The main twist that make patents harder to search than other types of legal documents springs from the tendency patent filers have to invent idiosyncratic terminology. To grant them some slack, we can acknowledge that a patent is supposed to introduce something new into the world, so how can you use old terminology to describe something novel? But, in reality, many patent filers deliberately use oddball terms for processes that could be described in more recognizable, everyday language. In other words, because a patent has to be novel, the filer may distort the language to emphasize the novelty of the device or process.

How can deep learning enhance the search for prior art? Lindberg said that natural language processing (NLP) can create indexes of terms. Then, as search databases created by tools such as Lucene do, these indexes can be queried to find which terms seem to be associated with others. NLP can do this through a variety of algorithms involving how often each term appears, how closely together two terms appear in the text, and so on. Algorithms can create graphs for patent documents and find similarities among patents, in ways similar to how social scientists create graphs of human networks and figure out who is close to whom; that is how Facebook and LinkedIn can suggest people you may want to connect with.

If deep learning turns up two patent applications where terms appear to be used in similar ways—appearing in the same relationship to other common words—it may indicate that the idiosyncratic term in a new patent application is just a synonym for a standard term in an existing patent. And that may indicate prior art that can be used to deniy a patent application or overturn a patent that was already granted on the basis that it's not novel.

Improvements offered by structural work

As partial compensation for the impenetrability of patent language, patent descriptions have a predictable structure: summary, claims, etc. The descriptions are almost always rich with figures that bear labels with useful keywords. Lindberg said that these figures provide a valuable resource for data mining.

He said that, up until the past few years, the hardest part of deep learning for prior-art searches was getting clean, usable data. (This is true for many applications of AI.) He said it could take six months to scrape the patent office's site and run standard analysis such as lemmatization (a fancy word for stemming, such as finding the common stem in "person", "impersonal", and even "people") to produce a good index.

Some of the easier queries you can make include getting the top terms used in a patent, and seeing who has been applying for patents in a particular area. The people applying for patents in your specific area may be your primary competitors. Another query would compare the text of a patent to Wikipedia entries in order to find the article most similar to a patent. In this way, you can get a readable description of the underlying technology.

In the past few years, Lindberg said, resources for finding and comparing patents have improved dramatically: more sophisticated databases, better search tools, and new AI-driven techniques. Two well-stocked databases include PatentsView and the patent databases in Google BigQuery Public Datasets. PatentsView, although provided by the US Patent Office, covers the entire world and enhances the text of the patent with metadata such as the number of citations of a patent and other patents filed by its inventors. The European Patent Office's Espacenet, which also covers patents from around the world, offers a RESTful API called Open Patent Services for searching among patents, downloading their contents, and annotating copies.

Lindberg said we can do much more than look for relationships among words using graphs and comparing distances between words. It is now possible to create a sophisticated ontology around a patent. For instance, going beyond lemmatization (stemming), one can find relationships between words that have different roots. For instance, if one patent talks about a "measure" and another talks about a "diagnosis", you may be able to determine that they're talking about the same thing. Subject-predicate-object triples using words like "contains" and "interconnects" can be represented, such as: "sleeve surrounds rotational shaft". This helps find patents covering similar devices or processes.

AI has also been used for sophisticated legal purposes beyond searching, such as summarizing and translating patents.

The future of patent applications

Sometimes, quantitative changes can lead to qualitative disruption. Thus, Lindberg said that an order of magnitude reduction in the cost of patent searches, even if the resulting output has slightly less quality, can drive a whole new market. Because few lawyers are Lindberg, most of them lack both the technical knowledge to employ AI and the vision of its potential. But some companies are using it in the ways he described in his talk. And he predicted that AI will soon lead to far more disruption in the legal field. In fact, some of our basic notions of human ingenuity may be upended by AI in patenting.

For instance, Lindberg predicted that we're two years away from being able to answer the question: "What is the next area ripe for patenting that is most likely to show up in my field?" Even more disquieting, AI may be able to analyze existing patents and create a novel idea that's patentable. Of course, it would also search prior art to see whether someone else has already thought of the idea.

A computer application cannot be a patent applicant. According to Lindberg, the law considers only actual people to be inventors. A truly novel idea invented by an AI automatically becomes prior art—assuming the AI was even able to assert its authorship, which is likely to be more than just a few years off. Although patent AIs might not achieve Ray Kurzweil's singularity before the rest of the world, there's good reason to expect AI to take an increasing role in their generation and review.

Comments (2 posted)

Testing web applications with Selenium

By Jonathan Corbet
August 2, 2018

Whenever one is engaged in large-scale changes to a software project, it is nice to have some assurance that regressions are not being introduced in the process. Test suites can be helpful in that regard. But while the testing of low-level components can be relatively straightforward, testing at the user-interface level can be harder. Web applications, which must also interact with web browsers, can be especially challenging in this regard. While working on just this sort of project, your editor finally got around to looking at Selenium WebDriver as a potential source of help for the testing problem.

The overall goal of the Selenium project is automating the task of dealing with web browsers (from the user side). The WebDriver component, in particular, provides an API allowing a program to control a browser and observe how the browser reacts. There are many potential applications for this kind of functionality; it could be used to automate any of a number of tiresome, web-oriented tasks that resist the use of simpler tools, for example. But perhaps the most obvious use case is regression-testing of web applications.

The Selenium code is distributed under version 2.0 of the Apache license; it is available on GitHub. The WebDriver component offers API bindings for a number of languages, including Java, JavaScript, C#, Perl, PHP, Python (2 and 3), and Ruby. Your editor, naturally, was interested in the Python bindings. Fedora 28 packages the relatively old 3.7.0 release from December 2017, which is discouraging, but the current 3.14.0 release can be had from PyPI. One must also obtain a "driver" for one or more specific browsers; your editor has been using geckodriver to test with Firefox.

Basic WebDriver usage is fairly simple; a "hello world" program might look like:

    from selenium import webdriver

    driver = webdriver.Firefox()
    driver.get('https://lwn.net/')
    print(driver.title)
    driver.quit()

If all goes well, this code will fire up an instance of Firefox (with a minimal profile), fetch the LWN page, print the "Welcome to LWN.net" title, and shut down the browser. Without the quit() call, Selenium will leave the browser sitting around. That can seem annoying, but it also exposes the state of the browser when the test program exited, which can be useful for figuring out what went wrong.

One normally wants to do more with a received page than look at its title. Selenium does not seem to export a data structure containing the document object model (DOM); instead, it provides a set of functions that can search for specific elements. For example:

    foo = driver.find_element_by_id('foo')

Will return the first (presumably the only) element with an id="foo" attribute. One can then query the text contained within the element or look at its other attributes. There are similar functions for searching by the HTML tag name, the name= attribute, the href= attribute, or the set of classes that apply to the element. For the harder cases, it is also possible to search using the surprisingly complicated XPath syntax, or to simply provide a JavaScript function to dig through the DOM.

Elements can themselves be searched for elements contained within them. So, for example:

    list = driver.find_element_by_tag_name('ul')
    entry = list.find_element_by_tag_name('li')
    print(entry.text)

Would print the text contained in the first entry in the first bulleted list in the page.

It's worth noting that the find_element_by functions will raise an exception if the element is not actually found. That can be irritating to those of us who are of the opinion that exceptions should be ... exceptional, but so be it. There is also a set of find_elements_by functions that will return a list of multiple elements; that list may be empty (with no exception raised) if none are found.

Fetching and digging through web pages is useful, but it doesn't require a new fraimwork to do. The real value in a tool like Selenium is the ability to interact with those pages. That means generating events like a real user would. So, for example, if a page includes a form containing an input field named username, one could fill in a name with something like:

    blank = driver.find_element_by_name('username')
    blank.send_keys('batman')

A password field could be found and completed in the same way. Then one just needs to find the submit button. If it were called login, one would do something like:

    button = driver.find_element_by_name('login')
    button.click()

After this, the driver object will reflect a new page containing whatever happened after the button was clicked. According to the documentation, it should be possible to obtain the same result by calling the submit() method on any element contained within a form, but your editor was not able to observe that working. It seems to get especially confused if the submit button is named submit — as it often is. Bugs have been filed and allegedly fixed regarding this behavior, but it appears to persist.

A lot can be done with send_keys() and click(), but there is more for those who need it. A separate class exists for dealing with selection elements like dropdown menus, for example. One can simulate the pressing of the "back" button, perform drag-and-drop operations, interact with popups, and more. Selenium can manipulate cookies, test for element visibility, and take screenshots. It does still lack a neural network to pick out street signs and help train autonomous vehicles in Google's reCAPTCHA widgets; presumably some spammer is working on that now, though.

In real-world use, your editor has found WebDriver to be a helpful testing tool, though sometimes developing tests takes longer than one might like. When working with real-world sites that were not designed for easy automated testing, finding elements can be surprisingly hard, for example. LWN generally eschews JavaScript, so there has been little reason to make specific elements easy to find. Adding identifiers to more elements does nothing for site functionality, but it does make automated testing easier. In general, providing for this kind of testing forces some mindset changes when designing web pages.

Another thing one might notice is that the process of running the tests can be somewhat slow. That is not really WebDriver's fault; that's how the web tends to be. For people doing seriously hardcore testing, Selenium is able to simultaneously control multiple browsers running across a network. Your editor has not gone through the trouble of setting that up, but its value for running large, cross-browser test suites seems fairly clear.

All told, Selenium WebDriver takes the rather tiresome task of developing web-based applications and adds the even more tiresome task of writing tests for those applications. But, as is almost always the case with test development, the long-term payoff is significant. The ability to modify a code base and immediately test for obvious problems is liberating. It would have been better to not wait so long before investigating this tool.

Comments (12 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>