Guarding personally identifiable information

June 7, 2017

This article was contributed by Andy Oram

There is no viable way to prevent data from being collected about us in the current age of computing. But if institutions insist on knowing our financial status, purchasing habits, health information, political preferences, and so on, they have a responsibility to keep this data—known as personally identifiable information (PII)—from leaking to unauthorized recipients. At the 2017 Strata data conference in London, Steve Touw presented a session on privacy-enhancing technologies. In a fast-paced 40 minutes he covered the EU regulations about privacy, the most popular technical measures used to protect PII, and some pointed opinions about what works and what should be thrown into the dustbin.

To jump straight to Touw's conclusions: we need to maintain much tighter control over data that we share. Like most who have studied the question of PII, Touw finds flaws in current forms of de-identification, which is the technique we rely on most often for protecting PII. He suggests combining de-identification techniques with a combination of restrictions on the frequency and types of queries executed against data sets, along with a context-based approach to data protection that is much more sophisticated than current access controls.

No one would have enough time to explain thoroughly all the issues in protecting PII. Touw focused on European legal requirements (which made sense for a conference held in London), technical difficulties in de-identifying data, and good organizational practices for protecting privacy. This article fills out some of the background underlying these issues as well.

Common constraints on data collection

Although people viscerally fear the collection of personal data, and alternatives such as Vendor Relationship Management have been suggested for leaving control over data in the hands of the individual, there are few barriers in the way of organizations that collect this data. The EU has regulated data collection for decades, and its General Data Protection Regulation (GDPR), which is supposed to come into force on May 25, 2018, requires limitations that are familiar to those in the privacy field. These include minimization, data retention limits, and restrictions on use to the original purpose for collecting the data. I'll offer a brief overview of these key concepts.

Minimization means collecting as little data as you can to meet your purpose. If you need to know whether someone is old enough to drive, you can record that as a binary field without recording the person's age. If you need to know how many cars pass down a street each day in order to plan traffic flow, you don't need to record the license plates of the cars.

Data retention limits are a form of minimization. Most data's value diminishes greatly after a few months. For instance, a person's income may change, so income information collected a year ago may no longer be useful for marketing. Therefore, without much of a sacrifice in accuracy, an organization can protect privacy by discarding data after a certain time interval.

Restricting use to the original purpose of data collection is an even stricter criterion. Supposedly it would mean that a retailer who collects your information in order to charge your credit card should not use that information to improve its marketing campaigns.

Governments in the US impose restrictions only on specific classes of information, such as data collected by health care providers. Fair Information Practices, which cover some broad issues such as transparency and the right to correct errors, are widely praised but not required by law. They also go nowhere near as far as EU laws in granting rights to individuals for their data.

Although the GDPR does not require organizations to obtain consent for data collection, Touw advised them always to do so. Otherwise, the organizations may be asked to demonstrate in court that they had a "legitimate interest" in the data, which is a subjective judgment. Touw did not go into the problems of consent forms, so his advice was really aimed at protecting the company doing the collection, not the individuals.

The dilemma of data sharing

Protection of personal data takes place on two levels: while storing it at the site collecting the data, and while granting access to other parties. Why would sites offer data to other parties? Touw did not cover this question, but there are a few reasons behind that practice.

Organizations can realize a large income stream from selling the data, which can then be used for purposes ranging from benign to ill. Governments collect and share data that is supposed to be for the public benefit (e.g. race and gender, incidences of communicable diseases). Public agencies, and even some companies, believe their data could contribute to initiatives in health, anti-corruption efforts, and other areas. Some institutions also anticipate that they might benefit from tools developed by others. Thus, Netflix released data on who viewed its video content for the Netflix prize of 2009, hoping to get a better algorithm for video recommendations from experts in the field.

When data is shared publicly, the organization tries to strip direct identifiers, such as names and social security numbers, and tries to reduce the risk that indirect identifiers such as postal codes can be used to re-identify individuals. Even when organizations sell their data privately, they often try to de-identify it in similar ways. The GDPR gives organizations pretty much free rein to use and release data, so long as it is correctly de-identified.

Problems with de-identification

The bulk of Touw's talk was devoted to the risks of de-identification, also known as anonymization. His skepticism about de-identification is shared by most experts in computing who have examined the field. In particular, he looked at techniques for pseudonymity and K-anonymity, claiming that they can't prevent re-identification unless they're pursued so far that they render the output data useless.

Touw predicted that organizations will stop releasing free, de-identified data sets, because de-identification has too often proven insufficient and too many embarrassing breaches have been publicized. Besides the Netflix prize mentioned earlier, where researchers re-identified Netflix users from the data [PDF], Touw mentioned some other open data sets and spent a good deal of time on New York City taxi data.

All these re-identification attacks depended on the mosaic effect, or finding other publicly available sources and joining them with the released data set. (Touw called this a "link attack.") In the case of the New York taxi data, most of us would have nothing to fear, but celebrities who are sighted at the beginning or end of their rides could potentially be re-identified. Touw claimed that New York City could not have prevented the re-identification by fuzzing or removing fields from the data, a point also made by the researcher who originally performed the re-identification attack. I believe Touw moved the goalposts a bit by adding new sources of information to fuel possible attacks as he removed existing information. Still, he made a case that the only way to protect celebrities would be to remove everything of value from the data.

Pseudonymization is the easiest way to de-identify data. It consists of putting a meaningless value in place of a personally identifying field. People may still be re-identified, though, if they possess unique values for other fields. For instance, if someone is the only Hispanic person in a particular apartment building, a combination of race and address can identify them. If someone suffers from a rare disease, a hospital listing with diagnoses may reveal sensitive information to someone who knows they have that disease.

K-anonymity addresses the problem of unique values, known also as high cardinality values. The technique makes sure there are enough duplicate values in different rows of data so that no individual is identified by a particular combination of fields. K-anonymity works by making values in fields more general: a common example is offering just the first three digits of a five-digit ZIP code. Because the digits are hierarchical (the code 200 is a single contiguous geographic area that contains 20001, 20002, etc.), generalizing the ZIP code exposes data that is still useful but is less specific.

Touw briefly mentioned two enhancements to K-anonymity, known as L-diversity and T-closeness, that are more restrictive. L-diversity [PDF] restricts the number of unique values in information by taking into account the probability that an attacker can guess something about the target (such as their address). T-closeness [PDF] tries to prevent re-identification by making sure that each division in the data (such as ZIP code) contains sensitive values with about the same frequency as the general population. Touw claimed L-diversity and T-closeness are more trouble than they're worth, and that all these techniques leave people at risk of re-identification unless the data is generalized to the point where it's worthless.

When you listen to data scientists like Touw who have investigated the limitations of anonymization, you come away feeling that there's no point to doing it. But let's step back and consider whether this is a constructive conclusion. Nearly all published examples of re-identification took advantage of poor de-identification techniques. Done right, according to proponents, de-identification is still safe. On the other hand, it's easy for proponents of de-identification to say that a technique was flawed after the fact.

To resolve the dilemma, one can look at de-identification like encryption. We can be fairly certain that, within a few decades, increased computing power and new algorithms will allow attackers to break our encryption. We keep increasing key sizes over the decades to compensate for this certainty. And yet we keep using encryption, because nothing better exists. De-identification is still worth using too. But Touw has some alternative ways to carry it out.

Proposed remedies

In addition to advising that organizations obtain consent for data collection, Touw offered two practices that are more effective than the previous methods of data protection: restricting data requests to a safe set of queries and using context-based restrictions. Neither practice is in common use now, but models exist for their use.

If an organization does not release data in the open, it can achieve some of the organizational and social benefits of open data by offering a limited set of queries to third parties. Touw promoted the concept of differential privacy, which is a complex technology understood by relatively few data experts. The concept has been attributed [PDF] to Cynthia Dwork, who co-authored a key paper [PDF] laying out the theory. She explains differential privacy there (on page 6) by saying, "it ensures that any sequence of outputs (responses to queries) is 'essentially' equally likely to occur, independent of the presence or absence of any individual." It never reveals any specific fields in the underlying data, but provides a set of aggregate queries—such as sums or averages—that mathematical analysis of the data set have shown to be privacy-preserving.

Touw demonstrated how a specific value for a specific person might be obtained by asking the same question—or to disguise the attack, many questions that differ slightly—over and over. Each question produces a slightly different result in the field you're interested in, but if you take the average of these results you can get very close to the original value. So some form of rate-limiting must be imposed on queries.

Touw's other major recommendation involves context-based or purpose-based restrictions, which he called "the future of privacy controls". They go far beyond individual or group access controls used by most sites.

One example of context-based restrictions is time-based access. A conventional employer might allow access by its employees from 9:00 AM to 5:00 PM. In a more flexible environment, such as a hospital where nurses' shifts have irregular beginnings and ends, the hospital may allow each nurse access to data when their schedule indicates they are on duty.

Another type of context-based restriction is based on granting users limited access to data based on a license that spells out what they want to do (say, cancer research) and how they can use data. If the user starts issuing requests for certain combinations of rows or columns that don't seem to fulfill the basis for which the license was granted, access can be denied.

Touw advises organizations not to try to combine all their data in a single data lake—or worse still, to copy data into a new repository in order to perform access controls. Maintaining two copies of data is always cumbersome and error-prone. In addition, you now offer attackers twice the opportunities to break into the data. Instead, he suggests an organizational set up what he calls a "data control plane". It implements all the policies defined by the organization and covers all data stores. The control plane should expose easy ways to create rules, make sure new policies take effect immediately, recognize the types of context mentioned earlier, and maintain audits that show what the data was used for. Organizations must also exercise governance over data so they know who owns it, who has access to it and under what circumstances, and how to manage the data's lifecycle (acquiring, storing, selling, purging). They can't just rely on the IT department to define and implement policies.

Few if any commercial vendors offer the advanced privacy-protecting technologies recommended by Touw. So at this point, attackers run ahead of most organizations that maintain data on us. Still, Touw's talk opens up a valuable debate about what real privacy protection looks like in 2017.

Index entries for this article
Security	Privacy
GuestArticles	Oram, Andy

Guarding personally identifiable information

Posted Jun 7, 2017 22:51 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (31 responses)

> Although people viscerally fear the collection of personal data
I don't. I don't really care about it.

Guarding personally identifiable information

Posted Jun 7, 2017 23:20 UTC (Wed) by NightMonkey (subscriber, #23051) [Link] (27 responses)

Cool. Send me all those numbers and letters that are on your credit and debit cards. I just wanna see if you have the same ones as I do. NBD. Just put them in a pastebin, to make it easy-peasy for both of us.

Guarding personally identifiable information

Posted Jun 7, 2017 23:23 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (26 responses)

Why would I? I don't care about my personal info, but I also am not going to actively give it out.

Also credit card number information is not personal information.

Guarding personally identifiable information

Posted Jun 8, 2017 14:28 UTC (Thu) by dskoll (subscriber, #1630) [Link] (25 responses)

Credit card numbers are considered personally-identifying information. Wikipedia references this document which says that "Examples of PII include... Personal identification number, such as social security number (SSN), passport number, driver‘s license number, taxpayer identification number, or financial account or credit card number."

Guarding personally identifiable information

Posted Jun 8, 2017 15:06 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (24 responses)

It's PII, but it's not personal data. They don't say anything about me.

Guarding personally identifiable information

Posted Jun 8, 2017 15:23 UTC (Thu) by pizza (subscriber, #46) [Link]

The point is that Personally Identifiable Information can be used to impersonate you in ways that are, while not outright legally-binding, can utterly screw you over or at least inconvenience you for years down the line.

And there is quite a lot of "PII" in "Personal Data"

Guarding personally identifiable information

Posted Jun 8, 2017 18:22 UTC (Thu) by dskoll (subscriber, #1630) [Link] (21 responses)

You are splitting hairs. Of course it's "personal data". It's data that belongs to you and only you. And it can be used by someone to impersonate you, or at least as part of such impersonation.

Guarding personally identifiable information

Posted Jun 11, 2017 22:40 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (20 responses)

No, it's not personal data. A credit card number can be used to identify me, but it doesn't say anything _about_ me.

If we had payment systems with more security than a wet towel, I wouldn't even care if my credit card numbers leaked. It also absolutely not a problem to secure CC numbers - simply replace them with unique tokens and keep the mapping secret. Or you if you're feeling fancy, you can use encryption instead.

Anonymization solves a completely different problem and I'm amazed that people don't understand that.

Guarding personally identifiable information

Posted Jun 12, 2017 13:11 UTC (Mon) by nijhof (subscriber, #4034) [Link] (1 responses)

A credit card number _does_ say something about you. At the very least it says that you've got a credit card -- which implies that you are rich enough for a bank to give you one. And besides, the number also contains the Issuer Identification Number, i.e. it identifies the bank or other provider that gave you the card -- which will narrow down where you live, etc.

Guarding personally identifiable information

Posted Jun 12, 2017 15:45 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

> A credit card number _does_ say something about you. At the very least it says that you've got a credit card -- which implies that you are rich enough for a bank to give you one.

First, something you have in common with the vast majority of the population is hardly "personal information". Simply guessing that a given individual has a credit card would be correct most of the time. Second, the card number doesn't say anything about any its owner by itself; if all you have is a card number then all you can say is that _someone_ has a credit card, which isn't personal at all.

> And besides, the number also contains the Issuer Identification Number, i.e. it identifies the bank or other provider that gave you the card -- which will narrow down where you live, etc.

That is a bit closer to personal data, but the same caveat applies: by itself that doesn't say anything about any particular individual, only the issuing bank. To make inferences about "where you live" one would first need to link the card to _you_. Otherwise all they can say is that _someone_ has a card from that bank.

The problem isn't a special class of "personal data", with a few obvious exceptions like name and address which are always filtered out anyway. Even a credit card number is not an issue in isolation (or wouldn't be given a reasonable minimum standard of security in payments). The problem is data sets which allow one to correlate _multiple_ types of otherwise _non-personal_ data in order to identify specific individuals. The data becomes personal only when aggregated together: the Latino Netflix subscriber, age 18-25, with a zip code starting with 407 and a credit card from Springfield Credit Union. No one part of that data is "personal", but taken together it can potentially single out a specific individual. The key is that almost any sort of data can be used for that sort of "fingerprinting", even data which no one considers personal.

Guarding personally identifiable information

Posted Jun 12, 2017 19:33 UTC (Mon) by dskoll (subscriber, #1630) [Link] (17 responses)

What do you mean, it doesn't say anything "about" you? How do you define "about"?

Your name doesn't say anything about you. It's just an identifier that (probably) your parents assigned to you. It might help someone guess at your sex, but even that's not foolproof.

Your marital status doesn't say anything about you. Plenty of people are single. Plenty are married.

Basically, you can define "about" as absurdly as you want to the point that no information is personal.

Guarding personally identifiable information

Posted Jun 12, 2017 20:18 UTC (Mon) by nybble41 (subscriber, #55106) [Link] (16 responses)

> Basically, you can define "about" as absurdly as you want to the point that no information is personal.

That's the thing, very little information _is_ personal on its own. "There exists a person with this credit card number." "There exists a person with this birthday." "There exists a person with this ethnicity." "There exists a person with this common first name." None of that is particularly personal; many other individuals share the same characteristics. It's only when multiple facts are aggregated together that one can start to draw conclusions about specific individuals—and that remains true even if the individual facts do not appear to be the least bit "personal".

We should not be looking at this as a question of whether a particular bit of information is or is not "personal data". The question is what conclusions can be draw from a complete data set, not one isolated fact. Trying to classify types of information as either "personal" or "non-personal" leads to the equally absurd position that _all_ information is personal, because _any_ form of information can potentially be used for that purpose.

Guarding personally identifiable information

Posted Jun 12, 2017 22:42 UTC (Mon) by flussence (guest, #85566) [Link]

>"There exists a person with this birthday." "There exists a person with this ethnicity." "There exists a person with this common first name."

People think they're safe leaving this information in public, and then things like this happen: https://medium.com/@CodyBrown/how-to-lose-8k-worth-of-bit...

Guarding personally identifiable information

Posted Jun 13, 2017 15:57 UTC (Tue) by ssmith32 (subscriber, #72404) [Link] (14 responses)

My specific CC# is unique to me, and only to me, and can be used to identify shopping habits, and, at times, physical location.

If that's not personal information, what do you define as personal information?

Guarding personally identifiable information

Posted Jun 13, 2017 21:43 UTC (Tue) by nybble41 (subscriber, #55106) [Link] (13 responses)

> My specific CC# ... can be used to identify shopping habits, and, at times, physical location.

Here's a credit card number: 4024007129431648. Just from the number, what can you tell me about the owner's shopping habits or location?

Of course that number was fake, but the point remains: on its own a CC# says very little. To get shopping habits or location you would need to correlate it with other data about where and how the card was used. It is the connection between the CC# and this other data (e.g. order history) which is "personal", not the CC# itself.

Guarding personally identifiable information

Posted Jun 14, 2017 9:33 UTC (Wed) by dgm (subscriber, #49227) [Link] (2 responses)

It's not the same. Your gender cannot identify you; your (complete) CC# can. If you limited the data to the last two digits, for example, then that would not be the case.

Guarding personally identifiable information

Posted Jun 14, 2017 9:39 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Why would you want to limit the data of CC#s and create a dataset with them? It serves no useful purpose whatsoever.

Guarding personally identifiable information

Posted Jun 14, 2017 12:58 UTC (Wed) by paulj (subscriber, #341) [Link]

The last two digits of your CC# + your gender certainly could help narrow down your identity. It just needs a few more "non-identifying, but narrowing" dimensions of information potentially to uniquely identify you.

Guarding personally identifiable information

Posted Jun 14, 2017 10:13 UTC (Wed) by tao (subscriber, #17563) [Link] (5 responses)

Of course. But personal data is a puzzle. Every little piece matters, and a lot of the pieces in the puzzles leak. Considering the semi-frequent announcement of "Company X hacked--millions of accounts leaked".

The information doesn't even have to be recent. Let's say I find 5-year old account info, with a CC used at two different sites. The former being "Explicit gay porn!" and the latter being "Reactionary Bible-thumpers united".

Now simply obtaining the CC:NAME tuple would yield pretty damn good material for blackmailing.

So yes, CC might not be a breach of personal integrity, but as soon as you have the CC:NAME tuple you're well on your way towards nasty integrity violations. This goes for all kinds of tuples of data; NAME:EMAIL, NAME:NICKNAKE, NICKNAME:EMAIL, NAME:FAVOURITE RESTAURANT, etc. A typical tuple would be <just about anything>:IP ADDRESS.

I always read the same set of webpages in the morning. I open them all at once. If you could track this + the IP-address you could easily find out that "Oh, tao is at the airport today" no matter which of my laptops I use, even if I use a brand new one, simply by recognising the pattern + the IP-address ("This address belongs to airport X").

Individual pieces of data are almost never personal information. A city, a gender, a CC number, a film. Even a license plate. But as soon as you can correlate the data "X lives in city Y", "X is of gender Y", "X owns credit card Y", "X likes film Y", "X has at some point driven car Y".

Enough data points will tell a story. Whether the story is the right one or not isn't always clear ("X has driven car Y" doesn't discern between "X owns car Y", "X borrowed car Y" or "X rented car Y", but finding out more facts about car Y might be enough to clear that up, without finding out anything else about X).

Some information might seem trivial; "X is male", for instance. But if you combine that with "X regularly buys women's underwear"?

Guarding personally identifiable information

Posted Jun 14, 2017 14:21 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (3 responses)

> Individual pieces of data are almost never personal information. A city, a gender, a CC number, a film. Even a license plate. But as soon as you can correlate the data "X lives in city Y", "X is of gender Y", "X owns credit card Y", "X likes film Y", "X has at some point driven car Y".

Exactly. It's those connections which are personal, not the individual pieces of data.

Guarding personally identifiable information

Posted Jun 14, 2017 14:38 UTC (Wed) by anselm (subscriber, #2796) [Link] (2 responses)

But that doesn't mean that the individual pieces aren't worth protecting on their own, on general principles.

In any case, in practice if a cracker steals company XYZ's customer database, chances are that it will already come with people's names, street addresses, e-mail addresses, and credit card numbers nicely prepackaged.

Guarding personally identifiable information

Posted Jun 17, 2017 5:08 UTC (Sat) by jtc (guest, #6246) [Link] (1 responses)

"But that doesn't mean that the individual pieces aren't worth protecting on their own, on general principles."

I don't think that's particularly useful or practical, if you're talking about, e.g., protecting an individual CC#, street address, etc. I could, for example, take a walk around my neighborhood and write down a house's address, the license plate # of a car parked on the road, or look in the phone book and write down a phone number, etc.. I could then publish this information (with no other associated data), legally, on the internet and, of course, anyone else could do the same. That can't be prevented, which shows why it's not practical.

Furthermore, publishing such info without any other data to go with it (such as a name, or, worse [whether true or false] an accusation that the person owning the car/house/etc. committed a felony or a particular crime) is extremely unlikely to cause any harm to the person associated with that data (house owner, car owner, ...).

To extend my example to the point of absurdity, I could write in a blog: "somebody has heart disease and his or her doctor has recommended heart-bypass surgery" (As a matter of fact, I've just done that!). This is certain to be true for more than one person in the world right now. But since there's no identifying data to go along with this claim, it does no harm whatsoever.

Maybe this is not what you meant, but if so, what you meant is not at all clear, IMO.

Guarding personally identifiable information

Posted Jun 19, 2017 9:14 UTC (Mon) by farnz (subscriber, #17727) [Link]

Have you read up on Data Protection legislation (which is about to be beefed up by the GDPR)? It actually implements the sorts of protections we're talking about.

Guarding personally identifiable information

Posted Jun 16, 2017 15:50 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Of course. But personal data is a puzzle. Every little piece matters, and a lot of the pieces in the puzzles leak. Considering the semi-frequent announcement of "Company X hacked--millions of accounts leaked".

We had a classic case a few years back. Rape victims are supposed to be kept anonymous. But one newspaper printed a story about "a vicar's daughter" while another said "from Ealing". Both bits, in isolation, could refer to many thousands of people. Put together, the victim's identity was revealed almost instantly.

Cheers,
Wol

Guarding personally identifiable information

Posted Jun 14, 2017 12:20 UTC (Wed) by hummassa (guest, #307) [Link] (3 responses)

> Here's a credit card number: 4024007129431648. Just from the number, what can you tell me about the owner's shopping habits or location?

If I have the access to the Visa/Master/AmEx database (even hacked, dated versions of it), you bet I can. Try posting your real CC# in this forum and you'll see.

Guarding personally identifiable information

Posted Jun 14, 2017 14:17 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (2 responses)

> If I have the access to the Visa/Master/AmEx database...

You're making my point for me. To get any personal information you need those databases, not just the CC#.

Guarding personally identifiable information

Posted Jun 19, 2017 12:07 UTC (Mon) by hummassa (guest, #307) [Link] (1 responses)

Actually, you are making my point: merchants and targeted advertised HAVE access to the credit cards database, being it the original financial ones or some database they can collect along the way.

Guarding personally identifiable information

Posted Jun 19, 2017 15:01 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

> merchants and targeted advertised HAVE access to the credit cards database

Agreed, but the point still stands. You can't get anyone's shopping habits or location from just a credit card number. When combined with additional information, sure, but not from the number alone. It's not the number itself which is personal, but rather the web of connections linking the number to other (likewise individually non-personal) pieces of data.

Guarding personally identifiable information

Posted Jun 10, 2017 15:46 UTC (Sat) by paulj (subscriber, #341) [Link]

It does to someone who has some retailers database dump, and can look up your card number to perhaps find the rest of your details (name, email, address, etc.).

Guarding personally identifiable information

Posted Jun 9, 2017 15:32 UTC (Fri) by hkario (subscriber, #94864) [Link]

You only don't care because you don't understand what actually is personal data ("I don't have anything important on my phone", that is, until you loose it):
https://www.youtube.com/watch?v=XEVlyP4_11M

Guarding personally identifiable information

Posted Jun 16, 2017 16:59 UTC (Fri) by Wol (subscriber, #4433) [Link]

> > Although people viscerally fear the collection of personal data

> I don't. I don't really care about it.

Not even when it places your livelihood (or life!) at risk?

Someone leaked my name, phone number, and the fact that I'd had an accident. It ended up on a list being sold to ambulance chasers. I really do NOT appreciate receiving a *flood* of nuisance calls to my mobile, seeing as they mostly arrive while I'm driving and it's not only illegal, but also dangerous, to talk on a mobile while driving.

Cheers,
Wol

Guarding personally identifiable information

Posted Jun 17, 2017 12:02 UTC (Sat) by nix (subscriber, #2304) [Link]

Look up what happened to Jeremy Clarkson when he said the same in a newspaper article.

Guarding personally identifiable information

Posted Jun 8, 2017 1:32 UTC (Thu) by gdt (subscriber, #6284) [Link] (12 responses)

Could we please encourage authors to do more reportage and less editorial? The parallel with encryption doesn't hold up, and the article would have been stronger without it.

Guarding personally identifiable information

Posted Jun 8, 2017 19:19 UTC (Thu) by kfiles (subscriber, #11628) [Link] (2 responses)

I strongly disagree, gdt. The editorial voice of LWN is one of the things that makes it special. The editors here do not merely report on press releases or pass on the conference minutes, they provide their perspectives on its importance to the Linux community. As they are informed by years of work in the distro and kernel community, I find their perspectives very useful.

Guarding personally identifiable information

Posted Jun 11, 2017 10:32 UTC (Sun) by jaromil (guest, #97970) [Link] (1 responses)

Well said kfiles. This is a great article, well written and covering a well relevant topic on which we are busy for instance with the https://decodeproject.eu . An applause to the LWN editors for this great content, especially considering how hard is to catch up in this research field that has relatively high noise. An article like this alone is worthed more than a year subscription for me. Thanks.

Guarding personally identifiable information

Posted Jun 16, 2017 16:04 UTC (Fri) by Wol (subscriber, #4433) [Link]

Neutrality is over-rated - and impossible! What matters is that the author's position is open and discernible. I really don't like "objective" reporting, because it rarely is ...

Cheers,
Wol

Guarding personally identifiable information

Posted Jun 8, 2017 22:54 UTC (Thu) by bronson (subscriber, #4806) [Link] (8 responses)

> The parallel with encryption doesn't hold up

Care to say more? The article did a good job of describing the similarities. You can probably find 50 ways that they're different but that's not important to literary comparison -- it doesn't affect the ways that they're similar.

Guarding personally identifiable information

Posted Jun 9, 2017 8:49 UTC (Fri) by tialaramex (subscriber, #21167) [Link] (4 responses)

The article talks about essentially the encryption treadmill, this idea that incremental advances will inevitably obsolete your encryption. But it's not really like that. The main reason crypto people assume secrets have a finite lifetime is that secrets are kept by people, and people leak - regardless of the technology involved. In practice a finite lifetime is acceptable.

Mechanically we're probably past the point where you can expect incremental technical advances to have any effect on symmetric encryption. DES key lengths were already too short in 1975, you should be able to find contemporary writing that backs up this criticism. Sure enough the only attack that's actually been successfully used on DES is a brute force attack on the key. Rijndael increases the minimum key size to 128-bits, which puts a brute force attack likely permanently out of reach, but even the original DES algorithm - in the back-to-back-to-back 3DES construction so as to use longer keys - is still safe today if you don't mind it being slow and awkward.

We have tended to abandon encryption algorithms once someone demonstrates that even in theory they can sometimes successfully break it with practical resources. In contrast anonymization techniques are _always_ theoretically broken, it's just that sometimes nobody bothers to break them in practice. If we're going to compare to something, how about the Yale lock on a farmhouse back door. Probably the door isn't locked anyway, and if it is anyone who spends a few minutes learning how online can break the lock. That's where we are with anonymization. Our best hope is that nobody even _wants_ to de-anonymize the data, not that they can't.

Guarding personally identifiable information

Posted Jun 15, 2017 4:31 UTC (Thu) by bronson (subscriber, #4806) [Link] (3 responses)

With 200 GPUs, DES is brute forced in a matter of days. And 200 GPUs are available today on AWS. Papers describing this sort of attack go back at least 8 years. It's folly to claim that DES is still safe.

Here is a nice picture of the treadmill, for hash functions anyway: http://valerieaurora.org/hash.html

Guarding personally identifiable information

Posted Jun 26, 2017 14:42 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (2 responses)

The key is in your first sentence. DES has to be _brute forced_. My comment already explained that "DES key lengths were already too short in 1975". The papers go back a LOT further than eight years, the EFF DES Cracker is _last century_. But what the papers don't do is break DES algorithmically, the algorithm is still, decades later, working exactly as intended, you can't find out what the message is without just trying all the keys.

And _that_ is why, again exactly as I wrote, 3DES is still safe. It's not a new algorithm using the same name, it's just three lots of DES because the algorithm is still fine, specifically 3DES is E(key3,(D(key2,(E(key1, message))) so that if you set key1= key2= key3 you get DES as before, but if you set them differently the attacker must either brute force all 168 bits of key or they must rely on the Meet-in-the-middle attack and do 2^112 operations, which is tighter but still impractical today.

Also Val's page is a member of the set of things which assume relatively short past trends will continue in order to predict the future. Her warning that you should plan on being _able_ to replace the hash in your shiny new thing is a sensible one, but the thing about such trends is that they're only notable while they stay true. Nobody is going to make a web page called "To our astonishment SHA-2 is still fine after 75 years". I always want to point to Disco Stu's graph of Disco record sales here...

And finally, while Val's advice is all very well, probably even _more_ useful would be to take the extra hour and learn more about what these things are. On the other side of the fence lots of effort has gone into making modern algorithms and libraries have fewer "sharp edges", such as SHA-3's elimination of length extension, but the edges are only sharp if you have no idea what these crypto algorithms do and do not promise for you. People writing SHA2(m1) and being surprised an adversary can use that to produce SHA2(m1 | chosen suffix) correctly without knowing m1 are protected by using SHA3() instead where their adversary can't pull that off, but they'd _also_ be protected, and better, by knowing what's going on here so they wouldn't fall for that mistake in the first place.

Guarding personally identifiable information

Posted Jun 27, 2017 11:01 UTC (Tue) by paulj (subscriber, #341) [Link] (1 responses)

V interesting comment. BTW, in the interests of "knowing what's going on here", could you explain how SHA-3 eliminates the 'traditional' crypto-hash extension attack?

Guarding personally identifiable information

Posted Jun 27, 2017 11:50 UTC (Tue) by anselm (subscriber, #2796) [Link]

With MD5 or SHA-1, the hash is basically identical to the internal state of the hash function after it has seen the input up to that point. You can use this to hash more stuff and the result will be indistinguishable from what would have been returned if you had applied SHA-1 to the concatenation of the original input (which you technically don't know) and your stuff in one go.

SHA-3 avoids this by using an algorithm where the hash value it outputs doesn't let you infer its internal state. This means that even if you know the hash value, that doesn't tell you everything you need to know to set up your own instance of the hash function after it has seen the material whose hash value you have, so you can perform the extension attack. (For the gory details, check Wikipedia.)

Guarding personally identifiable information

Posted Jun 9, 2017 20:40 UTC (Fri) by frostsnow (subscriber, #114957) [Link] (2 responses)

It's not so much that there are *50* ways that encryption and de-identification are different, but that they are *fundamentally* different.

The *security* of a strong encryption algorithm rests in the *secrecy* of a key. A *perfect* encryption scheme from a security perspective does exist, it's called a One-Time Pad (https://en.wikipedia.org/wiki/One-time_pad). The point about a perfect encryption scheme is that, given a ciphertext, *any* plaintext of the same size is *equally* likely. This is a fundamental point of encryption's security. To quote Schneier's "red book":

>A random key sequence added to a nonrandom plaintext message produces a completely random ciphertext message and no amount of computing power can change that.

Thus to compare de-identification to encryption on the notion that an increase in computing power will allow users to "break our encryption" and that "nothing better exists" shows a fundamental lack of understanding of what encryption is. It's not a good comparison.

THAT BEING SAID, I understand *why* the author wrote what they did, and what they *meant*. They are looking at the security that we expect from commonly-implemented block and public/private key ciphers, which make security/time/space trade-offs that (hopefully) give us a security that lasts relative to a certain growth in computing power that is projected to take place over the next couple of decades. They then take this idea and claim that it would be useful to look at de-identification with regards to this concept that we only need security over a certain time period, after which it becomes drastically less useful, and this point may be true.

However, it is *not* clear to me that the security provided by commonly-implemented encryption algorithms and the "security" provided by de-identification are in *any* way equivalent whatsoever. Encryption has a theoretical component that allows *perfect* security, de-identification is entirely based on obscurity. Again, with encryption, we have purposefully chosen weaker algorithms for pragmatic reasons, but it is not clear that such a choice even *exists* for de-identification algorithms.

Encryption and de-identification aren't analogous.

Disclaimer: I am not a cryptography expert and may have gotten something wrong. Also: https://www.xkcd.com/386/

Guarding personally identifiable information

Posted Jun 15, 2017 1:56 UTC (Thu) by andyo (guest, #30) [Link] (1 responses)

As the author, I appreciate the thoughtful comments about my analogy between de-identification and encryption. I think the analogy is better than gdt and frostsnow think, but perhaps they are right. If you don't like the analogy, just take the general principle that the perfect is the enemy of the good, and that something is better than nothing. I'm also glad to see that the main controversy in the comments concerns a relatively minor point within the larger article.

Guarding personally identifiable information

Posted Jun 15, 2017 5:28 UTC (Thu) by bronson (subscriber, #4806) [Link]

If people are working this hard, it can't have been TOO bad. :)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Guarding personally identifiable information

Common constraints on data collection

The dilemma of data sharing

Problems with de-identification

Proposed remedies

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Guarding personally identifiable information

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.