Privacy Enhancing Technologies - An introduction

Privacy Enhancing Technologies

An introduction

Lucie Hoffmann
by Lucie Hoffmann
on July 18, 2024
time to read: 11 minutes

Keypoints

PETs are empowering you

  • They care about protecting your personal data
  • They are diverse and still in development
  • They have innovative and challenging designs
  • They try to combine functionality with privacy

Choosing what you share about yourself, to whom and what they are going to do with it, is an important basis for determining and expressing yourself freely.

The growing amount of information flowing on the Internet challenges the privacy of online users and their control over their data. Privacy Enhancing Technologies (PETs) aim to provide tools for protecting the privacy of users’ personally identifiable information by minimizing the data they share and giving them control over who can access it and how it is used. In this article, we give an overview of what PETs can be used for, examples of such tools and challenges that remain in their current development. We chose to divide this article in four main sections, each focusing on an area where PETs can enhance privacy: Data analytics, computing tools, access management and communications.

Anonymized data analytics

In data analytics, sets of individuals’ data are used to derive results. These results may be statistics for research purposes or data models for prediction purposes. The issue here is that individual data may be containing personal information which is then made accessible to parties involved in processing it. Users may not want to trust these parties with their data. PETs like synthetic data, differential privacy and federated learning show different approaches to obfuscate this data or prevent access to its raw content.

Synthetic data

Synthetic data is data that is generated artificially instead of being “real” data from “real” individuals. To do so, models with similar statistical properties as “real” data are used to construct synthetic data sets. One of the remaining challenges is to make sure original records cannot be recovered from synthetic records and no re-identification attacks are possible.

The following illustration gives a simple example of data set showing the age of people depending on their department. The statistical properties of the original data are extracted, in this case the average of ages per department define the statistical model. Synthetic data is then generated from this model, keeping the selected properties.

synthetic data illustration

Differential privacy

In differential privacy, noise is added to individual data so that original data cannot be re-identified but aggregated noisy data still leads to similar results than what one would get with raw data. So the challenge here is to find a threshold where added noise provides a reasonable level of repudiation on the original individual data while not modifying results of analysis on large data sets.

Using the previous data set example, the following illustration shows that noise between -2 and +2 is added to the age fields, resulting in similar statistical results and possibility for individuals in the original data set to hide their real age.

differential privacy illustration

Federated learning

Federated learning consists in having raw data pre-processed locally at the data source and then sharing this locally trained model with others to obtain a general model aggregated from the individual models. The main challenge here again is to prevent personal information from being recovered from both locally and generally trained data.

The following illustration shows four different organisations with their own private database. They first derive a local model before sharing it with other organisation to create a common model.

federated learning illustration

Processing encrypted data

Computing tools, for example in cloud computing, are used to perform operations on data. If data is sensitive or needs to remain private, a solution is to find ways to process encrypted data so that computations reflect on the decrypted data. Homomorphic encryption, secure multi-party computation and trusted execution environment provide different approaches for processing encrypted data.

Homomorphic encryption

Homomorphic encryption relies on a mathematical object called homomorphism. A homomorphism is a mathematical function f that preserves a specific operation x such that f(a x b) = f(a) x f(b). If f() is an encryption function and x is the operation we want to perform on data, then processing (=operation) can be done on encrypted data f(a) and f(b). The result will be the same as if we had performed the operation on cleartext data and then encrypted it. The main challenges here are the increase in computation time compared to computation on unencrypted data, and the need for skilled developers to properly implement this more complex setup.

Secure multi-party computation

As described in more details in this article, Secure Multi-Party Computation enables different parties to compute a common operation on their respective private data. All parties obtain a common result without ever knowing about the private inputs of each party. Here again a important challenge resides in the computation costs.

Trusted execution environment

A Trusted Execution Environment is an isolated area in a computer processor providing sealing storage for data and attestation that all operations performed on it are expected and intended. Compared to other solutions mentioned in this article that are software-based, TEEs are hardware privacy technologies, which makes them more expensive to deploy and maintain.

Anonymous access management

Regarding access management, a common way to implement it is by having a user providing information on their identity, for example name, email and birthday, and then linking an account to them. Zero-knowledge proof and attribute-based credentials show ways to provide authentication and authorization while limiting what the user discloses about themselves.

Zero-knowledge proof

Zero-knowledge proof is a cryptographic method that allows a user to prove to another party a given statement is true without revealing any other further information. In other words, the user proves they possess a certain information or status without revealing the content of it. Prove you have a valid account to a platform without revealing the identity behind it. Protocols implementing this concept for access management are not trivial and mostly theoretical for now.

Attribute-based credentials

Attribute-based credentials allow a user to authenticate with a minimal set of attributes needed for authentication, without giving away their full identity. The attributes can be additionally protected by using zero-knowledge proof. For example, consider a streaming platform providing different types of content and different subscription packages are available depending on which content a user wants to access. In this case, a user’s subscription determines the categories of content they can access, i.e. they pays for a certain package of categories and then they can access any content in these categories but cannot access content in other categories they did not pay for. The subscription package is linked to a given user. But what this user accesses then on the platform doesn’t need to be linked to them. The platform only needs to verify that anyone accessing a given content has a subscription for the corresponding category of content. Here the attributes could be the set of categories a user has paid for. ABC provides ways for a user to prove these accesses without linking them to the user who made the subscription. These protocols are complex, rely on mathematical properties and are still in a theoretical stage.

Anonymous communications

Finally, metadata of communications on the Internet are observable by anyone. Even if content is encrypted, entities involved in Internet exchanges are still visible, disclosing which endpoint communicates with another, what type of data is exchanged and at which frequency. Onion and garlic routing are two approaches to anonymizing these communications.

Onion routing

Onion routing is based on the concept of mix networks, where routers receiving communication packets on the Internet send them to the next router in random order to make it hard to link a packet entering the router with a packet leaving it. “Onion” comes from the fact that each router on the communication path between two endpoints (e.g. you and the website you are visiting) adds a layer of encryption, such that each router only knows the previous and the next router on the path. This adds some delay to communications and implementations like TOR still leave potential surface for traffic analysis attacks, enabling an attacker to re-link the two communicating endpoints.

In the following illustration, we show a simplified view of the layered encryption in onion routing. Alice sends a message “Hello” to Bob. Clara then sends her own message “Hi” to Dave. The first router on the communication path removes their encryption layer, represented in yellow, and sends both messages in a random order to the next router on the path. Finally, this last router removes their own encryption layer and passes the messages, again in a random order, to their respective destinations.

onion routing illustration

Garlic routing

Garlic routing is based on onion routing. The main difference is that it encrypts multiple packets together to make it more difficult for attackers to perform traffic analysis. This works best if the overall traffic is uniformly distributed among routers, which may not always be the case.

Using the same example as for onion routing, we illustrate below the gathering of several messages in a single one before a router sends them to their common next destination. In this case, it is harder for an attacker between router 1 and 2 to analyse packets’ size to try and link them to messages received by router 1, especially in a more realistic case where router 1 receives more messages from other sources going to different destinations.

garlic routing illustration

Conclusion

We gave an overview of different types of PETs for various kinds of applications and goals. We focused on PETs protecting data in their analysis or computations by third parties, in communications, as well as providing new paradigms for unlinking identity to access management. This is not an exhaustive list of existing PETs and each of them would deserve their own article. It should nevertheless give a first taste and hopefully spike curiosity to learn more about these emerging technologies, considering the many challenges that come with privacy protection. Data has already been established as a precious and sensitive resource, may it be in research, business, information or intelligence. Developing concrete tools protecting this data has become crucial on an individual as much as on a societal scale.

About the Author

Lucie Hoffmann

Lucie Hoffmann completed a Bachelor in Information and Communication Systems at EPFL followed by a Master in Cybersecurity joint between EPFL and ETH. She was able to collect working experience on the new network architecture SCION during her master thesis. Her focus is web application security.

Links

Is your data also traded on the dark net?

We are going to monitor the digital underground for you!

×
Isn’t business continuity part of security?

Isn’t business continuity part of security?

Andrea Covello

Trapped in the net

Trapped in the net

Michèle Trebo

How I started my InfoSec Journey

How I started my InfoSec Journey

Yann Santschi

Area41 2024

Area41 2024 - A Recap

Michael Schneider

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here