当前位置:首页 - 新闻资讯 - 行业要点 - 正文

国广清科首席科学家郑丰:10项增强隐私的技术

发布时间:2023年09月14日
什么是增强隐私的技术?

What are Privacy-enhancing Technologies?

隐私增强技术包括任何提高敏感数据隐私和安全性的技术——关键信息,如客户姓名、电话号码、电子邮件地址和社会安全号码(SSN),本文主要介绍10项增强隐私的技术。

图片

Privacy-enhancing technologies include any technology that increases the privacy and security of sensitive data – critical information like customer names, phone numbers, email addresses and social security numbers (SSNs).  This article focuses on 10 privacy-enhancing technologies.

安全多方计算
Secure Multiparty Compute
安全多方计算(MPC)是密码学的一个分支,在多方之间分配计算,其中任何一方都看不到另一方的数据。核心思想是给不同的各方提供一种计算数据的方法,并得出双方都想要的结果,而无需将私人数据泄露给其他方。
Secure multiparty compute (MPC) is a branch of cryptography that distributes computation across multiple parties where no individual party can see the other party’s data. The core idea is to give different parties a way to compute data and arrive at a mutually desired result without divulging their private data to other parties.

该协议的一个经典示例是平均工资问题:一群工人如何在不向他人透露自己的个人工资的情况下计算平均工资?

A classic example use case for this protocol is the average salary problem: How can a group of workers compute their average salary without divulging their own personal salary to others?

MPC可以使用以下一系列操作来解决这个问题:

MPC can solve this problem using the following series ofoperations: 

工人A选择一个随机数W,并将其添加到他们的工资中,然后将总和传递给工人B,工人B将他们的工资添加到这个总和中,然后将其传递给工人C,以此类推。工人A获得最终金额,并可以计算平均工资。

Worker A chooses a random number W and adds it to their wage, then passes the sum to worker B, who adds their wage to this sum and passes it to worker C, and so on. Worker A gets the final amount and can calculate the average wage. 
MPC于20世纪70年代初推出,然后在20世纪80年代末首次正式推出,但直到最近,它才超越了学术界的研究,解决了商业产品的现实问题。MPC现在用于广泛的应用,包括:欺诈检测、心脏病分析和跨私人数据集的特征聚合。
MPC was introduced in the early 1970s and then first formalized in the late 1980s, but it’s only recently moved beyond something studied in academia to solving real-world problems for commercial products. MPC is now being used for a wide range of applications, including: fraud detection, heart disease analysis, and feature aggregation across private datasets.

这是一种非常强大的隐私方法,但有一些限制。使用MPC既增加了现有系统的计算开销,也增加了不同方之间的高通信成本。这使得使用MPC对许多问题不切实际。

It’s an extremely powerful approach to privacy, but there are some limitations. Using MPC adds both computational overhead to an existing system as well as a high communication cost between different parties. This makes using MPC impractical for many problems.

去识别技术
De-identification Techniques
去识别是从数据集中删除个人信息的过程。有多种去识别方法,例如令牌化和k-匿名化。
De-identification is the process of removing personal information from a data set. There are multiple de-identification methods, such as tokenization and k-anonymization.
令牌化 Tokenization
令牌化是一种非算法的数据混淆方法,将敏感数据交换为令牌。例如,像“John Smith”这样的客户名字可以被像“7afd3186-369f-4898-ac93-3a4e732ebf7c”这样的标记字符串替换。由于“John Smith”和字符串“7afd3186-369f-4898-ac93-3a4e732ebf7c”之间没有数学关系,因此在不访问令牌化过程的情况下,无法从令牌化数据中获取原始数据。
Tokenization is a non-algorithmic approach to data obfuscation that swaps sensitive data for tokens. For example, a customer’s name like “John Smith” could be replaced by a tokenized string like “7afd3186-369f-4898-ac93-3a4e732ebf7c”. Since there’s no mathematical relationship between “John Smith” and the string “7afd3186-369f-4898-ac93-3a4e732ebf7c”, there’s no way to get the original data from the tokenized data without access to the tokenization process. 
姓名和电子邮件地址的简单标记系统
A simple tokenization system for names and email addresses
标记化有多种技术和风格,包括长度保留标记化、格式保留标记化以及随机与一致标记化。不同的方法有不同的权衡,可以帮助支持不同的用例。
There are a variety of techniques and styles of tokenization, including length-preserving tokenization, format-preserving tokenization, and random versus consistent tokenization. Different approaches have different tradeoffs and can help support different use cases.
K-匿名化 K-anonymization
K-匿名化最早是由研究人员Latanya Sweeney和Pierangela Samarati在90年代末提出的。K-匿名是数据的一个属性。如果数据集中每个人的信息无法与至少k-1个人区分开来,则数据集被称为具有k-匿名属性。

K-anonymization was first proposed in the late 90s by researchers Latanya Sweeney and Pierangela Samarati. K-anonymity is a property of data. A data set is said to have the k-anonymity property if the information for each person in the data set can’t be distinguished from at least k-1 individuals.

这是一种“数字安全”的方法。从本质上讲,如果每个人都属于一个组,那么组中的任何记录都可以对应于单个个人。k-匿名化的缺点是没有随机化,因此攻击者仍然可以对数据进行推断。此外,如果攻击者知道一些关于个人的信息,他们可以使用群组来了解有关该人的其他信息。例如,如果我们数据集中60岁以上的所有女性都患有乳腺癌,而攻击者知道Julie超过60岁,并且在数据集中,那么现在攻击者知道Julie患有乳腺癌。

This is a “safety in numbers” approach. Essentially, if everyone belongs to a group, then any of the records within the group could correspond to a single individual. The downside of k-anonymization is that there is no randomization, so an attacker can still make inferences about the data. Additionally, if an attacker knows some information about an individual, they can use the groups to learn additional information about that person. For example, if all women in our data set over the age of 60 have breast cancer and the attackers know that Julie is over 60 and in the data set, then now the attackers know Julie has breast cancer.

像这个列表中的许多PET一样,如果k-匿名化与附加技术相结合,并通过设置正确的保护措施来加强,那么它是一个强大的工具。
Like many PETs in this list, k-anonymization is a powerful tool if it’s combined with additional technologies and reinforced by putting the right safeguards in place.

伪匿名化
Pseudoanonymization

伪匿名化是一种混淆形式,通过用假名替换字段值来隐藏个人的身份。通过伪匿名化,只删除了部分识别数据——足以使数据值无法链接到它们所指的人或事物(“数据主体”)。

Pseudoanyonymization is a form of obfuscation that hides the identity of an individual by replacing field values with pseudonyms. With pseudoanyonymization, only a portion of the identifying data is removed – enough that the data values can’t be linked to the person or thing they refer to (the “data subject”).

有各种伪匿名化方法,包括杂乱,将原始值与混淆的字母混合,以及数据屏蔽,其中部分原始数据被隐藏。

There are a variety of pseudoanyonymization methods including scrambling, whereby the original values are mixed with obfuscated letters, and data masking, where some part of the original data is hidden.

通过伪匿名化,总是存在重新识别的风险。不同的方法会带来不同的风险,有些方法不符合某些法规。虽然伪匿名化有很多用途,但仅此方法并不是一个完整的数据隐私解决方案。

With pseudoanyonymization, there’s always a risk of re-identification. Different methods carry different risks, and some methods aren’t compliant with certain regulations. And while pseudoanyonymization has many uses, this method alone isn’t a complete data privacy solution.


同态加密
Homomorphic Encryption

基本的加密形式自公元前1900年以来就已经存在,RSA、AES和DES等现代技术被广泛用于安全数据传输和存储。然而,所有这些技术的缺点是,为了对数据执行操作,您需要首先解密它。当解密的数据被缓存在内存中时,或者因为小的编码错误可能导致敏感的未加密数据出现在您的日志文件或其他地方时,这打开了潜在的攻击面。这增加了您的安全和合规性问题的范围。

Basic forms of encryption have been around since 1900 BC and modern techniques like RSA, AES, and DES are widely used for secure data transmission and storage. However, the downside of all of these techniques is that in order to perform operations on the data, you need to decrypt it first. This opens up a potential attack surface as decrypted data is cached in memory, or because small coding errors can result in sensitive unencrypted data showing up in your log files or other places. This increases the scope of your security and compliance problems.

同态加密被广泛认为是加密的“黄金标准”。这是一种加密形式,允许对加密数据进行计算,而无需先解密。

Homomorphic encryption is widely considered the “gold standard” of encryption. It’s a form of encryption that permits you to perform computations on encrypted data without first decrypting it.

然而,在实践中,完全同态加密存在性能挑战,因为它非常复杂,需要支持对加密数据的几乎所有操作。理论上,这是一种理想的能力,但对大多数软件工程师来说,他们不需要对数据执行任何任意操作。他们通常需要执行一些非常具体的操作。

However, in practicality fully homomorphic encryption suffers from performance challenges. Homomorphic suffers from performance challenges because it’s very complex and needs to support nearly any operation on encrypted data. This is an ideal capability in theory, but the reality for most software engineers is that they don’t need to perform any arbitrary operation on data. They typically need to perform a few very specific operations.

但对加密数据进行不必要的操作时,会导致性能缓慢,输入多态加密。与同态加密一样,多态加密支持对加密数据的操作,但专为支持工程师通常需要对敏感数据执行的用例和操作类型而设计。例如,仅对社会保险号的最后四位数字进行索引和解密,或在实际看不到客户信用评分的情况下确定客户的信用评分是否高于或低于给定的阈值。

However, unnecessary operations on encrypted data can lead to slow performance, enter polymorphic encryption. Like homomorphic encryption, polymorphic encryption supports operations on encrypted data, but is designed specifically to support the use cases and types of operations that engineers typically need to perform on sensitive data. Examples include indexing and decrypting only the last four digits of a social security number, or determining whether a customer's credit score is above or below a given threshold without actually seeing the customer's credit score.

多态加密是一种功能非常强大的PET,如今已为企业做好准备。

Polymorphic encryption is a very powerful PET that is ready for businesses today.


联合学习
Federated Learning

联合学习是一种分散的机器学习形式。在机器学习中,通常训练数据从多个来源(手机、笔记本电脑、物联网设备等)汇总,并带到中央服务器进行训练。然而,从隐私的角度来看,这显然存在问题。

Federated learning is a decentralized form of machine learning. In machine learning, typically training data is aggregated from multiple sources (mobile phones, laptops, IoT devices, etc.) and brought to a central server for training. However, from a privacy standpoint, this obviously has issues.

联合学习的例子。培训在当地进行,结果集中报告。

Example of federated learning. Training happens locally and results are reported centrally.

在联合学习模型中,培训在本地进行,结果报告给中央服务器,而不是原始输入。例如,集中式机器学习应用程序可以直接在所有设备上使用。本地模型根据用户的输入和设备使用情况在设备上学习和训练自己。然后,设备仅将训练结果从本地副本传输回中央服务器。

In a federated learning model, training happens locally and the results are reported to the centralized server, rather than the raw inputs. For example, the centralized machine learning application is available directly on all devices. The local model learns and trains itself on the device based on the user’s input and device usage. The device then transfers only the training results from the local copy back to the central server.

所有分散式设备的结果都汇总在一起,但看不到任何用户数据。然后,用户的设备可以更新为新训练的中央模型,如上图所示。

The results from all decentralized devices are aggregated together, but without seeing any of the user data. Users’ devices can then be updated with the newly-trained central model as shown in the image above.

不过,联合学习有一些局限性。它需要设备和中央服务器之间的频繁通信,因此需要大量的网络带宽。此外,本地训练需要足够的本地计算能力和内存来实际训练模型。

It requires frequent communication between devices and the central server, so it requires a lot of network bandwidth. Additionally, local training requires enough local computing power and memory to actually train the model.


零知识证明
Zero-Knowledge Proof

想象一下,如果你可以去酒吧点一杯饮料,而不是酒保要求看你的驾照来检查你的出生日期,然后计算一下你是否足够大,他们可以简单地问:“你足够大到可以买酒吗?”,你可以回答“是”,他们可以确定你的“是”是否是真的?这种类型的场景可以通过零知识证明。

Imagine if you could go to a bar and order a drink and instead of the bartender asking to see your driver’s license to check your birthdate and then do the math to figure out if you’re old enough, they could simply ask, “Are you old enough to purchase alcohol?”, and you could respond “Yes” and they could know for sure whether your “Yes” was true or not? This type of scenario is possible with zero-knowledge proofs.

零知识证明是一种方法,可以让一方向另一方证明给定的陈述是真实的,而不会透露该陈述是真实的以外的任何信息。

Zero-knowledge proof is a method that lets one party prove to another party that a given statement is true without revealing any information beyond the fact that the statement is true.

零知识证明的一个经典例子是阿里巴巴的洞穴。

A classic example of a zero-knowledge proof is the cave of Ali Baba.

在这个场景中,有两个角色,Peggy(证明者)和Victor(验证者)。Peggy和Victor都发现自己在洞穴的入口处,洞穴有两个不同的入口,通向两条不同的路径(A和B)。洞穴内有一扇连接两条道路的门,但只能用密码打开。Peggy拥有代码,Victor想拥有它,但首先Victor想确保Peggy没有在拥有代码上撒谎。

In this scenario there are two characters, Peggy (the prover) and Victor (the verifier). Both Peggy and Victor find themselves at the entrance to the cave, which has two distinct entrances to two different paths (A and B). Inside the cave there’s a door that connects both paths, but can only be opened with a secret code. Peggy owns the code and Victor wants to have it, but first Victor wants to make sure Peggy isn’t lying about owning the code.

Peggy怎么能在不透露代码的情况下向Victor展示她拥有代码呢?

How can Peggy show Victor that she owns the code without revealing it?

要做到这一点,Peggy可以通过任何一扇门(A或B)进入洞穴。进去后,Victor走近洞穴,向Peggy喊道,从任意选择的两条路之一返回。如果Victor大喊通过A路径返回,但Peggy目前在B路径中,那么只有当她确实有密码时,她才能通过A返回。当然,如果Peggy在撒谎并且已经处于A路径上,她可能会很幸运,但只要这个过程重复足够多次,Peggy只有在有密码的情况下才能始终沿着正确的路径返回。

To do this, Peggy can enter the cave by either door (A or B). Once inside, Victor approaches the cave and shouts to Peggy to return by one of the two paths, chosen arbitrarily. If Victor yells to return via the A path but Peggy is currently in the B path, she can only return via A if she indeed has the secret code. Of course, Peggy might get lucky if she’s lying and already be in the A path, but as long as this process is repeated enough times, Peggy can only always return along the correct path if she has the secret code.

在实际应用方面,加密货币Zcash正在使用零知识加密方法。ING的客户还使用它来证明他们的收入在可接受的范围内,而没有透露他们的确切工资。

In terms of practical applications, a zero-knowledge cryptographic method is being used by the cryptocurrencyZcash. It’s also being used by customers of ING to prove that their income is within an admissible range without revealing their exact salary.

零知识的挑战证明,答案不是100%保证的,而且是计算密集型的。证明人和核查人之间需要进行许多互动,以将失实陈述的概率降低到可接受的水平。这使得这对慢速或低功耗设备来说不切实际。

The challenges with zero-knowledge proofs that the answer isn’t 100% guaranteed, and that it’s computationally intensive. Many interactions are required between the prover and verifier to reduce the probability of misrepresentation to an acceptable level. This makes this impractical for slow or low-powered devices.

差异隐私
Differential Privacy

差异隐私是一种PET,商业兴趣日益增长。差异隐私的想法是,通过在数据集中引入少量噪音,查询结果不能用于推断出关于单个个体的很多东西。差分隐私不是像去识别这样的特定过程,而是算法或过程可以拥有的属性。

Differential privacy is a PET that’s seeing growing commercial interest. The idea of differential privacy is that by introducing a small amount of noise into a data set, a query result can’t be used to infer much about a single individual. Differential privacy isn’t a specific process like de-identification, but a property that an algorithm or process can have.

使用差异隐私,您可以通过向原始数据添加噪音来添加一层隐私。因此,使用所得数据进行的任何建模或分析都会降低准确性。

With differential privacy, you’re adding a layer of privacy by adding noise to the original data. As a consequence, any modeling or analytics performed using the resulting data has diminished accuracy.

很好地使用差异隐私的关键是仔细平衡对准确结果的需求和通过混淆这些结果来保护个人隐私的需要。

The key to using differential privacy well is to carefully balance the need for accurate results with the need to protect an individual’s privacy by obfuscating those results.


合成数据
Synthetic Data

合成数据(或“假数据”)是人工生成的数据,反映了原始数据集的模式和组成。这是一种数据匿名化。对于某些用例来说,合成数据是一个很好的解决方案,例如为工程师提供与生产数据具有类似属性的东西,用于测试目的,而不会暴露实际的客户PII。合成数据还被广泛用于训练欺诈检测系统中使用的机器学习模型。

Synthetic data (or “fake data”) is artificially generated data that mirrors the patterns and composition of the original dataset. It’s a type of data anonymization. Synthetic data is a great solution for certain use cases, like giving engineers something that has similar properties to production data for testing purposes without exposing actual customer PII. Synthetic data is also widely used for training the machine learning models used in fraud detection systems.

这是一个合成生成数据的复杂过程,这些数据的外观和感觉都足够真实,对测试或其他应用程序有用。例如,如果一家企业有多个包含客户名称的独立表的数据库,则两个数据库中的同一客户将具有相同的名称。理想情况下,为该业务生成的合成数据能够捕获这种模式。

It’s a complex process to synthetically generate data that looks and feels real enough to be useful for testing or other applications. For example, if a business has multiple databases with independent tables containing a customer name, the same customer in both databases would have the same name. Ideally the synthetic data that’s generated for this business is able to capture this pattern.

可信执行环境
Trusted Execution Environment
可信执行环境(TEE)是一种基于硬件的隐私方法。使用TEE,CPU具有来自主计算机进程和内存的硬件分区。TEE中的数据无法从主处理器访问,TEE与CPU其余部分之间的通信是加密的。对加密数据的操作只能在TEE内进行。
A Trusted Execution Environment (TEE) is a hardware-based approach to privacy. With TEE, a CPU has a hardware partition from the main computer’s process and memory. Data within the TEE can’t be accessed from the main processor and the communication between the TEE and the rest of the CPU is encrypted. Operations on the encrypted data can only take place within the TEE.
英特尔、AMD和其他芯片制造商现在提供TEE芯片。AWS Nitro Enclaves使用这项技术创建适合处理PII等高敏感数据的隔离计算环境,同时保持其安全和私密性。
Intel, AMD, and other chip manufacturers are now offering TEE chips.AWS Nitro Enclaves uses this technology to create isolated compute environments suitable for processing highly sensitive data like PII while keeping it secure and private.

数据隐私库
Data Privacy Vault
数据隐私库隔离、保护和严格控制访问以管理、监控和使用敏感数据。它既是一种技术,也是一种建筑设计模式。数据隐私库结合了多个PET,如加密、令牌化和数据屏蔽,以及零信任、日志记录和审计等数据安全技术以及隔离原则。

Adata privacy vault isolates, secures, and tightly controls access to manage, monitor, and use sensitive data. It’s both a technology and architectural design pattern. A data privacy vault combines multiple PETs, like encryption, tokenization, and data masking along with data security techniques like zero trust, logging, and auditing, and the principle of isolation.

使用数据隐私库的示例。敏感数据在收集时被发送到金库。令牌化数据存储在下游。

Example of using a data privacy vault. Sensitive data is sent to the vault at collection. Tokenized data is stored downstream.

使用数据隐私库的企业将客户PII从现有的数据存储和基础设施中移入保管库,如上图所示。数据隐私库成为所有客户PII的唯一真相来源,有效地将现有应用程序基础设施从数据安全和合规性的责任中解脱出来。这种数据隐私的架构设计方法还有另一个好处:它大大降低了数据驻留或本地化的复杂性。
A business that uses a data privacy vault moves their customer PII out of their existing data storage and infrastructure and into the vault, as shown in the image above. The data privacy vault becomes the single source of truth for all customer PII, effectively de-scoping the existing application infrastructure from the responsibilities of data security and compliance. This architectural design approach to data privacy carries another benefit: it significantly reduces the complexity of data residency or localization.

隐私的未来 The Future of Privacy

数据隐私不仅仅是合规性,也是为您的用户做的正确事情。这个领域有巨大的增长和势头,将帮助工程师构建更好、更安全的系统,同时仍然允许企业使用他们收集的关于客户的数据来改进他们的产品。

Data privacy isn’t just about compliance, it’s also the right thing to do for your users. There’s a tremendous amount of growth and momentum in this space that’s going to help engineers build better, more secure systems while still allowing businesses to use the data they collect about customers to improve their products.

随着这些技术在商业上可用,API和SDK等抽象层的开发,在我们的日常工程任务中使用这些技术将变得像编程拨打电话或发放信用卡交易一样简单和普通。

As these technologies become commercially available and abstraction layers like APIs and SDKs are developed, utilizing these technologies in our everyday engineering tasks will become as easy and commonplace as programmatically placing a phone call or issuing a credit card transaction.

数据隐私不是首席信息安全官的工作或隐私官员的工作,而是每个人的工作。作为经常负责保护敏感数据的技术方面的工程师,我们了解增强隐私的工具和技术的格局至关重要。而且,至关重要的是,我们要利用这种理解来遵循隐私最佳实践,并在用户共享其敏感个人数据时尊重对我们的信任。

Data privacy isn’t the CISOs job or the privacy officers’ job, it’s everyone’s job. As engineers who are often tasked with the technical aspects of securing sensitive data, it’s critical that we understand the landscape of privacy-enhancing tools and technologies. And, it’s vitally important that we use this understanding to follow privacy best practices and honor the trust placed in us when our users share their sensitive personal data.

国广清科是中央媒体背景的国广控股旗下高科技企业,公司核心团队来自清华大学、中国人民大学、中国传媒大学及中科院等知名院校和科研机构,以及美团、阿里巴巴、华为、浪潮等企业的行业专家。国广清科以大数据、人工智能、隐私计算为主要业务,对外提供隐私计算全栈技术服务、数字资源管理解决方案、数据系统和数据服务,助力数据要素流通、驱动组织实现数字化、智能化变革。

A high-tech company owned by State Broadcasting Holdings with a strong media foundation is CRI TSING'S TECH. The core team of the business is made up of industry experts from Meituan, Alibaba, Huawei, and Longchao as well as academics and researchers from prestigious universities and research centers like Tsinghua University, Renmin University of China, Communication University of China, and Chinese Academy of Sciences. CRI TSING'S TECH provides full-stack privacy computing technology services, digital resource management solutions, data systems, and data services to the public, facilitating the flow of data elements and helping businesses implement digital and intelligent change. Big data, AI, and privacy computing are some of the company's core competencies.

参考Reference:

https://softwareengineeringdaily.com/2022/07/21/10-privacy-enhancing-technologies-every-engineer-should-know-about/