隐私增强技术包括任何提高敏感数据隐私和安全性的技术——关键信息,如客户姓名、电话号码、电子邮件地址和社会安全号码(SSN),本文主要介绍10项增强隐私的技术。
该协议的一个经典示例是平均工资问题:一群工人如何在不向他人透露自己的个人工资的情况下计算平均工资?
A classic example use case for this protocol is the average salary problem: How can a group of workers compute their average salary without divulging their own personal salary to others?
MPC可以使用以下一系列操作来解决这个问题:
MPC can solve this problem using the following series ofoperations:
工人A选择一个随机数W,并将其添加到他们的工资中,然后将总和传递给工人B,工人B将他们的工资添加到这个总和中,然后将其传递给工人C,以此类推。工人A获得最终金额,并可以计算平均工资。
这是一种非常强大的隐私方法,但有一些限制。使用MPC既增加了现有系统的计算开销,也增加了不同方之间的高通信成本。这使得使用MPC对许多问题不切实际。
K-anonymization was first proposed in the late 90s by researchers Latanya Sweeney and Pierangela Samarati. K-anonymity is a property of data. A data set is said to have the k-anonymity property if the information for each person in the data set can’t be distinguished from at least k-1 individuals.
This is a “safety in numbers” approach. Essentially, if everyone belongs to a group, then any of the records within the group could correspond to a single individual. The downside of k-anonymization is that there is no randomization, so an attacker can still make inferences about the data. Additionally, if an attacker knows some information about an individual, they can use the groups to learn additional information about that person. For example, if all women in our data set over the age of 60 have breast cancer and the attackers know that Julie is over 60 and in the data set, then now the attackers know Julie has breast cancer.
伪匿名化是一种混淆形式,通过用假名替换字段值来隐藏个人的身份。通过伪匿名化,只删除了部分识别数据——足以使数据值无法链接到它们所指的人或事物(“数据主体”)。
Pseudoanyonymization is a form of obfuscation that hides the identity of an individual by replacing field values with pseudonyms. With pseudoanyonymization, only a portion of the identifying data is removed – enough that the data values can’t be linked to the person or thing they refer to (the “data subject”).
有各种伪匿名化方法,包括杂乱,将原始值与混淆的字母混合,以及数据屏蔽,其中部分原始数据被隐藏。
There are a variety of pseudoanyonymization methods including scrambling, whereby the original values are mixed with obfuscated letters, and data masking, where some part of the original data is hidden.
通过伪匿名化,总是存在重新识别的风险。不同的方法会带来不同的风险,有些方法不符合某些法规。虽然伪匿名化有很多用途,但仅此方法并不是一个完整的数据隐私解决方案。
With pseudoanyonymization, there’s always a risk of re-identification. Different methods carry different risks, and some methods aren’t compliant with certain regulations. And while pseudoanyonymization has many uses, this method alone isn’t a complete data privacy solution.
基本的加密形式自公元前1900年以来就已经存在,RSA、AES和DES等现代技术被广泛用于安全数据传输和存储。然而,所有这些技术的缺点是,为了对数据执行操作,您需要首先解密它。当解密的数据被缓存在内存中时,或者因为小的编码错误可能导致敏感的未加密数据出现在您的日志文件或其他地方时,这打开了潜在的攻击面。这增加了您的安全和合规性问题的范围。
Basic forms of encryption have been around since 1900 BC and modern techniques like RSA, AES, and DES are widely used for secure data transmission and storage. However, the downside of all of these techniques is that in order to perform operations on the data, you need to decrypt it first. This opens up a potential attack surface as decrypted data is cached in memory, or because small coding errors can result in sensitive unencrypted data showing up in your log files or other places. This increases the scope of your security and compliance problems.
同态加密被广泛认为是加密的“黄金标准”。这是一种加密形式,允许对加密数据进行计算,而无需先解密。
Homomorphic encryption is widely considered the “gold standard” of encryption. It’s a form of encryption that permits you to perform computations on encrypted data without first decrypting it.
然而,在实践中,完全同态加密存在性能挑战,因为它非常复杂,需要支持对加密数据的几乎所有操作。理论上,这是一种理想的能力,但对大多数软件工程师来说,他们不需要对数据执行任何任意操作。他们通常需要执行一些非常具体的操作。
However, in practicality fully homomorphic encryption suffers from performance challenges. Homomorphic suffers from performance challenges because it’s very complex and needs to support nearly any operation on encrypted data. This is an ideal capability in theory, but the reality for most software engineers is that they don’t need to perform any arbitrary operation on data. They typically need to perform a few very specific operations.
但对加密数据进行不必要的操作时,会导致性能缓慢,输入多态加密。与同态加密一样,多态加密支持对加密数据的操作,但专为支持工程师通常需要对敏感数据执行的用例和操作类型而设计。例如,仅对社会保险号的最后四位数字进行索引和解密,或在实际看不到客户信用评分的情况下确定客户的信用评分是否高于或低于给定的阈值。
However, unnecessary operations on encrypted data can lead to slow performance, enter polymorphic encryption. Like homomorphic encryption, polymorphic encryption supports operations on encrypted data, but is designed specifically to support the use cases and types of operations that engineers typically need to perform on sensitive data. Examples include indexing and decrypting only the last four digits of a social security number, or determining whether a customer's credit score is above or below a given threshold without actually seeing the customer's credit score.
多态加密是一种功能非常强大的PET,如今已为企业做好准备。
Polymorphic encryption is a very powerful PET that is ready for businesses today.
联合学习是一种分散的机器学习形式。在机器学习中,通常训练数据从多个来源(手机、笔记本电脑、物联网设备等)汇总,并带到中央服务器进行训练。然而,从隐私的角度来看,这显然存在问题。
Federated learning is a decentralized form of machine learning. In machine learning, typically training data is aggregated from multiple sources (mobile phones, laptops, IoT devices, etc.) and brought to a central server for training. However, from a privacy standpoint, this obviously has issues.
联合学习的例子。培训在当地进行,结果集中报告。
Example of federated learning. Training happens locally and results are reported centrally.
在联合学习模型中,培训在本地进行,结果报告给中央服务器,而不是原始输入。例如,集中式机器学习应用程序可以直接在所有设备上使用。本地模型根据用户的输入和设备使用情况在设备上学习和训练自己。然后,设备仅将训练结果从本地副本传输回中央服务器。
In a federated learning model, training happens locally and the results are reported to the centralized server, rather than the raw inputs. For example, the centralized machine learning application is available directly on all devices. The local model learns and trains itself on the device based on the user’s input and device usage. The device then transfers only the training results from the local copy back to the central server.
所有分散式设备的结果都汇总在一起,但看不到任何用户数据。然后,用户的设备可以更新为新训练的中央模型,如上图所示。
The results from all decentralized devices are aggregated together, but without seeing any of the user data. Users’ devices can then be updated with the newly-trained central model as shown in the image above.
不过,联合学习有一些局限性。它需要设备和中央服务器之间的频繁通信,因此需要大量的网络带宽。此外,本地训练需要足够的本地计算能力和内存来实际训练模型。
It requires frequent communication between devices and the central server, so it requires a lot of network bandwidth. Additionally, local training requires enough local computing power and memory to actually train the model.
想象一下,如果你可以去酒吧点一杯饮料,而不是酒保要求看你的驾照来检查你的出生日期,然后计算一下你是否足够大,他们可以简单地问:“你足够大到可以买酒吗?”,你可以回答“是”,他们可以确定你的“是”是否是真的?这种类型的场景可以通过零知识证明。
Imagine if you could go to a bar and order a drink and instead of the bartender asking to see your driver’s license to check your birthdate and then do the math to figure out if you’re old enough, they could simply ask, “Are you old enough to purchase alcohol?”, and you could respond “Yes” and they could know for sure whether your “Yes” was true or not? This type of scenario is possible with zero-knowledge proofs.
零知识证明是一种方法,可以让一方向另一方证明给定的陈述是真实的,而不会透露该陈述是真实的以外的任何信息。
Zero-knowledge proof is a method that lets one party prove to another party that a given statement is true without revealing any information beyond the fact that the statement is true.
零知识证明的一个经典例子是阿里巴巴的洞穴。
A classic example of a zero-knowledge proof is the cave of Ali Baba.
在这个场景中,有两个角色,Peggy(证明者)和Victor(验证者)。Peggy和Victor都发现自己在洞穴的入口处,洞穴有两个不同的入口,通向两条不同的路径(A和B)。洞穴内有一扇连接两条道路的门,但只能用密码打开。Peggy拥有代码,Victor想拥有它,但首先Victor想确保Peggy没有在拥有代码上撒谎。
In this scenario there are two characters, Peggy (the prover) and Victor (the verifier). Both Peggy and Victor find themselves at the entrance to the cave, which has two distinct entrances to two different paths (A and B). Inside the cave there’s a door that connects both paths, but can only be opened with a secret code. Peggy owns the code and Victor wants to have it, but first Victor wants to make sure Peggy isn’t lying about owning the code.
Peggy怎么能在不透露代码的情况下向Victor展示她拥有代码呢?
How can Peggy show Victor that she owns the code without revealing it?
要做到这一点,Peggy可以通过任何一扇门(A或B)进入洞穴。进去后,Victor走近洞穴,向Peggy喊道,从任意选择的两条路之一返回。如果Victor大喊通过A路径返回,但Peggy目前在B路径中,那么只有当她确实有密码时,她才能通过A返回。当然,如果Peggy在撒谎并且已经处于A路径上,她可能会很幸运,但只要这个过程重复足够多次,Peggy只有在有密码的情况下才能始终沿着正确的路径返回。
To do this, Peggy can enter the cave by either door (A or B). Once inside, Victor approaches the cave and shouts to Peggy to return by one of the two paths, chosen arbitrarily. If Victor yells to return via the A path but Peggy is currently in the B path, she can only return via A if she indeed has the secret code. Of course, Peggy might get lucky if she’s lying and already be in the A path, but as long as this process is repeated enough times, Peggy can only always return along the correct path if she has the secret code.
在实际应用方面,加密货币Zcash正在使用零知识加密方法。ING的客户还使用它来证明他们的收入在可接受的范围内,而没有透露他们的确切工资。
In terms of practical applications, a zero-knowledge cryptographic method is being used by the cryptocurrencyZcash. It’s also being used by customers of ING to prove that their income is within an admissible range without revealing their exact salary.
零知识的挑战证明,答案不是100%保证的,而且是计算密集型的。证明人和核查人之间需要进行许多互动,以将失实陈述的概率降低到可接受的水平。这使得这对慢速或低功耗设备来说不切实际。
差异隐私是一种PET,商业兴趣日益增长。差异隐私的想法是,通过在数据集中引入少量噪音,查询结果不能用于推断出关于单个个体的很多东西。差分隐私不是像去识别这样的特定过程,而是算法或过程可以拥有的属性。
Differential privacy is a PET that’s seeing growing commercial interest. The idea of differential privacy is that by introducing a small amount of noise into a data set, a query result can’t be used to infer much about a single individual. Differential privacy isn’t a specific process like de-identification, but a property that an algorithm or process can have.
使用差异隐私,您可以通过向原始数据添加噪音来添加一层隐私。因此,使用所得数据进行的任何建模或分析都会降低准确性。
With differential privacy, you’re adding a layer of privacy by adding noise to the original data. As a consequence, any modeling or analytics performed using the resulting data has diminished accuracy.
很好地使用差异隐私的关键是仔细平衡对准确结果的需求和通过混淆这些结果来保护个人隐私的需要。
The key to using differential privacy well is to carefully balance the need for accurate results with the need to protect an individual’s privacy by obfuscating those results.
合成数据(或“假数据”)是人工生成的数据,反映了原始数据集的模式和组成。这是一种数据匿名化。对于某些用例来说,合成数据是一个很好的解决方案,例如为工程师提供与生产数据具有类似属性的东西,用于测试目的,而不会暴露实际的客户PII。合成数据还被广泛用于训练欺诈检测系统中使用的机器学习模型。
这是一个合成生成数据的复杂过程,这些数据的外观和感觉都足够真实,对测试或其他应用程序有用。例如,如果一家企业有多个包含客户名称的独立表的数据库,则两个数据库中的同一客户将具有相同的名称。理想情况下,为该业务生成的合成数据能够捕获这种模式。
Adata privacy vault isolates, secures, and tightly controls access to manage, monitor, and use sensitive data. It’s both a technology and architectural design pattern. A data privacy vault combines multiple PETs, like encryption, tokenization, and data masking along with data security techniques like zero trust, logging, and auditing, and the principle of isolation.
Example of using a data privacy vault. Sensitive data is sent to the vault at collection. Tokenized data is stored downstream.
隐私的未来 The Future of Privacy
数据隐私不仅仅是合规性,也是为您的用户做的正确事情。这个领域有巨大的增长和势头,将帮助工程师构建更好、更安全的系统,同时仍然允许企业使用他们收集的关于客户的数据来改进他们的产品。
随着这些技术在商业上可用,API和SDK等抽象层的开发,在我们的日常工程任务中使用这些技术将变得像编程拨打电话或发放信用卡交易一样简单和普通。
数据隐私不是首席信息安全官的工作或隐私官员的工作,而是每个人的工作。作为经常负责保护敏感数据的技术方面的工程师,我们了解增强隐私的工具和技术的格局至关重要。而且,至关重要的是,我们要利用这种理解来遵循隐私最佳实践,并在用户共享其敏感个人数据时尊重对我们的信任。
国广清科是中央媒体背景的国广控股旗下高科技企业,公司核心团队来自清华大学、中国人民大学、中国传媒大学及中科院等知名院校和科研机构,以及美团、阿里巴巴、华为、浪潮等企业的行业专家。国广清科以大数据、人工智能、隐私计算为主要业务,对外提供隐私计算全栈技术服务、数字资源管理解决方案、数据系统和数据服务,助力数据要素流通、驱动组织实现数字化、智能化变革。
参考Reference:
https://softwareengineeringdaily.com/2022/07/21/10-privacy-enhancing-technologies-every-engineer-should-know-about/