Differential Privacy

Table 1.: Example dataset where ‘Answer’ is the sensitive attribute

  • This method helps us to know how much noise is added to the query results.
  • This is the most commonly used method to measure the utility.
  • The smaller the noise the higher the utility.

2. Error measurement:

  • In this method we calculate the difference between non-private and private output.
  • The lesser the difference, the higher the utility.
  • It is generally represented by a bound with accuracy parameters. Until this point, we now have understood the key components and operating mechanisms of differential privacy. From the ​next section​, we are going to see two important applications where differential privacy is highly applicable.
  • In this setting, the data curator applies each query to the dataset interactively.
  • The next query cannot be issued to the curator until the answer to the previous query has been published.
  • In this setting, all the queries are given to the curator at one time and the curator can provide answers with full knowledge of the query set.

  1. How many patients have diabetes at the age of 40-79?
  2. How many patients have diabetes at the age of 40-59?
  • The curator receives the first query in the beginning.
  • It sends the number of patients by query’s criteria.
  • Then the curator adds independent laplace noise with sensitivity as 1, Lap(1/epsilon)
  • Now, the next query comes in.
  • It will be answered with a sensitivity of 2 as changing one person in the table may change the results of both the queries.
  • Finally, the total noise added to the query set is Lap(1/epsilon) + Lap(2/epsilon).
  • Both queries are submitted to the curator at the same time.
  • The sensitivity measured is 2 since changing one person in the table may change the results of both the queries.
  • Therefore, the total noise that will be added is 2*Lap(2/epsilon).
  • This noise is greater than the interactive setting, this is because, correlation between queries leads to higher sensitivity. So what do we learn from it? We learn that non-interactive settings normally incur more noise than interactive settings. Although, both the examples indicate that the amount of noise increases dramatically when queries are correlated to each other. The reason behind this is the Laplace mechanism which makes it impractical in the scenarios that require answering large amounts of queries. There are several other mechanisms available which can be used depending on the requirement of number of queries, the accuracy of the output, and the computational efficiency. They are:
    • Transformation
    • Dataset partitioning
    • Iteration
    • Query Separation
  1. Transaction data publishing
  2. Histogram publishing
  3. Stream data publishing
  4. Graph data publishing
  1. Batch query publishing
  2. Synthetic data publishing
  1. Laplace/exponential frameworks
  2. Private learning frameworks
  • Mostly all types of analysis algorithms can be implemented using this framework.
  • Example can be adding noise to the count steps or perform exponential mechanisms when making selections.
  • In practice, it can be applied to supervised learning, unsupervised learning or frequent itemset mining problems.
  • Although, there are certain challenges associated with the use of this framework on supervised, unsupervised and frequent itemset mining types of learning.
  • With supervised learning, the privacy budget has to be consumed multiple times due to the ensemble process involved in most of these algorithms and also there is a high possibility of inducing large amounts of noise.
  • In case of unsupervised learning, high sensitivity of clustering centers can cause issues.
  • For frequent itemset mining, the most common problem is large candidate itemsets, inducing noise or randomizing the patterns may lead to miss out on some of the important patterns within the dataset. Although, there are some modern implementation of differentially private platforms that make use of this framework one of them being Microsoft’s PINQ (Privacy Integrated Querying) platform which can automatically implement clustering algorithms and return differentially private results.
  • This framework can only deal with limited learning problems.
  • It considers data analysis as a learning problem in terms of optimization.
  • Private learning frameworks combine differential privacy with diverse learning algorithms to preserve the privacy of the learning samples.
  • The foundation of this framework is on the ERM and PAC learning theories.
  • ERM helps to select the optimal learning model by transferring the learning process into a convex minimization problem. Differential privacy either adds noise to the output models or to the convex objective functions. PAC learning estimates the relationship between the number of learning samples and the models accuracy.
  1. Location-Based Services
  2. Recommender Systems
  3. Genetic Data
  1. Zhu et al. (2017). “Differentially Private Data Publishing and Analysis: A Survey.” IEEE Trans Data Knowl Eng. http://ezproxy.library.uvic.ca/login?url=https://ieeexplore.ieee.org/abstract/docum ent/7911185 (Very recent survey of state of the art in differential privacy)
  2. Dwork. “Differential Privacy.” ICALP 2006. http://ezproxy.library.uvic.ca/login?url=https://link.springer.com/chapter/10.1007/ 11787006_1 (Original paper introducing the notion of differential privacy)
  3. McSherry (2009). “Privacy integrated queries: An extensible platform for privacy-preserving data analysis.” SIGMOD. http://ezproxy.library.uvic.ca/login?url=https://dl.acm.org/doi/10.1145/1559845.15 59850 (Microsoft paper integrating differential privacy into a database system; also available as a lighter read in Comm. ACM).
  4. https://www.johndcook.com/blog/2019/10/14/privacy-budget/ (Accessed on: 05 March 2020).
  5. https://accuracyandprivacy.substack.com/​ (Accessed on: 05 March 2020).

Leave a Reply