AN ITERATIVE CLASSIﬁCATION SCHEME FOR SANITIZING LARGE-SCALE DATASETS
Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains.Many organizations aim to share such data while obscuring features that could disclose personally identiﬁable information. Much of thisdata exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identiﬁersfrom it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk isoften acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identiﬁers. Wemodel data sanitization as a game between 1) a publisher who chooses a set of classiﬁers to apply to data and publishes onlyinstances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leakedidentifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limitedadversary. Moreover, using ﬁve text data sets we illustrate that our algorithm leaves virtually no automatically identiﬁable sensitiveinstances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.
there has been a substantial amount of research conductedin the ﬁeld of privacy-preserving data publishing (PPDP)over the past several decades. Much of this work isdedicated to methods that transform well-structured (e.g.,relational) data to adhere to a certain criterion or a set ofcriteria, such as k-anonymization among a multitudeof others. These criteria attempt to offer guaranteesabout the ability of an attacker to either distinguish betweendifferent records in the data or make inferences tied to aspeciﬁc individual. There is now an extensive literature aimingto operationalize such PPDP criteria in practice throughthe application of techniques such as generalization, suppression(or removal), and randomization All of these techniques, however, relyon a priori knowledge of which features in the data areeither themselves sensitive or can be linked to sensitiveattributes. n the context of privacy preservation for unstructured data,such as text, various approaches have been proposed forthe automatic discovery of sensitive entities, such as identiﬁers.The simplest of these rely on a large collection ofrules, dictionaries, and regular expressions proposed an automated data sanitization algorithmaimed at removing sensitive identiﬁers while inducing theleast distortion to the contents of documents.
We use this threat model to construct a game betweena publisher, who 1) applies a collection of classiﬁers to anoriginal data set, 2) prunes all the positives predicted by anyclassiﬁer, and 3) publishes the remainder, and an adversaryacting according to our threat model. The data publisher’sultimate goal is to release as much data as possible while atthe same time redacting sensitive information to the pointwhere re-identiﬁcation risk is sufﬁciently low. In support ofthe second goal, we show that any locally optimal publishingstrategy exhibits the following two properties when theloss associated with exploited personal identiﬁers is high:a) an adversary cannot learn a classiﬁer with a high truepositive count, and b) an adversary with a large inspectionbudget cannot do much better than manually inspecting andconﬁrming instances chosen uniformly at random (i.e., theclassiﬁer adds little value).Moreover, we introduce a greedy publishing strategywhich is guaranteed to converge to a local optimum andconsequently guarantees the above two properties in alinear (in the size of the data) number of iterations. At a highlevel, the greedy algorithm iteratively executes learningand redaction. It repeatedly learns the classiﬁer to predictsensitive entities on the remaining data, and then removesthe predicted positives, until a local optimum is reached.The intuition behind the iterative redaction process is that,in each iteration, the learner essentially checks to determineif an adversary could obtain utility by uncovering residualidentiﬁers; if so, these instances are redacted, while theprocess is terminated otherwise. Our experiments on twodistinct electronic health records data sets demonstrate thepower of our approach, showing that 1) the number ofresidual true positives is always quite small, addressingthe goal of reducing privacy risk, 2) conﬁrming that theattacker with a large budget cannot do much better thanuniformly randomly choosing entities to manually inspect,3) demonstrating that most (> 93%) of the original data ispublished, thereby supporting the goal of maximizing thequantity of released data, and 4) showing that, in practice,the number of required algorithm iterations (< 5) is a smallfraction of the size of the data. Additional experiments,involving three datasets that are unrelated to the healthdomain corroborate these ﬁndings, demonstrating generalizabilityin our approach.
Our ability to take full advantage of large amounts ofunstructured data collected across a broad array of domainsis limited by the sensitive information contained therein.This paper introduced a novel framework for sanitization ofsuch data that relies upon 1) a principled threat model, 2) avery general class of publishing strategies, and 3) a greedy,yet effective, data publishing algorithm. The experimentalevaluation shows that our algorithm is: a) substantiallybetter than existing approaches for suppressing sensitivedata, and b) retains most of the value of the data, suppressingless than 10% of information on all four data sets weconsidered in evaluation. In contrast, cost-sensitive variantsof standard learning methods yield virtually no residualutility, suppressing most, if not all, of the data, when theloss associated with privacy risk is even moderately high.Since our adversarial model is deliberately extremely strong- far stronger, indeed, than is plausible – our results suggestfeasibility for data sanitization at scale.
 X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with bigdata,” IEEE Transactions on Knowledge and Data Engineering, vol. 26,no. 1, pp. 97–107, 2014.
 U.S. Dept. of Health and Human Services, “Standards for privacyand individually identiﬁable health information; ﬁnal rule,” FederalRegister, vol. 65, no. 250, pp. 82 462–82 829, 2000.
 Committe on the Judiciary House of Representatives, “FederalRules of Civil Procedure,” 2014.
 European Parliament and Council of the European Union, “Directive95/46/EC of the European Parliament and of the Council of24 October 1995 on the protection of individuals with regard tothe processing of personal data and on the free movement of suchdata,” Ofﬁcial Journal of the EC, vol. 281, pp. 0031–0050, 1995.
 B. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserving datapublishing: A survey of recent developments,” ACM ComputingSurveys, vol. 42, no. 4, p. 14, 2010.
 L. Sweeney, “k-anonymity: A model for protecting privacy,” InternationalJournal of Uncertainty, Fuzziness and Knowledge-BasedSystems, vol. 10, no. 05, pp. 557–570, 2002.
 C. Dwork, “Differential privacy: A survey of results,” in InternationalConference on Theory and Applications of Models of Computation,2008, pp. 1–19.
 L. Sweeney, “Achieving k-anonymity privacy protection usinggeneralization and suppression,” International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp.571–588, 2002.
 Y. He and J. F. Naughton, “Anonymization of set-valued data viatop-down, local generalization,” VLDB Endowment, vol. 2, no. 1,pp. 934–945, 2009.
 G. Poulis, A. Gkoulalas-Divanis, G. Loukides, S. Skiadopoulos,and C. Tryfonopoulos, “SECRETA: A system for evaluating andcomparing relational and transaction anonymization algorithms,”in International Conference on Extending Database Technology, 2014,pp. 620–623.