PRIVACY MODELS AND ANONYMIZATION TECHNIQUES FOR TABULAR HEALTHCARE DATA
DOI:
https://doi.org/10.20998/2079-0023.2024.02.12Keywords:
healthcare data, tabular data, data anonymization, privacy models, k-anonymity, l-diversity, t-closeness, data anonymization techniques, differential privacyAbstract
In today's world, issues of privacy and personal data protection are becoming extremely relevant, especially in the healthcare field, where the use of large volumes of data for research is becoming increasingly common. The use of personal data is regulated by relevant laws that require data anonymization to minimize the risks of identifying individuals. Anonymization is a process that allows the use of sensitive data without the risk of disclosing personal information while maintaining its utility. This article discusses the main privacy models and anonymization techniques used to protect tabular healthcare data. Privacy models include k-anonymity, l-diversity, and t-closeness. The k-anonymity model ensures that any combination of quasi-identifiers is shared by at least k records. The l-diversity model complements k-anonymity by requiring at least l unique combinations of sensitive attribute (SA) values in each equivalence class. The t-closeness model considers the distribution of these sensitive attribute values, ensuring that the distance between the SA distribution in the equivalence class and the overall distribution does not exceed a specified threshold. Anonymization techniques include generalization, suppression, relocation, permutation, perturbation, slicing, differential privacy, and synthetic data. Generalization reduces the precision of quasi-identifiers. Suppression removes certain values from the dataset to improve its statistical characteristics. Relocation changes a limited number of values in the data to enhance protection. Permutation mixes the values of quasi-identifiers between records while preserving the overall statistical features of the dataset. Perturbation adds noise to the data, increasing privacy. The idea of differential privacy also involves adding noise, but this is done at the query processing stage. Generating synthetic data allows the creation of new datasets that are similar in characteristics to the original data.
References
Document 32016R0679. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (Text with EEA relevance). EUR-Lex. Access to European Union law. URL: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32016R0679 (accessed 30.10.2024).
Summary of the HIPAA Privacy Rule. USA: United States Department of Health and Human Services. URL: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html (accessed 30.10.2024).
Sweeney L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems. 2002б vol. 10, is. 5, pp. 557–570. URL: https://epic.org/wp-content/uploads/privacy/reidentification/Sweeney_Article.pdf (accessed 30.10.2024).
Machanavajjhala A., Kifer D., Gehrke J., Venkitasubramaniam M.. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data. 2007, vol. 1, is. 1, pp. 3 – es.
Li N., Li T., Venkatasubramanian S. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE 23rd International Conference on Data Engineering. 2007, pp. 106–115. URL: https://ieeexplore.ieee.org/document/4221659 (accessed 01.11.2024).
Majeed A., Ullah F., Lee S. Vulnerability- and Diversity-Aware Anonymization of Personally Identifiable Information for Improving User Privacy and Utility of Publishing Data. Sensors. 2017, vol. 17, is. 5, article 1059. URL: https://www.mdpi.com/1424-8220/17/5/1059 (accessed 01.11.2024).
Samarati P., Sweeney L. Protecting Privacy when Disclosing Information: k-Anonymity and Its Enforcement through Generalization and Suppression. Semantic Scholar. 1998. URL: https://www.semanticscholar.org/paper/Protecting-privacy-when-disclosing-information%3A-and-Samarati-Sweeney/7df12c498fecedac4ab6034d3a8032a6d1366ca6 (accessed 30.10.2024).
Nergiz M. E., Gök M. Z. Hybrid k-anonymity. Computers & security. 2014, vol. 44, pp. 51–63. URL: https://www.researchgate.net/publication/261139007_Hybrid_k-Anonymity (accessed 30.10.2024).
Martin D. J., Kifer D., Machanavajjhala A., Gehrke J., Halpern J. Y. Worst-Case Background Knowledge for Privacy-Preserving Data Publishing. IEEE 23rd International Conference on Data Engineering. 2007, pp. 126–135. URL: https://www.academia.edu/1520959/Worst_Case_Background_Knowledge_for_Privacy_Preserving_Data_Publishing (accessed 30.10.2024).
Domingo-Ferrer J., Mateo-Sanz J. M. Practical Data-Oriented Microaggregation for Statistical Disclosure Control. IEEE Transactions on Knowledge and Data Engineering. 2002, vol. 14, no. 1, pp. 189–201.
Gal T. S., Tucker Th. C., Gangopadhyay A., Chen Z.. A data recipient centered de-identification method to retain statistical attributes. Journal of biomedical informatics. 2014, vol. 50, pp. 32–45.
Li T., Li N., Zhang J., Molloy I. Slicing: A New Approach to Privacy Preserving Data Publishing. IEEE Transactions on Knowledge and Data Engineering. 2010, vol. 24, is. 3, pp. 561–574.
Dwork C. Differential Privacy. Automata, Languages and Programming. 2006, pp. 1–12.
Abadi M., Chu A., Goodfellow I., McMahan H. B., Mironov I., Talwar K., Zhang L. Deep Learning with Differential Privacy. CCS'16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 2016, pp. 24–28.
Kingma D. P., Wellin M. Auto-Encoding Variational Bayes. arXiv. 2013, article 1312.6114. URL: https://www.semanticscholar.org/reader/5f5dc5b9a2ba710937e2c413b37b053cd673df02 (accessed 01.11.2024).
Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair Sh., Courville A., Bengio Y. Generative adversarial networks. arXiv. 2014, article 1406.2661. URL: https://dl.acm.org/doi/pdf/10.1145/3422622 (accessed 01.11.2024).
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).