Database anonymization and protections of sensitive attributes
Abstract
The importance of database anonymization has become increasingly
critical for organizations that publish their database to the public. Current
security measures for anonymization poses different manner of drawbacks. k-anonymity
is prone to many varieties of attack; !-diversity does not work well
with categorical or numerical attributes; t-closeness erases too much information
in the database. Moreover, some measures of information loss are designed for
anonymization measure, such as k-anonymity, where sensitive attributes do not
play a part in measuring database's security. Not measuring the re-distribution of
sensitive attributes will result in an underestimate for information loss such as 1-
diversity or t-closeness which intentionally tries removing the association between
non-sensitive attributes and sensitive attributes for better protecting individuals
from being indentified.
This thesis provides a more generalized version of !-diversity that will
better protect categorical attributes and numerical attributes and analyzes the
effectiveness and complexity of our new security scheme. Another focus of this
thesis is to design a better approach of measuring information loss and lay down
a new standard for evaluating information loss on security measures such as 1-
diversity and t-closeness and quantify actual information loss from deliberately
hiding relations between non-sensitive attributes and sensitive attributes. This
new standard of information loss measure should provide a better estimation of
the data mining potential remained in a generalized database.
This thesis also proves that unlike k-anonymity which can be solved in
polynomial time when k=2. 1-diversity in fact remains NP-Hard in the special
case where 1=2, and even when there are only 2 possible sensitive attributes in
the alphabet.
Collections
- Retrospective theses [1604]