Hashing email addresses for GDPR compliance
Clash Royale CLAN TAG#URR8PPP
UPDATED
We have a very unique scenario: We have several old databases of user accounts. We'd like a new system to be able to connect these old accounts to new accounts on the new system, if the user wishes it.
So for example, on System X you have an old account, with an old, (let's say) RPG character. On System Y you have another old account, with another RPG character on it.
On our new system, with their new account, we'd like our users to be able to search these old databases and claim their old RPG characters. (Our users want this functionality, too.)
We'd like to keep users' old account PII in our database for the sole purpose of allowing them to reconnect old accounts of their new accounts. This would benefit them and be a cool feature, but under GDPR and our privacy policy we will eventually need to delete this old PII from our databases.
BUT - What if we stored this old PII in such a way as that it was irreversible. I.e. Only someone with the information would ever get a positive match.
I'm not a security expert, but I understand that simple hashing (eg. MD5) is too far easy to hack (to put it mildly), and (technically) doesn't require "additional information" (ie. a key).
The good thing about MD5 is that it's fast (in the sense that it's deterministic), meaning we could scan a database of 100,000s rows very quickly looking for a match.
If MD5 (and SHA) are considered insecure to the point of being pointless, what else can we do to scan a database looking for a match? I'm guessing modern hashing, like bcrypt, would be designed to be slow for this very reason, and given that it's not deterministic means that it's unsuitable.
If we merged several aspects of PII into a field (eg. FirstnameLastnameEmailDOB) and then hashed that, it would essentially become heavily salted. Is this a silly solution?
hash privacy anonymity gdpr pseudonymization
|
show 6 more comments
UPDATED
We have a very unique scenario: We have several old databases of user accounts. We'd like a new system to be able to connect these old accounts to new accounts on the new system, if the user wishes it.
So for example, on System X you have an old account, with an old, (let's say) RPG character. On System Y you have another old account, with another RPG character on it.
On our new system, with their new account, we'd like our users to be able to search these old databases and claim their old RPG characters. (Our users want this functionality, too.)
We'd like to keep users' old account PII in our database for the sole purpose of allowing them to reconnect old accounts of their new accounts. This would benefit them and be a cool feature, but under GDPR and our privacy policy we will eventually need to delete this old PII from our databases.
BUT - What if we stored this old PII in such a way as that it was irreversible. I.e. Only someone with the information would ever get a positive match.
I'm not a security expert, but I understand that simple hashing (eg. MD5) is too far easy to hack (to put it mildly), and (technically) doesn't require "additional information" (ie. a key).
The good thing about MD5 is that it's fast (in the sense that it's deterministic), meaning we could scan a database of 100,000s rows very quickly looking for a match.
If MD5 (and SHA) are considered insecure to the point of being pointless, what else can we do to scan a database looking for a match? I'm guessing modern hashing, like bcrypt, would be designed to be slow for this very reason, and given that it's not deterministic means that it's unsuitable.
If we merged several aspects of PII into a field (eg. FirstnameLastnameEmailDOB) and then hashed that, it would essentially become heavily salted. Is this a silly solution?
hash privacy anonymity gdpr pseudonymization
2
Why do you need to pseudonymize them? You might have specific need to, but it is not a typical thing to need to do in this use case.
– schroeder♦
Jan 23 at 12:34
@schroeder Sorry I thought I'd explained. Some of this PII is about to expire as per our privacy policy. Pseudonymization would allow us to to keep this functionality without keeping their data.
– Django Reinhardt
Jan 23 at 13:52
6
Yep, that is a great situation for this use case. Kudos to your team for such great understanding of your policies!
– schroeder♦
Jan 23 at 13:54
17
"The good thing about MD5 is that it's fast, however, meaning we could scan a database of 100,000s rows" - not sure how the speed of MD5 plays a part here, since you are presumably only hashing the email once and searching a database of hashed emails? (And the DB search presumably uses an index...?)
– MrWhite
Jan 23 at 16:25
3
Isn't the point of that bit of the GDPR specifically to stop this? If I tell you "delete everything you have on me, GDPR says so", I want that gone from your records and never again relateable to me. I don't want an undo button for that.
– Adam Barnes
Jan 24 at 14:11
|
show 6 more comments
UPDATED
We have a very unique scenario: We have several old databases of user accounts. We'd like a new system to be able to connect these old accounts to new accounts on the new system, if the user wishes it.
So for example, on System X you have an old account, with an old, (let's say) RPG character. On System Y you have another old account, with another RPG character on it.
On our new system, with their new account, we'd like our users to be able to search these old databases and claim their old RPG characters. (Our users want this functionality, too.)
We'd like to keep users' old account PII in our database for the sole purpose of allowing them to reconnect old accounts of their new accounts. This would benefit them and be a cool feature, but under GDPR and our privacy policy we will eventually need to delete this old PII from our databases.
BUT - What if we stored this old PII in such a way as that it was irreversible. I.e. Only someone with the information would ever get a positive match.
I'm not a security expert, but I understand that simple hashing (eg. MD5) is too far easy to hack (to put it mildly), and (technically) doesn't require "additional information" (ie. a key).
The good thing about MD5 is that it's fast (in the sense that it's deterministic), meaning we could scan a database of 100,000s rows very quickly looking for a match.
If MD5 (and SHA) are considered insecure to the point of being pointless, what else can we do to scan a database looking for a match? I'm guessing modern hashing, like bcrypt, would be designed to be slow for this very reason, and given that it's not deterministic means that it's unsuitable.
If we merged several aspects of PII into a field (eg. FirstnameLastnameEmailDOB) and then hashed that, it would essentially become heavily salted. Is this a silly solution?
hash privacy anonymity gdpr pseudonymization
UPDATED
We have a very unique scenario: We have several old databases of user accounts. We'd like a new system to be able to connect these old accounts to new accounts on the new system, if the user wishes it.
So for example, on System X you have an old account, with an old, (let's say) RPG character. On System Y you have another old account, with another RPG character on it.
On our new system, with their new account, we'd like our users to be able to search these old databases and claim their old RPG characters. (Our users want this functionality, too.)
We'd like to keep users' old account PII in our database for the sole purpose of allowing them to reconnect old accounts of their new accounts. This would benefit them and be a cool feature, but under GDPR and our privacy policy we will eventually need to delete this old PII from our databases.
BUT - What if we stored this old PII in such a way as that it was irreversible. I.e. Only someone with the information would ever get a positive match.
I'm not a security expert, but I understand that simple hashing (eg. MD5) is too far easy to hack (to put it mildly), and (technically) doesn't require "additional information" (ie. a key).
The good thing about MD5 is that it's fast (in the sense that it's deterministic), meaning we could scan a database of 100,000s rows very quickly looking for a match.
If MD5 (and SHA) are considered insecure to the point of being pointless, what else can we do to scan a database looking for a match? I'm guessing modern hashing, like bcrypt, would be designed to be slow for this very reason, and given that it's not deterministic means that it's unsuitable.
If we merged several aspects of PII into a field (eg. FirstnameLastnameEmailDOB) and then hashed that, it would essentially become heavily salted. Is this a silly solution?
hash privacy anonymity gdpr pseudonymization
hash privacy anonymity gdpr pseudonymization
edited Jan 24 at 16:59
Django Reinhardt
asked Jan 23 at 12:33
Django ReinhardtDjango Reinhardt
414516
414516
2
Why do you need to pseudonymize them? You might have specific need to, but it is not a typical thing to need to do in this use case.
– schroeder♦
Jan 23 at 12:34
@schroeder Sorry I thought I'd explained. Some of this PII is about to expire as per our privacy policy. Pseudonymization would allow us to to keep this functionality without keeping their data.
– Django Reinhardt
Jan 23 at 13:52
6
Yep, that is a great situation for this use case. Kudos to your team for such great understanding of your policies!
– schroeder♦
Jan 23 at 13:54
17
"The good thing about MD5 is that it's fast, however, meaning we could scan a database of 100,000s rows" - not sure how the speed of MD5 plays a part here, since you are presumably only hashing the email once and searching a database of hashed emails? (And the DB search presumably uses an index...?)
– MrWhite
Jan 23 at 16:25
3
Isn't the point of that bit of the GDPR specifically to stop this? If I tell you "delete everything you have on me, GDPR says so", I want that gone from your records and never again relateable to me. I don't want an undo button for that.
– Adam Barnes
Jan 24 at 14:11
|
show 6 more comments
2
Why do you need to pseudonymize them? You might have specific need to, but it is not a typical thing to need to do in this use case.
– schroeder♦
Jan 23 at 12:34
@schroeder Sorry I thought I'd explained. Some of this PII is about to expire as per our privacy policy. Pseudonymization would allow us to to keep this functionality without keeping their data.
– Django Reinhardt
Jan 23 at 13:52
6
Yep, that is a great situation for this use case. Kudos to your team for such great understanding of your policies!
– schroeder♦
Jan 23 at 13:54
17
"The good thing about MD5 is that it's fast, however, meaning we could scan a database of 100,000s rows" - not sure how the speed of MD5 plays a part here, since you are presumably only hashing the email once and searching a database of hashed emails? (And the DB search presumably uses an index...?)
– MrWhite
Jan 23 at 16:25
3
Isn't the point of that bit of the GDPR specifically to stop this? If I tell you "delete everything you have on me, GDPR says so", I want that gone from your records and never again relateable to me. I don't want an undo button for that.
– Adam Barnes
Jan 24 at 14:11
2
2
Why do you need to pseudonymize them? You might have specific need to, but it is not a typical thing to need to do in this use case.
– schroeder♦
Jan 23 at 12:34
Why do you need to pseudonymize them? You might have specific need to, but it is not a typical thing to need to do in this use case.
– schroeder♦
Jan 23 at 12:34
@schroeder Sorry I thought I'd explained. Some of this PII is about to expire as per our privacy policy. Pseudonymization would allow us to to keep this functionality without keeping their data.
– Django Reinhardt
Jan 23 at 13:52
@schroeder Sorry I thought I'd explained. Some of this PII is about to expire as per our privacy policy. Pseudonymization would allow us to to keep this functionality without keeping their data.
– Django Reinhardt
Jan 23 at 13:52
6
6
Yep, that is a great situation for this use case. Kudos to your team for such great understanding of your policies!
– schroeder♦
Jan 23 at 13:54
Yep, that is a great situation for this use case. Kudos to your team for such great understanding of your policies!
– schroeder♦
Jan 23 at 13:54
17
17
"The good thing about MD5 is that it's fast, however, meaning we could scan a database of 100,000s rows" - not sure how the speed of MD5 plays a part here, since you are presumably only hashing the email once and searching a database of hashed emails? (And the DB search presumably uses an index...?)
– MrWhite
Jan 23 at 16:25
"The good thing about MD5 is that it's fast, however, meaning we could scan a database of 100,000s rows" - not sure how the speed of MD5 plays a part here, since you are presumably only hashing the email once and searching a database of hashed emails? (And the DB search presumably uses an index...?)
– MrWhite
Jan 23 at 16:25
3
3
Isn't the point of that bit of the GDPR specifically to stop this? If I tell you "delete everything you have on me, GDPR says so", I want that gone from your records and never again relateable to me. I don't want an undo button for that.
– Adam Barnes
Jan 24 at 14:11
Isn't the point of that bit of the GDPR specifically to stop this? If I tell you "delete everything you have on me, GDPR says so", I want that gone from your records and never again relateable to me. I don't want an undo button for that.
– Adam Barnes
Jan 24 at 14:11
|
show 6 more comments
2 Answers
2
active
oldest
votes
MD5 or SHA is not the concern. Hashes can be used for pseudonymization. The problem is that the hash would need to be salted (or peppered) so that data from other sources could not be used to identify the person.
My email is the same everywhere. A hash of it would also be the same. So that means that, in this case, the hash and my email become synonymous. Just like a username and the legal name of a person if paired. If you use a hash in this case, you actually gain nothing in terms of GDPR.
Hashing with a salt (or pepper) makes de-anonymising nearly impossible without knowing the added value. The salt (or pepper) almost becomes the token, in this case.
As always, check with your DPO.
2
You probably should still use a password hash not one designed for speed. Email addresses follow common patterns and may only have very short unique parts; which would leave some of them equivalent to short passwords that can be bruteforced if only protected by a single pass of MD5 or SHA.
– Dan Neely
Jan 23 at 16:03
5
"Hashing with a salt makes de-anonymising nearly impossible without knowing the salt." Since the salt is usually stored right next to the hash, shouldn't it be assumed that the salt is known?
– kapex
Jan 23 at 17:33
9
For efficient database lookups, consider using a pepper instead.
– NieDzejkob
Jan 23 at 20:49
6
@DanNeely using a password-grade hash and a proper salt (unique for each user) would make the lookups prohibitively expensive; with password verification, you have already selected the user and know which salt to use, but in this case, you don't know which user it is and so have to try all of the salts
– kbolino
Jan 24 at 2:42
2
@kbolino the lookup should still be fast, as NieDzejkob pointed out you just can't use a unique salt. Since the actual recovery process should be rarely run you can compensate for that with much higher difficulty factors than would otherwise be acceptable for a login. 10 or 20 seconds to hash the candidate email is fine, since once you're done it once you can do a fast DB lookup afterward; while the extreme slowness of the hash means that even without the need to do each user separately a brute force attack is prohibitively expensive. Just rent a big cloud VM for a for the initial seeding.
– Dan Neely
Jan 24 at 3:09
|
show 5 more comments
Realistically, pseudonymization is any method of obfuscating someone's PII/NPI so that it can't be reasonably traced back to one certain individual. GDPR doesn't necessarily dictate what hashing algorithm you are required to use in order to comply with it's standard, and to be honest - it's best that it doesn't, because if you consider the fact that if everyone was using the exact same method of obfuscation, you're creating a massive single point of failure all around. Your best bet, (as mentioned above) is to use some form of tokenization with salt, to add extra randomness to your algorithm so that it can't be easily bruteforced.
8
From an information security perspective, the idea that it's bad to have a single widely used obfuscation method is dubious (it's either secure or not). However, it is accurate that standardizing the method by law could pose a problem, since it could become outdated.
– Christoph Burschka
Jan 23 at 16:18
The legislation that the GDPR replaced (the data protection directive 95/46/EG) is over 20 years old. IIRC, in the mid-1990s, MD5 was a pretty decent choice, and certainly among the better that were generally available; these days it's considered horribly inadequate, and even SHA-1 (which was designed to replace it) is a bad choice. Who knows what will happen to hash algorithms in the next 20-25 years? I agree, mandating any particular method or algorithm in the regulations themselves would be a bad thing to do.
– a CVn
Jan 24 at 9:42
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "162"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsecurity.stackexchange.com%2fquestions%2f202022%2fhashing-email-addresses-for-gdpr-compliance%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
MD5 or SHA is not the concern. Hashes can be used for pseudonymization. The problem is that the hash would need to be salted (or peppered) so that data from other sources could not be used to identify the person.
My email is the same everywhere. A hash of it would also be the same. So that means that, in this case, the hash and my email become synonymous. Just like a username and the legal name of a person if paired. If you use a hash in this case, you actually gain nothing in terms of GDPR.
Hashing with a salt (or pepper) makes de-anonymising nearly impossible without knowing the added value. The salt (or pepper) almost becomes the token, in this case.
As always, check with your DPO.
2
You probably should still use a password hash not one designed for speed. Email addresses follow common patterns and may only have very short unique parts; which would leave some of them equivalent to short passwords that can be bruteforced if only protected by a single pass of MD5 or SHA.
– Dan Neely
Jan 23 at 16:03
5
"Hashing with a salt makes de-anonymising nearly impossible without knowing the salt." Since the salt is usually stored right next to the hash, shouldn't it be assumed that the salt is known?
– kapex
Jan 23 at 17:33
9
For efficient database lookups, consider using a pepper instead.
– NieDzejkob
Jan 23 at 20:49
6
@DanNeely using a password-grade hash and a proper salt (unique for each user) would make the lookups prohibitively expensive; with password verification, you have already selected the user and know which salt to use, but in this case, you don't know which user it is and so have to try all of the salts
– kbolino
Jan 24 at 2:42
2
@kbolino the lookup should still be fast, as NieDzejkob pointed out you just can't use a unique salt. Since the actual recovery process should be rarely run you can compensate for that with much higher difficulty factors than would otherwise be acceptable for a login. 10 or 20 seconds to hash the candidate email is fine, since once you're done it once you can do a fast DB lookup afterward; while the extreme slowness of the hash means that even without the need to do each user separately a brute force attack is prohibitively expensive. Just rent a big cloud VM for a for the initial seeding.
– Dan Neely
Jan 24 at 3:09
|
show 5 more comments
MD5 or SHA is not the concern. Hashes can be used for pseudonymization. The problem is that the hash would need to be salted (or peppered) so that data from other sources could not be used to identify the person.
My email is the same everywhere. A hash of it would also be the same. So that means that, in this case, the hash and my email become synonymous. Just like a username and the legal name of a person if paired. If you use a hash in this case, you actually gain nothing in terms of GDPR.
Hashing with a salt (or pepper) makes de-anonymising nearly impossible without knowing the added value. The salt (or pepper) almost becomes the token, in this case.
As always, check with your DPO.
2
You probably should still use a password hash not one designed for speed. Email addresses follow common patterns and may only have very short unique parts; which would leave some of them equivalent to short passwords that can be bruteforced if only protected by a single pass of MD5 or SHA.
– Dan Neely
Jan 23 at 16:03
5
"Hashing with a salt makes de-anonymising nearly impossible without knowing the salt." Since the salt is usually stored right next to the hash, shouldn't it be assumed that the salt is known?
– kapex
Jan 23 at 17:33
9
For efficient database lookups, consider using a pepper instead.
– NieDzejkob
Jan 23 at 20:49
6
@DanNeely using a password-grade hash and a proper salt (unique for each user) would make the lookups prohibitively expensive; with password verification, you have already selected the user and know which salt to use, but in this case, you don't know which user it is and so have to try all of the salts
– kbolino
Jan 24 at 2:42
2
@kbolino the lookup should still be fast, as NieDzejkob pointed out you just can't use a unique salt. Since the actual recovery process should be rarely run you can compensate for that with much higher difficulty factors than would otherwise be acceptable for a login. 10 or 20 seconds to hash the candidate email is fine, since once you're done it once you can do a fast DB lookup afterward; while the extreme slowness of the hash means that even without the need to do each user separately a brute force attack is prohibitively expensive. Just rent a big cloud VM for a for the initial seeding.
– Dan Neely
Jan 24 at 3:09
|
show 5 more comments
MD5 or SHA is not the concern. Hashes can be used for pseudonymization. The problem is that the hash would need to be salted (or peppered) so that data from other sources could not be used to identify the person.
My email is the same everywhere. A hash of it would also be the same. So that means that, in this case, the hash and my email become synonymous. Just like a username and the legal name of a person if paired. If you use a hash in this case, you actually gain nothing in terms of GDPR.
Hashing with a salt (or pepper) makes de-anonymising nearly impossible without knowing the added value. The salt (or pepper) almost becomes the token, in this case.
As always, check with your DPO.
MD5 or SHA is not the concern. Hashes can be used for pseudonymization. The problem is that the hash would need to be salted (or peppered) so that data from other sources could not be used to identify the person.
My email is the same everywhere. A hash of it would also be the same. So that means that, in this case, the hash and my email become synonymous. Just like a username and the legal name of a person if paired. If you use a hash in this case, you actually gain nothing in terms of GDPR.
Hashing with a salt (or pepper) makes de-anonymising nearly impossible without knowing the added value. The salt (or pepper) almost becomes the token, in this case.
As always, check with your DPO.
edited Jan 24 at 20:28
answered Jan 23 at 12:42
schroeder♦schroeder
75.1k29164200
75.1k29164200
2
You probably should still use a password hash not one designed for speed. Email addresses follow common patterns and may only have very short unique parts; which would leave some of them equivalent to short passwords that can be bruteforced if only protected by a single pass of MD5 or SHA.
– Dan Neely
Jan 23 at 16:03
5
"Hashing with a salt makes de-anonymising nearly impossible without knowing the salt." Since the salt is usually stored right next to the hash, shouldn't it be assumed that the salt is known?
– kapex
Jan 23 at 17:33
9
For efficient database lookups, consider using a pepper instead.
– NieDzejkob
Jan 23 at 20:49
6
@DanNeely using a password-grade hash and a proper salt (unique for each user) would make the lookups prohibitively expensive; with password verification, you have already selected the user and know which salt to use, but in this case, you don't know which user it is and so have to try all of the salts
– kbolino
Jan 24 at 2:42
2
@kbolino the lookup should still be fast, as NieDzejkob pointed out you just can't use a unique salt. Since the actual recovery process should be rarely run you can compensate for that with much higher difficulty factors than would otherwise be acceptable for a login. 10 or 20 seconds to hash the candidate email is fine, since once you're done it once you can do a fast DB lookup afterward; while the extreme slowness of the hash means that even without the need to do each user separately a brute force attack is prohibitively expensive. Just rent a big cloud VM for a for the initial seeding.
– Dan Neely
Jan 24 at 3:09
|
show 5 more comments
2
You probably should still use a password hash not one designed for speed. Email addresses follow common patterns and may only have very short unique parts; which would leave some of them equivalent to short passwords that can be bruteforced if only protected by a single pass of MD5 or SHA.
– Dan Neely
Jan 23 at 16:03
5
"Hashing with a salt makes de-anonymising nearly impossible without knowing the salt." Since the salt is usually stored right next to the hash, shouldn't it be assumed that the salt is known?
– kapex
Jan 23 at 17:33
9
For efficient database lookups, consider using a pepper instead.
– NieDzejkob
Jan 23 at 20:49
6
@DanNeely using a password-grade hash and a proper salt (unique for each user) would make the lookups prohibitively expensive; with password verification, you have already selected the user and know which salt to use, but in this case, you don't know which user it is and so have to try all of the salts
– kbolino
Jan 24 at 2:42
2
@kbolino the lookup should still be fast, as NieDzejkob pointed out you just can't use a unique salt. Since the actual recovery process should be rarely run you can compensate for that with much higher difficulty factors than would otherwise be acceptable for a login. 10 or 20 seconds to hash the candidate email is fine, since once you're done it once you can do a fast DB lookup afterward; while the extreme slowness of the hash means that even without the need to do each user separately a brute force attack is prohibitively expensive. Just rent a big cloud VM for a for the initial seeding.
– Dan Neely
Jan 24 at 3:09
2
2
You probably should still use a password hash not one designed for speed. Email addresses follow common patterns and may only have very short unique parts; which would leave some of them equivalent to short passwords that can be bruteforced if only protected by a single pass of MD5 or SHA.
– Dan Neely
Jan 23 at 16:03
You probably should still use a password hash not one designed for speed. Email addresses follow common patterns and may only have very short unique parts; which would leave some of them equivalent to short passwords that can be bruteforced if only protected by a single pass of MD5 or SHA.
– Dan Neely
Jan 23 at 16:03
5
5
"Hashing with a salt makes de-anonymising nearly impossible without knowing the salt." Since the salt is usually stored right next to the hash, shouldn't it be assumed that the salt is known?
– kapex
Jan 23 at 17:33
"Hashing with a salt makes de-anonymising nearly impossible without knowing the salt." Since the salt is usually stored right next to the hash, shouldn't it be assumed that the salt is known?
– kapex
Jan 23 at 17:33
9
9
For efficient database lookups, consider using a pepper instead.
– NieDzejkob
Jan 23 at 20:49
For efficient database lookups, consider using a pepper instead.
– NieDzejkob
Jan 23 at 20:49
6
6
@DanNeely using a password-grade hash and a proper salt (unique for each user) would make the lookups prohibitively expensive; with password verification, you have already selected the user and know which salt to use, but in this case, you don't know which user it is and so have to try all of the salts
– kbolino
Jan 24 at 2:42
@DanNeely using a password-grade hash and a proper salt (unique for each user) would make the lookups prohibitively expensive; with password verification, you have already selected the user and know which salt to use, but in this case, you don't know which user it is and so have to try all of the salts
– kbolino
Jan 24 at 2:42
2
2
@kbolino the lookup should still be fast, as NieDzejkob pointed out you just can't use a unique salt. Since the actual recovery process should be rarely run you can compensate for that with much higher difficulty factors than would otherwise be acceptable for a login. 10 or 20 seconds to hash the candidate email is fine, since once you're done it once you can do a fast DB lookup afterward; while the extreme slowness of the hash means that even without the need to do each user separately a brute force attack is prohibitively expensive. Just rent a big cloud VM for a for the initial seeding.
– Dan Neely
Jan 24 at 3:09
@kbolino the lookup should still be fast, as NieDzejkob pointed out you just can't use a unique salt. Since the actual recovery process should be rarely run you can compensate for that with much higher difficulty factors than would otherwise be acceptable for a login. 10 or 20 seconds to hash the candidate email is fine, since once you're done it once you can do a fast DB lookup afterward; while the extreme slowness of the hash means that even without the need to do each user separately a brute force attack is prohibitively expensive. Just rent a big cloud VM for a for the initial seeding.
– Dan Neely
Jan 24 at 3:09
|
show 5 more comments
Realistically, pseudonymization is any method of obfuscating someone's PII/NPI so that it can't be reasonably traced back to one certain individual. GDPR doesn't necessarily dictate what hashing algorithm you are required to use in order to comply with it's standard, and to be honest - it's best that it doesn't, because if you consider the fact that if everyone was using the exact same method of obfuscation, you're creating a massive single point of failure all around. Your best bet, (as mentioned above) is to use some form of tokenization with salt, to add extra randomness to your algorithm so that it can't be easily bruteforced.
8
From an information security perspective, the idea that it's bad to have a single widely used obfuscation method is dubious (it's either secure or not). However, it is accurate that standardizing the method by law could pose a problem, since it could become outdated.
– Christoph Burschka
Jan 23 at 16:18
The legislation that the GDPR replaced (the data protection directive 95/46/EG) is over 20 years old. IIRC, in the mid-1990s, MD5 was a pretty decent choice, and certainly among the better that were generally available; these days it's considered horribly inadequate, and even SHA-1 (which was designed to replace it) is a bad choice. Who knows what will happen to hash algorithms in the next 20-25 years? I agree, mandating any particular method or algorithm in the regulations themselves would be a bad thing to do.
– a CVn
Jan 24 at 9:42
add a comment |
Realistically, pseudonymization is any method of obfuscating someone's PII/NPI so that it can't be reasonably traced back to one certain individual. GDPR doesn't necessarily dictate what hashing algorithm you are required to use in order to comply with it's standard, and to be honest - it's best that it doesn't, because if you consider the fact that if everyone was using the exact same method of obfuscation, you're creating a massive single point of failure all around. Your best bet, (as mentioned above) is to use some form of tokenization with salt, to add extra randomness to your algorithm so that it can't be easily bruteforced.
8
From an information security perspective, the idea that it's bad to have a single widely used obfuscation method is dubious (it's either secure or not). However, it is accurate that standardizing the method by law could pose a problem, since it could become outdated.
– Christoph Burschka
Jan 23 at 16:18
The legislation that the GDPR replaced (the data protection directive 95/46/EG) is over 20 years old. IIRC, in the mid-1990s, MD5 was a pretty decent choice, and certainly among the better that were generally available; these days it's considered horribly inadequate, and even SHA-1 (which was designed to replace it) is a bad choice. Who knows what will happen to hash algorithms in the next 20-25 years? I agree, mandating any particular method or algorithm in the regulations themselves would be a bad thing to do.
– a CVn
Jan 24 at 9:42
add a comment |
Realistically, pseudonymization is any method of obfuscating someone's PII/NPI so that it can't be reasonably traced back to one certain individual. GDPR doesn't necessarily dictate what hashing algorithm you are required to use in order to comply with it's standard, and to be honest - it's best that it doesn't, because if you consider the fact that if everyone was using the exact same method of obfuscation, you're creating a massive single point of failure all around. Your best bet, (as mentioned above) is to use some form of tokenization with salt, to add extra randomness to your algorithm so that it can't be easily bruteforced.
Realistically, pseudonymization is any method of obfuscating someone's PII/NPI so that it can't be reasonably traced back to one certain individual. GDPR doesn't necessarily dictate what hashing algorithm you are required to use in order to comply with it's standard, and to be honest - it's best that it doesn't, because if you consider the fact that if everyone was using the exact same method of obfuscation, you're creating a massive single point of failure all around. Your best bet, (as mentioned above) is to use some form of tokenization with salt, to add extra randomness to your algorithm so that it can't be easily bruteforced.
answered Jan 23 at 14:50
GhostInTheShellGhostInTheShell
512
512
8
From an information security perspective, the idea that it's bad to have a single widely used obfuscation method is dubious (it's either secure or not). However, it is accurate that standardizing the method by law could pose a problem, since it could become outdated.
– Christoph Burschka
Jan 23 at 16:18
The legislation that the GDPR replaced (the data protection directive 95/46/EG) is over 20 years old. IIRC, in the mid-1990s, MD5 was a pretty decent choice, and certainly among the better that were generally available; these days it's considered horribly inadequate, and even SHA-1 (which was designed to replace it) is a bad choice. Who knows what will happen to hash algorithms in the next 20-25 years? I agree, mandating any particular method or algorithm in the regulations themselves would be a bad thing to do.
– a CVn
Jan 24 at 9:42
add a comment |
8
From an information security perspective, the idea that it's bad to have a single widely used obfuscation method is dubious (it's either secure or not). However, it is accurate that standardizing the method by law could pose a problem, since it could become outdated.
– Christoph Burschka
Jan 23 at 16:18
The legislation that the GDPR replaced (the data protection directive 95/46/EG) is over 20 years old. IIRC, in the mid-1990s, MD5 was a pretty decent choice, and certainly among the better that were generally available; these days it's considered horribly inadequate, and even SHA-1 (which was designed to replace it) is a bad choice. Who knows what will happen to hash algorithms in the next 20-25 years? I agree, mandating any particular method or algorithm in the regulations themselves would be a bad thing to do.
– a CVn
Jan 24 at 9:42
8
8
From an information security perspective, the idea that it's bad to have a single widely used obfuscation method is dubious (it's either secure or not). However, it is accurate that standardizing the method by law could pose a problem, since it could become outdated.
– Christoph Burschka
Jan 23 at 16:18
From an information security perspective, the idea that it's bad to have a single widely used obfuscation method is dubious (it's either secure or not). However, it is accurate that standardizing the method by law could pose a problem, since it could become outdated.
– Christoph Burschka
Jan 23 at 16:18
The legislation that the GDPR replaced (the data protection directive 95/46/EG) is over 20 years old. IIRC, in the mid-1990s, MD5 was a pretty decent choice, and certainly among the better that were generally available; these days it's considered horribly inadequate, and even SHA-1 (which was designed to replace it) is a bad choice. Who knows what will happen to hash algorithms in the next 20-25 years? I agree, mandating any particular method or algorithm in the regulations themselves would be a bad thing to do.
– a CVn
Jan 24 at 9:42
The legislation that the GDPR replaced (the data protection directive 95/46/EG) is over 20 years old. IIRC, in the mid-1990s, MD5 was a pretty decent choice, and certainly among the better that were generally available; these days it's considered horribly inadequate, and even SHA-1 (which was designed to replace it) is a bad choice. Who knows what will happen to hash algorithms in the next 20-25 years? I agree, mandating any particular method or algorithm in the regulations themselves would be a bad thing to do.
– a CVn
Jan 24 at 9:42
add a comment |
Thanks for contributing an answer to Information Security Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsecurity.stackexchange.com%2fquestions%2f202022%2fhashing-email-addresses-for-gdpr-compliance%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
Why do you need to pseudonymize them? You might have specific need to, but it is not a typical thing to need to do in this use case.
– schroeder♦
Jan 23 at 12:34
@schroeder Sorry I thought I'd explained. Some of this PII is about to expire as per our privacy policy. Pseudonymization would allow us to to keep this functionality without keeping their data.
– Django Reinhardt
Jan 23 at 13:52
6
Yep, that is a great situation for this use case. Kudos to your team for such great understanding of your policies!
– schroeder♦
Jan 23 at 13:54
17
"The good thing about MD5 is that it's fast, however, meaning we could scan a database of 100,000s rows" - not sure how the speed of MD5 plays a part here, since you are presumably only hashing the email once and searching a database of hashed emails? (And the DB search presumably uses an index...?)
– MrWhite
Jan 23 at 16:25
3
Isn't the point of that bit of the GDPR specifically to stop this? If I tell you "delete everything you have on me, GDPR says so", I want that gone from your records and never again relateable to me. I don't want an undo button for that.
– Adam Barnes
Jan 24 at 14:11