Analyzing Sony's leaked databases

Door mux op zondag 5 juni 2011 00:21 - Reacties (10)
Categorie: Filosofisch gezwam, Views: 7.960

There has been a lot of press surrounding the supposed weak security on Sony's web services. Lulzsec and other hackers have been able to download plaintext databases containing a lot of user info. Many malevolent people will use this information to try or log into for instance banking accounts and hack their e-mail boxes to send out spam. However, we can also use these databases to learn a thing or two about people's behaviour, for instance with respect to password strength. Let's try and learn something from this!



First of all, let me clear one thing up: I am by no means a security specialist. None at all. Also, this analysis is only very shallow. I won't even be saying anything you don't already know. So why read on?

For science! The Sony databases that have leaked on June 3rd by LulzSec are excellent data mining subjects. The plain text databases containin no hashes, let alone salted hashes, and have a very straightforward structure. Also, the registration forms that Sony used for gathering this information did not put any constraints whatsoever on the information entered apart from a valid e-mail address. This means that we get a very 'pure' look at for instance what type of passwords users generally enter when they're not confronted with some indication of their password strength. This in turn enables us to analyze what aspects of password strength we should focus on when putting together our own registration forms and storing personal data.
The database
I will be analyzing the plaintext database 'Sony_Pictures_International_AUTOTRADER_USERS.txt'. I will not publish it here because I'm pretty sure that is illegal. You can find a link to this database in, for instance, the Tweakers.net news coverage of these events. I have not used this information for anything but mathematical analysis. The analysis will be done with a bit of Javascript code. This database is structured as follows:
  • Standard MySQL-ish plaintext database output, newline and | (bar) separated rows/columns
  • Columns: date of birth - sex - phone number - ZIP - State - City - Address line 1 - Address line 2 - Last name - First name - Password - e-mail address - UID
  • Non-optional columns are date of birth, password, e-mail and UID. Optional columns are left out when not filled in
The analysis
We start off with a bit of a problem: the database contains lots and lots of duplicates; probably remnants of failed registration attempts. The duplicates all have more or less the same information but a different user ID. Of course, we want to weed these out so I do this by checking each e-mail address against the rest of the database. If it's already somewhere, we toss the new entry out. It's bruteforce but when qualitatively checking it against a few samples it seems to be an okay strategy:

code:
1
2
3
4
5
6
7
8
9
10
11
var str = document.getElementById("autotrader").value; //textarea containing the copypasted plaintext database
var arr = str.split("\n"); //row delimiter = newline

for(var i = 0; i < arr.length; i++) {
    arr[i] = arr[i].split(" | ");                  //column delimiter = |
    var email_field = arr[i].length - 2;
    if(in_array(arr[i][email_field],arr,-2, i)) { //in_array is a custom function
        arr.splice(i,1);                          //throw the line out if it's somewhere else in the database already
        i--;
    }
}


Where in_array is:

code:
1
2
3
4
5
6
7
//check if str exists somewhere in array[0...scanto][num_columns - offset]
function in_array(str, array, offset, scanto){ 
    for(var h = 0; h < scanto; h++){
        if(str == array[h][array[h].length + offset]) return 1;
    }
    return 0;
}


Alright, now we have a big array (1532 entries) without duplicates, we can check for the most basic of insecure password strategies:
  • Check if part or all of the password is composed of entire other fields
  • Check if the password overlaps with the e-mail address in a meaningful way (defined as 4 or more characters overlap)
  • Check if the password contains the year of birth
  • Check if the password is sufficiently long (defined as 8 or more characters)
  • Check if the password is composed of lowercase characters, uppercase characters and numbers
As I really only spent a short time analyzing this and quickly found out this is already a pretty discriminative test, I did not bother to implement things like checking for simple obfuscation (replacing o by 0, a by @ and such). All tests are really just one or two lines like this one:

code:
1
2
3
4
5
6
//password contains lowercase, uppercase and numeric
if(password == password.toUpperCase() || password == password.toLowerCase() ||  !(/\d/.test(password))){
    fail = 1;       //global password strongness failure
    strongness[i][4] = 1;   //bookkeeping
    pwcasenum++;        //increase bookkeeping for total number of this type of violation
}
The results
If you're really interested in the (admittedly, sloppy proof-of-conceptish) code behind the analysis you can visit the webpage containing this script and some rudimentary output here. Most of you, however, want to know the results. So here we go:

Even with just five simple password strongness conditions only 20 out of 1532 passwords pass the test.



Most of them fail for simply using only lowercase letters, with the occasional number tacked on:



Of course, these tests are not nearly enough to warrant a password with sufficient entropy that is not easily guessable with the help of other information. Therefore, my code spits out all the passwords that end up passing all the tests. Among these passwords are things like:

code:
1
2
Tiffany1
Cherokee3


Sure enough, these are strong according to my tests. They are 8 characters or more, contain upper case, lower case and numbers, don't include information from other fields and don't have overlap with the e-mail address. But clearly they're still not quite as strong as we would like the ideal password to be. Therefore, I ended with a human inspection of the remaining passwords. Out of 1532 samples, how many of these are actually high-entropy, generally considered strong passwords? The answer is exactly 2.

code:
1
2
FrES3fFS
YE8y64S2


Let me visualize that for you:

Conclusion
This is what happens if you don't enforce a strong password. It's a widely known fact that people are pretty lousy password generators, but instead of accepting that as gospel I went out and checked with a real database. Well, uhm. It's true. If you leave password generation up to the average person signing up for Sony Autotrader services two people will come up with a decent password on their own.

Now, I have said this before and I will say it again: this analysis is very shallow and doesn't account for a lot of things of interest to real security experts. Real security experts know how to do it right. However, a lot of people here are no security experts but still do occasionally program something that requires username-password security. I hope that this blog post serves as a reminder that you should really try to enforce somewhat of a strong password from your users. Your services may be very easily bruteforced otherwise. But even more importantly: I hope this blog post may convince even the smallest amount of layman users out there to re-evaluate their password choices.

I don't want to end this blog post with recommendations. I feel I've crossed the line already saying people should reconsider their passwords, because I know that passwords are always a trade-off between being easy to remember and having high entropy. What I actually want to say is that we can use these leaked databases for science! Data mining and analysis of personal information can show us all kinds of interesting behavioural patterns. Lots of companies are doing these things all the time, compiling and matching databases to find patterns, adjusting their businesses accordingly, but they keep this information secret as it may give them the competitive edge (e.g. Google). Today we had a look at password strength. Tomorrow we may use these databases to find out how sincere people are at filling out their personal details if a money prize is at stake compared to a free internet service. Think of the science that can be done!

Volgende: MiXley 3D printer - Main electronics 11-'11 MiXley 3D printer - Main electronics
Volgende: Obscure telecomabonnementen 06-'11 Obscure telecomabonnementen

Reacties


Door Tweakers user KoertW, zondag 5 juni 2011 01:17

A Really nice way of showing people what is wrong with the general passwords they use.

I personally see that in a workenvirioment with over 250 users that people (when needed to change their windowspasswords) don't put much effort in their passwords.
Relatives,pets,birthdates,names of children combined with a date/number... not a real challenge...

Let alone keep those seperated that "wrote down" their password so that they don't forget it.

The picture with the 2 green boxes against all reds really shows how few people actually try to get/make up a proper password.

Door Tweakers user Edek, zondag 5 juni 2011 09:14

Great, nice Blog!
Could you please release a sequel about the most common passwords? It should be funny because there are so many weak passwords that people use such as "Dolphin" and "hello"!

Door Tweakers user spone, zondag 5 juni 2011 09:40

Would using a strong password in this case have made any difference since they are stored unencrypted/non-hashed anyway?

[Reactie gewijzigd op zondag 5 juni 2011 09:41]


Door Tweakers user ViperNL, zondag 5 juni 2011 10:07

There should be a law against saving someones password as plain text. Seriously.

1. It's very personal information.
2. There is absolutely no reason to save it as plain text. None.
3. It's fair to assume people reuse their password for other websites (pretty much no one can remember a different password for each of the hundreds of websites they register on).

[Reactie gewijzigd op zondag 5 juni 2011 10:08]


Door Tweakers user chaozz, zondag 5 juni 2011 10:08

spone schreef op zondag 05 juni 2011 @ 09:40:
Would using a strong password in this case have made any difference since they are stored unencrypted/non-hashed anyway?
Good point. Even if the database was not compromized these passwords strength would not be very important (although stronger is better) because you would also would need to know the username.

So great analysis, but not really relevant to the hack itself.

[Reactie gewijzigd op zondag 5 juni 2011 10:09]


Door Tweakers user mux, zondag 5 juni 2011 10:09

@spone: well, obviously no. That's not the point of this blog. A password is supposed to be stored using a cryptographically strong one-way salted hash, not plaintext, exactly for this reason. Because most sane programmers actually do this, there aren't many databases of usernames, passwords and most importantly: accompanying information to mine from. If we don't know and analyze how people behave, it becomes harder for us to analyze security holes. This blog serves explicitly to use this data for good, and implicitly to:
- For the love of god, don't store personal data plaintext
- SQL injection attacks are sooooo 1989
- You should do proper password strength tests (maybe you don't really need to enforce password strength, but you need to at least indicate the password strength as a user types it)
- more?

@Edek: I don't really intend on making this a series. If anybody is interested: I've put a link to my javascript in the text and the databases are freely available for download. It took me just 30 minutes to do this. Go ahead, try it for yourself :)

@chaozz: if anything, the actual hack was even less interesting. It was nothing more than a 'bobby tables'-sql injection.

[Reactie gewijzigd op zondag 5 juni 2011 10:11]


Door Tweakers user YopY, zondag 5 juni 2011 10:11

I like your graphs, :+.

And no, strong passwords wouldn't help much if the database was encrypted. In fact, I think you should always assume a worst-case scenario when you create a new account on any site - i.e. the password is stored as plaintext. You should therefore keep a few things in mind:

* Is my password guessable or easily brute-forced?
* (if applicable) Is my recovery question / answer easily guessed or simpler / brute-forceable?
* Do I have a unique password for this site?

There's people that have a system for passwords on multiple websites, for example they pre- or suffix the password with the website's name (for example, sony-JGIJW445). In this case, i.e. when the password is stored as plaintext, that's not a very good idea as hax0rz may try using the same thing on other sites (for example, paypal-JGIJW445).

I personally use auto-generated 12-character alphanumeric passwords stored in a single database (keepass). Of course, this means that it only takes one keylogger and a hacked Dropbox account for a hacker to access all of my passwords (including the one for Paypal, so I could actually lose $8,62).

Door Tweakers user dcm360, zondag 5 juni 2011 12:38

Hm, if one of my passwords was in that database (fortunately not), it would have ended up in the list of weak passwords. My passwords contain normal characters, numbers and special characters, and they are at least 8 characters long. They only fail the upper/lowercase test...

Door Tweakers user Zsub, zaterdag 1 oktober 2011 12:10

I'd like to point out this xkcd.

Actually putting constraints on password length or enforcing the use of characters like @, $, etc. doesn't do any good, apparently.

Door Tweakers user akakiwi, dinsdag 8 november 2011 12:51

My password wouldn't fail, since it meets al the requirements, but, it's only 10 characters long, so, what @Zsub said.
Better a very long password that you can remember, than a relative short one with weird characters.

Om te kunnen reageren moet je ingelogd zijn. Via deze link kun je inloggen als je al geregistreerd bent. Indien je nog geen account hebt kun je er hier één aanmaken.