Password Leak Analysis - Extensive Analysis of Passwords

Password Leak Analysis

Extensive Analysis of Passwords

Marc Ruef
by Marc Ruef
on April 15, 2021
time to read: 18 minutes

Keypoints

How Passwords are Chosen

  • As part of our darknet monitoring service, we collect information on password leaks
  • These are imported into an analysis system and can thus be extensively statistically evaluated
  • The top 10 passwords usually consist of simple number sequences such as 123456 or primitive passwords such as qwertz
  • An examination of the different countries reveals linguistic and cultural differences in password selection
  • This information can be used to optimize bruteforce attacks

Password security is often seen as a classic entry-level topic in the field of information security. Accordingly, much has already been said and written about it. In the context of an extensive analysis of leaked account data, we want to conduct a broad analysis of passwords in this article. The number of passwords examined, as well as the statistical analysis of them, provides a unique insight into how passwords are chosen.

As part of our services on the Darknet, we have infiltrated a number of data markets. In these, stolen data, such as credit card information and passwords, are traded. The latter are of particular interest to us. This is because whenever a password leak is provided, we import it into an internal database in anonymized form. This system is then used to alert our customers about leaks that affect them.

Also, with this system, the extensive statistical analysis of passwords becomes possible (e.g. the analysis of the popular leak of libero.it in Italy). In the context of our Red Teaming projects we tend to use specially prepared password lists. This allows bruteforce attacks to be highly optimized. Some of these lists are freely available on our public GitHub repository.

Statistical Data of the Collected Leaks

Currently, our database stores over 733 million unique mail addresses and over 265 million unique passwords. In addition, meta-information about the leaks (e.g. source, date), usernames and hashes are also stored. However, these are negligible in the context of this analysis. Based on these primitive key figures, it can be said that a password is used 2.76 times on average.

The following graphic shows the most frequently used domains. It can be seen that mainly those of well-known freemailers and large providers are to be found. These include the international companies Yahoo with their freemailer (23.7%), Microsoft with Hotmail (16.6%) and Google with Gmail (11.1%). The first two are represented by additional regional top-level domains such as .fr (France), .co.uk (United Kingdom) and .de (Germany).

Distribution of domains in the password database

Companies with more local characteristics follow. These include, for example, mail.ru in Russia (6.7%), AOL with a focus on the USA (2.5%) and qq.com in China (0.6%).

By their nature, these figures represent the number of leaked information per service. At the same time, indirect conclusions can be drawn to a certain extent about the market penetration of the individual providers.

The following analyses have not been fully normalized. This means that anomalies have not been manually isolated or corrected. Such an anomaly occurs, for example, if a bot created several accounts and always used the same password. Or if certain terms (e.g. names, music groups, sports teams) are disproportionately represented due to a fashion movement. These anomalies can have a direct influence on the statistical circumstances and therefore disturb the underlying linguistic and cultural observation. Moreover, the analyses are always a snapshot and reflect the corresponding idiosyncrasies.

Top Passwords over All Data

Security companies like to use the specification of top passwords for advertising purposes. A closer look reveals that these lists are rarely based on an accurate and methodical approach. Many lists are outdated, copied together, or based on very limited data sets.

The following is an analysis of the top passwords across our entire dataset. The 20 most popular passwords are shown in descending order. We will encounter these passwords again and again in the following observations. All of them master the different breakdowns to a large extent.

Top 20 passwords over all data

Different Countries

In this paragraph, a breakdown of individual countries will take place. Thereby exemplarily selected states are examined. Thus, linguistic and cultural peculiarities can be identified. The association of a leaked data set to a country is primarily based on the domain used for the mail address. Either the provider is typical for the country (e.g. 126.com in China) or the top-level domain provides for use by a specific target group (e.g. hotmail.fr for France). The origin of a leak or the affected service itself has no influence on our investigations.

Breakdown of leaked data by country

The distribution of the collected data shows that a large part can be assigned to Russia (46.8%). Followed by United Kingdom (12.5%), France (10.4%) and Germany (7.5%). In the country-specific evaluations, states are omitted or given lower priority if only a small amount of data is available to us. These include, for example, Guinea-Bissau (0.0000111%), the Marshall Islands (0.0000115%), and Bonaire, Sint Eustatius and Saba (0.000013%).

Switzerland – Straightforward and Direct

Top passwords in Switzerland

For Switzerland with its top-level domain .ch, the usual standard passwords, such as 123456 (0.59%) and 123456789 (0.14%) are found in first place.

Unusual but regionally explainable are the German spelling passwort (0.04%) and the expression hallo (0.46%) or the variation of it as hallo123 (0.029%). With daniel (0.029%) a widespread male name is represented. This one has been very popular for years, topped the Rankings in Switzerland last time in 2019.

Particularly unusual, and probably a statistical anomaly, are !~!1 (0.91%) and tintenprofi (0.031%). These presumably come from specific leaks in which a separate clustering of just these passwords has occurred, resulting in a distortion of the overall results for the country.

Germany – Cultural Differences

Top passwords in Germany

If we now compare Switzerland with the top-level domain .de in Germany, we quickly see obvious parallels that are due to linguistic and cultural similarities. Again, of course, 123456 (0.64%) and 123456789 (0.22%) are at the top. The distributions are not exactly the same, but they play out in a similar range. Likewise, hallo123 (0.052%) and hallo (0.042%) are also found in the top spots, albeit with a reversal in terms of the order.

As typically German passwords can be identified primarily those from the subject areas football. These include schalke04 (0.023%) and fussball (0.020%) per se. But also Mercedes (0.012%) can be clearly assigned to records from Germany. Likewise, an above-average number of vulgar expressions like fi--en (0.028%) and ar--hole (0.019%) can be found.

The analysis of our dataset also shows statistical anomalies here that can be attributed to specific leaks. Foremost among these is Y57gjng4gH (0.032%), which looks like a random choice of characters. It is odd that this is represented so often. Accordingly, it can be assumed that it is the password of an automated mechanism or a fabricated record.

Looking at the top 20 passwords in Germany, 1q2w3e4r (0.031%) also stands out. Here, too, it appears at first glance to be a random and correspondingly fabricated password. However, this is not correct, since it corresponds to the merging of the strings 1234 and qwer. The characters are entered alternately from left to right on the keyboard (see picture).

Graphical illustration of 1q2w3e4r

The input of characters from left to right is particularly popular, as can be seen in the examples qwerty resp. qwertz can be seen. The fact that qwerty (0.086%) is more common in Germany than qwertz (0.029%) is surprising, since German-speaking keyboards use the second layout. It must be assumed that either English-speaking users or English-speaking keyboard layouts are dominant in the German-speaking area.

Russia – Multiple Typical Patterns

Top passwords in Russia

The evaluation of .ru for Russia is quite extensive, as they currently account for almost half of the records in our database. Looking at the most used passwords, they can be divided into five categories:

  1. ascending number sequences (e.g. 123456 and 123456789).
  2. repeating numbers (e.g. 1111 and 666666)
  3. letter input from left to right (e.g. qwerty and zxcvbnm)
  4. merging strings (e.g. 1q2w3e4r5t and 1q2w3e4r)
  5. duplication of strings (e.g. 123123 and 123321)

This exposes most of the typical password patterns. Chances are very high that the choice of a classic password falls into one of these categories. The distribution of the most used passwords is also more even than in other countries.

China – Numbers Preferred

Top passwords in China

The evaluation of China was based on the one hand on the top-level domain .cn, but also on some well-known domains of the largest providers such as 163.com and 126.com. This means that Chinese users on non-typical top-level domains could also be taken into account.

Looking at the top 100, it is immediately noticeable that 53% of the passwords consist of pure numbers. This is in contrast to 20% of the passwords in Germany. This is probably due to the fact that many systems do not provide for password entry with Chinese characters. And users are more likely to want to remember numerical combinations than foreign-language – mostly English – words.

By trying the top 100 passwords, a success rate of 3.01% is achieved for Germany. For Russia, this is above average at 5.46%. But for China even 7.11% is possible, which is due to the identified reduced entropy.

Passwords by Services

Top passwords of the domain yahoo.com

Just as a study of the most popular passwords per country can be implemented, this can be done for individual providers. As a sample, we have examined the data sets of the three most popular mail providers: yahoo.com (see graphic), hotmail.com and gmail.com.

At least among these three, there are no significant differences to the country-specific observations. The usual passwords are found in the front area. These include 123456 (0.73-1.02%) in first place and 123456789 (0.26-0.47%) in second place for all of them. This is actually followed by password (0.11-0.22%), but this has been pushed down to third place by fuk19600 (0.19%) at hotmail.com. This is probably again a password of automatically generated accounts or a fabricated record. Other similar effects can be observed again and again.

Password and Year Numbers

Year numbers are popular 4-digit PINs. Here we can see that 2005 (2.62%) tops the list and is immediately followed by 2004 (2.55%). This is followed by 2000 (2.32%) and 2003 (2.14%). Only now comes the first year from the 20th century: Namely the year 1987 (2.12%).

Top passwords of year numbers

Statistically, this observation becomes more exciting when the years are summarized in decades. Here we see that the decade 2000-2009 (18.71%) is closely followed by 1980-1989 (17.70%). After that, we find 1970-1979 (14.56%) and 1990-1999 (14.15%). This corresponds quite well to the primary audience of today’s Internet.

Here, the origin of the years must be distinguished. In most cases, the user’s own year of birth will probably be chosen. However, it is not unusual that also important events (e.g. wedding) or the birth years of one’s own children are used.

Top passwords grouped by decades

PINs (digits only)

Next, PIN (Personal Identification Number) of all types are searched for, which consist exclusively of numbers. In the case of 4-digit PINs, it can be seen that the usual standard patterns are also given for such short PINs. These include incremental number sequences such as 1233 (10.5%) and 4321 (0.25%), overall repetitions such as 1111 (1.62%) and 0000 (1.15%), and partial repetitions such as 1212 (0.52%) and 6969 (0.26%) (see the breakdown in the analysis of Russia).

Top 4-digit PINs

Some number series also look like years. For example 2005 (0.26%) and 2004 (0.25%). Whether these are really vintages or just patterns is difficult to say. The consideration of vintages was done before in a separate chapter.

Dates, for example birthdays or anniversaries, could also occur. However, they are underrepresented in terms of statistical distribution. In the Top 100, only a few come into question for this. Most likely a clear assignment of this form, since it is not a known year or an obvious pattern, can take place with 1231 (0.08%).

It is questionable whether these PINs also allow conclusions to be drawn about the relative distribution on devices such as smartphones and input terminals at ATMs. However, a comparison with other studies shows that the expected distribution can very well be observed here.

Pos PIN Datagenetics scip Difference
1 1234 10.713% 10.492% -0.220
2 1111 6.016% 1.624% -4.391
3 0000 1.881% 1.156% -0.724
4 1212 1.197% 0.528% -0.668
5 7777 0.745% 0.362% -0.382
6 1004 0.616% 0.095% -0.520
7 2000 0.613% 0.232% -0.380
8 4444 0.526% 0.358% -0.168
9 2222 0.516% 0.363% -0.152
10 6969 0.512% 0.267% -0.244

Average Password Lengths

Finally, the average length of the passwords should be discussed. This amounts to 9.7771 characters across all data sets and is thus above the traditional rule of thumb of 8 characters.

The breakdown by country now provides a basic overview of how highly anchored the understanding of security is in the respective cultures. Although this may provide a trend, the data must be seen in context. For some countries, there is relatively little leaked data. And others are dominated by generated and fabricated passwords. This can distort the view.

This distortion is the case for Eswatini (sz). The average length of this relatively small country with only about 1.1 million inhabitants is 24.6 characters. This is more than twice as long as Cookinseln (ck), which comes second with 11.5 characters. The number of inhabitants is also very limited with more than 18’000 and the meaningfulness is accordingly limited. For this reason, we have made an adjustment and only included countries in the statistics where there are at least 10,000 leaked accounts.

Distribution of average password lengths by country

After adjusting the data, a very concrete picture emerges. Ireland (ie) is in first place with 11.2 characters. This could be due to the fact that many large tech companies have corresponding connections to Ireland and thus exert direct or indirect influence on password structures.

This is followed by Kazakhstan (kz) with 9.674, Palau (pw) with 9.673 and Tokelau (tk) with 9.63. No countries known for technical developments. In the case of the top-level domain pw, the association with the term password could lead security-savvy users to prefer such addresses and actually not associate with the country otherwise.

Another western country is Belgium, which has an average password length of 9.55. United Kingdom with 9.43, Russia with 9.10 and Germany with 8.92 are also in the best quarter.

Rather average are Italy with 8.44, China with 8.33, USA with 8.21 and France with 8.18 characters.

Switzerland ranks in the last quarter with an average length of only 8.03 characters. It is difficult to say whether this is due to a lack of understanding of security, an increased willingness to take risks, or a concretely graded risk assessment.

Conclusion

Password lists seem archaic, but on closer inspection they still hold some hidden secrets. Skillful statistical analyses can identify cultural and technical conditions, thanks to which highly specialized and optimized password lists can be compiled.

In many places, the top passwords are the same or similar. In particular, ascending numerical strings and simple letter combinations are encountered time and again. The situation is similar for PINs, with anniversaries and years also playing an important role.

The study of average password lengths holds a few surprises in store, since above-average lengths can also be observed in countries that would not necessarily be considered highly risk-sensitive. It is therefore all the more unfortunate to see that Switzerland remains in the last quarter in this respect, even though we are considered a country with a high understanding of risk.

About the Author

Marc Ruef

Marc Ruef has been working in information security since the late 1990s. He is well-known for his many publications and books. The last one called The Art of Penetration Testing is discussing security testing in detail. He is a lecturer at several faculties, like ETH, HWZ, HSLU and IKF. (ORCID 0000-0002-1328-6357)

Links

Is your data also traded on the dark net?

We are going to monitor the digital underground for you!

×
Specific Criticism of CVSS4

Specific Criticism of CVSS4

Marc Ruef

scip Cybersecurity Forecast

scip Cybersecurity Forecast

Marc Ruef

Voice Authentication

Voice Authentication

Marc Ruef

Bug Bounty

Bug Bounty

Marc Ruef

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here