How Passwords are Chosen
As part of our services on the Darknet, we have infiltrated a number of data markets. In these, stolen data, such as credit card information and passwords, are traded. The latter are of particular interest to us. This is because whenever a password leak is provided, we import it into an internal database in anonymized form. This system is then used to alert our customers about leaks that affect them.
Also, with this system, the extensive statistical analysis of passwords becomes possible (e.g. the analysis of the popular leak of libero.it in Italy). In the context of our Red Teaming projects we tend to use specially prepared password lists. This allows bruteforce attacks to be highly optimized. Some of these lists are freely available on our public GitHub repository.
Currently, our database stores over 733 million unique mail addresses and over 265 million unique passwords. In addition, meta-information about the leaks (e.g. source, date), usernames and hashes are also stored. However, these are negligible in the context of this analysis. Based on these primitive key figures, it can be said that a password is used 2.76 times on average.
The following graphic shows the most frequently used domains. It can be seen that mainly those of well-known freemailers and large providers are to be found. These include the international companies Yahoo with their freemailer (23.7%), Microsoft with Hotmail (16.6%) and Google with Gmail (11.1%). The first two are represented by additional regional top-level domains such as
.co.uk (United Kingdom) and
Companies with more local characteristics follow. These include, for example, mail.ru in Russia (6.7%), AOL with a focus on the USA (2.5%) and qq.com in China (0.6%).
By their nature, these figures represent the number of leaked information per service. At the same time, indirect conclusions can be drawn to a certain extent about the market penetration of the individual providers.
The following analyses have not been fully normalized. This means that anomalies have not been manually isolated or corrected. Such an anomaly occurs, for example, if a bot created several accounts and always used the same password. Or if certain terms (e.g. names, music groups, sports teams) are disproportionately represented due to a fashion movement. These anomalies can have a direct influence on the statistical circumstances and therefore disturb the underlying linguistic and cultural observation. Moreover, the analyses are always a snapshot and reflect the corresponding idiosyncrasies.
Security companies like to use the specification of top passwords for advertising purposes. A closer look reveals that these lists are rarely based on an accurate and methodical approach. Many lists are outdated, copied together, or based on very limited data sets.
The following is an analysis of the top passwords across our entire dataset. The 20 most popular passwords are shown in descending order. We will encounter these passwords again and again in the following observations. All of them master the different breakdowns to a large extent.
In this paragraph, a breakdown of individual countries will take place. Thereby exemplarily selected states are examined. Thus, linguistic and cultural peculiarities can be identified. The association of a leaked data set to a country is primarily based on the domain used for the mail address. Either the provider is typical for the country (e.g.
126.com in China) or the top-level domain provides for use by a specific target group (e.g.
hotmail.fr for France). The origin of a leak or the affected service itself has no influence on our investigations.
The distribution of the collected data shows that a large part can be assigned to Russia (46.8%). Followed by United Kingdom (12.5%), France (10.4%) and Germany (7.5%). In the country-specific evaluations, states are omitted or given lower priority if only a small amount of data is available to us. These include, for example, Guinea-Bissau (0.0000111%), the Marshall Islands (0.0000115%), and Bonaire, Sint Eustatius and Saba (0.000013%).
For Switzerland with its top-level domain
.ch, the usual standard passwords, such as
123456 (0.59%) and
123456789 (0.14%) are found in first place.
Unusual but regionally explainable are the German spelling
passwort (0.04%) and the expression
hallo (0.46%) or the variation of it as
hallo123 (0.029%). With
daniel (0.029%) a widespread male name is represented. This one has been very popular for years, topped the Rankings in Switzerland last time in 2019.
Particularly unusual, and probably a statistical anomaly, are
!~!1 (0.91%) and
tintenprofi (0.031%). These presumably come from specific leaks in which a separate clustering of just these passwords has occurred, resulting in a distortion of the overall results for the country.
If we now compare Switzerland with the top-level domain
.de in Germany, we quickly see obvious parallels that are due to linguistic and cultural similarities. Again, of course,
123456 (0.64%) and
123456789 (0.22%) are at the top. The distributions are not exactly the same, but they play out in a similar range. Likewise,
hallo123 (0.052%) and
hallo (0.042%) are also found in the top spots, albeit with a reversal in terms of the order.
As typically German passwords can be identified primarily those from the subject areas football. These include
schalke04 (0.023%) and
fussball (0.020%) per se. But also
Mercedes (0.012%) can be clearly assigned to records from Germany. Likewise, an above-average number of vulgar expressions like
fi--en (0.028%) and
ar--hole (0.019%) can be found.
The analysis of our dataset also shows statistical anomalies here that can be attributed to specific leaks. Foremost among these is
Y57gjng4gH (0.032%), which looks like a random choice of characters. It is odd that this is represented so often. Accordingly, it can be assumed that it is the password of an automated mechanism or a fabricated record.
Looking at the top 20 passwords in Germany,
1q2w3e4r (0.031%) also stands out. Here, too, it appears at first glance to be a random and correspondingly fabricated password. However, this is not correct, since it corresponds to the merging of the strings
qwer. The characters are entered alternately from left to right on the keyboard (see picture).
The input of characters from left to right is particularly popular, as can be seen in the examples
qwertz can be seen. The fact that
qwerty (0.086%) is more common in Germany than
qwertz (0.029%) is surprising, since German-speaking keyboards use the second layout. It must be assumed that either English-speaking users or English-speaking keyboard layouts are dominant in the German-speaking area.
The evaluation of
.ru for Russia is quite extensive, as they currently account for almost half of the records in our database. Looking at the most used passwords, they can be divided into five categories:
This exposes most of the typical password patterns. Chances are very high that the choice of a classic password falls into one of these categories. The distribution of the most used passwords is also more even than in other countries.
The evaluation of China was based on the one hand on the top-level domain
.cn, but also on some well-known domains of the largest providers such as
126.com. This means that Chinese users on non-typical top-level domains could also be taken into account.
Looking at the top 100, it is immediately noticeable that 53% of the passwords consist of pure numbers. This is in contrast to 20% of the passwords in Germany. This is probably due to the fact that many systems do not provide for password entry with Chinese characters. And users are more likely to want to remember numerical combinations than foreign-language – mostly English – words.
By trying the top 100 passwords, a success rate of 3.01% is achieved for Germany. For Russia, this is above average at 5.46%. But for China even 7.11% is possible, which is due to the identified reduced entropy.
Just as a study of the most popular passwords per country can be implemented, this can be done for individual providers. As a sample, we have examined the data sets of the three most popular mail providers:
yahoo.com (see graphic),
At least among these three, there are no significant differences to the country-specific observations. The usual passwords are found in the front area. These include
123456 (0.73-1.02%) in first place and
123456789 (0.26-0.47%) in second place for all of them. This is actually followed by
password (0.11-0.22%), but this has been pushed down to third place by
fuk19600 (0.19%) at
hotmail.com. This is probably again a password of automatically generated accounts or a fabricated record. Other similar effects can be observed again and again.
Year numbers are popular 4-digit PINs. Here we can see that
2005 (2.62%) tops the list and is immediately followed by
2004 (2.55%). This is followed by
2000 (2.32%) and
2003 (2.14%). Only now comes the first year from the 20th century: Namely the year
Statistically, this observation becomes more exciting when the years are summarized in decades. Here we see that the decade 2000-2009 (18.71%) is closely followed by 1980-1989 (17.70%). After that, we find 1970-1979 (14.56%) and 1990-1999 (14.15%). This corresponds quite well to the primary audience of today’s Internet.
Here, the origin of the years must be distinguished. In most cases, the user’s own year of birth will probably be chosen. However, it is not unusual that also important events (e.g. wedding) or the birth years of one’s own children are used.
Next, PIN (Personal Identification Number) of all types are searched for, which consist exclusively of numbers. In the case of 4-digit PINs, it can be seen that the usual standard patterns are also given for such short PINs. These include incremental number sequences such as
1233 (10.5%) and
4321 (0.25%), overall repetitions such as
1111 (1.62%) and
0000 (1.15%), and partial repetitions such as
1212 (0.52%) and
6969 (0.26%) (see the breakdown in the analysis of Russia).
Some number series also look like years. For example
2005 (0.26%) and
2004 (0.25%). Whether these are really vintages or just patterns is difficult to say. The consideration of vintages was done before in a separate chapter.
Dates, for example birthdays or anniversaries, could also occur. However, they are underrepresented in terms of statistical distribution. In the Top 100, only a few come into question for this. Most likely a clear assignment of this form, since it is not a known year or an obvious pattern, can take place with
It is questionable whether these PINs also allow conclusions to be drawn about the relative distribution on devices such as smartphones and input terminals at ATMs. However, a comparison with other studies shows that the expected distribution can very well be observed here.
Finally, the average length of the passwords should be discussed. This amounts to 9.7771 characters across all data sets and is thus above the traditional rule of thumb of 8 characters.
The breakdown by country now provides a basic overview of how highly anchored the understanding of security is in the respective cultures. Although this may provide a trend, the data must be seen in context. For some countries, there is relatively little leaked data. And others are dominated by generated and fabricated passwords. This can distort the view.
This distortion is the case for Eswatini (sz). The average length of this relatively small country with only about 1.1 million inhabitants is 24.6 characters. This is more than twice as long as Cookinseln (ck), which comes second with 11.5 characters. The number of inhabitants is also very limited with more than 18’000 and the meaningfulness is accordingly limited. For this reason, we have made an adjustment and only included countries in the statistics where there are at least 10,000 leaked accounts.
After adjusting the data, a very concrete picture emerges. Ireland (ie) is in first place with 11.2 characters. This could be due to the fact that many large tech companies have corresponding connections to Ireland and thus exert direct or indirect influence on password structures.
This is followed by Kazakhstan (kz) with 9.674, Palau (pw) with 9.673 and Tokelau (tk) with 9.63. No countries known for technical developments. In the case of the top-level domain
pw, the association with the term password could lead security-savvy users to prefer such addresses and actually not associate with the country otherwise.
Another western country is Belgium, which has an average password length of 9.55. United Kingdom with 9.43, Russia with 9.10 and Germany with 8.92 are also in the best quarter.
Rather average are Italy with 8.44, China with 8.33, USA with 8.21 and France with 8.18 characters.
Switzerland ranks in the last quarter with an average length of only 8.03 characters. It is difficult to say whether this is due to a lack of understanding of security, an increased willingness to take risks, or a concretely graded risk assessment.
Password lists seem archaic, but on closer inspection they still hold some hidden secrets. Skillful statistical analyses can identify cultural and technical conditions, thanks to which highly specialized and optimized password lists can be compiled.
In many places, the top passwords are the same or similar. In particular, ascending numerical strings and simple letter combinations are encountered time and again. The situation is similar for PINs, with anniversaries and years also playing an important role.
The study of average password lengths holds a few surprises in store, since above-average lengths can also be observed in countries that would not necessarily be considered highly risk-sensitive. It is therefore all the more unfortunate to see that Switzerland remains in the last quarter in this respect, even though we are considered a country with a high understanding of risk.
We are going to monitor the digital underground for you!
Our experts will get in contact with you!
Further articles available here