Specialized Search Engines

Searching big data with high-speed

by Rocco Gagliardi

time to read: 9 minutes

Keypoints

Build your tools to efficiently reach the goal
Complex products are not always the best answer
OS tools are very powerful: use them

We received a big set of files to analyze; different formats, different content. The assignment was pretty simple: search for an email address and get all corresponding records matching the address.

We tried to import data in MariaDB and Elasticsearch, but both solutions were slow. MariaDB had better/flexible search performance in the short term, but it does not scale very well. Elastic scales well, but requires more code to keep data under control and more hardware. In addition, the search engine was just a PoC, so we didn’t want go big in hardware.

Challenges

File size: We received a lot of files with very different file sizes, ranging from 50Mb up to 89Gb. ⇒ To keep the access-time linear, we split each file into 10K lines chunks.
File ending: Sometime, there was no clean line ending. ⇒ We fixed with sed.
Json files: Some exports are just a single big line; some exports are multiple lines per record. ⇒ In both cases, we transformed the json file in a json file having 1 record per line, using the tool jq
Email extraction: Extracting email addresses sounds easy but is not a trivial task: the official regular expression to validate email addresses is pretty complex and time consuming. ⇒ We made some tests to find out a balanced regexp to save time and still catch the large majority of the mails. For a discussion on the email address regular expression, check How to Find or Validate an Email Address

Solution

We recycled an old Shuttle/I5/8GB, changed the HD with a new SSD (Samsung 850Pro, 1TB) and installed Debian.

After the hardware preparation and some tests, we decided to go for a filesystem index:

a set of files named by the first 3 chars of the email address
containing the reference to the source file/line where the email address was found

We prepared some Python and shell scripts to deal with the data.

We pre-parsed the files with specialized tools or some sed scripts, until we get a reasonable files base, as described in the preceding chapter. Then we applied the following pattern:

scan the file (f)
grep for email (e)
append "e, hash(f)" >> left(e,3)__.idx

In two days, all files were scanned and indexed. After that, it was possible to make a search for records containing a specific email address in a few seconds.

Attention was placed in the choice of the hashing algorithm; it was possible to save some Gb of redundant information. It was also important to balance the regular expression searching for mails: not too easy nor too complex.

Measures

Dataset to Analyze

Parameter	Value
Number of files to analyze	534
Number of records to analyze	1’895’800’000
Total size of files	222Gb

Results

Parameter	Value
Number of chunks produced	189’588
Size of chunks produced	222Gb
Number of indices produced	63’242
Size of indices produced	42Gb
Number of extracted email addresses	757’036’765

The chart shows on the x-axis the number of chunks and on the y-axis the number of files chunked by c. It shows how the original files have been chunked to flatten the access time.

Chunked to flatten access time

Timing

Parameter	Value
Preparation	ca. 1 day
Indexing	ca. 1 day
Speed of indexing by filesystem	ca. 9’000 records/second
Speed of indexing by MariaDB	ca. 350 records/second
Speed of indexing by Elastic	ca. 500 records/second

Search Time

Parameter	Value
Search time without indexing	ca. 960s (16m)
Search time with indices, best-case	0.005s
Search time with indices, worst-case	66s
Number of indexed email addresses with access time < 5s	559’356’992 (75%)
Number of indexed email addresses with access time between 5s and 10s	110’331’635 (15%)
Number of indexed email addresses with access time > 10s	77’810’353 (10%)

The following chart shows on the x-axis the indices in 50K cardinality groups (cg), on the left y-axis the number of email addresses in the cg, and on the right y-axis the worst access time [seconds] for the gc.

Number of email addresses by indices vw average worst access time

Circa 90% of the email addresses can be found within the 10s range, but the remaining 10% has an high access time. Based on statistics, a second indexing round has reindexed the top 10 indices by 4 chars; as result, the access time has been flattened to max 10s.

Top Indices

Indexing by the leftmost 3 chars, creates indices with different sizes, depending on how many email addresses starts with the 3 chars. The top 3 indices have millions of email addresses:

Indices	Number of mail addresses
mar__.idx	9’763’226
www__.idx	5’548’694
ale__.idx	4’605’329

We note that the majority of email addresses extracted starts with mar, at the second place www, which means that there are many “service” email addresses.

The following chart shows a treemap of the indices sizes.

Overview of indices cardinality

Conclusion

Using dedicated tools, we were able to decrease the query time by many order of magnitude. Depending on the goals, instead of use hardware or big and complex tools, it is better to create ad-hoc solution and use the power of the OS tools.

About the Author

Rocco Gagliardi has been working in IT since the 1980s and specialized in IT security in the 1990s. His main focus lies in security frameworks, network routing, firewalling and log management.

You need support in such a project?

Our experts will get in contact with you!

Transition to OpenSearch

Rocco Gagliardi

Graylog v5

Rocco Gagliardi

auditd

Rocco Gagliardi

Security Frameworks

Rocco Gagliardi

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here

Specialized Search Engines

Searching big data with high-speed

Keypoints

Challenges

Solution

Measures

Dataset to Analyze

Results

Timing

Search Time

Top Indices

Conclusion

About the Author

Links

Tags

You need support in such a project?

Transition to OpenSearch

Graylog v5

auditd

Security Frameworks

You want more?

You need support in such a project?

You want more?