Specialized Search Engines - Searching big data with high-speed

Specialized Search Engines

Searching big data with high-speed

Rocco Gagliardi
by Rocco Gagliardi
time to read: 9 minutes

Keypoints

  • Build your tools to efficiently reach the goal
  • Complex products are not always the best answer
  • OS tools are very powerful: use them

We received a big set of files to analyze; different formats, different content. The assignment was pretty simple: search for an email address and get all corresponding records matching the address.

We tried to import data in MariaDB and Elasticsearch, but both solutions were slow. MariaDB had better/flexible search performance in the short term, but it does not scale very well. Elastic scales well, but requires more code to keep data under control and more hardware. In addition, the search engine was just a PoC, so we didn’t want go big in hardware.

Challenges

Solution

We recycled an old Shuttle/I5/8GB, changed the HD with a new SSD (Samsung 850Pro, 1TB) and installed Debian.

After the hardware preparation and some tests, we decided to go for a filesystem index:

We prepared some Python and shell scripts to deal with the data.

We pre-parsed the files with specialized tools or some sed scripts, until we get a reasonable files base, as described in the preceding chapter. Then we applied the following pattern:

  1. scan the file (f)
  2. grep for email (e)
  3. append "e, hash(f)" >> left(e,3)__.idx

In two days, all files were scanned and indexed. After that, it was possible to make a search for records containing a specific email address in a few seconds.

Attention was placed in the choice of the hashing algorithm; it was possible to save some Gb of redundant information. It was also important to balance the regular expression searching for mails: not too easy nor too complex.

Measures

Dataset to Analyze

Parameter Value
Number of files to analyze 534
Number of records to analyze 1’895’800’000
Total size of files 222Gb

Results

Parameter Value
Number of chunks produced 189’588
Size of chunks produced 222Gb
Number of indices produced 63’242
Size of indices produced 42Gb
Number of extracted email addresses 757’036’765

The chart shows on the x-axis the number of chunks and on the y-axis the number of files chunked by c. It shows how the original files have been chunked to flatten the access time.

Chunked to flatten access time

Timing

Parameter Value
Preparation ca. 1 day
Indexing ca. 1 day
Speed of indexing by filesystem ca. 9’000 records/second
Speed of indexing by MariaDB ca. 350 records/second
Speed of indexing by Elastic ca. 500 records/second

Search Time

Parameter Value
Search time without indexing ca. 960s (16m)
Search time with indices, best-case 0.005s
Search time with indices, worst-case 66s
Number of indexed email addresses with access time < 5s 559’356’992 (75%)
Number of indexed email addresses with access time between 5s and 10s 110’331’635 (15%)
Number of indexed email addresses with access time > 10s 77’810’353 (10%)

The following chart shows on the x-axis the indices in 50K cardinality groups (cg), on the left y-axis the number of email addresses in the cg, and on the right y-axis the worst access time [seconds] for the gc.

Number of email addresses by indices vw average worst access time

Circa 90% of the email addresses can be found within the 10s range, but the remaining 10% has an high access time. Based on statistics, a second indexing round has reindexed the top 10 indices by 4 chars; as result, the access time has been flattened to max 10s.

Top Indices

Indexing by the leftmost 3 chars, creates indices with different sizes, depending on how many email addresses starts with the 3 chars. The top 3 indices have millions of email addresses:

Indices Number of mail addresses
mar__.idx 9’763’226
www__.idx 5’548’694
ale__.idx 4’605’329

We note that the majority of email addresses extracted starts with mar, at the second place www, which means that there are many “service” email addresses.

The following chart shows a treemap of the indices sizes.

Overview of indices cardinality

Conclusion

Using dedicated tools, we were able to decrease the query time by many order of magnitude. Depending on the goals, instead of use hardware or big and complex tools, it is better to create ad-hoc solution and use the power of the OS tools.

About the Author

Rocco Gagliardi

Rocco Gagliardi has been working in IT since the 1980s and specialized in IT security in the 1990s. His main focus lies in security frameworks, network routing, firewalling and log management.

Links

You need support in such a project?

Our experts will get in contact with you!

×
Enhancing Data Understanding

Enhancing Data Understanding

Rocco Gagliardi

Transition to OpenSearch

Transition to OpenSearch

Rocco Gagliardi

Graylog v5

Graylog v5

Rocco Gagliardi

auditd

auditd

Rocco Gagliardi

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here