Enhancing Data Understanding
Rocco Gagliardi
There are many log analysis tools ready to use that you can download, unpack and run, each with many great features and user friendly interfaces. However, in special cases, it may be more efficient to extract and manipulate data using a self engineered tool, basically because you will get a deeper understanding of data and the ability of create a very specified customized outputs.
Dealing with large logfiles is pretty simple. We will see how to extract information from an exported Firewall-1 logfile and format it for our need.
Last year, we worked on a project to protect the production servers from the internal users, splitting the internal network into many separate parts and ruling the traffic flowing between them. The idea was to install an open firewall, log all traffic (6-8 millions loglines/day) to find connections, decide whether the connection was needed or not and if yes implement an allow rule. At the end of the project, it was just necessary to swap the last rule from Allow
to Drop
to block all not required communication.
The main problem was to automatically keep track of analyzed connections and decisions about the validity of them, made by an external process, during all 6 months of the project duration. We decided to develop some tools for analyzing firewall logs and a database to track/sync rules implemented.
The requests have been defined as:
[date, rulenr, accept, drop]
[action, source, destination, service]
Basically the process of analysis entailed the following steps:
Firewall-1 log is stored in binary format in a protected location, a privileged account is required to start the Checkpoint log export utility on the management/log server. Normally a firewall engineer does not have access to that location (generally he is a normal user on a production system and just has SmartConsole read access), in this case it is still possible to export the logfile from the SmartConsole Tracker; it takes time and generates a big text file but it’s a good compromise between security and usability.
Once exported, we have a lot of lines to count on. By default, Checkpoint tools export following information:
num;date;time;orig;type;action;alert;i/f_name;i/f_dir;product;src;dst;proto;rule;service;s_port;icmp-type;icmp-code;message_info;xlatesrc;xlatedst;NAT_rulenum;NAT_addtnl_rulenum;xlatedport;xlatesport;th_flags;Attack Info;attack
We use Perl for many reasons, but mainly because it is available by default on most systems. Even if it would be possible to use the CSV parser, for simplicity it was decided to analyze the file using a simple split by ";"
. Performance, with a normal computer isn’t really an issue.
The heart of the extraction procedure and analysis comes down to these few lines of code:
038 my @counts = ('rule','src','dst','proto','service'); ... 129 while ( $ln=INPUT> ) { 130 chomp; 131 @hdr = split (/;/,$ln); 132 for ($i=0; $i<@hdr; $i++) { 133 chomp @hdr[$i]; 134 $header->{@hdr[$i]}=$i; 135 } 136 while ( $line=INPUT> ) { 137 $linecounts++; ... 139 chomp; 140 @content = split (/;/, $line); ... 145 foreach $counter ( @counts ) { 146 $stats->{$content[$header->{action}]}->{$counter}->{$content[$header->{$counter}]}++; 147 } 148 $stats->{$content[$header->{action}]}->{cnn}->{$content[$header->{src}]}->{$content[$header->{dst}]}->{$content[$header->{service}]}++; 149 $date = $content[$header->{date}]; 150 } 151 last 152 }
Basically we read a line, split them in columns by the ";"
and get an array of values. We can store all information we need in a relational form using a hash ($stats
), which is a very flexible and powerful structure in Perl.
Lines | Result |
---|---|
131-135 | To make the tool comfortable (the sequence of information might change depending on the version) it is preferred to not code the position of the elements in the array but to read the header and refer to the position with the column name. |
145-147 | For the basic statistics, we add a key for each action, defined counter, counter value and increment this by 1 each time the sequence is found. |
148 | For connections statistics, we add a key for each action, source-ip, destination-ip, service and increment this by 1 each time the sequence is found. |
With just two lines of code, once the whole logfile has been parsed, we have in memory a hash with all our statistics on used rules/sources/destination/protocol/services and a traffic matrix on who talks to whom:
print Dumper($stats) $VAR1 = { 'accept' => { 'proto' => { 'tcp' => 4 }, 'src' => { '192.168.2.1' => 3, '192.168.3.1' => 1 }, 'rule' => { '1' => 4 }, 'service' => { 'http' => 3, 'https' => 1 }, 'dst' => { '192.168.20.1' => 3, '192.168.10.1' => 1 }, 'cnn' => { '192.168.2.1' => { '192.168.20.1' => { 'http' => 3 } }, '192.168.3.1' => { '192.168.10.1' => { 'https' => 1 } } } } };
As defined in the specifications, it was necessary to generate a table of daily connections; it is very simple to produce such outputs: basically, we iterate on the hash(es) and print the results, like in the following example:
167 %actions = %$stats; 168 foreach $action (keys %actions) { 169 print OUTPUT "-"x20 ."\n"." Action: $action"."\n"."+"."-"x19 ."\n"; 170 foreach $counter ( @counts ) { 171 foreach $count ($stats->{$action}->{$counter}) { 172 %hs = %$count; 173 print OUTPUT "+- $action - $counter" ."\n"; 174 foreach $k ( (sort { $hs{$b} <=> $hs{$a} } keys %hs) ) { 175 printf(OUTPUT ("%6s => %s\n", $hs{$k}, $k)) if $hs{$k} > $filter; 176 } 177 } 178 } 179 }
The resulting summary as text format:
-------------------- Action: accept +------------------- +- accept - rule 4 => 1 +- accept - src 3 => 192.168.2.1 1 => 192.168.3.1 +- accept - dst 3 => 192.168.20.1 1 => 192.168.10.1 +- accept - proto 4 => tcp +- accept - service 3 => http 1 => https
The csv for export:
date,rulenr,accept,drop 20121128,1,4,0
date,rulenr,src,dst,proto,service,accept,drop 20121128,1,192.168.2.1,192.168.20.1,http,3,0 ...
Various other functions can be, and have been, added to facilitate the extraction and increase the efficiency of the tool, but with this minimum you can read, interpret and summarize millions of lines of information.
The daily extraction of nearly 7 million loglines required approx. 1 hour (including export, transfer and parsing), the parsing part took approximately 20 minutes.
Once created, the CSV file was imported in a MS Access DB used for storage and correlation (ca. 1GB in size at the end of the project) and through a separated front-end MS Access database it was possible to analyze, track and report all the required information.
Once clearly and exactly defined what is needed from where and how to use them, the analysis of the information is easier than it may appear at first sight; also the handling of large amounts of data is relatively simple with cpu/memory available today and a few lines of simple code.
Sometimes it is better to use mechanisms technically simple but very flexible sewn on to a existing process than adapt a process to a good tool – And, finally, hashes are great!
Our experts will get in contact with you!
Rocco Gagliardi
Rocco Gagliardi
Rocco Gagliardi
Rocco Gagliardi
Our experts will get in contact with you!