Source Code Analysis

A Beginner's Guide

by Marc Ruef

time to read: 30 minutes

There are many ways to do a security check. Traditionally, network-based analyses (i.e. Vulnerability Scans or Penetration Tests) are being enforced. There are, however, other approaches such as local audits, Configuration Reviews and Firewall Rule Reviews.

A very powerful method of a security check is the Source Code Analysis (SCA). In such an analysis, the code of specific software is examined to determine vulnerabilities. This article will show you the basics of such a review.

Starting Point and Goal

The starting point for any SCA is the availability of the code of the software that is to be examined. Closed-Source solutions can therefore not be examined without pre-emptive measures. This could happen in form of a Reverse Engineering which usually leads to much more work put into the project.

Depending on a wide variety of factors, the source can sometimes be acquired. Certain projects even offer access. And some vendors are willing to give their code – occasionally only snippets – to select customers. If the software is being developed by a third party, then the contract with that third party should state that the source code is being handed over or that the source code is accessible.

As soon as the source code is available, in open source projects this is never an issue, there’s nothing in the way of a SCA. The search for vulnerabilities in the application can begin.

Entry Points

Software usually provides more than one entry and exit point:

Entry Points allow the entry of data.
This data is being processed and the result is normally being delivered via an Exit Point.

Entry Points typically are User Entries. In line-based software, these are input using arguments or parameters and be made during runtime as interactive entries. In software that has a graphic user interface (GUI), the entries can be performed with the mouse or using gestures.

Software does, however, get its entries from other possible sources, mainly as an interface for non-human sources. For example environmental variables can be taken into account, files can be read or database queries can be performed. These accesses usually do not require interactivity with the user.

Identify Entry Point

During a SCA it is of vital importance to identify any and all entry points using parameters, calls to functions or objects. They are dependent on the language. They are usually and largely standardized in the context of their language. The following table lists the typical entry points for various languages.

Entry Point	PHP	ASP	JSP	Ruby
Argument	$argv, $_SERVER[‘argv’]	–	–	ARGV
Interactive Entry	fgets()	–	–	gets(), stdin.read()
Environment Variable	$_ENV, getenv(), apache_getenv(), $_SERVER	Request.ServerVariables, oShell. ExpandEnvironmentStrings()	System.getenv(), System.getProperty()	ENV
Files	fread(), fgets(), file(), file_get_contents()	file.OpenAsTextStream(), oBinRead.readBinFile()	InitialContext().lookup()	target.read(), File.readlines()
MySQL Query	mysqli_query(), mysqli_multi_query(), mysqli_real_query(), mysqli_send_query(), mysqli_stmt_execute()	oConn.Execute()	executeQuery(), executeUpdate(), execute()	Active Record: Model.find(), Model.take(), Model.first(), Model.last(), Model.find_by()
HTTP Dateiuploads	$_FILES	FileUploadControl, oUpload(“foo”).SaveAs	request.getPart()	params
HTTP GET- Parameter	$_GET, $_REQUEST, $_SERVER [‘QUERY_STRING’], $HTTP_GET_VARS	Request.QueryString	getParameter(), getParameterValues(), ${param[‘foo’]}, ${param.foo}	params, GET, query_parameters()
HTTP POST- Parameter	$_POST, $_REQUEST, $HTTP_RAW_POST_DATA, $HTTP_POST_VARS	Request.Form, Request[“foo”]	getInputStream(), getReader(), ${param[‘foo’]}, ${param.foo}	POST, request_parameters(), raw_post()
HTTP Cookies	$_COOKIE	Request.Cookies	request.getCookies(), ${cookie[‘foo’]}, ${cookie.foo}	Client: cookie_jar()
HTTP Sessions	$_SESSION	Session.Contents, Session(“foo”)	session.getAttribute()	–

As soon as you know which language is in use and how it’s constructed, you can find the entry points using a text search. For example, you could use grep:

maru@debian:~$ grep -H -n -r '$_GET\|$_POST\|$_SERVER\|$_COOKIE\|$_FILE' *.php
foo.php:3:if($_GET['a']  'foo'){
foo.php:5:}elseif($_POST['b']  'bar'{
foo.php:6:    echo htmlentities($_POST['c']);

Write down these entry points (either the line number or the name of the function) to be able to reference it in the documentation.

Difficulties of Alternative Referencing

The biggest difficulty while identifying the entry points (this also applies to exit points) is the fact that alternative referencing is possible. This is a possibility when entries can come from a channel via different mechanisms.

PHP offers a good example with their HTTP GET parameters. Calling the URL http://example.com/?foo=bar, the parameter foo can be called using $_GET[‘foo’]. Searching for $_GET will show you all entry points of this type. However, at the same time, there’s the possibility to reference the parameter using $_REQUEST[‘foo’] or $_SERVER[‘QUERY_STRING’] (this needs a substr() or a Regex). In old installations, even $HTTP_GET_VARS and with turned on register_globals (in this case, simply $foo suffices) can be used to achieve the same thing. Both of the latter mechanisms are not available in current versions of PHP.

In JSP there’s the special case of the Unified Expression Language. This enables access to objects using simplified tags. Instead of getParameter("foo") you can use ${param[‘foo’]} and ${param.foo}.

This shows that the understanding of the language in use has to be rather high in order to find and identify all variants of a single mechanism. The more flexible a language is, the more time you need to put into the analysis. PHP-software stands out due to its lack of unity and its lack of comprehensibility. Other languages such as JSP or Perl can tend to be complex as well. A simple search using grep can be quite difficult because you will need to work with immensely complex regular expressions.

In addition to that, the special case register_globals shows us that the deterministic SCA is only possible when the platform on which the software is to run is also factored in.

Entry Points Enable Manipulation

The identified entry points will have to be factored into the further progress of the SCA. In most cases, they define the starting point of an attack, because it is the entry points that make conscious manipulation possible. The Tractatus Logico-Philosophicus Instrumentum Computatorium states this in Derivation 1.1.1:

Security is a state. It allows a well-defined number of actions. An action is defined as a task that has been completed.

Subsequently, the entry point must be judged by its exposure and its possibilities. A HTTP GET parameter is, for example, very simple and can be accessed by anyone who is able to interact with a web application. It offers a simple and uncomplicated way of influencing the behaviour of a web application. The following table seeks to classify the entry points based on a generic base score. They can diverge from the base rating based on the technology and mechanisms in use (i.e. complexity of the data points) and are to be expanded on based on product and customer.

Entry point	Exposure	Simplicity
Arguments	medium-high	high
Environment Variables	low	high
Files	medium-high	low-high
Database	low-high	low-high
HTTP File Uploads	medium-high	high
HTTP GET Parameter	high	medium-high
HTTP POST Parameter	medium-high	medium-high
HTTP Cookies	medium	medium-high
HTTP Sessions	low	low

If an application is based on environment variables, there are other prerequisites that must be fulfilled in order to manipulate the application. Environment variables usually are merely adjustable by means of local access and require a corresponding interactive login on the target system.

Entry points such as environment variables can thus only be manipulated indirectly. It is just the same as when the output of database queries needs to be pulled into the fold. This typically is the case when there’s a persistent cross site scripting. In such a case, the payload needs to be entered into the database beforehand so that said payload can be read and output in a subsequent step. An attack goes through two phases in this case and is in the context of a SCA more difficult to comprehend.

Exit Points

Normally, a software takes in entries, processes them and outputs the result. In a second step of the analysis it is of great importance to understand which exit points are made available by the application. The process to find these is similar to the search for entry points. Similarly, there are language-dependent constructs that can be found and analysed.

Exit Point	PHP	ASP	JSP	Ruby
Output	echo, print, printf(), fprintf(), sprintf(), vprintf(), print_r()	Response.Write()	out.println(), out.print()	puts(), print(), printf(), putc(), ios.write(), ios.puts()
Write File	fwrite(), file_put_contents()	file.write(), file.WriteLine()	InitialContext().bind()	target.write()
MySQL Query	mysqli_query(), mysqli_multi_query(), mysqli_real_query(), mysqli_send_query(), mysqli_stmt_execute()	oConn.Execute()	executeQuery(), executeUpdate(), execute()	Model.find(), Model.take(), Model.first(), Model.last(), Model.find_by()
Environment Variable	putenv()	–	System.setenv(), System.setProperty()	ENV

Entry Points Show Vulnerabilities

Vulnerabilities in a software only show themselves when a manipulation – as a consequence of the use of an entry point – can be processed until and including the exit point. This results in the classic attack classes:

Buffer Overflow
Format String
Directory Traversal
Cross Site Scripting (XSS)
SQL Injection (SQLi)
etc.

Example of a Cross Site Scripting Attack

Let’s take a look at a cross site scripting vulnerability as an example. During a XSS-attack, the attacker’s goal is to alter the output of the application. These alterations usually are rather big. By injecting HTML- and JavaScript-elements, the attacker aims to indirectly attack the user.

In order for the attacker to consciously execute said attack, there must be an entry point so that the payload can be injected into the application. Traditionally, GET-parameters are used for this. An attack can, for example, look like this.

http://example.com/?foo=<script>alert('xss');</script>

If there’s a vulnerability there and calling this URL will inject the script, then it is safe to assume that the output looks something like this:

echo $_GET['foo'];

Forward and Backward Slicing

In order to find most vulnerabilities, Forward Slicing as well as Backward Slicing can be used. You search for an ending point and follow the stream of data to the corresponding ending point on the other side of the stream.

Forward: Entry Point ⇒ Exit Point
Backward: Exit Point ⇒ Entry Point

Naturally, applying and comprehending forward slicing is a lot easier because code is more easily read in a sequential manner.

Various classes of vulnerabilities are more easily identifiable when using Backward Slicing. This only leaves the question if and which variables are influenced in order to specifically exploit the application.

Functions and Methods

Up until now entry and exit points have only been looked at in the context of software as a whole. Modern programming languages offer the possibility of dividing code into different routines. Higher programming languages are used mainly as functions and methods. In an object driven programming there are classes in addition to functions and methods. Here, you have to observe and respect the paradigms that the language uses.

An entry point is therefore not only the argument that a user gives the software when calling the program. An internal entry point is also used when a function or a method with an argument is called. The big advantage of internal entry points as opposed to external entry points is the fact that the former is a lot easier to identify because they’re subject to norms. Calling a function in ANSI C always looks the same. For example:

int foo(int param1, int param2){
   //some code
}

In this, the function foo() is defined. This will produce an integer-datatype as a result due to the prefixed int. In the parentheses, the function expects two parameters who are, in turn, also given as integers. The parameters are internally (locally) referenced as param1 and param2.

Exit points in functions are also defined depending on their languages. The example mentioned above that uses ANSI C illustrates the type-security of the language. It is the declaration of the function that indicates which file type is returned. Other languages such as PHP do forgo this.

A function is by definition keen on returning something. In most languages, this happens by means of a command named return (i.e. PHP).

Most programming languages subject their subroutines to a scope. This defines the area of visibility for objects. Thus, functions are without intervention not able to access variables outside their scope. Global variables can be tied in using global $foo in PHP. This makes the direct manipulation of the global variable possible which generates a new entry point within the function that is not attached to the arguments of the function call.

Logical Sequencing

Up until now we have covered entry and exit points. We have made a short digression into subroutines, but we have not touched upon the logical sequence of events in software.

Software rarely works statically linear. Instead, there are decisions built in that will influence the further progression of events in processing. For example an output could only be given when the input was equal to the number 1:

if($_GET['foo'] == 1){
   echo $_GET['foo'];
}

A vulnerability manifests itself only when the logical sequence of events permits it. The earlier example of a XSS vulnerability requires that the sequence of events happens from input to output without there being any kind of intervention that would prove detrimental to the attack. If there had been a htmlentities($foo) somewhere, then there wouldn’t have been a vulnerability.

$foo_sanitized = htmlentities($foo);
echo $foo_sanitized;

Logical Errors

Apart from the fact that a logical sequence can contain a vulnerability, it is entirely possible to have software generate a vulnerability. There are a great many vulnerabilities that can be created like this. It is impossible to give a complete list of them seeing as they are based on individual properties of a wide range of software. The following list only serves to illustrate some possible occurrences of logical errors.

Example 1: Allocation instead of Comparison

A classic mistake is mainly one of negligence. In most higher programming languages, when a variable is allocated a value by means of a single equals sign: $foo = 'bar'; and a comparison within an expression is made by using two equals signs: if($foo == 'bar').

Due to negligence while writing code, people tend to just write a single equals sign instead of two:

if($foo = 'bar'){
   //is always executed
}else{
   //is never executed
}

The result of this is that the content of the variable in what the user thinks to be a comparison is always overwritten. If the write access is successful, then the expression is always returned as TRUE. An elseif or an else will never even be considered for execution.

Vulnerabilities like this are easily recognized by a pattern search. The only case in which recognizing these errors gets more difficult is when there are languages in use that utilize a quotation mark for both the allocation as well as the comparison. This includes languages of the Visual Basic family:

VB6
VB.NET
VBA
VBScript
ASP
ASP.NET

Example 2: Dead Code

Another classic mistake can be found in Dead Code. In this case, there are parts of code that are never executed. This happens when the logical sequence of events makes it impossible to access these parts of the code due to the fact that the prerequisites to access them are never fulfilled. Using compilers to optimize code, Dead Code can be eliminated automatically. There are, however, cases in which Dead Code can’t be identified by deterministic means or the environment used for development does not offer such optimisations. In the following table, there’s a fragment of code that checks the entry of foo for its value:

if($foo > 0){
   //can be executed
}elseif($foo > 10){
   //can never be executed
}else{
   //can be executed
}

The elseif-condition can never be TRUE due to the fact that the first expression will always be executed when a positive value is entered. This can lead to strange behaviour during execution because situations that are expected will never actually come to happen.

Example 3: Type Unsafe Checks

Another problem arises in languages that are not type safe. Once more, we’ll get back to our example of XSS based on PHP. A validation function could use the following check in order to recognize braces. The check uses the function strpos() which will output the position of the brace in case of a finding. It is assumed that in case of a finding the first condition is true. This also works, as long as the searched-for character is not at the beginning of a string. Because if that’s the case, then the returned value will be 0. Based on the Type Unsafety of PHP, this will result in a FALSE which will lead to the else-code being executed.

if(strpos($foo, '<')){
   //only if in position >0
}else{
   //even if in position 0
}

The only way to counteract this is to implement a type safe check on the basis of === or !== respectively. A similar effect occurs when a function does not give a return value. By default PHP returns NULL if there is no defined return value, which will result in a FALSE when a type unsafe check is being performed. This kind of vulnerability is dependent on the paradigms of the language it is based on.

A similar effect is created in JavaScript. The operation "5" - 2 will result in 3 because the minus sign is being interpreted as a mathematical operator. However, the operation "5" + 2 will produce 52 and not 7 as a result. This is because the plus sign lets both strings be interpreted as strings and combines them.

The Order of Business Becomes Important

In order to recognize logical errors you will need to understand the function of the application. This mainly has to happen within the application itself. A complete understanding is often only possible if the basic business principles on which it operates are being understood as well. For example, certain checks within a stocks trading system are only successful if you take into account the importance of margins and its connections to sell-outs and understand those. A good SCA focuses not only on the lines of code but also the application of the code in the company itself.

In PHP, strpos can misunderstand position 0 as FALSE.

Data Processing

Apart from the logical sequence of events, the functionality of software is also influenced by the available and the utilized data. The data used in variables is being edited by various functions. In this step, the program can use functions that are built-in to the software and are offered by the programming language itself. Or there are custom functions in use that are provided by external components such as libraries or APIs. Depending on how a variable is being processed, a vulnerability might establish itself. Or not.

Constructed Records

Strings can use various generic functions that can change the behaviour of software quite a lot. Using substr() or explode() you could construct partial strings. Using explode('<', $foo) during a XSS access makes a successful or at least coordinated attack far less likely.

Advanced Functions

Similarly, custom developed functions can turn out to be important when it comes to validating and securing entries. Therefore, it is important to look for hints of validation: and sanitation when looking at the code. These functions need to be analysed in great detail, because an error in those functions is what causes the possibility of there being a vulnerability.

Modern languages offer support when it comes to defending against typical attacks by default these days. People like to rely on them because they’ve proven to be a solid addition to the world of programming.

Attack	PHP	ASP	JSP
Cross Site Scripting	htmlentities(), htmlspecialchars()	Server.HTMLEncode()	escapeHtml()
Directory Traversal	basename(), realpath()	–	–
SQL Injection	mysql_real_escape_string(), mysqli_real_escape_string(), sqlite_escape_string(), addslashes(), PDO::quote()	–	–

Assertions

Assertions work in a similar way in that they enforce the expected prerequisites or recognize contradictions in order to achieve treatment of a specific error. The following example shows an assertion in ANSI C that can easily be implemented with glibc (C99). If the assertion turns out to be untrue, there will be an error.

#include <assert.h>
x = 1;
y = x + 2;
assert(y > 1);

Assertions are rather unpopular. They’re mainly used in highly professional projects during which the checking of data processing is of the utmost importance.

Accordingly, Program Slicing must be used to check if data processing neglects to use functions that are vital for security which could enable a successful attack to be carried out.

Data Manipulation Show Vulnerabilities

On the other hand, it is entirely possible that certain data manipulation is the reason for the existence of a vulnerability. Below, you’ll see code that expects the entry of two values called $var1 and $var2. They will be read in by the corresponding HTTP GET parameters. Each entry must not be longer than 15 characters. This is not enough for a classic XSS attack to be performed.

$var1 = substr($_GET['foo'], 0, 15);
$var2 = substr($_GET['bar'], 0, 15);
echo $var1.$var2;

Because the two strings are combined when being output, the payload can be distributed across the two variables. Therefore multiple factors must play into each other so that the error can occur.

http://example.com/?foo=<script>alert('&bar=xss');</script>

Vulnerabilities like this often occur where data sources are stacked or where they are being combined. This includes explicit functions like implode() and array_merge() (PHP has a lot of different possible sources of errors when it comes to array-functions). Also, simple String Concatenation can be the basis for an attack.

Analysis of Graphic Interfaces

Identifying entry points in procedural projects is relatively easy. It is considerably more difficult when the application supplies a graphic user interface. The objects in it are often abstracted in code and require additional effort to understand in context of the final processing of the code. Due to the properties of a graphic user interface, there could be vulnerabilities.

Erroneous Events

During an erroneous event, objects contain events that will trigger an action. For most objects, for example, you could define the event click. So if a button named cmdButton is being clicked, Visual Basic executes the following code:

Private Sub cmdButton_Click()
   MsgBox "foo"
End Sub

The term click implies in this context that a mouse click is being performed. This can also happen by means of keyboard entry – in Windows, this usually happens when hitting the space bar when the object is selected or the enter key when Default = Yes is activated. If, for example, the mouse pointer is deactivated there are still possibilities for the user to trigger the event.

This property has made many an application unsafe when it comes to preventing user entry. For example, you can define a textbox that only allows KeyPress of numbers:

Private Sub txtTextbox_KeyPress(KeyAscii As Integer)
   Select Case KeyAscii
      Case vbKey0 To vbKey9
      Case vbKeyBack, vbKeyClear, vbKeyDelete
      Case vbKeyLeft, vbKeyRight, vbKeyUp, vbKeyDown, vbKeyTab
      Case Else
         KeyAscii = 0
         Beep
   End Select
End Sub

Now, if this string is copied into the clipboard, and then pasted by means of right click, then the event is not triggered. The undesired entry of data can be performed that way. Ideally, a validation of the data occurs before the data is being processed. This can be done with the event Validate or – at the latest – when LostFocus occurs.

Parallelizing, Multi-Threading and Multi-Tasking

In addition to all that, there’s the fact that applications with graphical interfaces often offer parallelization and happen within a multi-tasking environment. There, a serialized processing can be guaranteed. Therefore, problems similar to those known from multi-threading solutions can occur during runtime.

The following code sets out to set the title of a frame right after opening by means of Me.Caption only to then disable the button tlbMenu.Buttons.Item(3) that is designed for debugging. Therefore, it cannot be used anymore.

Private Sub Form_Load()
   Me.Caption = "Foo 2.0"
   Screen.MousePointer = VbHourglass
   With tlbMenu.Buttons
      .Item(1).Enabled = True
      .Item(2).Enabled = True
      .Item(3).Enabled = False 'deactivate possibility to debug
   End With
   Screen.MousePointer = vbDefault
End Sub

In the eyes of the user, these events occur at the same time, when looking at it from a subjective point of view. Users will not notice that the title will be changed first and then the buttons are activated or deactivated in a sequential manner. If the system is really busy or the applications show additional dependencies that can be manipulated to lag, then there can be problems. If a user manages to click the button before .Item(3).Enabled = False is performed, the button’s function is being performed before it can be deactivated. Therefore, you’re dealing with a possible race condition.

It is almost impossible to avoid these problems completely. However, you can try to enforce a strict serialized sequence of events by avoiding parallelized mechanisms and enforcing the completion of commands using DoEvents. The latter is a feature unique to the Visual Basic Family.

In addition to that, you can try to complete all tasks relevant to security first. Deactivate the button first and then adjust the title of the frame and activate the other buttons. Should there be a lag with this configuration, there will be some optical distractions, but the security-related integrity of the software is optimized.

Maximum protection is reached by hiding or locking unused objects as early as possible. Ideally, the debug-button is defined as locked when instantiation occurs. That way, it is not even necessary to lock it later on. Unlocking the button would require explicit unlocking when necessary. It is – basically – a Whitelist-Approach when it comes to the unlocking of objects.

Documentation of Results

If weaknesses are discovered in source code, then they usually need to be documented. The documentation serves as a basis for the correction and addressing of the discoveries.

It is important that the affected parts of code are referenced as exactly as possible. Ideally, the following data points are used:

Data Point	Example
Software Name	Foo Forum 2.0
Filename	post.php
Function Name	createNewPost()
Code Lines	Lines 23-42

In addition to that, you could retype the entire code block directly so that a check and referencing is made even easier.

Further information concerning the weakness (classification, description, scenario of attack, example of exploit) add additional quality to the report. Ideally, there’s even a suggestion for a countermeasure. And even more ideally, that countermeasure can be expressed in functional code, because then you’re presenting a solution that is as viable as it gets.

Summary

Source code analysis is an important alternative or supplement to other security checks. If the source code of an application is available, there is a chance of efficiently and reliably finding vulnerabilities in said application.

You should set out to find all entry and exit points in order to identify possibilities of manipulation and effects of such a manipulation. The logical sequence of events and the data processing have to be taken into account as well in order to be able to effectively detect attack vectors.

About the Author

Marc Ruef has been working in information security since the late 1990s. He is well-known for his many publications and books. The last one called The Art of Penetration Testing is discussing security testing in detail. He is a lecturer at several faculties, like ETH, HWZ, HSLU and IKF. (ORCID 0000-0002-1328-6357)

You want to test the security of your firewall?

Our experts will get in contact with you!

Specific Criticism of CVSS4

Marc Ruef

scip Cybersecurity Forecast

Marc Ruef

Voice Authentication

Marc Ruef

Bug Bounty

Marc Ruef

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here

Source Code Analysis

A Beginner's Guide

Starting Point and Goal

Entry Points

Identify Entry Point

Difficulties of Alternative Referencing

Entry Points Enable Manipulation

Exit Points

Entry Points Show Vulnerabilities

Example of a Cross Site Scripting Attack

Forward and Backward Slicing

Functions and Methods

Logical Sequencing

Logical Errors

Example 1: Allocation instead of Comparison

Example 2: Dead Code

Example 3: Type Unsafe Checks

The Order of Business Becomes Important

Data Processing

Constructed Records

Advanced Functions

Assertions

Data Manipulation Show Vulnerabilities

Analysis of Graphic Interfaces

Erroneous Events

Parallelizing, Multi-Threading and Multi-Tasking

Documentation of Results

Summary

About the Author

Links

Tags

You want to test the security of your firewall?

Specific Criticism of CVSS4

scip Cybersecurity Forecast

Voice Authentication

Bug Bounty

You want more?

You need support in such a project?

You want more?