Security Testing
Tomaso Vasella
How to master Data Leakage Prevention
Simply put, data leakage occurs when data leaves its intended habitat. In practice, this means that sensitive data such as confidential documents or customer information leaves system or organizational boundaries and then is placed in an environment that does _not offer the level of protection _demanded by the data content. In most cases, this also means that the data becomes accessible to unauthorized parties. Two important elements can be observed in this example:
These two points play a central role in data leakage prevention because they concern the detection (content) and prevention of unwanted data leakage and the place where this can occur (system boundaries and interfaces).
The principle consists of detecting relevant data flows and the subsequent reaction based on defined rules. Data flows are analyzed in terms of the exchanged data content and the communication partners involved, and the data flow is then prevented, or an alarm is generated, or the user is warned. For example, sending sensitive information via email or downloading a document to an untrusted end device or copying a confidential document to a USB memory device might be prevented.
The idea is simple in concept but can be enormously complex in practical implementations. On the one hand, there is a huge amount of data formats that must be read and correctly interpreted, and on the other hand, it is often very difficult to automatically distinguish between permissible data flows that are necessary from a business perspective and those that are not.
An additional factor adds to this complexity: The numerous uses of cloud services blur system and organizational boundaries and lead to an increasing separation between data ownership and the ability to exercise technical control over one’s own data. Concepts such as Zero Trust can be helpful here.
DLP rule definitions often make use of so-called regular expressions which are machine-readable descriptions of data patterns. They make it quite easy to describe things like an ISIN (international securities identification number) based on its format: Starts with two letters for the country code, followed by nine alphanumeric digits and a check digit. The corresponding regex might look like this:
[A-Z]{2}[A-Z0-9]{9}[0-9]{1}
But what happens if the ISIN appears in a document with one or more spaces? Or a character string appears which corresponds to the pattern mentioned, but the first two letters are not a valid country code at all? A regular expression would have to be used that contains all valid country codes – that’s over 250! What to do if the relevant pattern is not a text but contained in an image? Should one use optical text recognition? How to deal with container formats? Simply renaming a Word document that is blocked by DLP from .docx to .zip is sometimes enough to bypass detection. And so on. With increasing technical effort, more such cases can be handled. However, this quickly becomes confusing and can lead to very resource-intensive data analysis, yet it will never be possible to achieve a hundred percent hit rate.
The next level of challenges involves data that belongs together in terms of content but is unstructured from a machine perspective. Using rules to identify which combination of data elements is harmful in a data leak and which flows of unstructured data are permissible from a business perspective can be very difficult. The increasing collection of data (think Big Data) and the proliferation of NoSQL databases pose particular challenges to traditional DLP methods.
Similar observations can be made concerning the flow of data: Sending confidential documents with unencrypted email is clearly undesirable. But what if confidential documents are uploaded to a business partner’s cloud application? Should that cloud application be trusted? And if so, how does the DLP solution know that the cloud application is trustworthy, but others are not? And how do you know that the data is not subsequently leaking out of the cloud application, that is completely outside your own control?
Another challenge is posed by the strong increase in location-independent working, home office and BYOD concepts. It is usually no longer possible to analyze the relevant data flows at a central location. The data flows often take place directly between the endpoints and the various cloud applications, requiring measures directly on the end device to control unwanted data flows.
These circumstances typically lead to a high number of false positives. In other words, it is almost impossible to create precise rules for blocking data outflows without impacting business processes. As a result, a logging-only mode is often chosen practice and automated blocking is frequently omitted. How well the automatic detection of sensitive data and its automatic protection, which are often touted in DLP products, really work in practice can be guessed based on these considerations. Machine learning and algorithms for recognizing data content, behavior patterns and anomalies can help, but they cannot solve these problems entirely.
In view of the above, one might conclude that data leakage cannot be prevented at all. To a certain extent, this is true, because it will never be possible to fully protect against a sufficiently motivated adversary. Someone with access to data will always be able to leak that data somehow, even if it is just via a screenshot. However, there are several security measures that can be taken to reduce these risks to an acceptable level. None of these are new:
The risk of data leakage is real and major damage can result. DLP solutions can help, but they cannot make up for poor processes or substantial gaps in the security posture. Data inventory and classification, solid baseline security measures, appropriate processes, and powerful security monitoring are well-known, commonly recommended measures that are also effective in helping to prevent data leaks. With the increasing use of cloud applications, monitoring and controlling endpoints and generally of all those elements that can still be controlled is more important than ever. Should an emergency occur, a communication strategy and predefined procedures are needed to be able to react quickly and effectively. That might also be a good occasion to think again about data economy.
Our experts will get in contact with you!
Tomaso Vasella
Tomaso Vasella
Tomaso Vasella
Tomaso Vasella
Our experts will get in contact with you!