Password Leak Analysis
With the recent advent of digital assistants, people invite into their homes not only a whole host of privacy concerns but the successful result of decades-long research into artificial intelligence. Of course, the term artificial intelligence amounts to hardly more than a diffuse description of the underlying concepts, and its definition has changed constantly over the past two decades in computer science. Still, many a person would agree that the likes of Alexa and Cortana constitute the closest humanity has come to conceiving of artificial intellects.
These devices are a long shot away from being anywhere close to human-like intelligence, but they are already capable of processing a few spoken and written languages, distinguishing faces, and adequately responding to commands and questions. But how do they manage to understand spoken language, let alone distinguish persons from one another?
In its most rudimentary sense of the term, Artificial Intelligence refers to the fact that computers are able to learn a certain set of skills without developers explicitly having to program them. This means in turn that these devices commonly need to be trained before they are accurate enough to be productive.
Computer science techniques that allow devices to learn belong to the diverse category of regression algorithms, or in buzzword-lingo, machine learning and big data. In essence, learning algorithms try to build a mathematical function that approximates the problem at hand, and their memory consists of a set of variables that are continually adapted to the training data in a feedback loop.
In order to perform their actions, digital assistants always use a combination of learning algorithms to transform and classify data or to perform predictions. For example, they would first use a recurrent neural network to translate spoken language from sound to the equivalent textual representation, then an expert system based on a Bayesian inference engine or a random forests algorithm to extract semantic meaning from the aforementioned textual representation in order to interpret the data as command or question.
Since ca. 2010, so-called artificial neural networks have surged in popularity over other regression algorithms due to their astounding capability of automatically extracting relevant characteristics from source data, which means that developers hardly need to characterize a problem before they can formulate a suitable learning algorithm. This means in turn, that the internal logic developed by artificial neural networks is often impossible to comprehend or reproduce. Moreover, artificial neural networks can require much longer learning phases and they are prone to bias and overfitting unless the training data is scrutinized for confounding factors. The effect of bias and overfitting in learning algorithms is humorously illustrated by the infamous tank detection experiment, but it can have deleterious effects in many real-life applications where learning algorithms replicate human bias and stereotype discrimination due to implicit bias present within the corpus of training data.
In order to understand how learning algorithms work in principle, we have to work our way through something less complex than spoken language or semantic analysis. Instead, we will go through the process of building a learning system that is capable of distinguishing SPAM from legitimate email.
At the most superficial level, a learning algorithm will take a corpus of complete email messages as input, and produce as output whether it thinks that a particular message is SPAM or legitimate. This problem belongs to the group of binary classification problems because the final outcome is only one of two discrete things; SPAM or HAM. The input data consists of a certain number of sets of characters of varying length.
In mathematical terms, the input
x is defined as an
N x F matrix, where
N is the number of email messages and
F is the number of input characteristics or features. The output, in turn, is defined as a vector
y of 0s and 1s with
N dimensions. The relationship between
y is defined as the function
f, which will depend on the actual choice of algorithm. Thus, our algorithm
f will effectively reduce
F dimensions of source data into a single dimension in the output.
But there’s a problem: Our source data isn’t an
N x F matrix of real numbers but a series of email messages. In order to get there, we need to perform so-called feature extraction, the process in which we reduce text messages of varying length into a series of equal-length numbers or features. There is no one true way to do this, and understandably, the final performance of a learning algorithm will strongly depend on how we represent the source data as a set of features.
One way to do this in our example would be to count the occurrence of each symbol (e.g.
<) which, if we assume UTF-8 format, will result in
F = 256. Another would be to count the frequency of particular n-grams (the more general definition of words). The final number of features will thus equal the total number of unique n-grams within our source data set. Given experience, it is immediately obvious that it makes more sense to count the n-gram frequency rather than the symbol frequency.
But we also need to account for n-grams that are so common that they do not contain any relevant information (words like “the” or “a”). If we do not take those out of our feature space, the algorithm is more prone to underfitting, and will also be less efficient due to the (usually) high number of features involved.
In the end, we want to produce those features that best explain the variance in the source data set, statistically speaking. This process is what is called feature selection:
Given a set of features, transform the feature data such that you have the least amount of features that explain as much of the problem domain as possible.
This is where artificial neural networks shine because they are capable of performing the feature selection process automatically (assuming that the network is suited for the problem). Going back to our SPAM example, we would probably want to filter out the most common n-grams using Term Frequency – Inverse Document Frequency (Tf-Idf) weighting. This statistical technique prioritizes words that are frequent in a single document but decreases the importance of words that are common across all documents, such as the or a in English.
Given our problem, it is best that we evaluate multiple algorithms, or models, for performance, because we don’t really know the shape of the multi-dimensional curve that separates SPAM from legitimate email. The relationship between the frequency of each n-gram and the legitimacy of the email could be linear, it could be polynomial, or it could be a much more complex, non-linear relationship. For example, it could be that the more complex HTML code an email contains, the more likely it is to be SPAM. But it could just as well be that once the terms review or draft appear in the same email, then it’s more likely to be from the marketing division of your company. Given our experience, we would probably catch Viagra SPAM with a linear model, but more involved or targeted SPAM will require a non-linear model. The choice of models will further involve statistical considerations and whether we know which training emails are SPAM, and which are not (i.e. whether we are working with labeled data or not).
In our example, we would likely work with labeled training data, therefore we would select from supervised learning algorithms. Common supervised models are:
The latter is actually a class of algorithms more commonly referred to as artificial neural networks because they were originally designed to model interconnected neurons analogous to the brains of vertebrates.
Each learning model has a set of parameters that are changed either by us prior to training or as part of the training process itself. Any pre-processing steps such as feature selection will also have parameters that influence the final performance of the model.
Given this fact, we might also want to optimize the choice of parameters set prior to training, so-called hyperparameters instead of just finding the best model. Therefore, a model is evaluated using either three separate data sets (for training, validation, and testing), or cross-validation followed by a testing phase on a separate data set.
The principle is the same in both cases, however: In a loop, train the model with a set of hyperparameters, and evaluate the current model using the validation set (or via cross-validation), before modifying the hyperparameters. After the best-performing hyperparameters have been found, evaluate that model against the final testing set. Results from this final evaluation are then used to compare the different algorithms against each other.
The performance of a model with respect to the underlying problem is assessed through multiple metrics based on the confusion matrix as shown below. The most commonly used metrics are:
As with any metric, the reliability of the evaluation depends on the distribution of the classes (e.g. SPAM or HAM) within the data set, and no one metric will perfectly capture the model performance.
Specific to our SPAM example is, that it is much worse for the SPAM filter to falsely classify legitimate email as SPAM. Therefore, any algorithm that will have a high false positive rate will be out of the game quick.
|Confusion Matrix||Condition True||Condition Negative|
|Predicted True||True positive||False positive|
|Predicted False||False negative||True negative|
Once the model is complete and training has concluded, the algorithm will tend to be surprisingly small and require only limited resources to run. Even old smart phones would be capable of running a decent-sized neural network once the training stage has completed. However, in the case of spoken language, our learning algorithm will likely have to continue learning in order to be able to deal with the ever changing nature of human language (or the economy).
Therefore, all of the aforementioned processes will need to be run periodically or continually while the learning algorithm is in a production environment. This requires vast amounts of computer resources to maintain, and it’s impossible to fit this all within a smart phone or even a dedicated consumer device. Unfortunately, this also means that learning algorithms effectively enforce some sort of cloud-based architecture, and privacy issues are a direct consequence of how the learning algorithm is kept up-to-date.
Our experts will get in contact with you!
Our experts will get in contact with you!
Further articles available here