Machine Learning in Cybersecurity, Part 1: Core Concepts and Examples
Spam detection, facial recognition, market segmentation, social network analysis, personalized product recommendations, self-driving cars – applications of machine learning (ML) are everywhere around us. Many security companies are adopting it as well, to solve security problems such as intrusion detection, malware analysis, and vulnerability prioritization. In fact, terms such as machine learning, artificial intelligence and deep learning get thrown around so much these days that you may be tempted to dismiss them as hype. In order to provide a framework for thinking about machine learning and security, in our last week’s webinar on Machine Learning in Cybersecurity, we covered the core concepts of machine learning, types of machine learning algorithms, the flowchart of a typical machine learning process, and went into details of three specific security problems that are being successfully tackled with machine learning – spam detection, intrusion detection, and vulnerability prioritization. Finally, we covered some common challenges encountered when designing machine learning systems, as well as adversarial machine learning in specific. In this blog post, we summarize the first part of that webinar, without going into the three specific applications and the challenges. We will cover those in Part 2 of this blog series, so stay tuned for next week’s post.
For more details, that entire webinar is now available on demand here. For even more details, and implementation examples in Python, we highly recommend Machine Learning & Security book by Clarence Chio and David Freeman – many of the examples we used in the webinar and in this post are discussed in more detail there.
Definitions: Artificial Intelligence, Machine Learning, Deep Learning
Artificial Intelligence (AI) is a loosely defined term that refers to algorithmic solutions to complex problems that are traditionally solved by humans. Sometimes there is an added requirement that these algorithms need to be able to achieve near-human-level intelligence in order to be considered AI. What is near-human-level intelligence is often left open for interpretation. AI includes using logic, rules, machine learning, etc. to solve complex problems.
Machine Learning (ML) is a subset and a core building block of AI. The formal definition, due to Tom M. Mitchell, defines machine learning algorithms as follows: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” Basically, ML refers to algorithms that learn in the sense of being able to generalize past data and experiences in order to predict future outcomes, improving with experience.
Deep Learning is a strict subset of ML referring to a specific class of algorithms – multilayered neural networks.
Types of Machine Learning Algorithms
There are two main types of ML algorithms: supervised and unsupervised.
Supervised learning refers to problems where you have labels or outcomes for past data and you want to predict labels for future data. There are two subtypes of supervised learning: classification, where the task is to predict discrete categories, and regression, where the task is to predict continuous numerical values.
On the other hand, in unsupervised learning you draw abstractions from unlabeled data and apply those abstractions to new data. A common example of unsupervised learning is clustering.
Let’s say you manage a large online video rental platform with many subscribers. You may find all three types of algorithms useful to solve your problems. For example,
- Supervised classification: you look at user characteristics and their past behavior in order to predict whether they will cancel their subscription in the next month or not.
- Supervised regression: you would like to predict the amount of time they will spend on your platform or the number of videos they will rent in some time period.
- Unsupervised: you would like to group your users by their similarities in order to improve your marketing strategy.
Two additional types of ML that are encountered less in the security domain are reinforcement learning and recommendation systems. In reinforcement learning an agent learns from rewards via a feedback loop with the environment, for example, a computer program learning to play a game (e.g., Google’s AlphaGo). Recommendation systems are used by platforms such as Netflix, Amazon, Spotify, and many others, to make personalized recommendations. They boil down to predicting preference or rating a user would give to an item. They can involve a mix of supervised and unsupervised techniques.
Supervised Machine Learning Process
The process of going from a business problem to a predictive model usually involves the following steps:
- Defining the problem and setting performance metrics and objectives. For example, we may want to identify fraudulent transactions in order to raise alerts for new transactions we believe to be fraudulent. This is a supervised classification problem where the aim is to use past data to predict if a transaction is legitimate or fraudulent. We may decide that we care most about identifying as many fraudulent transactions as possible correctly, even if that comes at the expense of raising a few false alarms.
- Data identification. In our example, let’s say we have a dataset of past transactions including properties (features) of those transactions such as transaction date, number of items purchased, account age, and payment method. We also have labels for those transactions that tell us whether they are legitimate (class 0) or fraudulent (class 1).
- Data preprocessing. Dependent on the problem, this may involve data cleaning, conversion of categorical features to numerical values, data filtering, dealing with missing values, and feature engineering (creating new features).
- Splitting the data into training and testing. Usually the data are divided into training and testing sets either randomly or based on time, in cases where the temporal dimension of the data is important.
- Algorithm selection. There are many kinds of machine learning algorithms, differing mostly in how the decision boundary between the classes is decided and how complicated it may be. Examples include logistic regression, decision trees, random forests, support vector machines, neural networks, etc.
- Training & parameter tuning. The model is trained on training data. Model parameters (hyperparameters) are tuned on training or on an additional validation data.
- Model evaluation. Model performance is evaluated on the test data set with respect to the metric established at the beginning of this process. If the performance on the test dataset is good and close to the performance on the training dataset, we can rest assured that the model generalizes well to new examples (does not overfit the training data).
Going from a business problem to a predictive model is usually an iterative process that may involve experimentation at every step – if, after step 7, we are not satisfied with the performance, we may want to tune the model parameters further, add new features, or pick a different algorithm. At its core, an ML algorithm takes a training dataset and outputs a model. The model is an algorithm that takes in new data points in the same form as the training data and outputs a prediction.
Classification, Regression, and Clustering Problems in Security
Let’s imagine you are in charge of a computer security at your company. Here are some questions that you may encounter that could be framed as:
- Classification problems:
- For every file sent through the network, does it contain malware?
- For every login attempt, has someone’s password been compromised?
- For every email received, is it a phishing attempt?
- For every request to your services, is it a denial of service (DoS) attack?
- For every outbound request from your network, is it a bot communicating with its command-and-control server?
- Regression problems:
- Predict the number of phishing emails an employee will receive in a given month given historical data about phishing emails and the employee.
- Predict the number of account sign-ins form a specific user or a specific office location given a known history.
- Predict the number of database requests during a specific time period.
- Clustering problems:
- Given a large dataset of internet traffic to your site, you may want to know which requests group together – some may be botnets, others may be legitimate users. You could do this using request times, frequency, source, etc.
- Given a large group of malware samples you may want to know how they group into malware families.
- Given online chatter or descriptions of vulnerabilities, you may want to know what are the different kinds of vulnerabilities that are out there.
Machine Learning Use Cases in Security – Pattern Recognition versus Anomaly Detection
Machine learning applications in cybersecurity come in two main forms: pattern detection and anomaly detection.
In pattern detection, we try to discover explicit or latent characteristics hidden in the data, and use them to teach an algorithm to recognize other forms of the data that exhibit the same set of characteristics. Patterns are inferred from the training data.
In anomaly detection, instead of learning specific patterns that exist in the data, the goal is to establish a notion of normality that describes most of a given dataset. Deviations from this will be detected as anomalies (outliers). There can be an infinite number of anomalies.
Since the two applications are often conflated with each other, it helps to consider specific examples. For example, if you are looking for fraudulent credit card transactions, it might make sense to use a supervised learning model (pattern detection) if you have a large number of both legitimate and fraudulent transactions with which to train the model. The assumption is that all fraudulent transactions have some characteristics in common and that future fraudulent transactions will look similar to the ones we have already seen. In other scenarios, it can be difficult to find a representative pool of positive examples that is sufficient for the algorithm to get a sense of what positive events are like – for example, in intrusion detection, breaches can be caused by zero-day attacks or newly released vulnerabilities, with no previously known signature of attack. This is where anomaly detection is a better approach.
For a detailed description of three specific applications of ML in cybersecurity – spam detection, intrusion detection, and (our favorite) vulnerability prioritization, as well as an overview of challenges encountered when designing ML systems, stay tuned for next week’s Part 2 of this blog series, or watch the on demand webinar.