NIST Teams up with IBM Watson AI System to Score Vulnerabilities, but Hold your Excitement!

The Great News

It has been recently reported that NIST, the agency hosting the National Vulnerability Database (NVD), plans to replace its manual scoring of software vulnerabilities with a new automated process that leverages IBM Watson’s artificial intelligence system. The report comes with a catchy title – “NIST Teams Up with IBM Watson to Rate How Dangerous Computer Bugs Are.”    The AI system should be “assigning risk scores” to most publicly reported computer bugs by October 2019, according to Matthew Scholl, the chief of NIST’s computer security division (as quoted in the linked report). It promises to alleviate the burden placed upon numerous human analysts by automating the tedious work that goes into assigning CVSS scores to newly discovered vulnerabilities.

The news comes as a great relief at a time when the number of reported vulnerabilities is skyrocketing. Consequently, delays in publishing those vulnerabilities are such that it is not unusual to see massive web reporting and vulnerabilities being used in ransomware in the wild before they are officially published in the NVD and assigned a CVSS score. It is not uncommon to see these labeled as “reserved” or “awaiting analysis”.

So, what seems to be the problem? Why are we warning you to contain your excitement?  To see how using terminology such as “rate how dangerous computer bugs are” and “assigning risk scores” could be overpromising and borderline misleading, let’s first look at the problem IBM Watson is trained to solve in this context.

How is NIST Using IBM Watson?

IBM Watson is a (very) advanced computer system originally intended to apply advanced natural language processing (NLP) to the field of open domain question answering. It was later extended to take advantage of advanced machine learning capabilities to provide assistance to businesses in need of predictive analytics.

Machine learning is a subfield of AI concerned with the study and construction of algorithms that can learn from data and make predictions on new data in order to solve practical problems. Learning here refers to progressively improving performance on a specific task with respect to a certain performance metric. In the context of cybersecurity, tasks have ranged from detection of spam and phishing emails, real-time anomaly flagging, and automated detection of malicious code to predicting risk posed by a newly discovered software vulnerability in order to prioritize remediation. The last is a problem we at NopSec care deeply about and you may be misled to believe that the problem is getting solved thanks to NIST partnering with IBM Watson.

The actual task IBM Watson was trained to learn here is to look at thousands of historical vulnerabilities and their corresponding CVSS scores (assigned by human analysts over the last two decades), find an algorithm that relates those descriptions to those CVSS scores, and then apply it to come up with scores of new vulnerabilities based on their descriptions.  The more the vulnerability is similar to a previously studied vulnerability, the more likely Watson was to score like a human analyst would, and the better its performance was deemed to be.

So, Watson is Assigning CVSS Score to Vulnerabilities – What Might be a Problem with That?

CVSS (V2) was introduced in 1999 as a standardized way of communicating characteristics of software vulnerabilities. It contains three metric groups: base, temporal, and environmental. Base group (the one Watson is tasked with assigning) covers the intrinsic characteristics of a vulnerability including its exploitability (how technically easy or difficult it is to exploit it) and impact (loss of confidentiality, integrity, or availability) to the affected system in case the vulnerability is exploited. These are combined to provide a score on a scale of 0 to 10. Base score on its own does NOT take into account existence of exploits or patches, nor the value of an asset the vulnerability is detected on (user’s unique context). A vulnerability with no PoC exploit code could be assigned the same score as one being actively used in attacks. A vulnerability detected on a database server exposed to the Internet could get the same score as one found on a printer that is disconnected from the Internet.

To compensate for the lack of context, users are encouraged to adjust this score by invoking the temporal and environmental groups which take into account the time evolution of the vulnerability and users’ unique environment, respectively. Temporal characteristics include existence and maturity of an exploit code, existence of unofficial or official fixes (patches), and reporting confidence. Environmental characteristics should account for the business value of the asset the vulnerability resides on, and the collateral damage potential such as loss of property, revenue or productivity if exploited.

By default, temporal and environmental components are left neutral – they do not affect the score unless their sub-vectors are specified manually by the end-user. This has unfortunately led to a now standard practice of using CVSS base score alone as a proxy of risk.  Cut-off points are set so that only high CVSS severity vulnerabilities are remediated (those with CVSS base score of 7 or higher) and all others are ignored.  We have found this to be both an inefficient and an unsafe strategy. It is inefficient because there are many more high severity vulnerabilities based on CVSS score (~38% of all historical vulnerabilities) than are actually being leveraged by malware and used in targeted attacks in the wild (~2%). It is unsafe because a significant portion of vulnerabilities that do get weaponized have medium or low severity based on CVSS (~44%). As such, CVSS base score is an insufficient measure of risk posed by a vulnerability.  For details, please refer to our 2018 State of Vulnerability Risk Management Report.

Conclusion

Watson likely does what it has been trained to do – it predicts CVSS base score of a newly described vulnerability. While it does promise greater efficiency in reporting just that, the CVSS score, it does not promise better vulnerability risk management.  It inherits all the weaknesses of the CVSS system. If CVSS V3 (the ‘improvement’ to the CVSS system introduced in 2015) as assigned by human analysts tells us that 60% of vulnerabilities published in 2017 are high severity, then so will Watson. Both will be out of touch with reality considering only  ~3.4% of those have posed actual threats so far by having been used in malware, exploit kits, ransomware, trojans, and targeted attacks.

The CVSS score coming out of Watson’s analysis is as flawed as the CVSS score we have been dealing with so far, and it is NOT to be confused with the risk of exploitation in the wild or targeted attacks.

At the end of the day, humans decide what goes in and what they are trying to get out of machine learning algorithms.  We at NopSec use machine learning to prioritize the remediation of vulnerabilities that are likely to be exploited in the real world.  In addition to CVSS score, our model is trained to look at many other vulnerability characteristics such as vendors and products affected, vendor prevalence, vulnerability age, vulnerability descriptions, existence of exploits in many public and commercial exploit databases and social media mentions to learn relationships between these that may lead to one vulnerability being used in targeted attacks, while many others are not.  We find that prioritizing based on our model outperforms CVSS-based prioritization both in terms of false positives (being efficient and not prioritizing too many) and false negatives (being safe and prioritizing the right vulnerabilities).