Use Case for AI in Cybersecurity : the DGA algorithm

Since 2018 and the release of the 1st version of Reveelium, the team at ITrust is convinced of the value of Artificial Intelligence for cybersecurity. In recent years, market trends have proven us right: from SMEs to industry giants, all rely on this technology to increase their ability to detect threats tenfold.

But what do we mean by AI?

Reveelium is not yet able to think like a human, and therefore does not replace the expertise of an accomplished SOC analyst. That being said, the practice of AI at ITrust goes far beyond the IF/THEN rules, even optimized, that have long been the basis of this highly human-supervised decision system.

Making a decision in a predefined situation is not a proof of intelligence. Traditionally, the correlation rules of SIEMs are static, fixed. They do not adapt to new situations, nor do they learn from their mistakes.

However, this notion of learning is crucial to detect the multitude of new threats that companies are facing, to counter the increasingly complex and skilled attacks of cybercriminals.

In response to this reality, ITrust’s data scientists have been developing our own machine learning algorithms for several years. Their goal: to continuously improve threat detection by identifying relationships among thousands of logs that a human would not necessarily see.

In a previous post, we talked about the Kryptis malware and how we managed to detect it thanks to the triptych: SIEM UEBA / Threat Intelligence / Human Expertise.

Today we propose to focus on the UEBA engine of our Reveelium technology, by presenting you one of these algorithms: the DGA. Without detailing its precise functioning, we simply explain the temporal and syntactic analysis methods that make this algorithm interesting for malware detection.

Reminder about Kryptis malware:

It is a Trojan malware (Trojan Horse) that, once installed, can be remotely controlled by the attacker (Command & Control). It is used to steal information such as user names, passwords or sensitive files (exfiltration). It can also take screenshots, keyboard captures (keylogging), monitor network traffic, launch executable files… And send all this information to a remote server controlled by an attacker.

DGA:

In this cyber-attack, it was our DGA detection algorithm that detected the spread of the Kryptis malware, thanks to the correlation of one of our alerts with SIEM alerts.

But what is a DGA?

DGA (Domain Generation Algorithm) is a technique used by cyber attackers to generate new domain names for malware Command and Control servers. The detection of DGAs is a crucial issue, as it can lead to the early detection of some known or unknown malware.

What is the value of AI?

An interesting question to ask is, what is the benefit of AI in the detection of DGA? The answer is simple: AI allows a more advanced analysis of domain names. It allows both a semantic analysis and a syntactic analysis of domain names. Thus allowing a more accurate and therefore more efficient classification.

Semantic analysis:

The semantic analysis is the part that allows to analyze the pronunciability of a domain name. That is, to know if the domain name we are studying is pronounceable or not. But above all, to what extent it is pronounceable.

To perform this semantic analysis, the algorithms that are most often used are deep learning algorithms, more specifically RNN (Recurrent Neural Network). RNNs are deep learning algorithms (Neural Network) that perform a temporal analysis of data (recurrent). The great strength of this type of algorithm is that it allows us to observe the recurrence of certain patterns.

The use of this algorithm in the detection of DGA is as follows: when we look at certain words (from the French or English language) or domain names, patterns appear, such as the consonant/vowel alternation.

Observation of the patterns of a normal domain name. c = consonant / v = vowel / red circle = pattern to observe

However, most of the time, when we analyze DGA, this pattern of alternation between vowel and consonant is not present.

Example of a DGA where no pattern is found with an alternation between consonant and vowel. c = consonant / v = vowel

It is the presence (or absence) of this pattern that the RNN detects, and that allows it to determine the pronunciability of words. And consequently to classify them, or not, as DGA.

Syntactic analysis:

Another way to detect DGAs is to use a syntactic approach. This type of approach makes it possible to analyze the content of the domain name. The analysis of the content of the domain name can for example be done by counting the number of vowels or the number of consonants it contains.

In this type of approach, we create a list of elements that will allow us to define the domain name. These elements can be quite simple, such as the number of vowels/consonants in the domain name, or more complex. Then, all these elements are provided as input to a machine learning algorithm that will learn to distinguish between normal domain names and DGA domain names.

Our method :

ITrust’s algorithm is able to combine both aspects, semantic analysis and syntactic analysis. This means that, to detect DGAs, we perform a double analysis allowing us to leave no aspect of the domain names unmonitored.

In addition, our algorithms are trained on ITrust’s databases, which allows us to give high reliability to the results of our algorithms. The larger the volume of data on which the algorithms are trained, the more robust the algorithms are.

AI: the key to contemporary cybersecurity

In conclusion, this post illustrates the usefulness of machine learning techniques for cyber security. They allow an in-depth analysis of data, on several aspects. (Here syntactic and semantic, to evaluate the structure and pronunciation of the domain name). The uses and needs of populations are still growing, and with them the perimeters to be supervised, which are more and more exposed to vulnerabilities. Faced with this growing complexity, the advantages of machine learning justify its gradual replacement of traditional methods: