Skip to content
This repository was archived by the owner on Jan 10, 2025. It is now read-only.

Commit e3698f0

Browse files
author
Camilla
authored
Merge pull request #332 from Winterflower/dga-part-2
Adds separate READMEs for each DGA blogpost
2 parents d5b3aaf + 18cb937 commit e3698f0

File tree

3 files changed

+300
-106
lines changed

3 files changed

+300
-106
lines changed
Lines changed: 7 additions & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -1,111 +1,12 @@
1-
# Supplementary materials for "Machine learning in cybersecurity: Training supervised models to detect DGA activity"
2-
This folder contains the supplementary materials for the blogpost ["Machine learning in cybersecurity: Training supervised models to detect DGA activity](https://www.elastic.co/blog/machine-learning-in-cybersecurity-training-supervised-models-to-detect-dga-activity).
31

4-
## Training the classification model
2+
This folder contains the supplementary materials for the following blogposts
53

6-
The raw data we used to train the model has the format
74

8-
```
9-
domain,dga_algorithm,malicious
10-
pdtmstring,banjori,true
11-
umfpstring,banjori,true
12-
cmzmstring,banjori,true
13-
hrynstring,banjori,true
14-
nhdjstring,banjori,true
15-
ppkustring,banjori,true
16-
```
5+
### Machine learning in cybersecurity: Training supervised models to detect DGA activity
6+
* [blogpost](https://www.elastic.co/blog/machine-learning-in-cybersecurity-training-supervised-models-to-detect-dga-activity)
7+
* [supplementary materials](training-supervised-models-to-detect-dga-activity.md)
178

18-
We are interested in extracting unigrams, bigrams and trigrams from the
19-
`domain` field. Thus, we have to first define a Painless script that is
20-
capable of taking a string as an input and expanding the string
21-
into a set of n-grams, for some value of `n`.
9+
### Machine learning in cybersecurity: Detecting DGA activity in network data
10+
* [blogpost]()
11+
* [supplementary materials](detecting-dga-activity-in-network-data.md)
2212

23-
The script is available in the file `ngram-extractor-reindex.json`, but is also
24-
reproduced below.
25-
26-
You can store it in Elasticsearch using the following Dev Console command.
27-
As you can see, the script accepts one parameter which is `n`, the length
28-
of the n-gram.
29-
30-
31-
```
32-
POST _scripts/ngram-extractor-reindex
33-
{
34-
"script": {
35-
"lang": "painless",
36-
"source": """
37-
String nGramAtPosition(String fulldomain, int fieldcount, int n){
38-
String domain = fulldomain.splitOnToken('.')[0];
39-
if (fieldcount+n>=domain.length()){
40-
return ''
41-
}
42-
else
43-
{
44-
return domain.substring(fieldcount, fieldcount+n)
45-
}
46-
}
47-
for (int i=0;i<ctx['domain'].length();i++){
48-
ctx[Integer.toString(params.ngram_count)+'-gram_field'+Integer.toString(i)] = nGramAtPosition(ctx['domain'], i, params.ngram_count)
49-
}
50-
"""
51-
}
52-
}
53-
54-
```
55-
56-
We can then use the stored script to configure an Ingest Pipeline as follows.
57-
58-
59-
```
60-
PUT _ingest/pipeline/dga_ngram_expansion_reindex
61-
{
62-
"description": "Expands a domain into unigrams, bigrams and trigrams",
63-
"processors": [
64-
{
65-
"script": {
66-
"id": "ngram-extractor-reindex",
67-
"params":{
68-
"ngram_count":1
69-
}
70-
}
71-
},
72-
{
73-
"script": {
74-
"id": "ngram-extractor-reindex",
75-
"params":{
76-
"ngram_count":2
77-
}
78-
}
79-
},
80-
{
81-
"script": {
82-
"id": "ngram-extractor-reindex",
83-
"params": {
84-
"ngram_count":3
85-
}
86-
}
87-
}
88-
]
89-
}
90-
```
91-
92-
Once the Ingest Pipeline has been configured we can re-index
93-
our original index with the raw data into a new index which will contain the
94-
n-gram expansion of each domain.
95-
96-
97-
```
98-
POST _reindex
99-
{
100-
"source": {
101-
"index": "dga_raw"
102-
},
103-
"dest": {
104-
"index": "dga_ngram_expansion",
105-
"pipeline": "dga_ngram_expansion_reindex"
106-
}
107-
}
108-
```
109-
110-
Once you have all of the data re-indexed through the Ingest Pipeline, you can follow
111-
the screenshots in the [blog post](https://www.elastic.co/blog/machine-learning-in-cybersecurity-training-supervised-models-to-detect-dga-activity) to configure your ML job.
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# Supplementary materials for "Machine learning in cybersecurity: Detecting DGA activity in network data"
2+
3+
This folder contains the supplementary materials for the blogpost "Machine learning in cybersecurity: Detecting DGA activity in network data".
4+
These configurations have been tested on Elasticsearch version 7.6.2
5+
6+
## Painless Script for Extracting Unigrams, Bigrams and Trigrams from Packetbeat data
7+
8+
Because our model was trained on unigrams, bigrams and trigrams, we have to extract these same features from any new domains we wish to score using the model. Hence, before passing the domains from packetbeat DNS requests into the inference processor, we first have to pass them through a Painless script processor that invokes the stored script below.
9+
10+
```
11+
POST _scripts/ngram-extractor-packetbeat
12+
{
13+
"script": {
14+
"lang": "painless",
15+
"source": """
16+
String nGramAtPosition(String fulldomain, int fieldcount, int n){
17+
18+
String domain = fulldomain.splitOnToken('.')[0];
19+
if (fieldcount+n>=domain.length()){
20+
return ''
21+
}
22+
else
23+
{
24+
return domain.substring(fieldcount, fieldcount+n)
25+
}
26+
}
27+
for (int i=0;i<ctx['dns']['question']['registered_domain'].length();i++){
28+
ctx['field_'+Integer.toString(params.ngram_count)+'_gram_'+Integer.toString(i)] = nGramAtPosition(ctx['dns']['question']['registered_domain'], i, params.ngram_count)
29+
}"""
30+
}
31+
}
32+
```
33+
34+
## Painless Script for Removing Unigrams, Bigrams and Trigrams from Packetbeat data
35+
36+
Since we don't want the extra unigrams, bigrams and trigrams to be ingested together with our packetbeat data, we will also configure a script to remove these features. This script will be invoked in the ingest pipeline after the inference processor. Of course, if you wish, you can leave the features in the documents, in which case, you can leave out this script from your ingest pipeline configuration.
37+
38+
39+
```
40+
POST _scripts/ngram-remover-packetbeat
41+
{
42+
"script": {
43+
"lang": "painless",
44+
"source": """
45+
for (int i=0;i<ctx['dns']['question']['registered_domain'].length();i++){
46+
ctx.remove('field_'+Integer.toString(params.ngram_count)+'_gram_'+Integer.toString(i))
47+
}
48+
"""
49+
}
50+
}
51+
```
52+
53+
## Ingest Pipeline Configuration for Packetbeat DNS Data
54+
55+
Once we have stored both of the Painless scripts above, we can move on to configuring the Ingest Pipeline for the DNS data. Since we are only interested in performing classification on DNS data, we will, later in this document, show you how to make the Ingest Pipeline below execute conditionally only if the required DNS fields are present in the packetbeat document. For now, let's assume the document redirected to the pipeline has the required DNS fields.
56+
57+
First, let's get the model ID. We will need this to configure our Inference processor.
58+
You can obtain the model ID by using the Kibana Dev Console and running the command
59+
60+
```
61+
GET _ml/inference
62+
```
63+
64+
You can then scroll through the response to find the model you trained for DGA detection.
65+
Make note of the model ID value. Below is a snippet of the model data showing the
66+
`model_id` field.
67+
68+
```
69+
{
70+
"model_id" : "dga-ngram-job-1587729368929",
71+
"created_by" : "_xpack",
72+
"version" : "7.6.2",
73+
"description" : "",
74+
"create_time" : 1587729368929,
75+
"tags" : [
76+
"dga-ngram-job"
77+
],
78+
```
79+
80+
If you have many models in your cluster, it can be easier to use part of your ML job's name in a search pattern like this
81+
82+
```
83+
GET _ml/inference/dga-ngram-job*
84+
```
85+
86+
Once, we have the model id, we can configure the Ingest pipeline for DNS data as below
87+
88+
```
89+
PUT _ingest/pipeline/dga_ngram_expansion_inference
90+
{
91+
"description": "Expands a domain into unigrams, bigrams and trigrams and makes a prediction of maliciousness",
92+
"processors": [
93+
{
94+
"script": {
95+
"id": "ngram-extractor-packetbeat",
96+
"params":{
97+
"ngram_count":1
98+
}
99+
}
100+
},
101+
{
102+
"script": {
103+
"id": "ngram-extractor-packetbeat",
104+
"params":{
105+
"ngram_count":2
106+
}
107+
}
108+
},
109+
{
110+
"script": {
111+
"id": "ngram-extractor-packetbeat",
112+
"params": {
113+
"ngram_count":3
114+
}
115+
}
116+
},
117+
{
118+
"inference": {
119+
"model_id": "dga-ngram-job-1587729368929",
120+
"target_field": "predicted_label",
121+
"field_mappings":{},
122+
"inference_config": { "classification": {"num_top_classes": 2} }
123+
}
124+
},
125+
{
126+
"script": {
127+
"id": "ngram-remover-packetbeat",
128+
"params":{
129+
"ngram_count":1
130+
}
131+
}
132+
},
133+
{
134+
"script": {
135+
"id": "ngram-remover-packetbeat",
136+
"params":{
137+
"ngram_count":2
138+
}
139+
}
140+
},
141+
{
142+
"script": {
143+
"id": "ngram-remover-packetbeat",
144+
"params": {
145+
"ngram_count":3
146+
}
147+
}
148+
}
149+
]
150+
}
151+
152+
```
153+
154+
In the pipeline above, the first three processor invoke the same Painless script `ngram-extractor-packetbeat` to extract unigrams, bigrams and trigrams respectively (see the parameter `ngram_count` which varies in each processor). They are followed by the Inference processor which references our trained model (your model id will be different). Finally, we have three Painless script processor, which reference the script `ngram-remover-packetbeat` to remove the features required by the model.
155+
156+
## Conditional Ingest pipeline execution
157+
158+
Not every document ingested through Packetbeat will describe a DNS flow. Hence, it would be ideal to make
159+
the pipeline we configured above execute conditionally only when our document contains the desired fields. There are a few ways to achieve our goal here. Both use Pipeline processors and check for the presence of specific fields in the Packetbeat document before deciding whether or not to direct it to the pipeline that contains our inference processor.
160+
161+
162+
```
163+
PUT _ingest/pipeline/dns_classification_pipeline
164+
{
165+
"description": "A pipeline of pipelines for performing DGA detection",
166+
"version": 1,
167+
"processors": [
168+
{
169+
"pipeline": {
170+
"if": "ctx.containsKey('dns') && ctx['dns'].containsKey('question') && ctx['dns']['question'].containsKey('registered_domain') && !ctx['dns']['question']['registered_domain'].empty",
171+
"name": "dga_ngram_expansion_inference"
172+
}
173+
}
174+
]
175+
}
176+
177+
```
178+
179+
In the conditional above, we first check whether the packetbeat document contains the nested structure `dns.question.registered_domain` and then do a further check to make sure the field is not empty.
180+
181+
Alternatively, one could check `"if": "ctx?.type=='dns'"` in the conditional.
182+
For a production usecase, please also make sure you think about error handling in the ingest pipeline.

0 commit comments

Comments
 (0)