|
| 1 | +# Supplementary materials for "Machine learning in cybersecurity: Detecting DGA activity in network data" |
| 2 | + |
| 3 | +This folder contains the supplementary materials for the blogpost "Machine learning in cybersecurity: Detecting DGA activity in network data". |
| 4 | +These configurations have been tested on Elasticsearch version 7.6.2 |
| 5 | + |
| 6 | +## Painless Script for Extracting Unigrams, Bigrams and Trigrams from Packetbeat data |
| 7 | + |
| 8 | +Because our model was trained on unigrams, bigrams and trigrams, we have to extract these same features from any new domains we wish to score using the model. Hence, before passing the domains from packetbeat DNS requests into the inference processor, we first have to pass them through a Painless script processor that invokes the stored script below. |
| 9 | + |
| 10 | +``` |
| 11 | +POST _scripts/ngram-extractor-packetbeat |
| 12 | +{ |
| 13 | + "script": { |
| 14 | + "lang": "painless", |
| 15 | + "source": """ |
| 16 | +String nGramAtPosition(String fulldomain, int fieldcount, int n){ |
| 17 | +
|
| 18 | + String domain = fulldomain.splitOnToken('.')[0]; |
| 19 | + if (fieldcount+n>=domain.length()){ |
| 20 | + return '' |
| 21 | + } |
| 22 | + else |
| 23 | +{ |
| 24 | + return domain.substring(fieldcount, fieldcount+n) |
| 25 | +} |
| 26 | +} |
| 27 | +for (int i=0;i<ctx['dns']['question']['registered_domain'].length();i++){ |
| 28 | + ctx['field_'+Integer.toString(params.ngram_count)+'_gram_'+Integer.toString(i)] = nGramAtPosition(ctx['dns']['question']['registered_domain'], i, params.ngram_count) |
| 29 | +}""" |
| 30 | + } |
| 31 | +} |
| 32 | +``` |
| 33 | + |
| 34 | +## Painless Script for Removing Unigrams, Bigrams and Trigrams from Packetbeat data |
| 35 | + |
| 36 | +Since we don't want the extra unigrams, bigrams and trigrams to be ingested together with our packetbeat data, we will also configure a script to remove these features. This script will be invoked in the ingest pipeline after the inference processor. Of course, if you wish, you can leave the features in the documents, in which case, you can leave out this script from your ingest pipeline configuration. |
| 37 | + |
| 38 | + |
| 39 | +``` |
| 40 | +POST _scripts/ngram-remover-packetbeat |
| 41 | +{ |
| 42 | + "script": { |
| 43 | + "lang": "painless", |
| 44 | + "source": """ |
| 45 | +for (int i=0;i<ctx['dns']['question']['registered_domain'].length();i++){ |
| 46 | + ctx.remove('field_'+Integer.toString(params.ngram_count)+'_gram_'+Integer.toString(i)) |
| 47 | +} |
| 48 | +""" |
| 49 | + } |
| 50 | +} |
| 51 | +``` |
| 52 | + |
| 53 | +## Ingest Pipeline Configuration for Packetbeat DNS Data |
| 54 | + |
| 55 | +Once we have stored both of the Painless scripts above, we can move on to configuring the Ingest Pipeline for the DNS data. Since we are only interested in performing classification on DNS data, we will, later in this document, show you how to make the Ingest Pipeline below execute conditionally only if the required DNS fields are present in the packetbeat document. For now, let's assume the document redirected to the pipeline has the required DNS fields. |
| 56 | + |
| 57 | +First, let's get the model ID. We will need this to configure our Inference processor. |
| 58 | +You can obtain the model ID by using the Kibana Dev Console and running the command |
| 59 | + |
| 60 | +``` |
| 61 | +GET _ml/inference |
| 62 | +``` |
| 63 | + |
| 64 | +You can then scroll through the response to find the model you trained for DGA detection. |
| 65 | +Make note of the model ID value. Below is a snippet of the model data showing the |
| 66 | +`model_id` field. |
| 67 | + |
| 68 | +``` |
| 69 | + { |
| 70 | + "model_id" : "dga-ngram-job-1587729368929", |
| 71 | + "created_by" : "_xpack", |
| 72 | + "version" : "7.6.2", |
| 73 | + "description" : "", |
| 74 | + "create_time" : 1587729368929, |
| 75 | + "tags" : [ |
| 76 | + "dga-ngram-job" |
| 77 | + ], |
| 78 | +``` |
| 79 | + |
| 80 | +If you have many models in your cluster, it can be easier to use part of your ML job's name in a search pattern like this |
| 81 | + |
| 82 | +``` |
| 83 | +GET _ml/inference/dga-ngram-job* |
| 84 | +``` |
| 85 | + |
| 86 | +Once, we have the model id, we can configure the Ingest pipeline for DNS data as below |
| 87 | + |
| 88 | +``` |
| 89 | +PUT _ingest/pipeline/dga_ngram_expansion_inference |
| 90 | +{ |
| 91 | + "description": "Expands a domain into unigrams, bigrams and trigrams and makes a prediction of maliciousness", |
| 92 | + "processors": [ |
| 93 | + { |
| 94 | + "script": { |
| 95 | + "id": "ngram-extractor-packetbeat", |
| 96 | + "params":{ |
| 97 | + "ngram_count":1 |
| 98 | + } |
| 99 | + } |
| 100 | + }, |
| 101 | + { |
| 102 | + "script": { |
| 103 | + "id": "ngram-extractor-packetbeat", |
| 104 | + "params":{ |
| 105 | + "ngram_count":2 |
| 106 | + } |
| 107 | + } |
| 108 | + }, |
| 109 | + { |
| 110 | + "script": { |
| 111 | + "id": "ngram-extractor-packetbeat", |
| 112 | + "params": { |
| 113 | + "ngram_count":3 |
| 114 | + } |
| 115 | + } |
| 116 | + }, |
| 117 | + { |
| 118 | + "inference": { |
| 119 | + "model_id": "dga-ngram-job-1587729368929", |
| 120 | + "target_field": "predicted_label", |
| 121 | + "field_mappings":{}, |
| 122 | + "inference_config": { "classification": {"num_top_classes": 2} } |
| 123 | + } |
| 124 | +}, |
| 125 | + { |
| 126 | + "script": { |
| 127 | + "id": "ngram-remover-packetbeat", |
| 128 | + "params":{ |
| 129 | + "ngram_count":1 |
| 130 | + } |
| 131 | + } |
| 132 | + }, |
| 133 | + { |
| 134 | + "script": { |
| 135 | + "id": "ngram-remover-packetbeat", |
| 136 | + "params":{ |
| 137 | + "ngram_count":2 |
| 138 | + } |
| 139 | + } |
| 140 | + }, |
| 141 | + { |
| 142 | + "script": { |
| 143 | + "id": "ngram-remover-packetbeat", |
| 144 | + "params": { |
| 145 | + "ngram_count":3 |
| 146 | + } |
| 147 | + } |
| 148 | + } |
| 149 | + ] |
| 150 | +} |
| 151 | +
|
| 152 | +``` |
| 153 | + |
| 154 | +In the pipeline above, the first three processor invoke the same Painless script `ngram-extractor-packetbeat` to extract unigrams, bigrams and trigrams respectively (see the parameter `ngram_count` which varies in each processor). They are followed by the Inference processor which references our trained model (your model id will be different). Finally, we have three Painless script processor, which reference the script `ngram-remover-packetbeat` to remove the features required by the model. |
| 155 | + |
| 156 | +## Conditional Ingest pipeline execution |
| 157 | + |
| 158 | +Not every document ingested through Packetbeat will describe a DNS flow. Hence, it would be ideal to make |
| 159 | +the pipeline we configured above execute conditionally only when our document contains the desired fields. There are a few ways to achieve our goal here. Both use Pipeline processors and check for the presence of specific fields in the Packetbeat document before deciding whether or not to direct it to the pipeline that contains our inference processor. |
| 160 | + |
| 161 | + |
| 162 | +``` |
| 163 | +PUT _ingest/pipeline/dns_classification_pipeline |
| 164 | +{ |
| 165 | + "description": "A pipeline of pipelines for performing DGA detection", |
| 166 | + "version": 1, |
| 167 | + "processors": [ |
| 168 | + { |
| 169 | + "pipeline": { |
| 170 | + "if": "ctx.containsKey('dns') && ctx['dns'].containsKey('question') && ctx['dns']['question'].containsKey('registered_domain') && !ctx['dns']['question']['registered_domain'].empty", |
| 171 | + "name": "dga_ngram_expansion_inference" |
| 172 | + } |
| 173 | + } |
| 174 | + ] |
| 175 | +} |
| 176 | +
|
| 177 | +``` |
| 178 | + |
| 179 | +In the conditional above, we first check whether the packetbeat document contains the nested structure `dns.question.registered_domain` and then do a further check to make sure the field is not empty. |
| 180 | + |
| 181 | +Alternatively, one could check `"if": "ctx?.type=='dns'"` in the conditional. |
| 182 | +For a production usecase, please also make sure you think about error handling in the ingest pipeline. |
0 commit comments