First, we estimate the Entropy for each nybble in the IPv6 addresses, across the whole dataset. For example, if the last nybble is highly variable, then the corresponding Entropy will be high. Conversely, the Entropy will be zero for nybbles that stay constant across the dataset. Below we plot the normalized value of Entropy for each of the 32 nybbles, along with the 4-bit Aggregate Count Ratio, which was introduced in Plonka and Berger, 2015.
Second, we group adjacent nybbles with similar Entropy to form larger segments, with the expectation that they represent semantically different parts of each address. We label these segments with letters and mark them with dashed lines in the plot below.
Next, we search the segments for the most popular values and ranges of values within them. For that purpose, we use statistical methods for detecting outliers and the DBSCAN machine learning algorithm. We analyze distribution and frequencies of values inside address segments.
Below we present the results, with ranges of values shown as two values in italics (bottom to top). The last column gives the relative frequency across the whole dataset. The /32 prefixes are anonymized.
A: bits 0-32 (hex chars 1- 8)B: bits 32-40 (hex chars 9-10)
- 20010db8 63.50%
- 30010db8 36.50%
C: bits 40-48 (hex chars 11-12)
- 10 77.80%
- 08 15.42%
- 09 5.05%
- 07 0.70%
- 00 0.55%
- 05 0.47%
D: bits 48-52 (hex chars 13-13)
- 00 67.02%
- 01 11.13%
- c2 0.67%
- fe 0.41%
- ff 0.41%
- * 02-5b 11.94%
- * 5c-fd 8.42%
E: bits 52-56 (hex chars 14-14)
- 0 10.10%
- 5 9.24%
- 4 9.11%
- 2 9.05%
- 1 8.90%
- * 3-f 53.61%
F: bits 56-64 (hex chars 15-16)
- 0 69.69%
- 1 5.41%
- 2 4.72%
- 3 3.75%
- 5 2.23%
- * 4-f 14.20%
G: bits 64-116 (hex chars 17-29)
- 00 14.18%
- 53 0.65%
- * 01-ff 85.17%
H: bits 116-120 (hex chars 30-30)
- 0000000000000 0.29%
- 0127016000630 0.11%
- 0127020800160 0.11%
- 0127020801800 0.08%
- 0127007100620 0.07%
- 0127022700290 0.06%
- 0127016001550 0.06%
- 0127016001130 0.06%
- 0127016000620 0.06%
- 0127022702170 0.06%
- * 0000000000001-0000000000af0 13.02%
- * 0000d9a050050-0000d9a053f90 0.39%
- * 0127022701090-0127022701270 0.20%
- * 010332b0b1e17-fffd8c3ab1643 84.90%
- * 0000000001a10-00fd12c41fce6 0.55%
I: bits 120-124 (hex chars 31-31)
- 0 49.51%
- 8 37.35%
- * 1-f 13.14%
J: bits 124-128 (hex chars 32-32)
- 0 51.62%
- 1 19.90%
- 2 9.63%
- 3 4.46%
- 4 2.38%
- * 5-f 12.02%
- 0 16.44%
- 1 8.20%
- 2 7.69%
- 3 6.93%
- 4 6.54%
- * 5-f 54.21%
Next, we search for statistical dependencies between the segments. For that purpose, we train a Bayesian Network (BN) from data.
Below we show structure of the corresponding BN model. Arrows indicate direct statistical influence. Note that directly connected segments can probabilistically influence each other in both directions (upstream / downstream). Under some conditions, segments without direct connection can still influence each other through other segments: e.g., A can influence C through B if C depends on B and B depends on A (even if there is no direct arrow between A and C).
Learning BN structure from data is in general a challenging optimization problem. Hence, there might be more than one possible BN structure graph for the same dataset.
Finally, below we show an interactive browser that decomposes IPv6 addresses into segments, values, ranges, and their corresponding probabilities. The browser lets for exploring the underlying BN model and see how certain segment values probabilistically influence the other segments.
Try clicking on the colored boxes below. You should see the colors changing, which reflects the fact that some segment values can make the other values more (or less) likely. For instance, in the Sample Report, you may find that clicking on J1 (i.e., the first value in segment J) makes segments C, D, F, H, and I largely predictable (see our paper for more examples).
You may condition the model on many segment values. Clicking on selected values un-selects them. Clicking on the red "Clear" above the color map un-selects them all. Below the browser we show the estimated proportion of the addresses matching your selection (vs. the dataset). If the browser cannot estimate the probabilities in a reasonable time, it asks before trying harder.
Using the BN model, below we generate a few candidate target IPv6 addresses matching the selection above. Note that we anonymize the IPv6 addresses in this report.
As we show in the paper, this technique allowed us to successfully scan IPv6 networks of servers and routers, and to predict the IPv6 network identifiers of active client IPv6 addresses.