Heterogeneous Benchmark and Multi-Branch Baseline for Cross-Domain IoT Botnet Detection
Submitted to Journal of Network and Computer Applications
Authors: Ammar Bhilwarawala, Likhamba Rongmei, Harsh Sharma, Arya Jena, Kaushal Singh, Jayashree Piri, Raghunath Dey
Affiliation: School of Computer Engineering, KIIT University, Bhubaneswar, India
The IoT botnet detection field has a quiet problem that's been building for years. Almost every published system trains on one dataset, reports numbers in the high 90s, and treats that as evidence of a solved problem. But models trained in one environment fall apart when you point them at a different one — different capture tools, different devices, different attack toolkits. The benchmark looked easy because it was a closed world.
This paper makes two contributions toward fixing that.
BRIDGE (Benchmark Reference for IoT Domain Generalisation Evaluation) is the first formally specified heterogeneous multi-dataset benchmark for IoT intrusion detection. It unifies five structurally distinct public datasets through a 46-feature semantic canonical vocabulary, with genuine-equivalence-only feature mapping, explicit zero-filling for absent features, and full coverage disclosure for every dataset. The leave-one-dataset-out (LODO) evaluation protocol establishes, for the first time with a formally reproducible methodology, just how large the cross-dataset generalisation gap is: all five evaluated deep learning architectures achieve LODO F1 between 0.39–0.47, and TCH-Net achieves the highest at 0.5577 — the first formally quantified community generalisation baseline in heterogeneous IoT intrusion detection.
TCH-Net is a multi-branch neural architecture proposed as a strong and well-characterised baseline for BRIDGE. It integrates three parallel branches — a three-path Temporal branch with residual convolutional-BiGRU, stride-downsampled BiGRU, and full-resolution pre-LayerNorm Transformer; a provenance-conditioned Contextual branch; and an aggregate Statistical branch — fused via Cross-Branch Gated Attention Fusion (CB-GAF) with learnable per-branch sigmoid vector gates. Evaluated across five independent seeds on BRIDGE, TCH-Net achieves F1 = 0.8296 ± 0.0028, AUC = 0.9380 ± 0.0025, and MCC = 0.6972 ± 0.0056, outperforming all 12 baseline models with statistical significance (p < 0.05).
| Model | F1 | ROC-AUC | MCC | PR-AUC |
|---|---|---|---|---|
| TCH-Net (Ours) | 0.8296 ± 0.0028 | 0.9380 | 0.6972 | 0.8912 |
| Transformer-IDS | 0.7958 ± 0.0030 | 0.9147 | 0.6255 | 0.8699 |
| 1D-CNN-IDS | 0.7932 ± 0.0076 | 0.9076 | 0.6213 | 0.8601 |
| CNN-LSTM | 0.7919 ± 0.0137 | 0.9056 | 0.6208 | 0.8578 |
| BiLSTM-IDS | 0.7805 ± 0.0010 | 0.8975 | 0.5972 | 0.8441 |
| BiGRU-IDS | 0.7805 ± 0.0011 | 0.8962 | 0.5987 | 0.8438 |
| DeepDefense | 0.7627 ± 0.0011 | 0.8776 | 0.5638 | 0.8192 |
| XGBoost | 0.7265 ± 0.0014 | 0.8704 | 0.5542 | 0.7989 |
| GraphSAGE-Approx | 0.7097 ± 0.0004 | 0.8259 | 0.4465 | 0.7403 |
| Kitsune-AE | 0.7045 ± 0.0007 | 0.8200 | 0.4362 | 0.7348 |
| MLP-IDS | 0.7039 ± 0.0008 | 0.8152 | 0.4348 | 0.7311 |
| IoT-DNN | 0.7009 ± 0.0002 | 0.8146 | 0.4278 | 0.7286 |
| Random Forest | 0.4323 ± 0.0082 | 0.8005 | 0.3557 | 0.6214 |
| Held-Out Dataset | TCH-Net LODO F1 | Best Baseline LODO F1 |
|---|---|---|
| CICIDS-2017 | 0.3128 | 0.19 (BiGRU-IDS) |
| CIC-IoT-2023 | 0.6013 | 0.47 (CNN-LSTM) |
| Bot-IoT | 0.5934 | 0.50 (CNN-LSTM) |
| Edge-IIoTset | 0.6791 | 0.56 (CNN-LSTM) |
| N-BaIoT | 0.6021 | 0.47 (BiGRU-IDS) |
| Mean | 0.5577 | 0.4297 |
Generalisation gap (in-distribution vs LODO): +0.2719. This is not a TCH-Net failure — all architectures show the same structural gap. The 0.5577 LODO mean is the first formally specified community baseline for cross-dataset IoT intrusion detection.
TCH-Net/
│
├── BRIDGE.csv # Unified 549 MB benchmark dataset
│
├── BRIDGE and TCH-Net (FULL PAPER).ipynb # Complete experimental notebook
│ # All 12 baselines, branch ablation,
│ # novelty ablation, LODO, temporal split,
│ # adversarial robustness, HP sensitivity
│
├── bridge-and-tch-net.ipynb # Clean training-only notebook
│ # TCH-Net, 5 seeds, saves checkpoints
│
├── checkpoints/
│ ├── tch_net_best.pth # Best checkpoint (highest F1)
│ ├── tch_net_seed_42.pth
│ ├── tch_net_seed_123.pth
│ ├── tch_net_seed_456.pth
│ ├── tch_net_seed_789.pth
│ ├── tch_net_seed_2024.pth
│ ├── scaler.pkl # RobustScaler — required for inference
│ └── manifest.json # Config, metrics, feature names
│
└── README.md
BRIDGE maps five publicly available IoT network security datasets into a single shared 46-feature canonical vocabulary based on CICFlowMeter nomenclature. Features that don't exist in a given dataset are zero-filled explicitly — nothing is fabricated, and coverage is fully disclosed.
| Dataset | Capture Tool | Year | Coverage | Role | Official | Kaggle |
|---|---|---|---|---|---|---|
| CICIDS-2017 | CICFlowMeter | 2017 | 93% (43/46) | Primary | UNB CIC | Kaggle |
| CIC-IoT-2023 | CICFlowMeter | 2023 | 87% (40/46) | Primary | UNB CIC | Kaggle |
| Bot-IoT | Argus | 2019 | 39% (18/46) | Primary | UNSW | Kaggle |
| Edge-IIoTset | Wireshark | 2022 | 22% (10/46) | Supplementary | IEEE DataPort | Kaggle |
| N-BaIoT | Kitsune | 2018 | 15% (7/46) | Supplementary | UCI | Kaggle |
| Dataset | Benign | Attack | Total | Atk% |
|---|---|---|---|---|
| CICIDS-2017 | 19,321 | 14,350 | 33,671 | 42.6% |
| CIC-IoT-2023 | 3,964 | 3,001 | 6,965 | 43.1% |
| Bot-IoT | 22 | 16 | 38 | 42.1% |
| Edge-IIoTset | 30,951 | 23,435 | 54,386 | 43.1% |
| N-BaIoT | 13,557 | 10,055 | 23,612 | 42.6% |
| Combined | 67,815 | 50,857 | 118,672 | 42.9% |
| Group | Category | Indices | Count |
|---|---|---|---|
| 1 | Flow rates, durations, pkt/byte counts | 0–16 | 17 |
| 2 | Packet size & IAT statistics | 17–37 | 21 |
| 3 | TCP flag indicators | 38–43 | 6 |
| 4 | Header length & window size | 44–45 | 2 |
Full feature list: flow_duration, pkt_count_fwd, pkt_count_bwd, byte_count_fwd, byte_count_bwd, pkt_rate, byte_rate, fwd_pkt_rate, bwd_pkt_rate, fwd_byte_rate, bwd_byte_rate, pkt_count_total, byte_count_total, fwd_pkt_len_total, bwd_pkt_len_total, subflow_fwd_pkts, subflow_bwd_pkts, pkt_len_min, pkt_len_max, pkt_len_mean, pkt_len_std, pkt_len_var, fwd_pkt_len_min, fwd_pkt_len_max, fwd_pkt_len_mean, fwd_pkt_len_std, bwd_pkt_len_min, bwd_pkt_len_max, bwd_pkt_len_mean, bwd_pkt_len_std, iat_mean, iat_std, iat_max, iat_min, fwd_iat_mean, fwd_iat_std, bwd_iat_mean, bwd_iat_std, flag_syn, flag_ack, flag_fin, flag_rst, flag_psh, flag_urg, fwd_header_len, init_win_fwd
import pandas as pd
df = pd.read_csv("BRIDGE.csv")
# Columns: 46 canonical features + 'label' (0=benign, 1=attack)
# + 'dataset_source' (which of the 5 datasets each row came from)Or via Hugging Face:
from datasets import load_dataset
dataset = load_dataset("Ammar-ss/BRIDGE")Three branches process a (B, 32, 46) sequence window in parallel, fused through CB-GAF.
Input: (B, 32, 46) — batch × 32-step window × 46 canonical features
Shared Feature Projection (residual):
Linear(46→92) → LayerNorm → GELU → Dropout
→ Linear(92→46) → LayerNorm [X̃ = X + f_proj(X)]
T-Branch (Temporal) — 512d:
Path 1: ResConvSE×3 + MaxPool → BiGRU(128/dir, 2L) → 8×256
Path 2: StrideConv(s=2) → BiGRU(64/dir, 1L) → AvgPool(8) → 8×128
Path 3: Linear(46→128) + LearnablePE
→ TransformerEncoder(Pre-LN, 2L, 8H) → AvgPool(8) → 8×128
Merge: concat → LayerNorm(512) → MHA(8H) → mean pool → 512d
H-Branch (Statistical) — 64d:
mean(X̃, time) → MLP(46→128→64, BN+GELU+Dropout) → 64d
C-Branch (Contextual) — 64d:
Embed_dataset(5,32) ‖ Embed_device(6,32) → 64d
CB-GAF (Cross-Branch Gated Attention Fusion):
Each branch → Linear projection → 128d
Each branch queries both others via cross-attention
Per-branch vector gate g^i ∈ (0,1)^128 (feature-wise, not scalar)
x_fused = g^i ⊙ x_self + (1 − g^i) ⊙ x_cross
concat(T, C, H fused) → LayerNorm → 384d
Classifier (residual head):
raw_proj(mean X̃) → 64d
concat(384d, 64d) → Linear(448→256) → Linear(256→128) + skip
→ Linear(128→2) → softmax
Aux Decoder (training only, λ=0.05):
MLP(384→64→46) — prevents information collapse in CB-GAF
2,692,696 trainable parameters. Single-sample inference latency: 6.43 ± 0.18ms on NVIDIA Tesla T4.
pip install torch numpy pandas scikit-learn huggingface_hub tqdmOpen bridge-and-tch-net.ipynb on Kaggle (GPU recommended). Update the dataset paths in Cell 3:
class Config:
CICIDS_PATH = "/kaggle/input/datasets/dhoogla/cicids2017"
CICIOT_PATH = "/kaggle/input/datasets/raqeeb24/ciciot-2023-stratified-dataset"
BOTIOT_PATH = "/kaggle/input/datasets/vigneshvenkateswaran/bot-iot-5-data"
EDGE_PATH = "/kaggle/input/datasets/mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot"
NBAIOT_PATH = "/kaggle/input/datasets/mkashifn/nbaiot-dataset"Run all cells. Checkpoints save automatically to tch_net_results/checkpoints/.
import torch
import pickle
import numpy as np
import torch.nn.functional as F
from huggingface_hub import hf_hub_download
# Download pretrained weights and scaler
ckpt_path = hf_hub_download("Ammar-ss/BRIDGE_and_TCH-Net", "tch_net_best.pth")
scaler_path = hf_hub_download("Ammar-ss/BRIDGE_and_TCH-Net", "scaler.pkl")
# Load scaler
with open(scaler_path, 'rb') as f:
scaler = pickle.load(f)
# Load checkpoint
ckpt = torch.load(ckpt_path, map_location='cpu')
config = ckpt['config']
# Instantiate model (copy TCHNet class from bridge-and-tch-net.ipynb)
model = TCHNet(
nf=config['n_features'], # 46
ws=config['window_size'], # 32
nc=config['n_classes'], # 2
)
model.load_state_dict(ckpt['state_dict'])
model.eval()
# Preprocess: scale raw features
# X_raw: np.ndarray of shape (N, 46)
X_scaled = np.clip(scaler.transform(X_raw), -10, 10).astype(np.float32)
# Inference
# x: FloatTensor (B, 32, 46) — sliding window sequences
# ctx: LongTensor (B, 2) — [dataset_source_id, device_category_id]
#
# dataset_source_id: 0=CICIDS-2017 1=CIC-IoT-2023 2=Bot-IoT
# 3=Edge-IIoTset 4=N-BaIoT
# device_category_id: 0=sensor 1=camera 2=appliance 3=IIoT 4=server 5=unknown
#
# If unknown, pass ctx = torch.zeros(B, 2, dtype=torch.long)
# (contextual branch has no independent predictive power — degrades gracefully)
with torch.no_grad():
logits, _ = model(x, ctx)
probs = F.softmax(logits, dim=-1)
preds = logits.argmax(dim=-1) # 0=benign, 1=attack| Variant | F1 | ΔF1 |
|---|---|---|
| T+C+H Full | 0.8296 | — |
| T+H | 0.7756 | −0.054 |
| T+C | 0.7752 | −0.054 |
| T only | 0.7753 | −0.054 |
| H only | 0.7054 | −0.124 |
| C+H | 0.7061 | −0.124 |
| C only | 0.6000 | −0.230 |
C-branch alone = near-random (AUC ≈ 0.50). Its role is fusion conditioning, not independent prediction.
| Variant | F1 | ΔF1 |
|---|---|---|
| Full TCH-Net | 0.8296 | — |
| w/o CB-GAF | 0.7759 | −0.054 |
| w/o MSTE (Three-Path) | 0.7760 | −0.054 |
| w/o Aux Loss | 0.7755 | −0.054 |
| w/o All (v2) | 0.7752 | −0.054 |
Removing any single novel component costs ~0.054 F1.
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 5×10⁻⁴ |
| Weight decay | 5×10⁻⁵ |
| Scheduler | Cosine annealing, 2-epoch warmup |
| Loss | Focal (γ=2.0, α-weighted, ε=0.05) + Aux (λ=0.05) |
| Batch size | 512 |
| Max epochs / patience | 30 / 5 |
| Window size / stride | 32 / 4 |
| Max train sequences | 800,000 |
| Max test sequences | 200,000 |
| Dropout | 0.15 |
| Augmentation | Gaussian noise (σ=0.01, p=0.30, train only) |
| AMP | fp16 on CUDA |
| Hardware | NVIDIA Tesla T4 (Kaggle) |
| Model | Params | Latency | Throughput | Memory | F1 |
|---|---|---|---|---|---|
| TCH-Net | 2.692M | 6.43±0.18ms | 20.5k sps | 10.27MB | 0.8296 |
| Transformer-IDS | 0.618M | 1.22±0.03ms | 36.8k sps | 2.36MB | 0.7958 |
| BiLSTM-IDS | 0.609M | 0.74±0.02ms | 34.2k sps | 2.32MB | 0.7805 |
| 1D-CNN-IDS | 0.068M | 0.69±0.03ms | 406.9k sps | 0.26MB | 0.7932 |
| CNN-LSTM | 0.142M | 0.88±0.05ms | 273.5k sps | 0.54MB | 0.7919 |
Cross-dataset generalisation. The LODO mean F1 of 0.5577 confirms TCH-Net doesn't generalise robustly to entirely unseen network environments in its current form. This isn't a TCH-Net-specific failure — all 12 evaluated baselines show the same structural gap. Deployment in a new environment requires fine-tuning on local traffic or domain adaptation components not present in the current architecture.
Testbed data. All five source datasets were collected in controlled environments, not live operational networks.
Binary classification only. TCH-Net is evaluated as a binary detector (benign vs. attack). Multi-class attack type identification is a natural extension not covered here.
Edge deployment. The 10.27MB footprint and 6.43ms latency are fine for NVIDIA Jetson hardware. For microcontroller-class endpoints (Cortex-M, ESP32), quantisation or knowledge distillation would be needed.
If you use BRIDGE, TCH-Net, or the canonical vocabulary in your work, please cite:
@article{bhilwarawala2026bridge,
title = {{BRIDGE} and {TCH-Net}: Heterogeneous Benchmark and Multi-Branch
Baseline for Cross-Domain {IoT} Botnet Detection},
author = {Bhilwarawala, Ammar and Rongmei, Likhamba and Sharma, Harsh
and Jena, Arya and Singh, Kaushal and Piri, Jayashree and Dey, Raghunath},
journal = {arXiv preprint arXiv:2604.11324},
year = {2026}
}This repository is released under the Apache 2.0 License.
The five source datasets (CICIDS-2017, CIC-IoT-2023, Bot-IoT, Edge-IIoTset, N-BaIoT) are subject to their own respective licenses. Please consult each dataset's original source before commercial or derivative use.
We thank the creators of CICIDS-2017, CIC-IoT-2023, Bot-IoT, Edge-IIoTset, and N-BaIoT for making their datasets publicly available. Experiments were conducted on Kaggle with NVIDIA Tesla T4 GPU instances.