I Built a Botnet to Catch One and Here's What I Learned
A security research project using BYOB, Ubuntu Multipass VMs, and machine learning to detect malicious network traffic There is a notable irony in constructing the very threats one aims to defend against. This project embodied that concept: it involved deploying a botnet within a controlled environment, monitoring its network impact, and subsequently training a machine learning model to detect it. While machine learning-based traffic inspection has been utilized by enterprise solutions like Zscaler and Palo Alto for years, these tools are often prohibitively expensive and designed for large-scale operations. The core objective of this research was more fundamental: determining whether a botnet can be detected purely through statistical analysis of traffic behavior, without the need for deep packet inspection.
The findings indicate that this approach is highly effective.
Setting Up the Lab Environment
To ensure security, no physical machines were used for deployment. The entire environment operated within isolated virtual machines utilizing Ubuntu Multipass. Multipass provides a lightweight, command-line-driven method for rapidly provisioning Ubuntu VMs locally.
The infrastructure was configured as follows:
One VM designated as the target (victim) machine.
One VM functioning as the Command and Control (C2) server.
Wireshark running in the background to capture all network traffic between the nodes.
The botnet infrastructure utilized BYOB (Build Your Own Botnet), an open-source educational framework developed by malwaredllc. As a post-exploitation framework, it is well-suited for security research. The payload establishes a reverse TCP shell encrypted with AES-256 via a Diffie-Hellman key exchange, maintaining communication with the C2 server. Upon connection, standard toolkit modules become available, including keyloggers, packet sniffers, port scanners, screenshot utilities, and persistence mechanisms.
Payload delivery simulated common malware vectors by masquerading as a standard document file, replicating social engineering tactics within a sandboxed environment. Analysis of Malicious Traffic Characteristics
During active communication between the botnet and the C2 server, Wireshark captured the network traffic. Subsequently, 32 statistical features were extracted per network flow, including metrics such as bytes transmitted and received, packet length variance, and response time distributions.
The statistical discrepancies between malicious and benign traffic were significant.
Bytes Transmitted: Malicious flows transmitted approximately 24 times more data than normal traffic within the same timeframe. The C2 connection proved highly conversational; even during idle periods, the system continuously maintained state, received instructions, and returned bulk data.
Session Duration: Malicious sessions lasted roughly twice as long as benign activity. While standard browser sessions tend to be brief and bursty, a persistent reverse shell maintains a continuous open connection.
Packet Length Variance: The variance in malicious traffic was nearly 2.7 times higher than that of benign traffic. This is logically consistent, as a C2 session alternates between minuscule command packets and extensive data exfiltration, resulting in a highly variable length distribution compared to the more uniform nature of standard web browsing.
One metric that remained consistent across both categories was the relationship between mean response time and packet length. Both traffic types exhibited similar baseline patterns; however, the malicious traffic displayed intermittent high-throughput bursts that were distinctly visible during visual data analysis.
Model Training and Evaluation
Utilizing the labeled dataset comprised of benign baseline activity and malicious C2 flows, a binary classifier was trained using scikit-learn. The data underwent an 80/20 train-test split, applying stratification to maintain an appropriate class balance.
The model achieved an 86.23% accuracy rate on the testing subset. Furthermore, it demonstrated the capability to process live flow data directly from Wireshark, enabling real-time session flagging, which was a primary objective of the research.
While an 86% accuracy rate represents a strong baseline, production-grade security applications require a more granular analysis of false positive and false negative rates. A model that achieves high accuracy but incorrectly flags a significant portion of legitimate traffic lacks practical viability. Addressing this balance remains a priority for future iterations. Proposed Architectural Enhancements
In its current state, the model functions passively by logging flagged packets. A potential advancement involves dynamic network rerouting rather than mere observation. Under this concept, when a flow is classified as suspicious, it is neither permitted to reach its destination nor silently dropped. Instead, the traffic is steered toward a specialized security analysis server. This environment handles intensive processing tasks such as deep packet inspection (DPI), sandboxing, threat intelligence correlation, and potential reverse engineering
before determining the appropriate traffic disposition.
Normal traffic ──────────────────────────────────► Destination Suspicious flow ──► ML classifier flags it
│ ▼ High-security analysis server (DPI, sandbox, threat intel) │ Allow / Block
While the policy-based routing mechanism is a standard networking function, the implementation of an automated feedback loop offers significant value. Traffic that is rerouted and conclusively verified as malicious can be integrated as new training data. This enables continuous model retraining, transitioning the system from a static ruleset into a self-improving detection engine.
Although high-end enterprise products utilize similar architectures, they remain costly, proprietary, and reliant on dedicated security operations centers. The ongoing research question is whether a comparable, lightweight system can be engineered for deployment at the ISP tier, within localized networks, or at the IoT edge environments where traditional enterprise tools are impractical. Limitations and Future Improvements
Several areas require refinement in future studies:
Omission of Port Analysis: Within the simulated environment, port-level characteristics were not fully utilized. In production networks, outbound connections via non-standard ports from unexpected processes represent a robust indicator of compromise and must be incorporated into the feature set.
Single Model Evaluation: The current iteration reports the accuracy of a single classifier. A rigorous methodological approach requires benchmarking multiple algorithms (e.g., Random Forests, XGBoost, and neural networks) against the same feature set, utilizing cross-validation to identify the optimal model.
Dataset Constraints: The experimental dataset was relatively small and highly controlled. Real-world network traffic is characterized by significantly higher noise levels, variance, and labeling complexities. Consequently, the 86% accuracy metric observed in the laboratory environment may degrade in production settings until the model is exposed to a broader diversity of traffic patterns.
Key Takeaways Constructing offensive infrastructure provides profound insights into defensive strategies. Observing a proprietary C2 server process traffic and analyzing live packet captures offers practical understanding that theoretical study cannot replicate. The research confirmed that statistical signatures are both tangible and robust. Threat detection can be achieved without decrypting packets or reverse-engineering payloads; the fundamental shape and behavior of the traffic provide sufficient diagnostic indicators. The dynamic rerouting concept represents the primary focus for subsequent phases of this project. Collaboration and insights regarding SDN-based traffic steering and advanced detection methodologies are highly encouraged.
Tools utilized: BYOB (malwaredllc/byob), Ubuntu Multipass, Wireshark, Python, scikit-learn. All testing was conducted within an isolated virtual environment.