Technology - Baskerville

Baskerville is a real-time, L7 bot and DDoS defense platform built as a streaming ML system. Here’s how it works.

Unsupervised anomaly detection

Conventional machine learning approaches for network attack detection are based around recognizing patterns of behavior, building and training a classification model. This requires large labelled data sets. However, the rapid pace and unpredictability of cyber-attacks make this labeling impossible in real time as well as incredibly time consuming post-incident. In addition, a signature-based approach is naturally biased towards previous incidents and can be outmaneuvered by new, previously unseen, patterns.

No labeling required: We rely on unsupervised anomaly detection (e.g., Isolation Forest and sequence models), so we don’t need large, hand-labeled datasets—labeling is time-consuming, costly, and impractical in real time.
Catches the unknown: Instead of memorizing past signatures, models learn each site’s normal behavior and flag deviations, including novel AI-driven bots and slow/low attacks.

Data pipeline (real-time, Kafka-first)

Inputs: Layer-7 weblogs from CDN edges (Cloudflare, Amazon CloudFront) and origin proxies.
Backbone: Apache Kafka with host-keyed partitions to keep all events for the same site ordered together.
Sessionization: Real-time creation of user sessions (via KSQL/Kafka Streams), grouping requests by session cookie (with IP/host fallbacks).

Feature extraction

Statistical features: Request rates, status-code histograms, bytes, burstiness, sliding-window aggregates.
Text/sequence features: Encoded URL paths and navigation sequences (time-ordered URLs).
Browser environment metrics: TLS/cipher hints, language/timezone, headless signals, and lightweight fingerprint sketches (canvas/WebGL quirks, WebRTC traits) to reveal swarms of “different” browsers sharing the same environment.

Human vs automated separation

A lightweight gatekeeper classifies traffic as human vs automated (incl. verified search bots), reducing noise and enabling tighter anomaly thresholds on the automated side.

Website-specific anomaly models (trained online)

Per-site models: Each domain gets its own unsupervised model, continuously trained on the fly from live sessions.
Algorithms: Isolation Forest for statistical features plus behavior models that learn URL-sequence norms—effective against slow, content-aware bots that mimic humans.
Drift-tolerant: Models adapt to changing traffic without relabeling or offline retraining cycles.

Action loop: challenge → verify → block

Suspicious sessions receive progressive challenges (low-friction JS checks up to CAPTCHAs) through CDN integrations.
Outcomes (pass/fail/hesitation) feed back into the pipeline; confirmed malicious actors are blocked or rate-limited, while legit users proceed.

Observability & storage

Durable store: All session summaries, decisions, and scores are saved in Postgres for audit and analytics.
Metrics: We publish challenge rate, bot rate, AI-bot rate, fingerprinting scores, and model health to Prometheus, with Grafana dashboards for real-time visibility.
Search at scale: Historical weblogs/attacks can be indexed in Elasticsearch for fast investigations and offline experiments.

Platform & integrations

Kubernetes-native services for elastic scaling.
Generic APIs to plug into CDNs (Cloudflare, CloudFront) and site backends, decoupling model logic from edge enforcement.

What’s next

Cluster-level detection: Cross-session/cross-IP clustering to catch AI botnets whose individual sessions look human but align into inhuman patterns at scale.
Deception/honeypots: Rotating low-SEO bait pages to attract content-hungry AI crawlers and safely exhaust their compute.
Synthetic adversaries: Continuous red-teaming with AI-driven bot agents.
Reinforcement loop: Operator feedback (FP/FN tags) closes the loop for ongoing model improvement.