Technology

Unsupervised anomaly detection

Conventional machine learning approaches for network attack detection are based around recognizing patterns of behavior, building and training a classification model. This requires large labelled data sets. However, the rapid pace and unpredictability of cyber-attacks make this labeling impossible in real time as well as incredibly time consuming post-incident. In addition, a signature-based approach is naturally biased towards previous incidents and can be outmaneuvered by new, previously unseen, patterns.

  • No labeling required: We rely on unsupervised anomaly detection (e.g., Isolation Forest and sequence models), so we don’t need large, hand-labeled datasets—labeling is time-consuming, costly, and impractical in real time.
  • Catches the unknown: Instead of memorizing past signatures, models learn each site’s normal behavior and flag deviations, including novel AI-driven bots and slow/low attacks.

Data pipeline (real-time, Kafka-first)

  • Inputs: Layer-7 weblogs from CDN edges (Cloudflare, Amazon CloudFront) and origin proxies.
  • Backbone: Apache Kafka with host-keyed partitions to keep all events for the same site ordered together.
  • Sessionization: Real-time creation of user sessions (via KSQL/Kafka Streams), grouping requests by session cookie (with IP/host fallbacks).

Feature extraction

  • Statistical features: Request rates, status-code histograms, bytes, burstiness, sliding-window aggregates.
  • Text/sequence features: Encoded URL paths and navigation sequences (time-ordered URLs).
  • Browser environment metrics: TLS/cipher hints, language/timezone, headless signals, and lightweight fingerprint sketches (canvas/WebGL quirks, WebRTC traits) to reveal swarms of “different” browsers sharing the same environment.

Human vs automated separation

  • A lightweight gatekeeper classifies traffic as human vs automated (incl. verified search bots), reducing noise and enabling tighter anomaly thresholds on the automated side.

Website-specific anomaly models (trained online)

  • Per-site models: Each domain gets its own unsupervised model, continuously trained on the fly from live sessions.
  • Algorithms: Isolation Forest for statistical features plus behavior models that learn URL-sequence norms—effective against slow, content-aware bots that mimic humans.
  • Drift-tolerant: Models adapt to changing traffic without relabeling or offline retraining cycles.

Action loop: challenge → verify → block

  • Suspicious sessions receive progressive challenges (low-friction JS checks up to CAPTCHAs) through CDN integrations.
  • Outcomes (pass/fail/hesitation) feed back into the pipeline; confirmed malicious actors are blocked or rate-limited, while legit users proceed.

Observability & storage

  • Durable store: All session summaries, decisions, and scores are saved in Postgres for audit and analytics.
  • Metrics: We publish challenge rate, bot rate, AI-bot rate, fingerprinting scores, and model health to Prometheus, with Grafana dashboards for real-time visibility.
  • Search at scale: Historical weblogs/attacks can be indexed in Elasticsearch for fast investigations and offline experiments.

Platform & integrations

  • Kubernetes-native services for elastic scaling.
  • Generic APIs to plug into CDNs (Cloudflare, CloudFront) and site backends, decoupling model logic from edge enforcement.

What’s next

  • Cluster-level detection: Cross-session/cross-IP clustering to catch AI botnets whose individual sessions look human but align into inhuman patterns at scale.
  • Deception/honeypots: Rotating low-SEO bait pages to attract content-hungry AI crawlers and safely exhaust their compute.
  • Synthetic adversaries: Continuous red-teaming with AI-driven bot agents.
  • Reinforcement loop: Operator feedback (FP/FN tags) closes the loop for ongoing model improvement.