Transaction Anomaly Detection Simulation
Case Study Summary
Client: A B2B SaaS Company
Impact Metrics:
- Real-time anomaly detection with 180-day sliding window training
- Stateless architecture enabling horizontal scaling via Redis model store
- Interactive monitoring dashboard for model inspection and anomaly analysis
Description
This proof of concept demonstrates high-level anomaly detection on transactional data. It simulates and/or ingests historical transactions, trains per-wallet models, flags unusual behavior, stores model state, and provides a lightweight UI for analysis.
Problem Statement
We need a lightweight, per‑wallet anomaly detector that works with only three fields — wallet_id
, timestamp
, amount
— and can surface unusual behaviour despite high traffic skew (a few wallets generate most transactions). The PoC must simulate “live” operation: for each day, retrain on the preceding 180 days and score that day’s transactions using the latest model state. Outputs are daily parquet files with unified anomaly flags and scores, plus full per‑wallet model objects in Redis. A Streamlit app provides drill‑downs to understand model behaviour and outcomes.
Approach Overview
Two complementary detectors are trained per wallet (only if it meets a minimum transaction threshold):
- Statistical detector (z‑score) on transaction amount distributions (log-scaled).
- Machine‑learning detector (Isolation Forest) on simple features (amount + time encodings such as hour-of-day one-hot).
Both models are retrained daily on the last 180 days, then used to score all transactions on the current day. We persist full model objects per wallet in Redis to keep the system stateless between runs and enable fine‑grained inspection in the UI.
Architecture at a Glance
- Data ingestion: Load historical transactions (Snowflake or mock extractors) with configurable filters.
- Modeling: Train both a simple statistical detector (z-score) and a machine learning model (Isolation Forest) per wallet.
- Model storage: Persist model parameters/state in Redis for reuse and scoring.
- Scoring & outputs: Produce daily parquet results and aggregated metrics; combine model signals into a single anomaly flag.
- Visualization: Streamlit app to explore results, distributions, timelines, and per-wallet details.
Tech Stack & Key Technologies
- Language: Python (3.x)
- Data & numerics:
pandas
,numpy
- ML:
scikit-learn
(IsolationForest), custom statistical z-score detector - Configuration & validation:
pydantic
,pyyaml
,python-dotenv
- Data access: Snowflake via
snowflake-sqlalchemy
- Storage: Redis (model parameter store)
- Visualization/UI: Streamlit; plotting with
plotly
,matplotlib
,seaborn
Inputs & Outputs (High Level)
- Inputs: Historical transactions (from Snowflake or mock sources), environment-configured credentials
- Outputs: Parquet result files with anomaly flags and scores; model parameters and metadata in Redis
Design Decisions & Rationale
- Per-wallet modeling: Each wallet exhibits unique behaviour; a single global model underfits with sparse features. We train separate models per
WALLET_ID
. - Sliding window (180 days): Captures recent behaviour and adapts to drift while keeping training bounded.
- Eligibility thresholds: Skip wallets with too few transactions to avoid unstable parameters.
- Statistical: min 30 tx (configurable).
- Isolation Forest: min 100 tx (configurable).
- Feature minimalism: Only
TOTAL_AMOUNT
and time of day are required, matching the sparse input contract. - Statelessness via Redis: Store full per‑wallet model objects in Redis under
model:{model_type}:{wallet_id}
and track recency via a sorted setmodel_updates
. - Multiprocessing: Train wallets in parallel using a process pool for throughput on CPU-bound workloads (distributed architecture is possible but out of scope for this PoC)
- Decision rule: Combine detectors with an OR — a transaction is anomalous if either model flags it. This yields a unified flag for downstream consumers.
Training & Scoring Lifecycle
Daily sliding-window simulation (retraining on the previous 180 days, then scoring the next day’s transactions):
sequenceDiagram
participant E as Extract (Snowflake/Cache)
participant P as Pipeline
participant S as Redis Store
participant O as Output Parquet
participant UI as Streamlit
loop For each day D in range
E->>P: Load tx in [D-180, D)
P->>P: Group by WALLET_ID and compute stats
par
P->>P: Fit Statistical if count ≥ 30
and
P->>P: Fit IsolationForest if count ≥ 100
end
P->>S: Save models (statistical, isolation_forest)
E->>P: Load tx in [D, D+1)
P->>P: Score with both models and combine (OR)
P->>O: Write labeled parquet for day D
end
O->>UI: Load parquet for analysis
Statistical (z-score)
- Log-scale amounts with
log1p
for robustness to fat tails (configurable). - Parameters per wallet: mean, std, median, training count.
- Predict: compute z-score and flag if
z > zscore_threshold
. - Config defaults:
min_transactions=30
,zscore_threshold=3.5
.
Isolation Forest
- Preprocessing per wallet:
RobustScaler(TOTAL_AMOUNT_log) + OneHot(hour_of_day)
. - Model:
IsolationForest(n_estimators, contamination, random_state)
. - Eligibility:
min_transactions=100
by default. - Outputs: prediction. Optional scores (decision_function) can be exposed if needed.
- Serialized to Redis including preprocessor; models are pickled and hex-encoded in JSON.
Outputs & UI
- Daily parquet written to
output/labeled_transactions_YYYYMMDD.parquet
with columns including: TX_ID
,WALLET_ID
,TX_COMPLETED_AT
,TOTAL_AMOUNT
,ORG_NAME
zscore
(float),statistical_anomaly
(bool),isolation_forest_anomaly
(bool),is_anomaly
(bool)- Streamlit app (
streamlit/app.py
) loads all parquet files and provides: - Summary metrics, model comparison, z-score distributions, timeline, wallet explorer, org summaries.
-
Ready to detect anomalies in your data?
Free 30-minute ML consultation
Custom anomaly detection strategy
Risk reduction & model performance analysisNeed to identify unusual patterns in your transaction data, user behavior, or business metrics? Let's design a tailored anomaly detection system that scales with your data and provides actionable insights.