Experimental

Machine Learning Research

This area covers everything that goes beyond deterministic extraction and rule-based analysis: NER during ingestion, optional heavy NLP libraries, ML Classification, and the supervised triage experiments. Specifically in development are semantic similarity, named entity recognition, and content sanitization to complement existing pattern-based extraction, clustering, and visualizations. Additionally, in the works: ML models run on cases within each extraction feature to surface similarities, differences, and distinguishing patterns across stat-style cohorts — for example FBI cases vs local law enforcement investigations, stranger cases vs family or teacher cases, and possible offender patterns across ages. The main workflow for triage is the Triage tab (/triage): rule-based tiers and evaluation, Random Forest model over the live database, and paste-in “live triage” that never writes to storage. This page explains scope, status, and how those pieces fit together.

Research Objectives

Pattern extraction and explicit tags remain the source of truth for case schema and filtering. ML here either augments extraction (NER) or learns from the existing rule-based priority scores (supervised triage tiers) so we can compare model behavior to transparent rules—not replace them silently. Semantic search, richer clustering, and comparative group studies are being explored where data supports it.

Active Development Areas

Random Forest Triage (shipped)

A random forest (and optionally a simpler decision tree) is trained on the structured features the system already derives from each case. Training targets are priority tiers from an explicit, rule-based rubric—so model output stays anchored to policy investigators can review, not an opaque score. The Triage workspace presents evaluation metrics and accuracy, a clear view of where the model disagrees with the rubric, and case lists by tier against the current case database; the same model can be run across the full corpus on demand and narrowed with standard filters, always reflecting live data rather than a frozen export. Live triage lets analysts paste narrative text and see suggested tiers in the session only—nothing is stored. The model file can be retrained and checked against benchmarks as the dataset grows. Overall this is an interpretable assist with measurable performance alongside transparent rules: built for scale, accountability, and analyst support—not black-box automation.
Named Entity Recognition (deployed)

Pipeline default is Stanza NER (with fallbacks when libraries or models are missing). Pulls organizations, dates, locations, and related agencies where regex is thin; merged with pattern output under the same case schema.
Semantic similarity & transformers (optional install)

Sentence embeddings and related tooling live behind requirements-ml.txt. Not required for the triage bundle. Intended for future similarity search and embedding-backed clustering without disrupting the deterministic baseline.
Content sanitization

Summarization / redaction-style helpers remain experimental in the ML processing layer—useful for safer review workflows once evaluated; not a default view in production.
Comparative case group analysis

Still a research direction: statistical and model-based contrasts between cohorts (e.g. agency mix, relationship classes) on top of facet and tag APIs—not a separate shipped feature yet.

Technical Stack

scikit-learn / joblib — Triage training, metrics, saved bundle (core requirements.txt)
stanza — Primary NER path in ingestion when installed and models are present
sentence-transformers — Optional embeddings (requirements-ml.txt)
spaCy / transformers — Optional NER or summarization paths where enabled

Architecture

ML components are optional enhancements, not replacements. Pattern-based extraction remains primary; ML provides additional capabilities. System reverts gracefully if ML models are unavailable or undesired.

Ingestion flow: Case text → pattern extraction → NER merge → storage. Triage inference does not affect stored cases.

What runs where: Cluster summaries for the Clusters view are precomputed and cached in the database for large corpora. Triage model classifies cases on demand from running the saved model over the current database (no static snapshot).

Deployment: Default requirements.txt includes sklearn/joblib for triage. Install requirements-ml.txt when you need embeddings, spaCy, or full transformers stacks locally.

Current Status

Triage experiments

UI on /triage: rule tiers, train/test-style metrics via API, demo Random Forest model over live data, paste-in live triage
Bundle path configurable; train with python3 scripts/train_triage_model.py --out models/triage_bundle.joblib
Labeled as experimental: models mirror rule bins—always compare to transparent rule scores

NER

Status: Integrated on the ingestion path (Stanza-first; graceful without models)
Usage: Organizations, dates, locations, ages where the model fires—merged with regex features
Merge policy: Pattern extraction wins on conflicts; NER fills gaps (MergeProcessing)

Planned Capabilities

Semantic Case Search

Find cases by meaning, not keywords. Discover relationships through semantic understanding.
Enhanced Clustering

Case grouping using semantic embeddings to reveal patterns traditional methods miss.
Entity Networks

Visualize relationships between organizations, platforms, and entities extracted via NER.
Clean Case Views

Sanitized summaries preserving analytical value without explicit details.
Comparative Group Analysis

ML models run on cases within each stat category to compare case groups—helps us understand these cases and the landscape of exploitation better.