Experimental

Machine Learning Research

This area covers everything that goes beyond deterministic extraction and rule-based analysis: NER during ingestion, optional heavy NLP libraries, ML Classification, and the supervised triage experiments. Specifically in development are semantic similarity, named entity recognition, and content sanitization to complement existing pattern-based extraction, clustering, and visualizations. Additionally, in the works: ML models run on cases within each extraction feature to surface similarities, differences, and distinguishing patterns across stat-style cohorts — for example FBI cases vs local law enforcement investigations, stranger cases vs family or teacher cases, and possible offender patterns across ages. The main workflow for triage is the Triage tab (/triage): rule-based tiers and evaluation, Random Forest model over the live database, and paste-in “live triage” that never writes to storage. This page explains scope, status, and how those pieces fit together.

Research Objectives

Pattern extraction and explicit tags remain the source of truth for case schema and filtering. ML here either augments extraction (NER) or learns from the existing rule-based priority scores (supervised triage tiers) so we can compare model behavior to transparent rules—not replace them silently. Semantic search, richer clustering, and comparative group studies are being explored where data supports it.

Active Development Areas
Technical Stack
  • scikit-learn / joblib — Triage training, metrics, saved bundle (core requirements.txt)
  • stanza — Primary NER path in ingestion when installed and models are present
  • sentence-transformers — Optional embeddings (requirements-ml.txt)
  • spaCy / transformers — Optional NER or summarization paths where enabled
Architecture

ML components are optional enhancements, not replacements. Pattern-based extraction remains primary; ML provides additional capabilities. System reverts gracefully if ML models are unavailable or undesired.

Ingestion flow: Case text → pattern extraction → NER merge → storage. Triage inference does not affect stored cases.

What runs where: Cluster summaries for the Clusters view are precomputed and cached in the database for large corpora. Triage model classifies cases on demand from running the saved model over the current database (no static snapshot).

Deployment: Default requirements.txt includes sklearn/joblib for triage. Install requirements-ml.txt when you need embeddings, spaCy, or full transformers stacks locally.

Current Status
Triage experiments
  • UI on /triage: rule tiers, train/test-style metrics via API, demo Random Forest model over live data, paste-in live triage
  • Bundle path configurable; train with python3 scripts/train_triage_model.py --out models/triage_bundle.joblib
  • Labeled as experimental: models mirror rule bins—always compare to transparent rule scores
NER
  • Status: Integrated on the ingestion path (Stanza-first; graceful without models)
  • Usage: Organizations, dates, locations, ages where the model fires—merged with regex features
  • Merge policy: Pattern extraction wins on conflicts; NER fills gaps (MergeProcessing)
Planned Capabilities