Experimental

Machine Learning Research

Current research focuses on applying ML/NLP enhancements to improve case analysis capabilities. Specfically evaluating semantic similarity, named entity recognition, and content sanitization to complement existing pattern-based extraction, clustering, and visualizations. Additionally, ML models will be run on cases within each extraction feature to identify similarities, differences, and distinguishing patterns between cases in each stat category (e.g. FBI cases vs Local Law Enforcement investigations, Stranger cases vs Family or Teacher cases, possible offender patterns across ages)

Research Objectives

The core system demonstrates pattern-based extraction works well for structured case data. This phase explores where ML can add value: understanding semantic relationships, extracting entities regex might miss, generating case summaries that preserve insights while reducing exposure to harmful content, and analyzing case groups to identify distinguishing features, similarities, and differences between groups.

Active Development Areas
Technical Stack
  • sentence-transformers — Semantic embeddings (all-MiniLM-L6-v2)
  • spaCy — Named entity recognition (en_core_web_sm)
  • transformers — Summarization models (BART/T5)
Architecture

ML components are optional enhancements, not replacements. Pattern-based extraction remains primary; ML provides additional capabilities. System reverts gracefully if ML models are unavailable or undesired.

Processing flow: Case Text → Pattern extraction → ML enhancement (optional) → Storage. Components are modular and can be evaluated independently.

Analysis Execution: All ML analysis (comparative case group analysis, semantic similarity, clustering) is pre-computed offline, not performed on live CaseLinker instances. Results are stored and served via API endpoints, ensuring performance and consistency without impacting real-time case processing.

Deployment: ML dependencies are available in requirements-ml.txt for local development. Production builds use requirements.txt (ML deps commented) for faster deployment; all ML code remains in the repository.

Current Status
Initial Development
  • ML Processing Layer components implemented
  • Evaluating performance on case data
  • Integration pending evaluation results
  • Comparative case group analysis models in development
NER Development
  • Status: Active and integrated. NER extraction operational for law enforcement agency identification
  • Usage: Extracts organizations, ages, dates, and locations from case text using Stanza/Transformers NER models
  • Integration: Merges with regex-based pattern extraction via MergeProcessing class. NER supplements missing data, pattern processing takes precedence when both sources have data
Planned Capabilities