CaseLinker - System Architecture

Case Reports

Structured Case Reports / Case Related Data

Input documents containing case narratives, investigation reports, and case-related data as PDFs.

Layer 1

Ingestion

Text Extraction & Normalization

# Extract text from PDF

text = extract_pdf_text("case_report.pdf")

# Normalize format

normalized = normalize_text(text)

# Returns: Clean, structured text ready for processing

PDF text extraction using pdfplumber
Organization detection from filenames
Batch processing support
Input validation and error handling

Layer 2

Processing

Feature Extraction & Case Schema

How it's done: Hybrid approach using regex patterns for structured data (demographics, platforms, evidence, prosecution) and pattern-based matching for semantic features (severity indicators, case topics, severity phrases). ML/NER extraction supplements with law enforcement agencies, ages, dates, and locations. Text is cleaned (URL removal, artifact normalization), cases are batched by temporal patterns, and unique case IDs are generated.

# 1. Clean URLs and artifacts

cleaned_text = clean_urls_from_text(raw_text)

# 2. Batch cases by month patterns

cases = case_batching(cleaned_text, org_name="azicac")

# 3. Extract features (regex + patterns + NER)

features = extract_features(case)

# Regex: Demographics, platforms, evidence, prosecution

# Patterns: Severity indicators, case topics, phrases

# NER: Law enforcement agencies, ages, dates, locations

Demographics

Victim age, count, gender
Perpetrator age, RSO status

Platforms

Social media, online methods, communication channels

Severity

Infant, very young, rape, production indicators

Topics

Hands-on, possession, online-only, family, stranger

Prosecution

Charges, booking status, outcomes

Evidence

Images, videos, storage volume, messages

Layer 3

Storage

Case schema & persistence

# Production: DATABASE_URL → PostgreSQL (Railway).

# Local/dev: SQLite file (caselinker.db - local).

storage = CaseStorage()  # or CaseStorage("caselinker.db")

storage.store_case(case)

# cases: JSON columns (topics, severity, platforms, …),

# raw_data (ingestion + case_text), extracted_features (slim schema)

# Postgres extras: precomputed_clusters, cluster_groups_slim (fast /api/cluster-groups)

Deployed on Railway with PostgreSQL; SQLite for local development
Normalized columns plus raw_data / extracted_features JSON
Optional slim cluster caches on Postgres for large corpora
Shared CaseStorage interface; hydrate/slim via case_storage_utils

Layer 4

Analysis

Filtering, facets, clustering, triage

# Tag intersection (topics, severity, platforms, investigation, relationship, custom)

cases = return_tagged_cases(all_cases, [

  {'tag': 'production', 'category': 'case_topics'},

  {'tag': 'infant', 'category': 'severity_indicators'},

])

# HTTP: /api/facet-tree, /api/facet-distinct, /api/facet-cohort-members

# (navigable tag combinations + cohort case IDs from live DB)

# Five cluster families + Jaccard “general”; /api/cluster-groups, /api/automated-analysis

triaged = triage_cases(cases)  # rule-based 5–10 score

# Experimental ML: saved sklearn bundle; /api/triage-model-corpus (live DB), /api/triage-live

Filter cases sharing the same tags (intersection logic)
Facet tree APIs for exploration with processed features
Clustering and automated insights in analysis.py; optional ML triage alongside rules

Layer 5

Visualization

Analyst UI: static pages + JSON APIs

# FastAPI serves visualization/ *.html; browser fetches /api/* JSON.
  
# Examples: /api/cases (full row on demand), /api/case-count, /api/stats, /api/cluster-groups, /api/facet-tree, /api/triage-eval, /api/triage-live

# D3 (and small helpers) in-page: clusters, stats, search, audit trail, live triage, analysis facet explorer

Slim list endpoints where possible; click-through loads full case text from API
Same nav across pages; server-side caching where it helps (e.g. cases list, clusters)
Modular: new views can sit on the same APIs and storage