Case Reports
Structured Case Reports / Case Related Data

Input documents containing case narratives, investigation reports, and case-related data as PDFs.

Layer 1
Ingestion
Text Extraction & Normalization
# Extract text from PDF
text = extract_pdf_text("case_report.pdf")
# Normalize format
normalized = normalize_text(text)
# Returns: Clean, structured text ready for processing
  • PDF text extraction using pdfplumber
  • Organization detection from filenames
  • Batch processing support
  • Input validation and error handling
Layer 2
Processing
Feature Extraction & Case Schema

How it's done: Hybrid approach using regex patterns for structured data (demographics, platforms, evidence, prosecution) and pattern-based matching for semantic features (severity indicators, case topics, severity phrases). ML/NER extraction supplements with law enforcement agencies, ages, dates, and locations. Text is cleaned (URL removal, artifact normalization), cases are batched by temporal patterns, and unique case IDs are generated.

# 1. Clean URLs and artifacts
cleaned_text = clean_urls_from_text(raw_text)

# 2. Batch cases by month patterns
cases = case_batching(cleaned_text, org_name="azicac")

# 3. Extract features (regex + patterns + NER)
features = extract_features(case)
# Regex: Demographics, platforms, evidence, prosecution
# Patterns: Severity indicators, case topics, phrases
# NER: Law enforcement agencies, ages, dates, locations
Demographics
Victim age, count, gender
Perpetrator age, RSO status
Platforms
Social media, online methods, communication channels
Severity
Infant, very young, rape, production indicators
Topics
Hands-on, possession, online-only, family, stranger
Prosecution
Charges, booking status, outcomes
Evidence
Images, videos, storage volume, messages
Layer 3
Storage
Case schema & persistence
# Production: DATABASE_URL → PostgreSQL (Railway).
# Local/dev: SQLite file (caselinker.db - local).
storage = CaseStorage() # or CaseStorage("caselinker.db")
storage.store_case(case)

# cases: JSON columns (topics, severity, platforms, …),
# raw_data (ingestion + case_text), extracted_features (slim schema)
# Postgres extras: precomputed_clusters, cluster_groups_slim (fast /api/cluster-groups)
  • Deployed on Railway with PostgreSQL; SQLite for local development
  • Normalized columns plus raw_data / extracted_features JSON
  • Optional slim cluster caches on Postgres for large corpora
  • Shared CaseStorage interface; hydrate/slim via case_storage_utils
Layer 4
Analysis
Filtering, facets, clustering, triage
# Tag intersection (topics, severity, platforms, investigation, relationship, custom)
cases = return_tagged_cases(all_cases, [
  {'tag': 'production', 'category': 'case_topics'},
  {'tag': 'infant', 'category': 'severity_indicators'},
])

# HTTP: /api/facet-tree, /api/facet-distinct, /api/facet-cohort-members
# (navigable tag combinations + cohort case IDs from live DB)

# Five cluster families + Jaccard “general”; /api/cluster-groups, /api/automated-analysis
triaged = triage_cases(cases) # rule-based 5–10 score
# Experimental ML: saved sklearn bundle; /api/triage-model-corpus (live DB), /api/triage-live
  • Filter cases sharing the same tags (intersection logic)
  • Facet tree APIs for exploration with processed features
  • Clustering and automated insights in analysis.py; optional ML triage alongside rules
Layer 5
Visualization
Analyst UI: static pages + JSON APIs
# FastAPI serves visualization/ *.html; browser fetches /api/* JSON.


# Examples: /api/cases (full row on demand), /api/case-count, /api/stats, /api/cluster-groups, /api/facet-tree, /api/triage-eval, /api/triage-live

# D3 (and small helpers) in-page: clusters, stats, search, audit trail, live triage, analysis facet explorer
  • Slim list endpoints where possible; click-through loads full case text from API
  • Same nav across pages; server-side caching where it helps (e.g. cases list, clusters)
  • Modular: new views can sit on the same APIs and storage

Open source on GitHub

View on GitHub