Case Reports
Structured Case Reports / Case Related Data

Input documents containing case narratives, investigation reports, and case-related data as PDFs.

Layer 1
Ingestion
Text Extraction & Normalization
# Extract text from PDF
text = extract_pdf_text("case_report.pdf")
# Normalize format
normalized = normalize_text(text)
# Returns: Clean, structured text ready for processing
  • PDF text extraction using pdfplumber
  • Organization detection from filenames
  • Batch processing support
  • Input validation and error handling
Layer 2
Processing
Feature Extraction & Case Schema

How it's done: Hybrid approach using regex patterns for structured data (demographics, platforms, evidence, prosecution) and pattern-based matching for semantic features (severity indicators, case topics, severity phrases). ML/NER extraction supplements with law enforcement agencies, ages, dates, and locations. Text is cleaned (URL removal, artifact normalization), cases are batched by temporal patterns, and unique case IDs are generated.

# 1. Clean URLs and artifacts
cleaned_text = clean_urls_from_text(raw_text)

# 2. Batch cases by month patterns
cases = case_batching(cleaned_text, org_name="azicac")

# 3. Extract features (regex + patterns + NER)
features = extract_features(case)
# Regex: Demographics, platforms, evidence, prosecution
# Patterns: Severity indicators, case topics, phrases
# NER: Law enforcement agencies, ages, dates, locations
Demographics
Victim age, count, gender
Perpetrator age, RSO status
Platforms
Social media, online methods, communication channels
Severity
Infant, very young, rape, production indicators
Topics
Hands-on, possession, online-only, family, stranger
Prosecution
Charges, booking status, outcomes
Evidence
Images, videos, storage volume, messages
Layer 3
Storage
Case Schema & Database Storage
# Store case with schema
storage = CaseStorage("caselinker.db")
storage.store_case(case)

# Normalized tables:
# - cases (main table)
# - victim_demographics
# - perpetrator_demographics
# - prosecution_outcomes
# - Raw data preserved in JSON
  • SQLite database with normalized schema
  • Raw data preservation in JSON format
  • Indexed fields for fast queries
  • Modular interface for database swapping
Layer 4
Analysis
Tag-Based Filtering & Triage
# Tag-based filtering (intersection logic)
# Select case features to analzye across categories:
# - Case Topics
# - Severity Indicators
# - Platforms & Environments
# - Investigation Types
# - Perpetrator Relationships
# - Custom topics
cases = return_tagged_cases(all_cases, [
  {'tag': 'production', 'category': 'case_topics'},
  {'tag': 'infant', 'category': 'severity_indicators'},
  {'tag': 'facebook', 'category': 'platforms_used'},
])
# Returns: Cases matching ALL selected tags

# Priority triage scoring
triaged = triage_cases(cases)
# Multi-factor scoring (5-10 scale):
# - Severity (35%), Victim count (30%)
# - Case type (25%), Evidence (10%), RSO (10%)
  • Tag-based filtering with intersection logic
  • Select cases by common features/tags
  • Multi-factor priority triage (5-10 scale)
  • Automated pattern detection and insights
Layer 5
Visualization
Interactive Visualizations & Analyst Interface
# D3.js visualizations
renderTimeline(cases)
renderSeverityViz(cases)
renderOutcomesViz(cases)
renderAutomatedAnalysis(results)

# Interactive features:
# - Click to view case details
# - Text highlighting for verification
# - Temporal filtering
# - Case grouping visualization
  • Timeline visualization (chronological view)
  • Severity indicators (color-coded analysis)
  • Prosecution outcomes (categorized display)
  • Perpetrator demographics (pie charts)
  • Platform/environment distribution
  • Organizations involved (agency analysis)

Open source on GitHub

View on GitHub