Input documents containing case narratives, investigation reports, and case-related data as PDFs.
How it's done: Hybrid approach using regex patterns for structured data (demographics, platforms, evidence, prosecution) and pattern-based matching for semantic features (severity indicators, case topics, severity phrases). ML/NER extraction supplements with law enforcement agencies, ages, dates, and locations. Text is cleaned (URL removal, artifact normalization), cases are batched by temporal patterns, and unique case IDs are generated.
Open source on GitHub
View on GitHub