Pattern Detection Documentation
Overview
The Pattern Detection Module provides utilities for detecting predefined patterns within text streams. This module leverages the Aho-Corasick algorithm — an efficient string-searching technique that matches multiple patterns simultaneously in linear time.
Core Components
Automaton Classes | Processor Classes |
---|---|
AhoCorasickAutomaton ✓ Trie-based pattern matching engine ✓ Linear-time complexity ✓ Simultaneous multi-pattern search |
BaseBufferedProcessor ✓ Abstract base for text streaming ✓ Buffer management ✓ Chunk-based processing |
AhoCorasickAutomatonNormalized ✓ Whitespace-insensitive matching ✓ Pattern normalization ✓ Original-to-normalized index mapping |
AhoCorasickBufferedProcessor ✓ Exact pattern matching ✓ YAML-configurable patterns ✓ Streaming-ready implementation |
AhoCorasickBufferedProcessorNormalized ✓ Whitespace-invariant detection ✓ Flexible text matching ✓ Preserves original text positions |
Utility Functions
load_patterns(yaml_path)
— Loads patterns from YAML configurationnormalize_and_map(text)
— Removes whitespace while tracking original character positions
How It Works
The module processes text in two primary stages:
-
Pattern Preprocessing
- Patterns are loaded from YAML configuration
- The Aho-Corasick automaton is constructed from patterns
- Failure links connect states for efficient pattern transitions
-
Buffered Text Processing
- Text is processed in manageable chunks
- Partial matches at chunk boundaries are preserved
- Match information includes pattern name and position
Text Normalization Pipeline
┌─────────────┐ ┌─────────────┐ ┌───────────────┐ ┌─────────────┐
│ Input Text │ ──► │ Normalize │ ──► │ Pattern Match │ ──► │ Map to │
│ │ │ (Remove │ │ (Using │ │ Original │
│ │ │ Whitespace)│ │ Automaton) │ │ Positions │
└─────────────┘ └─────────────┘ └───────────────┘ └─────────────┘
Performance Considerations
Time Complexity O(n + m + k) where:
- n = length of input text
- m = total length of all patterns
- k = number of pattern occurrences
Space Efficiency:
- Buffered processing minimizes memory usage
- Suitable for streaming applications with unbounded input
Flexibility vs. Performance:
- Standard processors offer exact matching with minimal overhead
- Normalized processors provide flexibility with slight computational cost
Usage Examples
# Standard pattern matching (exact)
processor = AhoCorasickBufferedProcessor('patterns.yaml')
result = await processor.process_chunk("input text")
print(f"Pattern found: {result.pattern_name}" if result.matched else "No match")
# Whitespace-insensitive pattern matching
norm_processor = AhoCorasickBufferedProcessorNormalized('patterns.yaml')
result = await norm_processor.process_chunk("input text with spacing")
print(f"Pattern found: {result.pattern_name}" if result.matched else "No match")