Skip to content

src.llm.pattern_detection.aho_corasick_normalized.AhoCorasickAutomatonNormalized

A wrapper for normalized pattern matching using the Aho-Corasick algorithm.

This class normalizes patterns by removing whitespace variations before building the underlying Aho-Corasick automaton. This allows for pattern matching that is insensitive to whitespace differences.

Attributes:

Name Type Description
`normalized_patterns`

Dictionary mapping pattern names to their normalized forms.

`pattern_lengths`

Dictionary storing the lengths of normalized patterns.

`automaton`

The underlying AhoCorasickAutomaton instance.

Parameters:

Name Type Description Default
patterns Dict[str, str]

Dictionary mapping pattern names to their original string patterns.

required
Source code in src/llm/pattern_detection/aho_corasick_normalized.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class AhoCorasickAutomatonNormalized:
    """A wrapper for normalized pattern matching using the Aho-Corasick algorithm.

    This class normalizes patterns by removing whitespace variations before building
    the underlying Aho-Corasick automaton. This allows for pattern matching that is
    insensitive to whitespace differences.

    Attributes:
        `normalized_patterns`: Dictionary mapping pattern names to their normalized forms.
        `pattern_lengths`: Dictionary storing the lengths of normalized patterns.
        `automaton`: The underlying AhoCorasickAutomaton instance.

    Args:
        patterns: Dictionary mapping pattern names to their original string patterns.
    """

    def __init__(self, patterns: Dict[str, str]):
        self.normalized_patterns = {}
        self.pattern_lengths = {}
        for name, pat in patterns.items():
            norm_pat, _ = normalize_and_map(pat)
            self.normalized_patterns[name] = norm_pat
            self.pattern_lengths[name] = len(norm_pat)

        self.automaton = AhoCorasickAutomaton(self.normalized_patterns)

    def reset_state(self):
        """Resets the automaton to its initial state.

        Should be called before starting a new search if the automaton has been
        used previously.
        """
        self.automaton.reset_state()

    def search_chunk(self, norm_chunk: str) -> List[Tuple[int, str]]:
        """Searches for pattern matches in normalized text.

        Args:
            `norm_chunk`: The normalized text chunk to search in. Should be
                pre-normalized before calling this method.

        Returns:
            A list of tuples, where each tuple contains:

                - The ending index of the match in the normalized text (int)
                - The name of the matched pattern (str)

        ``` python title="Example usage"
        automaton = AhoCorasickAutomatonNormalized({'pat1': 'hello world'})
        matches = automaton.search_chunk('helloworld')
        print(len(matches))  # 1
        ```
        """
        return self.automaton.search_chunk(norm_chunk)

    def get_pattern_length(self, pattern_name: str) -> int:
        """Returns the length of a normalized pattern.

        Args:
            `pattern_name`: The name of the pattern whose length is required.

        Returns:
            The length of the normalized pattern as an integer.

        Raises:
            `KeyError`: Error raised if the `pattern_name` is not found in the patterns dictionary.
        """
        return self.pattern_lengths[pattern_name]

get_pattern_length(pattern_name)

Returns the length of a normalized pattern.

Parameters:

Name Type Description Default
`pattern_name`

The name of the pattern whose length is required.

required

Returns:

Type Description
int

The length of the normalized pattern as an integer.

Raises:

Type Description
`KeyError`

Error raised if the pattern_name is not found in the patterns dictionary.

Source code in src/llm/pattern_detection/aho_corasick_normalized.py
70
71
72
73
74
75
76
77
78
79
80
81
82
def get_pattern_length(self, pattern_name: str) -> int:
    """Returns the length of a normalized pattern.

    Args:
        `pattern_name`: The name of the pattern whose length is required.

    Returns:
        The length of the normalized pattern as an integer.

    Raises:
        `KeyError`: Error raised if the `pattern_name` is not found in the patterns dictionary.
    """
    return self.pattern_lengths[pattern_name]

reset_state()

Resets the automaton to its initial state.

Should be called before starting a new search if the automaton has been used previously.

Source code in src/llm/pattern_detection/aho_corasick_normalized.py
41
42
43
44
45
46
47
def reset_state(self):
    """Resets the automaton to its initial state.

    Should be called before starting a new search if the automaton has been
    used previously.
    """
    self.automaton.reset_state()

search_chunk(norm_chunk)

Searches for pattern matches in normalized text.

Parameters:

Name Type Description Default
`norm_chunk`

The normalized text chunk to search in. Should be pre-normalized before calling this method.

required

Returns:

Type Description
List[Tuple[int, str]]

A list of tuples, where each tuple contains:

  • The ending index of the match in the normalized text (int)
  • The name of the matched pattern (str)
Example usage
automaton = AhoCorasickAutomatonNormalized({'pat1': 'hello world'})
matches = automaton.search_chunk('helloworld')
print(len(matches))  # 1
Source code in src/llm/pattern_detection/aho_corasick_normalized.py
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def search_chunk(self, norm_chunk: str) -> List[Tuple[int, str]]:
    """Searches for pattern matches in normalized text.

    Args:
        `norm_chunk`: The normalized text chunk to search in. Should be
            pre-normalized before calling this method.

    Returns:
        A list of tuples, where each tuple contains:

            - The ending index of the match in the normalized text (int)
            - The name of the matched pattern (str)

    ``` python title="Example usage"
    automaton = AhoCorasickAutomatonNormalized({'pat1': 'hello world'})
    matches = automaton.search_chunk('helloworld')
    print(len(matches))  # 1
    ```
    """
    return self.automaton.search_chunk(norm_chunk)