Aho-Corasick algorithm has been widely used for string matching due to its advantage of matching multiple string patterns in a single pass.
- compile multiple string patterns into a state machine
- Patterns = {he, she, his, hers}
AhoCorasick tree can be represented as table
State ID |
Value |
goto |
Failure State |
isFinal | output |
0 (Root) |
None |
h->1 s->3 |
0 |
| |
1 |
h |
e->2 i->6 |
0 |
| |
2 |
e |
r->8 |
0 |
Yes | {he} |
3 |
s |
h->4 |
0 |
| |
4 |
h |
e->5 |
1 |
| |
5 |
e |
|
2 |
Yes | {he, she} |
6 |
i |
s->7 |
0 |
| |
7 |
s |
|
3 |
Yes | {his} |
8 |
r |
s->9 |
0 |
| |
9 | s | 3 | Yes | {hers} |
from collections import deque
class State:
sid = None
value = None
isFinal = False
tranList = None
failState = 0
outputSet = None
def __init__(self, sid, val):
self.sid = sid
self.value = val
self.tranList = {}
self.failState = 0
self.outputSet = set()
def goto(self, key):
if self.tranList.has_key(key) == True:
return self.tranList[key]
def addOutput(self, key):
self.outputSet = self.outputSet ^ key
def display(self):
print "State:", self.sid, "Outgoing:", len(self.tranList.keys()), "Failure:", self.failState
if self.isFinal == True:
print " +--Output:", self.outputSet
for node in self.tranList.keys():
s = self.tranList[node]
s.display()
K (keyword) = {y1, y2, ... yn}
x : arbitrary string
functions:
1) goto function : g
his, hers, she 에 대해서 tree build를 수행
. 노드가 final node 인 경우 output 에 keyword를 추가함
2) failure function : f
. when a character is not matching
. 지금까지 찾은 결과는 다른 pattern의 prefix 임을 표시
. 최초 모든 state 의 initial failure state 는 root 로 지정함
. failure state 는 parent state 의 failure state 에서 key 을 search 한 결과임
(예를 들어 5번 node 의 failure node를 찾는 방법은,
1) 5 번 노드의 부모인 4번 노드의 failure state 를 찾음 => 1번 노드
2) 1번 노드에서 key 인 (e)에 대해서 goto function 을 구함 => 2번 노드
3) failure node의 output 을 자신의 output 에 추가함 => 5번 노드의 output = {she} + 2번 노드(failure node)의 output {he} => {she, he}
3) output function : output
. output 은 (1) tree build 할 때 추가되며, (2) failure function을 build 할 때 추가됨
Parallel Failureless Aho-Corasick Algorithm
- Parallel Failureless Aho-Corasick(PFAC) algorithm on GPU
- Allocate each byte of input an individual thread to traverse a state machine
- Remove all failure transitions as well as the self-loop transitions backing to the initial state
. minimum number of valid transitions
. Thread is terminated when no valid transitions
- Reference: Accelerating String Matching Using Multi-Threaded Algorithm on GPU
Implementation
ref) http://stackoverflow.com/questions/22398190/state-transition-table-for-aho-corasick-algorithm/22398959#22398959
http://snasrullah.blogspot.kr/2011/02/aho-corasick-string-matching-in-python.html
ref) http://multifast.sourceforge.net/
@ node.h
/* automata node */
typedef struct AC_NODE
{
int id; /* Node ID : for debugging purpose */
short int final; /* 0: no ; 1: yes, it is a final node */
struct AC_NODE * failure_node; /* The failure node of this node */
unsigned short depth; /* depth: distance between this node and the root */
/* Matched patterns */
AC_PATTERN_t * matched_patterns; /* Array of matched patterns */
unsigned short matched_patterns_num; /* Number of matched patterns at this node */
unsigned short matched_patterns_max; /* Max capacity of allocated memory for matched_patterns */
/* Outgoing Edges */
struct edge * outgoing; /* Array of outgoing edges */
unsigned short outgoing_degree; /* Number of outgoing edges */
unsigned short outgoing_max; /* Max capacity of allocated memory for outgoing */
} AC_NODE_t;
/* The Edge of the Node */
struct edge
{
AC_ALPHABET_t alpha; /* Edge alpha */
AC_NODE_t * next; /* Target of the edge */
};
@ ahocorasick.h
typedef struct AC_AUTOMATA
{
/* The root of the Aho-Corasick trie */
struct AC_NODE * root;
/* maintain all nodes pointers. it will be used to access or release
* all nodes. */
struct AC_NODE ** all_nodes;
unsigned int all_nodes_num; /* Number of all nodes in the automata */
unsigned int all_nodes_max; /* Current max allocated memory for *all_nodes */
/* this flag indicates that if automata is finalized by
* ac_automata_finalize() or not. 1 means finalized and 0
* means not finalized (is open). after finalizing automata you can not
* add pattern to automata anymore. */
unsigned short automata_open;
/* It is possible to feed a large input to the automata chunk by chunk to
* be searched using ac_automata_search(). in fact by default automata
* thinks that all chunks are related unless you do ac_automata_reset().
* followings are variables that keep track of searching state. */
struct AC_NODE * current_node; /* Pointer to current node while searching */
unsigned long base_position; /* Represents the position of current chunk
* related to whole input text */
/* The input text.
* used only when it is working in settext/findnext mode */
AC_TEXT_t * text;
/* The lase searched position in the chunk.
* used only when it is working in settext/findnext mode */
unsigned long position;
/* Statistic Variables */
/* Total patterns in the automata */
unsigned long total_patterns;
} AC_AUTOMATA_t;
References
Aho-Corasick string matching algorithm : in 1975 IBM researcher, http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
http://on-demand.gputechconf.com/gtc/2012/presentations/S0054-PFAC-Library-GPU-Based-String-Matching-Algorithm.pdf
Construction of Aho Corasick Automaton in Linear Time for Integer Alphabets
http://cs.haifa.ac.il/~landau/gadi/shiri.pdf
Exact string searching algorithms
http://www.cs.otago.ac.nz/cosc348/alignments/Lecture04_StringSearch.pdf
Heterogeneous Parallelization of Aho Corasick Algorithm (PFAC) 보다 15% speedup
http://rd.springer.com/chapter/10.1007%2F978-3-319-07581-5_19
High performance pattern matching on heterogeneous platform : PFAC보다 15% 빠름
http://www.ncbi.nlm.nih.gov/pubmed/25339087
http://journal.imbio.de/articles/pdf/jib-253.pdf