Passion/Algorithm

Aho-Corasick Algorithm Python implementation

sunshout 2014. 12. 3. 16:26

Aho-Corasick algorithm has been widely used for string matching due to its advantage of matching multiple string patterns in a single pass.


- compile multiple string patterns into a state machine

- Patterns = {he, she, his, hers}





AhoCorasick tree can be represented as table


State ID 

 Value

 goto

Failure State 

 isFinal

 output

 0 (Root)

 None

 h->1

 s->3

 0

 

 

 1

 h

 e->2

 i->6

 0

 

 

 2

 e

 r->8

 0

 Yes

 {he}

 3

 s

 h->4

 0

 

 

 4

 h

 e->5

 1

 

 

 5

 e

 

 2

 Yes

 {he, she}

 6

 i

 s->7

 0

 

 

 7

 s

 

 3

 Yes

 {his}

 8

 r

 s->9

 0

 

 
 9 s  3 Yes {hers}




from collections import deque


class State:

    sid = None

    value = None

    isFinal = False

    tranList = None

    failState = 0

    outputSet = None


    def __init__(self, sid, val):

        self.sid = sid

        self.value = val

        self.tranList = {}

        self.failState = 0

        self.outputSet = set()


    def goto(self, key):

        if self.tranList.has_key(key) ==  True:

            return self.tranList[key]


    def addOutput(self, key):

        self.outputSet = self.outputSet ^ key


    def display(self):

        print "State:", self.sid, "Outgoing:", len(self.tranList.keys()), "Failure:", self.failState


        if self.isFinal == True:

            print " +--Output:", self.outputSet


        for node in self.tranList.keys():

            s = self.tranList[node]

            s.display()


class AhoCorasick:
    root = None
    sid = 0
    table = {}

    def __init__(self):
        self.root = State(0, None)
        self.table[0] = self.root

    def addKeyword(self, keyword):
        current = self.root

        for key in keyword:
            print key
            if current.tranList.has_key(key) != True:
                self.sid = self.sid + 1
                current.tranList[key] = State(self.sid, key)
                self.table[self.sid] = current.tranList[key]

            current = current.tranList[key]

        current.isFinal = True
        current.outputSet.add(keyword)
    def setFailure(self):
        queue = deque()
        current = self.root

        for k in self.root.tranList:
            queue.append(self.root.tranList[k])

        while len(queue) != 0:
            r = queue.popleft()
            for k in r.tranList:
                queue.append(r.tranList[k])
                nd = r.tranList[k]
                # node = parent's failState
                # goto(node, value)
                sid = r.failState
                value = nd.value
                current = self.table[sid]

                while True:
                    if current.goto(value) == None and current.sid != 0:
                        new_sid = current.failState
                        current = self.table[new_sid]
                    else:
                        break
                child = current.goto(value)

                if child == None:
                    nd.failState = current.sid
                else:
                    nd.failState = child.sid

                nd.addOutput(self.table[nd.failState].outputSet)

    def findString(self, str):
        current = self.root

        for key in str:
            print "Check:", key
            while True:
                if current.goto(key) == None and current.sid != 0:
                    current = self.table[current.failState]
                    print "failure:", current.sid
                else:
                    child = current.goto(key)
                    break;
            if child != None:
                current = child
                if len(child.outputSet) > 0:
                    print "Sid:", child.sid , child.outputSet

    def display(self):
        self.root.display()

if __name__ == "__main__":
    x =AhoCorasick()
    x.addKeyword("he")
    x.addKeyword("she")
    x.addKeyword("his")
    x.addKeyword("hers")
    x.setFailure()
    x.display()

    a = "she is hers"
    print a
    x.findString(a)


K (keyword) = {y1, y2, ... yn}

x : arbitrary string

functions:

 1) goto function : g

his, hers, she 에 대해서  tree build를 수행

   . 노드가 final node 인 경우 output 에 keyword를 추가함



 2) failure function : f

   . when a character is not matching

   . 지금까지 찾은 결과는 다른 pattern의 prefix 임을 표시

   . 최초 모든 state 의 initial failure state 는 root 로 지정함

   . failure state 는 parent state 의 failure state 에서 key 을 search 한 결과임

     (예를 들어 5번 node 의 failure node를 찾는 방법은, 

         1) 5 번 노드의 부모인 4번 노드의 failure state 를 찾음 => 1번 노드

         2) 1번 노드에서 key 인 (e)에 대해서 goto function 을 구함 => 2번 노드

         3) failure node의 output 을 자신의 output 에 추가함 => 5번 노드의 output = {she} + 2번 노드(failure node)의 output {he} => {she, he}



 3) output function : output

    . output 은 (1) tree build 할 때 추가되며, (2) failure function을 build 할 때 추가됨





Parallel Failureless Aho-Corasick Algorithm

- Parallel Failureless Aho-Corasick(PFAC) algorithm on GPU

- Allocate each byte of input an individual thread to traverse a state machine

- Remove all failure transitions as well as the self-loop transitions backing to the initial state

  . minimum number of valid transitions

  . Thread is terminated when no valid transitions



- Reference: Accelerating String Matching Using Multi-Threaded Algorithm on GPU 


Implementation

ref) http://stackoverflow.com/questions/22398190/state-transition-table-for-aho-corasick-algorithm/22398959#22398959

http://snasrullah.blogspot.kr/2011/02/aho-corasick-string-matching-in-python.html

ref) http://multifast.sourceforge.net/


@ node.h

/* automata node */

typedef struct AC_NODE

{

    int id; /* Node ID : for debugging purpose */

    short int final; /* 0: no ; 1: yes, it is a final node */

    struct AC_NODE * failure_node; /* The failure node of this node */

    unsigned short depth; /* depth: distance between this node and the root */


    /* Matched patterns */

    AC_PATTERN_t * matched_patterns; /* Array of matched patterns */

    unsigned short matched_patterns_num; /* Number of matched patterns at this node */

    unsigned short matched_patterns_max; /* Max capacity of allocated memory for matched_patterns */


    /* Outgoing Edges */

    struct edge * outgoing; /* Array of outgoing edges */

    unsigned short outgoing_degree; /* Number of outgoing edges */

    unsigned short outgoing_max; /* Max capacity of allocated memory for outgoing */

} AC_NODE_t;


/* The Edge of the Node */

struct edge

{

    AC_ALPHABET_t alpha; /* Edge alpha */

    AC_NODE_t * next; /* Target of the edge */

};



@ ahocorasick.h

typedef struct AC_AUTOMATA

{

    /* The root of the Aho-Corasick trie */

    struct AC_NODE * root;


    /* maintain all nodes pointers. it will be used to access or release

    * all nodes. */

    struct AC_NODE ** all_nodes;


    unsigned int all_nodes_num; /* Number of all nodes in the automata */

    unsigned int all_nodes_max; /* Current max allocated memory for *all_nodes */


    /* this flag indicates that if automata is finalized by

     * ac_automata_finalize() or not. 1 means finalized and 0

     * means not finalized (is open). after finalizing automata you can not

     * add pattern to automata anymore. */

    unsigned short automata_open;


    /* It is possible to feed a large input to the automata chunk by chunk to

     * be searched using ac_automata_search(). in fact by default automata

     * thinks that all chunks are related unless you do ac_automata_reset().

     * followings are variables that keep track of searching state. */

    struct AC_NODE * current_node; /* Pointer to current node while searching */

    unsigned long base_position; /* Represents the position of current chunk

                                  * related to whole input text */


    /* The input text.

     * used only when it is working in settext/findnext mode */

    AC_TEXT_t * text;


    /* The lase searched position in the chunk.

     * used only when it is working in settext/findnext mode */

    unsigned long position;


    /* Statistic Variables */


    /* Total patterns in the automata */

    unsigned long total_patterns;


} AC_AUTOMATA_t;



References

Aho-Corasick string matching algorithm : in 1975 IBM researcher, http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm


http://on-demand.gputechconf.com/gtc/2012/presentations/S0054-PFAC-Library-GPU-Based-String-Matching-Algorithm.pdf


Construction of Aho Corasick Automaton in Linear Time for Integer Alphabets

http://cs.haifa.ac.il/~landau/gadi/shiri.pdf


Exact string searching algorithms 

http://www.cs.otago.ac.nz/cosc348/alignments/Lecture04_StringSearch.pdf


Heterogeneous Parallelization of Aho Corasick Algorithm (PFAC) 보다 15% speedup

http://rd.springer.com/chapter/10.1007%2F978-3-319-07581-5_19


High performance pattern matching on heterogeneous platform : PFAC보다 15% 빠름

http://www.ncbi.nlm.nih.gov/pubmed/25339087

http://journal.imbio.de/articles/pdf/jib-253.pdf