The Need for Speed - JAPE Plus
During the GATE training event of 2009 Andrew Borthwick from Spock Corp. was asking about ways to improve performance when running large JAPE grammars. At the time the only help I could offer was to suggest the usual things: don’t overuse Kleene operators, reduce the number of phases by grouping more rules in the same phase, do some profiling to identify problem spots and then try to redesign them. I don’t imagine that was very useful to him, but it did motivate me to have another look at JAPE and see if I can improve on the execution algorithm. Thus the JAPE Plus project was born.
While implementing the JAPE runtime the first time around, one idea I had was to split all constraints into atomic predicates and use those atomic predicates (and not the full constraints) as transition labels in the state machine. This was intended to allow the minimisation algorithm to do a better job of optimising the state machine. A constraint like {Person gender=="female” affiliation=="academic” age > 30} would then get converted into the following atomic predicates:
<annType: name="Person">
,<annFeature: name="gender", value="female", predicate="eq">
,<annFeature: name="affiliation", value="academic" , predicate="eq">
,<annFeature: name="age", value=30, predicate="gt">
.
This allows input annotations to partially match, e.g. an annotation of type Person, with a female gender, but which does not match the other conditions. These partial matches are represented by a set of transitions, that can help other constraints succeed, even when this particular one fails. However, given that in the end we didn’t do any proper minimisation for the JAPE state machine (more on that later!), this idea never got implemented.
For JAPE Plus I decided to revisit the idea of atomic predicates, but to completely by-pass the issue of full minimisation of the state machine and replace it with a dynamic programming approach. Let’s think about state transitions as ‘questions’ asked about the input, e.g. “is the current input annotation of type Person?”. Minimising the state machine would create a state/transition graph where the path from the initial state to the final one is the shortest possible, thus making sure you are not asking more ‘questions’ than you need to. In the case of standard JAPE, the state machine is not minimised, which results in the same ‘questions’ being asked about the same input annotations on different graph traversal paths, which is not optimal. JAPE Plus changes that by first breaking-up all constraints into atomic predicates, and then keeping track of the values for all these predicates when applied to each of the input annotations. This means that, even though the same ‘questions’ are asked more than once, the result value is only calculated once and stored. This stored value is then used for all the future calls of the same predicate for the same input annotation, which greatly reduces the overhead.
I promised above to come back to the minimisation issue. The reason why standard JAPE grammars are not minimised is that rule fragments can be associated with bindings. The standard finite state machine minimisation algorithms rely on the fact that multiple transitions that have the same label can be collapsed into a single one. In the case of JAPE, each transition has a label but it can also potentially have a binding. To fix that, one would need to change the input alphabet from Σ to ΣxΒ, where Β is the set of all possible bindings. When I wrote the first JAPE runtime, I considered that this makes the alphabet much larger, which greatly reduces the benefits of minimisation, while making the implementation more complicated. Because of that, and perhaps a bit of laziness too, I decided not to bother with implementing any minimisation.
By a fortunate coincidence, while I was working on implementing my dynamic programming optimisations, Petar Mitankin from Ontotext came up with a very clever solution to allow minimisation to take place in JAPE. Briefly put, he is generating synthetic state transitions when brackets are opened and closed, and delays the application of the bindings until the closed bracket transition. This requires a bit of extra state to be stored during the execution, but also allows the state machine to be minimised. The nice thing is that his work on minimising the state machine and my work on optimising the execution engine are somewhat orthogonal. While some of the benefits can be reached by either method, there are elements which are not shared. This allowed us to combine our ideas into the same engine, which has now become the new JAPE_Plus GATE plugin. JAPE_Plus has recently been added to the GATE SVN repository, and will be part of the next release, whenever that may come.
I have performed some evaluation experiments to see how much faster JAPE Plus is when compared with the standard JAPE engine. To find this, I used the ANNIE pipeline over a set of just over 8,000 web pages from the BBC news website. The largest part of the ANNIE application is a Named Entity Recogniser (NER), implemented in JAPE. I created two versions of the pipeline, one that uses the standard JAPE runtime, and another one based on JAPE Plus. For each document, I measured the time taken by the execution of the NER, by running each version of the pipeline five times and then averaging the results. The graph on the right maps document size to execution time for both versions of the runtime. Each position on the x axis represents all the documents of a certain size (e.g. all documents larger than 7 Kb and smaller than 8 Kb); the corresponding data points represent the average execution time for all those documents. The lines are a bit jagged due to data sparseness - there weren’t documents for every different size interval. As you can see, JAPE Plus is consistently faster; about 3 times faster when averaging all the execution times for all the documents (60ms for JAPE plus, compared to 196ms for JAPE).
If you want to try it, you’ll have to use the current snapshot version of GATE developer, as JAPE Plus wasn’t part of the GATE distribution last time we had a release. If you have any comments, or find any bugs, please report then to the GATE Users’ mailing list.