OpenFst: Racket Bindings
(require openfst) | package: openfst |
This package provides Racket bindings for OpenFst, "a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs)" [openfst]. Extensions to OpenFst based on the Python package pynini are also included [pynini].
1 Getting Started
A weighted finite-state transducers (FSTs) is a type of automata that consists of:
A set of states, represented by natural numbers. Each state is also assigned a final weight representing the weight of terminating computation in this state (Infinity is used to represent non-final states).
A set of arcs (or transitions) between states, consisting of a current and a next state, an input and output label, and a weight. labels are integers representing the unicode encoding of characters with 0 representing Ξ΅, or no-character.
A special state designated as the start state, from which computation begins. Empty or non-sane FSTs may lack a start state.
This library provides a high- and low-level interface for interacting with FSTs. The high-level abstract functional interface contains functions over FSTs such as fst-union and fst-compose. The low-level direct interface allows for inspection and stateful mutation of the structure of an FST.
1.1 Example: Thousand Separators
"7577130400" β "7,577,130,400"
To do this, we first define an FST that accepts any single digit called digit. Next, we take the closure over this FST with a lower and upper bound of 3. The resulting FST, digits3, will accept only strings of exactly 3 digits and rewrite them unchanged.
(define digit (fst-union "0" "1" "2" "3" "4" "5" "6" "7" "8" "9")) (define digits3 (fst-closure digit #:lower 3 #:upper 3))
Next we add a comma inserting FST before digits3 and take the Kleene * of this to get digits3*. The resulting FST will accept a string of digits with length divisible by 3 and insert a comma before every 3rd digit. Finally we concatenate this FST with an FST taking 1, 2, or 3 digits representing the leading digit of the number, to produce add-commas-fst.
(define digits3* (fst-closure (fst-concat (fst-cross "" ",") digits3))) (define add-commas-fst (fst-concat (fst-closure digit #:lower 1 #:upper 3) digits3*))
To actually apply this FST to a string we use fst-compose to implicitly convert the string to an FST itself and then compose that FST with the one we build for thousand separation. Finally, we convert the resulting FST back into a string. For the purposes of this example we do this explicitly, but fst-rewrite wraps this functionality up nicely.
(define (add-commas number-string) (fst->string (fst-compose number-string add-commas-fst)))
> (add-commas "7577130400") "7,577,130,400"
> (add-commas "81") "81"
> (add-commas "100000") "100,000"
1.2 Example: Leading Zeros
To see how weights can be applied to FSTs, consider the problem of removing leading zeros from a number. While this task can be accomplished with an unweighted-FST, introducing weights simplifies the solution.
"007" β "7"
The leading-zero removing transducer is defined as the concatenation of an FST that removes zero or more "0"s and an FST that accepts one or more digits. While the second FST could process the leading "0"s without removing them, this path will have more weight associated with it.
(define leading-0s (fst-concat (fst-closure (fst-cross "0" "")) (fst-closure (fst-add-weight digit 1.0) #:lower 1)))
Note that this FST can be composed with the thousand separator from the previous problem into a single new FST. When applying this FST, fst-shortest-path must be used since now multiple computations exist for some strings.
(define cleanup-fst (fst-compose leading-0s add-commas-fst)) (define (cleanup-number number-string) (fst->string (fst-shortest-path (fst-compose number-string cleanup-fst))))
> (cleanup-number "07577130400") "7,577,130,400"
> (cleanup-number "00000000") "0"
> (cleanup-number "100000") "100,000"
2 Abstract Automata Manipulation
procedure
(fst-accept str [#:weight weight]) β fst?
str : string? weight : real? = 0
> (fst-rewrite (fst-cross "hello" "world") "hello") "world"
procedure
(fst-closure fst [ #:lower lower #:upper upper]) β fst? fst : fst-like? lower : exact-nonnegative-integer? = 0 upper : (or/c #f exact-positive-integer?) = #f
(fst-closure fst) is equivalent to the Kleene *.
(fst-closure fst #:lower 1) is equivalent to the Kleene +.
(fst-closure fst #:upper 1) is equivalent to the ?.
> (define f (fst-closure (fst-cross "A" "B") #:lower 2)) > (fst-rewrite f "AAAAA") "BBBBB"
> (fst-rewrite f "AA") "BB"
> (fst-rewrite f "A") #f
procedure
(fst-compose fst ...+) β fst?
fst : fst-like?
> (define more+s (fst-closure (fst-cross "+" "++"))) > (fst->string (fst-compose "+" more+s)) "++"
> (fst->string (fst-compose "+" more+s more+s more+s)) "++++++++"
procedure
(fst-concat fst ...+) β fst?
fst : fst-like?
> (define f (fst-concat (fst-cross "a" "b") (fst-cross "x" "y"))) > (fst-rewrite f "ax") "by"
> (fst-rewrite f "a") #f
procedure
(fst-project fst type) β fst?
fst : fst-like? type : (or/c 'input 'output)
> (define A->B (fst-cross "A" "B")) > (fst->string (fst-project A->B 'input)) "A"
> (fst->string (fst-project A->B 'output)) "B"
procedure
(fst-shortest-path fst [n]) β fst?
fst : fst-like? n : exact-positive-integer? = 1
procedure
(fst->string fst) β string?
fst : fst-like?
procedure
(fst-inverse fst) β fst?
fst : fst-like?
> (define f (fst-inverse (fst-cross "hello" "world"))) > (fst-rewrite f "world") "hello"
procedure
(fst-reverse fst) β fst?
fst : fst-like?
> (define f (fst-reverse (fst-cross "hello" "world"))) > (fst-rewrite f "olleh") "dlrow"
procedure
(fst-optimize fst) β fst?
fst : fst-like?
procedure
(fst-difference fst1 fst2) β fst?
fst1 : fst-like? fst2 : fst-like?
> (define f (fst-difference (fst-union "a" "b" "c") "c")) > (fst-rewrite f "a") "a"
> (fst-rewrite f "c") #f
2.1 File I/O
Because of the limiations of the interfaces between Racket, C, and C++ these functions work by creating a named temporary file and passing the name of this file to the underlying C++ implementation. This should not be apparent to users of the library, but this approach has not been fully tested on all systems.
procedure
fst : fst-like? port : output-port?
procedure
port : input-port?
2.2 Derived Utilities
(require openfst/utils) | package: openfst |
procedure
(fst-add-weight fst weight) β fst?
fst : fst-like? weight : real?
This is accomplished by adding an Ξ΅ transition to the start of fst with the given weight.
procedure
(fst-insert fst [#:weight weight]) β fst?
fst : fst-like? weight : real? = 0
This function has the effect of replacing the input-label of every arc in the input FST with Ξ΅.
procedure
(fst-delete fst [#:weight weight]) β fst?
fst : fst-like? weight : real? = 0
This function has the effect of replacing the output-label of every arc in the input FST with Ξ΅.
procedure
(fst-rewrite fst input) β (or/c string #f)
fst : fst-like? input : string?
(fst->string (fst-shortest-path (fst-compose input fst)))
3 Direct Automata Access
procedure
state-info :
(cons/c (or/c symbol? (list/c symbol? real?)) (listof (list/c label? label? real? symbol?)))
(make-fst '(a (#\A #\B 1 a) (#\B #\C 1 b)) '((b 0) (#\A #\B 1 a)))
procedure
(fst-add-state! fst) β exact-nonnegative-integer?
fst : fst?
procedure
(fst-add-arc! fst state arc) β void?
fst : fst? state : exact-nonnegative-integer? arc : arc?
procedure
(fst-set-start! fst state) β void?
fst : fst? state : exact-nonnegative-integer?
procedure
(fst-set-final! fst state weight) β void?
fst : fst? state : exact-nonnegative-integer? weight : real?
procedure
(fst-num-states fst) β exact-nonnegative-integer?
fst : fst?
procedure
(fst-num-arcs fst state) β exact-nonnegative-integer?
fst : fst? state : exact-nonnegative-integer?
procedure
(fst-start fst) β (or/c #f exact-nonnegative-integer?)
fst : fst?
procedure
fst : fst? state : exact-nonnegative-integer?
procedure
(fst-states fst) β (listof exact-nonnegative-integer?)
fst : fst?
procedure
fst : fst? state : exact-nonnegative-integer?
3.1 Transition Arcs
Arcs represent transitions between states in an finite-state transducer. Each arc consists of a in-label, an out-label, a weight, and a next state. Arcs are added to the automata at the states from which they originate.
As an implementation note, arc objects are in fact pointers to underlying C++ objects in foreign memory. From the perspective of the user, however they conform to the struct interface.
procedure
ilabel : label? olable : label? weight : real? next-state : exact-nonnegative-integer?
procedure
(arc-ilabel arc) β exact-nonnegative-integer?
arc : arc?
procedure
(arc-olabel arc) β exact-nonnegative-integer?
arc : arc?
procedure
(arc-weight arc) β real?
arc : arc?
procedure
(arc-next-state arc) β exact-nonnegative-integer?
arc : arc?
4 Limitations
This library currently only supports x86_64 architecture. Getting it working on different hardware should be as simple as running the build script corresponding to the OS in the lib/ directory of the GitHub Repo, but I havenβt tested this...
While it should not be visible to the consumers of this library, the Windows implementation of this library is using an unofficial port of OpenFST [winfst] because the official OpenFST distribution does not support Windows.
Bibliography
[pynini] | K. Gorman, βPynini: A Python library for weighted finite-state grammar compilation,β Proc. ACL Workshop on Statistical NLP and Weighted Automata, 75-80, 2016. https://pypi.org/project/pynini/ | |
[openfst] | Cyril Allauzen, M. Riley, J. Schalkwyk, Wojciech Skut, Mehryar Mohri, βOpenFst: A General and Efficient Weighted Finite-State Transducer Library.β 16 July 2007. https://www.openfst.org | |
[winfst] | βOpenFST port for Windows.β https://github.com/kkm000/openfst |