Formatting serial streams in hardware
12 August 2024 (programming haskell clash fpga)I've been playing around with building a Sudoku solver circuit on an FPGA: you connect to it via a serial port, send it a Sudoku grid with some unknown cells, and after solving it, you get back the solved (fully filled-in) grid. I wanted the output to be nicely human-readable, for example for a 3,3-Sudoku (i.e. the usual Sudoku size where the grid is made up of a 3 ⨯ 3 matrix of 3 ⨯ 3 boxes):
4 2 1 9 5 8 6 3 7 8 7 3 6 2 1 9 5 4 5 9 6 4 7 3 2 1 8 3 1 2 8 4 6 7 9 5 7 6 8 5 1 9 3 4 2 9 4 5 7 3 2 8 6 1 2 8 9 1 6 4 5 7 3 1 3 7 2 9 5 4 8 6 6 5 4 3 8 7 1 2 9
This post is about how I structured the stream transformer that produces all the right spaces and newlines, yielding a clash-protocols based circuit.
With clash-protocols, the type of a circuit that transforms a serial stream of a values into a stream of b values is Circuit (Df dom a) (Df dom b), with the Df type constructor taking care of representing acknowledgement signals. So for our formatter, we are looking to implement it as a Circuit (Df dom a) (Df dom (Either Char a): the output is a mixed stream of forwarded data a and punctuation characters.
If we were writing normal (software) Haskell, we could write the corresponding [a] -> [Either Char a] in many different ways, but since we want to describe a hardware circuit, there are a couple more constraints:
- There is limited control over our input and output. The Circuit/Df abstraction provides a way to exert backpressure upstream, but that's a double edged sword: it also means whatever circuit we write has to be ready to handle backpressure from further downstream. And upstream can stall us: sometimes there is no input available.
- Everything, of course, has to be finite. This includes both data and recursion depth.
- We don't want to waste parts on our FPGA. If a counter can be just 11 bits, using an 11-bit data type instead of a Word16 can translate to actual savings.
Luckily, clash-protocols takes care of the first constraint via the expander function:
expander :: forall dom i o s. (HiddenClockResetEnable dom, NFDataX s) => s -> (s -> i -> (s, o, Bool)) -- ^ Return `True` when you're finished -- with the current input value and are ready for the next one. -> Circuit (Df dom i) (Df dom o)
So basically expander reduces the problem to just writing a "normal" pure Haskell function s -> i -> (s, o, Bool). This is reassuring since I am otherwise not too familiar with clash-protocols but this take mes back to the terra firma of Haskell.
A simpler formatter
Let's warm up by writing a state machine for expander that just puts two spaces between each input:
data DoubleSpacedState = CopyTheInput | FirstSpace | SecondSpace deriving (Generic, NFDataX) -- So that Clash can store the state in a register -- | Transform the stream 'a','b',... to Right 'a', Left ' ', Left ' ', Right 'b', Left ' ', Left ' ', ... doubleSpaced :: (HiddenClockResetEnable dom) => Circuit (Df dom a) (Df dom (Either Char a)) doubleSpaced = expander CopyTheInput \s x -> case s of CopyTheInput -> (FirstSpace, Right x, True) FirstSpace -> (SecondSpace, Left ' ', False) SecondSpace -> (CopyTheInput, Left ' ', False)
We can try it out using the Circuit simulator simulateCSE which on its own has an intimidating type because clash-protocols supports more protocols than just Df:
λ» :t simulateCSE simulateCSE :: (Protocols.Internal.Drivable a, Protocols.Internal.Drivable b, KnownDomain dom) => (Clock dom -> Reset dom -> Enable dom -> Circuit a b) -> Protocols.Internal.ExpectType a -> Protocols.Internal.ExpectType b
But once we partially apply it on doubleSpaced with the Clash clock, reset and enable lines made explicit, the type suddenly makes a ton of sense:
λ» :t simulateCSE @System (exposeClockResetEnable doubleSpaced) simulateCSE @System (exposeClockResetEnable doubleSpaced) :: (NFDataX a, ShowX a, Show a) => [a] -> [Either Char a]
Let's try applying it on a string like "Hello" to make sure it works correctly:
λ» simulateCSE @System (exposeClockResetEnable doubleSpaced) "Hello" [Right 'H',Left ' ',Left ' ', Right 'e',Left ' ',Left ' ', Right 'l',Left ' ',Left ' ', Right 'l',Left ' ',Left ' ', Right 'o'
The simulation seemingly hangs because simulateCSE starts feeding input is not yet available signals to the simulated circuit after it runs out of the user-specified input. So it's not hanging, our circuit is just not producing any more output. And when we look at the type of expander, we see that this has to be how it works, since if there is no input, there is no i argument to pass to the state machine function. But this is not exactly what we want, since we want the two spaces to be sent as soon as possible, after the preceding character, not before the next character. So we change doubleSpaced to only consume the input after all spaces are output:
doubleSpaced :: (HiddenClockResetEnable dom) => Circuit (Df dom a) (Df dom (Either Char a)) doubleSpaced = expander CopyTheInput \s x -> case s of CopyTheInput -> (FirstSpace, Right x, False) -- Here... FirstSpace -> (SecondSpace, Left ' ', False) SecondSpace -> (CopyTheInput, Left ' ', True) -- ... and here
λ» simulateCSE @System (exposeClockResetEnable doubleSpaced) "Hello" [Right 'H',Left ' ',Left ' ', Right 'e',Left ' ',Left ' ', Right 'l',Left ' ',Left ' ', Right 'l',Left ' ',Left ' ', Right 'o',Left ' ',Left ' '
Getting rid of the bespoke state datatype
If we want to write similar state machines for more complex formats, we should start thinking about composability. For example, we can think of our simple double-spaced formatter as a combination of a formatter that just forwards the input with another that outputs two spaces.
At any given moment, we are either in the forwarding state or the space-printing state:
data Forward = Forward deriving ... data Spaces = FirstSpace | SecondSpace deriving ... type DoubleSpacedState = Either Forward Spaces
Or using even more generic types, we can say that there is just one Forwarding state, and two state for the Spaces:
type DoubleSpacedState = Either (Index 1) (Index 2)
Here, Index :: Nat -> Type is a Clash-provided type with an exact number of distinct values 0, 1, ..., n-1, with Index n represented as ⌈log₂ n⌉ bits.
One nice thing about using only Either, Index and tuples for our state representation is that we can then use Clash's Counter class to iterate through the states:
doubleSpaced :: (HiddenClockResetEnable dom) => Circuit (Df dom a) (Df dom (Either Char a)) doubleSpaced = expander (Left 0 :: Either (Index 1) (Index 2)) \s x -> let output = case s of Left 0 -> Right x Right 0 -> Left ' ' Right 1 -> Left ' ' s' = countSucc s consume = case s' of Left 0 -> True _ -> False in (s', output, consume)
Here, I've changed the code computing whether the current input should be consumed so that it looks at the next state instead of the current one. Because this is really what we are doing – we want to go through as many states as we can, until we get to the point that the next time around we will need new input.
Declarative formatting
Compared to the initial version with three distinct, named constructors, we have gained generality, in that we can now imagine what the state would need to look like for our original formatting example. But already at this simplified example, it has cost us legibility: looking at the latest definition of doubleSpaced, it is not immediately obvious what format it corresponds to.
So of course the next thing we want to do is use a declarative syntax for the format, and derive everything else from that. We can take a page out of Servant and give users a library of type-level combinators corresponding to regular expressions without alternatives. Our end goal is to be able to write our Sudoku example as just a single type definition, using :++ for concatenation, :* for repetition, and type-level strings for literals:
type GridFormat n m = ((((Forward :++ " ") :* n :++ " ") :* m :++ "\r\n") :* m :++ "\r\n") :* n
So we Forward the data and follow it up with a single space, then after each nth repetition, we insert one more space. Do this whole thing m times, and end the line (using the old serial format instead of the "modern" newline-only Unix format), then after this is done m times, we insert the extra newline between the blocks.
Implementing this idea starts with capturing the essence of a format specifier: it needs to be associated with a counter type used for the given formatter's state, and we need to know how to produce the next single formatting token in any given state.
data PunctuatedBy c = Literal c | ForwardData deriving (Generic, NFDataX) class (Counter (State fmt), NFDataX (State fmt)) => Format (fmt :: k) where type State fmt format1 :: proxy fmt -> State fmt -> PunctuatedBy Char
Then, using this interface, we can write a generic formatter using expander, similar to our earlier attempt:
format :: (HiddenClockResetEnable dom, Format fmt) => Proxy fmt -> Circuit (Df dom a) (Df dom (Either Word8 a)) format fmt = Df.expander countMin \s x -> let output = case format1 fmt s of ForwardData -> Right x Literal sep -> Left (ascii sep) s' = countSucc s consume = case format1 fmt s' of ForwardData -> True _ -> False in (s', output, consume)
The easy formatters
Let's get all the easy cases out of the way. These are the formatters where we can either directly write format1, or it can be delegated to other formatters:
-- | Consume one token of input and forward it to the output data Forward instance Format Forward where type State Forward = Index 1 -- Here, `Index 1` stands in for `()` but with a (trival) `Counter` instance format1 _ _ = ForwardData -- | Concatenation data a :++ b instance (Format a, Format b) => Format (a :++ b) where type State (a :++ b) = Either (State a) (State b) -- The order is important, since `countMin @(Either a b) = Left countMin` format1 _ = either (format1 (Proxy @a)) (format1 (Proxy @b)) -- | Repetition data a :* (rep :: Nat) instance (Format a, KnownNat rep, 1 ≤ rep) => Format (a :* rep) where type State (a :* rep) = (Index rep, State a) -- The order is important, since that's how the `Counter` instance for tuples cascades increments format1 _ (_, fmt) = format1 (Proxy @a) fmt
Reflecting symbols character by character
What we want to do for the Format (sep :: Symbol) instance is to use Index n as the state, where n is the length of the symbol, and then format1 _ i would return the ith character of our separator sep.
Unfortunately, this requires considerably more elbow grease than the previous instances. Currently, there aren't many type-level functions over Symbol in base so we have to implement it all ourselves based on just the UnconsSymbol type family.
type SymbolLength s = SymbolLength' (UnconsSymbol s) type IndexableSymbol s = IndexableSymbol' (UnconsSymbol s) class (KnownNat (SymbolLength' s)) => IndexableSymbol' (s :: Maybe (Char, Symbol)) where type SymbolLength' s :: Nat symbolAt :: proxy s -> Index (SymbolLength' s) -> Char instance IndexableSymbol' Nothing where type SymbolLength' Nothing = 0 symbolAt _ i = error "impossible" instance (IndexableSymbol s, KnownChar c) => IndexableSymbol' (Just '(c, s)) where type SymbolLength' (Just '(c, s)) = 1 + SymbolLength s {-# INLINE symbolAt #-} symbolAt _ i | i == 0 = charVal (Proxy @c) | otherwise = symbolAt (Proxy @(UnconsSymbol s)) (fromIntegral (i - 1))
With the help of these utility classes, we can now write the formatter for Symbols. The lower bound on SymbolLength is needed because the degenerate type Index 0 (isomorphic to Void) would just screw everything up.
-- | Literal instance (IndexableSymbol sep, KnownNat (SymbolLength sep), 1 ≤ SymbolLength sep) => Format sep where type State sep = Index (SymbolLength sep) format1 _ i = Literal $ symbolAt (Proxy @(UnconsSymbol sep)) i
Note that the indirection between a format fmt and its state type State fmt was only needed in the first place because we wanted Symbols to be valid formatters without wrapping them in an extra layer. If we were content with extra noise like Literal "\r\n" instead of just "\r\n", we could collapse the two types.
In my real code, I ended up having to change format slightly so that it directly produces 8-bit ASCII values instead of Chars, because I found that composing it with another Circuit that does the conversion wasn't getting inlined enough to produce Verilog that avoids the large 21-bit-wide multiplexers for Char.