hBPF = eBPF in hardware
Saturday, April 3, 2021
Introducing hBPF - an eBPF implementation for FPGAs.
This project was started beginning of 2021 as an experiment of how fast and how far you can get, with using alternate hardware description languages (compared to the classic ‘V’ languages VHDL and Verilog), most of the time open-source tools (compared to expensive, commercial toolchains) and cheap development boards (e.g Arty-S7).
It implements an eBPF CPU using LiteX/Migen, a Python3 based SoC builder and Hardware Definition language (HDL).
eBPF
Back in 1992 the original Berkeley Packet Filter (BPF) was designed for capturing and filtering network packets that matched specific rules. Filters are implemented as programs to be run on a register-based virtual RISC machine providing a small number of instructions inside the Linux Kernel.
At some point in 2014, work to extend the existing BPF virtual machine started to make it useful in other parts of the Linux Kernel. More, wider registers, additional instructions and a JIT eventually resulted in extended BPF. The original and now obsolete BPF version has been retroactively renamed to classic BPF (cBPF). Nowadays, the Linux Kernel runs eBPF only and loaded cBPF bytecode is transparently translated into eBPF before execution.
hBPF
The hPBF project now implements most of eBPF features in hardware (FPGA). The main purpose of implementing an eBPF CPU in hardware is the same as that of the original cBPF: processing network packets. By attaching a hBPF CPU directly to a network PHY/MAC a form of a smart NIC could be created. Such a NIC is capable to perform tasks on packets offloaded by the host CPU for performance reasons.
The following picture shows an overview of how hBPF can be used.
The hBPF CPU has access to separated program- and data memory (Harvard architecture). Data memory (8-Bit) in this example, holds network packets which are processed based on the instructions in program memory (64-Bit).
The implementation requires about 10500 LUTs including Wishbone Bridge and LiteScope Debugger. The CPU core alone requires about 8000 LUTs. The implementation was tested with 100MHz.
The source code and additional infos can be found Github.