RaiderChip presents the GenAI v1, a state-of-the-art hardware IP core


Designed specifically for the demands of Generative AI - the most challenging AI workload to date. Our IP core is engineered to maximize efficiency in processing and memory utilization, setting new standards in AI inference speed.

RaiderChip GenAI v1 running Phi-2 model on a Versal FPGA

RaiderChip GenAI v1 running Phi-2 model on a Versal FPGA



GenAI v1 IP core on DDR FPGAs

This IP core is readily available for devices in all families of the AMD Versal FPGA.

Find in the following table the resource usage and AI inference speed (for the raw unquantized Microsoft Phi-2 LLM):

The GenAI v1 IP core has also been verified on AMD UltraScale+ devices (including on AWS F1 instances) running Meta’s Llama-2 and Llama-3 models with state-of-the-art efficiency.

Please contact us if you would like more information for any FPGA device/vendor.

GenAI v1 IP core on HBM FPGAs

Coming soon…

Designed for Maximum Efficiency


Generative AI requires extensive computational power and memory bandwidth for processing:

The AI Inference Unit is the Token (i.e. from 1 syllable to 1 word, depending on the model). A typical 7B LLM model requires approximately 14 GFLOPS (14 Billion floating point operations per second) to produce just 1 token per second.

However the challenge is not FLOPs, the critical bottleneck in AI processing is memory bandwidth to transfer Billions of parameters to the processing engines as many times as tokens per second of speed. Our GenAI v1 IP core tackles it through low level cycle design to extract the highest memory bandwidth, effectively incrementing by more than 20% the processing speed in tokens per second than top competitors.

Our technology achieves superior throughput using the same memory technology, offering higher value, through a higher performance with a lower power, at the same cost.



Advanced Parallelism and Memory Utilization


The RaiderChip GenAI v1 IP Core features:

  • Massive Floating Point (FP) Parallelism: To handle extensive computations simultaneously.
  • Optimized Memory Bandwidth Utilization: Ensuring peak efficiency in data handling. Our IP core’s design is fully parametrizable, allowing it to scale seamlessly and maximize efficiency based on the target architecture, thanks to our sophisticated scheduling and flow control logic.



Benchmarking Against the Best


The normalized throughput metric — tokens per second per unit of memory bandwidth — differentiates the quality of each accelerator design, independently of the memory technology and bandwidth selected by each vendor. This metric shows GenAI v1 accelerator design outperforms all major competitors:

  • +37% over Intel Gaudi
  • +28% beyond Nvidia’s cloud GPUs
  • +25% above Google’s latest TPU



Looking to the Future


Our current demonstrators achieve real-time interactive AI speeds, and thorough verification confirmed the IP core’s performance and precision. With proven scalability and state-of-the-art efficiency, the GenAI v1 is set to revolutionize both edge and cloud markets, offering solutions that exceed conventional speed thresholds and meet high-performance demands.