RaiderChip presents the GenAI v1, a state-of-the-art hardware IP core
Designed specifically for the demands of Generative AI - the most challenging AI workload to date. Our IP core is engineered to maximize efficiency in processing and memory utilization, setting new standards in AI inference speed.
GenAI v1 IP core on DDR FPGAs
This IP core is readily available for devices in all families of the AMD Versal FPGA.
Find in the following table the resource usage and AI inference speed (for the raw unquantized Microsoft Phi-2 LLM):
The GenAI v1 IP core has also been verified on AMD UltraScale+ devices (including on AWS F1 instances) running Meta’s Llama 2, 3.0, 3.1 and 3.2 LLM models with state-of-the-art efficiency.
Please contact us if you would like more information for any FPGA device/vendor.
GenAI v1-Q IP core with Quantization support
The GenAI v1-Q IP core product adds 4-bits and 5-bits Quantization support (Q4_K and Q5_K) to the extraordinary efficiency of the base GenAI v1.
Ideal for boosting their capabilities with low-cost DDR and LPDDR memories, incrementing inference speed by 276%.
4 bits Quantization lowers memory requirements by up to 75%, allowing the largest and most intelligent LLM models to fit into smaller systems, reducing the overall cost, while keeping real-time speed, and also reducing energy consumption. All of this with minimal impact on model accuracy and intelligence perception.
GenAI v1 IP core on HBM FPGAs
Coming soon…
Designed for Maximum Efficiency
Generative AI requires extensive computational power and memory bandwidth for processing:
The AI Inference Unit is the Token (i.e. from 1 syllable to 1 word, depending on the model). A typical 7B LLM model requires approximately 14 GFLOPS (14 Billion floating point operations per second) to produce just 1 token per second.
However the challenge is not FLOPs, the critical bottleneck in AI processing is memory bandwidth to transfer Billions of parameters to the processing engines as many times as tokens per second of speed. Our GenAI v1 IP core tackles it through low level cycle design to extract the highest memory bandwidth, effectively incrementing by more than 20% the processing speed in tokens per second than top competitors.
Our technology achieves superior throughput using the same memory technology, offering higher value, through a higher performance with a lower power, at the same cost.
Advanced Parallelism and Memory Utilization
The RaiderChip GenAI v1 IP Core features:
- Massive Floating Point (FP) Parallelism: To handle extensive computations simultaneously.
- Optimized Memory Bandwidth Utilization: Ensuring peak efficiency in data handling.
Our IP core’s design is fully parametrizable, allowing it to scale seamlessly and maximize efficiency based on the target architecture, thanks to our sophisticated scheduling and flow control logic.
Benchmarking Against the Best
The normalized throughput metric — tokens per second per unit of memory bandwidth — differentiates the quality of each accelerator design, independently of the memory technology and bandwidth selected by each vendor. This metric shows GenAI v1 accelerator design outperforms all major competitors:
- +37% over Intel Gaudi
- +28% beyond Nvidia’s cloud GPUs
- +25% above Google’s latest TPU
Looking to the Future
Our current demonstrators achieve real-time interactive AI speeds, and thorough verification confirmed the IP core’s performance and precision. With proven scalability and state-of-the-art efficiency, the GenAI v1 is set to revolutionize both edge and cloud markets, offering solutions that exceed conventional speed thresholds and meet high-performance demands.