At RaiderChip, we specialize in designing cutting-edge hardware accelerators for Generative Artificial Intelligence, delivering solutions that drastically reduce computational and energy requirements.
With a design philosophy centered on efficiency, we set new benchmarks in AI inference speed, achieving the industry’s highest tokens-per-second for any memory bandwidth.
Full Hardware-Based Generative AI accelerator
RaiderChip’s GenAI NPU is the industry’s first fully hardware-based Generative AI accelerator for local inference at the Edge, be it on-premises servers or in embedded devices, setting a new standard in performance, privacy and efficiency.
This groundbreaking NPU design delivers significantly superior performance compared to hybrid solutions, while being stand-alone. No need for external CPUs or Internet connection, the LLM models are embedded in it.
Key Benefits of RaiderChip’s NPU
Thanks to its custom design, the GenAI NPU can process more tokens with less memory bandwidth, to achieve unparalleled efficiency.
- Extraordinary scalability: Eliminates latency introduced by HW-SW communication, enhancing overall system speed.
- Superior performance with cost-effective hardware: Delivers top-tier results using less hardware components.
- Enhanced energy efficiency: Operates without external processors, reducing power consumption.
- Predictable, real-time performance: In a fully hardware-based design, performance is deterministic, unlike CPU/GPU solutions.
Optimized Resource Utilization
This design leverages higher computational density, with pipelines tailored to Generative AI’s Transformers architecture, ensuring better utilization of FPGA and ASIC resources. It is an ideal solution for applications requiring consistent, reliable performance with a very low cost, such as embedded systems and real-time AI solutions.
Maximum efficiency
Our design maximizes the number of tokens generated per effective GByte/s of available memory bandwidth, effectively addressing the primary bottleneck in Generative AI acceleration.
By optimizing the utilization of this critical resource, our technology delivers exceptional performance, regardless of the device in which it is integrated.
The normalized performance metric —tokens per second per unit of memory bandwidth— serves as a benchmark for evaluating the efficiency and quality of accelerator designs. This metric is independent of the memory technology or bandwidth selected by each vendor, ensuring a fair comparison.
Our results demonstrate that our design outperforms all major competitors:
- +37% over Intel Gaudi 2’s server AI accelerator
- +28% beyond Nvidia’s cloud GPUs
- +25% above Google’s v5e cloud AI TPUs
By setting a new standard in memory efficiency, our accelerators enable faster, more scalable, and cost-effective solutions for generative AI applications.
Hardware cost and energy consumption savings
- Reduced Investment in High-Bandwidth Memory:
By maximizing token generation per GB/s, our design reduces reliance on expensive high-bandwidth memory solutions, such as HBM, allowing the use of more cost-effective memory like DDR or LPDDR without sacrificing performance.
- Higher Performance with Less Hardware:
Thanks to its superior efficiency, our accelerator achieves equivalent or better results with fewer components, lowering the size and cost of final devices.
- Optimized Memory Usage:
Efficient memory bandwidth utilization decreases the energy consumed by memory operations, which represent a significant portion of energy consumption in generative AI acceleration.
- Greater Efficiency per Token:
Our design processes more tokens with less workload, reducing overall system energy consumption for inference or text generation tasks, even in real-time applications.
This allows the development of generative AI-based solutions that are more cost-effective and sustainable. In doing so, they achieve faster return on investment, lower operational costs, and facilitate the integration of this technology into a wide range of products, tailored to different budgets and applications.
Target agnostic technology: ASIC and FPGA
Our designs are target-agnostic and can be integrated into a wide range of FPGAs of different sizes and costs, as well as manufactured into ASIC devices.
Our goal was to provide exceptional flexibility, offering configurable hardware that allows clients to adjust parameters such as precision, inference speed, model size, hardware cost, and energy consumption, to achieve the optimal balance that best aligns with their needs and objectives.
Run all Transformers-based LLM models
Our accelerators are designed to process any LLM built on Transformers technology, which includes the vast majority of Generative Artificial Intelligence models, such as Meta’s Llama models or Microsoft’s Phi models.
This also offers automatic support for any fine-tuned LLM derived from a compatible base model, with no modifications required.
Additionally, we can handle confidential customer models without ever needing to have their weights. All we need is the configuration file and table structure to provide full acceleration capabilities while maintaining data security and privacy.
Supporting vanilla and quantized LLM
Vanilla Models
Our solution enables the acceleration of generative AI models using maximum precision floating-point arithmetic (FP32, FP16, BF16, FP8), ensuring maximum accuracy and consistency in inference results. This approach achieves interactive speeds even on low-cost devices and is ideal for applications where precision is critical.
Examples where precision is crucial:
- Medical diagnostics: Models generating or analyzing medical data require the highest accuracy to ensure reliable outcomes, matching their training targets perfectly.
- Financial modeling: Predictive models for stock markets or risk assessment demand absolute precision to avoid costly errors.
Thanks to the extraordinary efficiency of our design, we can execute complex models in their original (vanilla) version, even on devices with very limited computational resources.
Quantization
Additionally, RaiderChip’s solutions support 4-bit and 5-bit quantization (Q4_K, Q5_K), offering higher performance with low-cost DDR and LPDDR memory.
Quantized models significantly increase inference speed and reduce memory requirements with minimal impact on precision.
Examples where speed is paramount:
- Real-time language translation: Accelerated models provide instantaneous translations for live conversations or broadcasts.
- Interactive customer service: Chatbots need to process and respond rapidly to maintain a seamless user experience.
- Edge AI applications: Autonomous vehicles or IoT devices rely on high-speed decision-making to operate effectively.
The use of quantized models enables:
- A 276% increase in inference speed.
- Up to a 75% reduction in memory requirements.
This makes it possible for the largest and most advanced LLM models to run on embedded systems, significantly reducing overall costs, maintaining real-time performance, and lowering energy consumption, while preserving almost entirely the model’s intelligence.