What is GenAI v1?
RaiderChip’s GenAI v1 is the first licensable Hardware accelerator for Generative Artificial Intelligence, designed for use in FPGA-based products in the form of an IP core.
What is Generative AI?
Generative AI is the most advanced form of Artificial Intelligence. It is capable of creating original content, such as text, music, or images, and can also perceive, think, analyze, predict, and interact using human forms of expression.
Why a Hardware device?
Generative AI models demand significant computational power and memory bandwidth. Running increasingly capable and complex models at interactive speeds is a major challenge and requires the development of specialized hardware acceleration.
The RaiderChip GenAI series of Hardware accelerators are designed to maximize memory bandwidth and computing performance, ensuring real-time AI capabilities in the most efficient way, while keeping hardware costs and energy consumption as low as possible.
Why LLM on the Edge?
Because it is fully offline, you can very literally run a “ChatGPT” kind of assistant on a FPGA board containing our GenAI inference engine, attached to a solar panel for power, and nothing else, since no network connection is even needed.
As a consequence, since the Generative AI LLM model is stored and run locally, your product will not need external monthly subscriptions, nor require cloud infrastructure to provide cutting edge intelligence.
What LLM models can it run?
RaiderChip’s Generative AI hardware accelerator is designed to run any model built on Transformers technology, which underpins the vast majority of LLMs. The selection of additional supported models specifically for FPGA depends on customer choices of target devices and foundational LLMs.
There exist various FPGA options available on the market which differ in speed, size, and memory bandwidth capacity. These technical factors, combined with commercial considerations such as unit cost, power consumption, and final functionality, mean that the ideal FPGA and LLM vary for each final product. For instance, using “smaller” models (such as Meta’s Llama 3.2 1B and Microsoft’s Phi-2 2.7B) or 4-bit Quantization is ideal for products based on simpler, more economical FPGAs, while larger LLMs in their original format, preserving full floating-point precision (FP32, FP16, BF16, FP8), are also possible on larger FPGA devices.
The models specifically supported for FPGA to date include Meta’s Llama 2, Llama 3, Llama 3.1, and Llama 3.2, as well as models from other providers (Microsoft’s Phi-2 and Phi-3). RaiderChip’s strategy is to add the most relevant models as they are released.
It’s also important to highlight that supporting a foundational model allows any customer-specific fine-tuned derivative to be accelerated seamlessly, without the need to share its weights.
How do you support a new LLM model?
Our Generative AI acceleration solution distinguishes itself by its highly optimized and flexible design. As an example, more than 99.99% of the algorithm is implemented directly in hardware, the parts that define the inference speed, while leaving only a very thin layer in software, which is in charge of adding support for new models. This software layer is so light-weight it’s barely 200 KB in size and fully autonomous, with no external dependencies. That is, no need for huge libraries such as transformers or Torch.
The result is a flexible and efficient solution that allows us to support new models without any hardware changes, making the process fast and straightforward, giving our acceleration system extraordinary flexibility and adaptability. In this context, a new model can be supported immediately when it’s a new version of a previous model or can take anywhere from a few days to a few weeks in other cases.
It is important to note that any fine-tune or LLM derived from a supported base model (i.e. fine tunes of any Llama LLM) are supported out of the box, without any changes.
Can you support a confidential LLM whose weights can not be shared?
Yes, we usually can do with only knowing the structure of the models; little more than the name and size of its tables to infer its structure, but under no circumstances do we need the insides of the model, its weights.
LLM on the Edge and on FPGA?
In a fast-paced technological environment where numerous companies introduce innovations weekly and increasingly complex and capable models are rapidly launched to the market, the reprogrammable nature of FPGAs makes them a safeguard against obsolescence.
Moreover, local, stand-alone, and fully offline Artificial Intelligence offers the significant added value of complete confidentiality and security against external attacks. This makes it ideal for sensitive applications such as automotive, defense, home appliances (e.g., assistants, cameras), intellectual property (e.g., coding assistants), etc., which remain fully protected from cyberattacks and data leaks involving important or sensitive information. All of this is achieved without sacrificing access to sophisticated consumer assistants or industrial agents and predictors that function anytime and anywhere without relying on the cloud or an internet connection, and, importantly, without the need for monthly subscriptions (such as ChatGPT).
Do you support Quantization?
Yes, our base product, the GenAI v1, supports full precision Floating Point arithmetic (FP32, FP16, BF16, FP8), delivering the highest precision from LLMs. This results in inference outcomes of superior quality and coherence, preserving the original model’s reasoning abilities. This demonstrates the extraordinary efficiency of our technology, capable of running complex models with limited system memory bandwidth and computational resources, even without resorting to techniques like quantization.
Additionally, the GenAI v1-Q upgraded version, adds support for 4-bit and 5-bit Quantization (Q4_K and Q5_K) to the already extraordinary efficiency of the base GenAI v1.
This new Generative AI LLM hardware accelerator, running inside FPGA devices, is ideal for boosting performance with low-cost DDR and LPDDR memories, increasing inference speed by 276%. It also reduces memory requirements by up to 75%, allowing the largest and most intelligent LLM models to fit into smaller systems, lowering the overall cost while maintaining real-time performance and reducing energy consumption. All of this is achieved with minimal impact on model accuracy and intelligence perception.
This allows us to offer maximum flexibility to our customers, with highly configurable hardware that enables them to balance criteria such as accuracy, inference speed, model size, hardware unit cost, or energy consumption goals according to their needs, finding the perfect balance that best meets their objectives.
Why add Generative AI to our products?
Enhance your solutions with stand-alone, offline AI capabilities using our GenAI series of accelerators, requiring no internet connection or subscriptions, and ensuring full privacy to transform them into:
Smart Assistants: Capable of performing countless tasks in fields such as technical, medical, educational, customer service, and administrative. This adds enormous functionality to devices with interactions similar to what we have today with Amazon Alexa or Google Nest, with a far more capable bran, and without the need for Internet connection.
THE Universal User Interface: Revolutionizing the way we communicate with all kinds of electronic devices, from large industrial machinery to small household appliances, enabling interaction in a much more natural and intuitive way: through our own language. The AI being the user manual and the operator.
Advanced Predictors: Generative AI represents the cutting edge in prediction capabilities, identifying patterns and trends with higher accuracy than previous Machine Learning algorithms. This has numerous applications in sectors such as security (e.g., video-surveillance), aviation (e.g., radars), defense, industry, and finance.
Generative AI as a universal predictor?
The reason why GenAI is so revolutionary is because it predicts what the user wants to read when using a chat interface, or what the user wants to see when used as an image generator. In both cases the underlying reason is the Generative AI algorithm is the world’s best predictor today. This means you can train it on anything, not just text or pixels, and it will perform better than today’s machine learning algorithms at understanding obscure patterns and predicting future outcomes. From detecting false positives in radar echoes, or a CCTV surveillance system, to predicting a failure in a remote critical infrastructure whose sensors are monitored on-site by a GenAI capable system.
How can we evaluate your product?
At RaiderChip, we prefer facts over words, which is why we provide our clients with an interactive demo that they can use and analyze at their own site. Our Edge demo runs the Meta Llama 2, 3.0, 3.1, and Llama 3.2 models, as well as the Microsoft Phi-2 model on cost-effective AMD Versal FPGA devices, achieving interactive speeds on the vanilla LLM, and almost four times that when using quantization. This allows for extracting the maximum intelligence from the AI model or, optionally, using quantization to achieve maximum speed. To facilitate this, we provide the download link to an evaluation SD card to boot your FPGA, containing everything necessary for it to power on as a GenAI accelerator, ready to run the Generative AI model of your choice.
This demonstrator can be used as an interactive chat, allowing the client to directly engage with the model, discussing whatever they prefer while evaluating factors such as latency, speed (tokens per second), model intelligence, and more. Optionally you can also connect to your FPGA board using its OpenAI compatible API, to serve remote requests from computers in your network, making it seamlessly integrate with most AI workloads using this common protocol.
Need more information on any of these items? Please don’t hesitate to contact us, and we will be happy to provide it as soon as possible!