EDGE AI POD

Generative AI on NXP Microprocessors

EDGE AI FOUNDATION

Stepping into a future where AI doesn't require the cloud, NXP is revolutionizing edge computing by bringing generative AI directly to microprocessors. Alberto Alvarez offers an illuminating journey through NXP's approach to private, secure, and efficient AI inference that operates entirely at the edge.

The heart of NXP's innovation is their EAQ GenAI Flow, a comprehensive software pipeline designed for iMX SoCs that enables both fine-tuning and optimization of AI models. This dual capability allows developers to adapt openly available Large Language Models for specific use cases without compromising data privacy, while also tackling the challenge of memory footprint through quantization techniques that maintain model accuracy. The conversational AI implementation creates a seamless experience by combining wake word detection, speech recognition, language processing with retrieval-augmented generation, and natural speech synthesis—all accelerated by NXP's Neutron NPU.

Most striking is NXP's partnership with Kinara, which introduces truly groundbreaking multimodal AI capabilities running entirely at the edge. Their demonstration of the LAVA model—combining LLAMA3's 8 billion parameters with CLIP vision encoding—showcases the ability to process both images and language queries without any cloud connectivity. Imagine industrial systems analyzing visual scenes, detecting subtle anomalies like water spills, and providing spoken reports—all while keeping sensitive data completely private. With quantization reducing these massive models to manageable 4-bit and 8-bit precision, NXP is making previously impossible edge AI applications practical reality.

Ready to experience the future of edge intelligence? Explore NXP's application code hub to start building with EIQ GenAI resources on compatible hardware and discover how your next project can harness the power of generative AI without surrendering privacy or security to the cloud.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1:

Welcome to Alberto Alvarez from NXP. Hi, alberto, thanks for joining us and for finding the time to present Generative AI on NXP microprocessor bringing intelligence to the edge. The floor is yours.

Speaker 2:

Thank you very much. Hi everyone, my name is Alberto Alvarez. I am a systems and applications engineer at NXP, and today I'm going to present a little bit of what we have for generative AI on NXP microprocessors. So today's agenda is the following we will briefly talk about the background of AI at NXP, and specifically about GenAI, and then we are going to introduce the EAQ GenAI flow. After it, we are going to share with you our latest multimodal GenAI at the edge demo and finally, we will give a very brief summary of what was presented and have a brief session of Q&A. So we are very excited to share this with you.

Speaker 2:

So let's start with the background. So we know that AI spans from training in the cloud to inference at the Edge, and at NXP we are mainly focused at AI deployment, and that's given our robust processing capabilities that provide a unique opportunity to shape the deployment of AI at the edge. Nxp is not in the business of training huge models on the cloud and then running them with these big GPUs. Instead, nxp is targeting opportunities for private, secure and efficient AI inference at the edge for areas such as industrial automation, smart homes, smart cities, autonomous driving systems and among others. So we know that every day we hear about AI. A new model came out from XYZ paper Andodal LLMs to the edge, and with the introduction of all these new models, multimodal models and also the discrete NPUs very powerful NPUs that allow us to accelerate these models at the edge the possibilities of creating new solutions are just expanding every single day. Let's just think about one example mobile robotics. So NXP can now give a robot, the brain, an LLM, to have smarter capabilities of reasoning, but not only that, we can also give it the eyes, the ears and the mouth to be able to interact and analyze its surroundings and then become a smarter robot. Another example could be with superior reasoning and visual analysis, and this can be enabled with multimodal models for vision, for vision, and let's just think about having a smart assistant for medical applications in the healthcare industry, where you can have preliminary analysis of diagnosis with images. I mean, the possibilities are just very, very are a lot. So this is just a small list of them. It's not an exhaustive list of examples, but anyways, nxp is targeting at this kind of applications and trying to leverage all the big models that are coming out in the large community of AI.

Speaker 2:

So now let me introduce the EAQ GenAI flow, which is a software pipeline that nxp offers to our customers to be able to create new smart, intelligent solutions with our imx socs. So eaq genii flow mainly targets two capabilities. The first one is fine-tuning and then optimization. So fine-tuning allows our customers to enhance and adapt the openly available LLMs for specific use cases without exposing their private data to anonymous servers. And by doing this and fine tuning the model, it will reduce the hallucinations and the errors during the reasoning process, which is particularly important when applied for medical, automotive or factory automation solutions. The optimization is also a very important key aspect of this pipeline. We know that without optimizing our models, the memory footprint even for the smallest LLMs is just too big for edge deployment. So we have the capability of quantizing these LLMs into four bits or eight bits by maintaining the accuracy of the model, and then accelerate these models with powerful GPUs sorry NPUs at the edge. So this particular pipeline, we are releasing the first version for imx95 SoC and we are going to very quickly go over this end-to-end pipeline of EIQ GenAI Flow.

Speaker 2:

So yeah, as I mentioned, the very first release we have published last month is for conversational AI, so we are using audio to process the entire pipeline with ASR, tts, rac and then generate a very seamless response out of the system. So how this works? So we first need a wakeboard engine. In this case we are using VIT, which is a technology from NXP. We are not going to cover it in this presentation, but basically it allows the system to start listening to the incoming audio. You can trigger it with just one word, let's say hey, nxp, and it will start listening to the audio. And then the automatic speech recognition it's going to convert the incoming spoken audio into text that then can be interpreted by the LLM. Together with drag, we can enhance the LLM with specific relevant knowledge for our use case application. Again, this knowledge is not previously seen by the LLM during the training process. This is something you can do offline and not expose your data to the cloud. It makes a very private implementation and solution for everyone and, together with RAC, so the LLM can generate a response based on the input context that was given to the system. And finally, tts is going to convert the response that the LLM generates into speech. All this is currently available for you for iDolm X95 SoC and in a few slides I'm going to share the links with you so you can go and access and play with this new technology. In the future, we are also trying to implement more ways of utilizing this with text image video. So the idea is to eventually have a multi-modal pipeline for EIQ GenAI.

Speaker 2:

Okay, so let me go deeper into each of these components. Let's start with ASR. So ASR, as I mentioned, it, converts the input language into text. We are using in this implementation the whisper architecture. The model that we are using has 244 million parameters and is quantized into eight bits and with this we can have real-time speech recognition at the edge. The system will encode the audio in segments of three seconds and for each segment it's going to predict multiple words. Nxp also offers the pro version of AIQ flow and it will allow you to explore more ASR models and also personalize your ASR models with tuning.

Speaker 2:

Now the LLM that we are using in this implementation is Danube 3. It has 500 million parameters and is quantized into 8 bits, and with the LLM we can have very coherent responses with natural language, depending on the context that we are utilizing this in our application and NXP it's optimizing these models to be able to perform fast in our SOCs, accelerated by our NTUs. So this model is accelerated by the Neutron NPU, which offers two tops of compute power, and we can see in this table that when we run the model on the CPU with six threads, the time to first token for this model is around one second. So it will take around a second to start generating the first token, but the CPU load is very high. We see that we are utilizing 60% of the CPU, which might not be desirable for our application, right? So now we can accelerate this model with the neutron NPU and reduce the time to first token, first token, by half. So now it's twice as fast as the CPU execution. And the key part is that we have offloaded a lot of the cycles from the CPU and we can utilize this CPU for other processes in our application. So yeah, we are moving forward. We know that in the future we will even optimize further these models and will be accelerated even further.

Speaker 2:

All right, so now let's talk a little bit about RAC. So we already know what RAC is doing here. It's basically enhancing the knowledge of the LLM to be able to generate very accurate responses in the application with specific knowledge that we don't want to share with any anonymous servers, is providing a tool that you can utilize to build and create a database that's fully compatible with our pipeline, with EIQ GenAI Flow pipeline, so you can very easily start creating an application that's going to utilize all these GenAI tools at the edge and you can go and check it out at the links that are in the next slides so you can start building your database and you can start testing this with an imx95 board. So, yeah, very, very interesting things that we are starting to build.

Speaker 2:

And finally, tts so with text-to-speech, we can convert the output text that the LLM is generating into speech. So the model that we are utilizing here is based on a bits architecture. It has 19.5 million parameters and is quantized into 8 bits. And the pro version of EIQ GenAI Flow offers more voices, selection options and also supports more languages, not only English. So with TTS, we can synthesize the text that the LLM generates into natural speech. So it will give a very seamless interaction for the end user.

Speaker 2:

And yeah, so I'm just giving a very brief overview of what we have created with EIQ GenAI Flow and we are very excited to start seeing what our customers can develop with these tools. So we invite you to go into our application code hub and look at what we are offering for EIQ GenAI. This is a demonstrator demo so you can go and look at the information. You can replicate this and start using it in your imx95. So, yeah, feel free to go and look at it and start exploring these new tools.

Speaker 2:

We have presented some events, like Embedded World and CES, some reference designs that already utilize the EIQ GenAI flow. So here we have three of them the AI Enhanced eCockpit, the AI Controller for Health Insights, poc and Advanced Digital Connected Cluster. So all three you can go and check them out at the links at YouTube and very amazing reference designs for our customers. All of them are starting to utilize these GenAI capabilities completely running at the edge. All right, so okay. So that was it for the EAQ GenAI flow. I know it's a lot of information in a very short time, but at the end we can have a session for quick questions. But now we are very excited to present to you our multimodal GenAI at the Edge demo that we have created in partnership with Kinara. So here we are going to show you our first demo that can run multimodal GenAI completely at the Edge no connection to the cloud, which is very private, it's secure and it's efficient. So this is made possible thanks to nxp's partnership with kinara, which is a pioneering company for gen ai acceleration at the edge. They have this ada2 npu that allow us to deploy these big models into the edge and have very efficient acceleration in terms of power consumption and performance of the entire application.

Speaker 2:

So, very briefly, I want to touch on what we are doing with Kinara. So we are basically scaling up our portfolio of what we already have at NXP for acceleration of AI. So we know we have the iMX 810 Plus that has NPU which offers 2.3 tops. Then we also have the imx93, which includes the NPU from ARM. It's a collaboration between the NXP and ARM. It's the ethos-u NPU with 0.5 tops, and those NPUs are very good for computer vision but are not intended to be used for Gen AI.

Speaker 2:

So now, with Kinara partnership, we can bring this discrete NPU that allow us to enhance our portfolio of imx SoCs and bring these capabilities of Gen AI. So Kinara has the first version of their discrete NPU, which is Ada 1. It offers six stops of compute power and then recently they have released the latest one, which is second generation Ada 2 NPU, which offers 48 stops of compute power and with this NPU is that we can now bring this multi-model and big LLMs models to the edge and start creating smarter, more intelligent solutions for many of the different applications we can find in the industry and other sectors. So, utilizing this discrete NPU in our SOCs, it's very seamless, it's compatible with our vision pipeline, so you can have a complete accelerated pipeline with the input video stream from the camera utilizing our ISPs, and then accelerate all the pre-processing and post-processing with our GPUs and have the models, these multi-modal models, accelerated with the discrete NPU from Kinara. So with this, we are now allowing our customers to create very powerful applications.

Speaker 2:

But the beauty of this, we think, is that it's all running locally at the edge. There is no connection to the cloud, so all the data is kept very private, all right. So let me move to the next one, all right? So we are very excited to say that we presented this first demo to the open public in Embedded World this past March and it was a big success. We had a lot of engagement from our customers and all the attendees to the event and basically with this, when we bring this multimodal capabilities to the edge, we can start thinking of new applications, right, like, let's think of a system that is analyzing a scene with a camera and it's issuing reports every now and then and telling you what's going on in an industry complex or your home, if you have some cameras, or your home. If you have some cameras in your home and you want to know what's going on, or, at the end of the day, you want to have a report of what happened in your house, well, with this you don't need to have a human just looking at something for the entire day. You can have a very powerful system analyzing the scenes all at the edge, without sharing all this data to anonymous servers. So the video is playing. Sorry for that. Okay, so what's what we did with this implementation?

Speaker 2:

So this demo utilizes the LAVA model, which stands for Large Language and Vision Assistance, and this is an openly available model that you can find in the open source community model that you can find in the open source community, and it combines language and vision capabilities to understand and generate responses based on both text and image inputs. The model internally has two components it utilizes an LLM together with a vision encoder. The LLM is LAMA3 from Meta, it's open and it has 8 billion parameters. The model is quantized into 4 bits and it's fully accelerated by the ARA2 NPU and the vision encoder is CLIP from OpenAI. It's also openly available for you to use it's open source and it has 428 million parameters. And it has 428 million parameters, this model. We quantized it into 8 bits and it's also fully accelerated by the ARA2 NPU. So altogether, lava is running at the ARA2 NPU and this is made possible thanks to the 16 gigabytes of DDR memory fully dedicated for the models. So that's in the discrete NPU in this image that you see we can show the ARA2 NPU. So the models are fully loaded into the DDR and that's why we can utilize them very efficiently.

Speaker 2:

The performance of this model in the iMX 810 Plus Freedom Board with the ARA2 NPU is the following we see that the clip model is extracting the features of the image in around 430 milliseconds, so around half a second, very quickly. And then the LAMA3 model with 8 billion parameters. It's generating around 6.5 tokens per second and the time to first token, the time it takes to generate the first token, it's seven seconds for the first image that you give and then, if you catch the image and you want to keep asking questions. Within just 1.5 seconds you will start generating the new tokens. All right, so here we see a very um high level overview of what's uh, what's lava, how it works.

Speaker 2:

So we have um, first the vision encoder that it's going to take the input image and extract the features from the image and it's going to encode this feature features in a language that the LLM can understand. Together with the input prompt, the LLM will start generating an output with context of the image and the question that was given to the system. But not only that in this demo we have also integrated the eIQ-GenAI flow technology. So this means you can trigger the system with a wake word engine and just say, hey, nxp, and it will start listening to the question that you want to ask. And then it's going to process the image together with the input prompt and generate a response out of the image. So once the response is generated, the system will synthesize the text into speech and this will result in a very seamless interaction with the system. So the user can very easily interact with the system, ask a question and get an output with speech. So we are very excited of this technology and this implementation. We think it's going to revolutionize what we have in the industry and we know it's going to be a very solid reference design for our customers in the near future.

Speaker 2:

So here we can see a small video of this running on the imx 810 plus freedom board. Again, what you are seeing here is fully accelerated at the edge. There is no connection to the cloud. So we are selecting an image and we are asking the system to to to tell us if there is water on the floor. This image is very challenging. Even for the humans, it's hard to see the water, but we see that the lava model, it's capable of detecting the water and it's telling us yes, there is water on the floor, so it might be a spill or a leak and it needs attention to prevent any potential damage. So we can think of this kind of applications in the industry, where we cannot afford or we cannot put a lot of sensors in the floor and with just one picture we can ask the system to describe what's going on and we can get a very accurate response from this.

Speaker 2:

And finally, I want to also share a little bit of what we have with other models. So here we are showing llama2, fully accelerated with ada2 NPU. It has 7 billion parameters and we are getting a performance of around 17 tokens per second with a time to first token of only 1.25 seconds, so very fast responses, and this together with Drag and other tools from EIQ GenEye Flow. We can start building a lot of new, very smart, intelligent solutions that run privately, secured and efficiently at the edge that run privately, secured and efficiently at the edge. So, last but not least, I want to invite you all to go and look at the Computex 2025 NXP keynote, where Jens Hinrichsen, executive VP from NXP, talks about what's going on with GenAI at the edge in NXP, but mostly about agentic ai, which we think it's the next step going forward and what we have presented here. It's the base for you to start building agentic ai solutions at the edge.

Speaker 2:

So the summary of this very brief presentation, trying to cover a lot of topics and technologies from NXP, is that NXP is definitely transforming intelligent systems with generative AI at the edge. We utilize powerful NPUs like Neutron and Ada2. And, of course, these are available for our customers and we also offer very modular frameworks, such as AIQ GenAI flow, to allow our customers to easily start building and creating smart, intelligent solutions. The key benefits we think it's that we can enable real-time, efficient and secure applications with AI that run on edge devices, and this will enhance our user experiences across a lot of diverse applications. Nxp is committed to shaping the future of smart connected devices and we are driving the evolution of application-ready AI at the edge. So thanks everyone. I know it was a lot of information, trying to cover a lot in just 25 minutes, but, uh, hopefully it, uh, it gives you a perspective of what nxp is doing with genii. Uh, currently, and and thanks everyone thanks a lot, alberto.

Speaker 1:

Your talk was really comprehensively explaining all the nxp, comprehensively explaining all the nxp solution to bring gen I at the edge. Congratulations, I mean, nxp is doing a lot of fantastic work. We're running out of time. Maybe a very, very quick question from you know, generative AI and language models are subject of deep quantization. There are also approaches 1-bit, llm and the like. Do you think your solutions are ready for such extreme quantized models?

Speaker 2:

yeah, well, um, definitely I don't think we are ready for one bit quantization at this moment for this kind of uh applications, but we are utilizing for llm's, four bits quantization and we know, uh from our tests that these use cases that we have shown are ready for creating very robust applications. So yeah, I mean all this is very new. There are a lot of new methods out there to be able to optimize these big models. I know in the future we are going to keep improving this.

Speaker 2:

This is something that is moving forward very quickly, but for now we are very happy to see that with 4-bits quantization we can shrink these big models and fit them into these discrete NPUs and allow everyone to start building applications that are very private and still leverage all these Gen AI technologies. So for vision encoders like Clip, we utilize 8-bits. We've seen that that gives the best accuracy with this type of models, but LLMs definitely. We are sticking with 4-bits for now.