EDGE AI POD

Bridging the Digital Divide by Generative AI through the Edge

EDGE AI FOUNDATION

The technological revolution sparked by generative AI threatens to create the deepest digital divide we've ever seen. In this illuminating talk, Danilo Pau from STMicroelectronics reveals how only a handful of companies worldwide possess the resources to fully harness large-scale generative AI, while the rest of humanity risks being left behind.

Pau takes us through the sobering reality of today's AI landscape: hyperparameterized models requiring nuclear power plants for training, hundreds of millions in costs, and worrying environmental impacts. But rather than accept this centralized future, he presents a compelling alternative path – bringing generative AI to edge devices.

Through a comprehensive survey of recent research, Pau demonstrates that generative AI is already running on edge devices ranging from smartphones to microcontrollers. His team's work with STMicroelectronics processors showcases practical implementations including style transfer, language models, and perhaps most impressively, an intelligent thermostat capable of natural language interaction with reasoning capabilities.

What emerges is a vision for AI not as another backend classifier but as a transformative interface between humans and machines. "GenAI is not for another detector," Pau explains. "We need to offer new added value" through natural interactions that understand context and can reason about the world.

For researchers and developers, this talk provides concrete pathways to explore: from audio processing as a "low-hanging fruit" to visual question answering systems that run on minimal hardware. The future of AI isn't just in massive data centers – it's in the devices all around us, waiting to be unleashed through energy-efficient processing and innovative approaches to model optimization.

Ready to join the movement bringing AI capabilities to everyone? Explore how edge-based generative AI could transform your products and help bridge the growing digital divide.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1:

Our first speaker is someone very special, not just to this forum but to me personally. Danilo Pau from STMicroelectronics is co-chairing this working group and forum alongside with me, and I've had the real pleasure of working closely, very closely, with him over the past months. His insights and dedications, his spirit, his energy have been absolutely instrumental in shaping this event, this Exact Forum, and today he'll be talking or introducing papers from a survey. It's a kind of survey survey of each gen AI research trends, devices and techniques. And, danilo, the stage is yours.

Speaker 2:

Thank you so much, dear Ajara. It's really very nice the way you introduced me and, yeah, I feel at home when I'm with all of you and all the time in the working group. So let me share my screen and the presentation, okay, and then the presentation. During the presentation, I cannot see the dashboard, so hopefully, if I do anything wrong, please let me know, okay, so I assume the slides are visible and my contribution for today is about different perspective related to generative AI, and I will talk also a little bit about agentic AI from a practical example point of view. And then there is a couple of words that really touch me, and these are the digital divide, and so let me go through all of that. So why spend some of your precious time in hearing myself? Number one, because generative ai, deployed the gap, is the biggest source of the digital gap, and I will talk a little bit about it. And second, I will touch, uh, the state of art about gen I at the edge and and trying to prove that this is already a reality, in the sense that there are lots of contributions already in place, and I would like to show also some personal contributions about bringing gen I at the edge using couple of processors, the stm32 mp2 and the stm32 n6, which are initial demonstration on our capability to go in this direction. And and then this is something I really love, because but why to use Gen-I at the edge? I mean, we spent a lot of years on bringing conventional AI in products since 2018, even before, and now there is this new wave, and that clearly creates difficulties to everyone. But let's go first of all in saying that from my personal point of view that's my opinion Gen-I is the source of digital divide, in the sense that digital divide in 2000 was dividing the humankind in people that got access to the PC from people that didn't have this opportunity. Well, generative AI is order of magnitude greater in terms of impact about the digital divide.

Speaker 2:

Few companies around the world I would say five can master it, have assets, technical assets, technological assets, financial assets, have the operation in place to handle it. Everyone else is out of it, and why? Because you know the corner year was 2019. On 2019, the memory opportunity. It means the gap between the memory assets of the GPUs and the model parameters of the fixed function. Ai was an opportunity because people could devise workloads, complex and complex, being comforted by the deployability. On GPU, there was enough memory, there was enough computational power, but then, from 2019, this trend created the memory wall, and the memory wall is order of magnitude gap between hyperparameterized models, teratrillions of weights, even more, and not speaking for the data compared to the capability of the GPU. You may know that the latest from Nvidia is Grace Hopper.

Speaker 2:

It's a very brilliant type of technology, but you need thousands, I would say millions of that to build up your computing center in order to manage this. And these are type of workloads and that's why only a few companies can manage them and can afford it, but everybody else no. And this happened just in 13 years and the transition started in 2019. And the consequences are that Gen-I drives the cloud. What does it mean? You need hundreds of trillions of tokens to build up a data set to train such workloads, which means that you need the few ones that can master it, need gigawatts nuclear reactors to power the compute centers, and there is a clear sign that the majors are investing hundreds of billions of dollars to bring up even closed nuclear plants like Three Mile Island to generate enough energy. The amount of money to train these workloads are hundreds of millions of dollars. Who can afford them? Who can afford this cost? And by using nuclear reactor, the impact on the natural resources is huge just 2.5 liters per kilowatt per hour just to manage the uranium all the way. The process to use and to feed the nuclear reactors and the GNI will generate, you know, millions of tons of electronic waste. From now, in the next five years, co2 will be increased. Unfortunately, sustainability means that it's ironic, but double digit increase of CO2.

Speaker 2:

And how to think that services Gen-I, services for the masses can scale to everyone? Last year, we had 5 billion point 16 users between Android and iOS. How can these people can get access simultaneously to any service that fits their needs? So clearly? Yes, maybe it's possible, but at the cost of several tens of thousands of compute servers, and therefore it's like to say bye, bye to research in the university. It's a farewell of the deployment and the accessibility from small, medium enterprises, the one that are enjoying today the fixed function AI, the conventional AI and there is a huge bottleneck in accessibility of this technology for all of them. And, after all, it's like what happened in 2017, where fixed function AI was on the cloud. Everybody was speaking about smart sensors that essentially, were streaming data to the cloud, where unlimited assets were available, and this trend didn't last that much because, fortunately, this community was built in 2018. And in a few years, I would say five years, but also at the beginning, products were available Maybe standard microcontroller at the beginning, and now with hardware acceleration, with neural processing unit in single chip, and clearly in the next year, there will be a huge proliferation of fixed function AI, of fixed function AI, but fortunately, to cope with the cloud gap due to generative AI, tinyml Foundation evolved in what is today Age AI Foundation and this created a new hope for everybody of us.

Speaker 2:

So, but let me elaborate a little bit. I mean, is that something new? Or in the last few years, there were brilliant examples? Obviously yes, and in fact, me and Gloria, a student of mine, did a review of the state of the art about Gen-I at the edge in the last three years 2022 till 2024, 20, 20, 22 till 2024 about people who designed and published generic model on edge devices. Not saying that this is a possibility, but this has been achieved in this, in the sense that these people not only claim it, but proven that their model were deployable on edge devices, so it was possible to analyze the performances.

Speaker 2:

Essentially, we reviewed 135 selected papers and only 66 were in the scope in the sense that we're fulfilling the requirement to prove that Gen-I workloads were deployable on embedded devices. So a little bit of statistics 20% were journal paper, 63% were conference paper and 17 preprint. 48% from IEEE, 17% from ArcSiv, 20% from Springer, elsevier 3%, other 5%, acm 3% and AAAI 3%. So you see that these are people that put their face in written in scientific papers that have been mostly peer-reviewed I would say the majority and so this is a series of stuff and what emerged is that optimization techniques have been used to reduce the cost, for example, quantization. I mean there is people that achieved one-bit LLM knowledge distillation. That's the proof that the foundational models are very useful to derive hypoparameterized models for Gen-I at the edge with pruning techniques. So optimizations are already there, are known, are in many tools already available that are used to design neural network. So optimization techniques are mature and everybody can use them to start from foundational model to achieve what? To achieve different type of use cases.

Speaker 2:

So the task from this survey were essentially more than 45% about visual processing, 5% to 10% about image generation, short language model, text-to-speech style transfer and visual question and answering and the remaining 5% face-whooping speech enhancement, captioning, translation and so on. So visual processing task, captioning, translation, and so on. So visual processing tasks, super resolution these papers were about super resolution, image enhancement, signal processing, denoising, restoration in painting and broken-head rendering and inference time from few milliseconds to a couple of seconds. All but three models were deployed on a smartphone, so smartphone emerged as fundamental assets image generation processor a17, 18, 16, 15, exynos and snapdragon these are the few processors that allow with the people to experiment. Image generation and models were ranging from 400 million to 1 million of parameters. Short language model, once again on smartphones, utilized transformer architectures between 125 million to less than 4 billion parameters.

Speaker 2:

Quantization were massively using and then visual question and answer, and this I quite like very much and I would like to mention especially TinyVQA and TinyVQA from many authors, especially Tinoch I know very well, professor Mossein and their team achieved on GAP8. Gap8 is a very low power processor, so that's really, with knowledge distillation and integer eight quantization that's proven that VQA can be achieved on low power processors. And then test to speech. Test to speech a fundamental matrix is a real time factor, the ratio between processing time and audio length, and we had end-to-end approach like nix tts and two-stage approaches and ranging, and rtf was ranging between 1 over 100 and close to 2. And also people designed vocoders and implemented on Raspberry Pi and also microcontrollers.

Speaker 2:

So, there were other approaches which mostly were deployed on Jetson, on smartphones, on Raspberry Pi and some also on microcontrollers, which is good. So a bit of statistics about deployable device, the end device, the edge devices Less than 10% of microcontrollers. This suggests that microcontrollers will have perhaps a marginal role in the Gen-I at the edge area. So let's face it, and me personally, I tried, with some colleagues, to deploy some geni task on the stm32 and six, which embody also an integrated neural processing, three terops per watt, style transfers, style transfers, and the other on the stm32, mp2, achieving some functional implementation about image generation, slm, tts and and the like, uh 10 on jetson. Jetson, let's remind that this is 15 to 50 watts off the shelf, obviously and the majority, more than 70 percent, on application processor mastered by qualcomm, mediatek, apple, samsung, I, silicon and google. So let me go a little bit more on what we did on our processors. In partnership with the fondazione bruno kesler, I had the privilege to cooperate with alberto ancillotto and Elisabetta Farella. They developed a style transfer. Essentially, do you want your face to be painted as a masterpiece? Then we would like to achieve anonymization and personalization and content adaptation and personalization and content adaptation. And then a colleague of mine implemented on the STM32 and 6 leveraging the neural processing unit 10 frames per second and that's a real-time process. We imported also this demo at the Met AdWord this year, so I don't know who was there there, but it was possible to see another project that I started to compile on arm 835, 1.5 gigahertz. Our product is called stm32mp2, so that's one. This is the one, and and support open linux, so I compiledonshine. So the Pete Worden, useful sensor, amazing speech-to-text model which is much better than Whisper, but I compiled also Whisper for some comparisons. And then all the C++ projects Lama 2, lama, cpp, mamba, lava and Olam. Olam I quite recommend Olam because it's easy to manage and give you access to many foundational models DeepSeq, gemma, misral Granite and so on. And also I had some experience on stable diffusion. By the way, don't use stable diffusion 1.5. There is a brilliant article published by AEEE which strongly suggests because the material that have been used by stable diffusion 1.5, it's a bit controversial. So I compiled, as they are, a few demonstrations. So this is Olamar with QN 0.5 billions and I'm just discussing with this Alibaba model and these 2.4 tokens per second. Not that much, but this is an initial step. The second is 1-bit LLM BitNet from Microsoft. That's a brilliant project. The second is 1-bit LLM BitNet from Microsoft. That's a brilliant project Also. This I achieved on MP2 with some initial performances. By the way, if there is ARM people here, please consider to accelerate 1-bit operations in your instruction set, because we are using ARM 835. We need 1-bit instruction acceleration. I mean this gives enormous acceleration, 32 times. And 1-bit multiplying and accumulating is so cheap in hardware just an XOR and a bit count that everybody can implement. So please do that.

Speaker 2:

And then I moved on reasoning and I tried Olamar DeepSeq 1.5 billions and against logical reasoning question with answer from freshers and actually it proven to be accurate and I posted also on LinkedIn some demonstrations. But all of that for what? Why are we so passionate about generative AI? So that's my answer because we need it for natural user machine interface. Gen I is not for another detector, is not another classifier. People will not understand that because they expand. They already invested so much time that cannot go back, so we need to offer a new added value. So how to substantiate that?

Speaker 2:

With the student, we prototypized an interesting thermostat, a consumer thermostat, and as natural interaction, the way to do it was to speak with it. Let it reason, depending on the query, and let the thermostat speak back to you. So it was implemented on Raspberry, and why? On Raspberry, off the shelf, brilliant Linux system and a lot of projects available and offer nice performances, but not that nice as real time would require. So please, rosina, if you could kindly share the demo or the video demo. Can you warm a bit the room? Also, I feel cold in the house. Can you warm a bit the room?

Speaker 3:

Your room will start to feel warmer. As I slowly increase the temperature by 2 degrees, I aim to reach a comfortable reference temperature based on today's forecast, ensuring we don't overheat. Would you like to provide any additional preferences or?

Speaker 2:

should I proceed with these settings? I'm out in a couple hours and I'll be back tomorrow morning.

Speaker 3:

I want the home to be warm when I'm back, but I don't need to heat the house when I'm away.

Speaker 2:

Okay, okay, maybe, rosina, can I take once again the control and go back to my presentation.

Speaker 1:

Yeah, you can move. Those are your slides, Daniel.

Speaker 2:

Thank you so much, rosina, very kind. So let me explain what's going on. First of all, we were speaking to the mic, the Raspberry was running Moonshine from UsefulSensor out of the box for transcription, and then Granite 3.3. Why Granite 3.3? Through OLAMA? Because it has interesting reasoning capability, which means that after the translation happened, granite decided to launch a query on the web to understand the temperature outside my room so that it could understand, it could know, measure the in-room temperature, no, the outside temperature and decide the gradients to warm the room and then report that to me and let me continue the discussion.

Speaker 2:

So for text-to-speech we use Piper, which is also a quite tiny model. Just a few millions of parameters, and that's the pipeline. Few millions of parameters and that's the pipeline. So moonshine, granite 3.3 and then piper. So this come to with the complexity profiling. It requires granite c gig of memory 2.36 throughput intoa token per second, and it was exactly nine and nine. The majority of the complexity was there.

Speaker 2:

Yeah, moonshine, real-time factor out of the box without optimization, 0.2. Execution time 0.5%. 65 megabytes of RAM. Clearly Granite dominates. And then Piper. Piper, execution time RTF sorry, 0.13, just 1% of the overall profiling and 185 megabytes of RAM. So clearly these use cases require lots of RAM, and this is due to the granite, but this is granite with reasoning capability. This is due to the granite, but this is granite with reasoning capability. So clearly there is a need for energy-efficient hardware acceleration. I think it's out of any discussion. So let me conclude and apologize if I took too much time. So generative AI implementation are clearly appearing on edge devices since the last three years 2022 and 2024, even before the foundation decided to address with our working group, gen-i.

Speaker 2:

So this is established, let me say, and some use cases, even if the minority were meant for microcontroller, are still possible on off-the-shelf microcontroller like the STM32 and STM32-N6 and STM32-N2. Fortunately, the community made available lots of foundational models and these are brilliant because to derive the students we need teachers, and with optimized techniques which are well established, including hyperparameter optimization, then we can derive such models. And then it's clear that generative AI LLM is just a step. Really, the milestone is LLM with reasoning capability, more complexity, more memory, clearly. But this enables true human-machine interaction at an unprecedented interactivity level. And in my opinion, hmi has a new opportunity, which is to take advantage of the agentic wave and we need to push forward accelerators with unprecedented energy efficiency, and when I say unprecedented energy efficiency, I would say hundreds of terab per watt at least. So thank you so much for your attention and any questions Don't hesitate either to ask or just to write to me. Thank you.

Speaker 1:

Thank you. Thank you so much, danilo, for the thoughtful and data-driven overview, and also for the demos. I love them. It's very courageous of you to show them like. Especially there is this remark two to three tokens per second which I'm going to come up to later, but it's clear that the field of HGEN AI is maturing quickly, right, danilo? And of course, this survey helps many of us map where we've been and where the opportunities lie. So I have two questions from the audience. The first question is by F Wang what are the development environments you would recommend for deployment of the models on microcontrollers?

Speaker 2:

okay, well, depend on which kind of models if we are referred. For me the models can be divided into uh, fixed function ai and agentic AI. So in the second case, in my, I would say, initial experiment, the capability to use a strong and stable embedded Linux is fundamental. We do deliver OpenST Linux, but also on Raspberry P was fundamental because, for example, let's take Moonshine, you need UV and then it was brilliantly engineering Moonshine. Then you need some install, some library to be installed and and therefore the support of linux and the and the other python library is fundamental.

Speaker 2:

So python abstraction, linux, open linux I think is is fundamental for function, fixed function ai. Actually, the solutions like the our one, uh, which are based on the STI-JAI core technology, do not need an RTOS. So we just need a strong IDE with capability to compile C code with strong API, few API to allow application integration, and I mean STI makes available not only the STI-JAI core technology but also another kind of tool, depending on the product. So just quickly saying CubeMX for the STM32 microcontroller, stellar Studio for the automotive microcontroller, mem Studio for the sensors and the Developer Cloud, just to have a try.

Speaker 1:

Thank you. Thank you, danilo. Another question from a LinkedIn user when do you see a bottleneck to break through? Do you think it will come from community model software improvement first, or from STM architecture improvement, or from ARM NPU capabilities, or all of the three?

Speaker 2:

I am for open innovation, so I need innovation, and the innovation I would really love to see are processors that are trying to set the Machiavelli goal reach the energy efficiency of the human brain. So, from wherever it comes, even in a small lab, I don't know where in the planet I think this is what we need. We are today at hundreds of terabytes per watt, but that's with the memory computing that's published on ICC, our papers there. We need more. That's an initial start and, you know, through open innovation. I cannot imagine another way to progress than this.

Speaker 1:

Thanks, danilo. Maybe one last question from me. Given the explosion of interest in edge-gen AI, what key areas of research or experimentation would you encourage, for example, young researchers to focus on next, because I see many young researchers there. So what do you recommend?

Speaker 2:

Well, I was inspired well about generative AI. About generative AI, um, definitely, uh, I was touched by the paper from tinoch, the visual question and answering, and also, you know, uh, and I think I shared the demo from what yesterday pete warden said focalize on audio. I think audio maybe is a low-hanging fruit for generative ai and for vqa, obviously, complexity is higher. So these are the two things, but, as absolutely with reasoning capability, I think it would be more complex, but we need. I think we are just at the beginning of llm with reasoning capability and what I saw is really intricate, is complex, is not linear in terms of reason. So there is plenty of progress In terms of a fixed function AI. I'm still in love with on-device learning.