EDGE AI POD

Applying GenAI to Mice Monitoring

EDGE AI FOUNDATION

The AI revolution isn't just for tech giants with unlimited computing resources. Small and medium enterprises represent a crucial frontier for edge generative AI adoption, but they face unique challenges when implementing these technologies. This fascinating exploration takes us into an unexpected application: smart laboratory mouse cages enhanced with generative AI.

Laboratory mice represent valuable assets in pharmaceutical research, with their welfare being a top priority. While fixed-function AI already monitors basic conditions like water and food availability through camera systems, the next evolution requires predicting animal behavior and intentions. By analyzing just 16 frames of VGA-resolution video, this edge-based system can predict a mouse's next actions, potentially protecting animals from harm when human intervention isn't immediately possible due to clean-room protocols.

The technical journey demonstrates how generative AI can be scaled appropriately for edge devices. Starting with a 240-million parameter model (far smaller than headline-grabbing LLMs), the team optimized to 170 million parameters while actually improving accuracy. Running on a Raspberry Pi 5 without hardware acceleration, the system achieves inference times under 300 milliseconds – and could potentially reach real-time performance (30ms) with specialized hardware. The pipeline combines three generative neural networks: a video-to-my model, an OPT transformer, and a text-to-speech component for natural interaction.

This case study provides valuable insights for anyone looking to implement edge generative AI in resource-constrained environments. While currently limited to monitoring single mice, the approach demonstrates that meaningful AI applications don't require supercomputers or billion-parameter models – opening doors for businesses of all sizes to harness generative AI's potential.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1:

There are three aspects in this talk that I would like to suggest you to consider. First of all, for what edge-generative AI can be useful. I will focalize to only one use case. Now, these use cases why is representative Is representative because behind that there is a problem which is affecting the operations of some small or even medium medium enterprises. And then the next topic is yeah, but how to generate data in a way that the problem can be shaped and captured properly. And the third is yeah, but once the algorithm, the pipeline, is shaped, on which edge processor to deploy it and what type of features these edge processors should feature. So, essentially, generative AI is today dominated by large enterprises, and fixed function AI is what we are working on in this community since 2018, 2019. So imagine being a small-medium enterprise which took years to get familiarized with fixed-function AI detectors and the like, and now generative AI comes into the picture. That can easily be confusing Not large enterprises, small and medium enterprises. So that's all about. I cannot turn. The I said is not working. Okay, yeah, it's going to be a long talk, okay, so, yeah, next, just the next slide, please. So the business is not large, I mean, and is quite flat. It will slowly grow over the next five years. So, from a money point of view, you may not find very interesting, but still these are small, medium enterprises. So try, do we want to scale generative AI? We need to care about this type of company. Then you may like or may not like the problem, but that's the point I mean. The point is how to get them into the picture of edge generative AI. Okay, so next, and these guys, one of these companies, already developed fixed-function AI. On which type of target? Well, these are cages. The cages are hosting a small amount of mice and they are used for research, experimentation, and the trend is to sensorize the cage with imagers, with microcontrollers. To achieve what fixed function AI? Which means that we are focused on low power processing, on analyzing the data locally and provide some assessment. What type of assessment Next? So that's the dashboard. Essentially, these are different cameras which are capturing. You may run, please, no, sorry, the previous one, you may. Essentially, in the dashboard, you have different views of the cage and here the exercise is to understand what is the amount of water available, what is amount of the food available and if the water is available in the cage, because companies are very concerned about their wellness, so the mice wellness is truly at the top of their priority. So fixed AI essentially is used there Classifier, even detector, the next one. And also lighting conditions may be very poor, because imagine these cages are in a rack. The rack typically is 90, 100 cages, and then you have several cages Placed where, placed in clean rooms, because any type of contaminants should not affect the life of this guy, and the value of these mice increases as soon as the big pharma proceeds on the experimentation. So that's fixed AI and fixed function AI, so you may use your kind of YOLO or convolutional neural network to classify and so on.

Speaker 1:

Next, but then why do we need generative AI? Because we need to essentially move at the next level, process the information such that we can create new information, assess the past information and create new information. What type of information? Information that allows a natural machine interaction kind of application. So you may smile because what does it mean? I want to interact in a natural way with mice. Well, with their behavior. Yes, that's the evolution, because the fixed function AI is already in place. So we need to move forward, and always having in mind their wellness.

Speaker 1:

Here energy-efficient processing is the key. So what is the problem? Essentially, the problem is you can display the whole. Yeah, essentially the problem is you can display the whole. Yeah, essentially the problem is I want to assess the past information, visual information in very short amount of time for example, 20 frames and then predict the next intent of the mice, such that I can predict their behavior. If I can predict their behavior especially these guys may have a weird behavior, especially when they are many in the same cages I can protect them. And you know, predicting this information is important because you cannot enter in the clean room at any point in time. You have to wear, you have to be very decontaminated. So this type of processing becomes important. Resolution is VGA is enough, frames per second are four and we had to work on data set creation. For example, in this case we captured 3.6 million of frames and these are example of the frames, like in the detector, like in the detection, and really the mice can be placed anywhere. In this case we just placed with one cage we want to simplify the problem and with one mice.

Speaker 1:

Now this is the pipeline. Let me explain. This is the pipeline to achieve labeling, because all these images need to be labeled. So imagine that I have on top a textual query. Then this pipeline, which essentially is composed by two parts, the bottom part is creating a representation which is handcrafted. In which way? Essentially, we use standard deviation to understand the type of motion which is into the past frames 16 frames and we detect the level of motion, we detect also if these are sleeping, and then we create a first one-hot encoding. Then, okay, you can move forward. The next, the next I think I say this the next, then the upper part of the pipeline. Instead, use a state-of-the-art algorithm such as VideoLava, which generates draft labels and then these are combined with the encoding coming from the uncrafted pipeline and then these are further processed by Lama3. And at the end, I have labeled all the images. Therefore, this has been created and now I have these 3.6 millions of frames which have been labeled.

Speaker 1:

Next one so what type of pipeline are we devising? Essentially, the pipeline is the first one. The baseline is called video-to-action. Video-to-action is composed by three pipelines Video-my, which is state-of-the-art 114 million parameters. And adapter, which is a fully connected layer 0.6 million parameters. And then OPT, open, pre-training, transform 125 millions of parameters. Then on these we played essentially frames coming, the 16 past frames and then the intents are predicted out of this pipeline.

Speaker 1:

Now, one exercise was to decrease the complexity of the pipeline. By the way, you see the order of magnitude it's essentially 230, 240 million parameters. We don't need gigaparameters or even more. That's an example of the scale we are looking for for edge generative AI. And then we simply stripped some part of the Video my in order to save 62% of the parameter for the Video my and achieved something like 170 million parameter.

Speaker 1:

So there were two versions. The next and then training was composed by three steps. Essentially it was an incremental training. In the first step we wanted to train Video my in the fully connected layer, while OPT was frozen. In the second one you go next. In the second one, we used parameter-efficient fine-tuning in order to fine-tune OPT. And in the third step there was a bundle adjustment of all the weights for the three networks in order to achieve the final operation. That was not much time-consuming because, for example, we used the Hugin-Face capability to run this training. So I didn't need a supercluster AI or a super-powerful supercomputer to run these things because, once again, from the perspective of a small-medium enterprises, this guy cannot afford a Corvex or Colosseum supercomputer. Simply, there is no budget for that.

Speaker 1:

Okay, so the next one. So these are the performances. We went down 29% in terms of trainables. In terms of trainables, accuracy by the tiny V2A was improved by 1.86%. Validation and testing, and the perplexity was decreased by 37%. So these are the key parameter to validate and assess the quality of the neural network.

Speaker 1:

The transformer, the next one. And then what type of target did we use? Well, I'm from ST, so it was not an ST processor, not for the moment. It was a Raspberry P, p4, P5, obviously. So the type of processor 2.4 gigahertz quad processor with a lot of LLPR DRAM. That's not a microcontroller, it's a multiprocessor. So that's the type of architecture we are looking at. Think about it.

Speaker 1:

I did also a survey about implementations in the last years, and the multi-processing unit dominates. That's not the space for microcontroller. There were very few examples. So that's another result of the transformation, and the best results that we obtained were lower than 300 milliseconds per inference, which is not bad. Hardware acceleration at the moment, at my poor knowledge, doesn't exist. If, when and if they will exist, certainly we will go down at least one order of magnitude. You know what does it mean? 30 milliseconds with me, which means real time. Okay. So we are looking for to see maybe someone in this community to develop such devices. The memory is huge 130, but because we use the Onyx runtime. Now, if you strip down the Onyx runtime certainly you can lower in a massive way the amount of rhyme and lowering the amount of RIME and lowering the amount of RIME plus hardware acceleration certainly will make the implementation more energy efficient. So the conclusions Next. So that's the pipeline, okay, you understood.

Speaker 1:

And how to make natural the interaction? At least one way using text-to-speech. What kind of text-to-speech? There is a nice project on GitHub which is Piper. Essentially, the penalty between the length of the audio and the processing time is 40%, which is nice. It's a 15 million parameter generative neural network. So at the end you see that three over four neural networks are generative in this pipeline. The fully connected is just an adapter to adapt the representation between the two neural networks. So we got less than 300 milliseconds on Raspberry Pi 5.

Speaker 1:

Perplexity 2.64. So I think it can fly and in the future will fly even better. What are the limitations of this work? One mice, one cage. Obviously, in the cage usually three to five mice live together. They have more complex behavior. They have their own complex language, so maybe this application is not huge in terms of businesses, but technology-wise. I hope it will give you a sense of the complexity that is behind and the opportunity, which certainly represents one sample of the possible deployment of edge-generated AI, especially from the perspective of small-medium enterprises, which are of paramount importance to achieve and to get. And then I'm done and sorry for these issues. I hope it was smooth.