EDGE AI POD
Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.
These are shows like EDGE AI Talks, EDGE AI Blueprints as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.
Join us to stay informed and inspired!
EDGE AI POD
Real World Deployment and Industry Applications
The humble printer - that device gathering dust in the corner of your office - is about to undergo a remarkable transformation. Thanks to advancements in generative AI, printers and scanners are evolving from passive endpoints into intelligent document processing powerhouses.
Arniban from Wipro Limited unveils how visual language models (VLMs) like QN 2.5 VL and LayoutLMv3 are being deployed directly on edge devices rather than in the cloud. This breakthrough approach addresses critical data privacy concerns while eliminating the need for continuous network connectivity - perfect for sensitive enterprise environments where document security is paramount.
These multimodal AI implementations enable remarkable capabilities that were previously impossible. Imagine a printer that can automatically extract complex tables from documents and convert them into visually appealing charts. Or one that can intelligently correct errors, translate content between languages, adapt layouts for visually impaired users, or even remove advertisements when printing web pages - all without sending your data to external servers.
The technical implementation involves clever optimizations to run these sophisticated models on relatively constrained hardware. Through techniques like 4-bit quantization, image downscaling, and leveraging NVIDIA's optimized libraries, these models can function effectively on devices with 16GB of GPU memory - bringing AI intelligence directly to the point where documents are produced.
While challenges remain in handling large documents and managing the thermal constraints of embedded devices, this technology marks the beginning of a new era in intelligent document processing. The days of printers as "dumb" input-output machines are numbered. The future belongs to intelligent endpoints that understand what they're printing and can transform it in ways that add tremendous value to users.
Try imagining what your workflow could look like when your printer becomes your intelligent document assistant. The possibilities are just beginning to unfold.
Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org
Hello, arniban. Good to see you, arniban, from YPRO Limited, and thanks to be with us, arniban, and to contribute to today, to the third forum of today. I think you are going to speak about industry applications and deployments of Gen I at the edge. So whenever you feel ready to share your presentation, please, please, do so and and the floor is yours thank you.
Speaker 2:Thank you so much, danielo. Is my presentation visible?
Speaker 1:yes, it is it is okay.
Speaker 2:Thank you so much. Yeah, uh, thank you, danielo. So um today. So my name is anirban and I am representing vipro limited today um to talk about some of the real world deployment and industry applications that we are seeing. This is the agenda for my presentation. So some background on where we come from. What is the AGI relevance?
Speaker 3:that we need about it. We don't. I think you are. Maybe it's another presentation you're showing, because we just see your tile slide right now. I don't know exactly if you want try again, because I think it's just the one that you have uploaded previously, but you are not sharing your screen.
Speaker 2:Okay okay, okay, let me share that. Is it visible now?
Speaker 3:Let me see. No, I think it's still the other one. I think you or maybe now let's see Is that the one?
Speaker 2:And then you just go full screen. Table of content. Yeah, table of content Now.
Speaker 3:you just need to go to full screen because we just see all the slides on the left, and then I'll let you take it over I've gone to full screen no, it's not. We are still seeing everything on the left to all your slides? No, no. Are you using different screens? Maybe on the left, to all your slides? No, no. Are you using different screens? Maybe, or when you just go to slideshow.
Speaker 2:OK, let me try. Ok, let me share again. Yes, screen.
Speaker 3:Is it visible in full screen now? No, it's not visible on full screen. If not, yeah, we just have to go that way, like and if you just go and take slideshow that it's showing your whole screen video yeah right, danilo, you can also see that. Right, it's not in full screen mode, he's just.
Speaker 1:I see. No, not in full screen.
Speaker 3:No, no, Okay, no, any technical difficulty Arniban yeah, if not, yeah, we have to go that way, but it's not really what we want, but it's okay.
Speaker 2:I'm studying in full screen, but not sure. Yeah, Maybe you have multiple screens or maybe you can share the presentation that I uploaded.
Speaker 3:Yeah, I can share that one.
Speaker 2:But that's why do you want that actually, I without the video is fine, I think we will okay then we go with this one.
Speaker 3:Okay, can you move forward here in that slide, and then I'll?
Speaker 2:yeah, you can go ahead to the next slide, okay. So you want me to? Okay? Okay, I'll just follow you then. Yeah, yeah, so this is the agenda and then some background on our team. So, for those who are not aware, so Wipro is a leading global information technology consulting and software development company. We are present in more than 167 countries and we focus on different areas, including AI and cloud computing and data analytics, for example. So I come from a background of embedded development and currently, for the last five years, working on AI, and I would like to acknowledge the bigger team behind this as well. To acknowledge the bigger team behind this as well. Yeah, we can go to the next slide, thank you. So, to talk about the use cases, so what we see are different trends in ai. So there are advancements in the generative ai models, with lot, lot of multimodal AI models coming in, and these models are driving new use cases as well as they are enabling improvements in the older
Speaker 2:implementation efforts. A lot of the implementation for multimodal AI happens on the cloud, but there are certain concerns with that in terms of mainly the data privacy part of it, where people may not be ready to share all their information to the cloud, as well as the need for the availability of a high bandwidth network. So in some use cases it is beneficial to have the inferencing done in an offline mode without a dedicated network connectivity. So, in terms of looking at embedded devices, we looked at some of the different devices and we saw that, like the printers and scanners is one category whether where there may be lot of applications of multimodal AI Today, the printers and scanners that we see are mostly passive. They don't basically process the content themselves. They struggle with a lot of inconsistent formats and layouts and basically these formats are all unstructured. A lot of the documents come in multilingual formats, especially in enterprise context and with traditional rule-based approaches and even traditional deep learning techniques, there is very limited reformatting support which is available. So with the coming of multimodal AI, basically these passive endpoints can become more intelligent and they can help in basically changing this unstructured format into a more structured format. They have the power to translate the content from one language to other and they have this special ability to understand both image and the text. We can go to the next slide. Yeah, so we took one use case, mostly on the document understanding part of it.
Speaker 2:So lot of multimodal AI models are available, even in the open source. For example, we have the QN 2.5 VL, which is a very popular model. It is capable of handling text, images and even other modalities. We have very special models like LayoutLMv3 from Microsoft, which is basically trained for document understanding. Then there are other popular models like Donut and TATR from Microsoft for table understanding. And these are the multimodal series of models, whereas we have another series more from like the diffusion models. So we have a model from FLUT. So they are very creative in generating new content.
Speaker 2:We can go to the next slide. This is our solution approach, that how on the printers and scanners, basically we can bring in these multimodal ai models so we have this input image or pdf format as well, as we may have an user input that may be in the form of a prompt or it can be by voice direction. So once, so once this is acquired, we do some conversion, maybe we convert PDF to image. Then we do a lot of pre-processing tasks, like we do resolution conversion, we do page segmentation, noise reduction and so on, and then pretty much we can use the visual language models directly for optimized inference. We can do quantization, and without much supervision the blms are capable of interpreting the content, identifying key elements and then restructuring the layouts.
Speaker 2:They are also capable of generating in the format that we desire. And then we have some post-processing activities where we clean up, prepare in the format that we desire. Basically, the interesting part of this is that without even doing a lot of fine tuning of the base foundation models, we can get quite decent results. So primarily in our implementation we have used two models. So one is the QEN 2.5 real model. It excels in handling complex visual inputs, including images of different sizes and so on, while still maintaining a good linguistic performance. This model, so we have tried with 3 billion parameters, 7 billion parameters and so on, and the configuration that we used had a GPU RAM of 16 GB basically. On the other hand, we have also used the flux diffusion model. This has a hybrid architecture of both multimodal and parallel diffusion transformer block and it had a size of around 12 billion parameters. So this also needs VRAM of around 24 GB. We can go to the next slide.
Speaker 2:So here we have primarily two use cases. One is on extracting the tabular data and reformatting to the graphs. So we can import a pdf document which has tables and other textual content. So the vlms are quite skilled to extract the table information and then, once we have the tabular data, we can convert it into a different representation, say a pie chart or a bar chart, and then we can overwrite, say, the tabular data with this information. So lot of these creative things can be done with the documents.
Speaker 2:The next use case is more on image generation and modification. So basically we were thinking about taking card use case. So with the textual prompt we can create a greeting card from the model and then we can do incremental modification, steps, steps, by prompting the model. So for example, here the greeting card says happy new year, 2023 and then with the proper text prompt we can modify the year to 2025. So some of these modifications are done very simply. Then in the third step, we wanted to change the outer boundary of the image. So we were able to do it with a simple prompt to the model. You can go to the next slide.
Speaker 2:This is another use case where we do the text correction which I already discussed, Then coming to other use cases. So what we have seen is in a printer scanner use case.
Speaker 2:Automatically we can understand the scanned documents and then classify them into specific categories. So in an enterprise context it really helps that it can also trigger a notification that this document is intended for this group of people. Then we have multilingual content handling ability and so these models can easily translate between different languages, and this is very helpful in typical enterprises. Another use case may be like adapting the layout for visually impaired users by increasing the font size, contrast and simplifying the design. Then in a scanner use case we can say have some defects in the text, and all of those gaps can be automatically filled in by the model and error correction can be done automatically. We have seen other use cases, like when it comes to giving a print from a web page, like a lot of these advertisements and other content comes in. So all of that can be automatically detected by the AI model and removed.
Speaker 2:Yeah, we can go to the next slide.
Speaker 2:Yeah, yeah, unfortunately I think this video will not run, but basically, while we were showing the table extraction, this is the text correction part of it. Yeah, you can go to the next slide. This is about changing the borders and this is about changing the color of the bird. By just a simple prompt to the model, it is able to update the bird color. Yeah, you can go to the next slide.
Speaker 2:So this is some of the deployment aspects, what we considered. So the aim is to deploy it on the edge and the device, like we discussed. But these models still need considerable resources to run. So first of all, we took lot of steps in optimizing it. One step was basically downscaling the image. So during the preprocessing step, we downscaled the image by reducing the resolution significantly so that it improves the image processing performance efficiency, with some loss of accuracy. Then these VLM models are very good in object localization and grounding. So some of the tasks, such as accurately locating the objects within the images using bounding boxes and point coordinates, so all of these can be done automatically for the VLMs in it. So all of these can be done automatically for the vlms and, as an application developer. We don't need to set up some complicated pipeline to do this, so we use the power of the vlms which are inbuilt.
Speaker 2:In terms of model quantization, we quantize the seven billion parameter model using a 4-bit quantization technique and we used the GGUF format for the model. Gguf is a more efficient and flexible way of storing and using the models for inference and is designed to run well on consumer grade computer hardware. In terms of the diffusion model, we did a lot of hyperparameter optimization. So there are certain parameters for the diffusion model, like case sampler and scheduler, and these can be tweaked for better performance. By enabling better control of the sampling process how the image gets generated and by fine tuning the sampling algorithms and parameters, we could do a balance between the accuracy as well as the performance. And, last but not the least, we utilize the optimized library for NVIDIA for the GPU to optimize the performance on the GPU. So overall, the deployment configuration was based on an x86 based processor and NVIDIA T4 GPU with a memory of around 16 GB.
Speaker 2:All the mentioned steps that we took helped in reducing both the inference time and the memory footprint for deployment on typical printer or a companion computer SOC. So also the point of view that we have is it will still take time to deploy such kind of models directly on the printer, but we also have the companion devices to the printer, like the AI PC or the smartphone, which themselves are quite capable. So some of these things can be orchestrated between both, and some things may also be done on the cloud. Yeah, we can go to the next slide. So these are some of the challenges that still remain in terms of large document handling. So we need to do a lot of segmentation tasks and batch processing to manage the memory load.
Speaker 2:Then, though, we did not really do a fine tuning of this base models, but for the optimum accuracy we would need to do a fine tuning for the use case, and that will help us in maintaining the performance. When it comes to embedded deployment, we have to keep in mind the thermal and the power constraints, using efficient scheduling and hardware acceleration to minimize the power consumption. And, last but not the least, the compute requirements are still high. But as the SOCs develop and we also get better methods of optimization, I think the deployment on the target lower cost devices would be a reality, we can go to the next slide.
Speaker 2:So these are some of the conclusions and future directions. So multimodal AI models are a transformative advancement, especially if we look at document reformatting tasks in printers, and by deploying such models directly, either on the device or on a companion device, we can offer customers much better and smarter printing and scanning solutions. Much better and smarter printing and scanning solutions. Multiple innovative use cases can be accomplished by using these models, and this approach will set the stage for a new era of intelligent edge printing, where both the content understanding and the reformatting can happen seamlessly at or near the point of output. Yeah, so I think with this we come to the end of the presentation.
Speaker 1:Yeah, and I would be happy to take some questions uh, thanks, uh, thanks, arniban and uh, and your presentation was smooth and maybe we missed your video. I don't know if you could find a way to share links to the video later on so that we can enjoy your demonstration. Can you quickly comment about what the demo was about?
Speaker 2:Sorry, can you?
Speaker 1:repeat your question About what your demos were about. What was the concept?
Speaker 2:of the video Correct, correct. The demos were about some very simple examples. For example, from a document, like we can do a table extraction, right. So if you think of earlier times, right, we would have to use some very specific template based approach for doing this extraction. Or we would have to train a deep learning model using a lot of data, right. But with this visual language models, the advantage is that lot of the knowledge for these kind of tasks are already inbuilt in those foundation models. So, um, and they realize the, both the imaging context as well as the prompts that the user gives, right, so it can relate. So when I say that, okay, you extract this table, uh and just uh output it into some CSV format, so the model itself has that capability to do it. So this really would shorten the development time for application developers and I think that is one of the key areas to look forward.
Speaker 1:And Arnivan. A curiosity, also from a business perspective how do you consider about using a foundational model for a solution that you sell in the market? After all, the foundational model I mean you you adopt, but maybe someone else prepared the trainer there with with lots of data and and and then you are trying to exploit, for example, for the purposes that you spoke about, and and how, how do you manage such a solution which depend from foundational model? Is is something that you consider viable, reliable, or do or do you see any cons about that?
Speaker 2:Yeah, so I think people have concerns on the foundational model, like what kind of data set that it has been trained on, and so on. So I think we should look at it use case by use case. So I think the power comes from the data for the AI, and just using the foundational models may not be appropriate for any commercial solution.
Speaker 2:So we really need to fine tune the data with whatever contextual data it belongs to the enterprise right, and ultimately, the models will still hallucinate right at the end of the day, and so we need to build in lot of guardrails so that, if it really does, then it should have an exit path, and it should give something which is more logical, right?
Speaker 2:so I think a lot of the work with ai will be with how you write the prompts, how you basically experiment with it, how you see what it is doing and building some checks on top of it, right, and still it would be a kind of a fuzzy system, meaning you can never predict that it would repeat the response every time, right, but basically it still gives a benefit in certain use cases. I think that is the balance we need to draw, like where it can be applied and where it cannot be applied.
Speaker 1:And I like it that you mentioned that this technology can help, for example, visually impaired people. So you made me think that this great technology can help to create new services and products and applications, people with less abilities and which can really enjoy the capability of a JNI in providing machine interaction at the next level. So thanks to underline also this possibility.