How AI Zip Is Shrinking Models for a Device-First Future Artwork

EDGE AI POD

Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.

These are shows like EDGE AI Talks, EDGE AI Blueprints as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.

Join us to stay informed and inspired!

All Episodes

EDGE AI POD

How AI Zip Is Shrinking Models for a Device-First Future

September 09, 2025 • EDGE AI FOUNDATION

What if we could put powerful AI anywhere—from underwater cameras monitoring fish to the phone in your pocket? That's the vision driving AI Zip, a company building ultra-efficient AI models that can run on virtually any device.

During this fascinating conversation, we explore how AI Zip is pioneering a different path in artificial intelligence by focusing on extreme compression and efficiency. While most companies pursue ever-larger models requiring massive cloud infrastructure, AI Zip is shrinking intelligence to fit where it's needed most—at the edge where 99% of data originates.

The numbers are staggering: edge devices collectively possess about 100 times more computing power than all cloud resources combined, yet 95% of AI workloads run in the cloud. This disconnect represents an enormous untapped opportunity that AI Zip is addressing through innovations in model compression and deployment.

We dive into real-world applications, including an award-winning smart fish farming solution developed with SoftBank that uses underwater computer vision to optimize feeding and dramatically reduce waste. This practical example shows how specialized AI can deliver enormous value when deployed directly at the data source.

Perhaps most thought-provoking is the efficiency gap between artificial and natural intelligence. While our most advanced AI systems require thousands of watts of power, the human brain operates on just 20 watts—about the same as a smartphone. Similarly, a jumping spider can navigate complex 3D environments with millions of neurons, while autonomous vehicles need billions of parameters. Closing this three-orders-of-magnitude efficiency gap represents an exciting frontier for AI research.

The future of AI won't just be about bigger models—it will be hierarchical, with specialized intelligence at every level. Subscribe now to hear more conversations with pioneers who are reimagining what's possible at the intersection of AI and edge computing.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 2: 0:09

So, Yubei, nice to meet you and I know we've met before too and AI Zip is actually a really interesting part of our community and it's good to be able to chat with you today, Nice to meet you too.

Speaker 1: 0:24

It's a pleasure to be here.

Speaker 2: 0:26

Well, I mean, why don't we kind of get into it? I think you know, as I mentioned in the preamble before we started taping, there's lots of interesting companies in our community and doing all sorts of things. You know there's like the whole I talk about the whole tech stack. You know, from metal to cloud. There's all these layers in there of companies. It's like a team sport where you have to kind of work together to kind of actually solve problems with edge AI and tiny ML and stuff. So, um, and you guys are really you're right in there, right in the stack, um, and so maybe you can kind of give us a little background on who who is. Kind of what's the origin story? Where did you guys come up with this idea? What's the whole deal?

Speaker 1: 1:13

Definitely so. Ai Zip essentially is a company that we specialize building the world's smallest and most efficient AI softwares, ai models that can, with the hope to essentially go into anything any devices, right so we're software only. We'd like to partner with all the hardware leaders in order to bring AI to the next gen devices. So over the years, we have partners with many very well known. Over the years, we have partnered with many very well-known brands in this community. So we won this year's CES Innovation Award together with SoftBank, and we won the Best Officer Award with Bosch, and we won the Best AI Product of the Year Award together with Analog Devices the best AI product of the year award together with analog devices, yeah, and so along the way, we have, uh, formed a lot of these. Um, collaboration. It has a team is almost a born out of an accident, right? So we didn't have no plan to start a company at the beginning, right? So then then COVID came right. So COVID came. Everyone was quarantined at home.

Speaker 1: 2:26

My friend came to me, asked me how do you actually build a working AI system? I had no idea, and that was a shocking moment, because I've been always doing fundamental AI research for long and through hard learning, right? So when we say fundamental AI research, we really somewhat times mean useless, right? I know how to push the benchmark, I know how to write sometimes very good paper, but how to build a thing that actually work is the e-cheap part as an engineer, right? So I don't really know. So that was an interesting question. And we start to experiment something and that eventually turned into a company because we all loved this and then seems like a very useful thing to do at the time. Then we decided to create this company called AIZip in the end, late, late 2020.

Speaker 2: 3:25

I have to ask create this company called as, in the end, late, late, 2020. I have to ask so the name. You know, when I think of AI zip, I think of wind zip. You zip your files together. Was there some sort of inspiration there? Was that the?

Speaker 1: 3:40

Yeah, so. So the name the name was, you know, at the intuition of the name of the company is really the format that you compress your file, right? So you want to compress AI into. Essentially. That's why we put AI and zip together, right? So make this name and yeah, yeah, almost like you have a zip that AI is inside, right, so that's the intuition, yeah.

Speaker 2: 4:08

Yeah, you guys had actually a really impressive booth at CES. I kind of stumbled upon it. I was wandering around and all of a sudden I saw this huge thing and it was like, and there's all these people in there? And I was like, oh AI, zip, wow, you guys are really kicking butt. So yeah, you guys won an award with SoftBank at CES. I think we actually just did a live stream about that. That was for the aquaculture kind of fish farming solution.

Speaker 1: 4:36

Yeah, exactly so. We have a very good collaboration with SoftBank and this year, that CES Innovation Award about, uh, smart farming for the next generation of smart farming, right, which is really, really important because, um, uh, you know, if you know about this market, um, a lot of the fish farm farmers actually went out of business because, during the well, everyone loves, loves sushi, right so, but we, we didn't know that a lot of these, uh, fish farmers actually went out of the business because a lot the the fish food price actually, uh, you know, increase. The price increased actually is three, three times or so, right so, during the covet period, which is crazy, right so, now accounts for 70 of the cost of the fish farmers. So, so, you know, if you're not careful and you can easily become a not a profitable business by doing that, right, because the food is very expensive and then we feed your the fishers right, so it's not going to.

Speaker 2: 5:40

You know a lot of these food are yeah, it's a, it's not going to you know a lot of these food are.

Speaker 1: 5:43

Yeah, it's a wasted right. So the idea is really put AI there under the water to really gauge, count the fish and also gauge where the appetite is right. So when to feed and how to feed, precisely almost like you try to water your plants, very precisely right To save water, and the same idea is here.

Speaker 1: 6:04

And that turned out to be a great success and now that evolves into the next generation of fish farmer infrastructure technology right. So with these computer vision, power techniques and, of course, the Japanese artists, they are very creative. They designed that booth with a lot of these fish almost like a movie.

Speaker 2: 6:28

that actually attracted a lot of the traffic yeah, so that we all loved that design, yeah yeah, no, it was really interesting yeah, yeah, that.

Speaker 1: 6:40

And we, we don't only stop there, right. So now our collaboration extended to beyond fish, the farm, but in general, to agriculture, more categories, but also small language models, ultra efficient language model that can run on your device and do qna, do basic lag retrieval, argument generation, right, so that that is, uh, another large scope to for enterprise. On device language model, we all know that language model is very has a huge market at the moment. Right so so. But on device market has been, you know, really underestimated, and we see that markets coming very quickly, and so we formed a team to tackle that.

Speaker 2: 7:28

So the models you were using for the fish farming, the aquaculture, those were like DNNs running on phones, basically, is that kind of?

Speaker 1: 7:39

Yes, it's vision models running on all level devices. Move the.

Speaker 2: 7:44

DNNs maybe into the cameras themselves at some point and kind of do that Right. Yeah, but now you're talking about also because I think you guys also did a talk at Embedded World in Austin in the fall around language model, running language models. I think it was like on Brownfield systems or something you were demonstrating like how efficient you can get language models to run on fairly. You know unaccelerated uh platforms right, and so is that a big Frontier for for right it was like how do you take language models and zip them up?

Speaker 1: 8:20

um, yeah, same philosophy. So, um, language model is one of the new frontier that we start to work on. Actually, we started to work on that more than two years ago, at the time when the CME published a paper called Tiny Story and then Sebastian Barbeck at Microsoft picked up that thread and started to lead the small language model charge at microsoft and start to talk about. That was the early, some of the early work. Um, tiny stories and uh, uh, textbooks, are you, are, you need, right? So, this, all you need category papers, right, but we love that, right? So, because we see that the future, a lot of the ai, the future of the AI, is going to be on device. I can show you a few numbers to support that.

Speaker 1: 9:12

And, just a matter of fact, jason Huang, a media CEO, jason Huang just went to KnowPrior, gave a podcast and talked about what he, in his vision, uh, what would the future of the ai be? And and he said that the future of the inference would be modhead scale. You will have these large models and it will have these small models, right, so, just like a natural intelligence, you know, look at this world, right, so, we have human, but we also have a lot of dogs we have have cats, we have zombie spiders, we even have ants, right, so they're intelligent at every level. So we see that the current inference, if you think about that, it's estimated right.

Speaker 1: 9:58

So 99% of the data is actually originated on device which is not surprising, because of you know where their sensors, their data the real world as we call it.

Speaker 1: 10:12

Exactly exactly. But the surprising part another surprising part I think is more interesting than this is that if you think about the compute and actually add up the world's compute capacity in all these HPC centers right, it's roughly. Also 1% compared to on-device, which is interesting because the sheer numbers of devices out there, right, so in your phone, everyone's pocket, in the cars, and smart cameras when you add this up, it's almost two orders magnitude more than in cloud computing resource. But yet more than 95% of the compute, ai traffic, ai workload went to the cloud, which is interesting. Majority of the investment also went to the cloud, which creates a huge opportunity for on device AI, right, so we see that opportunity coming very close and mainly because three market is three forces right, so the market is there for it and the hardware's capability is quickly increasing.

Speaker 1: 11:25

Just a matter of fact, if you think about seven years ago, apple's A11 chip, I believe that's the first generation that carries a neural engine, right, so the compute capacity for that was 0.6 T-ops and last year the compute capacity for that neural engine is 35, six fold more in seven years. Then A19 is around the corner and in two years we're going to see. And in two years we're going to see more than 100 T-ops compute capacity on device. So a lot of these capacity is sitting in your pocket. The language model will be on device. A lot of these day-to-day elementary intelligence, everyday intelligence we call that, will be in your pocket, on your device, available there, right? So that's the future we see. So language model would not be an exception. So we formed the group and start to tackle that problem, start to put AI agent on device.

Speaker 2: 12:28

Right, yeah, no, it makes total sense. I mean, like you said, 90% of the data originates in the edge or the real world, and then they compute as you. You it's an interesting stat to sort of add up all the compute out there, uh, relative to data centers, and you know, there's all those benefits of doing things on the edge, doing things where the data is created, so, uh. So that's kind of your frontier, and so my understanding too, also, though, is like there's tops, but there's also like sort of memory constraints too, and one of the things that's sort of holding back language models and sort of resource constrained environments is, it is kind of a memory hog, um, so do you see like a lot of kind of progress happening in that space in the next couple years?

Speaker 1: 13:09

yeah, there are a couple very exciting opportunities in that direction. So I would say the memory, yes, you're absolutely right, and we really start to realize that computer is not everything. A lot of the times bottleneck can be your memory.

Speaker 1: 13:26

But one thing, you know, it's very interesting to see. For example let's still take Apple as example A lot of the chip design is around how to resolve that memory bottleneck right, so that unified memory architecture right. So all these designs are for the next generation of AI, Right? If that's not designed for AI, a lot of these hardware design makes no sense, Right? So I think the hardware leaders are preparing for that future and try to co design their architectures so that it will be appropriate for that future, and try to co-design their architectures so that it will be appropriate for the future AI workload. And AI is just a new type of software, right, so you need the right hardware to support. I think that this community is actually working on the right direction to form this co-design hardware and AI software co-design and that will be appropriate for the future AI workload on device yeah, yeah, no, for sure.

Speaker 2: 14:22

And uh, I mean, in the commercial space, as you know, there's um, there's always a challenge in terms of doing greenfield deployments versus you know more brownfield. And how do you add, you know AI capabilities into existing deployments and things like that. So I think that'll be interesting to see how that plays out. But for sure, I mean, it's almost hard to believe, even from a few years ago, how much more compute is possible, and I mean you guys are focused on the software part to make it happen. So that's pretty exciting. Are you primarily focused? Well, you talk about SoftBank. So is Japan like a market for you? Big market? You're in California. I mean, how big is AI Zip? Are you a multinational conglomerate or what's going?

Speaker 1: 15:04

on. Well, we are still pretty lean. We have 30, some full-time and then including part-time, we're looking stop 50.

Speaker 1: 15:16

Uh, so majority of us are in the in california and, but we do have an employee from japan, from korea, from germany, from china, from taiwan, you know where we have the decentralized effect, right so. But the team has been a very have a strong bonding, we have zero attrition, so, which is hard to do in the space, right. So it's as such, a side, yeah and and um, yeah, so we love, we love japanese market and we myself, I'm a, I'm a judoka, I, I have done judo for eight years and, uh, you, you know, we love Japanese market and we have a lot of very fruitful collaboration with Japanese companies. For example, softbank is one, and now we have a long term collaboration with Renesas and, and you know, we have a very strong and strategic partner to be released soon that's also in Japan. So Japanese companies are leaders in many dimensions.

Speaker 1: 16:26

So, that's a very important marketplace.

Speaker 2: 16:29

Yeah, and they've been deploying technology leading technology in retail and healthcare and industry and robotics there for decades, so good place to be. What do you think I'm?

Speaker 1: 16:41

surprised at how many people speak Japanese in ASM. I was surprised, right, how many people at ASM actually speaks very fluent Japanese. I was like, hey, wow, good for you guys.

Speaker 2: 16:55

That's a good skill. What do you think about like model portability? I mean, you're in the model business, you're in the software business. The edge is notoriously fragmented and heterogeneous, right, you mentioned, there's always a new chip and a new thing, and they tend to have their own interesting, unique designs. How do you manage portability of your models and software across these different, heterogeneous platforms?

Speaker 1: 17:22

I love this question, so. So this question is actually very important. I think, with the statement I'm going to make it, it's going to be very important for a lot of these hardware leaders to one thing we start to observe is that, of course, you need a lot of these deep hardware knowledge in order to make this happen. And even though we are software only, we are very, very strong in terms of hardware expertise, and that definitely helps. We built a lot of these AI what we call AI design automation tool chains that will deploy your model onto different target platforms. So, just to name a few, there are Arm Arm is a good platform and then Rixofive, that's another one, and Qualcomm's Hexagon, and there are Cadence, hifive series. So there are many, uh very important platforms, right. So we formed automatically deployment uh flow for that and uh.

Speaker 1: 18:25

And one thing I do realize that that's a remain remaining challenge uh that I I think it need uh for the collaboration between us and the hardware leaders is really to form this new partnership to co-design hardware and software. That's a gap we start to see. I think NVIDIA is doing great in that dimension, right? So, because we all know that Moore's Law has been slowing down, you know, just in terms of the hardware side. So the new innovation a lot of the times come from co-design. So, for example, nvidia designed their data center, so almost like a new compute unit, and then a lot of the co-design is there to facilitate communication here.

Speaker 1: 19:13

Try to compress a lot of the things into a smartphone factor, and Apple is also doing great right. So there's an internal AI application development in Apple and there's strong hardware component there too, and then you can do this co-design right. What do we see you know in for the rest of the community is that we really need to form this strong partnership so that the AI experts, software experts and hardware experts sit together and define the next generation of hardware. By doing that, I think we can continue the scaling to improve the capability, rather than we think that hardware is going to be like this and we developed our network and then we realized, ah well, it's not deployable on this hardware, so that would be a pity. So I think more seamless collaboration and integration would be definitely very important for this community Interesting.

Speaker 2: 20:12

Yeah, no, it's a tough nut to crack because I think a lot of the semiconductor partners want to kind of have their special sauce and their architectures that you know have their advantages. But from a software side, sometimes that can be kind of challenging to sort of. You know, anytime you start doing you know kind of abstractions and things like that, you get performance issues and size issues and things like that. So, but it'll be interesting to see how that plays out. I think that is something that I've heard a lot about. So it's good that you guys are thinking about that.

Speaker 1: 20:49

So, yeah, yeah it's uh, just to compensate this point, I think it's's not merely a hardware issue. So, for example, let's say you're selling Windows, so another hardware supplier, they build a new processor. It would be a pity that a new processor cannot support Windows, because Windows is going to be a useful operating system to have on your hardware. So to have these your hardware right, so to have these experts in the same room co-define the future I think that is important so that we don't waste our capabilities right.

Speaker 2: 21:27

True. By the way, it's hard to get a new processor to support Windows just FYI, so I was in the Windows team for a little while. It's a lot of work, yeah yeah, definitely, definitely.

Speaker 1: 21:36

Yeah, it's a lot of work. Yeah, yeah.

Speaker 2: 21:38

Definitely, definitely. Yeah, it's doable, but no, I hear you. I think it's going to be an interesting bridge to cross. I mean, I know that there's work at Microsoft, for example, with Onyx and some other frameworks and other things, so maybe there's some point there where there's more cohesion that we'll see Cool. Any last words of wisdom about AI Zip that you want people to know about. I think we got a pretty good view, but what else? What's the thing that people should know about AI Zip that you don't think they know?

Speaker 1: 22:07

yet that there's a big opportunity in on-device resource, right?

Speaker 1: 22:21

So that has not been tapped into and the end game we see that is going to be the future of AI is going to be hierarchical and where you really have a lot of these gen models and all the models in between, they can do their useful things and have their specializations and meanwhile being ultra efficient and we envision really to be the portal of the future AI traffic, where there are sensors, there are AI models, and so that you know, even though very simple, but that could be the first station for a lot of this data to interact with the AI ecosystem, right? So that's one of the vision we have. Another thing that is quite unexpected, but over time I start to think more and more is the efficiency is actually more than just a application problem. I shared with you several numbers at the beginning, very early on, but I want to share with you three, another three numbers, and the first one is always going to be the boring one, right? So do any of you know how much power that the human brain consume?

Speaker 2: 23:32

It's a large model right. So large I heard it's about 20 watts, 2025 watts.

Speaker 1: 23:37

Here we go. Very nice, you are you, you are experts, right? So it's about 20 Watts, right? So that's about the power consumption of iPhone 15 max.

Speaker 2: 23:48

Oh, 15 max yeah.

Speaker 1: 23:51

Yeah, it's a. It's a big device but, you know, still a lot more efficient than than the large language. Unless you have a 600 billion parameter models and you deploy that onto two a 100 servers, You're looking at 10k watts, or four servers that would be 20k watts.

Speaker 2: 24:09

Oh yeah.

Speaker 1: 24:10

Huge. That is the power gap, three hours magnitude, roughly. But now I'm going to tell you the model difference. We have this Tesla. Tesla is very nice. Everything is on device. You have this on-device perception, 3d perception, planning, navigation in 2d space right, so, with a few billion parameters, that's the model size. Right. So, based on the foundation, then you build different heads for different detectors, right, right. So that's the model. But I will give you a example in the natural intelligence world, it's a jumping spider. The jumping spider, right, also have eight cameras Two front, two on the side, two on the back, peripheral, and then another two, right, so eight cameras, eight cameras, 360 degree of vision right.

Speaker 1: 25:02

So build us, scan the environment and build a, a 3d perception and it can does not 2d navigation, but 3d navigation right, so it will look at its parade here. Okay, I need to go down here and go through the bush and go up here and get to its prey, right, so that's very tricky. Uh 3d navigation, right? How many neurons does a jumping spider have to accomplish all of this? I? Don't know a thousand uh, no, not that small, but a few million neurons, right. A few million neurons instead of a few billion neurons, right? So that's against. We're seeing this.

Speaker 1: 25:42

Three others magnitude difference and finally, I think that's going to be even more interesting in this this number is that you know we are talking about. We love large language model, right. So the scaling is so far the one of the most robust law we have found so far. But the question is, uh, you know, every one and a half years we are seeing the training tokens become 10x more right, so we started Chinchilla, I believe it's 1.5 or so, and then now Lama3.

Speaker 1: 26:13

Last year was 15 trillions of tokens, right, so we're constantly seeing bigger, bigger. Apple just released a new training set which is 200 trillion tokens. That's a lot of tokens, but for humans, how many tokens can we actually acquire before the age of 20?

Speaker 2: 26:33

I don't know, I kind of blew the nerve and that number is not going to be a big one.

Speaker 1: 26:37

Oh well, let me do this simple exercise for you. So let's say, start from day one. We start to read text, right? So we are going to read 10 words per second, which is a lot. Usually people cannot read more than a couple words per second. And let's assume each of the word is three tokens rather than 1.3. 1.3 is the average token for general English text, I believe. But let's say you're reading really complicated words, three tokens per word, and now that makes 30 tokens per second. So you're going to do this 12 hours for 20 years, 12 hours per day for 20 years. So that is going to be 9.6 billion tokens for 20 years. That's three hours magnitude again.

Speaker 2: 27:24

Right, so you're seeing this.

Speaker 1: 27:26

Yeah, yeah, learning, learning, learning, learning, efficiency, model, efficiency, energy efficiency you constantly start to see this. Three artists, magnitude difference, right? So I think that's one of the next scientific frontier, right? So how do we actually decipher this guy, right? Just a matter of fact, right? So when the steam engine actually first came out, right? So so do you know what's the energy efficiency for a steam engine? I?

Speaker 2: 27:59

don't't know, probably not much.

Speaker 1: 28:01

Not much. It was a very tiny number. I was surprised at how small that was. I remember that was 0.2%. Wow. Right now, northeast, we constantly start to see 20% to 50%. So it depends on what type of engine we're talking about, but that is the energy gap. So energy efficiency gap improved by three hours magnitude also. It depends on what type of energy we're talking about, but that is the energy gap, right. So energy efficiency gap improved by three hours magnitude also right. So I think data is a new type of entropy, right.

Speaker 1: 28:31

So, we're seeing this. We're at the beginning of this revolution. It's almost like we now have a steam engine which is not super efficient. We now have a steam engine which is not super efficient, but I think getting this efficiency up would be an important factor is that towards egis actually, yeah, yeah, yeah, no, that's true. Yeah, I mean unexpected thing yeah, and that's that's.

Speaker 2: 28:56

Uh, that's some fantastic analogies that you have there and I agree, I think we're sort of at the steam engine phase and, um, yeah, so we, we know that there's a lot of upside there, especially in efficiency. We and I always say efficiency is the new currency. You know, that's kind of the new thing is how efficient can you be with power and cost and other things to get these things done. So cool. No, this is really helpful to kind of get get a good picture of of where you guys are at and, yeah, looking forward to more success and I think for folks that don't know who AI Zip is, now you do. So that's great and appreciate the time.

Speaker 1: 29:30

Awesome, all right, thank you. Thank you for this interesting conversation.