EDGE AI POD

Energy Efficient and high throughput inference using compressed tsetlin machine

EDGE AI FOUNDATION

Logic beats arithmetic in the machine learning revolution happening at Newcastle University. From a forgotten Soviet mathematician's work in the 1960s to modern embedded systems, Settle Machine represents a paradigm shift in how we approach artificial intelligence.

Unlike traditional neural networks that rely on complex mathematical operations, Settle Machine harnesses Boolean logic - simple yes/no questions similar to how humans naturally think. This "white box" approach creates interpretable models using only AND gates, OR gates, and NOT gates without any multiplication operations. The result? Machine learning that's not only understandable but dramatically more efficient.

The technical magic happens through a process called Booleanization, converting input data into binary questions that feed learning automata. These finite state machines work in parallel, creating logical patterns that combine to make decisions. What's remarkable is the natural sparsity of the resulting models - for complex tasks like image recognition, more than 99% of potential features are automatically excluded. By further optimizing this sparsity and removing "weak includes," Newcastle's team has achieved astonishing efficiency improvements.

The numbers don't lie: 10x faster inference time than Binarized Neural Networks, dramatically lower memory footprint, and energy efficiency improvements around 20x on embedded platforms. Their latest microchip implementation consumes just 8 nanojoules per frame for MNIST character recognition - likely the lowest energy consumption ever published for this benchmark. For edge computing and IoT applications where power constraints are critical, this breakthrough opens new possibilities.

Beyond efficiency, Settle Machine addresses the growing demand for explainable AI. As regulations tighten around automated decision-making, the clear logical propositions generated by this approach provide transparency that black-box neural networks simply can't match. Ready to explore this revolutionary approach? Visit settlemachine.org or search for the unified GitHub repository to get started with open-source implementations.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1:

Good morning and thank you for the opportunity. Just a bit of disclaimer to start with. This talk was originally supposed to be presented by Sheng Yu Duan, who has told me at the very last minute that the visa appointment he had for the US Embassy is in June, which is way past the time for this symposium. But this talk will pretty much follow on from the first talk Thawsefti on Monday or is it on Tuesday? On Settling Machine. So I'm not going to do yet another tutorial on Settling Machine. This is a new machine learning ecosystem that we're building in Newcastle. Let's talk through a more embedded systems-orientated work. So again a bit of marketing.

Speaker 1:

I come from Newcastle University. It's in the beautiful northeast part of England. It's probably the largest city in the entire northeast of England. We have a beautiful campus, as you can see. You're welcome to visit anytime.

Speaker 1:

We have a history over the last 35 plus years. We have a history of working with. You know asynchronous computing as well as you know new self-aware, self-managed circuits and systems. So that kind of strength led to a few strands of research and research outcomes. For example, we developed a few design automation tools with the likes of, let's say, warcraft, which is now adopted by a few big industries out there. We also have several benches of ICs that we have designed following the principles of SettleMachine, which is a new SettleMachine. I can also break the news that just last week we released the results of our third microchip, which is also a SettleMachine result which can do 8 nanojoules per frame for MNIST hand character recognition, which I think is, to my knowledge, is probably the lowest ever recorded in published literature. So I'm going to carry on. So in the next few minutes I'll be talking through some ideas about Settle Machine, again complementing what Tauseef spoke about, and then I'll talk about some interesting opportunities that you can pick up from Settle Machine exploiting sparsity. Sparsity is a great characteristic for many machine learning algorithms. That includes the neural networks and also other types of algorithms as well. So one of the reasons I personally got interested in SettleMachine.

Speaker 1:

Again, a little bit of history about SettleMachine SettleMachine got invented in only 2018, november 2018, but it's not actually that new If you go back to the history of the name itself. Mikhail Settlin is the mathematician and also the cybernetic engineer in the 1960s who invented the principles of learning automata. It was famously called the Settlin automata. He was also, at that point, called as the father of new type of logical machine learning, but unfortunately, after his short life, there was a gap of research in very many forms. So in 2018, november, one of our colleagues, ula Christopher Granmo, and also a few of us joined hands in terms of creating this new machine learning ecosystem. We were only five researchers in November 2018 and, as of this date, I am part of this steering committee which builds the international working group on settled machines. We have more than 50 plus research groups working on different types of machine learning models and also circuits and systems using Settling Machine.

Speaker 1:

Okay, so why Settling Machine? Settling Machine, unlike the arithmetic principles that you will hear from neural network or neuromorphic computing or other types of machine learning for that matter, settling Machine brings in something called white box machine learning approach. It is a logic-based approach. If humans are given two different fruits in front of them, we don't go about calculating the multipath regression of the feature, the depth, the color, the patterns and everything else in between. What we do is we use some simple yes or no questions in our brain ontologically to come up with a set of questions that can answer the classification problem. So that is actually the inspiration of this white box approach Because it is rule-based and also logic-based.

Speaker 1:

It's largely interpretable. It's very, very low complexity, because you're speaking in the language of a computer AND gates, or gates, not gates, and also some summation operations. That's all there is. There's no multiplication whatsoever, zero multiplications, very little to do with, also, arithmetic at all, and mostly OR gate, not gate, and AND gates. So an example of what a rule would look like is something like this If some of you will know something around decision trees, you will probably recall that in decision trees we use a lot of informational sort of background building to build up the tree nodes as to the thresholds that you'll be using for classifying your problem right.

Speaker 1:

In so many ways it's actually set to mission, is very similar, but it does it in a very dynamic way in terms of training. So after you have trained a certain problem, the kind of proposition looks like this A certain type of flower, iris flower, called setosa, could be defined as a very simple logical proposition If the petal length is this is a yes or no question. This is another yes or no question. This is another yes or no question. This is another yes or no question. It's a very simple example, just to you know, get you to think about what we are talking about. And again, because I mentioned that there are only three types of gates not gate or gate followed by and gate. And finally, you have some summations as well. Now, again, a little bit of tutorial. Again, I don't want to do a lot of tutorial, because has done a good job in terms of creating an understanding or helping with the understanding.

Speaker 1:

So what we do is that, in order to create that Boolean logic-based, interpretable logic patterns, the first thing you need to do is you need to extract the yes or no questions from the data, and that process is called Booleanization. So you create Boolean questions yes or no questions. How do you do that? Let's say you have raw samples, data samples which come in the form of some floating point numbers. So what you will do is you will put them into some kind of a. You can actually use the raw form of the binary as well, for example, the representation of those floating point numbers in raw binary forms. Or, if you want it to be a bit more minimal, what you could do is you could do something like quantile binning. What it means is by saying that, okay, you decide how many bins you want your data to fall into the raw data and then each data will then be put into a certain bin, based on the intervals that you have defined, and these bins will then be binarized and those binaries will become your Yosono questions.

Speaker 1:

Your yes or no questions will become simply as simple as these. So boolean, a binary representation of the beans that you have, and that gives you an interesting starting point for the. So that is the boolean data set that you start with, and once you have the boolean data set, the next thing you do is that for every boolean, you also create a complement. So if you have M number of booleans, you will have two M literals including their complements. For 2m literals you will then have 2m learning automata or settling automata, and settling automata is nothing but finite state machines. It's like a bounded counter with two n states. The higher end will be dedicated towards deciding whether you want to include that particular literal as part of your proposition when I say proposition, that is a logical conjunction that you want to propose from each clause or you want to exclude it. Now, interestingly, if you include something, it becomes important to your logical proposition, but if you exclude it, it becomes completely redundant. It does not matter anymore. It does not participate in your logical conjunction at all. So that is then leading to something around.

Speaker 1:

This is just basically to talk through how the actual training happens, if it is a true positive or true negative. You initially do some random obviously random decisions of these automata, you make wrong decisions and you come up with poor accuracy. But then you go back and ask these questions Do I have a true positive based on the decisions made by this particular clause? The composition of these ones will be called a clause. Does this clause contribute to a true positive outcome or true negative outcome? If it does, then you keep making the moves in the right direction because you are making the right moves right, and if you don't, if you go for the false negative, then you go and start denying the patterns in the other way around. Or if you go for the false positive, then you have a separate type of feedback which allows you to modify the incorrect clause outputs from one particular output to the other. For example, if you originally had one, you'll have to do a team of automata will have to be changed in order to make sure that your output is changed from one to zero. So this is really the internal working principle Once you have all the clauses. So basically one clause is like one screenshot of the proposition, but if you have 100 clauses, 100 of these will pick up the random propositions independently in parallel.

Speaker 1:

Right Now, one of the beautiful properties I want you to look at quickly is that, as you have the parallel data thrown at the team of automata for each clause independently, also with the other clauses being completely parallel to them. So basically the same data gets fed into so many clauses in parallel, all of them start to create their own logical patterns and they will then start to agree on one point, which is called the class summing point. Some of them will say I want to propose for a certain class. Some of them will say I want to propose for a certain class. Some of them will say I want to oppose the definition of a certain class. This is where the binary nature of a certain machine comes in. So some of them will be called as positive clauses, some of them will be called as negative clauses and once you have enough confidence from the positive clauses, you can say that this, indeed, is a certain class with a certain confidence. Indeed is a certain class with a certain confidence.

Speaker 1:

Now, interestingly, if I relate back to the examples, in human nature some of us are very confident in things we do and say, so we are very confident and we always say the same thing over and over again. Sometimes we also become gullible with the matters of some judgment where you see that others also are saying similar sort of stuff, so we might as well change our mind often as well. So that happens in the learning automata as well, because sometimes when you're doing independent logical pattern picking from the clauses, right, sometimes the positive clauses tend to pick up some weak correlations which are also present in the negative clauses. Number one, that's the property, number one that we'll be exploiting today. And second property, which is what Tosif will have mentioned already on Tuesday, is the idea of sparsity.

Speaker 1:

If you look at the number of includes as opposed to the total number of learning automata, you're looking at something around. For example, for a simple IRIS classification, iris is a very, very simple example, toy example 79.2% are excludes. But if you go for MNIST, you're looking at 99%. If you go for CIFAR-10, you're in the order of 99.3%. If you go for higher order problems, the number of excludes become really, really high. That only leaves you with a few, very few patterns of includes that you're really interested in. Now, this is the property number two that we'll be exploiting as well. Sparsity and the property number one was some weak includes appear both in the positive clauses and negative clauses at the same time. Now, if we exploit these two properties together, then we'll have some interesting patterns coming out. For example, you'll see some of the across these different clauses. Some of them appear in the positive clauses and negative clauses at the same time here, for example, here and so on. So we're gonna, we're gonna. What we're gonna do is that for the opportunity number one, which is the weak includes in both the positive clauses and negative clauses, we're going to take them out because they're not really confident in terms of the logical patterns that we're trying to propose.

Speaker 1:

For those of you who want to understand a bit more about SettleMachine, I don't think we have to say a lot. We created this website for a reference, single point of source for all information that you need to know settlemachineorg. There is also a unified. The working group is working on a unified GitHub repository called. If you search by T-M-U Settling Machine Unified GitHub Repository, you will see the entire bunch of data sets, algorithms and some already pre-made examples for you.

Speaker 1:

So one of the other things I did not talk about today is the idea that because you do logical propositions, you're naturally interpretable, and that's really interesting as a property, because one of the things I've seen across the three days by the way, I really enjoyed this. This is the first time I came to this symposium. What I really liked is that there is a lot of practical application driven focus, industry focus, some really interesting demos. I would have liked to see a bit more explainability aspects, because this is something that is emerging as an important topic and when I wear my hat as a founder of Literal Labs, this is one of the USPs that we want to sell that a certain machine and also similar sort of algorithms offer interpretability properties which we need for age machine learning to be empowered.

Speaker 1:

So let's look at the pattern of logics. If I do the aggregation of all logical patterns across all clauses, this is the shape of two. I'm sorry, I don't think you can see it very well, but if this is the shape of two for all positive clauses, if you aggregate them all together, you will probably see that there is a nice little white gap of some earmarking in the middle, nice little two pattern coming across. All the twos that you have written. There are so many different types of twos in MNIST. This is just an example, but if you take the negative clauses all together, you will see that all the other characters, all the other numerical characters, starting from one, two, one, three, 4, 5, and all that, they will create another pattern and that pattern is not similar to this one at all. They are very different.

Speaker 1:

However, some of the patterns over here, because of the weak propositions that we talked about weak includes that we talked about some of them in the positive clauses will also be included in the negative clauses because they are gullible, they're not confident. So that leads to an interesting proposition that we can now take them out. This is exactly what we did. We independently went for, class by class, looking at the weak includes. We took them out. We also exploited the sparsity properties by having only the chain of includes that matter for your logical proposition. So I now took out 99% of my excludes, leaving with only 1% of my model size. Secondly, I also took about 50% of them which are weak excludes, leaving with another, you know, half of the original model size and that has an impact. Have a look.

Speaker 1:

The thing is that sometimes those gullible logical patterns are important. I'll tell you why. Because if you're doing a voting exercise, sometimes the people who want to go for votes, they want to actually quote the weak voters as their or less confident voters, as their possible voters, because they know that strong voters will not change their mind. But if you quote them in the right ways the weak voters or voters who are slightly more gullible you can not change their mind, but if you quote them in the right ways the weak voters or voters who are slightly more gullible you can actually get their votes. So that is what happens with the Settling Automata as well, because the stochastic nature of how the reinforcement happens you have the includes happening in both pathways and thereby, if I remove them, what happens is that sometimes we lose the ability to oppose. For example, I wanted to oppose something because you have taken me out. I cannot oppose it strongly enough.

Speaker 1:

So that impact is seen as a loss of accuracy. But that loss of accuracy is around, let's say, 0.8% loss for something like an MNIST. And if you go around looking at the results, you will see that the accuracy is pretty much in the same ballpark, but the accuracy drops by only 0.8%. But look at the reduction in model size. This is just for MNIST, but if you go for higher order machine learning problems, the model size reduction is much higher as well. Ok, now let's look at some quantitative comparisons with other machine learning. The slide transition is not happening for some reason. Ok, now it's working.

Speaker 1:

So what Sheng Yu and the team did is that we went through a few tiny ML machine learning problems. Point it that way. Yeah, okay, right. So we did the exercise of understanding how the new type of sparsity exploitation, with the discarding of weak includes, act in terms of the accuracy changes. As you can see, across the seven data set, different data sets, the accuracy changes of the order of, let's say, 2.97% or 3.38%. It really depends on the size and also the internal logical principles of the logical proposition that you're proposing. So you're looking at something like 3.38% max and what are we getting in terms of we're losing some accuracy because of the discarding of the includes.

Speaker 1:

But in terms of the other algorithms, let's look at the accuracy loss with relation to the other algorithms. If you look at the random forest binary neural networks and also the traditional set machine implementation that we have. You can see that across the board the accuracy is very, very comparable to some of the BNN implementations and even the vanilla settler machine implementations, but you are significantly faster. Look at the inference time almost 10x lower compared to some of the BNN implementations, and the memory footprint is significantly smaller, as lower compared to some of the BNN implementations, and the memory footprint is significantly smaller as well compared to the BNN implementations.

Speaker 1:

Bnns are basically a binarized neural networks. They're supposed to be the lowest complexity neural network implementations. You will see out there. And then, of course, the results, which are not included here. If you reduce latency and if you also reduce the memory footprint, that has a direct impact on your energy efficiency. We saw in the order of 20x energy efficiency on the SD micro platforms across the different benchmarks as well. So this is an embedded systems exercise that went really well and we want to design a new type of IC around this as well. So that is the next aim. Thank you very much for listening. Thank you.