Self-modifying neural network

Recently, I’ve been working on a model I’d like to share. It’s not finished yet; I am struggling to get everything I’ve written down on paper into actual code. So I’ll be presenting the theory instead of the results.

My model is a merging of two interesting models, the differentiable neural computer, and the sparse, distributed, automatic, procedural memory. (The SDAPM is a model I designed myself, but have never published.)

The DNC is essentially a miniature CPU designed to be completely differentiable. It consists of two parts, the controller and the memory. The controller processes information and generates control signals to determine what and where memories get memorized and recalled. I suggest you read about it, it’s pretty interesting.

The SDAPM is a variant of heteroassociative sparse distributed memory, with an additional ‘consolidate’ function. As opposed to SDM’s two arrays for keys & values, SDAPM has an additional array called ‘transformations.’ When a memory is added to SDAPM, the corresponding spot in the transformations array is filled with an empty matrix. During consolidation, multiple memories are selected, and a Hebbian learning rule is used to combine these memories into a single transformation matrix which perfectly mimics the key-value pairings for each key, but linearly interpolates between them otherwise. (There’s some more complex stuff going on in the background, but that’s not super relevant) The SDAPM is very interesting, because the consolidation step can be interpreted as generating a single layer for a MLP.

My combination of the two is (as a temporary name) called DNC-SDAPM. It replaces the memory of the DNC with the SDAPM. Since the DNC can select arbitrary memories and string them together, it should be able to compose multiple transformative memories together to form a tiny neural network for any given input. And since the DNC memorizes, and the SDAPM consolidates during inference, it is capable of adaption and the creation of new neural networks to accomplish new tasks. Another technique I plan on using is adaptive computation time, which would allow the DNC-SDAPM to construct a network with a potentially infinite number of layers.

Consolidation is super powerful just by itself. If we ignore the benefits to differentiability and memory consumption, it also allows a network to ‘cache’ multiple non-linear layers into a single linear one. A linear layer is significantly less powerful than a non-linear one, of course, but when you have fifty of them, you can view it as a piece-wise non-linear function. This means that for a network with ACT, it will become faster at processing the more it learns about a particular task.

1 Like

This seems extremely interesting as its a clear improvement of current static models that cannot adapt. It looks like a better simulation of intelligence or current human thought. I wonder what this could be applied to. If this is a model that works with LLM’s then I would suggest patenting it rather than open source, as this could be a massive game changer for the industry.

I’ll be looking forward to how this develops.

Update:

I’m still working out various issues, but I do have some progress to note. Creating new layers during training is incredibly finicky. As humans, it’s obvious the benefit of doing so, but heteroassociative memory is chaotic, and so gradient descent is very bad at estimating how useful the memory can be. People have gotten around this issue by adding artificial unpredictability into more stable parts of the model, making the memory more stable by comparison. (I’m specifically referring to Joerg Franke’s ADNC here)

The way I designed the memory consolidator makes it very separate from the controller of the model. Admittedly, it’s a strange choice, but one of the goals with a DNC is to have a model that doesn’t change behaviors as you add more memory slots. And given that constraint, you can’t have a lot of control while maintaining efficiency. The result of this is that consolidation requires a lot more attention to actually be trainable than the other parts of the model. The other day, I halved the loss for a copy task by including metadata in the consolidator’s input.

I haven’t gotten around to including adaptive computation time, since I’m starting small and trying to prove the effectiveness of the layer generation. I’m using a copy and paste task where the number of datapoints is much greater than the slots in the memory, forcing the use of layers to store the data more densely.

Here’s an image of the matrices when loss was 7-ish

This is fascinating. Transformers are outside of my comfort zone as of now, but I thank you for sharing. It brightensmy day when I see stuff like this. Since O can’t be certain I understand everything you’ve said, I must continue to think from my own sphere of understanding. Your stuff seems a little dreamy.

Might there be a better way to learn about Transformers than to tread ‘Attention is All You Need’?

‘Attention is all you need’ is kind of foundational, but as the title says, the main thing to know is attention. Attention is pretty basic, but no one really describes it in a simple way.

So, let’s say that we have a sequence of vectors A through D, and another set of vectors E through H. We can build a table which pairs every single combination of A-D and E-H together.

A B C D
E 𝑓(A, E) 𝑓(B, E) 𝑓(C, E) 𝑓(D, E)
F 𝑓(A, F) 𝑓(B, F) 𝑓(C, F) 𝑓(D, F)
G 𝑓(A, G) 𝑓(B, G) 𝑓(C, G) 𝑓(D, G)
H 𝑓(A, H) 𝑓(B, H) 𝑓(C, H) 𝑓(D, H)

Each pair of vectors, we can put into a function 𝑓(x, y). Let’s define the function as:

𝑓(x,y) = (x_1 * y_1) + (x_2 * y_2) + (x_3 * y_3) + … and so on … + (x_N * y_N)

If we plug in some test values, we can see 𝑓(x,y) can tell us how similar two vectors are.

𝑓(-1,-1) = -1 * -1 =  1
𝑓( 1,-1) =  1 * -1 = -1
𝑓(-1, 1) = -1 *  1 = -1
𝑓( 1, 1) =  1 *  1 =  1

When x and y are different, the result is negative, but if x and y have the same sign, the result is positive.
Now, if we look back at out original table, we can see that we are actually calculating how similar each vector is to another.
Attention is a little bit more complicated than this, but not by much. Attention uses the ‘softmax’ function which takes in a vector, and normalizes it so that every value is between 0 and 1, and that the values all add up to one. Since 𝑓(x,y) can take in vectors and returns a single value, we can apply the softmax function to each row of the table.

A B C D
E 0.1 0.3 0.2 0.4
F 0.1 0.2 0.4 0.3
G 0.2 0.4 0.2 0.2
H 0.2 0.2 0.1 0.5

As you can see, each row sums to one, and all values are between 0 and 1.
The next step for attention is to multiply each entry in the table by a vector.

A B C D
E 0.1 * A 0.3 * B 0.2 * C 0.4 * D
F 0.1 * A 0.2 * B 0.4 * C 0.3 * D
G 0.2 * A 0.4 * B 0.2 * C 0.2 * D
H 0.2 * A 0.2 * B 0.1 * C 0.5 * D

By multiplying each value by its column’s vector, we have turned the single values back into vectors. Now, we add each entry for each row.

E : 0.1 * A + 0.3 * B + 0.2 * C + 0.4 * D = OUT_1
F : 0.1 * A + 0.2 * B + 0.4 * C + 0.3 * D = OUT_2
G : 0.2 * A + 0.4 * B + 0.2 * C + 0.2 * D = OUT_3
H : 0.2 * A + 0.2 * B + 0.1 * C + 0.5 * D = OUT_4

Doing this gives us our result, a sequence of vectors called ‘OUT.’
This is a very basic example, however. The attention used in transformers is usually done like this:

𝑓(x) = softmax( Q(x_ 1) * K(x_1) + Q(x_1) * K(x_2) + ... + Q(x_1) * k(x_N) ) * V(x)

Where Q, K and V are layers in a neural network. (V(x) is being applied to every vector in the sequence, then multiplied pairwise by the result of softmax) This is different from our earlier 𝑓 function because it only takes in a single sequence of vectors instead of two. This is how real transformers work. It’s called ‘self-attention.’ They are comparing vectors in the sequence to other vectors in the same sequence.

The Transformer shares the other components with many other neural networks, so I won’t go over that here.

I learned myself from reading papers, but I’ve heard from others that 3blue1brown’s video was useful for them, as well as Andrej Karpathy’s series.

Thie 3blue1brown was a very well done video.

In your case, you mentioned that the DNC-SDAPM model is

capable of adaption and the creation of new neural networks to accomplish new tasks

Ther must be several use-cases for such a model. Perhaps simple, such as to mark a change, or complex, such as to reduce loss.

I’m a little puzzled by the loss of 7 though. How good is that, and does the model show signs of demonstrating that it is capturing patterns, such as is there a plottable decrease in loss and plottable increase in scoring metrics on train, test, and validation data?

a single transformation matrix which perfectly mimics the key-value pairings for each key, but linearly interpolates between them otherwise.

This sounds interesting. The 3blue1brown covered the key value matrices, but this linear interpretation between them may not have been in the video.

The SDAPM is very interesting, because the consolidation step can be interpreted as generating a single layer for a MLP.

It seems like an interaction term acting to connect layers. I’ve never really gotten past nodes that are logistic, so anything linear is new to me.

The use of the hesian is interesting. The hessians I use track movement as a second order rate of change. In fact, I’ve never used a key value matrix In a nn layer. So, a small thing I’m missing is in understanding what exactly is being captured inside the memory slots. Since the hessian is used, my immediate though was that it is capturing a point in the gradient. In some imagined space, it makes sense that all the memory slots would be uniformed and descending along with the loss. And if the memory is combing two layers, then that fits too. Although, I’m not sure I got that quite right.

You mentioned a lot of benefits and features, such as adaptability,l and self forming mini-neural-networks,

But also mentioned that

So I’m not sure how the new mini-neural-networks achieve a result. You highlighted differentiability, so any mini-neural-net must crunch weights and either reintroduce at the same nod where the mini-neural-net was created, or take another path to the loss function. That would.make this statement particularly interesting for a number of reasons.

The controller processes information and generates control signals to determine what and where memories get memorized and recalled

Am I at all cose, or have I gotten too far off?

You are close, but my model is strictly more powerful than a transformer. SDAPM is a transformer-like type of memory, but it is capable of significantly more complicated associations. You should put transformers to the side when thinking about the SDAPM-DNC, it’s better to start from a clean slate since there are more differences than similarities.

The SDAPM works like this:
The memory is split into three arrays, Bv, M, and Bw.

  1. Each step, it receives three vectors. We will call them ‘words’
  2. Word 1 is the read word. The read word is compared against all the keys in memory. The keys are stored in Bv. This gives us a vector for the similarity of each key compared to the read word. We will call it A. (it is of size batch_size x number_of_memory_slots)
  3. We then construct a weighted average of all the memory slots by multiplying their values by A, and then summing.
    Bv' = sum(Bv*A)
    M' = sum(M*A)
    Bw' = sum(Bw*A)
  4. Now, we can calculate the result like so:
    result = (word_1 - Bv') @ M' + Bw'
    Where @ is vector-matrix multiplication.
    This is equivalent to a linear layer with weights of M'.
  5. Writing is much simpler. We add a new slot to memory, (i.e. we add another slot to Bv, M and Bw.) and then store word 2 in the new Bv slot, an Indentity matrix in the new M slot, and word 3 in the new Bw slot.
  6. For consolidation, we repeat steps 1 and 2. However, we now use A to identify memory slots which should be combined together.
  7. Using a penrose-moore inverse, we can construct a matrix which would turn a key into it’s corresponding value.
  8. Then, we use A to construct a weighted average of all generated matrices.
  9. The newly generated matrix is put into a new memory slot, along with it’s corresponding pre-bias and post-bias.

This is a very simplified overview. In reality, there’s some additional work we need to do to ensure each consolidated matrix actually maps the keys to values correctly. There’s also the fact that the DNC uses multiple addressing modes, which seriously changes the result of the read. (also this is a single-head example, whereas the model has multiple read/write heads)

Anyway, this complicated algorithm means that every consolidated matrix is differentiable with respect to the controller, which allows them to be a part of the backwards path, and affect how the DNC controller should handle reading & writing to memory. (and consequently the generation of consolidated matrices)


As for you question about an infinite number of layers; That would only really be possible when using ACT. If we have some number of consolidated memory slots(A.K.A. linear layers) and we have a task where the model is trying to generate the next word in a sequence, we can feed the output of the entire model into the input within the model. Basically, if layer 1 feeds into layer 2, and layer 2 feeds into layer 3, etc, etc… all the way to layer N, that would give us the next token in the sequence. But if instead we want to know the token after that one, we could feed the result of layer N into layer 1, and repeat the process. We can repeat this process any number of times, effectively making a model with an arbitrary (and potentially infinite) number of layers.

ACT can also lead to multiple layers being merged into one. If we are still doing a generation task, there might be 10 intermediate layers between the input we receive and the output we generate. But, if we memorize the output as a value, and the input as a key, we can make examples of the inputs to the 10 layers, and their corresponding outputs. Then, if we consolidate and make a new layer out of these examples, we have essentially ‘compressed’ ten layers into one. And running a single layer is significantly faster than running 10 of them, thus causing a 10x speed up at the cost of some accuracy.


Also, I have to confess that the results of the model has been lackluster so far. It’s very difficult to manage the ‘bigness’ of the values stored in memory. In instances where I don’t completely reset memory between epochs the memory will gradually grow in size. This effectively adds noise to the SDAPM and makes it unreliable in the long term. The model will increase it’s dependence on SDAPM for the first thousand epochs, but after that it will complete forsake it. (which causes the loss to spike)

I’m recording loss differently than before. Now, the lowest the loss gets is around 12, which corresponds to 86% accuracy for a copy and paste task. I’ve been taking a break from working on it the past few days, but I’ll change a couple of things and see if I can get the accuracy any better.