Recently, I’ve been working on a model I’d like to share. It’s not finished yet; I am struggling to get everything I’ve written down on paper into actual code. So I’ll be presenting the theory instead of the results.
My model is a merging of two interesting models, the differentiable neural computer, and the sparse, distributed, automatic, procedural memory. (The SDAPM is a model I designed myself, but have never published.)
The DNC is essentially a miniature CPU designed to be completely differentiable. It consists of two parts, the controller and the memory. The controller processes information and generates control signals to determine what and where memories get memorized and recalled. I suggest you read about it, it’s pretty interesting.
The SDAPM is a variant of heteroassociative sparse distributed memory, with an additional ‘consolidate’ function. As opposed to SDM’s two arrays for keys & values, SDAPM has an additional array called ‘transformations.’ When a memory is added to SDAPM, the corresponding spot in the transformations array is filled with an empty matrix. During consolidation, multiple memories are selected, and a Hebbian learning rule is used to combine these memories into a single transformation matrix which perfectly mimics the key-value pairings for each key, but linearly interpolates between them otherwise. (There’s some more complex stuff going on in the background, but that’s not super relevant) The SDAPM is very interesting, because the consolidation step can be interpreted as generating a single layer for a MLP.
My combination of the two is (as a temporary name) called DNC-SDAPM. It replaces the memory of the DNC with the SDAPM. Since the DNC can select arbitrary memories and string them together, it should be able to compose multiple transformative memories together to form a tiny neural network for any given input. And since the DNC memorizes, and the SDAPM consolidates during inference, it is capable of adaption and the creation of new neural networks to accomplish new tasks. Another technique I plan on using is adaptive computation time, which would allow the DNC-SDAPM to construct a network with a potentially infinite number of layers.
Consolidation is super powerful just by itself. If we ignore the benefits to differentiability and memory consumption, it also allows a network to ‘cache’ multiple non-linear layers into a single linear one. A linear layer is significantly less powerful than a non-linear one, of course, but when you have fifty of them, you can view it as a piece-wise non-linear function. This means that for a network with ACT, it will become faster at processing the more it learns about a particular task.