General Architectural Design Considerations for Neural Networks

Universal Approximation Theorem, Depth, Connections

Jake Batsuuri

Published in

Computronium Blog

8 min readJun 9, 2020

The Setup

Neural networks with n layers will have n-1 hidden units and 1 output unit. We have studies hidden units and output units. For example, we know from 1 to n-1 we have hidden layers, in other words, every layer except the last one. Whereas the last one is called the output layer. We studied what these layers consist of, units. As well as what these units consist of, weights and activation functions.

But what else is variable in a neural network? The other feature we can vary in the neural net is the overall structure of the network. In other words the shape of the graph — the discrete structure, with nodes and edges.

https://mathworld.wolfram.com/ChromaticNumber.html

However, neural nets are only a subset of all graphs. A prospective neural net architecture needs to satisfy certain requirements before it can be certified as a valid architecture.

https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463

Looking at this sheet of neural network architectures, it’s not really obvious what these requirements are. But what we can see is that there are lots of varieties of structures. If we simplify this to only nodes and edges, then we can start out the study. Here’s the taxonomy of neural networks:

Neural networks — the largest unit of organization, consisting of layers
Layers — second largest unit of organization, there can be input layers, hidden layers, and output layers, these layers consist of units or nodes
Units — the basic unit of a neural network, these are made up of weights and connected to each other by edges or connections
Connections — the abstract concept of connections between nodes, in mathematical graphs, they’re called edges or arcs

Many neural networks have a layer by layer computation, meaning each layer can only interact with the preceding layer and the following layer. But this strict requirement is not a hard rule, there are existing and undiscovered networks that may not have this requirement.

The notation for a layer can be understood as:

Here the yellow input layers are equivalent to vector x. And we build the second layer recursively by putting the first layer into it:

You can do this continuously ad libitum. For each network, we can vary the depth of the network, for each layer we can vary the number of nodes or what’s also referred to as the width of the network.

Deeper Networks

Deeper networks generally can get the job done with fewer units or nodes in each layer and far fewer parameters. This leads to good generalizations on a test set. But these can also be difficult to optimize.

Shallow Networks

Shallow networks generally need more units per layer and additional parameters to accommodate the complexity. Due to the additional parameters, one might not reach acceptable levels of accuracy. But these are easier to optimize and even a single layer can definitely get the job done on simpler problems.

Experimentation

For the same problem, if we are so inclined, we can try many different configurations as time permits and find the ideal architecture. This means keeping most things constant while varying the structure and monitoring the validation set error, accuracy on predictions etc.

Universal Approximation Properties and Depth

Without activation functions, a series of matrix operations on an input results in a linear combination of the operations in the series or one that is equivalent to a single linear operation. So these matrix multiplications end up being able to only represent a linear function. This can be useful if the real data generating function is linear, but when we do want to represent non linear functions, it’s not very useful. You might think that for every different class of non linear function we might have to use or come up with a suited non linear function. But it turns out the a basic feed-forward neural network is a universal approximator. And that activation functions on nodes are the key.

Universal Approximation Theorem

The universal approximation theorem states that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate arbitrary well real-valued continuous functions on compact subsets of Rn. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters or weights. On the other hand, it does not provide a construction for the weights, it merely states that such a construction is possible.

-Wikipedia

Specifically, theory states that Feed-forward Neural Network with a linear output layer and at least one hidden layer with any activation function can approximate any Borel measurable function from one finite dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units.

-Deep Learning

Initially, the universal approximation theorem was proven only for sigmoid activation functions. But eventually it has been proven for many other squashing activation functions. Now it’s considered true for any activation function.

Borel Measurable Function

A Borel Measurable Function is a function that in our case fuzzily maps data from one topological space to another, with the criteria that both spaces are measurable on all open sets or Borel sets. Where the Borel sets are the closed and bounded subsets of their respective spaces formed by adding all open sets, all closed sets, all unions of open sets, all unions of closed sets, all intersections of closed sets, and all intersections of open sets.

For our purpose we consider any continuous function on a closed and bounded subset of Rn as Borel Measurable.

Failure to Learn

The universal approximation theorem says that given enough hidden layers and parameters, a multi layer perceptron can represent any function. But representation and learning are separate rungs on a ladder. In order to get useful information out of a model, we need our neural network to be able to represent and learn on the data.

First the neural network must be able to represent the real data generating function, only then it can learn the function. This learning can fail for several reasons. For one, the optimization algorithm may not be able to find the right value of the parameters. For another, the training algorithm may overfit on the data.

Starting Point of Depth

The universal approximation theorem says that there exists a large enough network to represent any complicated function. But it doesn’t say where it is or when to stop as we increase the size.

In many ways, the best place to start is 1 layer deep. Because 1 layer is sufficient to represent any function. But since it may not learn the function, one can keep adding layers until the generalization error is minimized.

Interpretations on Additional Layers

We can understand the step between each layer to the next as a fold in the space. Where each fold simplifies the shape of the data.

Another way to interpret each step is to think of them as mini functions, that eventually get us to the final result. And yet another way to interpret them is to think of each layer as learning a specific feature of the data, where each step uses the output of the previous layer.

You can also interpret these layers as each learning a single step of a multi step process, for example, first select all blue objects, then select all spherical objects, then select all objects made of metal… Which in the end gets you all the blue spherical metal objects out of a pool of random objects.

Other Architectural Considerations

So far we have limited our design consideration to include only networks that have chained layer architecture, but there are other layers.

CNNs and RNNs

Since CNNs and RNNs have special considerations for computer vision and sequence processing, their design will be considered in separate.

Skip Connections

In general, there’s no rule that says layers have to be chained. Skip connections are edges on the graph that skip a layer, to connect to the layer’s after the next one, from layer i to i+2 or higher.

Dense vs Sparse Connections

And the last key consideration is how each layer is connected to the next. By default, it is connected in a dense or fully connected manner. But using less connections than full is popular too. Generally using less connections reduces computation and parameters. Using sparse connections is often problem dependent, which we will study in separate when considering CNNs and RNNs and so on.

Up Next…

Coming up next is the n class categorization with neural networks. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.