General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections
The Setup
Neural networks with n layers will have n-1 hidden units and 1 output unit. We have studies hidden units and output units. For example, we know from 1 to n-1 we have hidden layers, in other words, every layer except the last one. Whereas the last one is called the output layer. We studied what these layers consist of, units. As well as what these units consist of, weights and activation functions.
But what else is variable in a neural network? The other feature we can vary in the neural net is the overall structure of the network. In other words the shape of the graph — the discrete structure, with nodes and edges.
However, neural nets are only a subset of all graphs. A prospective neural net architecture needs to satisfy certain requirements before it can be certified as a valid architecture.
Looking at this sheet of neural network architectures, it’s not really obvious what these requirements are. But what we can see is that there are lots of varieties of structures. If we simplify this to only nodes and edges, then we can start out the study. Here’s the taxonomy of neural networks:
- Neural networks — the largest unit of organization, consisting of layers
- Layers — second largest unit of organization, there can be input layers, hidden layers, and output layers, these layers consist of units or nodes
- Units — the basic unit of a neural network, these are made up of weights and connected to each other by edges or connections
- Connections — the abstract concept of connections between nodes, in mathematical graphs, they’re called edges or arcs
Many neural networks have a layer by layer computation, meaning each layer can only interact with the preceding layer and the following layer. But this strict requirement is not a hard rule, there are existing and undiscovered networks that may not have this requirement.
The notation for a layer can be understood as:
Here the yellow input layers are equivalent to vector x. And we build the second layer recursively by putting the first layer into it:
You can do this continuously ad libitum. For each network, we can vary the depth of the network, for each layer we can vary the number of nodes or what’s also referred to as the width of the network.
Deeper Networks
Deeper networks generally can get the job done with fewer units or nodes in each layer and far fewer parameters. This leads to good generalizations on a test set. But these can also be difficult to optimize.
Shallow Networks
Shallow networks generally need more units per layer and additional parameters to accommodate the complexity. Due to the additional parameters, one might not reach acceptable levels of accuracy. But these are easier to optimize and even a single layer can definitely get the job done on simpler problems.
Experimentation
For the same problem, if we are so inclined, we can try many different configurations as time permits and find the ideal architecture. This means keeping most things constant while varying the structure and monitoring the validation set error, accuracy on predictions etc.
Universal Approximation Properties and Depth
Without activation functions, a series of matrix operations on an input results in a linear combination of the operations in the series or one that is equivalent to a single linear operation. So these matrix multiplications end up being able to only represent a linear function. This can be useful if the real data generating function is linear, but when we do want to represent non linear functions, it’s not very useful. You might think that for every different class of non linear function we might have to use or come up with a suited non linear function. But it turns out the a basic feed-forward neural network is a universal approximator. And that activation functions on nodes are the key.
Universal Approximation Theorem
The universal approximation theorem states that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate arbitrary well real-valued continuous functions on compact subsets of Rn. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters or weights. On the other hand, it does not provide a construction for the weights, it merely states that such a construction is possible.
-Wikipedia
Specifically, theory states that Feed-forward Neural Network with a linear output layer and at least one hidden layer with any activation function can approximate any Borel measurable function from one finite dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units.
-Deep Learning
Initially, the universal approximation theorem was proven only for sigmoid activation functions. But eventually it has been proven for many other squashing activation functions. Now it’s considered true for any activation function.
Borel Measurable Function
A Borel Measurable Function is a function that in our case fuzzily maps data from one topological space to another, with the criteria that both spaces are measurable on all open sets or Borel sets. Where the Borel sets are the closed and bounded subsets of their respective spaces formed by adding all open sets, all closed sets, all unions of open sets, all unions of closed sets, all intersections of closed sets, and all intersections of open sets.
For our purpose we consider any continuous function on a closed and bounded subset of Rn as Borel Measurable.
Failure to Learn
The universal approximation theorem says that given enough hidden layers and parameters, a multi layer perceptron can represent any function. But representation and learning are separate rungs on a ladder. In order to get useful information out of a model, we need our neural network to be able to represent and learn on the data.
First the neural network must be able to represent the real data generating function, only then it can learn the function. This learning can fail for several reasons. For one, the optimization algorithm may not be able to find the right value of the parameters. For another, the training algorithm may overfit on the data.
Starting Point of Depth
The universal approximation theorem says that there exists a large enough network to represent any complicated function. But it doesn’t say where it is or when to stop as we increase the size.
In many ways, the best place to start is 1 layer deep. Because 1 layer is sufficient to represent any function. But since it may not learn the function, one can keep adding layers until the generalization error is minimized.
Interpretations on Additional Layers
We can understand the step between each layer to the next as a fold in the space. Where each fold simplifies the shape of the data.
Another way to interpret each step is to think of them as mini functions, that eventually get us to the final result. And yet another way to interpret them is to think of each layer as learning a specific feature of the data, where each step uses the output of the previous layer.
You can also interpret these layers as each learning a single step of a multi step process, for example, first select all blue objects, then select all spherical objects, then select all objects made of metal… Which in the end gets you all the blue spherical metal objects out of a pool of random objects.
Other Architectural Considerations
So far we have limited our design consideration to include only networks that have chained layer architecture, but there are other layers.
CNNs and RNNs
Since CNNs and RNNs have special considerations for computer vision and sequence processing, their design will be considered in separate.
Skip Connections
In general, there’s no rule that says layers have to be chained. Skip connections are edges on the graph that skip a layer, to connect to the layer’s after the next one, from layer i to i+2 or higher.
Dense vs Sparse Connections
And the last key consideration is how each layer is connected to the next. By default, it is connected in a dense or fully connected manner. But using less connections than full is popular too. Generally using less connections reduces computation and parameters. Using sparse connections is often problem dependent, which we will study in separate when considering CNNs and RNNs and so on.
Other Articles
This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigendecompositions, SVD3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback Leibler Divergence8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyperparameters9. Principal Component Analysis Breakdown
Motivation, Derivation10. Feedforward Neural Networks
Layers, definitions, Kernel Trick11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews
Up Next…
Coming up next is the n class categorization with neural networks. If you would like me to write another article explaining a topic in-depth, please leave a comment.
For the table of contents and more content click here.
References
Adams, R. A. (2017). Calculus. Prentice-Hall.
François, C. (2018). Deep Learning with Python and Keras. MITP-Verlags GmbH & Co. KG.
Goodfellow, I. (2017). Deep Learning. MIT Press.
Nicholson, K. (2009). Linear Algebra with Applications.
Sutton, R. S. (2018). Reinforcement Learning. A Bradford Book.
Wackerly, D. D. (2007). Mathematical Statistics with Applications. Belmont, CA: Nelson Education.
(n.d.). A First Course In Linear Algebra — Open Textbook Library. Retrieved February 24, 2020, from https://open.umn.edu/opentextbooks/textbooks/a-first-course-in-linear-algebra-2017