What are Activation Functions in Neural Networks and how do they work?

Unleashing the Power Within

Published in

Level Up Coding

11 min readMay 7, 2024

In the mesmerizing realm of neural networks, activation functions reign supreme. These functions, with their ability to inject dynamic non-linearities into the network, pave the way for complex function approximation and accurate predictions. In this captivating journey, we will delve deep into the definition, purpose, and pivotal role of activation functions in the enchanting world of neural network models.

Decoding the Enigma: Understanding Activation Functions

An activation function serves as the guiding beacon for a neuron in a neural network, illuminating its output based on the input. By applying fascinating non-linear transformations to the input data, these functions empower the network to gain knowledge and comprehend the intricate relationships between the input and output.

The purpose of activation functions is a mesmerizing double-edged sword. Firstly, they embrace the allure of non-linearities, granting neural networks the power to decipher and emulate the complex patterns present in real-world data. With the beauty of the natural world often manifesting in non-linear forms, confining neural networks to solely linear functions would stifle their learning and generalization capacities.

Secondly, activation functions act as wise guardians, ensuring the output of neurons stays within a specific range. This normalization is of paramount importance, for it shields neurons from becoming saturated or trapped during the magical process of training. When inputs grow large, the gradients shrink, leading to sluggish convergence or the dreaded cessation of training.

The Heartbeat of Neural Networks: Activation Functions in Action

Activation functions pulsate with life at the core of neural network models, influencing their performance and aptitude in profound and wondrous ways. To truly grasp the essence of neural networks, one must appreciate the importance of these powerful functions. So, let us unravel the enigmatic secrets behind the profound significance of activation functions:

Non-linearity Unleashed: Activation functions unlock the captivating realm of non-linearity, enabling neural networks to capture and represent the intricate web of relationships within data. Without activation functions, neural networks would be confined to treading the linear path, with their ability to capture and model intricate patterns greatly diminished.
Taming Complexity: By embracing the magic of non-linear transformations, activation functions empower neural networks to conquer highly complex tasks. From deciphering mesmerizing images and captivating speech to translating the enchanting nuances of languages and predicting the twists and turns of financial trends, these functions serve as the bridge between the non-linear beauty inherent in real-world phenomena and the network’s capability to comprehend and harness that beauty
Preventing Saturation: Activation functions act as the wise guardians of neurons, shielding them from saturation and stagnation during the mystical process of training. By meticulously regulating the range of neuron outputs, activation functions ensure that gradients during the mesmerizing dance of backpropagation remain sufficiently substantial for effective weight updates. This results in the graceful convergence of the network and enhances the accuracy of its predictions.
Gatekeepers of Knowledge: Certain activation functions, such as the sigmoid and tanh functions, possess the magical power of gatekeeping. They skillfully regulate the flow of information within neural networks, showcasing their brilliance in recurrent neural networks (RNNs) and long short-term memory (LSTM) networks. Through their astute control, they govern memory and information retention, propelling neural networks to new heights of wisdom and understanding.

Activation functions are the bedrock of neural networks. They bestow the captivating gift of non-linear transformations, enable the modeling of intricate functions, prevent saturation during mesmerizing training journeys, and expertly employ gatekeeping mechanisms. The thoughtful selection and design of activation functions are indispensable for achieving optimal performance and awakening the true potential of neural network models.

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def relu(x):
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def tanh(x):
    return np.tanh(x)

def softmax(x):
    exps = np.exp(x - np.max(x))
    return exps / np.sum(exps, axis=0)

Unleashing the Power of Activation Functions

Activation functions are the superheroes that give deep learning models their incredible abilities. They bring non-linearity to neural networks, enabling them to identify and predict intricate patterns. Join us on this thrilling adventure as we dive deep into the secrets of the most popular activation functions, discovering their unique characteristics, advantages, and real-world applications.

1. Sigmoid Activation Function: The Enchanting Logistic Function

The sigmoid activation function, also known as the logistic function, takes center stage in the early days of neural networks. Imagine it as a shape-shifter that transforms inputs into a fascinating range between 0 and 1. Witness its mathematical wizardry:

f(x) = 1 / (1 + exp(-x))

Special Traits and Historical Significance

The sigmoid function captivates us with its smoothness and differentiability, making it a perfect match for optimization algorithms based on gradients.
Historically, it drew attention due to its striking resemblance to the firing rate of biological neurons.
It played a starring role in binary classification problems, as its output served as a probability of belonging to a particular class.

Limitations and Peculiarities of the Sigmoid Function

The sigmoid function hides a villain called the “vanishing gradient.” When dealing with extreme inputs, the gradients of weight tend to shrink during backpropagation, hindering learning or trapping networks in local minima.
Its non-zero centeredness poses optimization challenges.
The exponential computations involved can be resource-intensive and add computational complexity.

2. Rectified Linear Unit (ReLU): The Fearless Hero of Today

Enter ReLU, the courageous champion and one of the most celebrated activation functions in the modern world. With unwavering determination, ReLU banishes negative values to the shadows, allowing non-negative values to shine. Hear its battle cry:

f(x) = max(0, x)

Abundant Benefits and Valuable Virtues of ReLU

ReLU fearlessly conquers the vanishing gradient problem, effortlessly propagating gradients during backpropagation regardless of the input value.
Its computational efficiency is truly formidable, triumphing over complex calculations with simple operations.
ReLU’s sparse activation capability optimizes computational resources, activating only a select few neurons.

Defeating the Vanishing Gradient with ReLU

With ReLU as a weapon, deep neural networks rise against the threat of vanishing gradients. Gradients flow freely, making training more powerful and efficient, and victory within reach.

3. Hyperbolic Tangent (tanh): The Sibling Saga of Sigmoid

Witness the tale of the hyperbolic tangent function, a captivating sibling of the sigmoid function. Created from the same mold, it maps inputs to the mesmerizing range of -1 to 1. Dive into its mathematical journey:

f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))

United Siblings: Tanh’s Similarities to Sigmoid and its Applications

Tanh’s captivating S-shaped curve echoes the essence of the sigmoid function, but with a range spanning from -1 to 1 and centered at zero.
Recurrent Neural Networks (RNNs) often utilize tanh, appreciating its ability to capture both positive and negative values.

Comparing Tanh with Sigmoid and ReLU

Tanh emerges as the champion of zero-centeredness, effortlessly addressing optimization challenges.
However, it also falls prey to the vanishing gradient problem when faced with extreme inputs, sharing the fate of the sigmoid function.
Meanwhile, ReLU stands tall, unyielding to the vanishing gradient curse, and boasts computational efficiency.

4. Softmax Activation Function: The Maestro of Multi-Class Classification

In the vast realm of multi-class classification, the melodious notes of the softmax activation function resonate. Its mission? To harmonize inputs into a symphony of probabilities, gracefully embracing the essence of multiple output classes.

The Brilliant Role of Softmax in Multi-Class Classification

Softmax orchestrates the convergence of output probabilities, creating a harmonious symphony that always adds up to one. The world of probability-based decision-making unfolds.
The model reveals its prediction, selecting the class with the highest probability as the star of the show.

Decoding the Enigma: The Normalized Output of Softmax

The voices of the softmax chorus reveal their true identities as probabilities, each representing the chance of an input belonging to a particular class.
The grand finale arrives, with the class boasting the highest probability taking center stage as the predicted class.

In this captivating journey, we have unlocked the brilliance of these activation functions, each with a specific purpose in the grand tapestry of deep learning. We explored the historical significance and limitations of the sigmoid function, the vanishing gradient-defying heroics of ReLU, the sibling tale of tanh and sigmoid, and the harmonious symphony of softmax in multi-class classification. Equipped with this knowledge, you can now make informed decisions and choose the activation function that best suits your neural network architecture. But remember, this journey is only the beginning; the realm of activation functions is filled with countless wonders waiting to be discovered.

Discovering the Surprising Benefits of Different Activation Functions

When it comes to neural networks, activation functions play a critical role. They determine how a neuron outputs data, allowing for non-linear transformations. While widely known functions like ReLU and sigmoid are popular choices, there are lesser-known activation functions that offer unique advantages, potentially improving the performance of models. In this article, let’s explore three such functions: Leaky ReLU, Parametric ReLU (PReLU), and Swish.

Leaky ReLU: Resurrecting Dormant Neurons

Imagine this: the traditional ReLU function sets any negative input to zero, rendering it lifeless. But fear not, as Leaky ReLU swoops in to save the day with its refreshing take. It introduces a small incline for negative inputs, reviving their activation values.

Leaky ReLU follows the equation:

f(x) = max(ax, x)

In this equation, ‘x’ represents the input, and ‘a’ is a small positive constant that determines the slope of the negative portion. Typically, ‘a’ is assigned a value close to 0.01.

The advantage of Leaky ReLU lies in its solution to the problem of “dying ReLU.” In traditional ReLU, when a neuron becomes inactive due to negative inputs, its gradient becomes zero, causing it to remain inactive during subsequent training. Leaky ReLU addresses this issue by allowing a small gradient for negative inputs, reigniting the potential hidden within the “dead neuron.”

Leaky ReLU has proven to enhance model performance, especially with complex and intricate datasets. By introducing a hint of slope for negative inputs, Leaky ReLU creates a more continuous function. This continuity ensures smooth gradient flow, facilitating seamless information transmission during the training process. As a result, it promotes better learning and superior convergence.

Parametric ReLU (PReLU): Empowering the Slope

Expanding on the principles of Leaky ReLU, Parametric ReLU (PReLU) encourages us to dream bigger. It introduces a learnable parameter to redefine the negative slope, enhancing adaptability.

PReLU follows the same equation as Leaky ReLU:

f(x) = max(ax, x)

However, it takes things a step further by incorporating a mutable parameter ‘a’. This learnable parameter enables the neural network to explore and determine the optimal slope for negative inputs. As gradients are calculated through backpropagation, the value of ‘a’ is continuously adjusted, allowing the function to improve its performance over time.

PReLU proves particularly advantageous in handling negative gradients. By granting the network the ability to fine-tune the slope through the adaptable ‘a’, PReLU accommodates scenarios where specific neurons benefit from a more aggressive slope for negative inputs.

Empirical studies have highlighted PReLU’s ability to amplify model performance, especially in the realm of deep neural networks. By offering greater flexibility in approximating non-linear functions, PReLU paves the way for improved learning in the face of complex and high-dimensional datasets.

import torch
import torch.nn as nn

class PReLUModel(nn.Module):
    def __init__(self):
        super(PReLUModel, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.prelu = nn.PReLU()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.fc1(x)
        x = self.prelu(x)
        x = self.fc2(x)
        return x

# Example usage
model = PReLUModel()
print(model)

Swish: The Embodiment of Smoothness and Versatility

Introducing Swish, a relatively new addition to the activation function family that has captivated attention with its unique charm.

Swish can be represented by the equation:

f(x) = x * sigmoid(bx)

In this equation, ‘x’ represents the input, while ‘b’ controls the magnitude of the function.

Swish combines the non-linear properties of ReLU with the smoothness of sigmoid. For negative inputs, it gracefully approaches zero, and for positive inputs, it experiences linear growth. Unlike ReLU, Swish lacks any flat areas, ensuring the smooth flow of gradients across the entire range of inputs.

Experiments and comparisons with other functions have showcased Swish’s ability to outperform ReLU in terms of accuracy and convergence speed. Its continuous and non-monotonic characteristics facilitate optimal gradient propagation during training, resulting in improved optimization and generalization.

In conclusion, lesser-known activation functions such as Leaky ReLU, Parametric ReLU (PReLU), and Swish offer remarkable advantages over their more widely used counterparts. By addressing issues like inactive neurons, introducing adaptable parameters, and fostering smooth non-linearity, these functions enhance the performance and flexibility of neural networks. Take a leap into the world of these activation functions to supercharge your models and unleash their full potential.

import tensorflow as tf
from tensorflow.keras.layers import Activation

def swish(x):
    return x * tf.keras.activations.sigmoid(x)

# Model incorporating the custom Swish activation
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, input_dim=784),
    tf.keras.layers.Lambda(swish),  # Using Lambda to apply the custom Swish function
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

The Impact of Activation Functions: Boosting Neural Network Performance

Activation functions play a pivotal role in the success of neural networks, injecting them with non-linear capabilities. This allows models to unravel intricate patterns and make precise predictions. In this section, we’ll explore how various activation functions affect model accuracy and reveal strategies for finding the perfect fit for different tasks.

The Power of Activation Functions in Model Accuracy

The choice of activation function can heavily influence the accuracy of your neural network model. Let’s dive into some popular activation functions and witness their impact on model performance:

Sigmoid Function: The sigmoid function maps inputs to a range of 0 to 1. It excels in binary classification tasks, providing probability outputs. However, sigmoid functions face the vanishing gradient problem, which can hinder training, especially in deep neural networks.
ReLU (Rectified Linear Unit): ReLU is a highly favored activation function due to its simplicity and effectiveness. It eliminates negative values by setting them to zero, leaving positive values untouched. ReLU tackles the vanishing gradient problem and is computationally efficient. But, beware! It can lead to dead neurons if used carelessly.
Leaky ReLU: Leaky ReLU is an enhanced version of ReLU that tackles the dead neuron issue. It introduces a small, non-zero gradient for negative input values, preventing neurons from going entirely dormant.
Tanh Function: The hyperbolic tangent function, tanh, maps inputs to a range of -1 to 1. It overcomes the vanishing gradient problem associated with the sigmoid function while retaining its non-linear properties. Tanh is often utilized in models where the output needs to be centered around zero.

Choosing the Right Activation Function for Different Tasks

Selecting the most appropriate activation function depends on the task at hand and the architecture of your neural network. Let’s explore some guidelines to help you make informed decisions:

Binary Classification: If you’re dealing with binary classification tasks that require outputs in the [0, 1] range, the sigmoid function fits the bill. It can provide probability estimates for the positive class.
Multi-Class Classification: For multi-class classification scenarios with more than two classes, the softmax activation function prevails. It generates a probability distribution across all classes, making it ideal for such tasks.
Regression: In regression tasks where outputs can be any real value, employing a linear activation function or even no activation function at all may suffice.
Hidden Layers: When selecting activation functions for hidden layers, ReLU and its variations (Leaky ReLU, Parametric ReLU, etc.) are commonly employed. They have proven to be effective in deep learning architectures.

Remember, the choice of activation function should align with the nature and distribution of your data. Experimentation and fine-tuning might be necessary to discover the best activation functions for your specific task.

Conclusion

In conclusion, activation functions breathe life into neural networks, infusing non-linearity at their very core. They empower networks to unravel intricate data relationships, enabling accurate predictions and adaptive learning. As the field of deep learning continues its tumultuous voyage, understanding the nuances of activation functions will remain a guiding light for researchers, practitioners, and curious minds alike.