How Neural Networks Work: Activations, Training, Softmax

Single-Neuron Fundamentals

A neural network’s simplest building block is a single neuron. It receives an input feature vector x, multiplies it by a learnable weight vector w, adds a learnable bias b, and computes a weighted sum:

z = w · x + b

The neuron’s output (its activation a) is obtained by passing z through a chosen activation function g(·):

a = g(z)

With different choices of g(·), a single neuron can perform both regression and classification. For example:

Linear activation g(z) = z yields a linear regression model.
Sigmoid activation g(z) = σ(z) yields logistic regression.

This equivalence highlights why neurons are flexible primitives and why networks of neurons can represent complex functions.

Why Bias Matters

The bias term b shifts the decision boundary, allowing the model to fit data that does not pass through the origin. Without b, the neuron’s expressiveness is unnecessarily constrained.

Activation Functions That Introduce Nonlinearity

Activation functions enable neural networks to model nonlinear patterns that linear models cannot capture.

Step Function

Fires (outputs 1) if z > 0, otherwise outputs 0.
Historically used in perceptrons; not differentiable, so unsuitable for modern gradient-based training.

Linear

g(z) = z; equivalent to no activation.
Useful for regression outputs, but insufficient for building deep nonlinear hierarchies.

Sigmoid

g(z) = σ(z) ∈ (0, 1), often interpreted as probability for binary classification.
When used in a single neuron, reproduces logistic regression.

ReLU (Rectified Linear Unit)

g(z) = max(0, z); the most widely used activation in deep learning.
Efficient to compute, helps mitigate vanishing gradients, and often accelerates training.

From Single Neuron to Multi-Layer Networks

Stacking neurons in fully connected (dense) layers yields feedforward neural networks (also called multilayer perceptrons, MLPs).

Input layer: where data enters the model.
Hidden layers: where nonlinear feature transformations occur.
Output layer: where predictions are produced.

Only hidden and output layers count toward network “depth.” A model with one hidden layer and one output layer is a two-layer network.

Hidden-Layer Computations

For hidden unit j in layer i:

z[j]^(i) = W[j,:]^(i) · a^(i-1) + b[j]^(i) a[j]^(i) = g^(i)(z[j]^(i))

Where:

W^(i): weight matrix connecting layer i−1 to i
b^(i): bias vector of layer i
g^(i): activation function for layer i

Capacity vs. Complexity

More layers/neurons increase representational power.
Larger models require more data and compute, and they are more susceptible to overfitting.

Regularization with Dropout

Dropout randomly “turns off” a fraction of neuron activations during training (multiplying them by zero). This:

Reduces co-adaptation
Encourages robustness
Often improves generalization

Matrix Notation and Efficient Batching

Neural networks scale via vectorized linear algebra.

Define activations of layer 0 as the input features: a^(0) = X when working with a batch of examples.

For layer i:

Z(i) = A(i−1) · (W(i))T + 1 · (b(i))T

A^(i) = g^(i)(Z^(i))

Notes:

Each row of A and Z corresponds to one example in the batch.
Bias addition is broadcast across rows.
Transposes align dimensions for matrix multiplication.

Vectorization enables modern hardware (especially GPUs) to process thousands of operations in parallel, making large-scale deep learning practical.

Training Neural Networks

Neural networks learn by minimizing a loss (cost) function J(W, b) that measures prediction error.

Common Losses

Classification: cross-entropy
Regression: mean squared error

Initialization

Weights are typically initialized with small random values (from uniform or normal distributions) to break symmetry.
Biases are often initialized to zero.

Gradient Descent

Weights and biases are updated iteratively in the direction that reduces the loss:

Compute gradients via backpropagation.
Update parameters using learning rate-controlled steps.

Because most real-world losses are non-convex, gradient descent may converge to local minima or saddle points; however, in practice it works well with appropriate architectures and hyperparameters.

Softmax and Logits for Multi-Class Classification

When predicting across multiple classes, the output layer typically produces one score (logit) per class:

softmax(z)_i = e

{z_i} / Σ_j e

{z_j}Converts logits into a probability distribution (all outputs in (0,1), summing to 1).
The class with the highest probability is the predicted label.

Example: For z = [1, 2], softmax yields approximately [0.27, 0.73].

Design Choices and Nonlinear Decision Boundaries

Model design affects the learned decision boundary:

A small network (e.g., one hidden neuron with sigmoid) may produce only a near-linear boundary—insufficient for curved, complex datasets (e.g., two moons).
A deeper/wider network can learn nonlinear boundaries that separate such data effectively.

This illustrates the depth/width–capacity trade-off and why deeper architectures are favored for complex tasks.

Practical Implementation in Python (scikit-learn)

Below is a minimal workflow using scikit-learn’s MLPClassifier to train a basic neural network on a toy two-moons dataset, including preprocessing and evaluation.

Data Preparation

Create two-moons dataset
Split into train/test
Standardize features

_python
_{import numpy as np}
_{import matplotlib.pyplot as plt}
_{from sklearn.datasets import make_moons}
_{from sklearn.model_selection import train_test_split}
_{from sklearn.preprocessing import StandardScaler}
_{from sklearn.neural_network import MLPClassifier}
_{from sklearn.metrics import accuracy_score}

_{# Create a toy dataset}
_{X, y = make_moons(n_samples=300, noise=0.1, random_state=42)}

_{# Train/test split (70% train, 30% test)}
_{X_train, X_test, y_train, y_test = train_test_split(}
_{X, y, test_size=0.3, random_state=42}
₎

_{# Standardize features (important for MLPs)}
_{scaler = StandardScaler()}
_{X_train = scaler.fit_transform(X_train)}
_{X_test = scaler.transform(X_test)}

Define and Train the Network

One hidden layer with two neurons
ReLU activation in hidden layer
SGD optimizer
Extended iterations for convergence

_python
_{hidden_layer_depth = 1}
_{hidden_layer_width = 2}
_{hidden_layer_sizes = (hidden_layer_width,) * hidden_layer_depth}

_{mlp = MLPClassifier(}
_{hidden_layer_sizes=hidden_layer_sizes,}
_{activation='relu', # hidden layers}
_{solver='sgd', # optimization algorithm}
_{max_iter=2000,}
_{random_state=42}
₎

_{mlp.fit(X_train, y_train)}

Evaluate the Model

_{python
y_pred = mlp.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy * 100:.2f}%')}

Visualize Decision Boundary and Training Loss

_python
_{# Meshgrid for decision boundary visualization}
_{x_min, x_max = X_test[:, 0].min() - 0.5, X_test[:, 0].max() + 0.5}
_{y_min, y_max = X_test[:, 1].min() - 0.5, X_test[:, 1].max() + 0.5}
_{xx, yy = np.meshgrid(}
_{np.linspace(x_min, x_max, 300),}
_{np.linspace(y_min, y_max, 300)}
₎
_{grid = np.c_[xx.ravel(), yy.ravel()]}

_{# Probabilities for class 1 across the grid}
_{probs = mlp.predict_proba(grid)[:, 1].reshape(xx.shape)}

_{fig, axes = plt.subplots(1, 2, figsize=(12, 6))}

_{# Decision boundary}
_{ax1 = axes[0]}
_{ax1.contourf(xx, yy, probs, levels=[0, 0.5, 1], alpha=0.3, cmap=plt.cm.Greys)}
_{ax1.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=plt.cm.Greys,}
_{edgecolors='k', s=60)}
_{ax1.set_title(}
_{f'Decision Boundary\nHidden Layers: {hidden_layer_depth}, 'f'Neurons per Layer: {hidden_layer_width}'}
₎
_{ax1.set_xlabel('Feature $x_1$')}
_{ax1.set_ylabel('Feature $x_2$')}

_{# Training loss curve}
_{ax2 = axes[1]}
_{ax2.plot(mlp.loss_curve_)}
_{ax2.set_title('Training Loss Curve')}
_{ax2.set_xlabel('Iteration')}
_{ax2.set_ylabel('Loss')}
_{ax2.grid(True)}

_{plt.tight_layout()}
_plt.show()

If the boundary is too simple to separate the classes, increase depth and/or width to improve representational capacity.

Matrix Math Under the Hood

Neural networks rely on matrix multiplication for efficient forward and backward passes:

Forward pass: computes Z and A per layer for a full batch.
Backpropagation: propagates gradients layer-by-layer using chain rule and matrix operations.
Broadcasting adds per-neuron biases across the batch efficiently.

The reliance on vectorized operations is why GPUs, which excel at parallel linear algebra, are pivotal to modern deep learning.

Three Expert Q&As

Q1: How do I choose between sigmoid, ReLU, and linear activations?

Use ReLU (or variants) in hidden layers for efficient training and better gradient flow.
Use sigmoid for binary classification outputs (single neuron).
Use softmax for multi-class outputs.
Use linear activation for regression outputs.

Q2: When should I add more layers vs. more neurons?

Add layers to build hierarchical features and capture more complex patterns.
Add neurons within layers to increase capacity at a given level of abstraction.
If training accuracy is low, increase capacity; if generalization suffers, consider regularization (e.g., dropout), more data, or better preprocessing.

Q3: Why standardize features for MLPs?

Standardization (zero mean, unit variance) stabilizes optimization, helps gradient descent converge faster, and prevents certain features from dominating due to scale differences.

Learn How Neural Networks Work: A Technical Guide to the Basics of AI

Be a Tech Insider