Coffee Perceptron
Walkthrough: The Differentiable Perceptron (Sigmoid Neuron)
This tutorial bridges the gap between abstract neural network theory and concrete implementation. By dissecting the simplest differentiable unit of Deep Learning—the Sigmoid Neuron—we will manually trace the path of “learning” from a raw error signal back to a specific weight update.
Specifically, this document covers:
-
The Mechanics of Learning: How Forward Propagation (prediction) and Backward Propagation (correction) mathematically interact via the Chain Rule.
-
Batch Processing: Transitioning from simple vector logic to Matrix Calculus (Z = X · W + b) to handle entire datasets efficiently.
-
Manual Verification: Calculating gradients by hand to demystify what “Autograd” engines do behind the scenes.
-
Implementation: Translating these manual derivation steps directly into PyTorch code to verify our arithmetic.
1. Intuitive Analogy: “The Coffee Log” (Batch Data)
To understand how a model learns general rules (rather than just reacting to one event), we need a dataset. Imagine looking back at your purchase history for three different mornings.
The Goal: Find a set of preferences (Weights) that explains all these decisions reasonably well.
The Dataset (Matrix X)
We have 3 examples (rows). Each has 2 features (columns).
| Scenario | Caffeine (x₁) | Price (x₂) | Decision (y) | Reasoning |
|---|---|---|---|---|
| 1. Desperate Monday | 1.0 (Strong) | 2.0 (Expensive) | 1 (Buy) | “I needed energy, ignored price.” |
| 2. Tourist Trap | 0.2 (Weak) | 2.0 (Expensive) | 0 (Skip) | “Weak AND expensive? No way.” |
| 3. The Daily Grind | 1.0 (Strong) | 0.5 (Cheap) | 1 (Buy) | “Strong and cheap. Easy yes.” |
The Initial Weights (W) - “Conflicted Personality”
Our model starts with the same initial guess as before.
- Caffeine (w₁ = 0.5): “I kind of like caffeine.”
- Price (w₂ = -0.5): “I really hate paying money.”
- Bias (b = 0.0): Neutral mood.
The Conflict
- In Scenario 1, the high price (2.0) multiplied by the hate for price (-0.5) creates a strong negative signal (-1.0).
- The model currently thinks Monday’s coffee is a bad deal.
- Learning Goal: The model needs to realize that Caffeine is more important than Price to satisfy all 3 examples (since we bought 2/3 coffees).
2. Theoretical Foundation: Matrix Notation
To process multiple examples at once, we switch from vectors (x) to matrices (X).
2.1 The Computational Graph (Batch Version)
- Linear Aggregation (Z):
- X is shape (3, 2).
- W is shape (2, 1).
- b is a scalar (broadcasted to shape (3, 1)).
- Result Z is shape (3, 1) (One score for each coffee).
- Activation (A): Applied element-wise.
- Loss (L): Average Mean Squared Error across N examples.
3. Experimental Setup (The Matrices)
We translate our “Coffee Log” into PyTorch-style tensors.
Input Matrix (X):
\[X = \begin{bmatrix} 1.0 & 2.0 \\ 0.2 & 2.0 \\ 1.0 & 0.5 \end{bmatrix}\]Target Vector (Y):
\[Y = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix}\]Weights (W) & Bias (b):
\[W = \begin{bmatrix} 0.5 \\ -0.5 \end{bmatrix}, \quad b = 0.0\]4. Forward Pass (Matrix Multiplication)
Step A: Linear Aggregation (Z = X · W + b)
We perform the dot product for all 3 days simultaneously.
\[Z = \begin{bmatrix} 1.0 & 2.0 \\ 0.2 & 2.0 \\ 1.0 & 0.5 \end{bmatrix} \cdot \begin{bmatrix} 0.5 \\ -0.5 \end{bmatrix} + 0.0\]Row 1 (Monday): (1.0 · 0.5) + (2.0 · -0.5) = 0.5 - 1.0 = -0.5
Row 2 (Trap): (0.2 · 0.5) + (2.0 · -0.5) = 0.1 - 1.0 = -0.9
Row 3 (Daily): (1.0 · 0.5) + (0.5 · -0.5) = 0.5 - 0.25 = 0.25
\[Z = \begin{bmatrix} -0.5 \\ -0.9 \\ 0.25 \end{bmatrix}\]Step B: Activation (A = σ(Z))
We apply the sigmoid function to each score.
- σ(-0.5) ≈ 0.3775
- σ(-0.9) ≈ 0.2890
- σ(0.25) ≈ 0.5621
Step C: Loss Calculation (L)
Compare Predictions (A) vs Reality (Y).
- Monday: (0.3775 - 1)² = (-0.6225)² ≈ 0.387
- Trap: (0.2890 - 0)² = (0.2890)² ≈ 0.083
- Daily: (0.5621 - 1)² = (-0.4379)² ≈ 0.191
5. Backward Pass (Matrix Gradients)
We need to find the “Average Direction” to move the weights to satisfy all three days.
Step A: Gradient w.r.t. Activation (∇A)
\[\nabla A = (A - Y)\] \[\nabla A = \begin{bmatrix} 0.3775 - 1 \\ 0.2890 - 0 \\ 0.5621 - 1 \end{bmatrix} = \begin{bmatrix} -0.6225 \\ 0.2890 \\ -0.4379 \end{bmatrix}\]Interpretation:
- Row 1 (neg): Prediction was too low. Push UP.
- Row 2 (pos): Prediction was too high. Push DOWN.
- Row 3 (neg): Prediction was too low. Push UP.
Step B: Gradient w.r.t. Z (δ)
Element-wise multiplication: δ = ∇A ⊙ σ’(Z).
Recall σ’(z) = a(1-a).
- Sigmoid Derivatives:
- 0.3775 · (1 - 0.3775) ≈ 0.235
- 0.2890 · (1 - 0.2890) ≈ 0.205
- 0.5621 · (1 - 0.5621) ≈ 0.246
- Chain Rule:
- Monday: -0.6225 · 0.235 ≈ -0.146
- Trap: 0.2890 · 0.205 ≈ 0.059
- Daily: -0.4379 · 0.246 ≈ -0.108
Step C: Gradient w.r.t. Weights (∇W)
This is the crucial matrix operation: Xᵀ · δ.
We check how each feature contributed to the error across all examples.
\[\nabla W = \begin{bmatrix} 1.0 & 0.2 & 1.0 \\ 2.0 & 2.0 & 0.5 \end{bmatrix} \cdot \begin{bmatrix} -0.146 \\ 0.059 \\ -0.108 \end{bmatrix}\]-
For Caffeine (w₁): (1.0 · -0.146) + (0.2 · 0.059) + (1.0 · -0.108) = -0.146 + 0.0118 - 0.108 = -0.242
-
For Price (w₂): (2.0 · -0.146) + (2.0 · 0.059) + (0.5 · -0.108) = -0.292 + 0.118 - 0.054 = -0.228
(Note: We divide by N=3 usually, but for this manual trace we keep the sum or divide at the update step).
Step D: Gradient w.r.t Bias (∇b)
Sum of δ.
-0.146 + 0.059 - 0.108 = -0.195
6. Weight Update (Learning)
η = 0.1. Let’s use the sums calculated above.
Update Caffeine Weight (w₁)
\[w_{1,new} = 0.5 - 0.1(-0.242) = 0.5 + 0.0242 = 0.5242\]Reasoning: Even though we bought 2/3 coffees, the model realized that Caffeine was present in both “Buy” scenarios. The gradient is negative (meaning “Loss goes down if Weight goes up”), so we increase the weight.
Update Price Weight (w₂)
\[w_{2,new} = -0.5 - 0.1(-0.228) = -0.5 + 0.0228 = -0.4772\]Reasoning: The model realized that a high negative weight (-0.5) was causing too much error on Monday (Desperate Monday). It slightly reduces the penalty for price (makes it less negative) to accommodate that purchase.
Conclusion
By processing the matrix, the model learned to prioritize Caffeine more and penalize Price slightly less to fit the aggregated behavior of the user.
7. Code Verification & Execution Trace
The following Python script implements the matrices exactly as defined in Section 3 and executes one step of learning.
import torch
import torch.nn as nn
def verify_batch_perceptron():
print("--- 1. Setup (The Coffee Log) ---")
# Input Matrix (3 examples, 2 features)
# Row 1: Desperate Monday, Row 2: Tourist Trap, Row 3: Daily Grind
X = torch.tensor([
[1.0, 2.0],
[0.2, 2.0],
[1.0, 0.5]
])
# Target Vector (The Purchase Decisions)
Y = torch.tensor([
[1.0],
[0.0],
[1.0]
])
# Initialize Weights and Bias (Matches Section 3)
# w1 = 0.5 (Caffeine), w2 = -0.5 (Price)
W = torch.tensor([[0.5], [-0.5]], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)
print(f"Initial Weights (W):\n{W.detach().numpy().T}")
print("\n--- 2. Forward Pass ---")
# Step A: Linear Aggregation Z = XW + b (Matches Section 4A)
Z = torch.matmul(X, W) + b
print(f"Z (Linear Scores):\n{Z.detach().numpy().T}")
# Step B: Activation A = sigmoid(Z) (Matches Section 4B)
Z.retain_grad()
A = torch.sigmoid(Z)
A.retain_grad()
print(f"A (Predictions):\n{A.detach().numpy().T}")
# Step C: Loss (Mean Squared Error)
# Note: PyTorch MSELoss divides by N=3 by default
loss_fn = nn.MSELoss()
loss = loss_fn(A, Y)
print(f"Average Loss: {loss.item():.4f}")
print("\n--- 3. Backward Pass ---")
loss.backward()
print("Gradient w.r.t Weights (grad_W):")
print(W.grad.T)
# Manual Check:
# In Section 5C, we calculated sums: w1_sum = -0.242, w2_sum = -0.228
# PyTorch divides these by N=3.
# Expected w1 grad = -0.242 / 3 = -0.0806
print("\n--- 4. Weight Update (Learning) ---")
lr = 0.1
with torch.no_grad():
W_new = W - lr * W.grad
b_new = b - lr * b.grad
print(f"New Weights:\n{W_new.numpy().T}")
print(f"New Bias:\n{b_new.numpy()}")
if __name__ == "__main__":
verify_batch_perceptron()