Neural Nets for People Who Think in Trees
Building One From Scratch, One Limitation at a Time.
June 7, 2026
The Motivation
If you think in decision trees, neural nets were probably explained to you backwards — handed over fully assembled, instead of built up one limitation, intuition, and innovation at a time, the way gradient boosted decision trees are.
So, let's build one from scratch.
Stacking Is Free and Worthless
A neuron is just a linear regression: ŷ = Wx + b. Stack two linear layers and look at the algebra: W₂(W₁x) = (W₂W₁)x. Two matrices multiply into one, and you're back to a single linear rule. You bought nothing — any pile of linear layers collapses to one.
A Hinge Is a Tree Split
The fix is an activation -- or "kink" -- the most popular being ReLU, max(0, w·x + b): flat on one side of a hyperplane, a ramp on the other. That single nonlinearity is all it takes to break the collapse. A kinked unit is no longer a plain line, and no matrix can flatten it back, so the rule the network represents finally curves.
One ReLU unit is essentially one tree split — "w·x + b > 0". Every unit is the same ReLU, just as every tree node is the same "feature > threshold"; what differs is the weights.
Each unit, then, is one kink you can place anywhere — the same move a tree makes when it splits, just sloped instead of stepped.
The Return on Width
Line those kinks up in a single layer and add them together: that running sum is the network's output. A pile of identical ramps, each turning on in a different spot, sums to an arbitrary wiggle — enough hinges → any curve, the way enough leaves → any staircase. And the sum is exactly what width buys you: more units in the layer, more hinges in the sum, more wiggles you can trace.
But "enough" has a cost: the return is strictly linear. A unit of width is one more ReLU — one more kink you can bend into the curve. Want another wiggle in the fit? That's another neuron, another hinge. A target with a hundred wiggles costs a hundred hinges; the count climbs one-for-one with the detail you're chasing. That count is what we should be watching.
The Return on Depth
Everything so far has been width — more units in a single layer, each laying down one more hinge, more cuts side by side carving the input into finer pieces. Depth is the other lever: instead of widening one layer, stack the kinked layers so each one bends what the layer below already bent. Now they compose — and the nonlinearity is what keeps that composition from collapsing back into a single rule. It works the same way in nets and decision trees: depth buys exponential regions — a depth-d tree has 2d leaves.
Why Depth Wins
So "depth multiplies regions" isn't the new idea — the price is. A tree pays for its 2d regions with 2d − 1 independent splits: exponential regions, exponential parameters.
Width is the expensive route — k units carve on the order of k regions, one per hinge, linear in the count. Depth is the efficient one: each layer folds the space beneath it, so a single boundary drawn upstream reappears in every region the layers below already carved, and the regions multiply per layer instead of adding.
Put numbers on it. To carve a thousand-odd regions, a single wide layer needs about a thousand units — roughly 3,000 weights. Stack ten narrow layers of two units and you reach the same thousand regions (210) for about 60. Same expressiveness, fifty times fewer weights.
The Fold
A net reaches that same explosion for far less. Picture folding a sheet of paper a few times and making one cut: unfold it and that snip shows up many times over, though you only ever cut once.
That is depth. Each layer folds the space beneath it, so a single boundary — drawn one time — reappears across every fold. Regions multiply per layer instead of adding, but the weights behind them grow only polynomially — exponential regions, polynomial parameters (Montúfar et al., 2014).
One boundary, paid for once, reused across every fold.