Neural Nets for People Who Think in Trees

Building One From Scratch, One Limitation at a Time.

June 7, 2026

The Motivation

If you think in decision trees, neural nets were probably explained to you backwards — handed over fully assembled, instead of built up one limitation, intuition, and innovation at a time, the way gradient boosted decision trees are.

So, let's build one from scratch.

Stacking Is Free and Worthless

A neuron is just a linear regression: ŷ = Wx + b. Stack two linear layers and look at the algebra: W₂(W₁x) = (W₂W₁)x. Two matrices multiply into one, and you're back to a single linear rule. You bought nothing — any pile of linear layers collapses to one.

No nonlinearity, no bend: W₂(W₁x) = (W₂W₁)x. The two-layer line tilts through each layer and lands right on the dashed guide — exactly where the one-layer line already is. Either way it's a straight line through a curved cloud; the extra layer buys nothing.

A Hinge Is a Tree Split

The fix is an activation -- or "kink" -- the most popular being ReLU, max(0, w·x + b): flat on one side of a hyperplane, a ramp on the other. That single nonlinearity is all it takes to break the collapse. A kinked unit is no longer a plain line, and no matrix can flatten it back, so the rule the network represents finally curves.

Same split, two outputs: the tree gives one flat value on each side (a step), while ReLU ramps — the soft, sloped version of the same split.

One ReLU unit is essentially one tree split — "w·x + b > 0". Every unit is the same ReLU, just as every tree node is the same "feature > threshold"; what differs is the weights.

Each unit, then, is one kink you can place anywhere — the same move a tree makes when it splits, just sloped instead of stepped.

The Return on Width

Line those kinks up in a single layer and add them together: that running sum is the network's output. A pile of identical ramps, each turning on in a different spot, sums to an arbitrary wiggle — enough hinges → any curve, the way enough leaves → any staircase. And the sum is exactly what width buys you: more units in the layer, more hinges in the sum, more wiggles you can trace.

Three units, three kinks, summed into one curve; ↑ or ↓ shows which way each unit bends it. The decision tree (dashed) hands each section a single flat value instead — a staircase, not a sloped curve.

But "enough" has a cost: the return is strictly linear. A unit of width is one more ReLU — one more kink you can bend into the curve. Want another wiggle in the fit? That's another neuron, another hinge. A target with a hundred wiggles costs a hundred hinges; the count climbs one-for-one with the detail you're chasing. That count is what we should be watching.

Width adds hinges one for one. Each dot on the line is a hinge — a unit of width. Two hinges (left) cut straight across the cloud; matching every wiggle (right) takes a hinge apiece.

The Return on Depth

Everything so far has been width — more units in a single layer, each laying down one more hinge, more cuts side by side carving the input into finer pieces. Depth is the other lever: instead of widening one layer, stack the kinked layers so each one bends what the layer below already bent. Now they compose — and the nonlinearity is what keeps that composition from collapsing back into a single rule. It works the same way in nets and decision trees: depth buys exponential regions — a depth-d tree has 2^d leaves.

The dots are samples from a smooth underlying relationship; each line is the network's prediction at a given depth, told apart by its dash. With two units per layer, one layer manages just a bend or two (its 7 weights buy two kinks); a second layer folds those into more; a third tracks the curve. Each added layer is only two more units — about six more weights — yet roughly doubles the regions the fit can carve.

Why Depth Wins

So "depth multiplies regions" isn't the new idea — the price is. A tree pays for its 2^d regions with 2^d − 1 independent splits: exponential regions, exponential parameters.

Width is the expensive route — k units carve on the order of k regions, one per hinge, linear in the count. Depth is the efficient one: each layer folds the space beneath it, so a single boundary drawn upstream reappears in every region the layers below already carved, and the regions multiply per layer instead of adding.

Put numbers on it. To carve a thousand-odd regions, a single wide layer needs about a thousand units — roughly 3,000 weights. Stack ten narrow layers of two units and you reach the same thousand regions (2¹⁰) for about 60. Same expressiveness, fifty times fewer weights.

Top row: on a wiggly target like this, a wide-but-shallow net and a deep net fit equally well. Bottom row: the difference is cost as the target gets more complex. To carve ~1,000 regions (2¹⁰), one wide layer needs ~1,000 units — about 3,000 weights; ten narrow layers of two reach the same count for about 60. Same fit, ~50× fewer weights.

The Fold

A net reaches that same explosion for far less. Picture folding a sheet of paper a few times and making one cut: unfold it and that snip shows up many times over, though you only ever cut once.

That is depth. Each layer folds the space beneath it, so a single boundary — drawn one time — reappears across every fold. Regions multiply per layer instead of adding, but the weights behind them grow only polynomially — exponential regions, polynomial parameters (Montúfar et al., 2014).

Folding reuses one cut. Fold a square once across and once down — two creases at right angles — then make one L-shaped snip and unfold. That single cut is now four, one per quadrant, each flipped across a crease. Each fold doubles the copies; two folds give 2². A layer of depth does the same: it reuses one boundary across every region the layers below already carved.

One boundary, paid for once, reused across every fold.