← Steve Ellis

Neural Nets for People Who Think in Trees

Building One From Scratch, One Limitation at a Time.

June 7, 2026

The Motivation

If you think in decision trees, neural nets were probably explained to you backwards — handed over fully assembled, instead of built up one limitation, intuition, and innovation at a time, the way gradient boosted decision trees are.

So, let's build one from scratch.

Stacking Is Free and Worthless

A neuron is just a linear regression: ŷ = Wx + b. Stack two linear layers and look at the algebra: W₂(W₁x) = (W₂W₁)x. Two matrices multiply into one, and you're back to a single linear rule. You bought nothing — any pile of linear layers collapses to one.

one layer: Wx + b two layers: W₂(W₁x) =
No nonlinearity, no bend: W₂(W₁x) = (W₂W₁)x. The two-layer line tilts through each layer and lands right on the dashed guide — exactly where the one-layer line already is. Either way it's a straight line through a curved cloud; the extra layer buys nothing.

A Hinge Is a Tree Split

The fix is an activation -- or "kink" -- the most popular being ReLU, max(0, w·x + b): flat on one side of a hyperplane, a ramp on the other. That single nonlinearity is all it takes to break the collapse. A kinked unit is no longer a plain line, and no matrix can flatten it back, so the rule the network represents finally curves.

x max(0, w·x + b) ReLU tree split
Same split, two outputs: the tree gives one flat value on each side (a step), while ReLU ramps — the soft, sloped version of the same split.

One ReLU unit is essentially one tree split — "w·x + b > 0". Every unit is the same ReLU, just as every tree node is the same "feature > threshold"; what differs is the weights.

Each unit, then, is one kink you can place anywhere — the same move a tree makes when it splits, just sloped instead of stepped.

The Return on Width

Line those kinks up in a single layer and add them together: that running sum is the network's output. A pile of identical ramps, each turning on in a different spot, sums to an arbitrary wiggle — enough hinges → any curve, the way enough leaves → any staircase. And the sum is exactly what width buys you: more units in the layer, more hinges in the sum, more wiggles you can trace.

f(x) = c + Σ aᵢ·max(0, wᵢ·x + bᵢ) output f(x) input x neural net decision tree
Three units, three kinks, summed into one curve; ↑ or ↓ shows which way each unit bends it. The decision tree (dashed) hands each section a single flat value instead — a staircase, not a sloped curve.

But "enough" has a cost: the return is strictly linear. A unit of width is one more ReLU — one more kink you can bend into the curve. Want another wiggle in the fit? That's another neuron, another hinge. A target with a hundred wiggles costs a hundred hinges; the count climbs one-for-one with the detail you're chasing. That count is what we should be watching.

2 hinges 7 hinges
Width adds hinges one for one. Each dot on the line is a hinge — a unit of width. Two hinges (left) cut straight across the cloud; matching every wiggle (right) takes a hinge apiece.

The Return on Depth

Everything so far has been width — more units in a single layer, each laying down one more hinge, more cuts side by side carving the input into finer pieces. Depth is the other lever: instead of widening one layer, stack the kinked layers so each one bends what the layer below already bent. Now they compose — and the nonlinearity is what keeps that composition from collapsing back into a single rule. It works the same way in nets and decision trees: depth buys exponential regions — a depth-d tree has 2d leaves.

output f(x) input x data 1 layer — 7 params 2 layers — 13 3 layers — 19
The dots are samples from a smooth underlying relationship; each line is the network's prediction at a given depth, told apart by its dash. With two units per layer, one layer manages just a bend or two (its 7 weights buy two kinks); a second layer folds those into more; a third tracks the curve. Each added layer is only two more units — about six more weights — yet roughly doubles the regions the fit can carve.

Why Depth Wins

So "depth multiplies regions" isn't the new idea — the price is. A tree pays for its 2d regions with 2d − 1 independent splits: exponential regions, exponential parameters.

Width is the expensive route — k units carve on the order of k regions, one per hinge, linear in the count. Depth is the efficient one: each layer folds the space beneath it, so a single boundary drawn upstream reappears in every region the layers below already carved, and the regions multiply per layer instead of adding.

Put numbers on it. To carve a thousand-odd regions, a single wide layer needs about a thousand units — roughly 3,000 weights. Stack ten narrow layers of two units and you reach the same thousand regions (210) for about 60. Same expressiveness, fifty times fewer weights.

Width Depth the fit the cost fits the data fits it just as well ≈ 3,000 params ~1,000 units, one layer ≈ 60 params 10 layers of 2
Top row: on a wiggly target like this, a wide-but-shallow net and a deep net fit equally well. Bottom row: the difference is cost as the target gets more complex. To carve ~1,000 regions (210), one wide layer needs ~1,000 units — about 3,000 weights; ten narrow layers of two reach the same count for about 60. Same fit, ~50× fewer weights.

The Fold

A net reaches that same explosion for far less. Picture folding a sheet of paper a few times and making one cut: unfold it and that snip shows up many times over, though you only ever cut once.

That is depth. Each layer folds the space beneath it, so a single boundary — drawn one time — reappears across every fold. Regions multiply per layer instead of adding, but the weights behind them grow only polynomially — exponential regions, polynomial parameters (Montúfar et al., 2014).

1. a flat sheet fold ×2 2. fold both ways, one cut one snip ✂ unfold 3. unfold — four cuts cut once, appears four times
Folding reuses one cut. Fold a square once across and once down — two creases at right angles — then make one L-shaped snip and unfold. That single cut is now four, one per quadrant, each flipped across a crease. Each fold doubles the copies; two folds give 2². A layer of depth does the same: it reuses one boundary across every region the layers below already carved.

One boundary, paid for once, reused across every fold.