What It’s Worth
Dopamine, the Ventral Striatum, and the Learning of Value
The problem the selector cannot solve on its own
We ended the last chapter with a machine that can choose. The basal ganglia hold the motor programs switched off by default and release one by lifting its inhibition while clamping the rest — selection by disinhibition, the trick conserved since the lamprey. We even said plainly what the machine is for: a hungry animal is more likely to release the eating action than the throwing one, because hunger has raised the value of eating. And then we helped ourselves, repeatedly and without apology, to the very thing we had not explained. We said some actions are worth more than others. We never said where the worth comes from, or how it reaches the striatum and tips the contest.
That is the gap this chapter fills, and it is not a small one. To see why it is hard, consider what an animal would need in order to learn, on its own, which actions are worth selecting. The difficulty is not recognizing a reward when it arrives — a mouthful of food, a swallow of water. The difficulty is everything that comes before the reward, and the long, uncertain gap in between.
Think about what actually happens when you are hungry at midnight. You do not simply eat; there is no food in your hand. You get out of bed, you cross the room, you open the refrigerator, you find it empty, you check the pantry, and there — finally — is something to eat. The reward sits at the end of a long chain of actions, most of which produced nothing at all. And even once you have eaten, the physiological payoff that the whole exercise was for — the restoration of blood glucose, the quieting of the signals that woke you — arrives slowly, minutes to hours later, long after the behavior that earned it is over. We met this lag in the very first unit, when we distinguished the fast hedonic response from the slow homeostatic correction it predicts.
So an animal trying to learn from experience faces two problems at once, and they pull in opposite directions. The first is the credit assignment problem: when reward finally arrives, which of the many actions that preceded it deserves the credit? You opened the refrigerator (nothing), checked the oven (nothing), opened the pantry (food). Reinforcing all of it would teach you to check the oven again next time, which is useless. The reward signal somehow has to reach back through time and reinforce the right link in the chain. The second problem is the one that makes the first so acute: synaptic plasticity, as we developed it earlier in the book, is a fast and local affair. The coincidence detection that strengthens a synapse — pre-before-post, calcium through the NMDA receptor, the backpropagating action potential — operates on a timescale of tens of milliseconds. But the reward is seconds, even minutes, removed from the action that earned it. The synapse that needs strengthening went quiet long ago. How can an outcome reinforce a synapse that is no longer active?
This chapter is about the system that solves both problems, and about the molecule at its center. We will build it in stages. First the learning theory — the surprisingly simple idea that you learn only when you are surprised — and the signal that implements it: the dopamine reward-prediction error, the spine of this chapter and one of the genuine triumphs of systems neuroscience. Then the place where that signal does its most important work, the ventral striatum, built on the same plan as the dorsal striatum of the last chapter but wired to a different world. Then a complication that turns out to be central rather than peripheral: dopamine is doing two jobs at once, and keeping them apart is the key to the whole picture. We will pay off the worked example the last chapter promised — the contest between eating and throwing, and how a shift in drive flips it. And we will close, as this unit must, by handing the next unit its opening question: not which action to take, but what each option is actually worth.
A note on what we will lean on rather than rebuild. The dopamine system itself — the midbrain nuclei, the VTA and substantia nigra, where their axons go — we met in the last chapter and in the unit on neuromodulation; we will recall it, not re-derive it. The same goes for synaptic plasticity and the machinery of credit assignment at the single synapse, and for the hypothalamic drive states that set behavior in motion in the first place. This chapter assumes all of it and builds upward.
You learn only when you are surprised
Start with the learning theory, because the biology turns out to be its near-literal implementation, and the theory is older than any of the recordings.
The intuitive view of associative learning — the one built into “cells that fire together, wire together” — is that learning tracks co-occurrence. Ring a bell before food often enough, and the bell-food association strengthens in proportion to how often the two coincide. For a long time this seemed obviously right. It is also, as it turns out, wrong, and the experiment that broke it is worth stating because it forces the better theory.
The phenomenon is called blocking, demonstrated by Leon Kamin in the late 1960s. Train an animal until a light reliably predicts a shock. Now present the light and a tone together, still followed by the shock, for many trials. By the co-occurrence view, the tone is paired with shock just as often as the light was, so it should become an equally good predictor. It does not. The animal learns almost nothing about the tone. The light already predicted the shock perfectly; the tone added no new information; and so — this is the crucial inference — no learning occurred. Co-occurrence was present in full. Learning was absent.
What blocking reveals is that the brain does not learn about what happens; it learns about what it failed to predict. The tone is redundant — the outcome was already expected — and the redundant cue is ignored. Learning is driven not by the reward but by the error: the discrepancy between the reward you predicted and the reward you got.
Robert Rescorla and Allen Wagner formalized exactly this in 1972, in what became the most influential model in the psychology of learning. Its content can be stated in one sentence: the associative strength of a cue is updated in proportion to the difference between the reward actually received and the total reward predicted by all cues present. When the outcome matches the prediction, the difference is zero and nothing changes — which is precisely why a well-predicted reward teaches nothing, and why the blocked tone stays mute. Learning happens only at the surprise. As the prediction improves, the error shrinks, and learning slows and stops on its own. The model is sometimes written as an equation, but the idea is the whole of it: learn from prediction error, not from coincidence.
The Rescorla-Wagner model has one limitation that matters for everything below: it is blind to time within a trial. It treats each trial as a single event — cue, then outcome — and asks only how the association changes from one trial to the next. But our midnight problem was a problem within the episode: a chain of actions, and a reward at the end, with the credit needing to flow back across the gap between the steps. For that we need a model that handles time inside the trial.
That extension is temporal-difference (TD) learning, introduced by Richard Sutton in 1988 and developed with Andrew Barto — the pairing that founded the modern field of reinforcement learning, the same framework that now trains artificial agents. (Much of the vocabulary in this chapter — reward, value, prediction error, credit assignment — flows as much from this computational tradition as from neuroscience, and the two have been entangled ever since.) The move TD makes is elegant. Instead of waiting for the final reward and then asking what predicted it, the system maintains, at every moment, a running prediction of future reward — a value. And it learns by comparing each moment’s prediction to the next. If things are suddenly looking better than they did an instant ago — you have just opened the pantry and seen food — that improvement is a prediction error, and it reinforces whatever you just did, immediately, without waiting for the food to be eaten. The error need not span the whole gap from action to final reward. It only ever spans one step, from each moment to the next, and the value information ratchets backward through the chain one link at a time, trial by trial.
This solves the credit-assignment problem in a way that is biologically usable. The reward does not have to reach backward across seconds to find the action that earned it. Instead, value propagates backward step by step, so that each action comes to be reinforced by the improvement in prospects it produces — by the moment-to-moment change in expected reward, available right when the action occurs. (In full reinforcement-learning models this bootstrapping can be set up to learn the value of states, the value of actions, or both; the biological point that matters here is simply that value can travel backward through a sequence before the final reward arrives.) Hold that result in mind, because we are about to watch a population of neurons behave very much as though they were reporting this quantity.
Dopamine carries a teaching error
Here is where the theory meets the tissue, and the fit is close enough to have reorganized a field.
In a now-famous series of recordings beginning in the 1980s and culminating in the 1990s, Wolfram Schultz recorded from individual dopamine neurons in the midbrain of monkeys while the animals learned to associate cues with rewards. Many of those neurons behaved, trial after trial, as if they were reporting a reward-prediction error. Three results define the finding, and together they are as close to a smoking gun as systems neuroscience offers. The hedge in “many” and “as if” is not throat-clearing — it is the seam along which the modern picture has grown more complicated, as we will see — but the core fit is real, and it is worth meeting in its clean form first.
- An unpredicted reward drives a burst. When a reward arrives with no warning, the dopamine neurons fire a sharp phasic burst. The reward was unexpected; the prediction error is positive; dopamine reports it.
- A predictive cue shifts the burst earlier in time. Once the animal learns that a cue reliably precedes the reward, the burst moves. The neurons no longer respond to the reward itself — that is now fully predicted, error zero — and instead fire to the cue, the earliest moment at which the prospect of reward improved. This is precisely the backward propagation that TD learning predicts: the signal migrates to the earliest reliable predictor.
- An omitted reward drives a dip below baseline. Now train the cue, then withhold the promised reward. At the moment the reward was due, the dopamine neurons pause, dropping below their baseline firing. The reward was predicted and did not come; the prediction error is negative; dopamine reports that too, by falling silent.
That third result deserves a moment, because it depends on an otherwise puzzling feature of the dopamine system: these neurons fire tonically, maintaining a low background rate even when nothing is happening. That tonic rate is what lets the phasic signal be signed. A prediction error can be positive or negative — better than expected, or worse — and you cannot express “worse than expected” by dropping below zero if you are already at zero. Because the neurons sit at a low baseline, they can dip beneath it to signal a shortfall, just as they can burst above it to signal a windfall. This does not mean that error coding is the purpose of tonic firing — tonic activity in these neurons does several things, and we should not read a single function into it — but it is what makes a negative prediction error physiologically expressible at all.
Put the three together and the dopamine burst looks less like a reward signal than like an error signal — the difference between predicted and received, the quantity that Rescorla-Wagner and TD learning say a system must compute in order to learn. Schultz, with Peter Dayan and Read Montague, made the identification explicit in 1997: the phasic firing of midbrain dopamine neurons behaves like the temporal-difference prediction error. A theory built to explain the behavior of animals turned out to describe, to a striking degree, the firing of neurons — which is what gives the account its force, and why it reorganized the field even though, as the next box describes, it has not turned out to be the whole story.
A few refinements sharpen the picture rather than complicating it. The dopamine error tracks not just whether reward came but how much relative to expectation: a larger-than-expected reward drives a bigger burst, a smaller-than-expected one a smaller response or a dip, grading smoothly with the size of the surprise. And the signal is sensitive to probability and uncertainty, though in a more interesting way than a single number. In work by Fiorillo, Tobler, and Schultz, the phasic response to the cue varied monotonically with the probability of reward — a more likely reward producing a larger cue response, consistent with the cue improving the prospect more. On top of that, they found a separate, slower signal: a gradual ramp of activity building through the waiting period between cue and reward, and this sustained response was largest when reward was maximally uncertain, near a 50% chance. When the reward finally arrived, the response still obeyed prediction — a fully predicted reward drained the moment of any dopamine response, while a reward delivered under uncertainty still carried news and still drove a response.
How does this signal solve the second of our two problems, the timescale mismatch — the fact that the synapse needing change went quiet long before the reward arrived? The answer is the eligibility trace, an idea we can now give its proper home. When a synapse is active, it does not simply do its job and reset. It leaves a lingering molecular “tag” — a trace marking it as having recently participated, and therefore as eligible for modification if a reinforcing signal should arrive in the next short while. The dopamine error, broadcast some time later, finds these tagged synapses and adjusts them; synapses without a trace are left untouched. The trace is what bridges the gap between the action and its consequence: it holds the synapse eligible across the delay, so that a teaching signal seconds later can still reach the right connection. The precise molecular identity of the trace is still being worked out, and the important point for this chapter is not that there is one known molecule called “the trace.” It is the functional requirement: recently active synapses must remain temporarily eligible for dopamine-gated plasticity, or else a delayed reward could never teach the correct action. There is good evidence that corticostriatal plasticity is gated by the timing of dopamine relative to synaptic activity, on roughly the seconds-to-behavior scale the problem demands; the mechanism that implements the eligibility window is what remains unsettled.
The reward-prediction-error account is one of the best-supported theories in systems neuroscience, and the main text presents it as such because it has earned that standing. But it would be against the spirit of this book to leave the impression that the matter is closed. It is not, and the past decade has made the picture richer and more contested. The honest summary is that the RPE story is securely part of what dopamine does, and almost certainly not the whole of it.
Several complications are now well established. The midbrain dopamine neurons are not a uniform population broadcasting one signal: they differ by where they project and by their molecular identity (single-cell studies propose several distinct types), and some co-release other transmitters — glutamate, GABA — alongside dopamine. Different dopamine pathways appear to carry different information. Within this heterogeneity, signals that are not pure reward-prediction error turn up: responses to salience (how attention-grabbing a stimulus is, regardless of reward), to novelty, and to the physical intensity of events. Schultz’s own resolution of the salience challenge is worth knowing: he argues the dopamine response has two components in time — an initial, unselective burst that registers any salient event before its value is known, followed a fraction of a second later by the selective, value-specific prediction-error component. On this view the early salience-like response is real but is the opening of the signal, not its substance.
A second line of complication concerns dopamine and movement itself, which we will see again from the other side later in this chapter. Dopamine is not only a teaching signal arriving after the fact; it also rises in anticipation of and during the pursuit of reward, and it scales with the vigor of movement — how fast and forcefully an animal moves toward something it wants. Recent work even finds reach velocity tracking dopaminergic learning signals moment to moment. Whether this is a separate function or another face of the same value signal is exactly the kind of question still in play.
Where does this leave the field? Split, productively. Some researchers read the new findings as refinements — adjustments to a fundamentally sound prediction-error model. Others are less convinced the prediction-error framework can carry the full weight of explaining behavior, and press the alternatives — salience, novelty, retrospective rather than prospective learning — more seriously. What no one argues is that the original recordings were wrong: the data were never the problem. Better tools simply reveal more, and more changes the interpretation. This is what a healthy theory under pressure looks like — not overturned, but no longer the only thing in the room. (The reader encountering this years from now should expect the balance to have shifted; the safe summary is that reward-prediction error is a real and central part of what dopamine signals, and that “central part” is doing real work in that sentence.)
The ventral striatum: the same machine, wired to a different world
We now have a teaching signal. Where does it teach? The answer returns us to the architecture of the last chapter — with a deliberate economy, because the point is that we have seen this circuit before.
The ventral striatum, whose principal part is the nucleus accumbens, is the input structure of a basal-ganglia loop built on the same plan as the motor loop. It is striatal tissue: it is populated largely by the same inhibitory medium spiny neurons, many of them enriched for D1- or D2-type dopamine receptors. Its output passes through a pallidal stage — here the ventral pallidum, the portion of the pallidum lying beneath the anterior commissure — and on through the thalamus (the mediodorsal nucleus, the one that serves prefrontal cortex) back to the cortex that fed it. The disinhibitory logic is the same; the default-suppression-and-release principle is the same. This is the architectural claim that matters: it is the same selection-and-learning machine, another of the parallel loops we met as the framework of Alexander, DeLong, and Strick.
But one piece of the dorsal picture does not carry over cleanly, and it is worth flagging before we lean on it, because the temptation to import the whole motor-loop diagram is strong. In the dorsal striatum we could line up receptor identity with pathway and, roughly, with function: D1 with the direct “go” pathway, D2 with the indirect “stop” pathway. In the ventral striatum that alignment breaks down. Kupchik and colleagues showed directly that coding the accumbens projections as “direct” versus “indirect” by D1 versus D2 receptor identity is not valid in the dorsal sense — the receptor classes, their projection targets, and their behavioral roles simply do not sort into two clean opposing channels the way the motor model implies. The dorsal loop is the right visual analogy for the ventral one, and the right way in; it is not a rulebook to be applied receptor by receptor.
What differs is everything plugged into it — and that is the whole point of the Difference-Engine view we developed last chapter: a stereotyped circuit performs one computation, and what it computes depends entirely on its inputs and outputs. Wire the machine to motor and premotor cortex and it selects movements; that was the last chapter. Wire an identical copy to the limbic world and it selects something else. The cortex and structures feeding the ventral striatum are not the motor map but the apparatus of motivation and emotion: the amygdala, the hippocampus, the anterior cingulate, and the ventromedial and orbital prefrontal cortex. Its dopamine arrives not from the substantia nigra pars compacta — the dorsal striatum’s supplier — but from its neighbor, the ventral tegmental area (VTA), by the mesolimbic projection. The inputs carry information about drives, contexts, places, and emotional associations; the loop selects among those. What the dorsal loop does for movements, this loop does for goals — for the things worth pursuing. It is, to put it in the terms the last chapter set up, a major motivational entry point into the action system, and the striato-nigro-striatal spiral we described is one anatomical route by which what this loop computes about worth can climb toward the dorsal territories that select and automatize movement.
Two anatomical points are worth having, and no more. First, the nucleus accumbens is often divided into a shell and a core, and the distinction has some functional teeth: the shell, with its especially heavy VTA dopamine input and its connections to the hypothalamus and ventral pallidum, sits closest to raw motivational and hedonic processing, while the core, more like the rest of the striatum in its connections, sits closer to the translation of motivation into action. We will not lean on the subdivision heavily, but it matters once for the wanting/liking story below. Second — the same honesty we owed the motor loop, now doubled — what the direct and indirect pathways do in this loop is genuinely less settled than in the dorsal striatum, on top of the receptor-mapping problem just noted. The framings on offer (“go” versus “no-go,” reward versus punishment, approach versus avoid, prepare versus select) overlap but do not coincide, and the field has not converged. The optogenetic evidence is instructive precisely because it does not deliver the clean rule one might hope for. When the D1 (so-called direct-pathway) cells of the accumbens are excited directly, animals will work intensely to self-stimulate them — D1 excitation in the accumbens is reliably rewarding. But the D2 side does not come out as a tidy mirror image: stimulating D2 cells tends to produce ambivalent, weaker, or context-dependent effects — Berridge’s group characterizes it as “D1 reward versus D2 ambivalence” — including avoidance under some conditions but not the simple “D2 = punishment” rule the dorsal analogy might lead you to expect. This is a frontier, not a foundation, and we flag it as such rather than building the chapter on it.
In 1954 James Olds and Peter Milner placed stimulating electrodes in the brains of rats and let the animals press a lever to deliver the stimulation to themselves. With the electrode in the right place — along a fiber bundle then called the medial forebrain bundle, a pathway that includes ascending dopamine fibers among many other systems — the rats pressed, and pressed, and did not stop. They would press thousands of times an hour. They would cross an electrified grid to reach the lever, tolerating shocks they would not tolerate for food. Left to it, they would forgo eating and drinking to the point of starvation, pressing for stimulation until the experimenter intervened.
It was irresistible to call this a “pleasure center,” and for decades it was. But the language was wrong in a way this chapter is built to expose, and the rats were trying to tell us. An animal in the grip of pleasure would, presumably, at some point be satisfied. These animals were not satisfied; they were driven. What the electrode tapped was not a wellspring of pleasure but the engine of wanting — the dopaminergic motivation system that makes a reward worth pursuing — and stimulating it produced not contentment but an unquenchable pursuit of more. The distinction between wanting and liking, which the next section develops, is exactly what the self-stimulating rat dramatizes. (The idea, remarkably, is not dead as therapy: there have been modern attempts to stimulate this same pathway for treatment-resistant depression.) Olds and Milner found something real and important. They just misnamed it — and the misnaming is instructive.
Dopamine’s two jobs: wanting versus learning
Here is the complication that organizes the rest of the chapter, and it is best introduced by noticing that the account so far is incomplete in a specific way. We have described dopamine as a teaching signal — the prediction error that arrives after an outcome and reshapes which actions the striatum will select next time. That is a signal about the past: this just happened, update accordingly. But a great deal of what dopamine does is about the future and the present: getting the animal to pursue a reward in the first place, and keeping it pursuing across the long gap before the reward arrives. These are not the same job, and dopamine does both.
Call them the two heads of the system. One head is reward learning — the phasic prediction error of the last three sections, which reshapes which behaviors the striatum will select by finding the eligible synapses and adjusting them. The other head is motivation, or wanting — a signal that energizes behavior, that rises in anticipation of a reward and sustains the effort to obtain it. The evidence that these are genuinely distinct, rather than one signal described two ways, is some of the most illuminating in the whole field, and it comes from asking a deceptively simple question: what, exactly, does dopamine make you do?
The decisive work is Kent Berridge’s, and it turns on separating two things that ordinary language fuses. We say we “want” what we “like” and “like” what we “want,” as if these were one. Berridge’s central finding is that the brain does not treat them as one. Wanting — the motivation to pursue and work for a reward, which Berridge calls incentive salience — is dissociable from liking — the hedonic pleasure actually taken in consuming it. And dopamine, it turns out, is a major mechanism for wanting — for the cue-triggered incentive salience that makes an animal pursue a reward — and is not the same thing as the hedonic pleasure of liking.
The experiments make the dissociation concrete. Strip dopamine from a rat — lesion the system, or deplete it pharmacologically — and the animal stops pursuing food. It will not work for it, will not seek it out; left alone, it would starve in the presence of food it has not been moved to approach. By the “pleasure” account of dopamine, such an animal should also have lost its capacity to enjoy food. It has not. Place food in its mouth, and it shows the same hedonic reactions as before — the species-typical facial expressions of “liking” that Berridge learned to read in rodents are fully intact. The liking is there. Only the wanting is gone. Dopamine was never the pleasure; it was the motivation to pursue.
And the converse: where does the pleasure live, if not in dopamine? In a different system, anatomically tiny and chemically distinct. Berridge’s group identified hedonic hotspots — sub-regions about a cubic millimeter across, especially in the shell of the nucleus accumbens and the posterior ventral pallidum — where stimulating the opioid system (and certain related systems) can genuinely amplify “liking” reactions, more than doubling the hedonic response to a sweet taste. These hotspots do not contain all pleasure, and dopamine does not contain all wanting; the systems overlap and interact, and the full circuitry of each is broader than any one node. But the dissociation itself is decisive: the mechanisms that make a reward wanted can be pulled apart from the mechanisms that make it liked. The buttercream that delighted you before dinner and seems excessive after has not changed; what changed is whether your hedonic systems still paint it as pleasant — and, separately, whether your dopamine system still makes you want to cross the room for it.
This is the chapter’s deepest structural point, and it is worth stating as plainly as the disinhibition principle of the last chapter. There is no inner connoisseur who likes the reward and then decides to want it. Wanting and liking are produced by separable machinery — dopamine prominent in the one, opioid hotspots central to the other — dissociable by lesion, by drug, by anatomical locus, and by timescale. They normally travel together because in a well-ordered animal the things worth wanting are the things worth liking, but the brain computes them apart, and they can come apart. This is the same anti-homunculus move this book has made at every level. Just as there is no commander in the basal ganglia who chooses the movement, there is no evaluator in the ventral striatum who feels the pleasure and issues the desire. There are dissociable dynamics riding the same neighborhood — and, in the case of dopamine, partly the same molecule — each doing its own job.
We can now also close the loop with the motor system in a way that would have looked like a digression a moment ago. Recall, from the controversy box, that dopamine scales with the vigor of movement. That is the wanting head showing up in the motor stream: the same motivational signal that gets you out of bed also sets how forcefully you move once you are up. There is direct evidence that the motivational head is tied to movement specifically — a cue that requires the animal to act for its reward drives dopamine that rises and stays elevated until the action is performed, whereas a cue promising the same reward for staying still does not. Wanting is not an abstract registering of worth; it is, in part, the energizing of action toward the worthwhile. The teaching head looks backward and reshapes the synapses; the wanting head looks forward and drives the body.
The worked example: eating, throwing, and the flip
We promised, two chapters ago, to follow a single contest all the way through once we had the machinery in hand. We now have it. Return to the sandwich.
The dorsal stream, we said, delivers a set of affordances: the sandwich affords eating, but its shape also affords throwing, and both action programs are specified in parietal and premotor cortex and arrive at the dorsal striatum as candidate movements. The basal ganglia resolve that competition by disinhibition — releasing one program, clamping the other. The question the last chapter could not answer was what weights the contest. Now we can trace it, and we can watch the weighting change.
Begin with a hungry animal. A homeostatic signal — the hunger we traced to the hypothalamus in the first unit, ghrelin and the rest — does something we can now state mechanistically: it acts on the dopaminergic motivation system, the VTA and its projection to the ventral striatum, raising the incentive value of food-related actions. This is the wanting head at work. The eating affordance, in this drive state, is painted with high incentive salience; the throwing affordance is not. Through the ventral loop, and along the spiral that lets motivational signals reach the more dorsal action territories, the contest in the dorsal striatum is biased: the eating program is the one more likely to have its brake lifted. The animal eats. None of this required a chooser weighing eating against throwing; it required a drive state setting a dopaminergic bias, and the selection machinery resolving the contest under that bias. The architecture is the chooser.
Now let the outcome teach. Suppose the food turns out better than expected — you went looking for carrots and found the thin mints. At consumption, the learning head fires: a positive prediction error, because the reward exceeded prediction. That error finds the synapses left eligible by the actions that led here — the searching, the opening of the pantry — and strengthens them, so that next time hunger strikes, those actions are likelier to be selected. The credit, propagating backward, lands most heavily on the actions closest to the reward and only faintly on the dead ends (the empty refrigerator, the empty oven), exactly as the temporal structure of the error dictates. The eating contest has not merely been resolved this once; it has been reshaped for the future.
And now the flip the last chapter promised. Let the drive state change — let the animal eat to satiety. The hunger signals fall and the meal-related satiety signals rise: gastric distension, gut peptides such as CCK, GLP-1, and PYY, and post-ingestive insulin, the fast satiety machinery of the first unit. (Leptin belongs here too, but in a different role — a slower background signal about energy stores, modulating the system over a longer horizon rather than ending this particular meal.) Together these reduce the incentive salience of food-related actions, because that incentive value is computed relative to the current drive state, not fixed to the object. The very same sandwich, unchanged in itself, no longer biases the contest toward eating. The eating affordance loses its weight; other affordances — including, now, the idle throwing of a thing one no longer wants to eat — are no longer outcompeted. The contest tips the other way. Nothing about the sandwich changed and nothing about the basal-ganglia circuitry changed. What changed was a drive state, reaching the selection machinery through the dopaminergic valuation system, re-weighting a competition that the architecture then resolves. That is what it means to say the ventral striatum builds the value that the dorsal striatum spends.
This is also the place to see, in miniature, why the two heads must be distinct. The wanting head is what flips with the drive state — satiety sharply reduces the wanting, at least for that food in that moment. But the learning the animal did along the way is not erased by becoming full: you still know where the thin mints are. Liking, wanting, and learning came apart cleanly in the example because they are separable functions in the tissue, exactly as the lesion and hotspot studies showed. A single “reward signal” could not behave this way.
The system hijacked: a note on addiction
Everything in this chapter predicts a characteristic way for the system to fail, and the prediction is borne out in one of the most consequential disorders in medicine. We flagged it two chapters ago as the habit bargain pathologically amplified; we can now say what the amplification is.
Many addictive drugs, through very different primary pharmacological actions, raise or dysregulate dopamine transmission in the mesolimbic circuitry of the ventral striatum. By the lights of this chapter, that does not mean they simply deliver pleasure — dopamine is a mechanism of wanting, not the hedonic gloss of liking. It means that drug-taking and the cues that predict it can become unusually powerful teachers and unusually powerful motivational magnets. One influential account — and we should be clear it is one account, not the whole disorder — runs as follows. A drug-induced dopamine surge mimics, or exaggerates, the teaching signal normally reserved for a better-than-expected outcome. The learning machinery therefore treats the cues and actions that preceded the drug as unusually important, strengthening drug-seeking in striatal circuits in a way that ordinary rewards, with their ordinary and self-limiting prediction errors, do not. The eligibility traces are found and reinforced; and the goal-directed-to-habit shift we described in the dorsal striatum is driven toward its pathological extreme, so that seeking becomes compulsive and cue-triggered rather than chosen — the ventral-to-dorsal progression that Everitt and Robbins have emphasized.
It is worth saying plainly that this is not all there is to addiction. Real dependence also involves tolerance and withdrawal, negative reinforcement (taking the drug to escape a bad state rather than to reach a good one), recruitment of stress systems, the erosion of prefrontal control over behavior, individual genetic and developmental vulnerability, and pharmacology specific to each drug. Hyman, Malenka, and Nestler frame the disorder as a pathological usurpation of the brain’s reward-related learning systems across these striatal and prefrontal circuits; Everitt and Robbins emphasize the habit progression; Berridge emphasizes sensitized wanting. These accounts are compatible, and the wanting/liking story this chapter can tell is one genuine piece of a larger picture, not the whole of it.
One of the most striking features of addiction follows from the wanting/liking dissociation, and it is what makes that dissociation worth teaching here. Because the drug’s grip works substantially through dopamine-driven wanting, while liking depends on the separable opioid-hotspot system, the two can be pulled apart by repeated use: the wanting can sensitize and grow while the liking does not keep pace, or even fades. The result is the state Berridge’s incentive-sensitization theory describes — intense, cue-triggered craving for a substance that may no longer deliver much pleasure at all. The addicted person, on this view, is not chasing a pleasure they keep catching; they are wanting out of proportion to liking, driven to pursue a reward whose hedonic payoff has decoupled from the urge to get it. The self-stimulating rat that opened our detour, pressing past satisfaction to starvation, was an early glimpse of the same dissociation in its purest experimental form. This does not make addiction only a disorder of wanting — the previous paragraph’s complications are all real — but it does explain why “just stop liking it” was never a coherent description of what an addicted person is up against.
(We met the basal ganglia’s motor failures in the last chapter — dopamine loss in Parkinson’s disease, and striatal degeneration in Huntington’s disease — as two ways the selector breaks. Addiction is a different kind of failure again: not too much or too little release of movement, but a pathological capture of learning, wanting, and cue control — the valuation side of the same machinery, hijacked by an agent the system reads as the best reward it has ever encountered.)
Where this leaves us, and where the next unit begins
Step back and take in what this loop accomplishes. The basal ganglia of the last chapter could choose among actions but could not say which were worth choosing. This chapter supplied the worth, and it did so without ever installing an evaluator to compute it. A teaching signal — the dopamine prediction error, which fires only at the surprise and goes silent once the world is predicted — reshapes which actions the striatum will select, finding the synapses that experience left eligible. A motivation signal, riding partly the same molecule, energizes the pursuit of reward and tilts the action contest according to the animal’s current drives. Pleasure itself lives elsewhere again, in the opioid hotspots, dissociable from the wanting that the dopamine system supplies. Value, in this picture, is not a number an inner judge assigns. It is the settling of a learned, drive-weighted competition — the architecture, once again, doing the work that a homunculus would otherwise have to do.
That closes the unit’s long argument. We began this whole unit with a body that could move and asked how it comes to move well and to the point: from the spinal pattern generators and the muscle, up through cortical control of the seen world, through the cerebellum that smooths and corrects, to the basal ganglia that select among candidate actions and — in this final chapter — learn which selections are worth making. The through-line was never anatomy for its own sake; it was the control hierarchy that turns undirected capacity into directed, learned, well-chosen action.
But notice the question this chapter kept almost answering and then setting aside. We have spoken of value throughout — incentive value, reward value, things “worth” pursuing — and we have shown how the striatum learns and spends it. We have not asked how the brain represents value as such: how it compares unlike options on a common scale, how it decides that this reward is worth more than that one, how preference itself is computed and read out. Our worked example flipped a contest by changing a drive; it did not weigh two genuine alternatives and judge between them. Human imaging offers a hint of where that judgment happens, though not as a clean anatomical handoff. The ventral striatum is prominent when rewards are anticipated and when prediction errors are learned; the ventromedial prefrontal and orbitofrontal cortex are especially engaged when options must be represented and compared on a common scale of worth. Both regions carry value-related signals — the division is not “one anticipates, the other receives” — but the emphasis differs, and it points beyond the striatum: the explicit valuation that guides a deliberate choice between alternatives appears to lean heavily on the prefrontal cortex the accumbens projects to.
That is the seam where this unit ends and the next begins. Having built the machinery that selects actions and learns their value, we turn next to the machinery that evaluates — to valuation, preference, and decision-making, and to the ventromedial and orbitofrontal cortex where the brain seems to represent what its options are worth. The accumbens has told us that a reward is coming and taught us how to get it. What that reward is worth, and how the brain decides between one worth and another, is the subject of the unit that follows.
What we are sure of, and what is still open
As in earlier chapters, it is worth separating the settled core from the frontier.
What is well established. Learning is driven by prediction error, not by mere co-occurrence — the lesson of blocking, formalized by Rescorla and Wagner and extended across time by temporal-difference learning. The phasic firing of midbrain dopamine neurons behaves like a reward-prediction-error signal: a burst to unpredicted reward, a shift of that burst to the earliest reliable predictor, and a dip below baseline when a predicted reward is omitted. This identification, made explicit by Schultz, Dayan, and Montague, is one of the strongest links between a computational theory and a neural signal that neuroscience has produced. The ventral striatum (nucleus accumbens) is the input stage of a basal-ganglia loop built on the same plan as the motor loop but wired to limbic and motivational structures and supplied with dopamine from the VTA. Dopamine is a major mechanism of wanting (incentive salience) and is distinct from hedonic liking: depleting it abolishes the pursuit of reward while leaving the pleasure of consuming it largely intact, and “liking” reactions can be amplified by opioid and related activity in small hedonic hotspots of the accumbens shell and posterior ventral pallidum. The two can be dissociated, even if neither system holds the whole of its function alone. Most drugs of abuse raise or dysregulate ventral-striatal dopamine, and one robust feature of addiction — intense, cue-triggered craving that can outlast pleasure — follows from the wanting/liking dissociation, though it is not the whole of the disorder.
What remains contested or unsettled. The reward-prediction-error account, though strongly supported, is very likely not the whole of what dopamine does. The midbrain dopamine neurons are heterogeneous in their projections and molecular identity and in some cases co-release other transmitters; some dopamine signals track salience, novelty, or stimulus intensity rather than reward-prediction error as such; and dopamine’s relation to movement — its scaling with vigor and its role in energizing and sustaining action — is not fully reconciled with its role as a teaching signal. Researchers differ on whether these findings are refinements of a sound model or grounds for a deeper revision. The molecular identity of the eligibility trace — the mechanism that holds a synapse modifiable across the delay between action and reward — is not settled; several calcium-, cAMP-, and kinase-dependent mechanisms are plausible candidates, and the implementation remains open, even as the functional requirement (a temporary window of eligibility for dopamine-gated plasticity) is not in doubt. And the organization of the ventral loop is considerably less well understood than the motor loop: coding its projections as “direct” versus “indirect” by D1 versus D2 receptor identity is not valid in the dorsal sense (Kupchik and colleagues), and the behavioral roles of the two cell classes do not resolve into a clean rule — direct-pathway (D1) excitation in the accumbens is reliably rewarding, but the D2 side produces ambivalent and context-dependent effects rather than a simple mirror-image “punishment.” As always, the schematic is cleaner than the tissue — and in the case of dopamine, the tissue is, at the time of writing, an unusually active frontier.