35 What It’s Worth

How Consequences Reweight Future Action

35.1 From selecting now to learning for next time

The preceding chapter described how basal-ganglia loops regulate access to action. They help one set of cortical and brainstem controllers gain influence, keep incompatible actions from interfering, scale the vigor of the selected action, and reorganize control when circumstances change. That account explains how several current possibilities become one course of behavior. It leaves open a second problem: why the competition has the weights it does.

An action changes the body and world. A reach obtains the cup or knocks it over. A turn reveals the source of a sound. Opening one door produces food; opening another produces nothing. Some consequences restore a regulated variable, some cause injury, and some provide new information. Once those consequences have occurred, the same actions should not enter the next competition with their original weights.

The change can be specific. Finding food in the pantry should make opening the pantry more likely when the same need returns. It should not strengthen every action that occurred during the evening. The change must also remain flexible. The food can still be remembered when satiety makes it temporarily unimportant. A familiar route can still be avoided after the destination becomes dangerous. Repeated practice can eventually produce the opposite problem: an action may be triggered so reliably by its context that it persists after the outcome that established it has lost its value.

These are problems of reinforcement learning. Reinforcement does not mean that a reward simply stamps in whatever happened before it. Outcomes alter predictions, action–outcome relations, motivational pull, and corticostriatal plasticity. The resulting changes affect which actions later gain access and how strongly they are expressed.

Consequences reweight future action. Dopamine-related prediction errors help update recently active corticostriatal circuits; current bodily state changes how learned outcomes matter now; and repeated experience can shift relative control from outcome-sensitive action toward cue-triggered habit.

The word value will appear throughout the chapter, but it does not name one substance or one signal. Several distinctions will matter:

Term	The question it answers
Outcome prediction	What is likely to happen?
Reward-prediction error	Was the outcome better or worse than expected?
Incentive value	How worth pursuing is that outcome in the organism’s present state?
Hedonic impact	How pleasurable or aversive is the outcome when it is experienced?
Learned action weight	How strongly should this action be favored in this context?

These variables interact, but they are not interchangeable. A cue can predict food accurately when an animal is sated. An action can still be known to produce an outcome that is no longer wanted. A dopamine transient can teach without being the pleasure produced by the outcome. A habitual action can remain strong even when the outcome’s current incentive value has fallen.

The chapter begins with an evolutionary point. Dopamine did not first appear as a chemical for pleasure or monetary choice. It entered an ancient vertebrate system that linked sensory conditions and bodily state to action. Mammalian evolution elaborated that arrangement by linking it to larger cortical, limbic, and mnemonic loops. The conserved principle is that experience changes the conditions under which an embodied controller will act again.

35.2 Dopamine entered an ancient action system

The evolutionary argument in the preceding chapter concerned the architecture of the basal ganglia. Lampreys possess recognizable striatal, pallidal, subthalamic, and nigral components, including direct- and indirect-like pathways and tonic inhibitory output to motor-related targets. The dopaminergic organization inside that circuit is ancient as well.

In mammals, direct-pathway spiny projection neurons are enriched in D1 dopamine receptors, whereas indirect-pathway neurons are enriched in D2 receptors. D1 and D2 receptors activate different intracellular signaling cascades and change the excitability and plasticity of their host neurons in different directions. Work from Sten Grillner’s group showed the same principal organization in lamprey. D1- and D2-receptor expression distinguishes corresponding striatal populations, and dopamine increases the excitability of direct-pathway neurons while decreasing the excitability of indirect-pathway neurons [@ericssonetal2013dopamine]. The molecular arrangement that allows dopamine to bias access to action therefore predates the mammalian neocortex.

The lamprey circuit also reveals why it is misleading to begin with reward. Neurons in a region homologous to the mammalian substantia nigra pars compacta and ventral tegmental area project to the striatum, as expected, but they also project directly to the optic tectum. Dopamine delivered to the tectum changes the visuomotor transformations that organize orienting and avoidance [@perezfernandezetal2017tectum]. Some individual dopamine neurons branch to both striatum and tectum, linking modulation of basal-ganglia circuits to modulation of the sensory–motor structure that selects where the animal will orient [@vontwickeletal2019corelease].

This arrangement places dopamine inside an ancient control system. Visual evidence reaches tectal and striatal circuits. Basal-ganglia output alters access to orienting and locomotor systems. Dopamine changes the responsiveness and plasticity of the neurons participating in that process. It can therefore influence what the animal does now and what similar circumstances will elicit later.

Across vertebrate evolution, these loops became linked with increasingly elaborated pallial, amygdalar, hippocampal, insular, hypothalamic, and thalamic systems. In mammals, the expanding pallium supplied cortical networks capable of representing objects, rules, actions, contexts, and delayed consequences. Dopaminergic projections diversified across ventral, associative, and sensorimotor striatal territories while retaining strong relationships with collicular and brainstem action systems.

The resulting system can learn that a tone predicts food, that one action produces sucrose while another produces grain, that the food is currently undesirable after satiety, or that a drug-predictive cue deserves immediate pursuit. Those are mammalian elaborations of an older principle: dopamine modulates circuits that convert evidence and state into action. It is not a molecule that evolved to label pleasure.

The evolutionary evidence has a clear limit. Lamprey studies establish an ancient receptor architecture and direct modulation of action-related circuits. They do not establish that lamprey dopamine neurons carry the same temporal-difference reward-prediction-error signal recorded in trained monkeys. The ancient scaffold and the modern computational account support one another without being identical claims.

35.3 Three things an animal can learn

Learning from an outcome requires more than remembering that something good or bad occurred. The nervous system must preserve different kinds of relation. Three are especially important here.

35.3.1 A cue can predict an outcome

In Pavlovian learning, an event predicts an outcome whether or not the animal’s action produces it. A sound may precede food. A place may predict danger. A light may indicate that water will soon become available. Once the relation has been learned, the cue can orient attention, evoke anticipatory physiology, and prepare actions appropriate to the expected outcome.

Prediction does not imply desire. A sated animal can continue to recognize a cue as a reliable predictor of food. A person can know that an alarm predicts an aversive event without wanting the event. Pavlovian learning establishes what the cue forecasts.

35.3.2 An action can cause an outcome

In instrumental learning, the animal learns a relation between what it does and what follows. Pressing one lever produces sucrose; pressing another produces a food pellet. Turning left reaches shelter; turning right reaches exposed ground. The consequence depends on the action.

This relation contains more information than a generalized history of reinforcement. It can preserve the identity of the outcome and the causal structure of the action. An animal can therefore reduce the action producing one devalued food while preserving another action that produces a still-desirable food. That result will become the chapter’s principal test of flexible action.

35.3.3 The current body changes what an outcome is worth

The learned identity and probability of an outcome do not determine its present incentive value by themselves. Water has a different claim on action during dehydration. Salt can change from aversive to intensely attractive during sodium depletion. Food loses much of its immediate pull after sensory-specific satiety while remaining recognizable and remembered.

The nervous system therefore combines learned outcome representations with the current state of the organism. This is one point at which the regulated body enters action control. Hypothalamic, brainstem, hormonal, visceral, insular, and amygdalar signals do not simply send a scalar drive to a central chooser. They change the activity and plasticity of distributed networks representing cues, outcomes, and actions.

The three relations can change independently. A cue may continue to predict an outcome after the outcome has been devalued. An animal may retain an action–outcome relation while no longer performing the action because the outcome is unwanted. A habit may continue after both the current incentive value and the action–outcome contingency have changed. The behavioral preparation matters because each pattern supports a different conclusion.

35.3.4 Credit must be assigned across more than one timescale

Consider a hungry animal that leaves its nest, crosses a room, searches an empty location, opens a container, and finally finds food. Several temporal problems are embedded in that episode.

At the local synapse, cortical and thalamic activity in the striatum may last only briefly. If a dopamine signal arrives a second later, the recently active connections need a temporary state that marks them as eligible for change. At the level of the sequence, predictive significance must propagate from the eventual food toward earlier places and actions over repeated experience. Across still longer intervals, the physiological consequences of eating unfold over minutes or hours and must be related to remembered episodes, internal-state changes, and intermediate predictors.

No single mechanism solves all three delays. A synaptic eligibility trace helps bridge a short interval between recent neural activity and a teaching signal. Temporal-difference learning describes how predictive information can propagate across a sequence of states. Long-horizon learning also depends on memory for events and contexts, reactivation of prior experience, hierarchical organization of action, and repeated contact with intermediate consequences.

This separation prevents a common mistake. Dopamine does not need to search backward through every event of the night to find the one correct synapse. Nor does a seconds-long biochemical trace explain how a meal hours earlier becomes related to later restoration of energy balance. Reinforcement learning is distributed across levels and timescales.

35.4 Unexpected outcomes revise prediction

Simple co-occurrence cannot explain what animals learn. The decisive demonstration is blocking, introduced by Leon Kamin [@kamin1969predictability].

An animal first learns that a light predicts an outcome. The outcome may be a shock in the original aversive preparation or food in an appetitive version. Once the light is a reliable predictor, a tone is added. Light and tone now occur together before the same outcome. The tone has been paired with the outcome repeatedly, yet it acquires little predictive control. The established light already accounted for what happened. The new tone supplied no error that required the prediction to be revised.

Blocking shows that associative change depends strongly on what the outcome adds to the current prediction. It is not enough that two events occur together. A redundant cue can be present on every trial and still acquire little control.

Rescorla and Wagner formalized this principle by proposing that associative strength changes according to the discrepancy between the outcome obtained and the outcome predicted by all cues present [@rescorlawagner1972theory]. A larger positive discrepancy produces greater strengthening. An outcome smaller than expected produces a negative discrepancy. When prediction and outcome match, little updating is required.

The prediction error is not subjective astonishment. It is a computational difference. A tiny, unnoticed change can generate an error if it violates a precise expectation; a dramatic event can generate little new learning if it was fully predicted.

The Rescorla–Wagner model treats each trial as a unit. It does not represent the successive moments between cue and outcome. Temporal-difference learning extends the same principle through time [@sutton1988temporal]. At each moment, the system compares its present estimate of future outcome with the reward and prediction available at the next moment. A cue that improves the forecast generates a positive error before the final outcome arrives. With repeated experience, predictive significance can move toward earlier states in the sequence.

Opening a container and seeing food therefore matters immediately. The sight of food raises the estimate of what is about to occur, even before ingestion changes physiology. Earlier cues can acquire predictive significance because they lead to that improved state. Learning proceeds by successive changes in expectation rather than by one signal spanning the entire delay.

Temporal-difference models can learn the expected return associated with a state, often written as V, or the expected return associated with taking an action in a state, often written as Q. These are useful formal quantities. They are not assumed to reside in one nucleus or to exist as a single number in every biological decision. The organism’s representations determine what counts as the current state, which actions are distinguished, and which features of an outcome are preserved.

Deeper Dive: prediction-error equations

The Rescorla–Wagner update can be written in simplified form as:

\[ \Delta V = \alpha\beta(\lambda - \sum V) \]

The change in associative strength, \(\Delta V\), depends on learning-rate terms and on the difference between the obtained outcome, \(\lambda\), and the outcome predicted by all cues, \(\sum V\). Blocking occurs because the established cue already predicts the outcome, leaving little error for the added cue.

A temporal-difference error can be written as:

\[ \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \]

The error at time \(t\) compares the current prediction with immediate reward, \(r_t\), plus the discounted prediction at the next state. A positive error means that prospects improved; a negative error means that they worsened. The same framework can be written for action values, \(Q(s,a)\), and combined with eligibility traces that allow several recently visited states or actions to be updated.

The equations identify a computational problem and a family of solutions. They do not imply that dopamine is the only neural error signal, that the brain represents one complete state variable, or that every form of learning is temporal-difference learning. The earlier chapter How a Synapse Learns develops the synaptic and computational machinery in greater detail.

Prediction error explains a large and important class of learning. It does not justify the claim that organisms learn only when surprised. Perceptual learning, motor adaptation, imitation, latent structure learning, and several forms of synaptic plasticity involve other errors and other teaching relations. The more useful conclusion is narrower and stronger: unexpected outcomes revise predictions, and many reward-guided actions depend on those revisions.

35.5 Dopamine can supply a teaching signal

The computational account became a neural account when recordings from midbrain dopamine neurons revealed a striking temporal pattern. Wolfram Schultz and colleagues trained monkeys in tasks in which a sensory cue predicted a drop of juice. Many dopamine neurons in substantia nigra pars compacta and ventral tegmental area responded according to three conditions [@schultzdayanmontague1997].

When juice arrived unexpectedly, the neurons produced a brief increase in firing. After learning, the main increase occurred at the earliest reliable cue rather than at delivery of the now-predicted juice. When expected juice was omitted, firing fell below its ongoing level at the time the juice should have arrived.

These responses resemble a signed reward-prediction error:

better than expected produces a positive response;
as expected produces little outcome response;
worse than expected produces a negative response.

The cue response does not literally travel backward from the reward. Learning changes when the discrepancy occurs. Early in training, the juice improves the forecast. Later, the cue produces the improvement, while the predicted juice adds little.

The correspondence survives stronger tests than the three-part illustration. Dopamine-neuron responses scale with differences between expected and obtained outcomes. In blocking designs, a redundant cue evokes little new dopamine-related teaching and acquires little associative control [@waeltietal2001blocking]. In rodents, temporally precise optogenetic activation of dopamine neurons at outcome delivery can supply the positive error needed to overcome blocking or alter extinction, establishing a causal relation between the signal and learning rather than a correlation alone [@steinbergetal2013causal].

The signal also reaches instrumental action. Dopamine release in dorsal striatum can reflect errors tied to the expected outcome of a particular action or position within an action sequence. In a carefully structured task, nigrostriatal dopamine reported whether the obtained outcome violated the sequence-specific action–outcome prediction, not merely whether reward occurred [@hollonetal2021sequence]. The result connects the classic cue–reward finding directly to the problem of reweighting future action.

35.5.1 A brief signal can modify recently active synapses

The earlier synaptic chapter introduced the three-factor rule. Presynaptic activity and postsynaptic depolarization create a local eligibility state at selected corticostriatal synapses. A dopamine signal arriving within a limited window then changes the probability and direction of lasting plasticity. Experiments on striatal spines show that dopamine delivered within roughly a second of relevant glutamatergic activity can stabilize structural change, whereas the same dopamine signal arriving too late is ineffective [@yagishitaetal2014window].

The trace is not a label attached to an entire action. It is a transient biochemical state at recently active connections. Dopamine is also not acting alone. Receptor type, postsynaptic pathway identity, acetylcholine, endocannabinoids, calcium, local inhibition, and the timing of cortical and thalamic input all shape what changes. The functional point is nevertheless clear: a brief, broadly distributed dopamine signal can selectively modify the subset of synapses made eligible by recent activity.

Temporal-difference learning and eligibility traces solve complementary problems. Temporal-difference learning propagates predictive significance across successive states over experience. Eligibility traces allow a delayed teaching signal to modify connections that were active shortly before it arrived. Neither mechanism is a complete account of learning across an entire day.

35.5.2 Dopamine signals differ across place and time

The reward-prediction-error pattern is robust, but dopamine is not one scalar broadcast. Dopamine neurons differ in their inputs, molecular properties, projection targets, and behavioral responses. Local terminal release is further shaped by striatal circuitry. A signal measured in nucleus accumbens need not match a signal measured in dorsolateral striatum, and cell-body firing need not map one-for-one onto transmitter release at every terminal.

Fast measurements illustrate the distinction. During learning, dopamine release in nucleus accumbens can shift from an unexpected reward toward its predictive cue [@dayetal2007shifts]. In mice moving through a virtual environment, separate dopaminergic axons can carry rapid locomotion-related and reward-related signals [@howedombeck2016rapid]. During reward-guided action, dopamine and striatal cholinergic activity show different temporal relationships in ventral, dorsomedial, and dorsolateral territories [@duhneetal2024mismatch]. These findings do not erase prediction error. They locate it within a differentiated system.

Dopamine also influences the action that generates the outcome. Activity in substantia nigra dopamine neurons before self-initiated movement can change the probability and vigor of a future movement without specifying its direction [@dasilvaetal2018dopamine]. Striatal dopamine release can integrate expected benefit, cost already invested, and current motivation on different timescales [@esheletal2024cost]. The same transmitter can therefore participate in learning from an outcome, energizing pursuit, and adjusting action to current conditions.

This is not a reason to abandon the teaching-signal account. It is a reason to state it precisely:

Many dopamine signals report changes in expected outcome strongly enough, and at the right time, to teach corticostriatal circuits. Other dopamine signals and timescales contribute more directly to movement, salience, effort, and motivational activation.

35.6 Learning is distributed across a striatal gradient

Dopamine does much of its work in the striatum, but the striatum does not contain one reward compartment beside one motor compartment. Its territories form gradients with strong biases and extensive convergence.

The ventral striatum includes the nucleus accumbens and adjoining portions of caudate and putamen [@zahmbrog1992accumbens]. The accumbens is commonly divided into core and shell. The boundary is anatomically useful, and the two territories differ in their inputs, outputs, neurochemistry, and contributions to behavior. The shell is especially connected with medial prefrontal, amygdalar, hippocampal, hypothalamic, and brainstem systems. The core has strong relations with anterior cingulate, orbitofrontal, amygdalar, thalamic, and motor-related systems. Neither territory belongs to only one function.

Farther dorsally and laterally, striatal input becomes increasingly biased toward associative and sensorimotor cortex. Posterior dorsomedial territories are especially important for using action–outcome relations. Dorsolateral territories receive dense sensorimotor input and become prominent when familiar contexts trigger well-practiced actions. The transition is graded. Cortical and thalamic projections overlap, and the same action can engage ventral, associative, and sensorimotor territories at different phases of learning and performance.

This organization is suited to the chapter’s problem. An odor can predict a particular food. The hippocampus can represent where that food was found. The amygdala can represent its biological significance. The insula can represent its sensory and visceral consequences. Hypothalamic and brainstem systems can signal current metabolic and hormonal state. Cortical systems can represent the available actions and the rules relating them to outcomes. These influences converge in partially overlapping striatal populations rather than arriving at a single value register.

The major output of nucleus accumbens reaches the ventral pallidum, which in turn communicates with mediodorsal thalamus, hypothalamus, lateral habenula-related circuitry, midbrain, and brainstem. Accumbens neurons also project to midbrain dopamine regions and other targets. Ventral-striatal activity can therefore influence cortical action systems, autonomic and endocrine control, orienting, locomotion, and the dopamine signals that reshape later learning. The circuit is recurrent from the beginning.

The core and shell make different contributions within this recurrent network. Lesions of accumbens core impair the use of predictive cues and the ability of learned outcome information to guide instrumental performance in several tasks, whereas shell lesions produce a different profile involving context, motivational state, and response allocation [@corbitetal2001accumbens]. Those dissociations are important. They do not support the formula that shell generates raw motivation while core converts it into action.

The familiar D1/D2 diagram also becomes less reliable in ventral striatum. In dorsal striatum, D1-enriched neurons project mainly toward basal-ganglia output nuclei and D2-enriched neurons mainly toward external pallidum. In nucleus accumbens, both receptor classes project substantially to ventral pallidum, and receptor identity does not cleanly specify a direct versus indirect output route [@kupchiketal2015coding]. Dopamine still changes the excitability and plasticity of accumbens neurons. The canonical dorsal pathway labels should not be imported as a complete ventral circuit map.

The positive account is straightforward. Ventral striatum is a major interface through which learned predictions, current bodily state, context, and reward-related cues alter the control of action. Associative and sensorimotor striatal territories contribute other information and become increasingly influential as action–outcome relations are established and behavior is practiced. Learning is distributed across the gradient because natural behavior requires the whole relation: what is present, what can be done, what will follow, and what that outcome means now.

35.7 The ascending spiral links striatal territories

The gradient is connected not only through cortex and thalamus but through dopamine neurons themselves. Suzanne Haber and colleagues injected anatomical tracers into different striatal territories in primates and mapped their projections to midbrain dopamine nuclei and the return projections from those nuclei [@haberetal2000spiral].

Part of the organization is reciprocal. Ventral striatum projects to a medial midbrain region that returns dopamine to ventral striatum. Associative striatum has corresponding relations with more lateral midbrain dopamine neurons, and sensorimotor striatum with still more lateral regions. These relatively closed components can support plasticity within a functional territory.

A second part is nonreciprocal. A striatal territory influences dopamine neurons whose projections extend beyond the territory that provided the input. The accumbens shell can thereby influence dopamine delivery to accumbens core; core-related circuitry can influence more dorsal associative striatum; and associative circuitry can influence dopamine delivery to dorsolateral sensorimotor striatum. Repeated across the midbrain, these open components form an ascending striato–nigro–striatal spiral from ventromedial toward progressively more dorsal and lateral territories.

The anatomy provides a route by which activity related to motivation and expected outcome can alter dopaminergic modulation of action-related circuits. A cue that first gains importance within ventral-striatal networks can, through repeated experience, affect plasticity in territories that represent the sequence used to obtain the outcome. An abstract rule represented in associative cortex can likewise influence sensorimotor action through several cortical, thalamic, and midbrain routes.

The spiral is not a conveyor belt carrying a commodity called value. Nor does it prove that every goal-directed action becomes a habit by moving through the same fixed sequence. Cortical convergence, amygdalar and hippocampal projections, thalamostriatal input, local striatal interactions, and direct hypothalamic and brainstem influences provide additional paths across the system. The spiral is one anatomically demonstrated route for coordinated learning across territories.

Functional experiments support that role. In rats with extensive cocaine-seeking experience, disconnecting nucleus accumbens core from dopamine-dependent dorsolateral-striatal circuitry reduces well-established drug seeking [@belineveritt2008serial]. Longitudinal recordings also show that phasic dopamine signaling related to cocaine seeking emerges in dorsolateral striatum over weeks and depends on antecedent ventral-striatal circuitry [@willuhnetal2012hierarchical]. These results demonstrate serial recruitment during a particular learned behavior. They do not establish that dorsolateral recruitment is necessary for addiction-like behavior in every task or every animal.

35.8 The decisive test: does the action still care about its outcome?

A reinforced action can be strong for at least two reasons. The organism may perform it because it represents the outcome and currently wants that outcome. Or the context may trigger the action because repetition has strengthened the relation between situation and response. Ordinary performance often looks the same in both cases. The distinction appears only when the outcome or contingency changes.

The clearest test uses two actions and two outcomes. A rat learns that pressing one lever produces sucrose and pressing another produces a food pellet. Both actions become frequent. One outcome is then devalued, commonly by allowing the animal to consume it to satiety before the test. The animal is returned to the chamber, but the levers are tested briefly in extinction so that no new food is delivered.

An outcome-sensitive animal selectively reduces the action that had produced the devalued food. It continues performing the action that produced the other outcome. This result is more informative than a general decline in responding. The animal must preserve the identity of both action–outcome relations, update or retrieve the current incentive value of one outcome, and use that combination to guide action without receiving the outcome during the test [@balleinedickinson1998goal].

The procedure separates knowledge from present motivation. Selective satiety does not need to erase the memory that a lever produces sucrose. The animal may continue to approach or consume sucrose later when hunger returns. What changes during the test is the current claim of that outcome on action.

Devaluation can also be produced by pairing one outcome with illness. The sensory identity of the outcome then predicts an aversive consequence. Again, selective reduction of the corresponding action shows that performance depends on an internal model of what the action produces, not merely on the number of prior reinforcements.

Several brain systems are required to make this flexibility possible. Posterior dorsomedial striatum is important for acquiring and expressing action–outcome control. Lesions there impair sensitivity to both outcome devaluation and changes in the causal contingency [@yinetal2005dms]. Insular cortex is important for retrieving the current incentive value of a specific sensory outcome. Animals with insular lesions can learn the action–outcome relation yet fail to adjust performance after the outcome is revalued through selective satiety [@balleinedickinson2000insula]. Interactions involving nucleus accumbens core allow current outcome information and predictive cues to influence instrumental action [@corbitetal2001accumbens].

The circuit is distributed because the computation is distributed. The action is represented in corticostriatal networks. The sensory identity and remembered consequences of the outcome involve insular, amygdalar, orbitofrontal, and related systems. Current physiological state changes the response to that representation. Dorsomedial striatum helps bring the relation back into the competition among actions.

35.8.1 State can revalue an outcome before new experience with it

The most dramatic demonstration comes from sodium appetite. A highly concentrated salt solution is normally aversive. Rats can learn that a cue predicts access to it while they are in a normal physiological state. If the rats are then made sodium depleted, the cue can become strongly attractive on its first presentation in the new state, before the animal has had an opportunity to taste the salt and learn that it is now useful [@tindelletal2009dynamic].

The learned cue–outcome relation remained available. A new internal state changed how the predicted outcome entered motivation and action. Neural activity in ventral pallidal and accumbens-related circuits reflected the new incentive status. This is not ordinary reinforcement accumulated through repeated reward. It is a state-dependent reinterpretation of an already learned outcome.

The example captures the embodied character of incentive value. Salt did not acquire an enduring positive property. The depleted body changed what the same salt represented for control. A theory that treats value as a fixed label attached to an object cannot explain the reversal.

Outcome devaluation therefore provides more than a laboratory definition of flexible action. It shows how memory and regulation meet. An organism acts according to expected consequences, but the consequences are evaluated from within a changing body.

35.9 When control shifts toward habit

With repetition, behavior can become less dependent on the current outcome. This is the defining property of an instrumental habit [@dickinson1985actions].

A habit is an action whose performance has become relatively insensitive to changes in the current value of its outcome or to changes in the causal relation between action and outcome.

That definition is behavioral. It does not require the action to be unconscious. It does not mean that the movement is smooth, skilled, or fast. It does not mean that the actor cannot describe the outcome. A practiced action may be automatic in one sense and remain fully sensitive to its goal. Conversely, an awkward response can be habitual if it persists after the outcome or contingency has changed.

35.9.1 Devaluation and contingency test different relations

Outcome devaluation asks whether the action still depends on the present desirability of its consequence. Contingency degradation asks whether the action still depends on causing that consequence.

Suppose a lever press initially produces food. Food is then delivered just as often without the press, weakening the causal advantage of acting. Goal-directed performance declines because the action no longer improves the probability of the outcome. Habitual performance can persist because the context continues to trigger the response even after the causal relation has been degraded.

An omission schedule makes the test stronger: the outcome is delivered only if the animal refrains from the previously reinforced action. Persistent responding now postpones or prevents the outcome. If the action continues, its control cannot be explained by a currently useful action–outcome relation.

No single negative result proves habit. Satiety can reduce motor vigor; illness can generalize beyond one food; extinction can alter motivation; and an animal may fail a devaluation test because it cannot represent outcome identity. Strong experiments use two actions, two outcomes, sensory-specific revaluation, and controls showing that the animal can still discriminate and consume the outcomes.

35.9.2 Dorsomedial and dorsolateral striatum have different biases

The dorsal striatum contains a useful, though incomplete, anatomical gradient. Posterior dorsomedial striatum receives associative cortical input and is strongly involved in action–outcome learning. Dorsolateral striatum receives dense sensorimotor input and becomes increasingly important when familiar stimuli and action sequences control performance.

Lesions of dorsolateral striatum can preserve knowledge of an expected outcome while preventing the normal emergence of outcome-insensitive responding after extended training [@yinetal2004dls]. Reversible inactivation of dorsolateral striatum can restore sensitivity to a degraded action–outcome contingency in animals whose behavior had become habitual [@yinetal2006contingency]. Conversely, disrupting dorsomedial striatum impairs goal-directed control even when the animal can still move and obtain reinforcement [@yinetal2005dms].

This evidence supports a shift in relative control, not a handoff from one complete system to another. Dorsomedial and dorsolateral circuits are active during the same periods of training. Orbitofrontal and striatal populations can encode changes between goal-directed and habitual strategies as task conditions change [@gremelcosta2013shift]. Molecular manipulations of histone deacetylase 3 in either dorsomedial or dorsolateral striatum can accelerate or prevent habit formation, showing that both territories participate in the plasticity governing the transition [@malvaezetal2018habits].

The balance also remains reversible. A strongly practiced action can return to outcome-sensitive control after context change, contingency change, or circuit perturbation. Habit is not the permanent transfer of a behavior to a lower neural level. It is one way control is allocated among concurrently available systems.

35.9.3 Sequence chunking is related to habit but not identical to it

Well-practiced action sequences often develop a distinctive neural structure. Early in learning, striatal activity can mark individual turns, pauses, and choices. With practice, activity in dorsolateral striatum can become especially prominent near the beginning and end of a sequence, creating a task bracket around the intervening actions [@jogetal1999habits]. Start- and stop-related signals also emerge in nigrostriatal circuits during sequence learning [@jincosta2010startstop].

This organization can allow several components to be initiated and completed as a unit. It reduces the need to reconsider every element and can increase speed and consistency. It does not by itself establish that the sequence is habitual. A skilled pianist can execute a fluent phrase while remaining sensitive to the intended sound. A driver can perform a practiced gear change while changing the sequence immediately when traffic demands it. Chunking concerns the organization of a sequence; habit concerns its sensitivity to outcome and contingency.

The distinction matters for this unit. Motor skill, automaticity, habit, and compulsion can all produce rapid, practiced behavior. They describe different properties and require different tests.

Human studies face an additional difficulty. Instructions and explicit knowledge can keep behavior outcome sensitive, while laboratory training is usually too brief to reproduce years of everyday repetition. Even so, extended training can increase activity in posterior putamen and reduce sensitivity to outcome devaluation in humans [@tricomietal2009human]. The result supports continuity with the animal literature while leaving room for language, explicit strategy, and cultural routines to reshape human habit.

Deeper Dive: goal-directed and habitual are not synonyms for model-based and model-free

Computational reinforcement learning often distinguishes model-based control from model-free control. A model-based agent uses a representation of transitions and outcomes to evaluate possible action sequences. A model-free agent learns cached action values from prediction errors without explicitly simulating the transition structure.

The distinction resembles the behavioral contrast between goal-directed action and habit. Outcome-sensitive action normally requires knowledge of what the action produces, whereas a habitual response can be supported by a cached relation between situation and action. The pairs are not exact synonyms. A model-based computation need not be conscious or slow. A model-free value can remain sensitive to some motivational changes. Behavioral habit is defined by performance after devaluation or contingency change, not by fitting one computational model.

The two vocabularies answer related questions at different levels. The behavioral tests establish what information controls the action. The computational models propose how that information might be learned and used.

35.10 Learning, wanting, and liking can separate

The word reward can refer to several operations. An outcome can teach the actions and cues that precede it. A cue can make the outcome worth pursuing now. Consumption can produce positive hedonic reactions. These processes normally cooperate, but experiments can separate them.

35.10.1 Learning changes future weights

The teaching function has already been developed. A better-than-expected outcome produces dopamine-related signals that modify eligible corticostriatal connections. A worse-than-expected outcome can weaken the predictions or actions that led to it. Over repeated experience, cues and actions acquire control because they predict specific consequences.

Learning does not require the outcome to remain pleasurable forever. A drug-predictive cue can remain an effective trigger after tolerance has changed the drug’s hedonic effect. A habitual action can persist after the outcome has been devalued. Prediction and action weight are historical products; current incentive value is a state-dependent use of that history.

35.10.2 Cues can acquire motivational pull

A reward-predictive cue does more than report information. It can attract attention, evoke approach, increase autonomic preparation, and invigorate actions that obtain the expected outcome. Berridge and colleagues call this cue-triggered motivational property incentive salience, or wanting.

A useful assay is Pavlovian-instrumental transfer. A cue is first paired with an outcome independently of the animal’s action. The animal separately learns an instrumental response. When the cue is later presented during an extinction test, it can selectively or generally increase instrumental performance. The cue has gained access to action even though it does not deliver the outcome during the test.

Dopamine in nucleus accumbens is important for this motivational influence. Infusing amphetamine into accumbens can magnify the capacity of a sucrose-predictive cue to provoke instrumental pursuit without increasing positive taste reactions to sucrose and without simply reinforcing the response during the test [@wyvellberridge2000wanting]. The manipulation increases cue-triggered wanting without demonstrating increased liking.

Current state changes the effect. A food cue has greater motivational pull during hunger. The salt cue described earlier becomes attractive during sodium depletion before new experience with the outcome. Dopamine therefore does not assign one fixed incentive value to a cue. It participates in the interaction between learned prediction, current state, and available action.

Wanting is also related to vigor. An animal can know which action produces food yet perform it slowly when the expected benefit is low or the effort is high. Nigrostriatal and mesolimbic dopamine influence the probability, latency, and intensity with which an action is pursued [@dasilvaetal2018dopamine; @esheletal2024cost]. Motivational activation is not a separate prelude to motor control. It changes the action that gains access and the energy committed to it.

35.10.3 Hedonic impact has partly different causal machinery

Liking refers to the positive hedonic impact of an outcome when it is consumed. In rodents, investigators infer part of this impact from stereotyped orofacial reactions to sweet and bitter tastes. The assay is limited, but it allows causal manipulations to distinguish appetitive pursuit from the reaction to the outcome itself.

Small regions within medial nucleus accumbens shell and posterior ventral pallidum function as hedonic hotspots. Local activation of mu-opioid or endocannabinoid signaling within these sites can amplify positive taste reactions to sweetness. The same manipulation just outside the hotspot may increase food intake without increasing hedonic reactions, and some neighboring regions can suppress positive reactions [@pecinaberridge2005hotspot; @smithberridge2005pallidum].

The hotspots are not pleasure centers. They are small causal nodes within a wider network that includes brainstem taste circuits, parabrachial and hypothalamic systems, amygdala, insula, orbitofrontal cortex, accumbens, and ventral pallidum. Rodent taste reactions also capture only one component of human pleasure. The experiments nevertheless establish a decisive dissociation: increasing dopamine-related pursuit is not the same manipulation as increasing hedonic impact.

The three operations can now be stated cleanly:

learning changes the predictions and action weights that will govern future behavior;
wanting gives cues and outcomes motivational control over present action;
liking is the hedonic reaction when the outcome is experienced.

Dopamine contributes strongly to the first two. It does not provide the whole of the third.

Olds and Milner: reinforcement is not a meter of pleasure

In 1954, James Olds and Peter Milner implanted electrodes in rat brains and discovered that animals would repeatedly perform an action that produced brief electrical stimulation at certain sites [@oldsmilner1954reinforcement]. The finding was revolutionary because stimulation itself acted as a powerful reinforcer. It could shape approach, lever pressing, and repeated return to a location.

The experiment did not identify a single pleasure center. The effective sites included fibers and regions connected through the medial forebrain bundle, and electrical current activated mixed axons and cells in both directions. Lever pressing established that the stimulation strengthened the actions that produced it. It did not reveal whether the stimulation generated subjective pleasure, incentive motivation, arousal, action activation, or some combination.

Later work on dopamine, opioid hotspots, and incentive salience explains why the distinction matters. A consequence can powerfully reinforce and motivate action without serving as a direct meter of hedonic experience.

35.11 Addiction recruits several forms of learning and control

Addiction is often described as a reward system being hijacked. The metaphor captures the extraordinary control that drugs and drug-predictive cues can acquire. It obscures the fact that substance-use disorders recruit several mechanisms, and that different drugs perturb the nervous system in different ways.

35.11.1 Drug pharmacology enters learning circuits

Cocaine blocks monoamine transporters. Amphetamine reverses or disrupts monoamine transport. Opioids act at opioid receptors and strongly alter pain, brainstem, striatal, and stress circuits. Nicotine activates nicotinic acetylcholine receptors. Alcohol changes several transmitter and membrane systems. Their acute subjective and physiological effects are not interchangeable.

Many addictive drugs nevertheless increase or dysregulate dopamine in mesolimbic striatum, either directly or through upstream circuitry [@dichiaraimperato1988drugs]. This allows drug delivery and its predictors to engage the same plasticity that ordinarily relates cues and actions to biologically important outcomes. The drug does not need to be interpreted by an inner evaluator as the best possible reward. Its pharmacology changes the signals through which learning and motivation are implemented.

The consequence can be unusually persistent cue control. Places, people, odors, paraphernalia, internal states, and action sequences become predictors of drug availability. Those predictors can capture attention and provoke approach long after withdrawal. Human PET studies show that cocaine-predictive cues can increase dopamine in dorsal striatum in relation to craving [@volkowetal2006cues]. Cue reactivity is therefore not a mere memory of prior pleasure. It is a learned pathway into present action.

35.11.2 Repeated seeking can recruit dorsal habit circuitry

Early drug use can be flexible and explicitly outcome directed. With repetition, the actions used to obtain a drug can become increasingly controlled by familiar cues and sequences. In rats, prolonged cocaine self-administration can produce seeking that becomes less sensitive to outcome devaluation and increasingly dependent on dorsolateral striatum [@zapataetal2010habitual]. Serial ventral-to-dorsal striatal connectivity and the ascending dopamine spiral provide one route for that recruitment [@belineveritt2008serial; @willuhnetal2012hierarchical].

This does not make addiction a habit by definition. Animals can develop addiction-like persistence in tasks that require flexible, newly solved actions and do not show the predicted transfer to dorsolateral dopamine control [@singeretal2018habits]. Human drug seeking can involve planning, deception, long delays, and deliberate pursuit. Habit contributes to some stages and forms of addiction, but complex goal-directed behavior can also serve a destructive outcome.

35.11.3 Relief from a bad state can become the reinforcer

As dependence develops, drug taking can be maintained by negative reinforcement. The action removes or reduces withdrawal, stress, dysphoria, pain, or other aversive states. The relevant prediction error is no longer only the arrival of a positive drug effect. Relief is better than the state that preceded it.

Escalating access to cocaine in animals can produce a transition toward excessive intake and altered hedonic regulation [@ahmedkoob1998escalation]. Opioid withdrawal, alcohol withdrawal, and stimulant crashes each engage partially different stress and homeostatic systems. These internal states become contexts in which drug-seeking actions acquire renewed value.

Negative reinforcement links addiction back to the regulated body. The drug changes the body; adaptation changes the state produced by absence of the drug; and the aversive state gives the drug-taking action a new function. The same action can therefore be maintained first by anticipated positive effects and later by relief, habit, cue-triggered wanting, or several processes at once.

35.11.4 Compulsion is not simply a strong habit

A habit is diagnosed by relative insensitivity to outcome devaluation or contingency change. Compulsion is persistence despite conflict, punishment, or known harm. The concepts overlap but are not equivalent.

In animal models, only a subset of subjects continues seeking or taking a drug despite punishment, high effort, or signals that the drug is unavailable. That individual variation resembles one feature of human vulnerability [@derochegammonetetal2004addiction]. In a mouse model, stochastic strengthening of particular corticostriatal synapses after cocaine exposure predicted the emergence of punishment-resistant seeking, and reversing that plasticity reduced the behavior [@pascolietal2018compulsion]. Compulsion therefore reflects specific circuit changes and conflict among controllers, not an all-purpose increase in one reward value.

Human addiction adds levels that animal models only partly capture. Development, trauma and chronic stress, social isolation, poverty, availability, cultural practice, psychiatric illness, pain treatment, and drug potency alter exposure and vulnerability. Prefrontal, insular, amygdalar, hippocampal, hypothalamic, striatal, and brainstem systems all contribute. No single dopamine transient, striatal territory, or habit score defines the disorder.

The wanting–liking distinction remains valuable inside this larger account. Drug-predictive cues can acquire intense motivational pull even when tolerance reduces pleasure or when the person expects serious harm. Incentive motivation can sensitize while hedonic impact does not increase in parallel. But declining liking is not required for every substance-use disorder, and craving is not the only cause of relapse.

The strong conclusion is therefore plural:

Addiction can combine pharmacological distortion of teaching signals, cue-triggered wanting, recruitment of habitual action, relief from aversive states, and impaired reconfiguration in the face of harm. Different people and drugs reach persistent use through different mixtures of these processes.

35.12 How reinforcement and habit are studied

Claims about value and reward are especially vulnerable to circular reasoning. If an animal presses a lever, the outcome is called valuable; if a brain region becomes active, it is called a value region; and the action is then explained by the represented value. Strong experiments break that circle by manipulating prediction, outcome identity, bodily state, causal contingency, or the timing of a candidate teaching signal.

35.12.1 Behavioral designs separate the controlling relation

Blocking asks whether a redundant cue acquires associative control. Extinction presents the cue or action without its expected outcome and measures how prediction changes. Outcome devaluation changes the current incentive value of one outcome while preserving the learned action–outcome relation. Contingency degradation changes whether the action causes the outcome. Omission makes performance postpone the outcome. Pavlovian-instrumental transfer asks whether a predictive cue invigorates or redirects an independently learned action.

Each design supports a different inference. Lever pressing alone does not reveal whether the action is goal directed, habitual, cue driven, or maintained by relief. The critical evidence comes from what happens after a controlled change.

35.12.2 Recording methods measure different dopamine variables

Single-neuron electrophysiology measures spikes in identified or putative dopamine cells with millisecond resolution. It revealed the canonical prediction-error pattern but samples a small population and does not directly measure transmitter at the target.

Microdialysis measures extracellular transmitter over minutes and is well suited to slower changes during drug exposure or sustained motivation. Fast-scan cyclic voltammetry measures subsecond changes in local dopamine concentration and was central to showing cue-related release in ventral and dorsal striatum [@dayetal2007shifts; @willuhnetal2012hierarchical]. Genetically encoded fluorescent dopamine sensors now permit rapid optical measurement across many trials and locations [@patriarchietal2018sensor].

These signals should not be treated as interchangeable. Cell-body firing, axonal calcium, extracellular dopamine, receptor occupancy, and postsynaptic signaling are linked but distinct. Release can be regulated locally by acetylcholine and other transmitters. A nucleus-accumbens signal cannot be assumed to describe dorsolateral striatum.

35.12.3 Causal timing matters

Optogenetic methods can activate or inhibit genetically defined dopamine neurons or striatal populations with subsecond timing. Chemogenetic methods alter defined populations over longer intervals. The strongest teaching experiments place a brief optogenetic manipulation at the exact moment a positive or negative error should occur. If the manipulation changes what is learned without simply changing movement or consumption, it supports a teaching role [@steinbergetal2013causal].

Lesions and reversible inactivation answer different questions. A lesion made before training can disrupt acquisition, performance, or compensation over time. A brief inactivation after learning can test whether a structure is needed to express outcome-sensitive or habitual control. Receptor antagonists can distinguish dopamine-dependent performance from the storage of the learned relation, but they can also alter movement and effort.

35.12.4 Anatomy establishes possible routes

Classical anterograde and retrograde tracers map projections with synaptic and topographic precision in animals. The primate striato–nigro–striatal spiral is based on such tracer evidence [@haberetal2000spiral]. Cell-type-specific viral methods can identify projections and manipulate them in rodents. Diffusion MRI in humans estimates the orientation of white-matter pathways but cannot establish the same direction, synapse, or transmitter identity.

35.12.5 Human studies add translation and new constraints

Functional MRI can identify signals correlated with expected outcome, prediction error, effort, and devaluation. Temporal-difference models have explained activity in human ventral striatum and orbitofrontal regions during learning [@odohertyetal2003td]. Pharmacological manipulation can change the relation between model-derived prediction errors and behavior [@pessiglioneetal2006human]. PET can measure receptor availability and infer task-related competition with radioligands, as in cue-evoked dorsal-striatal dopamine during cocaine craving [@volkowetal2006cues].

Human work also exposes the limits of translation. A BOLD prediction-error correlate is not a direct dopamine measurement. A PET displacement signal integrates over a much longer interval than a phasic burst. People can use verbal rules and explicit strategies that laboratory rodents cannot. Patients taking dopaminergic medication or living with addiction provide important causal constraints but are not unaltered models of normal function.

The chapter’s conclusions therefore come from convergence. Behavioral designs define the relation controlling action. Recordings reveal when and where candidate signals occur. Temporally precise manipulations test causality. Anatomy establishes the routes through which learning can spread. Human studies determine which principles survive language, instruction, long experience, and disease.

35.13 Coda: when action becomes history

Unit 5 began with the problem of controlling a body. Lower motor neurons translate neural activity into force. Spinal and brainstem circuits organize posture, withdrawal, orienting, and rhythmic action. Cortical systems specify flexible movements in relation to objects and rules. Cerebellar loops predict the consequences of those movements before delayed feedback can arrive. Basal-ganglia loops regulate which controllers gain influence, how vigorously they act, and when their claim should end.

This final chapter adds history to that hierarchy.

An action does not end when its muscles relax. Its consequences alter predictions, synapses, bodily state, and the later competition among actions. Better-than-expected outcomes produce teaching signals. Recently active corticostriatal connections become more or less influential. Learned cues gain access to attention, physiology, and movement. The current body changes whether the predicted outcome deserves pursuit. Repetition can reduce dependence on the outcome and allow familiar contexts to trigger a practiced sequence directly.

The result is adaptive because an animal should not solve the same control problem from the beginning each time it encounters it. Experience should make successful routes easier to recruit, dangerous actions easier to suppress, and familiar sequences faster to organize. The result is also vulnerable. A cue can acquire too much motivational control. A useful action can persist after its consequence changes. A drug can alter the teaching and state signals that ordinarily keep learning tied to the needs of the body.

The basal ganglia do not receive a finished value from a separate evaluator. Striatal, cortical, amygdalar, hippocampal, insular, hypothalamic, thalamic, pallidal, and midbrain systems jointly preserve relations among cues, actions, outcomes, and current state. Dopamine-related prediction errors modify parts of that network. Habit changes which relations dominate performance. Worth is therefore not one number stored in one place. It is expressed in what the organism predicts, pursues, enjoys, avoids, and continues doing when conditions change.

Unit 6 will extend the temporal and representational horizon. Organisms can compare delayed and uncertain possibilities, imagine outcomes not currently present, follow symbolic rules, and make choices whose consequences depend on other agents. Orbitofrontal, ventromedial prefrontal, hippocampal, amygdalar, insular, and striatal systems will all remain involved. The next unit does not begin where a subcortical reward machine hands value to a cortical chooser. It asks how a distributed controller constructs and acts upon possible futures.

Unit 5 ends with a simpler and more fundamental transformation:

Action becomes consequence. Consequence becomes learning. Learning changes the next action.

A note on what this chapter is sure of, and what it isn’t

We are confident that:

Associative learning can depend on prediction error rather than mere co-occurrence, as blocking demonstrates.
Many midbrain dopamine neurons show the canonical signed pattern expected of a reward-prediction-error signal in controlled learning tasks.
Temporally precise dopamine-neuron activity can causally alter cue–outcome learning.
Dopamine gates corticostriatal plasticity within a limited temporal relation to recent neural activity.
Cue–outcome prediction, action–outcome knowledge, current incentive value, and hedonic impact are experimentally separable.
Outcome devaluation and contingency degradation distinguish flexible action–outcome control from habitual performance.
Posterior dorsomedial and dorsolateral striatal territories make different, interacting contributions to goal-directed and habitual action.
The striato–nigro–striatal spiral from ventromedial toward dorsolateral striatum is anatomically established in primates.
Cue-triggered incentive motivation can be increased without a corresponding increase in hedonic taste reactions.
Small opioid- and endocannabinoid-sensitive regions of accumbens shell and posterior ventral pallidum make causal contributions to hedonic reactions.
Addiction recruits several learning and control processes and cannot be reduced to pleasure, dopamine, habit, or one striatal region alone.

We have good reason to think that:

Dopamine evolved as a modulator of action-related vertebrate circuits before the expansion of mammalian cortical reward and decision systems.
Reward-prediction-error signals help reweight the cues and actions that enter future corticostriatal competition.
Ventral, associative, and sensorimotor striatal territories form interacting gradients rather than independent motivational, cognitive, and motor machines.
The ascending spiral provides one route through which motivational learning can alter dopamine-dependent plasticity in more dorsal action systems.
Repetition can shift relative control toward dorsolateral sensorimotor circuitry while action–outcome knowledge remains available elsewhere.
Dopamine helps couple expected benefit, current state, and effort to the vigor with which an action is pursued.
Incentive sensitization, habit recruitment, negative reinforcement, and impaired flexible control each contribute to some forms and stages of addiction.

We remain genuinely unsure about:

How reward-prediction error, movement, salience, effort, and state information are divided among dopamine neurons, axons, release sites, and timescales.
How credit is assigned across long natural action sequences whose important physiological consequences unfold over minutes or hours.
How the brain chooses the state and action representations over which a prediction error is computed.
How precisely rodent dorsomedial–dorsolateral distinctions map onto human caudate and putamen during everyday habits.
When a practiced human action should be classified as skilled, automatic, habitual, or compulsive.
How consistently different addictions recruit a ventral-to-dorsal progression, and whether that progression is necessary in any given person.
How rodent taste-reactivity hotspots relate to the full range of conscious human pleasure.

The unresolved questions concern the detailed implementation and the diversity of natural behavior. They do not weaken the central conclusion: consequences alter future control by changing the distributed relations among cues, bodily state, outcomes, and action.