The Numerator

Attribution choice is incentive design

Harel Rechavia — Tue, 19 May 2026 16:49:21 GMT

Most of my job is using metrics to tell a story. You decide what to measure, and that number starts steering things including the roadmap, what gets celebrated in the launch review, what counts as a win. I’m comfortable on that side of the table.

On one of those launches I watched it run the other way. I was on one of three recommendations teams at a large ecommerce platform, and the attribution model we’d been handed started steering us. The notifications team, and to be honest my own team too, started trying to predict anything a user might buy in the next 30 days and push it in front of them now, just to land the credit before the other surfaces did. We never asked whether it was the right call. We just did whatever moved the metric.

The first time the org rolled up the three recommendations teams as a cluster, the number was flat. Not declining, not growing. Just flat across the quarter. Each team’s own dashboard told a different story: green here, red there, green again, all of it inside the band that looks like normal quarter-to-quarter noise. No team was obviously broken. The aggregate said something was. That was the gap that started the conversation.

The Numerator is practitioner notes on product data science at consumer scale. Written by Harel Rechavia, formerly Google Waze, Amazon Alexa Shopping, and Viber. Subscribe for one or two posts a month.

The setup

A large ecommerce platform had three recommendations surfaces, each owned by a different team. Replenishment notifications pushed a message when the platform predicted a user was running low (”Heads up, you’re probably running low on coffee pods”). Search recommendations surfaced products when a user typed a query. High personalization recommendations was the always-on, model driven surface that picked products based on a behavioral profile.

One user, one coffee pods purchase, three teams that could plausibly claim credit. So the platform picked an attribution model. Three options were on the table: First-Touch (credit to whichever surface introduced the product to the user first inside a 30-day window), Even-Split (credit divided across every surface the user saw inside the window), and Last-Touch (credit to whichever surface the user came closest to the moment of purchase). The org picked First-Touch. The reasoning sounded right: reward the team that introduced the product. I was on one of the three teams when this rolled out and saw firsthand how our decisions started changing.

What the teams actually optimized for

The behavior change was rational. If credit went to whoever was first, the dominant move was not “recommend coffee pods when they are most relevant.” It was “recommend coffee pods now, before the other two surfaces do.” Anything a user might plausibly buy in the next 30 days became fair game to push in front of them today.

The catch is that most of those purchases would have happened anyway. The user was going to buy coffee pods because they were running low, not because three different surfaces had impressed them with the idea. The platform was crediting baseline behavior the user would have done anyway, not incremental purchases the team had caused. This is the same mechanic Uber’s growth team eventually surfaced when an incrementality test against a no-paid-ads control group led them to cut roughly $30M in annual U.S. Meta spend. Their paid ads were mostly reaching people who would have signed up anyway. Standard attribution had been crediting them for it.

Because the underlying purchases were largely fixed, the attribution credit was not being created. It was being shuffled. When Replenishment feature won the race to be first on a cohort of coffee-pods buyers, their metric went up and one of the other two surfaces dropped. Next quarter, the trade ran the other way. A senior Airbnb growth-marketing lead described the same pattern publicly as teams using attribution “to re-slice the pie” instead of growing the pie. From any one team’s seat, the metric was volatile but the direction was unclear. No one was lying. No one was sandbagging. Everyone was rationally chasing the credit the model handed out.

The slow realization

No single postmortem caught it. Each team’s DS saw their own metric move, sometimes up, sometimes down, mostly inside the band that quarter-to-quarter noise normally lives in. From inside one team, there was nothing to see. The behavior was only visible if you stood far enough back, and nobody was standing that far back.

The catalyst was structural, not analytical. The org started measuring the recommendations cluster as a group, on top of measuring each team. At the group level, growth was slow. The aggregate told a different story than any individual page told. That gap was the first thing that made anyone ask why three teams that each looked roughly fine were collectively not moving the number that mattered.

The thesis

Attribution choice is incentive design. Picking a model is not a measurement decision. It is a behavioral one. It will tell you how your teams are going to race.

Three teams plus First-Touch is a race to be first on purchases the user would have made anyway. Three teams plus Even-Split is either cooperation or quiet collusion on volume. Three teams plus Last-Touch is a race to intercept buyers at the bottom of the funnel. There is no neutral option in this set. Every model selects for a behavior. The right question before picking one is not “which model is most accurate.” It is “which behavior do we want from the teams.”

Mature measurement systems sidestep some of this by calibrating attribution against incrementality experiments. Most orgs picking their first attribution model are not running both, and the behavior consequence shows up first. When a metric becomes the target teams are graded on, the teams will optimize for the metric.

Subscribe now

What to do Monday

When you are asked to design attribution for a multi team product, treat it as a behavior design problem, not a measurement problem. Bring the team leads in before the model is picked, not after. Ask “what behavior do we want from these teams” before “what attribution model fits the data.” If the org cannot answer the first question, the model will answer it for them. And the answer will look like normal noise on every individual dashboard for a long time before it shows up at the aggregate.

Two ways to measure long-term value and the assumption they share

Harel Rechavia — Thu, 07 May 2026 12:54:21 GMT

You have probably had this conversation, from one side or the other. A feature shipped, the short-term metric is flat or down, and someone in the room says “future value will catch what the experiment cannot.” The model gets pulled up. Sometimes it agrees with the team and the feature ships. Sometimes it disagrees and the feature does not. Either way, nobody in that room asks a question that should be the first one: how often does this model actually agree with the long-term truth when we wait long enough to check?

Netflix asked it on themselves. Across 200 A/B tests and 1,098 test arms, they checked their 14-day surrogate index against the day-63 ground truth. The headline number, the one most posts about this paper quote, is 95% directional consistency. The number that should follow it, and almost never does, is 65% recall against the day-63 ship decision. That is not the model failing on edge cases. That is the model deciding the next quarter of your roadmap.

This post is the practitioner version of long-term value measurement, focused on the two methods you will see most in published industrial work: PSM-style action scoring (Airbnb FIV) and the surrogate index (Netflix, Booking.com). Other families exist, including one built on heterogeneous treatment effects with policy trees, but the two we focus on are the ones that show up in the most launch reviews and the ones whose tradeoffs are best documented in public.

I have watched some version of this play out at Google Waze and Amazon Alexa Shopping. Different names, different math, same shape. Sometimes it held us back from rolling out a feature the team had spent months on. Sometimes it gave us the language to measure a win and defend it to leadership when the short-term metrics could not. Either way, you pay twice. Once in model maintenance, and once in the stakeholder time it takes to explain what the model did and did not say. Here is what the published implementations look like and where they break.

Airbnb FIV: scoring actions from the past

Some product actions create most of their value over a horizon experiments cannot reach. Account recovery setup is the canonical example. The 14-day metric is the percentage of accounts that completed the recovery setup flow. The actual value is the percentage of takeover incidents resolved in five minutes instead of churning. Six months downstream. Invisible inside the test window.

The same shape repeats:

First-time wow moments. First booking, first ride, first family-plan invite. The action’s value lives in retention curves that bend months later.
Latency and crash fixes. A single bad session raises churn risk that does not appear in the same session’s metrics.
Trust and safety actions. Two-factor enrollment, blocking, reporting. Rare events with enormous per-event value and volume too low to move aggregates.
Onboarding milestones. Profile completion, payment method added, social graph seeding. Each one is friction in the moment.

If your org has no way to count downstream value, the org systematically does not ship these features. The portfolio bends toward features whose value is visible inside a 14-day window. That is a poor proxy for what users actually care about.

Airbnb’s Future Incremental Value (FIV) is the most fully published implementation of the first family. For each of about 150 action events, Airbnb estimates the propensity that a given user takes the action, matches each user who took it to a “twin” who looks similar in propensity but did not take it, and reads the long-term outcome difference between the two groups as the action’s incremental value over a 1-year horizon. The propensity model uses around 1,000 control features pulled from their internal feature store.

Output is split into “Present” (immediate) and “Future” (residual downstream) value. PMs see what fraction of an action’s value is downstream rather than realized today. That UX choice does real work. It lets a team defending a slow-payoff feature point at the Future component instead of arguing in the abstract.

The math

Let’s walk through FIV first. For each user, estimate the probability they took the action of interest, given their pre-action covariates. This is the propensity score, learned across about 1,000 covariates from the feature store. Match each user who took the action to a “twin” with a similar propensity who did not. The action’s average effect on the treated is the difference in long-term outcomes between focal users and their matched complements.

The long-term outcomes here are observed, not predicted. The action happened in the past, you waited the full horizon, you read the actual metric. This is why FIV needs historical data and cannot score brand-new product surfaces.

Identification rests on three assumptions. Unconfoundedness: there are no unobserved confounders once you condition on the covariates. Common support: every region of the propensity is represented in both the focal and the complement populations. SUTVA: no spillovers between users. SUTVA fails in two-sided marketplaces, which is why Airbnb adds an explicit cannibalization haircut.

The practical breakpoint is overlap. If your propensity model has high AUC, the treated and untreated populations are too separable, the matched twins are not really twins, and the estimator becomes unstable. You actually want a propensity model that struggles at the boundaries.

The PSM recipe: ~1000 pre-action covariates feed a propensity score, which splits users into focal and complement, matches them by similar propensity, then reads the difference in observed long-term outcomes as the action’s incremental value.

The surrogate index: predicting an experiment’s long-term effect

Long before Netflix put the recall problem on paper, Pinterest published a 1% holdout on a badging feature. The short-term A/B test showed +7% DAU. The year-long holdout settled at +2.5% DAU. The short-term test overstated the win by roughly 2.8x. That is not a Pinterest fact. That is a category fact about novelty inflation in self-serve consumer products. Carry it as the working order-of-magnitude assumption for any short-term readout you cannot validate against a long-term holdout.

The surrogate index is what you use when you want a long-term answer but cannot wait a year. It replaces the slow primary outcome with a model-predicted index built from short-run signals. Netflix’s surrogate-index validation paper walks through the method and validates it against 200 of their own A/B tests. They estimate, from historical data, a function that maps short-term proxies S and pre-treatment covariates X to a predicted long-term outcome. In a new experiment, they observe S during the test window, predict the long-term outcome per unit, and treat the difference between treatment and control on that prediction as their long-term effect estimate.

The identification assumption that does the heavy lifting is called surrogacy. It says S contains all the information about how treatment affects the long-term outcome. If treatment moves the long-term outcome through a channel that is not captured in S, the estimate is biased and you will not know it.

Netflix’s number on themselves: 95% directional agreement, 65% recall against the day-63 ship decision. The recall gap is where senior DS should focus. Directional agreement covers most of your routine launches. The recall gap is concentrated on the launches with the largest upside, where the cost of a false negative is highest.

Booking.com’s KDD 2024 paper generalizes the proxy choice as a portfolio optimization problem over a corpus of past experiments. Treat each prior experiment as labeled data: a vector of short-term effects and a long-term effect. Learn the linear combination of short-term metrics that best predicts long-term effect for new experiments. The unsubtle practical insight: the optimal proxy varies with experiment sample size. There is no single “best company-wide proxy.” The published shorthand papers over a real cost.

The math

In a historical sample, fit a model that predicts the long-term outcome from a vector of short-run surrogate metrics plus pre-treatment covariates. In a new experiment where only the short-run surrogates are observed, predict the long-term outcome per user, then take the difference between treatment and control on that prediction.

Identification rests on Prentice surrogacy: the long-term outcome is independent of treatment once you condition on the surrogates. If treatment moves the long-term outcome through a channel the surrogates do not capture, the estimate is biased and you will not know it. There is also a comparability assumption: the relationship between surrogates and outcome must be the same in the historical and experimental samples.

Booking.com’s portfolio framing replaces the per-experiment estimation with a learned linear combination of short-run metrics, optimized across a corpus of past experiments.

The practical breakpoint is the validation result that started this post. 65% recall against the day-63 ship decision, even with a careful production system at Netflix’s scale. That is the cost of the assumption.

The surrogate index recipe: a model trained on past experiments where both short-run signals and long-term outcomes were observed, applied to a new test to predict the long-term treatment effect from short-run signals alone.

How they differ

The two methods sit in different places in your workflow, and the easiest way to keep them straight is to ask what counts as a row of training data.

For Airbnb FIV, a training row is a (user, took the action or not, observed long-term outcome) triple. The unit of analysis is the user-action pair. The output is a fixed table: each tracked action has an FIV score that says “this action is worth X on the north star metric over the long term.” A PM whose A/B test shows the feature drives 5% more recovery setups multiplies that 5% by the recovery-setup score to get a long-term value estimate. Real features usually move more than one action, so the same test might also show +2% on 2FA enrollment and -1% on profile completion. The feature’s total long-term value is the sum of (action lift × FIV score) across every action it moved, without waiting a year for the long-term outcome to land.

For the surrogate index, a training row is a past experiment (or a user inside one) with both a short-term effect vector and a long-term effect. The unit of analysis is the experiment. The output is per-experiment: this specific test predicts +1.2% on the long-term metric. You use it after a test ends, before you have day-63 data.

Horizons differ too. Airbnb publishes at 365 days. Netflix validates at 63 days. Most of that difference is business cycle, not method. Streaming engagement stabilizes inside two months because users return weekly. Travel is low-frequency and seasonal, so a 60-day window misses most of the signal. The smaller part is method economics: a surrogate index needs the long-term outcome to materialize on a cadence that lets you retrain, which biases the family toward shorter horizons even when the business case might want longer. PSM does not have this problem, which is part of why FIV-style work tends to publish at longer horizons than surrogate-index work. If Netflix’s NSM stabilized at 365 days they would still pick a shorter surrogate horizon for retraining sanity. If Airbnb’s NSM stabilized at 60 days they would happily run FIV at 60.

The two methods look different. The math is genuinely different. The FIV claim that does the heavy lifting is that nothing unobserved drives both the action and the outcome.

The honest version is not “our model agrees with truth 95% of the time.” It is “our model agrees with truth 95% of the time on the experiments and the time horizons where we could check, and we cannot test whether this generalizes.”

Where these methods fall apart

It does not work for new features.

The historical data the surrogate index needs to validate against does not exist for a brand new product surface. The honest answer when leadership asks for the future value of a feature you launched last month is “we cannot estimate it for another quarter.” Some leaders accept this. Others do not.

Eligibility-gated features are hard.

If only some users qualify, due to subscription tier, region, or device class, matching has to happen within eligibility strata. The number of models you maintain explodes and the matched cohorts get small fast.

It needs population scale.

PSM and surrogate models both need enough observations to estimate stable propensities and stable proxy-to-outcome relationships. Airbnb explicitly says their host-level FIV is underdeveloped because there are not enough hosts. If you are at a sub-million MAU consumer product or B2B SaaS, the method may be infeasible.

Two-sided marketplaces have a double-counting problem.

A guest action and a listing action can both legitimately claim credit for the same booking. Airbnb’s Platform FIV is their attempt to net this out. It is not solved.

Maintenance is expensive, and the cost is mostly human rather than technical.

The model needs ongoing feature engineering as new product surfaces ship, but that is the easy part. I have watched the DS who built one of these systems end up spending more of their time onboarding new contributors and walking stakeholders through how to use, and how not to misuse, the framework than on the model itself.

The model can also become political.

PMs who like a feature might trust the model when it agrees with them and challenge it when it does not. Teams whose features get killed by an FV verdict might request re-runs, alternative specifications, sensitivity analyses. This is healthy in moderation and corrosive at scale. Maintaining trust in the model is a real cost on top of maintaining the model itself.

A simplified proxy is gameable in the Goodhart sense.

Gibson Biddle’s hidden-800-number example is the evergreen case study: a PM at a streaming company decides “contacts per 1000 streams” should go down, hides the customer support phone number, and the metric improves while the customer experience degrades. Once the proxy becomes the target, the proxy stops measuring what it was designed to measure.

The honest takeaway

Long-term value measurement is a complement to clean experimentation, not a substitute for it. Short-term A/B tests are still the right tool for short-term questions, guardrail metrics still matter, pre-registered hypotheses still help. The future value layer sits on top and answers questions experiments cannot.

Subscribe now

The features it surfaces are real. Account recovery setups, latency improvements, fraud reporting flows, the unsexy infrastructure work whose value is invisible inside a 14-day window. These features ship in companies that have a long-term value layer and quietly do not ship in companies that do not. Over years, the gap compounds.

The cost is also real. A serious program needs population scale, ongoing data science investment, leadership trust, stakeholder patience, and a long-term holdout you keep funded. Most companies should not build the full platform first. The companies that should build one and do not are the ones leaving the most value on the table.

If your org is in that second category, the move is not to build the full machinery. The move is to start with one specific action where the case is obvious and the data is available. Build a small surrogate or PSM model for that one action. Validate it once against a long-term holdout. Show the result. If the model holds and the result lands, the second model is easier to fund. After about ten such actions, you have a program.

That is roughly how every company in this post got there. Nobody built the platform first. And nobody published a recall number that suggested their model was anywhere close to perfect. The 65% number is not a Netflix problem. It is a structural cost of operating under an assumption you cannot test. The mature move is to plan for it instead of hoping it does not apply to you.

Share The Numerator

The Numerator

Attribution choice is incentive design

The setup

What the teams actually optimized for

The slow realization

The thesis

What to do Monday

Further Reading

Two ways to measure long-term value and the assumption they share

Airbnb FIV: scoring actions from the past

The math

The surrogate index: predicting an experiment’s long-term effect

The math

How they differ

Where these methods fall apart

The honest takeaway

Further reading