
Understanding Hewitt et al. and Using Nalanda Today
Source:vignettes/understanding-hewitt-et-al.Rmd
understanding-hewitt-et-al.RmdOverview
This vignette is for users who want a practical introduction to the
simulation approach in Hewitt, Ashokkumar, Ghezae, and Willer (Hewitt et al. 2024a, 2024b), and who want to
understand what the nalanda package can already do in that
spirit.
The short version is:
- the papers show that large language models can often predict the direction and relative size of social science experimental effects surprisingly well;
- this works best for text-based survey experiments;
- the method uses careful prompting, demographic conditioning, and averaging over multiple prompts;
- raw model-predicted effects were too large on average in their main
archive, so the authors recommend shrinking effect estimates by about
0.56in that specific setting (Hewitt et al. 2024b); -
nalandaalready supports several related simulation workflows, but users still need to do some parts manually.
What the papers did, in simple language
The central question of the papers is whether a language model can be used to simulate how people would respond in real social science experiments (Hewitt et al. 2024a). Rather than asking the model to guess an effect size directly, the authors prompted the model to act like many hypothetical survey participants, each exposed to a study condition and then asked the outcome question.
The broad workflow was:
- describe the study setting,
- describe a hypothetical participant,
- show the treatment text,
- ask the outcome question on the original response scale,
- repeat this many times across conditions, people, and prompt variants, and
- compare average responses across conditions.
The main finding was that model-based predictions were strongly correlated with the real treatment effects in their primary archive of U.S. survey experiments (Hewitt et al. 2024a). In plain terms, the model was often good at telling which interventions would work better than others, even if it was not perfect at recovering the exact numeric size of the effects.
What the supplement adds
The supplement is especially useful because it clarifies what parts of the procedure mattered most in practice (Hewitt et al. 2024b).
1. Prompting strategy matters
The prompts were not just a bare stimulus plus question. They included:
- an introductory framing sentence,
- a short description of the study context,
- a description of the hypothetical participant,
- the experimental stimulus, and
- the outcome question and scale.
This matters because users trying to apply the same general approach should not assume that any single ad hoc prompt will behave like the published method.
2. Ensemble prompting matters
The supplement reports that averaging over more prompt variants improved accuracy (Hewitt et al. 2024b). In practice, this means users should avoid treating a single prompt wording as decisive whenever cost permits.
3. Demographic conditioning is part of the method
The main paper describes prompting the model with specific participant profiles, including fields such as gender, age, race, education, ideology, and party (Hewitt et al. 2024a). The supplement suggests that matched demographic profiles gave only small or modest gains in some subgroup analyses, but they are still part of the paper’s design (Hewitt et al. 2024b).
4. Absolute effect sizes need caution
The model was good at predicting relative effects, but its raw effect
estimates were too large on average in the primary archive. The
supplement reports a shrinkage coefficient of about 0.56,
meaning that raw predicted effects in that setting should be multiplied
by 0.56 to improve absolute calibration (Hewitt et al. 2024b).
This is one of the most important practical take-aways from the paper.
Recommended take-aways for users
If you want to use these papers as a guide, the safest practical lessons are:
- focus first on relative comparisons and ranking, not only exact effect-size recovery;
- use multiple prompt variants rather than relying on one wording;
- be explicit about who is being simulated;
- preserve the original response scale in the prompt;
- keep raw outputs and calibrated outputs separate;
- treat the
0.56factor as a useful paper-specific calibration, not as a universal law for every possible application.
How this relates to nalanda
nalanda was not originally built as a line-by-line
reproduction of Hewitt et al. It was built around chapter-based
simulations for questions such as whether books or book chapters shift
attitudes. Even so, there is substantial overlap in spirit and
implementation.
At the moment, nalanda already supports:
-
Pre/post chapter simulations with
run_ai_on_chapters(). -
Post-only one-turn simulations with
run_ai_on_chapters_one_turn(). -
Custom prompt sequences with
simulate_treatment(). -
Summary functions for package-native outputs, such
as
compute_run_ai_metrics()andcompute_run_ai_metrics_one_turn().
What nalanda does not yet fully provide out of the
box:
- a built-in prompt bank matching the paper’s exact strategy,
- a first-class ensemble prompting object,
- a first-class demographic profile object,
- a dedicated condition-based experimental wrapper modeled directly on the Hewitt et al. survey workflow,
- built-in between-condition inferential contrasts, and
- a standard calibration helper for externally estimated effect columns.
A concrete way to use nalanda today
Below is a practical way to use the package while staying close to the lessons from Hewitt et al.
Scenario 1: Pre/post chapter simulations
This is the most native current workflow in nalanda.
library(nalanda)
res <- run_ai_on_chapters(
book_texts = my_book_texts,
groups = c("Democrat", "Republican"),
context_text = "You are simulating an American adult who politically identifies as a {identity}.",
question_text = "On a scale from 0 to 100, how warmly do you feel towards {group}s?",
n_simulations = 20,
temperature = 0,
model = "gemini-2.5-flash-lite"
)
chapter_metrics <- compute_run_ai_metrics(res)This gives you package-native pre/post summaries such as deltas and gap changes. That is already useful for understanding whether a chapter seems to shift the simulated participant before versus after exposure.
Scenario 2: Post-only simulations
If you want something closer to the paper’s single-prompt logic, use the one-turn interface.
res_one_turn <- run_ai_on_chapters_one_turn(
book_texts = my_book_texts,
groups = c("Democrat", "Republican"),
context_text = "You are simulating an American adult who politically identifies as a {identity}.",
question_text = "On a scale from 0 to 100, how warmly do you feel towards {group}s?",
n_simulations = 20,
temperature = 0,
model = "gemini-2.5-flash-lite"
)
one_turn_metrics <- compute_run_ai_metrics_one_turn(res_one_turn)This is not identical to the Hewitt et al. survey design, but it is closer to a post-only prompt structure where the chapter serves as the intervention text.
Scenario 3: Control-versus-treatment chapter comparisons
Some users will have treatment books and control books.
nalanda can already help generate the simulated outcomes
for each condition, even if the formal between-condition comparison
happens elsewhere.
One practical workflow is:
- run the simulation separately or jointly on all chapters or books,
- attach a condition label such as
control,treatment_a, ortreatment_b, - compute the package-native summaries,
- pass the resulting data frame to your preferred downstream tool for mean comparisons, contrasts, regression, or meta-analytic summaries.
Conceptually:
chapter_metrics$condition <- c("control", "treatment", "treatment")
# then analyze in your preferred downstream workflow
# for example with dplyr summaries, rempsyc helpers, or regression modelsThis is an important design point: nalanda can own the
simulation workflow without needing to own every inferential
comparison.
What users should currently do themselves
At this stage, users wanting to approximate the Hewitt et al. method more closely should currently handle several parts themselves.
1. Build or manage a prompt bank manually
Right now, users still need to manage alternative prompt wordings in their own scripts. A good practical habit is to write down:
- multiple introductory variants,
- the study-setting text,
- the participant-profile text,
- the intervention text wrapper, and
- the outcome question.
Even if these are stored as plain character vectors in a script, that is better than repeatedly editing one long prompt string by hand.
2. Run ensembles manually
If you want to average over several prompt variants, you currently need to do this by running several simulations and combining them yourself. This is important if you want to remain closer to the published workflow.
3. Manage demographic profiles manually
Users who want more than identity-only conditioning should currently define their own participant profiles in a data frame or list and loop over them.
4. Estimate between-condition contrasts downstream
If your design involves treatment versus control comparisons,
nalanda can help produce the simulated outcomes, but you
will currently need to estimate contrasts yourself using your preferred
analysis workflow.
5. Apply effect calibration explicitly
If you estimate an effect column downstream and want to apply the paper’s primary-archive calibration, you should currently do so yourself and document it clearly.
For example:
results$adjusted_effect <- results$raw_effect * 0.56
results$calibration_source <- "Hewitt et al. 2024 primary survey archive"How to think about the 0.56 factor
This is the single most tempting number to overgeneralize from the papers, so it is worth being explicit.
When it is reasonable to use it
Using 0.56 is most defensible when:
- your application is fairly close to the paper’s primary setting,
- your outcome is a text-based survey-style response,
- you are estimating condition differences on the original response scale, and
- you want a rough calibration of absolute effect magnitudes.
When to be cautious
You should be more cautious when:
- your design is very different from their survey archive,
- your intervention is a long book chapter rather than a short survey treatment,
- your outcome is behavioral or cumulative rather than a direct survey item,
- your analysis focuses on subgroup heterogeneity, or
- you are working outside the kind of U.S. survey setting used in the paper.
A simple recommended workflow for users
If you want a practical, cautious workflow inspired by Hewitt et al., the following is a reasonable current recipe.
Minimal workflow
- choose a clear outcome question and scale;
- decide whether your design is pre/post or post-only;
- define at least a small set of prompt variants;
- run multiple simulations per chapter or condition;
- compute package-native summary metrics in
nalanda; - if needed, estimate between-condition comparisons downstream;
- optionally create a separate calibrated version of any effect column.
Better workflow
If cost and time allow, improve on the minimal workflow by:
- using several prompt variants rather than one,
- simulating several participant profiles rather than one generic identity,
- checking whether conclusions are stable across prompts and profiles,
- comparing raw and adjusted effects side by side, and
- treating conclusions as stronger when rank-order patterns are robust.
What nalanda may support later
Planned or plausible future extensions include:
- built-in prompt bank objects,
- first-class ensemble controls,
- first-class demographic profile objects,
- a condition-based experimental simulation wrapper,
- small calibration helpers, and
- cumulative chapter simulation designs.
These features would make it easier for users to get closer to the Hewitt et al. workflow without writing as much scaffolding themselves.
Final practical advice
For most current users, the best way to understand the Hewitt et al. papers is to think of them as a guide to disciplined simulation rather than as a recipe that transfers mechanically to every design.
The main practical lessons are:
- prompt carefully,
- average across prompts when possible,
- be explicit about who is being simulated,
- distinguish simulation from inference,
- keep raw and adjusted outputs separate, and
- use the
0.56correction cautiously and transparently.