
Roadmap for Paper-Faithful Simulation Workflows
Source:vignettes/simulation-roadmap.Rmd
simulation-roadmap.RmdPurpose
This vignette is a planning note for collaborators. It records where
the nalanda package currently stands, how the package could
be extended toward the simulation strategy described by Hewitt,
Ashokkumar, Ghezae, and Willer (Hewitt et al.
2024a, 2024b), and which implementation steps seem most important
for the next phase of work.
The immediate aim is not to claim that nalanda already
reproduces the design of these papers. Rather, the goal is to identify a
realistic path for building a user-facing workflow that supports:
- paper-faithful simulation of survey experiments,
- the existing pre/post chapter workflow already implemented in
nalanda, - control-versus-treatment chapter comparisons when some books act as control conditions, and
- future extensions for cumulative reading designs across multiple chapters.
Current package status
At present, nalanda already supports several useful
pieces of the broader simulation agenda:
- A two-turn pre/post workflow for chapter interventions via
run_ai_on_chapters(), where an identity-conditioned baseline is collected before exposure to a chapter and a post-reading measure is collected after exposure. - A one-turn workflow via
run_ai_on_chapters_one_turn(), where identity context, chapter text, and the outcome question are presented in a single prompt. - A prompt-first multi-turn interface via
simulate_treatment(), which is flexible enough to support more customized simulation sequences. - Summary helpers that keep raw model output separate from derived metrics.
This means the package already contains the core execution machinery needed for prompt construction, repeated simulation, structured extraction, and summary pipelines. The main gap is not basic infrastructure. The gap is a paper-faithful experimental abstraction layer.
It is also useful to separate simulation design from statistical
analysis. nalanda already supports workflows that can later
be used for group comparisons, including cases where some books act as
control conditions. That does not mean nalanda itself needs
to become the main home for inferential contrast estimation or
hypothesis testing. Those tasks may still belong downstream in other
tools, including rempsyc.
What the papers add
The Hewitt et al. workflow differs from the current chapter workflow in a few important respects (Hewitt et al. 2024a, 2024b):
- The main design is condition-based rather than pre/post. The model simulates responses to each experimental condition, with group comparisons then performed downstream.
- Prompts are built from a bank of introductory variants rather than a single fixed wording.
- Simulations include demographically described participants rather than only broad identity labels.
- Predictions are averaged over many prompts as an ensemble strategy.
- Raw predicted treatment effects are useful for ranking conditions, but absolute effect magnitudes appear to benefit from linear calibration. In the primary survey archive, the paper estimates a shrinkage factor of approximately 0.56 (Hewitt et al. 2024b).
Taken together, these papers suggest that nalanda should
support at least three closely related simulation families:
-
condition-labeled post-only simulations, where outputs are ready for later group comparisons but inferential contrasts can remain downstream; -
baseline -> exposure -> post outcomesimulations, where within-unit change metrics are native to the package; and -
cumulative exposuresimulations, where multiple chapters or interventions are allowed to build on one another over time.
These are simulation families, not analysis families. The package does not need to choose only one family, and it does not need to absorb every downstream statistical task. The stronger design goal is to share infrastructure across them.
Recommended implementation steps
The table below reflects my current view of the most useful staged plan. Impact scores are on a 1 to 10 scale, where 10 means the step is especially important for scientific usefulness and for alignment with the published papers.
| Step | Description | Difficulty | Impact |
|---|---|---|---|
| 1 | Build a paper-faithful prompt layer, including reusable prompt bank objects and prompt constructors for survey experiments and book-based designs. | Low to medium | 10 |
| 2 | Add a condition-based simulation wrapper that runs control and treatment conditions, stores condition labels and simulation metadata, and returns outputs ready for downstream comparison. | Medium to high | 10 |
| 3 | Add a descriptive summary and calibration layer for package-native metrics, while leaving formal inferential contrasts to downstream tools. | Medium | 8 |
| 4 | Add demographic profile infrastructure, including profile samplers and weighted profile sets, so users can simulate subgroup-specific or population-matched runs. | Medium | 7 |
| 5 | Add ensemble controls that formalize how many prompt variants are used, how they are sampled, and how outputs are pooled. | Low to medium | 8 |
| 6 | Extend the framework to cumulative chapter designs, where earlier chapters can remain in memory or be summarized forward into later prompts. | High | 8 |
I would still group these into three practical phases:
- Phase 1: prompt layer plus ensemble controls.
- Phase 2: condition-based experimental wrapper plus descriptive summaries and calibration helpers.
- Phase 3: richer demographic sampling and cumulative exposure designs.
Why these steps matter
Step 1: Prompt layer
This is the highest-leverage near-term step because it creates a common language for all downstream workflows. The supplement describes a structured prompting strategy with an introductory sentence, a study-setting description, participant information, treatment content, and the outcome question (Hewitt et al. 2024b). That same structure can be reused for chapter simulations even when the design is not identical.
For the current package, this would let us move away from hard-coding prompts in user scripts and toward explicit prompt templates that are inspectable, versionable, and easier to document.
Step 2: Condition-based simulation wrapper
This is the step that would bring nalanda closest to the
paper’s core design. The current one-turn workflow already has much of
the required mechanics, but it is organized around books and chapters
rather than experimental conditions. A dedicated wrapper should make
conditions, control groups, and outcomes first-class objects. The
resulting outputs can then be handed off to downstream tools for mean
comparisons, contrasts, or other inferential analyses when needed.
Step 3: Descriptive summaries and calibration
This step should be narrower than a full contrast-analysis framework.
The role of nalanda here is to produce package-native
summaries that are useful for inspection, plotting, and workflow
handoff. For pre/post designs, that includes metrics such as within-unit
deltas. For post-only condition-labeled designs, that includes
condition-level summaries and calibration helpers. The papers show
strong rank-order prediction, but they also argue that absolute effect
magnitudes are systematically overstated without calibration (Hewitt et al. 2024a, 2024b). That makes
calibration worth supporting, even if inferential testing remains
outside the package.
Step 4: Demographic profile infrastructure
This matters, but I do not think it should block the earlier steps. The subgroup analysis in the supplement suggests that matched demographic prompts gave only small or no predictive advantages for gender and ethnicity, with somewhat more benefit for party (Hewitt et al. 2024b). That makes demographic conditioning important, but not the first dependency.
Step 5: Ensemble controls
The supplement explicitly reports that predictive accuracy improved as the number of prompts in the ensemble increased (Hewitt et al. 2024b). For that reason, ensemble prompting should not remain an implicit user choice. It should be represented as a documented object or argument in the package API.
Step 6: Cumulative chapter designs
This step is especially relevant for the book project, even though it goes beyond the paper’s main experimental setup. The existing pre/post framework is a natural base for cumulative designs because it already keeps the logic of before/after change separate from prompt execution. The hard part is deciding how accumulated reading history should enter later prompts.
What can be applied directly to the existing chapter workflow?
Even though the paper’s design is not identical to the chapter workflow, several ideas transfer well.
Transferable immediately
- Prompt standardization. The package would benefit from prompt templates that separate intro text, study framing, identity or profile information, chapter text, and outcome questions.
- Prompt ensembles. Rather than relying on one canonical chapter prompt, we could average over several introductory phrasings or framing variants.
- Optional richer participant context. The current identity-based context could be extended to richer demographic profiles, especially in cases where subgroup interpretation matters.
- Separation of raw and calibrated results. The package already tends to preserve raw outputs and compute metrics downstream. That is a good fit for calibration as well.
Transferable with design adaptation
- Condition-ready chapter outputs. A chapter can be treated as a treatment condition and stored alongside a no-reading control, placebo chapter, or alternative chapter, with the resulting outputs passed downstream for comparison.
- Megastudy-style ranking. Sets of chapters or chapter framings could be compared as candidate interventions, much as the papers compare many treatments within one study.
- Cumulative exposure. Later chapters could be modeled as interventions delivered after earlier ones, with prior material either preserved in memory or compressed into an accumulated summary state.
Less transferable without stronger validation
- The exact 0.56 calibration factor. This number was estimated for the paper’s primary archive of U.S. survey experiments and should not be assumed to transfer automatically to chapter-level reading interventions.
- Claims about subgroup benefits from demographic matching. The paper’s subgroup findings are informative, but chapter interventions may produce different patterns of heterogeneity.
Where should calibration happen?
My current recommendation is:
- keep raw simulated responses unchanged,
- compute only package-native descriptive summaries in
nalanda, - leave formal group comparisons and inferential contrasts to downstream tools, and
- optionally add calibration helpers that work on summary outputs.
In practice, that means calibration should not be applied inside the low-level simulation functions themselves.
Why not pre-adjust inside simulation functions?
This would make the raw model output harder to inspect, harder to compare across calibration schemes, and harder to validate later. It would also blur the line between model execution and statistical post-processing.
Why not leave calibration entirely to user scripts?
That is flexible, but it is easy to do inconsistently. If calibration is part of the recommended workflow, the package should provide a standard path for it.
Recommended compromise
For metrics that nalanda already owns conceptually,
summary functions should be able to compute both raw and optional
adjusted outputs. For example, if a summary function computes
delta_outgroup, a calibrated variant could appear as
adjusted_delta_outgroup when a calibration factor is
supplied.
This has several advantages:
- raw outputs remain accessible,
- calibration remains explicit,
- multiple calibration schemes can coexist,
- user scripts remain simpler and less error-prone.
For future condition-based workflows, a better boundary may be to
provide a small helper that adjusts already-computed effect columns,
regardless of where those effects were estimated. In other words,
nalanda does not need to own contrast estimation in order
to support calibration.
Such a helper could work on a user-supplied column and append:
adjusted_effectcalibration_factorcalibration_source
The package default should probably be
calibration = NULL, with named presets available for known
settings. A preset corresponding to the Hewitt et al. primary archive
could reasonably use 0.56, but that should be framed as a
setting-specific option rather than a universal package default.
Proposed object designs
prompt_bank
The prompt_bank object would formalize the reusable
prompt pieces that are currently spread across ad hoc strings in
scripts. Conceptually, it should be a named list or tibble-backed object
with a small, inspectable schema.
At minimum, a prompt_bank should contain:
-
intro_variants: short opening instructions or framing sentences. -
setting_template: the general study description, such as survey context or reading-task context. -
profile_template: a template for participant description, including placeholders for identity or demographic fields. -
stimulus_template: a wrapper describing how the treatment text or chapter text is introduced. -
outcome_template: the question and response-scale wording. -
scenario: a label such as"survey_experiment","book_prepost", or"book_cumulative". -
metadata: version, source paper, and notes.
In practice, one useful design would be:
prompt_bank <- list(
intro_variants = c(
"You will be asked to predict how people respond to various messages.",
"Can reading a message affect people's attitudes and actions?"
),
setting_template =
"Social scientists often conduct research studies using online surveys. The text below is from one such survey conducted on a large, diverse population of research participants.",
profile_template =
"Participant X is a {ideology}, {age}, {ethnicity}, {gender} participant with {education}. Politically, Participant X identifies as '{party}'.",
stimulus_template =
"Please read the material below. {stimulus_text}",
outcome_template =
"{outcome_text} Please choose a number from {scale_low} to {scale_high}.",
scenario = "survey_experiment",
metadata = list(source = "hewitt_ashokkumar_2024")
)For book workflows, a related prompt bank might swap in a
chapter-specific setting_template while keeping the same
overall structure.
ensemble_size
I am imagining ensemble_size as more than a bare
integer, even if the user API initially accepts an integer. Internally,
it would be useful to represent the ensemble settings as a small control
object.
At minimum, this object should capture:
-
n: number of prompt variants to use per condition or chapter. -
method: whether prompts are sampled randomly, cycled deterministically, or exhaustively enumerated. -
replace: whether prompt variants may repeat. -
weights: optional prompt weights if some variants are meant to count more. -
pooling: whether outputs are averaged at the response level, condition mean level, or effect level. -
seed: a seed strategy for reproducibility.
Conceptually:
ensemble_size <- list(
n = 8L,
method = "sample",
replace = TRUE,
weights = NULL,
pooling = "effect",
seed = 42L
)For an early implementation, it would be enough to let users pass
ensemble_size = 1, 4, or 8, while
storing the richer object internally. That would keep the public API
simple but leave room to expand.
demographic_profiles
The demographic_profiles object should represent either
a fixed set of profiles or a sampling frame from which profiles are
drawn. This is important because the papers do not only vary wording;
they also vary the participant being simulated (Hewitt et al. 2024a).
At minimum, each profile should be able to store:
profile_idgenderageethnicityeducationideologypartyweightlabel
Conceptually:
demographic_profiles <- tibble::tibble(
profile_id = c("p1", "p2"),
gender = c("Female", "Male"),
age = c("30-39", "Over 60"),
ethnicity = c("White", "Black"),
education = c("College", "Some college"),
ideology = c("Conservative", "Moderate"),
party = c("Strong Republican", "Lean Democrat"),
weight = c(0.5, 0.5),
label = c("profile_1", "profile_2")
)For nalanda, this object could support at least three
modes:
- Identity-only mode, close to the current package design.
- Fixed-profile mode, for exact reproducibility with a specified profile set.
- Weighted-sampling mode, for approximating a target population.
For chapter work, the immediate value may be greatest for party or ideology, with richer demographic fields becoming more important when studying subgroup heterogeneity or matching a target population.
Proposed next steps
If the goal is to move one small piece at a time, my current order of work would be:
- implement a
prompt_bankconstructor and prompt-building helpers, - expose ensemble controls in a minimal but explicit form,
- design a condition-based simulation wrapper around the existing one-turn execution logic,
- add summary functions that return raw and optional adjusted package-native metrics,
- add a small calibration helper for user-supplied effect columns,
- add demographic profile objects and sampling helpers,
- revisit cumulative chapter designs after the first three pieces are stable.
The main reason for this order is that prompt standardization and package-native summaries are useful immediately for the existing chapter workflow, whereas population-matched demographic simulation and cumulative exposure are likely to require more validation work.