How Deliberative Experiences Shape Subjective Outcomes: A Study of Fifteen Minipublics from 2010–2018

In the twenty-first century, deliberative democracy has grown exponentially both as a subject of scholarship and a public practice. Though governments and civic organizations have sponsored thousands of deliberative forums across the globe, it remains unclear how strongly participants’ experiences of deliberative processes connect to their sense of satisfaction, knowledge gains, and opinion change. In addition, the dearth of comparative studies makes it unclear whether those process-outcome relationships vary depending on the context of a deliberative event. To address those questions, we analyzed survey data collected at fifteen Citizens’ Initiative Reviews held from 2010–2018. The findings show strong relationships between process and outcome perceptions, though weaker linkages to opinion change. The duration, official authorization, and ideological diversity of participants also shaped many process and outcome measures, with the duration of process and ideological diversity moderating even some process-outcome linkages. The results support the argument that the subjective experience of deliberation is important for achieving its aims.

After years of critique that deliberative theory was divorced from practice, scholars and practitioners have begun to apply theories of public deliberation to the development and evaluation of real public engagement processes (Gastil 2018;Nabatchi et al. 2012;Neblo 2015). This has necessitated creating a definition of deliberation that can be operationalized when collecting data about events (Black, Burkhalter, et al. 2011). Though contextual needs and constraints shape what deliberation looks like in practice, most scholars have coalesced around a shared description of deliberative processes. Namely, for a public engagement process to count as deliberative it should prompt participants to critically analyze relevant information, arguments, and values (Burkhalter et al. 2002;Steenbergen et al. 2003) and engage in an egalitarian discussion that demonstrates mutual respect and consideration of different perspectives (Benhabib 1996;Gutmann & Thompson 1996;Mansbridge 1983).
Deliberation, however, is about more than a normative model of open-ended discussion. Those organizing deliberative events or institutions aim to achieve specified outcomes, such as more nuanced opinions, better decisions, and increased public engagement (Chambers 2003;Goodin & Dryzek 2006;Kuyper 2018;Niemeyer & Dryzek 2007). For an event to be deliberative, therefore, participants must move beyond the sharing of evidence and opinions and towards the formation of an informed judgment that takes into account multiple options and perspectives. Sometimes this means the post-forum recording of a considered private judgment (Fishkin 2018), but other times it means rendering verdicts, making decisions, or at least arriving at concrete recommendations or shared judgments (Crosby & Nethercutt 2005;Grönlund et al. 2014;Hendriks 2005).
With this working conception of deliberation broadly shared (e.g., Black, Welser, et al. 2011;Karpowitz & Raphael 2014;Nabatchi 2012;Neblo 2015), scholars have begun to assess the strength of the theorized link from deliberative inputs to outputs (Farrar et al. 2010;. Too often, however, such research lacks repeated iterations of a deliberative design, a sufficiently large sample of participants, or subjective measures of participants' individual experiences. Without comparative data, scholars have difficulty testing the impact of contextual variables, such as duration and level of ideological diversity among participants. Rarer still are datasets that permit inspection of consequential deliberation, in which there exist real political or policy stakes for the process (C. Johnson & Gastil, 2015).

Evaluating Face-to-Face Deliberative Events
Deliberative public events like the CIR have proliferated in this century (Abdullah & Rahman 2015;Fishkin 2009;Gastil & Levine 2005;Grönlund et al. 2014;Nabatchi et al. 2012;Neblo 2015;Setälä & Smith 2018). What was once theorized as an ideal way of reaching decisions (Barber 1984;Bohman 1996;Chambers 2003;Habermas 1998) has been realized through the development of varied process designs in settings ranging from the local contexts-such as individual workplaces, neighborhoods, and schools-to global ones, such as the World Wide Views forums that link participants across continents to discuss transnational policy problems (Herriman et al. 2011).
As more public officials and democratic reformers came to champion this type of engagement, deliberative processes have gained legitimacy and power (Fagotto & Fung 2009;Fung & Wright 2003;Hartz-Karp & Briand 2009). Deliberation has been used by citizens and governments across the world to draft policy, propose laws, set budgets, and initiate constitutional reform (Farrell & Suiter 2019;Gilman 2016;Warren & Pearse 2008). In the context of direct democracy, one especially useful deliberative intervention is to convene a 'minipublic'-a body of randomly selected citizens gathered to study and assess a public issue (Grönlund et al. 2014;Setälä & Smith 2018). Deliberative Polling and Citizens' Assemblies have been used to develop ballot measures (Fishkin et al. 2015;Warren & Pearse 2008). The process studied herein takes a different path, asking panelists to study an already developed ballot initiative and write an assessment for lay voters who may otherwise have trouble finding wellreasoned arguments and reliable information (Burnett 2019). In the present age of disinformation, the CIR may be equally critical as a means of countering deliberate deceptions disseminated via social media during elections (Schia & Gjesvik 2020).
Such institutionalization could improve opportunities for informed and effective citizen engagement (Warren & Gastil 2015), but scholars must maintain a critical eye when evaluating whether the deliberative experience leads to its assumed outcomes (Spada & Ryan 2017). Similarly, because deliberation is resource intensive, those promulgating its expansion should identify under what contexts it can be most effective.

The Deliberative Experience
To advance our understanding of the deliberative process, we begin by identifying its two key components. At a minimum, deliberation entails rigorous analysis of information and policy alternatives, along with an inclusive and respectful discussion process (Burkhalter et al. 2002).

Analytic rigor and democratic relations
The analytic portion of deliberation requires participants to examine pertinent evidence (Gouran & Hirokawa 1996). Such evidence may be established facts but can also include narratives and personal experiences related to the policy or decision in question (Black 2008;Polletta & Lee 2006). Another requirement asks that participants consider tradeoffs or weigh the pros and cons of implementing any decision (Barber 1984;Fishkin 1991;Gouran & Hirokawa 1996;Mathews 1994). This means that they should consider the benefits of potential solutions and the consequences of implementation and seek to uncover any unintended consequences that might arise as a result of a specific decision. Finally, participants should consider the relevant values underlying arguments. This requires considering what goals might be reached through particular decisions and acknowledging that multiple, competing values are often at play (Anderson 1993;Benhabib 1996).
Equally important to the deliberative process is inclusive and respectful discussion (Gutmann & Thompson 1996;Mansbridge 1983). Inclusivity has two primary components: external and internal. The external component requires events to seek a diversity of participants so that stakeholders or traditionally marginalized individuals are not excluded from the discussion and decision making (Young 2002). Internal inclusivity occurs once participants have been assembled. This requires not simply equal speaking opportunities among participants but also fair and full consideration regardless of demographic characteristics or different ways of speaking (Benhabib 1996;Gutmann & Thompson 1996;Mansbridge 1983). Finally, a deliberative discussion seeks to provide information in a way that is accessible to all participants so that each discussant has an opportunity to both consider the information and weigh in on the discussion (Gastil 1993;Gutmann & Thompson 2004).

Objective and subjective experience
These criteria can be considered from both an objective and subjective perspective. Objectively, scholars can attempt to measure the degree to which participants have investigated an issue or shown respect to one another by counting the times participants ask questions or interpreting transcripts of an event. Subjectively, we can look to participants' experiences as measures of whether these goals have been achieved.
For questions of democratic quality, subjective experiences may be a more useful measure. Participants may be better at measuring the presence of respect and mutual consideration, for example, than experts because they are the ones who experience and engage in those activities (Black 2012;Gastil et al. 2012). Though measures like turn-taking can assess the level of equality, they may not be the best judge of equity. Some participants may need to speak more to lend voice to minority experiences. Conversely, some participants may feel less comfortable talking in group settings. Though experts can lend a more objective evaluation framework that provides consistency across participants, if deliberation is an inherently subjective process, then participant experiences provide a critical measure of whether deliberation occurred (Neblo 2007). Indeed, in one study attempting to connect deliberative quality and outcomes, participant ratings of deliberative quality were linked to convergence on policy attitudes whereas expert coding of deliberative quality was not .
Still, perceptions of inclusion likely reflect one's individual perspective. For instance, a study found that people of color were more positive in their ratings of deliberative quality than were white participants, because participants of color more readily juxtaposed the opportunity to have their voices heard with contexts outside of the deliberative setting in which those opportunities are scarcer (Abdel-Monem et al. 2010). This discrepancy indicates that experience matters, and subjective experiences may rely on different standards than objective measures of process quality.
Findings in small group literature also encourage caution in conflating experience with theoretically derived measures of quality, particularly in relation to the analytic components of the process. This body of literature suggests that participants may not be able to judge the analytic quality of a conversation because they may not have the information available to make such a judgment. Unfortunately, participants in small group discussions sometimes withhold pertinent information and rely too heavily on information already known to the whole group. As a result, groups can make flawed judgments when their shared information supports a bad choice (Lu et al. 2012;Stasser & Titus 2003).
Group decision-making scholarship, however, typically takes place in laboratory settings with tight interactional constraints, severe time limits, and minimally motivated participants. By contrast, deliberative events are designed specifically to encourage information sharing among participants, such as ensuring a diversity of participants, training them in deliberative practices, and providing them with evidence to reference during discussion. Even researchers working in the 'hidden profile' research paradigm, which showed information processing biases in small groups, have recognized that real-world groups may lie outside the scope of such theories (Sohrab et al. 2015), precisely because deliberative discussion can help participants overcome self-defeating tendencies (Myers 2018).
For the same reason, participants in deliberative events may be better equipped to judge the quality of their discussion than those who are in non-deliberative contexts. Without more research on comparisons of objective and subjective assessment, however, subjective measures of analytic rigor might be understood as a useful but incomplete measure of discussion quality (Gastil 2013).

Three Outputs: Knowledge Gains, Opinion Change, and Satisfaction
Advocates of deliberative democracy have highlighted a number of potential benefits of deliberation. Deliberation has the potential to shift policy opinions, encourage consensus decision making, and increase democratic legitimacy (Fishkin 2018;Gastil 2018;Goodin & Dryzek 2006;Grönlund et al. 2014;Neblo et al. 2018). Additionally, well-structured deliberation can change the cognitions and actions of participants. Engagement in a deliberative process has been shown to increase participants' sense of political efficacy, policy knowledge, and civic engagement (Fishkin 2009;Gastil 2004;Jacobs et al. 2009;Morrell 2005;Nabatchi 2010).
Though all such changes are laudable, the success of a deliberative event often depends on achieving two interrelated goals-knowledge gains and opinion change (Farrar et al. 2010;Niemeyer & Dryzek 2007). One of the first metrics used to empirically test deliberative events was whether participants learned relevant policy information through their experience (Fishkin & Luskin 1999;Gastil & Dillard 1999). Such research has continued and now shows ample evidence that deliberation can lead to policy-specific knowledge gains (Barabas 2004;Fishkin 2018;Gastil 2006;Richards 2018). This reflects the core premise of deliberative democracy that, all other things being equal, a pluralistic and deliberative process should yield better decisions (Landemore 2013).
Even so, the ability of deliberation to generate higher quality judgments is contested (Pincock 2012). Some argue that deliberation can lead to opinions that are more consistent with available knowledge or underlying values (Barabas 2004;Fishkin 2018;. Others are wary that social pressure, rather than knowledge gains or perspective taking, may determine shifts in opinion (Karpowitz & Mendelberg 2014;Sanders 1997). Those who take the latter stance, however, are often studying events that may not actually be thoroughly deliberative (Pincock 2012). Connecting broad variations in the quality of deliberation to the likelihood of opinion change among participants may shed light on this debate, even if it cannot discern the more fine-grained mechanisms whereby deliberation shapes opinion.
Finally, in addition to knowledge gains and opinion change, participants' subjective experience also provides a measure of process quality, with participants' selfreported satisfaction providing one important indicator of success (Abdullah & Rahman 2015;Foels et al. 2000;Gastil et al. 2012;Hickerson & Gastil 2008). Participants who are satisfied with their experience of deliberation are more likely to see the host of attitudinal and behavioral changes mentioned above (Gastil, Deess, et al. 2010). Thus, participants' satisfaction can be an important metric for assessing the success of a deliberative event.

Hypotheses
Having described key features of deliberation and three intended outputs, we examine the relationships between these process and outcome variables. Simply put, when participants subjectively experience deliberation as a rigorous and respectful process, do they become more likely to report process satisfaction, knowledge gains, and shifts in their opinions?

Main Hypothesis
Consistent with the preceding literature review, our principal hypothesis predicts that CIR panelists' assessments of both the analytic rigor and democratic quality of their deliberative event will be positively associated with our three focal outcomes. These include participants' process satisfaction, their sense of having learned enough to reach a good decision on the ballot measure, and their degree of reported individual opinion change.
H1: CIR panelists' assessments of the review's analytic rigor and democratic quality will be positively associated with participants' (a) satisfaction with the process, (b) their sense of having learned enough to reach a good decision, and (c) their level of opinion change.

Contextual Predictors
This first hypothesis generalized across all instances of the CIR, but with over a dozen different iterations of this process in hand, our data permit us to advance hypotheses about how different CIR deliberations and outcomes link back to variations in the Review's design and setting. Though deliberative theorists often talk about the importance of institutional context (G. F. Johnson 2009), the nature of a discussion issue and its framing (Gastil, Bacci, et al. 2010;Lee 2014), or the particular deliberative design being employed (Carman et al. 2015;Himmelroos 2017), the effects of these differences have not been empirically tested in a comparative manner. The present research context makes it possible to look at a handful of such variations across deliberative forums, including differences in the political context, event duration, and political division.

Duration of deliberation
In the case of the CIR, one straightforward variation concerns the Review's duration. Although the question of whether adequate time is given for deliberation should be a fundamental piece of any process evaluation (J. Abelson et al. 2003;Coote & Lenaghan 1997), studies rarely address this directly. Advocates of deliberation often claim that extensive time is needed to engage in substantive deliberation. Greater time can allow participants to delve deeply into an issue and provide them the space to gather information, hear from witnesses, and collectively scrutinize evidence (Coote & Lenaghan 1997). Time also affords participants the opportunity to develop mutual respect and understanding (Renn et al. 1993).
The length of time required for participation in such events, however, can place a considerable burden on everyday citizens. This could prevent otherwise willing community members from engaging in deliberative processes and ultimately result in less inclusivity (J. Abelson et al. 2003;Barnes 1999;Coote & Lenaghan 1997;Dienel & Renn 1995;French & Laver 2009). Though one study of trial jurors found that quality of the deliberations, rather than length of time, was the deciding factor in whether jury experience led to increased voting (Gastil, Deess, et al. 2010), some citizens' jury participants have lamented that four days was not enough time to adequately grapple with the issue in question (Barnes 1999). Also, shorter deliberative processes do not appear to provide the same civic motivation that more rigorous processes engender ). If duration does matter, such tradeoffs may be worth the costs, but if it has no influence on participant experience and process outcomes, shorter processes may be able to generate more inclusivity at a lower cost.
The first four iterations of the CIR (2010-2012) all lasted five days, but from 2014-2018, the Reviews have all lasted three and a half days (hereafter labeled as 'four days'). Concerns about controlling the cost of the CIR's implementation prompted this foreshortening, but it went against the traditional model of Citizens' Juries that the CIR aimed to reproduce (Crosby & Nethercutt 2005). We hypothesized that the participants' subjective ratings of deliberative quality would decline as a result of this abbreviation of the CIR process. Moreover, because time gives participants a greater opportunity to weigh arguments and evidence and engage in democratic discussion, we predict that participants will be more likely to change their opinion if they took part in the longer process.
H2a: Relative to shorter ones, longer CIRs will lead to higher participant assessments of the process' analytic rigor and democratic quality and will be more likely to result in knowledge gains and opinion change.

Degree of authority
Part of what sets the Oregon CIR apart from so many other minipublics is its authority-its ability to put its findings in the official state pamphlet mailed to every Oregon voter (Warren & Gastil 2015). From 2014 through 2018, however, unofficial CIRs have been held in a county (Jackson County, Oregon), two municipalities (Phoenix, Arizona and Portland, Oregon), and four states (Arizona, California, Colorado, and Massachusetts). Lacking formal authorization from the government, these CIRs have no reliable method for sharing their findings with voters. By contrast, the exercise of real political power in the Oregon CIR raises the stakes for deliberation (Levine et al. 2005), which means that deliberation is more consequential. When participants are aware of this state authority, they may be more likely to take the task of deliberation seriously, enacting the ground rules meant to assure the process' analytic rigor and democratic quality. Similarly, they may be more likely to keep an open mind, and thereby learn more information and reconsider their initial judgments.
H2b: Empowered CIRs, compared to unofficial ones, will lead to higher participant assessments of the process' analytic rigor and democratic quality and will be more likely to result in knowledge gains and opinion change.

Ideological divergence
The degree of ideological divergence within a CIR panel may also influence the likelihood of achieving desired outcomes. Deliberative processes are designed to bring competing perspectives into conversation with one another. Though deliberation may at times be used to allow homogenous groups to engage in preference identification, more often designers seek to engage individuals from across the political spectrum so that participants may learn about competing perspectives and evaluate a range of potential solutions (Gutmann & Thompson 1996;Mansbridge 1983;Young 2002). In this sense, the presence of ideological divergence among panelists is a necessary requirement for knowledge gains and opinion change.
The presence of high degrees of difference, however, may have adverse effects on the quality of deliberation if participants polarize in opposition to one another (Mendelberg & Karpowitz 2007;Mendelberg & Oleske 2000;Sanders 1997). The CIR design offers an opportunity to explore the effects of political division on process outcomes. One criterion used to select participants is partisan affiliation, resulting in a participant pool reflective of local political divisions. Even so, because CIRs have been conducted in different locations and at different points in time, and because simple partisan affiliation does not provide an indication of ideological strength, the level of ideological division among participants varies across reviews.
H2c: Compared to relatively homogenous CIRs, those that have more ideological divergence among their citizen panelists will produce lower participant assessments of the process' analytic rigor and democratic quality, but they will be more likely to result in knowledge gains and opinion change.

Context as Moderating the Influence of Participant Experiences on Outcomes
Our hypotheses for the contextual variables concern not only their association with deliberative process perceptions but with the association between those perceptions and subjective outcomes. These amount to contextual qualifications of H1, which predicts that positive experiences of analytic rigor and democratic quality will lead to the intended deliberative outcomes.
Because time is theorized as essential for the development of both the analytic and democratic aspects of deliberation (Coote & Lenaghan 1997;Dienel & Renn 1995), we predict that a shorter CIR duration will place greater stress on deliberative processes. In the foreshortened instances of the CIR, which last four days instead of five, favorable outcomes will depend to a greater extent on the levels of analytic rigor and democratic discussion achieved.
H3a: The duration of a CIR will moderate the relationship between participant assessments of process quality and deliberative outcomes, such that shorter processes will show stronger relationships between participants' assessment of the process quality and their likelihood of experiencing satisfaction, knowledge gains, and opinion change.
Similar effects may occur in relation to empowerment. If participants take their duty to voters seriously, stateauthorized instances of this process should create stronger ties between participants' ratings of the deliberative process and its outcomes. In the context of amplified authority, CIR panelists may think more critically about whether the process met the deliberative criteria when rating their satisfaction with its performance. Likewise, we predict that panelists' readiness to learn new information or alter their opinions will hinge on deliberative process quality more in the higher-stakes official CIRs versus the pilot tests thereof.
H3b: The presence or absence of legislative authorization for a CIR will moderate the relationship between participant assessments of process quality and deliberative outcomes, such that empowered processes will show stronger relationships between participant assessments of deliberative quality and satisfaction, knowledge gains, and opinion change.
Finally, high levels of disagreement, especially if those become personal, make maintaining respectful democratic relations among members all the more important (Hwang et al. 2018;Zhang 2012). Substantive disagreement also could heighten the impact of analytic rigor, which would become more crucial as a means of analyzing conflicting information and perspectives and generating attitude change (Burgess et al. 2008;Caluwaerts & Deschouwer 2014; also see Esterling et al. 2015). Thus, we anticipate that CIR panelists participating in events populated with more ideologically diverse participants will show stronger positive relationships between deliberation perceptions and outcomes.
H3c: The level of ideological divergence among CIR panelists will moderate the relationship between participant assessments of process quality and deliberative outcomes, such that more diverse panels will yield stronger relationships between participant ratings of deliberative quality and satisfaction, knowledge gains, and opinion change.

Methods
To test the relationship between deliberative experiences and outputs, researchers surveyed participants in fifteen CIRs between 2010 and 2018. At the end of each day's discussions, participants at each review took a brief survey that asked them to rate the democratic quality of the discussion. The participants took an additional, and slightly longer, survey at the end of the review that asked them to reflect on the entirety of the CIR process to assess its analytic rigor and their overarching satisfaction with the process. With very few exceptions, every participant completed every survey on every day, for an N of 318 (with a survey response rate above 98 percent).

Deliberative Experience Measures
In determining how well the CIR embodied a deliberative process, we relied on an operationalization adapted from Gastil (2008) and used previously in other studies (Black, Welser, et al. 2011;Gastil 2013). This approach defines analytic rigor as establishing a solid information base, analyzing all available options, identifying pertinent values, and weighing the pros and cons of an issue. Democratic discussion is defined as providing equal opportunity to participate and creating an atmosphere of mutual comprehension, consideration, and respect.
Because participants were asked to deliberate about measures that often necessitated the discussion of complex scientific or economic evidence, the research team chose to assess analytic rigor at the end of the review, rather than on a daily basis. The assumption here was that although participants might at times feel they did not have all of the necessary information at the end of each day, by the end of the review they would be better able to assess whether they had been provided the information needed to thoroughly understand the evidence, arguments, and relevant values. Participants were, however, asked to rate the democratic quality of the discussion at the end of each day. Here, the research team believed it was important to assess whether participants felt they were being respected and included in the conversation throughout the entirety of the process and that they could follow the discussion even if they didn't have all of the information needed to reach their final decision until the end.
To assess its analytic rigor, participants were asked at the end of the review how well the process performed in 'weighing the most important arguments and evidence' in favor of/opposing the measure and in ' consideration of the values and deeper concerns motivating' those in favor of/opposing the measure on a scale from 'very poor' (1) to ' excellent' (5). These four items were then combined into a single scale assessing the CIR's Analytic Rigor (α = 0.89, M = 4.21, SD = 0.72).
Participants rated the democratic quality of the discussion at the end of each day of the CIR by responding to how often they engaged in the following activities on a scale from 'never' (1) to ' almost always' (5): carefully considered 'views different from your own' when expressed by ' experts or other CIR participants,' felt that ' other participants treated you with respect today,' or had 'trouble understanding or following the discussion' (reverse coded). Participants were also asked whether they had 'sufficient opportunity to express [their] views today' on a scale from ' definitely no' (1) to ' definitely yes' (5). Participants' individual scores on each question were averaged across days, and then those four average scores were combined into a scale assessing the CIR's Democratic Quality (α = 0.66, M = 4.41, SD = 0.36).

Deliberative Output Measures
The three focal output variables in this study were satisfaction with the deliberative process, the perception of gaining sufficient knowledge to make a sound judgment, and the perception of having changed one's opinion on the policy issue under discussion.

Process satisfaction
On their end-of-review survey, participants rated their ' overall satisfaction with the CIR process' on a scale from 'very dissatisfied' (1) to 'very satisfied' (5). This was used as the Satisfaction measure, M = 4.49, SD = 0.80.

Subjective knowledge gain
To measure whether participants believed they had gained an adequate amount of knowledge during the review, participants were asked whether they believed 'that [they] learned enough this week to make an informed decision?' Responding on a scale from ' definitely no' (1) to ' definitely yes' (5), the average responses on this Learned Enough measure were very high (M = 4.65, SD = 0.70).

Subjective opinion change
Opinion change was assessed by asking participants about their opinions prior to the review and at the end of the review on a scale from 'strongly support' (1) to 'strongly oppose' (5). Because the CIR event organizers did not want to encourage participants to reach an opinion before deliberation occurred, for all but the 2014 reviews, researchers were required to measure both pre-and post-review opinions during the end-of-review survey. The first question asked, 'Before you participated in the CIR, what was your position on this measure?' The second then asked, 'At the end of the CIR process, what is your position now on this measure?' The absolute value of the difference between their pre-and post-CIR positions on that five-point scale was calculated to determine their degree of Opinion Change (M = 1.21, SD = 0.88). 1

Contextual Variations
For the fifteen CIR panels, Table 1 describes the key features of each panel, including its official authorization (or pilot test status), its duration, and the level of dispersion of participants' left-right ideological identities. These variations result in relatively balanced splits that divide the CIR panelists into four-day CIRs (n = 222) versus five-day CIRs (n = 96) and unofficial CIR pilot tests (n = 162) versus official Oregon CIRs (n = 156). As for Ideological Divergence, this continuous variable (M = 1.55, SD = 0.18) was created by measuring the SD of ideology within each CIR panel, using participants' self-identification on a scale from ' extremely liberal' (1) to ' extremely conservative' (7). The CIRs conducted in 2012 did not include the ideology variable, so their Ideological Divergence scores were estimated based on panelists' party membership. 2
Because of inconsistent use of some of these demographics across the CIRs, along with reluctance to answer particular questions, three of these variables had excessive levels of missing data (age, income, education, ethnicity). Their inclusion would have reduced the effective sample size from 314 to 222, and multiple imputation would have limited utility given the distinctness of these variables. Their inclusion did not change the findings, so we dropped them from further analysis. 3

Statistical Analysis
The nature of our sample made multi-level modeling impractical owing to the small sample size at the group level of analysis (i.e., fifteen CIR processes). In terms of statistical power (Cohen 1988), the repetition of a small-scale deliberative process over the years yielded a sufficiently large sample of individual participants but an insufficient number of separate deliberative events.
To acknowledge the non-independence of individual panelists nested within each of those fifteen CIRs, however, we utilized cluster-robust standard errors (Cameron & Miller 2015;Esarey & Menger 2019;Liang & Zeger 1986). After making this adjustment, we used general linear regression models to test each hypothesis, 4 including the use of interaction terms to test the contextual moderating effects in the third hypothesis. 5

H1: General Hypothesis
H1 predicted that the two measures of deliberative experience-analytic rigor and democratic discussionwould each predict three outcomes commonly associated with deliberative events-participant satisfaction, knowledge gains, and opinion change. Table 2 shows the results of the three corresponding regression equations, which include both process variables and control variables as predictors for each outcome measure. 6 For Satisfaction, analysis showed significant independent associations for both Analytic Rigor (B = 0.50) and Democratic Discussion (B = 0.53). Analytic Rigor (B = 0.32) and Democratic Discussion (B = 0.44) also predicted panelists' sense that they had learned enough to make an informed decision. Neither of these process variables, however, predicted Opinion Change. Democratic Discussion had a non-significant association in the predicted direction but also considerable variance in this statistical relationship (B = 0.20, p = 0.07). 7

H2: Effect of Context on Experience and Outcomes
To test whether context influenced both perceptions of deliberation and process outcomes in the predicted directions, we conducted separate regression equations for each of these predictors, paired with the same control variables as in the preceding analyses. Table 3 shows the key results from each of these five equations, with context serving as independent variables and the process quality and outcomes acting as dependent variables. 8 Looking first at the duration of the CIR (H2a), the length of the review did not correspond to Analytic Rigor, though it had a nearly significant association with Democratic Discussion ratings (B = 0.11, p = 0.08), which were higher for the five-day CIRs (M = 4.49) than for four-day reviews (M = 4.38), t = 2.58, p = 0.005. Duration was associated with Learned Enough ratings (B = 0.26), with the longer processes yielding higher average scores (M = 4.83) than did the shorter ones (M = 4.57). There was also more change in opinion reported during the five-day CIRs (M = 1.42) compared to the four-day processes (M = 1.12), B = 0.31.
Turning to official authorization of the CIR, the results showed associations with two outcome measures. Participants in state-authorized Oregon CIRs reported a greater sense of having learned enough about the issue (M = 4.73) than did those in pilot processes (M = 4.58), though this did not reach significance, B = 0.15, p = 0.07. Empowered processes were also more likely to lead to opinion change (M = 1.35) than were pilot processes (M = 1.09), B = 0.31.
On average, the CIR panels' Ideological Divergence was associated with differences in panelists' process ratings and various outcomes, but the standard errors of these coefficients were unusually large. Parameter estimates for Ideological Divergence were high and in the predicted directions for Analytic Rigor (B = -0.29), Learned Enough (B = 0.27), and Opinion Change (B = 0.31), but the error terms of these regression coefficients were substantial as well. As a result, the only near-significant effect for this contextual variable was Learned Enough, p = 0.08. 9

H3: Contextual Moderators of Main Associations
Our third set of hypotheses predicted that each of these contextual variables would moderate the associations between the process and outcome measures. To test for moderation, we began with the same regression models used to test the first hypothesis but added  in a contextual variable and its interactions with the two process measures, Analytic Rigor and Democratic Discussion. 10 Significant interaction terms indicated that the relationship between a process and outcome measure was moderated by a contextual variable. Moreover, we hypothesized negative interaction terms, meaning that the process-outcome relationship was stronger for shorter CIRs, pilot tests versus official CIRs, and panels with less ideological diversity. After running nine separate regressions (one for each pairing of contextual and outcome variables), Table 4 shows the four equations that resulted in statistically significant interactions. All four showed interactions for Analytic Rigor, but not Democratic Discussion.
Consistent with Hypothesis 3a, the four-day CIR processes had stronger process-outcome associations for Satisfaction (B = -0.25) and Learned Enough (B = -0.28). To illustrate these interactions, linear regressions showed higher coefficients for Analytic Rigor predicting Satisfaction in four-day CIRs (B = 0.54) versus five-day CIRs (B = 0.29), with a similar difference for Learned Enough (B = 0.38 vs. B = 0.14). The equation for Opinion Change, however, had an unanticipated result, with the processoutcome link being stronger for the five-day CIR (B = 0.49). 11 Expressed as linear regression coefficients, this was the difference between a negative association in fourday CIRs (B = -0.19) versus five-day (B = 0.32).
The only other statistically significant association was consistent with Hypothesis 3c: Learned Enough had a significant interaction with Ideological Divergence (B = -0.82). A median split on this contextual variable showed that the linear regression coefficient for Analytic Rigor was relatively high (B = 0.45) for low-diversity CIRs versus those with more ideological divergence (B = 0.23).

Discussion
This paper asked three primary questions. First, do participants' experiences of analytic rigor and democratic discussion predict process satisfaction, a sense of learning, and opinion change? Second, does context influence    Table 2. how participants assess either the deliberative quality of the event or its outcomes. Third, are the relationships between participants' experiences of deliberative quality and process outcomes moderated by contextual variables?

Summary of Findings
Findings indicate that it matters whether participants believe deliberation occurred during their sessions. Higher participant assessments of analytic rigor and democratic discussion were associated with both participant satisfaction and participants' belief that they had learned enough to reach a good decision. Neither measure of deliberative experience, however, was associated with variations in opinion change. In sum, we found a clear relationship between the ideal definition of deliberation and two of its expected outputs-satisfaction and knowledge gain. This aligns with previous theory and research. Deliberation is designed to foster more informed decisions (Niemeyer & Dryzek 2007), and a plethora of research shows that participation can lead to knowledge gains (Barabas 2004;Fishkin 2018;Gastil 2006;Richards 2018). Both analytic rigor and democratic discussion may be essential to produce participant satisfaction and participants' confidence that they have learned enough to reach a good decision. Our second set of hypotheses sought to understand the influence of contextual variables. Hypothesis 2a predicted that more time spent deliberating would lead to higher participant assessments of the CIR process and a greater sense of learning and opinion change. Duration proved unrelated to process assessments, but longer CIR processes yielded a stronger sense of having learned enough and a greater frequency of changing one's opinion. Hypothesis 2b predicted that the Oregon CIRs authorized by government would lead to more favorable process assessments and outcomes relative to CIR pilot projects. Once again, learning and opinion change were the only results consistent with hypotheses, though the former result fell just short of the conventional threshold for significance. Ideological diversity had process and outcome associations in the predicted directions, but high standard errors rendered all of these findings non-significant.
Our third set of hypotheses predicted that these same three contextual variables would moderate the strength of the process-outcome relationship. A moderation effect was clearest in regard to the four-versus five-day CIR duration. In the shorter processes, analytic rigor was more strongly associated with process satisfaction and the sense of having learned enough to make an informed decision. In the shorter CIR processes, however, scores on analytic rigor had a significant negative relationship with opinion change. One interpretation of the latter finding is that citizen bodies like the CIR may have a turning point for opinion change between the fourth and fifth day. Without that extra day, greater rigor can produce rapid learning but a modest resistance to opinion change if the process feels rushed. Unfortunately, this poses a dilemma for practitioners, who recognize that high-quality deliberation can be expensive to arrange and burdensome for citizen participants (Barnes 1999;French & Laver 2009). Researchers should continue to explore exactly how much rigorous deliberation is required to produce desired outcomes, thereby ensuring that any added cost is worth the marginal benefit.
The other significant contextual moderator found that CIR panels with low ideological diversity had a stronger relationship between analytic rigor and opinion change. Consistent with predictions, this result suggested that a more rigorous deliberative process can help make up for low ideological diversity when it comes to generating shifts in panelist opinions about the ballot measure under discussion.
An indirect implication of the moderation findings concerns the distinction between analytic rigor and democratic process quality. The fact that the former variable was the only one moderated by contextual factors is one more validation of the difference between these two process measures. This conceptual and methodological note has special significance for deliberation scholars, who continue to seek a robust approach to measuring process quality (Black, Burkhalter, et al. 2011). Such assessments should at least make the distinction between the depth of problem and solution analysis versus the relational dynamic among the participants (Gastil & Black 2007).
Finally, we note the relative weakness of the demographic control variables in our analysis, including those we dropped from analysis to avoid significant data loss (see Footnote 3). The pattern we see here is consistent with other studies that have found relatively small or non-existent demographic variations in deliberative experiences (e.g., Hickerson & Gastil 2008;Siu 2009;Sumaktoyo et al. 2016). Given the importance of potential inequalities in deliberative events (e.g., Mendelberg & Karpowitz 2007;Young 2002), it is noteworthy when these variations fail to emerge in such analyses.

Study Limitations
Though pooling survey results across fifteen deliberative events provided a rare glimpse at cross-event patterns, limitations in this study warrant caution when generalizing from our results. Fifteen is a small number when considering the CIR as a unit of analysis, and this made multilevel modeling impossible. As detailed in the Methods section (and in Sumaktoyo et al. 2016), the use of cluster-robust standard errors might offer some reassurance to those who wish to account for the interdependence of the participants within a single event, but it is a poor substitute for modeling effects at different levels of analysis.
With a sufficient number of event-level cases, one could begin to tease apart the effects of contextual variables, such as duration, official authorization, and ideological diversity. Not only did this study analyze those variables separately, but the use of just three contextual variables obscured other group-level differences and interrelationships among them. For example, the fourversus five-day duration variable overlapped with time, since the only five-day CIRs were the first four events, held in 2010-2012. Likewise, authorization covaried with geography: The only officially authorized CIRs all happened in Oregon, with all but two of the pilot tests occurring elsewhere.
Others may wish to improve on this study's measure of deliberative quality, which relied on participant assessments of the process. Though we believe subjective experiences are valid measures of process quality (Knobloch et al. 2013), utilizing more objective measures of process quality could help validate, or raise doubts about, those assumptions. A related problem was that CIR participants rated all aspects of their process highly, resulting in low variance for some variables, particularly democratic discussion.
Similarly, our measure of opinion change and knowledge gains were reliant on participants' subjective sense of those changes. Pre-and post-deliberation measures would provide more validity to these tests. This is particularly important in the case of opinion change. When participants rated their prior opinion after they deliberated, they were more likely to say that they had previously been undecided, resulting in lower levels of opinion change. These discrepancies may obscure actual opinion change that did take place and subsequently may have reduced our ability to find opinion change or sort out the relationships between opinion change and other relevant variables.

Conclusion
If deliberation hopes to achieve its basic goals, then participant experience matters. Processes that participants rated as more analytically rigorous and democratic produced higher levels of satisfaction and feelings of having the knowledge necessary to make a good decision. Such ideal deliberation, however, did not necessarily result in opinion change.
Context also matters. Longer processes increased participants' confidence in their issue-specific knowledge and led to greater levels of opinion change. Shorter processes were more reliant on analytic rigor to produce panelist satisfaction and learning, though the reverse was true for opinion change. Empowered processes were more likely to yield opinion change. Finally, processes with relatively low ideological division among the panelists were most reliant on analytic rigor to produce opinion change.
A broader view of these findings can provide lessons for both practitioners and scholars. If those who promote deliberation do so with the goal of achieving better democracy, then meeting the ideals of deliberation is necessary. Simply calling something deliberative engagement does not make it so. For such interactions to make a difference for participants, the process needs to foster the careful weighing of information, arguments, and values under conditions that engender equality, respect, and consideration of diverse perspectives. This is important to remember as the CIR model gets adapted in other countries, with recent pilot tests occurring in Korsholm (Finland) and Sion (Switzerland), but it applies equally well to the wider array of deliberative designs being developed in Ireland (Farrell & Suiter 2019), Belgium (Reuchamps 2020), and elsewhere.
Looking across all the findings from this study, the basis of deliberative theory appears sound, though the details remain hazy. Across contexts, positive participant experiences of analytic rigor and democratic quality led to satisfaction and knowledge gains. Though scholars have long made this claim, seldom have outcomes been empirically connected to the presence or absence of deliberative components. This paper attempted to fill that gap. Its findings bolster claims that deliberation itself is what leads to the broader goals of a more informed, engaged, and legitimate democracy. Equally, however, it complicates questions for deliberative proponents and highlights the need to continue to test its basic presumptions rather than assuming that theoretical arguments will be realized in practice. These results indicate that opinion change may be more elusive and context-dependent than previously theorized. Researchers must continue to explore what makes participants change their minds during deliberative events and debate whether opinion change should be considered a measure of deliberative success.

Notes
1 In 2014, participants provided their opinion in a presurvey conducted before the review and reported only their post-event opinion on the end-of-review survey.
To test whether this discrepancy influenced reported levels of opinion change, a t-test was performed comparing the absolute value of opinion change for participants in 2014 versus the other years. That test did find a significant difference between the two groups, with those who reported their opinions in a pre-survey showing higher levels of change than those who reported their pre-review opinions retrospectively in the post-CIR survey (t = -2.60, p = 0.01). We discuss this issue in the conclusion as a limitation of our study. 2 The 2010 and 2014 Oregon CIRs were combined to create a distribution of ideology by partisanship among Oregon panelists. CIR panelists were assigned ideology scores within party to match this overall distribution, then ideology SDs were calculated for both of the 2012 CIRs. This simplified form of missing data imputation for Ideological Divergence was used because this dataset had no other significant data loss. 3 We checked the robustness of the regression analyses reported below to make certain that there was no difference in main results depending on the inclusion or exclusion of these three demographic control variables. There were no such differences, as only one of these dropped demographic variables had a significant association with a dependent variable in regression: Education was associated with Learned Enough (B = .08, SE = 0.03, p = 0.011) and opinion change (B = 0.15, SE = 0.07, p = 0.038). The minor effects of these controls is consistent with another study using clustered standard errors (Sumaktoyo et al. 2016). 4 We conducted these analyses in SPSS using the Complex Samples analysis commands. This procedure began with the Prepare for Analysis (CSPLAN syntax keyword) that specified the CIRs as the grouping variable and the size of each CIR panel, then we proceeded to regression (CSGLM syntax keyword). A previous analysis had used conventional regression without clustering and produced approximately the same results as reported herein. 5 A previous analysis produced approximately similar results with a different approach, conducting separate regressions for each contextual variable, with partisanship dispersion split at the median, then comparing regression coefficients using a q-test (Cohen 1988). 6 Because the measures of satisfaction and knowledge gain had potential ceiling effects (i.e., the modal response in both cases being the highest value of a 1-5 scale), we reexamined the effects of deliberative process variables using Tobit regression (McBee, 2010) in SPSS via the R statistical package extensions. The coefficients were similar in every case, without any changes in their level of statistical significance. 7 We used one-tailed p values throughout our analyses to reflect the directional nature of each hypothesis. On this choice of a significance threshold, see (R. P. Abelson 1995: 64-67). 8 Because partisanship was used to assign ideological dispersion scores for two of the CIRs, alternative analyses were run for this contextual variable dropping partisanship as a control. The results were nearly identical. 9 Dropping the two CIR cases with imputed values (see Note 2) caused no change in results. 10 Each interaction term was calculated by multiplying a contextual variable by a process variable. 11 Because this result ran contrary to predictions, the stricter two-tailed threshold was applied, p = 0.012.