Update – April 29, 2025:
The Answer Assistant experiment has concluded.
Answer Assistant is an experiment in which AI-generated answers are verified, edited, and curated by the community before becoming publicly visible. We want to test if this feature could help improve the answer experience and encourage knowledge sharing by helping users get unstuck or get a jump-start on content curation while maintaining quality.
As we kick off the experiment, we want the Stack Exchange community to know that the team is:
Committed to the Stack Exchange network being a place for human-curated knowledge and information. This experiment explores how that can remain the case in a world with GenAI, while investigating possible new workflows that could benefit existing community members and the next generation of users.
Committed to building solutions that add value for users on the platform. LLMs are part of the world now, and any potential integration must be explored responsibly, in ways that not only provide value (task completion, closing knowledge gaps, etc) but also create transparency, keep humans in the loop, and encourage human contributions.
Not interested in any outcomes that might dilute the value of the platform. It is not a goal of this experiment to get GenAI content into public view. The goal is to see how users interact with the clearly labeled LLM-originated content and assess it for potential inclusion in the public knowledge base.
At this time we do not plan to expand the experiment to other sites on the network, unless other sites are open to volunteering. The goal is learning and taking any next steps cautiously, and we will share learnings as this moves forward.
Overview of Answer Assistant experiment
The Answer Assistant experiment will appear on several participating Stack Exchange sites whose moderators agreed to participate in the initial test. Only site moderators and logged-in users with a certain amount of rep (which can vary per site) will be able to see and verify private AI-generated answers. This curation process ensures that answers are verified by humans, and edited when appropriate, before they can become public to all users viewing the Stack Exchange site.
A private answer will be generated by an LLM if the question meets site-specific criteria. The answer will be visually different from human-authored answers and it will be clearly labeled that the AI-generated answer is private, may be incorrect, and needs verification by members of the community. If the answer becomes public, it will be attributed to an account labeled “Answer Bot.”
Human verification determines whether a private AI-suggested answer becomes public or not
A private AI-suggested answer as it would appear to users eligible to view it
Flexible settings for each participating Stack Exchange site
Each community in the Stack Exchange network is unique. The ability to customize the experiment settings — such as what questions might get an AI-generated answer, who can see/evaluate the private answers, and requirements for an answer to become public — allows room to leverage differences between communities and try variations in a controlled way. Limiting the visibility of the private answers helps prevent exposing the answer to users unfamiliar with the topic or community norms and ensures the intended community members make the judgments on answer quality.
The below settings are the default for question and answer visibility, but can be customized per community over the course of the experiment.
Questions that meet the following criteria may receive an AI-generated answer:
Older than 72 hours, to leave time for human curation
Posted in 2024 or 2025
Net positive score (0+)
Unanswered, defined as having no upvoted or accepted answer
Users with at least 50+ rep (default value) on the specific Stack Exchange site will be able to see and evaluate the private answers. This reputation requirement can also be customized per site.
A private answer becomes public if multiple users mark it as ‘correct’. A private answer moves to a deleted state, visible only to site mods, if multiple users mark it as ‘incorrect’. To handle mixed results, a net “score” must also be reached for an answer to become public or deleted. These specific thresholds (number of user votes needed, and the net score) can be set differently based on the level of site activity. If a user marks an answer as “partially correct” that assessment does not count toward any outcome.
Answers can be edited while in the private state. If an answer becomes public, editing history on the private version is visible only to moderators and those users who made edits, to prevent the original AI-generated draft from being indexed/crawled. Comments left on the private answer are not displayed if an answer becomes public.
Further details can be found on the help center article.
A careful and cautious approach
You might wonder why we’re moving forward with the experiment, in the face of concerns and sensitivity around AI-generated content across the network. Put simply – in this time of foundational change, we must prepare for many possible futures. User expectations around seeking and contributing knowledge are rapidly shifting, and it’s important to both understand the nuances of that shift and have plans for how to address it. Knowing more about the pitfalls and opportunities related to human/AI collaboration is the key to making informed decisions as the technology evolves. And we learn by experimenting.
We conducted research and facilitated discussions with Stack Exchange community members and moderators about this concept throughout the latter half of 2024, to both get initial feedback from various types of users, and to identify communities that saw value in testing it out. It’s true to say that many of the people involved in those conversations expressed concerns about moving forward with this experiment. We also heard from people cautiously optimistic about the concept. The feedback heard was instrumental in getting us to how the experiment is designed today.
A common concern expressed was about AI-generated answer quality and the impact these answers could have on the platform. We share many of these concerns. While LLMs currently produce mixed results in terms of quality and accuracy, they are continually improving. We feel that it’s vital to begin experimenting now with ways to safely and responsibly offer this functionality, so Stack Exchange can be better prepared for a time when AI-generated answer quality and accuracy may be more reliable.
Others, particularly Stack Exchange site moderators, were more broadly concerned about the precedent such an experiment might set for the future of the network. To those of you who share these concerns, please know that the moderators of your communities represented them very well. We recognize the need to move slowly and judiciously into any new path for answer creation and validation, with the priority of ensuring that the human-curated nature of Stack Exchange remains intact.
We are conducting this experiment in a manner that we believe is respectful to the concerns expressed. Several Stack Exchange communities have volunteered to be part of the initial test group, and they’re interested in seeing how this could impact goals like reducing unanswered questions and increasing engagement within their community. If there are encouraging signals and results, we can look at possible next steps and other potential goals.
Goals and metrics
These are the primary metrics we’ll be tracking in the Stack Exchange sites participating in the experiment:
% of unanswered questions
% of AI-suggested answers that were voted on (private and public)
# of AI-suggested answers that were deleted/closed or became public
% of public AI-suggested answers with a positive score
# of secondary engagement interactions (votes, edits, comments, views)
# of users performing those interactions
As with the Question Assistant experiment within Staging Ground on Stack Overflow, the high-level goals are to increase user success and maintain content quality by leveraging AI/ML assistance in contribution.
Next steps
The initial phase of the experiment was made visible to moderators of participating Stack Exchange sites in December. The goals there were initial answer quality assessment and testing for feature bugs.
For the current expansion, to some community members on participating test sites based on reputation, the goal is to monitor engagement and answer quality, as detailed in the previous section, as well as any fraud or abuse signals. During this stage, we plan to review engagement and quality metrics and can consider adjusting the settings to expand visibility and/or eligible questions.
The initial learnings of our goals and metrics outlined in this post will help shape decisions on any changes or future expansion of this experiment. We’re taking it one step at a time, and are looking forward to understanding the various values or benefits each participating community experiences during this stage. We will continue to keep this post updated with findings and any details of next steps as we have them.
As we have stated in other recent communications, the company remains committed to testing AI/ML thoughtfully and purposefully to support the core values of Stack Overflow: human connection, collaboration, and knowledge sharing. The goal is to build and support a healthy ecosystem of active users and community contributors. This experiment will be tested in a safe, controlled, and transparent approach where humans are always in the loop. We remain open to concluding the experiment early if we find the results unfavorable for any reason.
What would need to be in place for you to feel comfortable seeing Answer Assistant implemented as a controlled experiment in your Stack Exchange community?
FAQ
Which Stack Exchange sites are participating in the experiment?
Arts & Crafts, Raspberry Pi and User Experience (UX) are currently participating in the experiment. Web Apps was a participant in an earlier stage.
Which LLMs are being used to generate the private AI-suggested answers?
The current experiment is integrated with an existing data partner, however, the feature is designed to work with any LLM in the future. We are not able to disclose the specifics during this phase of the experiment.
Will Answer Bot answers be subject to the same human oversight once 'approved'? Could other community members add comments, downvote, flag as spam, vote to delete etc, as with human answers?
Absolutely. If a private AI-suggested answer were to become public, it is subject to all of the same actions and processes that a human answer would be. Even in the private state, an answer can be flagged for moderator review and moderators can delete it directly if they see fit.
Will the Answer Bot also respond to questions or feedback, potentially editing or deleting its answer if it's convinced that it's wrong?
At this time, the private AI-suggested answer is fixed and is not subject to further revision based on updates to the question, or any comments on the question or private answer. That is something that could be explored if the initial stages of the experiment points us in that direction.
What about attribution/citation/sourcing on the answers that are suggested?
Right now it’s not included since the GenAI output does not consistently provide that. We’ve stated that attribution is non-negotiable and that goes both ways. We are determining what sourcing data will be delivered from LLM providers along with the private AI-suggested answers. For the purposes of this limited experiment, we feel it’s still worthwhile to test out the concept and user interactions.
