Timeline – 4 weeks

Advisor – Steven Dow


During Spring 2017, I along with my teammate Tushar (with supervision under Prof Scott Klemmer, Design Lab) conducted a study to explore the effectiveness of Conversational Interfaces as a tool to encourage learning.

The project involved a 5 week long study with 19 participants (15 + 4 for pilot study) and my responsibilities included –

  • Planning the study (recruiting participants, performing study preflight check (data analysis, plans and measurements) and working with advisor for validation.
  • Working with the developer to help design the prototype and iterate on the system design.
  • Designing the Wizard of Oz prototype for the chatbot experience using a chat API called Drift.
  • Developing the conversational interface (chatbot) production rules and script for an effective wizard of oz prototype.
  • Designing surveys for participant feedback and gain insights about their experience with the beta.
  • Performing usability tests and observational study on our prototype with participants.
  • Evaluating study results and analyzing quantitative and qualitative data to find statistical significance of the impact of conversational interfaces in learning (non-parametric Mann-Whitney U test).




Summary / Overview



There is tremendous research indicating that engaging learners in a planned discussion can be a powerful tool to motivate and encourage learning.

 With the increasing growth of conversational interfaces, due to current advancements in Artificial Intelligence (AI) and Natural Language Processing (NLP), we felt that these discussions could be generated to learners artificially through chatbots.

Hence our challenge was to “Explore the effectiveness of Conversational Interfaces (such as chatbots) as a tool for stimulating such discussions and aid learning on online learning platforms.”

“Explore the effectiveness of Conversational Interfaces (such as chatbots) as a tool for stimulating such discussions and aid learning on online learning platforms.”

We hypothesize that learners who were engaged in such discussions with an advanced chatbot would show improvement in terms of learning metrics and breed higher quality knowledge retention by learners.

For our research project, we decided to focus the impact of conversational interfaces on a well-established citizen science platform called 'Gut Instinct" developed by "American Gut Project" and "Design Lab" (ACM-CHI 2017) researchers that teaches users how to make good research questions for effective citizen science contribution.





“Gut Instinct” is one such platform where users learn about the human gut microbiome by collaboratively generating ideas and questions [1]. To provide meaningful contribution to gut microbiome research, a part of the platform focuses on teaching its users - how to create “good” questions.

We designed a between-subjects experiment among 15 participants comparing the learning metrics of users using the current platform without a conversational interface as the control condition and along with an advanced chatbot that could engage in a discussion with the users as the treatment condition.


We simulated the AI-chatbot using a wizard-of-OZ technique where the researcher acted as the chatbot (without user’s knowledge), responding to users queries using a pre-decided script/production rules as guidance that we developed for the study.

Participants were tested using 2 metrics:

  1. Quiz scores that tested their understanding of the concepts and
  2. Quality of the research questions they generated after their training phase (Graded by a subject expert).



Experiment Design and Iteration – After formulating our hypothesis, we designed an analysis plan that established objective measures and data analysis methodologies.

We got feedback from our advisor (Scott Klemmer) and also from the original developers of the platform (Vineet from American Gut Project) and iterated on our design of the prototype for the control condition.


 Analysis Plan

Analysis Plan

 Chatbot Script

Chatbot Script

 Pilot Study (Chatbot simulation using Messenger)

Pilot Study (Chatbot simulation using Messenger)

We also conducted pilot sessions using Facebook Messenger chat with few participants prior to the study to iterate on our initial chatbot production rules that helped us understand biases that users had with chatbot and various quirks and expectations that participants might have while using the chatbot for learning the task. 


<Messenger photo> <chatbot scripts ke photos>



Our between – subjects experiment involved 15 students (3 female, 12 male) that were recruited from an American university. Participants that were chosen were health-conscious individuals who did not have much exposure in terms of research.

Their ages ranged from 19 to 28 years and the median age was 24 years.

 Participants were randomly divided into one of the two conditions

  1. Control Condition – Gut-Instinct without a chatbot
  2. Treatment Condition – Gut-Instinct with a chatbot

Each participant spent 30-45 minutes for the study session that involved three different modules that taught the participants on how to ask great research questions based on 5 concepts -

  1. Answerable
  2. Definite
  3. A link between a cause and an effect
  4. Operational
  5. Simple

There were tested based on the performace of a quiz and the questions they generated for citizen science research. These individual questions were graded on a rubric by a subject expert from the American Gut Project (Vineet Pandey).


  • I moderated and acted as an observer and note-taker in all the control conditions (8/15) along with my teammate Tushar.
  • For treatment conditions (7/15), I was responsible for controlling the chatbot *(as the wizard for our wizard of OZ prototype ) from another room using Drift API chat with the help of the AI production rules we developed and iterated previously.
  • I also took post-session interviews with the participants to gain feedback and insights on their interaction with the prototype and chatbot. We also sent surveys to our participants to get additional inputs and feedback about their experience with our prototype.


Though the results did reflect improvements in the quiz scores for participants that used chatbot as a learning aid, however the difference wasn't significant* (p=0.06432 > 0.05).  

 p=0.06432 (Mann – Whitney test U=64, n1=7, n2=8, p &gt; 0.05 two-tailed)

p=0.06432 (Mann – Whitney test U=64, n1=7, n2=8, p > 0.05 two-tailed)

 p=0.865 (Mann–Whitney test U=26, n1=7,n2=8, p &gt; 0.05 two-tailed)

p=0.865 (Mann–Whitney test U=26, n1=7,n2=8, p > 0.05 two-tailed)

Also, when it came to the quality of questions generated by the participants. Our experiment did not support our hypothesis as well as there was rarely any significant difference* between the treatment and control condition (p=0.865>0.05. Mann – Whitney U=26, n1=7, n2=8, p > 0.05 two-tailed).


* (Note - We performed non-parametric u test here instead of a t-test to measure statistical signifance for better accuracy due to the fact that our data results wasn’t normalized and we also had uneven number of participants).

To go further understand the results, we realized from our post-experiment interviews and survey that many users did engage with chatbot at the start, they were not involved extensively with them in a constructive discussion that we had hoped the users would do (which eventually would have yielded better learning)

Infact, our qualitative feedback data revealed that most users had a pre-conceived notion or bias about the fact that an AI chatbot would not even be able to help them in the first place and chatbots would be annoyance to interact with. One of the participant said that “If I knew I was talking to a person instead of an AI chatbot, I would’ve argued my case further and had a discussion…” even after being instructed to ask chatbot during any confusion or doubt. 

“If I knew I was talking to a person instead of an AI chatbot, I would’ve argued my case further and had a discussion…”
— User

Also while analyzing the quiz scores based on each individual concepts, we saw highest gains in scores for the Answerable and Operational concepts (Figure 5).

While comparing observational notes and user feedback, we noticed that people were really engaging with chatbots for clarification for those two concepts due to having relatively difficult and confused examples.

However, it’s hard to deduce the correlation that the chatbot is necessarily responsible for the slight increase in score of other concepts.




So in conclusion, the prototype "failed" in our user testing phase, hence eventually failing to reject our null hypothesis . But it definitely taught us a lot of things about introducing new technology that might already have a negative bias on users.