Creating Persona Chatbots (with a focus on evaluation)

CS263A: Natural Language Processing, UCLA

Explored different aspects of persona injection in open-ended dialogue systems using prompt tuning techniques: in-context learning and chain-of-thought (CoT) prompting. The baseline model is Mistral 7B, and we compare persona-injected outputs against GPT-3.5, GPT-4, Llama 3, and Phi.

Prompt Tuning Techniques

  • In-Context Learning: Example persona-consistent dialogues provided within the prompt, allowing the model to adapt behavior based on demonstrated examples without parameter updates.
  • Chain-of-Thought (CoT): The model reasons step-by-step about persona consistency, explicitly considering the persona's background, speech patterns, and knowledge boundaries before generating a response.
  • Combined (CoT + In-Context): Both techniques applied together for maximum persona alignment.

Evaluation Framework

  • LLMEval (GPT-4 as judge): Automated evaluation of fluency and coherency by prompting GPT-4 to score generated dialogues on semantic consistency with the assigned persona.
  • MMLU: 57-subject benchmark spanning STEM, humanities, and social sciences to measure whether persona injection degrades general reasoning ability.
  • Toxicity and Bias (HDS Score): Harmful content variance across demographic groups to ensure persona-injected models do not introduce disproportionate toxicity.

Results

MetricModel / ConditionScore
Fluency & Coherency (GPTEval)Mistral 7B3.18
GPT-3.53.32
GPT-43.32
Llama 33.79
Phi3.95
Reasoning (MMLU)Baseline47.13%
CoT45.83%
CoT + In-Context43.24%
PerplexityMistral 7B13.9
ToxicityGPT-20.029 ± 0.0066
Our Model0.104 ± 0.047
HDS ScoreGPT-20.076
Our Model0.0258

Key Observations

  • All persona-injected models achieved higher fluency and coherency than human-generated dialogues, with Phi scoring highest at 3.95.
  • MMLU scores decreased with stronger persona alignment (47.13% to 43.24%), suggesting a trade-off between persona fidelity and general reasoning.
  • Our model showed increased but controlled toxicity (0.104 vs. GPT-2's 0.029), while the HDS score dropped from 0.076 to 0.0258, indicating reduced harmful content variance across demographic groups.

Follow this repository for more details.

GPTEval trends

Figure (a): Coherency score trend. Figure (b): Fluency score trend for GPT4Eval.

Correlation matrices

Correlation matrices for (a) coherency and (b) fluency across models and human evaluation. Lighter colors indicate higher correlation.

KDE plots

Kernel density estimate plots of coherency score distribution across models and human evaluators.