Conversational UX Design - Metrics

4. Sequence Metrics

If a conversation design includes expandable sequences, then the base sequences and their expansions can each be measured. The Natural Conversation Framework provides unique metrics for the effectiveness and efficiency of the conversation itself, independently from customer satisfaction. In addition, these are 2-way metrics: they apply to both the automated agent and the user. Our prototype dashboard offers the following 3 metrics...

Number of Sequences: This is a measure of how many primary social actions were initiated by the user or by the agent. Instead of reporting simply total number of turns, our dashboard groups expansion turns with their base sequence.

Sequence Completion Rate: This is the percentage of how many initiated base sequences were completed by the recipient, agent or user. In other words, this is a measure of the rate of success across sequences in the conversations, which is closely related to the level of mutual understanding. If agent or user do not understand what the other means or is doing, they cannot complete the sequence initiated by the other. This is critical because mutual understanding is one of the distinctive goals of conversation.

Interactional Efficiency: This is a measure of how much extra work the user or the agent had to do in the conversation. The expandable-sequence interaction model gives us a theoretical minimum number of turns a given conversation could have. That is, if every completed sequence were compact, the theoretical minimum would be the number of completed base sequences times two. We can then compare that minimum to the actual number of turns in the conversation and get a measure of the rate of sequence expansion.

Taken together, these sequence metrics tell us different things about conversations with our agent, Alma. Using extreme sets of artificial data, we can demonstrate how the metrics change with different kinds of sequences.

Example of Set A
01 U:  what computer beat Jeopardy?
02 A:  An IBM computer named Watson.
03 U:  what computer beat a human chess champion?
04 A:  It was named Deep Blue.

Example of Set B
01 U:  what computer beat a human chess champion?
02 A:  It was named Deep Blue.
03 U:  Deep what?
04 A:  Blue.
05 U:  what do you mean?
06 A:  The IBM computer was named Deep Blue.
07 U:  oh ok
08 A:  Is there anything else I can do for you?

Example of Set C
01 U:  what's in theaters?
02 A:  What do you mean?
03 U:  what movies are playing in theaters now?
04 A:  I'm sorry, I still don't understand.

The resulting sequence metrics for these three sets of extreme data look like this...

  Number of


A 17.4 0.88 0.92
B 10.4 0.83 0.49
C 7.2 0.14 0.09


In data set A, we included only completed sequences initiated by the user and no expansions by the user. The Sequence Completion rate for agent and user combined was 88% and the Interactional Efficiency was 92%. In data set B, we included only completed sequences initiated by the user, but also numerous expansions by the user. In this case, the combined Sequence Completion rate was still high, 83%, but Interactional Efficiency dropped significantly to 49%. Finally, in data set C, we included only conversations in which none of the substantive sequences initiated by the user were completed. The Sequence Completion rate plummeted to 14% and Interactional Efficiency to 9%.

In short, if both the Sequence Completion and Interactional Efficiency rates are high, the conversations themselves are good. If they are both very low, the conversations are bad. But if Sequence Completion is high and Interactional Efficiency is moderate, the conversations are successful, but the user or agent is doing additional work to achieve that success. This invites the conversation designer to explore the nature of those sequence expansions. If they are eliciting details, the topic of conversation may be inherently complex. For example, buying airline tickets involves a lot of details and decisions. Moderate Interactional Efficiency may be normal for this activity. However, if the expansions are primarily due to understanding repairs, the conversation designer should re-evaluate the terminology that the agent uses and the knowledge that it assumes and determine if the conversation can be redesigned so that it is more understandable from the start. With inherently complex topics, it may not. But note the fact that the user and agent can still succeed in the face of understanding troubles demonstrates the value of conversational repair features.

These Sequence Metrics also enable us to help disentangle user dissatisfaction with the agent itself from dissatisfaction with its message, for example, company policies. If a customer reports dissatisfaction after an interaction with a company's virtual agent, and the Sequence Completion and Interactional Efficiency rates are high for that conversation, then we know that the customer did not experience trouble understanding the agent and vice versa. Rather the dissatisfaction must have come from the message delivered by the agent and not the quality of the conversation itself. In other words, if the user complains and the agent recognizes and responds appropriately to that complaint, then the problem is not in the agent's ability to understand but in the substance of the complaint itself.

How it works: In order to measure the occurrence of base sequences and their expansions in conversation logs, we label both the user's and the agent's actions inside the conversation flow itself. We set context variables on each node in the dialog tree to indicate if the user inputs and agent outputs associated with that sequence are parts of base sequences or expansions. Then as users interact with the agent, the conversations label themselves! The only complication is when a user's input defaults, that is, does not match any conditions in the dialog tree. For this case, we provide a modifier that represents the average percentage of defaulted inputs that are initiators of base sequences. We set this modifier based on prior data in order to provide a correction to the metrics.


Summary: Natural Conversation Framework

We see then the four components of our Natural Conversation Framework for Conversational UX Design: 1) an Interaction Model, 2) Conversation Navigation, 3) Common Activities and 4) Sequence Metrics. Each component is based on the model of expandable sequences as documented in the literature of Conversation Analysis. The framework is abstract enough to apply to any conversation platform, but so far we have implemented it on two platforms: Watson Dialog and Watson Conversation.

Continue on to learn more about the Conversational UX Design Process...

Project Members

Dr. Robert J. Moore,
Conversation Analyst, Lead

Raphael Arar,
UX Designer & Researcher

Dr. Margaret H. Szymanski,
Conversation Analyst

Dr. Guang-Jie Ren,