An Evaluation Framework for Multimodal Interaction

(Wechsung, 2013)
Wechsung, I. (2013). An evaluation framework for multimodal interaction. Berln: Springer.



Durante los ltimos quince aos, la multimodalidad ha ampliado sus
horizontes debido al desarrollo de tecnologas como el habla (speech). Si
bien la utilizacin de dispositivos de entrada como el mouse o el teclado
eran privilegiados por los usuarios hace dcadas atrs, en la actualidad hay
un aumento de la preferencia por las interfaces tctiles y de reconocimiento
de voz.

Otra tendencia, que tiene aument la popularidad del speech y ha debilitado
el dominio del teclado y el mouse, es la pantalla tctil, el uso de sta se ha
intensificado con la posibilidad que tienen los usuarios de acceder a la
compra de celulares inteligentes, tabletas, computadores tctiles. (Explicar
con cifras de la encuesta Common sense media tambin se podra
presentar el ejemplo de la cada de BlackBerry, 9 de cada 10 ventas en
Norteamrica de porttiles son de pantalla tctil). (citar ejemplo de Siri o de

Definiciones de multimodal

Pero que es multimodal, Cmo definirlo sin entrar en tensiones exclusivas
desde las orientaciones psicolgicas o tecnolgicas?

Multimodal dialogue systems are systems which enable human-
machine interaction through a number of media, making use of
different sensory channels. The understanding of the term media
in the scientific community is, in contrast to the term modality,
mostly uniform. Media is associated with the physical realization
respectively presentation of information via input and output
devices (cf e.g. Bernsen, 1997; Gibbon, Moore, & Winski, 1998;
Hovy & Arens, 1990; Jokinen & Raike, 2003; Sturm, 2005).

Descripcin de otras definiciones

Thus, the three senses sight, hearing, and touch, correspond to the three

perceptual channels. Thereby the terms visual and auditive refer to the
perception and the sensory modalities; the terms optical and acoustical refer
to physical (and not physiological) parameters (Schomaker et al., 1995).

According to Charwats (1992) definition only three different modalities
respectively three different human senses can be distinguished. Although
the aforementioned senses are nowadays those with the highest relevance
for human-computer-interaction (HCI), at least three more senses (smell,
vestibular, taste) are defined in physiology.
Another definition of modality is offered by Bernsen (2008):
Modality is a way of representing information in some physical medium.
Thus, a modality is defined by its physical medium and its particular way of
With this definition, Bernsen (2008) moves away from the physiological
understanding of modality: Modalities according to Bernsen (2008) refer to
ways of information representation rather than to the human sense. Thus,
he broadens the term modality: Humans use many ways to represent
information and these different ways of information representation may refer
to the same sensory modality (e.g. images and text are different ways of
representation but both refer to vision). A multimodal system in Bernsens
sense is a system employing at least two different modalities (ways of
information representation) for input and/or output.

Sistemas multimodales

Un sistema multimodal emplea tcnicas de interaccin diferentes y un
usuario necesita tener diferentes modalidades sensoriales para interactuar
con estos sistemas. Sin embargo, la mayora de las definiciones, como las
presentadas anteriormente, se centran en uno o el otro punto de vista.
Mller et al (2009) considera ambos puntos de vista, afirmando que "los
sistemas de dilogo multimodal son sistemas que permiten la interaccin
hombre-mquina a travs de diversos medios de comunicacin, haciendo
uso de diferentes canales sensoriales".

Tipos de sistemas multimodales

Nigay and Coutaz (1993) identificaron cuatro tipos de sistemas

Interacciones exclusivas: el sistema ofrece diferentes modalidades pero el
uso slo es secuencial (una modalidad a la vez).
Interacciones alternativas: el sistema ofrece diferentes modalidades.
Como para las interacciones exclusivas, las modalidades slo se pueden
utilizar alternativamente pero pueden estar relacionados con los dems.
Interacciones simultneas: las modalidades se pueden utilizar en paralelo
Interacciones sinrgicas: las modalidades se pueden utilizar en paralelo y
la informacin puede estar relacionada con los dems.


Los tipos de sistemas multimodales sealan que la multimodalidad se
puede presentar en diferentes formas en la interaccin con el usuario, pero
para que exista multimodalidad se debe asegurar que existan diferentes
entradas y salidas de interaccin.
El xito en el funcionamiento de la multimodalidad, no supone la
satisfaccin del usuario por ello se aborda en esta investigacin contenidos
educativos multimodalidades que posibiliten una interaccin natural, donde
lo que es natural ser definido por la pertinencia, el contexto y las
necesidades especficas del usuario.

Fundamentos cognitivos de la interaccin Multimodal

Al proveer mltiples canales de comunicacin, sistemas multimodales se
asumen para apoyar mediante el uso de diferentes recursos cognitivos de
procesamiento de la informacin humana. Esta suposicin se basa en gran
medida en las teoras cognitivas postulan mltiples los recursos de
procesamiento de modalidad especfica.

El estudio desarrollado por Baddeley, el modelo de memoria de trabajo
describe los procesos que se dan en la memoria a corto plazo desde tres
componentes agrupados en the central executive que son the visual-
spatial, sketchpad and the phonological loop. (Visual espacial, bloc de
dibujo y el bucle fonolgico).

Con respecto a la memoria a largo plazo, la codificacin de la informacin
en las representaciones mentales se supone que es asumida en gran
manera desde la modalidad especfica. Por Paivio (1986) teora de la
codificacin dual postula dos sistemas cognitivos en gran medida
independientes: los sistemas imaginales, no verbal y la verbal. Similar, al
bucle fonolgico de Baddeley, el sistema verbal procesa informacin verbal
mientras que el sistema imaginal, anlogo a la visual-espacial sketchpad,
procesa la informacin visual-espacial. Segn Paivio (1986), hallazgos de
neuropsicologa apoyan estas hiptesis. Fue demostrado, que depende del
tipo de informacin (verbal vs espacial) diversas reas del cerebro estn

Aun as, estos sistemas estn conectados. Esto explica por qu las
presentaciones multimodales, e.g. verbal-auditivo emparejadas con
informacin visual, pueden ser superior a las presentaciones unimodales.
La codificacin dual conduce a un mayor rendimiento de reconocimiento y

Multiple Resource Theory (MRT) by Wickens
(2002, cf. Section 3.2.1).
Esta teora resume que la presentacin de informacin redundante y la

divisin de informacin a varios canales reduce la carga cognitiva general
experimentada por el usuario. Con menor carga cognitiva, los errores son
menos probables y la interaccin es ms robusta.


I nteraction Performance Aspects on the User Side
All processing steps described above, can be mapped to the interaction performance aspects proposed by Mller et
al. (2009) and Wechsung et al. (2012a). These aspects are perceptual effort, cognitive workload and response effort.
Perceptual effort. Perceptual effort is the effort required for decoding the system messages, and for understanding
and extracting their meaning (Zimbardo, 1995), e.g. listening-effort or reading effort. This aspect refers to the
perceptual modalities described above. The Borg scale can be used to assess perceptual effort (Borg,1982).
Cognitive workload. The cognitive workload is defined as the specification of the costs of task performance, such
as the necessary information processing capacity and resources (De Waard, 1996). It refers to the processing codes
and processing stages. An overview of methods assessing cognitive workload is given in De Waard (1996) and Jahn
et al. (2005). A popular method is the NASA-TLX questionnaire (Hart & Staveland, 1988). A lightweight
instrument shown to have excellent psychometric properties (Sauro & Dumas, 2009) even in comparison to more
elaborate measures (De Waard, 1996) is the Rating Scale Mental Effort (RSME) by Zijlstra (1993). Note that the
RSME is also known as the SMEQ (Subjective
Mental Effort Questionnaire).
Physical response effort. The physical response effort, is the effort required to communicate with the system
(Mller et al. 2009), such as the effort required for typing in an answer or pushing a button. This aspect refers to the
response codes.
A scale specifically designed to measure physical response effort, is to the authors knowledge, not available.
However, the questionnaire proposed in the ITU-T Recommendation P.851 (ITU-T Rec. P.851, 2003) contains items
related to physical Response effort. Also an adapted version of the RSME (Zijlstra, 1993) may be used.

The degree of interference (and consequently the workload and the effort) increases with the degree to which
different tasks or information refer to the same processing dimensions (see Sec. Processing Steps on the User
To measure performance on the user side, peri-physiological parameters, derived from the users body, can be used.
These measures include pupil diameter, eyetracking and psycho-physiological measures like electrocardiography
(ECG), electromyography (EMG), electroencephalography (EEG) and electro-dermal activity (EDA) (Schleicher,
2009). Generally, these measures are rather unspecific and the valence of a situation (positive or negative) is not
determinable even for EMG measures (Mahlke & Minge, 2006). Consequently, drawing inference based solely on
these methods is difficult. Other possible data sources are log-files, which may be employed to record task success,
task duration, or modality choice. Please note that for all the performance aspects, questionnaires using self-report
are mentioned. Selfreports require the user to judge their performance. If such measurements are taken, the
experienced workload is measured. The experienced performance and the performance assessed via indirect
measurements as described above do not necessarily have to correspond.

Processing Steps on the System Side
On the system side, six processing steps have been identified based on the frameworks of Lopez Cozar and Araki
(2005) and Herzog and Reithinger (2006).
Input Processing. In the first step the input of the various sensors (e.g. microphones, face recognition, gesture
recognition) is processed (Herzog & Reithinger,2006). The input is decoded into a format understandable to the
system, e.g. from acoustics to a text string in case of speech input.
Modality Specific Interpretation. In this step, the transformed input is further transformed into symbolic
information and meaning is provided to the data (Herzog & Reithinger, 2006). For example, a sequence of words is
analysed to gain the meaning (Lopez Cozar, & Araki, 2005).
Fusion. This is the stage, in which the meaning obtained from the different sensors is merged and combined into
one coherent representation, in order to acquire the users intention (Lopez Cozar, & Araki, 2005; Herzog &
Reithinger, 2006).
Dialogue Management. The dialogue management decides on the next steps or actions to be taken, in order to
maintain the dialogue coherence to lead the dialogue to the intended goal (Gibbon, Moore, & Winski, 1998; Lopez
Cozar, & Araki, 2005; Herzog & Reithinger, 2006).
Fission. The fission operation selects the modalities presenting the output and their coordination (Lopez Cozar, &
Araki, 2005; Herzog & Reithinger, 2006).
Modality Specific Response Generation. After fission, modality specific responses are generated; here the abstract
output information is transformed into media objects, understandable to the user (Lopez Cozar, & Araki, 2005;
Herzog & Reithinger, 2006).
Output Rendering. Finally, the output rendering, the actual presentation of the coordinated system response in the
defined media channels like speakers and displays takes place (Lopez Cozar, & Araki, 2005; Herzog & Reithinger,

