ICRA 2013 Barattini Corradini Gesture Annotation

Abstract - This presentation deals with the need of a shared
definition, repository of gestures (referred to as gestabulary)

and of an annotation system within the robotics community.
This need arises from the necessity to create a common ground
on which to build effective Human Robot Interaction (HRI)
systems.
Over the last couple of decades, significant efforts have been
made towards the development of user interfaces for human
robot interaction by means of a combination of natural input
modes such as visual, audio, pen, gesture, etc.
These body-centered intelligent interfaces not only substitute
for the common interface devices but can also be exploited to
extend their functionality.
While earlier systems and prototypes considered the input
modes individually, it has quickly become apparent that the
different modalities should be considered in combination. The
rationale behind this finding is based on the evidence that each
single modality can be used to leverage and to complement the
semantic information delivered on each other input channel.
One of the most promising interaction modes is the use of
natural gestures. For the gestural interaction between a mobile
system such as a robot and human users, especially visual
information seems to be much relevant because it gives the
system the capability to observe its operational environment in
an active manner.
Despite relatively successful, the use of gesture has been
however rather confined to a few scenarios and application
contexts. This is due to the lack of a technical definition of what
a gesture is which consequently results in a lack of a
classification for the different kinds of human gestures.

I. ON GESTURES
The keyboard has been the main input device for many
years. Thereafter, the widespread introduction of mouse in the
early 1980's changed the way people interacted with
computers. Lately, a large number of input devices, such as
those based on pen, haptic, finger movements etc., have
appeared. The main impetus driving the development of new
input technologies has been the demand for more natural
interaction systems. Several promising user interfaces that
integrate various natural interaction modes [7,12,15] and/or
use tangible objects [1,2,24], have been put forward.
Speech, the primary human communication mode, has
been successfully integrated in several commercial and
prototype systems. From voice commands [10], speech
interfaces have evolved into conversational interfaces [6]
which are a metaphor of a conversation modeled after human-
human conversations. Also several gesture systems have been

*The research presented in this paper as part of the LOCOBOT

Project has been financed by the European Commission grant N FP7 NMP
260101.
Paolo Barattini (corresponding author) is with Ridgeback sas, Turin,
Italy phone: +39-0172-575087 e-mail: paolo.barattini@yahoo.it.
Andrea. Corradini is with the IT College of Media and Design,
Copenhagen, Denmark, e-mail: andc@kea.dk

proposed to date, yet we are not aware of any of them capable
of reaching near-human recognition performance.
In the areas of computer science and engineering, gesture
recognition has been approached within a general pattern
recognition framework and therefore with the same tools and
techniques adopted in other research areas like speech and
handwriting recognition.
While speech is fundamentally a sound wave, i.e. a
temporal sequence of alternating high and low pressure pulses
in medium through which the wave travels, and while
handwriting can be seen as a temporal sequence of ink on a
2D surface, gesture are interpreted as a set of connected
spatial movements. From this perspective, a gesture is a
trajectory in the 3D space, and as such it is like handwriting in
a higher dimensional space. The difficulty in dealing with
gesture is thus mainly due to its space-temporal variation.
Similarly to speech and handwriting, intrinsic intra- and
inter-personal differences can be found in the production of
gesture. The same gesture usually varies when performed by a
different person. Moreover, even the same person is never
able to exactly reproduce a gesture. Gesture however has an
additional problem of technical nature. Gesture recognition is
influenced by the devices used to capture the movement that
underlies the gesture as well as by the environmental
conditions in which they are performed.
Hand, limbs and arm tracking is the principal demand for
gesture-centered applications. Users of such applications were
usually required to wear a suit or glove equipped with sensors
to measure their 3D position and orientation. These input
devices permit to pick up very accurate input data, but are
uncomfortable and cumbersome for the user to wear.
Furthermore, they are not useful in any real world context in
which human users happens to encounter a service or robot
assistant. The same holds for many work environments in
which the user is performing multiple tasks and operations
and cannot be encumbered by additional devices like gloves
or overalls with markers that usually even restrict the user
natural movements. Camera-based input devices are much
more user-friendly as they are less intrusive. Because of the
hardware requirements, they represent a cheap and feasible
alternative to wearable sensors. Nonetheless, they introduce
some problems by their own that root in both the
computational costs for real-time image processing and the
difficulty of extracting 3D information from 2D images. To
sense gestures with a camera is still a fragile task that usually
works only in a constrained environment.
The idea of using gestures and/or speech to interact with a
robot has begun to emerge only during a recent period of time
as in the field of robotics the most efforts have been hitherto
concentrated on navigation issues.
Several generation of service robots operating in
commercial or service surroundings both to give support and
interact with people has been employed to date [3,16] while
this field of research is gaining more and more attention from
the industry and the academia.
Paolo Barattini and Andrea Corradini
Gesture input and annotation for interactive systems*

Gesture type
(according to
Kendon [14])
Explication Robot use
Gesticulation Idiosyncratic
movements that
accompany
speech.
Command for industrial
and service co-worker.
Speech disambiguation.
Language like Like gesticulation
but grammatically
integrated in the
utterance.
Comprehension of
natural conversation
between humans.
Speech disambiguation.
Pantomime Gesture sequences
that mimics a
story without
speech.
Natural non coded
communication at
distance or in noisy
environments.
Robot teaching.
Emblems Conventional
gestures such as
e.g. thumbs-up.
Simple effective basic
commands for HRI.
Sign language Signs are like
words for speech
with lexicon and
grammar.
For specific categories
of users already
proficient in sign
language, in special
applications domains.
Table 1: Gestures ordered according to their degree of
independent comprehension
Gesture type
(according to
McNeill [19])
Explication Robot use
Iconic Images of
concrete entities
and/or actions.
Command for industrial
and service robotic co-
worker with semantic
interpretation capacity.
Metaphoric They present
abstract content.
Higher level of
communication for
context and for robot
with higher level
capacity of planning.
Deictic Pointing gesture
to identify an
object or to
indicate the
position.
Teaching the robot.
Space and motion
related commands.
Beat Rhythmic
movements
supporting
prosody.
Modify the robot action
speed of execution.
Table 2: Classification of gestures in relation to their
semantic value
Each human language has several expressions indicating
hand, arm, body movements and gestures.
In their daily use, gestures are an integral part to human-
human communication, despite some gestures can also be
produced in isolation. They can have different meanings
according to the culture and the location which they are
expressed in. If we look up the word gesture as a noun in any
dictionary, we can obtain a set of meanings for it when used
in a common sense. From any definitions, it is clear that
gestures are intimately connected with human movements and
actions. Gestures are specialized actions and movements, but
they work in ways that not all other motions do. They operate
under psychological and cognitive constraints, not just
anatomical constraints since their intent is mainly of
communicative nature. Manual skills are usually not essential
for natural gesturing, unless we consider the production of
sign languages. These are however fully fledged artificial
languages by themselves and as such may not be considered
as natural gesturing.
Being the definition of gesture blurry and given only for
its common use, there are more open questions than answers
when it comes to gestures. Scholars in gesture studies do not
agree on whether gestures are an integrated form and a
complement of spoken utterances or a spill-over of speech
production. It is not clear to what extent a gesture is a
voluntary action and if, in general, there is any level of
control or gestural awareness.
The lack of an algorithmic definition represents an
obstacle to the development of gestural interfaces that exhibit
nearly-human performance. This also brings along another
problem: the lack of an exhaustive clear-cut gesture
categorization. Several useful classifications and dichotomies
have been proposed according to many different criteria
[4,8,9,14,19,20,25]. Most of them have been put forward by
psychologists, psycholinguists, cognitive scientists, biologists
and/or linguists who focused on either characteristics of the
movement or insights in motor dysfunctions or reasons why
specific movements occur or do not occur, etc. The categories
are not always discrete and not mutually exclusive since they
display some overlapping (i.e. a gesture may involve more
than one category). As a result, none of these classifications is
universal and none can be used to algorithmically define at
least a few gesture categories. None of these taxonomies
provide any rule base in making gestures understandable for
or synthesizable by computers.
Researchers who create computational gesture-based
systems have been using their own definition of gesture. As
such, in computing related areas, the definition of gesture
tends to be application-specific, less spontaneous and linked
to a learning process for those who are supposed to
understand the set of gestures considered in the specific
application. In this way, a gesture becomes a predefined
spatial-temporal template to include into or recognize from a
library of predefined movements which we can refer to as a
gestobulary.
Some scholars refer to and deal with gestures as 3D
predefined movements of body and/or limbs. Others define
gesture as a set of motions and orientations of both hand and
fingers. Some other researchers regard gestures as (ink)
markers entered with a mouse, fingertip, joystick or electronic

pen [21]. Human facial expression or lip-reading [5] have
been also considered as gestures.
Speech can be successfully recognized because it relies on
the peculiar sound wave characteristics of each phoneme that
make up a sentence. Similarly, handwriting recognition relies
on the sequence of the specific spatial characteristics of the
symbols representing each letter of the alphabet that make up
a sentence. Despite it is unknown and even arguable whether
there is the equivalent of a phoneme or a letter in the realm of
human gestures, in computer systems gestures are considered
as a language by itself made up of a limited set of building
blocks to parallel the phonemes and the letters. With this
implicit assumption and with the implications that derive from
it, gestures can be exactly defined in their own right and can
be associated a pre-defined semantics.
II. DO WE NEED STANDARDIZATION OF GESTURE AND
ANNOTATION?
In the HRI field, one of the ongoing technological goals
is the development of a system where a robot understands
without any ambiguity (or in other words: without errors) the
natural communication with human users.
The ideal features of an application that can understand
human multimodal communication is the capability of
complete disambiguation of the communicative message
conveyed and intended by the human and delivered by him
over different modes. The message must be univocally
mapped to meaning, as it is done e.g. for spoken language
with words, phonemes and syllables.
On one hand, this is possibly impossible, especially in
non-limited contexts. Ambiguity is an inherent and implicit
part of human-human communication. It could also be
envisioned as the room in which the communication has
leeway for evolution of human relationship and emotions. By
quoting S.T. Piantadosi [22] Syntactic and semantic
ambiguity are frequent enough to present a substantial
challenge to natural language processing. The fact that
ambiguity occurs on so many linguistic levels suggests that a
far-reaching principle is needed to explain its origins and
persistence
On the other hand, a computational system usually
produces an interpretation of the intended meaning that tends
to be univocal within and for that system and according to its
interpretational choices and technical as well as contextual
limitations.
The basis of the interpretation of gesture is annotation. An
annotation is a system that maps the gesture in the space (for
example subdividing the space around the subject in
quadrants or sectors) and time domain.
The use of different annotation systems could create
different interpretations of human gestures. In a robotic
system, the reaction of the robot (movement, feedback tone
and light signals, movements, gesture, motion etc.) would be
different. This would add to the inherent ambiguity of human
communication and language.
On the other side the use of systematically collected and
annotated multimodal corpora would facilitate a:
principled understanding of modalities integration
generalized guidance on media allocation
cross-modal reference resolution/generation
anticipation of phenomena and/or patterns in a
certain mode
gold standard against which the multimodal systems
output could be evaluated
interface design in intelligent multimodal systems
development
Different annotation systems and related corpora have
been proposed and used in different domains of interactive
systems and robotics such as robot expressive communication
with speech and gestures [13], avatar animation [11],
cognitive robotics and to evaluate the system from a usability
perspective, to analyze miscommunication and users
positioning and task strategies [11], multimodal instruction
dialogues between human and robot [23] capturing higher
dialogue structures during human- robot interaction [18] for
the design and implementation of expressive gestures in a
humanoid robot [17] and others.
We advocate for a standard annotation system as well as
for a gestobulary for HRI applications, especially those in
critical environments. We believe that this is needed in order
to diminish any potential source of ambiguity while also
decreasing the efforts from the human users to understand the
reactions and feedback (of whatever kind) produced by the
robots.
It is direct experience of the authors that in the frame of a
simple interaction with an industrial robot, with a few simple
commands, i.e. five common natural gestures, immediately
arose the need of the human to adapt to the limitation of the
technology in order to be able to obtain the desired
interpretation of the gesture by the robot (i.e. the execution of
the command). Here we mean that care and attention is
needed to produce the gesture within the frame of the
disambiguation capability of the robot (which in our case was
about 80% correct interpretation) given its allotted range of
speed, acceleration, spatial orientation, and visual perspective.
Low cost systems for wide range real world applications,
capable of becoming a market product, are prone to have
several limitations in HRI.
The establishment of a common ground, a shared
gestobulary, and a standard annotation to build on, would
help to lower the need for human users to learn anew how to
interpret robot communication based on speech and gesture,
how to communicate to a robot, each time that the human
meets a different brand or model of robot. As immediate
benefit, this would enhance the effectiveness and efficiency of
HRI as well as the effort of the human side to produce
unambiguous gestures.
For robotic systems such as industrial or service robotics
co-workers, in which the communication consists essentially
of commands by the human, the effort at building a
gestabulary as a set of pre-coded human gestures (i.e. having
a specific and unique meaning) will allow to produce low cost
system (in terms of cameras, computational capacity, power

expenditure) with high efficacy and low ambiguity of human
gesture interpretation by the robot.
This can be considered the equivalent of the command
icons that we find on the GUI of an MP3 player (those that
once were used in cassette recorders such as pause button
(two vertical bars), play button (right-pointing triangle), fast-
forward button (two right-pointing triangle), stop button
(square), etc., and that are a standard portrayed in IEC 60417
and also sketched in Figure 1.

Figure 1: Standardized IEC 60417 controls symbols
for commercial electronics
To date, there is no ISO or IEC standard, nor a scientific
or commercial agreed and shared set of gestures in HRI.
The European Council Directive 92/58/EEC on minimum
requirements for the provision of safety signs at work
includes just few hand signals (see Figure 2). This set is
intended for the communication between two workers, in
situations in which one subject controls a moving machine or
with moving parts, for example a crane lifting containers or a
mobile wheeled forklift, and the other worker provides
directions. Most part of these signals request the use of two
hands and need a quite wide space envelope (there must be
space around the person that is free of obstacles). They do not
appear adaptable to other contexts; nevertheless they show
that the evolution of robotics and the adoption of robotic co-
worker in regulated environments brings immediately about
issues related to standardization in HRI.

Figure 2: Hand signal from European Council
Directive 92/58/EEC
Similar pre-coded gestures can be found also in aircraft
marshaling which is a one-to-one communication that relies
on visual communication for aircraft ground handling. A
marshaller usually wears a reflecting safety vest, and uses
marshaling wands and handheld illuminated beacons for
instructions. The marshaller assists aircraft pilots at the airport
with signals such as keep turning, stop, shut down the engine,
slow down, etc. used to lead the aircraft to the runway or to its
parking stand. Most of these important visual codes for use in
international aviation are standardized by the International
Civil Aviation Organization (ICAO).
REFERENCES
[1] Amaro, S., and Sugimoto, M. (2012) Novel interaction techniques
using touch-sensitive tangibles in tabletop environments. Proceedings
of the ACM international conference on Interactive tabletops and
surfaces. p. 347-350.
[2] Bradley, D., and Roth, G. (2005). Natural interaction with virtual
objects using vision-based six DOF sphere tracking. Proceedings of the
ACM SIGCHI International Conference on Advances in computer
entertainment technology, p.19-26.
[3] Breazeal, C. (2004). Social interactions in HRI: the robot view. IEEE
Transactions on Systems, Man, and Cybernetics, Part C: Applications
and Reviews. 34(2):181-186.
[4] Cadoz, C. (1994). Le geste, canal de communication homme/machine:
la communication instrumentale. Technique et Science de
l'Information. 13(1):31-61.
[5] Campbell, R., Landis, T., & Regard, M. (1986). Face Recognition and
lipreading: a neurological dissociation. Brain. 109 (3): 509-521.
[6] Dahl, D. (Ed). (2005). Practical Spoken Dialog System. Kluwer
Academic Publishers.
[7] Demirdjian, D., Ko, T. & Darrell, T. (2005) Untethered Gesture
Acquisition and Recognition for Virtual World Manipulation. Virtual
Reality.
[8] Efron, D. (1972). Gesture, Race and Culture. Mouton Press.
[9] Ekman, P., & Friesen, W. (1969). The repertoire of nonverbal
behavior: categories, origin, usage and coding. Semiotics, 1, p. 49-98.
[10] Goulati, A. & Szostak, D. (2011). Proceedings of the 13th International
Conference on Human Computer Interaction with Mobile Devices and
Services (MobileHCI). P. 517-520.
[11] Green, A., et al. (2006). Developing a contextualized multimodal
corpus for human-robot interaction. Proceedings of 5th international
conference on Language Resources and Evaluation (LREC).
[12] Kaiser, E., et al. (2003). Mutual disambiguation of 3D multimodal
interaction in augmented and virtual reality. Proceedings of the 5th
international conference on Multimodal interfaces. p. 12-19.
[13] Kanis J. and Kroul Z. (2008). Interactive HamNoSys Notation Editor
for Signed Speech Annotation. ELRA Proceedings p. 88-93.
[14] Kendon, A. (1986). The Biological Foundations of Gestures: Motor
and Semiotic Aspects. Lawrence Erlbaum Associates.
[15] Klsch, M., et al. (2006). Multimodal interaction with a wearable
augmented reality system. IEEE Computer Graphics and Applications.
26(3):62:71.
[16] Kragic, D., Petersson, L. & Christensen, H.I. (2002). Visually guided
manipulation tasks Robotics and Autonomous Systems, 40(2):193-203.
[17] Le, Q. A., Hanoune, S., & Pelachaud, C. (2011). Design and
implementation of an expressive gesture model for a humanoid robot,
Proceedings of 11th IEEE-RAS International Conference on Humanoid
Robots, p. 134-140.
[18] Maas, J. F., & Wrede, B. (2006). BITT: A corpus for topic tracking
evaluation on multimodal human-robot-interaction. Proceedings of the
International Conference on Language and Evaluation (LREC).
[19] McNeill, D. (1992). Hand and Mind: what gestures reveal about
thought. The University of Chicago Press.
[20] Nespoulos, J.L. & Roch Lecours, A., (1986). Gestures: nature and
function. In: Nespoulos, J.L., Perron, P., and Roch Lecours, A. The
biological foundations of gestures: motor and semiotic aspects, 49-62.
Lawrence Erlbaum Associates.
[21] Oviatt, S.L. et al. (2000). Designing the User Interface for Multimodal
Speech and Pen-based Gesture Applications: State-of-the-Art Systems
and Future Research Directions. Human Computer Interaction. 15(4):
263-322.
[22] Piantadosi, S. T., Tily, H., & Gibson, E. (2012). The communicative
function of ambiguity in language. Cognition, 122(3), 280-291.
[23] Wolf, J. C., & Bugmann, G. (2006). Linking Speech and Gesture in
Multimodal Instruction Systems. Proceedings of the 15th IEEE
International Symposium on Robot and Human Interactive
Communication (RO-MAN), pp. 141-144.
[24] Wu, A. et al. (2011). Tangible navigation and object manipulation in
virtual environments. Proceedings of the 5th international conference
on Tangible, embedded, and embodied interaction. p. 37-44.
[25] Wundt, W.M. (1973). The language of gestures. Mouton Press.

ICRA 2013 Barattini Corradini Gesture Annotation

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

ICRA 2013 Barattini Corradini Gesture Annotation

Cargado por

Copyright:

Formatos disponibles

Abstract - This presentation deals with the need of a shared

definition, repository of gestures (referred to as gestabulary)

*The research presented in this paper as part of the LOCOBOT

También podría gustarte