0 calificaciones0% encontró este documento útil (0 votos)
14 vistas4 páginas
introductio to the communication from human to robot with gestures.
Do we neeed a gestabulary ?
This presentation deals with the need of a shared
definition, repository of gestures (referred to as gestabulary)
and of an annotation system within the robotics community.
This need arises from the necessity to create a common ground
on which to build effective Human Robot Interaction (HRI)
systems.
Over the last couple of decades, significant efforts have been
made towards the development of user interfaces for human
robot interaction by means of a combination of natural input
modes such as visual, audio, pen, gesture, etc.
These body-centered intelligent interfaces not only substitute
for the common interface devices but can also be exploited to
extend their functionality.
While earlier systems and prototypes considered the input
modes individually, it has quickly become apparent that the
different modalities should be considered in combination. The
rationale behind this finding is based on the evidence that each
single modality can be used to leverage and to complement the
semantic information delivered on each other input channel.
One of the most promising interaction modes is the use of
natural gestures. For the gestural interaction between a mobile
system such as a robot and human users, especially visual
information seems to be much relevant because it gives the
system the capability to observe its operational environment in
an active manner.
Despite relatively successful, the use of gesture has been
however rather confined to a few scenarios and application
contexts. This is due to the lack of a technical definition of what
a gesture is which consequently results in a lack of a
classification for the different kinds of human gestures.
introductio to the communication from human to robot with gestures.
Do we neeed a gestabulary ?
This presentation deals with the need of a shared
definition, repository of gestures (referred to as gestabulary)
and of an annotation system within the robotics community.
This need arises from the necessity to create a common ground
on which to build effective Human Robot Interaction (HRI)
systems.
Over the last couple of decades, significant efforts have been
made towards the development of user interfaces for human
robot interaction by means of a combination of natural input
modes such as visual, audio, pen, gesture, etc.
These body-centered intelligent interfaces not only substitute
for the common interface devices but can also be exploited to
extend their functionality.
While earlier systems and prototypes considered the input
modes individually, it has quickly become apparent that the
different modalities should be considered in combination. The
rationale behind this finding is based on the evidence that each
single modality can be used to leverage and to complement the
semantic information delivered on each other input channel.
One of the most promising interaction modes is the use of
natural gestures. For the gestural interaction between a mobile
system such as a robot and human users, especially visual
information seems to be much relevant because it gives the
system the capability to observe its operational environment in
an active manner.
Despite relatively successful, the use of gesture has been
however rather confined to a few scenarios and application
contexts. This is due to the lack of a technical definition of what
a gesture is which consequently results in a lack of a
classification for the different kinds of human gestures.
Copyright:
Attribution Non-Commercial (BY-NC)
Formatos disponibles
Descargue como PDF, TXT o lea en línea desde Scribd
introductio to the communication from human to robot with gestures.
Do we neeed a gestabulary ?
This presentation deals with the need of a shared
definition, repository of gestures (referred to as gestabulary)
and of an annotation system within the robotics community.
This need arises from the necessity to create a common ground
on which to build effective Human Robot Interaction (HRI)
systems.
Over the last couple of decades, significant efforts have been
made towards the development of user interfaces for human
robot interaction by means of a combination of natural input
modes such as visual, audio, pen, gesture, etc.
These body-centered intelligent interfaces not only substitute
for the common interface devices but can also be exploited to
extend their functionality.
While earlier systems and prototypes considered the input
modes individually, it has quickly become apparent that the
different modalities should be considered in combination. The
rationale behind this finding is based on the evidence that each
single modality can be used to leverage and to complement the
semantic information delivered on each other input channel.
One of the most promising interaction modes is the use of
natural gestures. For the gestural interaction between a mobile
system such as a robot and human users, especially visual
information seems to be much relevant because it gives the
system the capability to observe its operational environment in
an active manner.
Despite relatively successful, the use of gesture has been
however rather confined to a few scenarios and application
contexts. This is due to the lack of a technical definition of what
a gesture is which consequently results in a lack of a
classification for the different kinds of human gestures.
Copyright:
Attribution Non-Commercial (BY-NC)
Formatos disponibles
Descargue como PDF, TXT o lea en línea desde Scribd
Abstract - This presentation deals with the need of a shared
definition, repository of gestures (referred to as gestabulary)
and of an annotation system within the robotics community. This need arises from the necessity to create a common ground on which to build effective Human Robot Interaction (HRI) systems. Over the last couple of decades, significant efforts have been made towards the development of user interfaces for human robot interaction by means of a combination of natural input modes such as visual, audio, pen, gesture, etc. These body-centered intelligent interfaces not only substitute for the common interface devices but can also be exploited to extend their functionality. While earlier systems and prototypes considered the input modes individually, it has quickly become apparent that the different modalities should be considered in combination. The rationale behind this finding is based on the evidence that each single modality can be used to leverage and to complement the semantic information delivered on each other input channel. One of the most promising interaction modes is the use of natural gestures. For the gestural interaction between a mobile system such as a robot and human users, especially visual information seems to be much relevant because it gives the system the capability to observe its operational environment in an active manner. Despite relatively successful, the use of gesture has been however rather confined to a few scenarios and application contexts. This is due to the lack of a technical definition of what a gesture is which consequently results in a lack of a classification for the different kinds of human gestures.
I. ON GESTURES The keyboard has been the main input device for many years. Thereafter, the widespread introduction of mouse in the early 1980's changed the way people interacted with computers. Lately, a large number of input devices, such as those based on pen, haptic, finger movements etc., have appeared. The main impetus driving the development of new input technologies has been the demand for more natural interaction systems. Several promising user interfaces that integrate various natural interaction modes [7,12,15] and/or use tangible objects [1,2,24], have been put forward. Speech, the primary human communication mode, has been successfully integrated in several commercial and prototype systems. From voice commands [10], speech interfaces have evolved into conversational interfaces [6] which are a metaphor of a conversation modeled after human- human conversations. Also several gesture systems have been
*The research presented in this paper as part of the LOCOBOT
Project has been financed by the European Commission grant N FP7 NMP 260101. Paolo Barattini (corresponding author) is with Ridgeback sas, Turin, Italy phone: +39-0172-575087 e-mail: paolo.barattini@yahoo.it. Andrea. Corradini is with the IT College of Media and Design, Copenhagen, Denmark, e-mail: andc@kea.dk
proposed to date, yet we are not aware of any of them capable of reaching near-human recognition performance. In the areas of computer science and engineering, gesture recognition has been approached within a general pattern recognition framework and therefore with the same tools and techniques adopted in other research areas like speech and handwriting recognition. While speech is fundamentally a sound wave, i.e. a temporal sequence of alternating high and low pressure pulses in medium through which the wave travels, and while handwriting can be seen as a temporal sequence of ink on a 2D surface, gesture are interpreted as a set of connected spatial movements. From this perspective, a gesture is a trajectory in the 3D space, and as such it is like handwriting in a higher dimensional space. The difficulty in dealing with gesture is thus mainly due to its space-temporal variation. Similarly to speech and handwriting, intrinsic intra- and inter-personal differences can be found in the production of gesture. The same gesture usually varies when performed by a different person. Moreover, even the same person is never able to exactly reproduce a gesture. Gesture however has an additional problem of technical nature. Gesture recognition is influenced by the devices used to capture the movement that underlies the gesture as well as by the environmental conditions in which they are performed. Hand, limbs and arm tracking is the principal demand for gesture-centered applications. Users of such applications were usually required to wear a suit or glove equipped with sensors to measure their 3D position and orientation. These input devices permit to pick up very accurate input data, but are uncomfortable and cumbersome for the user to wear. Furthermore, they are not useful in any real world context in which human users happens to encounter a service or robot assistant. The same holds for many work environments in which the user is performing multiple tasks and operations and cannot be encumbered by additional devices like gloves or overalls with markers that usually even restrict the user natural movements. Camera-based input devices are much more user-friendly as they are less intrusive. Because of the hardware requirements, they represent a cheap and feasible alternative to wearable sensors. Nonetheless, they introduce some problems by their own that root in both the computational costs for real-time image processing and the difficulty of extracting 3D information from 2D images. To sense gestures with a camera is still a fragile task that usually works only in a constrained environment. The idea of using gestures and/or speech to interact with a robot has begun to emerge only during a recent period of time as in the field of robotics the most efforts have been hitherto concentrated on navigation issues. Several generation of service robots operating in commercial or service surroundings both to give support and interact with people has been employed to date [3,16] while this field of research is gaining more and more attention from the industry and the academia. Paolo Barattini and Andrea Corradini Gesture input and annotation for interactive systems*
Gesture type (according to Kendon [14]) Explication Robot use Gesticulation Idiosyncratic movements that accompany speech. Command for industrial and service co-worker. Speech disambiguation. Language like Like gesticulation but grammatically integrated in the utterance. Comprehension of natural conversation between humans. Speech disambiguation. Pantomime Gesture sequences that mimics a story without speech. Natural non coded communication at distance or in noisy environments. Robot teaching. Emblems Conventional gestures such as e.g. thumbs-up. Simple effective basic commands for HRI. Sign language Signs are like words for speech with lexicon and grammar. For specific categories of users already proficient in sign language, in special applications domains. Table 1: Gestures ordered according to their degree of independent comprehension Gesture type (according to McNeill [19]) Explication Robot use Iconic Images of concrete entities and/or actions. Command for industrial and service robotic co- worker with semantic interpretation capacity. Metaphoric They present abstract content. Higher level of communication for context and for robot with higher level capacity of planning. Deictic Pointing gesture to identify an object or to indicate the position. Teaching the robot. Space and motion related commands. Beat Rhythmic movements supporting prosody. Modify the robot action speed of execution. Table 2: Classification of gestures in relation to their semantic value Each human language has several expressions indicating hand, arm, body movements and gestures. In their daily use, gestures are an integral part to human- human communication, despite some gestures can also be produced in isolation. They can have different meanings according to the culture and the location which they are expressed in. If we look up the word gesture as a noun in any dictionary, we can obtain a set of meanings for it when used in a common sense. From any definitions, it is clear that gestures are intimately connected with human movements and actions. Gestures are specialized actions and movements, but they work in ways that not all other motions do. They operate under psychological and cognitive constraints, not just anatomical constraints since their intent is mainly of communicative nature. Manual skills are usually not essential for natural gesturing, unless we consider the production of sign languages. These are however fully fledged artificial languages by themselves and as such may not be considered as natural gesturing. Being the definition of gesture blurry and given only for its common use, there are more open questions than answers when it comes to gestures. Scholars in gesture studies do not agree on whether gestures are an integrated form and a complement of spoken utterances or a spill-over of speech production. It is not clear to what extent a gesture is a voluntary action and if, in general, there is any level of control or gestural awareness. The lack of an algorithmic definition represents an obstacle to the development of gestural interfaces that exhibit nearly-human performance. This also brings along another problem: the lack of an exhaustive clear-cut gesture categorization. Several useful classifications and dichotomies have been proposed according to many different criteria [4,8,9,14,19,20,25]. Most of them have been put forward by psychologists, psycholinguists, cognitive scientists, biologists and/or linguists who focused on either characteristics of the movement or insights in motor dysfunctions or reasons why specific movements occur or do not occur, etc. The categories are not always discrete and not mutually exclusive since they display some overlapping (i.e. a gesture may involve more than one category). As a result, none of these classifications is universal and none can be used to algorithmically define at least a few gesture categories. None of these taxonomies provide any rule base in making gestures understandable for or synthesizable by computers. Researchers who create computational gesture-based systems have been using their own definition of gesture. As such, in computing related areas, the definition of gesture tends to be application-specific, less spontaneous and linked to a learning process for those who are supposed to understand the set of gestures considered in the specific application. In this way, a gesture becomes a predefined spatial-temporal template to include into or recognize from a library of predefined movements which we can refer to as a gestobulary. Some scholars refer to and deal with gestures as 3D predefined movements of body and/or limbs. Others define gesture as a set of motions and orientations of both hand and fingers. Some other researchers regard gestures as (ink) markers entered with a mouse, fingertip, joystick or electronic
pen [21]. Human facial expression or lip-reading [5] have been also considered as gestures. Speech can be successfully recognized because it relies on the peculiar sound wave characteristics of each phoneme that make up a sentence. Similarly, handwriting recognition relies on the sequence of the specific spatial characteristics of the symbols representing each letter of the alphabet that make up a sentence. Despite it is unknown and even arguable whether there is the equivalent of a phoneme or a letter in the realm of human gestures, in computer systems gestures are considered as a language by itself made up of a limited set of building blocks to parallel the phonemes and the letters. With this implicit assumption and with the implications that derive from it, gestures can be exactly defined in their own right and can be associated a pre-defined semantics. II. DO WE NEED STANDARDIZATION OF GESTURE AND ANNOTATION? In the HRI field, one of the ongoing technological goals is the development of a system where a robot understands without any ambiguity (or in other words: without errors) the natural communication with human users. The ideal features of an application that can understand human multimodal communication is the capability of complete disambiguation of the communicative message conveyed and intended by the human and delivered by him over different modes. The message must be univocally mapped to meaning, as it is done e.g. for spoken language with words, phonemes and syllables. On one hand, this is possibly impossible, especially in non-limited contexts. Ambiguity is an inherent and implicit part of human-human communication. It could also be envisioned as the room in which the communication has leeway for evolution of human relationship and emotions. By quoting S.T. Piantadosi [22] Syntactic and semantic ambiguity are frequent enough to present a substantial challenge to natural language processing. The fact that ambiguity occurs on so many linguistic levels suggests that a far-reaching principle is needed to explain its origins and persistence On the other hand, a computational system usually produces an interpretation of the intended meaning that tends to be univocal within and for that system and according to its interpretational choices and technical as well as contextual limitations. The basis of the interpretation of gesture is annotation. An annotation is a system that maps the gesture in the space (for example subdividing the space around the subject in quadrants or sectors) and time domain. The use of different annotation systems could create different interpretations of human gestures. In a robotic system, the reaction of the robot (movement, feedback tone and light signals, movements, gesture, motion etc.) would be different. This would add to the inherent ambiguity of human communication and language. On the other side the use of systematically collected and annotated multimodal corpora would facilitate a: principled understanding of modalities integration generalized guidance on media allocation cross-modal reference resolution/generation anticipation of phenomena and/or patterns in a certain mode gold standard against which the multimodal systems output could be evaluated interface design in intelligent multimodal systems development Different annotation systems and related corpora have been proposed and used in different domains of interactive systems and robotics such as robot expressive communication with speech and gestures [13], avatar animation [11], cognitive robotics and to evaluate the system from a usability perspective, to analyze miscommunication and users positioning and task strategies [11], multimodal instruction dialogues between human and robot [23] capturing higher dialogue structures during human- robot interaction [18] for the design and implementation of expressive gestures in a humanoid robot [17] and others. We advocate for a standard annotation system as well as for a gestobulary for HRI applications, especially those in critical environments. We believe that this is needed in order to diminish any potential source of ambiguity while also decreasing the efforts from the human users to understand the reactions and feedback (of whatever kind) produced by the robots. It is direct experience of the authors that in the frame of a simple interaction with an industrial robot, with a few simple commands, i.e. five common natural gestures, immediately arose the need of the human to adapt to the limitation of the technology in order to be able to obtain the desired interpretation of the gesture by the robot (i.e. the execution of the command). Here we mean that care and attention is needed to produce the gesture within the frame of the disambiguation capability of the robot (which in our case was about 80% correct interpretation) given its allotted range of speed, acceleration, spatial orientation, and visual perspective. Low cost systems for wide range real world applications, capable of becoming a market product, are prone to have several limitations in HRI. The establishment of a common ground, a shared gestobulary, and a standard annotation to build on, would help to lower the need for human users to learn anew how to interpret robot communication based on speech and gesture, how to communicate to a robot, each time that the human meets a different brand or model of robot. As immediate benefit, this would enhance the effectiveness and efficiency of HRI as well as the effort of the human side to produce unambiguous gestures. For robotic systems such as industrial or service robotics co-workers, in which the communication consists essentially of commands by the human, the effort at building a gestabulary as a set of pre-coded human gestures (i.e. having a specific and unique meaning) will allow to produce low cost system (in terms of cameras, computational capacity, power
expenditure) with high efficacy and low ambiguity of human gesture interpretation by the robot. This can be considered the equivalent of the command icons that we find on the GUI of an MP3 player (those that once were used in cassette recorders such as pause button (two vertical bars), play button (right-pointing triangle), fast- forward button (two right-pointing triangle), stop button (square), etc., and that are a standard portrayed in IEC 60417 and also sketched in Figure 1.
Figure 1: Standardized IEC 60417 controls symbols for commercial electronics To date, there is no ISO or IEC standard, nor a scientific or commercial agreed and shared set of gestures in HRI. The European Council Directive 92/58/EEC on minimum requirements for the provision of safety signs at work includes just few hand signals (see Figure 2). This set is intended for the communication between two workers, in situations in which one subject controls a moving machine or with moving parts, for example a crane lifting containers or a mobile wheeled forklift, and the other worker provides directions. Most part of these signals request the use of two hands and need a quite wide space envelope (there must be space around the person that is free of obstacles). They do not appear adaptable to other contexts; nevertheless they show that the evolution of robotics and the adoption of robotic co- worker in regulated environments brings immediately about issues related to standardization in HRI.
Figure 2: Hand signal from European Council Directive 92/58/EEC Similar pre-coded gestures can be found also in aircraft marshaling which is a one-to-one communication that relies on visual communication for aircraft ground handling. A marshaller usually wears a reflecting safety vest, and uses marshaling wands and handheld illuminated beacons for instructions. The marshaller assists aircraft pilots at the airport with signals such as keep turning, stop, shut down the engine, slow down, etc. used to lead the aircraft to the runway or to its parking stand. Most of these important visual codes for use in international aviation are standardized by the International Civil Aviation Organization (ICAO). REFERENCES [1] Amaro, S., and Sugimoto, M. (2012) Novel interaction techniques using touch-sensitive tangibles in tabletop environments. Proceedings of the ACM international conference on Interactive tabletops and surfaces. p. 347-350. [2] Bradley, D., and Roth, G. (2005). Natural interaction with virtual objects using vision-based six DOF sphere tracking. Proceedings of the ACM SIGCHI International Conference on Advances in computer entertainment technology, p.19-26. [3] Breazeal, C. (2004). Social interactions in HRI: the robot view. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 34(2):181-186. [4] Cadoz, C. (1994). Le geste, canal de communication homme/machine: la communication instrumentale. Technique et Science de l'Information. 13(1):31-61. [5] Campbell, R., Landis, T., & Regard, M. (1986). Face Recognition and lipreading: a neurological dissociation. Brain. 109 (3): 509-521. [6] Dahl, D. (Ed). (2005). Practical Spoken Dialog System. Kluwer Academic Publishers. [7] Demirdjian, D., Ko, T. & Darrell, T. (2005) Untethered Gesture Acquisition and Recognition for Virtual World Manipulation. Virtual Reality. [8] Efron, D. (1972). Gesture, Race and Culture. Mouton Press. [9] Ekman, P., & Friesen, W. (1969). The repertoire of nonverbal behavior: categories, origin, usage and coding. Semiotics, 1, p. 49-98. [10] Goulati, A. & Szostak, D. (2011). Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services (MobileHCI). P. 517-520. [11] Green, A., et al. (2006). Developing a contextualized multimodal corpus for human-robot interaction. Proceedings of 5th international conference on Language Resources and Evaluation (LREC). [12] Kaiser, E., et al. (2003). Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality. Proceedings of the 5th international conference on Multimodal interfaces. p. 12-19. [13] Kanis J. and Kroul Z. (2008). Interactive HamNoSys Notation Editor for Signed Speech Annotation. ELRA Proceedings p. 88-93. [14] Kendon, A. (1986). The Biological Foundations of Gestures: Motor and Semiotic Aspects. Lawrence Erlbaum Associates. [15] Klsch, M., et al. (2006). Multimodal interaction with a wearable augmented reality system. IEEE Computer Graphics and Applications. 26(3):62:71. [16] Kragic, D., Petersson, L. & Christensen, H.I. (2002). Visually guided manipulation tasks Robotics and Autonomous Systems, 40(2):193-203. [17] Le, Q. A., Hanoune, S., & Pelachaud, C. (2011). Design and implementation of an expressive gesture model for a humanoid robot, Proceedings of 11th IEEE-RAS International Conference on Humanoid Robots, p. 134-140. [18] Maas, J. F., & Wrede, B. (2006). BITT: A corpus for topic tracking evaluation on multimodal human-robot-interaction. Proceedings of the International Conference on Language and Evaluation (LREC). [19] McNeill, D. (1992). Hand and Mind: what gestures reveal about thought. The University of Chicago Press. [20] Nespoulos, J.L. & Roch Lecours, A., (1986). Gestures: nature and function. In: Nespoulos, J.L., Perron, P., and Roch Lecours, A. The biological foundations of gestures: motor and semiotic aspects, 49-62. Lawrence Erlbaum Associates. [21] Oviatt, S.L. et al. (2000). Designing the User Interface for Multimodal Speech and Pen-based Gesture Applications: State-of-the-Art Systems and Future Research Directions. Human Computer Interaction. 15(4): 263-322. [22] Piantadosi, S. T., Tily, H., & Gibson, E. (2012). The communicative function of ambiguity in language. Cognition, 122(3), 280-291. [23] Wolf, J. C., & Bugmann, G. (2006). Linking Speech and Gesture in Multimodal Instruction Systems. Proceedings of the 15th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 141-144. [24] Wu, A. et al. (2011). Tangible navigation and object manipulation in virtual environments. Proceedings of the 5th international conference on Tangible, embedded, and embodied interaction. p. 37-44. [25] Wundt, W.M. (1973). The language of gestures. Mouton Press.
Locobot Icra 2013 Workshop Pb12
Human Robot Interaction (HRI) for Assistance
and Industrial Robots. Scientific Knowledge,
Standards and Regulatory Framework. How
do I design for the real world?
Organizers : Gurvinder S. Virk, A. Tapus, F. Bonsignorio,, N. Mirnig,
M.Tscheligi, S. Haddadin, M. Vincze, Han Boon Siew, H.
Samani, N. Bellotto, A. Corradini, P. Barattini, N. Robertson,
C. Morand, A. Rovetta, C. Woegerer, A. Pichler.