Bateman - Wildfeuer JOP Article

Available online at www.sciencedirect.
com
ScienceDirect
Journal of Pragmatics 74 (2014) 180--208
www.elsevier.com/locate/pragma
A multimodal discourse theory of visual narrative

John A. Bateman *, Janina Wildfeuer 1
Faculty of Linguistics and Literary Sciences, Bremen University, Bremen 28334, Germany
Received 28 May 2014; received in revised form 27 September 2014; accepted 5 October 2014
Abstract
There have been many attempts to provide accounts of visually expressed narratives by drawing on our understandings of linguistic
discourse. Such approaches have however generally proceeded piecemeal --- particular phenomena appearing similar to phenomena in
verbal discourse are selected for discussion with insufficient consideration of just what it means to treat visual communication as
discourse at all. This has limited discussions in several ways. Most importantly, analysis is deprived of effective methodologies for
approaching visual artefacts so that it remains unclear what units of analysis should be selected and how they can be combined. In this
paper, we articulate a model of discourse pragmatics that is sufficiently general to apply to the specifics of visually communicated
information and show this at work with respect to several central aspects of visual narrative. We suggest that the framework provides an
effective and general foundation for reengaging with visual communicative artefacts in a manner compatible with methods developed for
verbal linguistic artefacts.
2014 Elsevier B.V. All rights reserved.
Keywords: Narrative; Discourse; Semantics; Visuals; Comics; Multimodality
1. Introduction
The ability of sequences of visual materials to fulfil communicative purposes that are in many ways analogous to
those achieved by sequences of linguistic elements is now broadly recognized. Indeed, the question of the relationship
of film --- i.e., sequences of moving images --- to language and even film considered as a language has been with us for
almost as long as film itself (see the thorough discussion and further references in, e.g., Metz, 1964). An explosion of
interest in the same question when raised with respect to sequences of static images --- as in comics, graphical
instruction leaflets, picturebooks and so on --- is now also to be observed (cf. Evans, 2009; Cohn, 2013a; Miodrag, 2013;
Schumacher, 2013). Nevertheless, the general area of applying linguistic approaches to the visual remains one that is
hotly contested. Many working primarily on the visual side have voiced fundamental critiques of the relevance of
linguistic theories and methods when addressing visual communication. And much of this critique draws attention to the
very different basis that verbal language, with its reliance on convention and the construction of larger elements from
smaller elements, appears to have when compared with visual artefacts, exhibiting strong iconicity and an apparent lack
of basic non-iconic elements that may be defined independently of their use in larger wholes (cf. Gombrich, 1959:6;
Bateman, 2014:46--47).
This continues to raise difficult questions not only concerning the theoretical relevance of properties derived from
the study of language but also with respect to basic methodological issues for the determination of units of analysis.
* Corresponding author. Tel.: 49 421 218 68120; fax: 49 421 218 98 68120.
E-mail addresses: bateman@uni-bremen.de (J.A. Bateman), wildfeuer@uni-bremen.de (J. Wildfeuer).
1
Tel.: 49 421 218 68287; fax: 49 421 218 98 68287.
http://dx.doi.org/10.1016/j.pragma.2014.10.001
0378-2166/ 2014 Elsevier B.V. All rights reserved.
J.A. Bateman, J. Wildfeuer / Journal of Pragmatics 74 (2014) 180--208 181
In fact, a broad feeling can be observed that applications of insights gained within linguistics to the visual
are inherently guilty of the kind of linguistic imperialism commonly attributed to Saussure and Barthes, in which
language is taken as the model, the master pattern, for all semiotic, i.e., meaning creating, behaviour. Given the
rather marked lack of success exhibited to date by linguistically-oriented semiotic accounts of meaning construction
in visual media, imperialism of this kind seems unwarranted. Indeed, although sometimes insightful, it is striking
how little acceptance such approaches have found --- semiotic accounts remain by and large at rather abstract,
illustrative levels of application and even these are commonly considered with suspicion by visual analysts more
generally.
In this paper we will suggest that one of the main reasons for this unsatisfactory state of affairs is the lack of analytic
bite that linguistically-motivated accounts of the visual have traditionally been able to bring to bear on their objects of
analysis. The resulting descriptions in fact lack much of the general method of a linguistic analysis: particular forms or
aspects of an artefact under study are extracted and discussed as if they were linguistic in nature without the foundation
of the broad raft of empirical methodologies usual in linguistic work proper. This leaves few principles for demarcating units
--- neither within images nor across sequences of images --- and, once identified, consequences for interpretation are only
informally specified. We believe that one of the most significant underlying problems to be resolved here is that of invoking
inappropriate levels of abstraction. Verbal and visual media are fundamentally different at lower levels of descriptive
abstraction and so drawing analogies too soon will tend to distort the respective subject matters. In contrast, we will argue
that significant similarities between verbal and visual communicative artefacts can be located at the more abstract levels
of discourse. It is then only at these higher levels of abstraction that insights from linguistic models can be beneficially
applied and we begin to regain the sense in which sequences of both verbal and visual material can function similarly as
communicative artefacts or performances.
The focus of this paper is therefore to articulate a model of discourse pragmatics that is sufficiently general to apply to
the specifics of visually communicated information. We will suggest that this framework provides a new foundation for
reengaging with visual communicative artefacts in a manner compatible with the methods developed for verbal linguistic
artefacts, but without positing misleading analogies with linguistic syntax, morphology or phonetics. In the past there have
been many attempts to develop accounts of visual artefacts and of the operations of visual communication involved that
have characterized themselves in terms of linguistic discourse. But there has been relatively little critical discussion of just
what is meant by considering visual communication as discourse in the first place. This has limited progress in several
ways. Most importantly, analysis has been deprived of effective methodologies for approaching visual artefacts. It is
unclear what units of analysis should be selected and how they can be combined, which naturally compromises empirical
application. By working through an explicitly formulated framework for analysing discourse that we have extended to
operate across verbal and visual materials (as well as their combinations), we will show how a more robust position for
analysis can be achieved.
We structure the discussion as follows. Crucial to the account is a reconstruction of the basic notion of semiotic
mode that draws, at the least level of abstraction, on an acceptance of materiality and, at the highest level of abstraction,
on a detailed model of discourse and its operation. For this, we also require an appropriate acknowledgement of the
fact that properties of perception enter into the operation of many visual media in a fundamentally different way to that
found with verbal language. First, therefore, we introduce our definition of semiotic mode in some detail because it is
this that gives us the foundation necessary for applying notions of discourse across a variety of media. Second, we set
out the model of discourse that we employ within our notion of semiotic mode, showing both how this can be made to
lend itself to descriptions of non-verbal material and how a suitable relationship to perception can be drawn. Third, we
illustrate the framework in action by taking several cases of visual narrative. Here we move progressively away from
the treatment of verbal language to non-naturalistic visual narrative communicative artefacts. Fourth and finally,
we summarize what has been achieved and briefly discuss how our approach is relevant for multimodal communication
in general.
2. Semiotic modes and multimodality
Approaches to understanding the workings of complex multimodal artefacts continue to be hampered by

inadequate characterizations of the notion of multimodality itself. It is common for sensory modalities and semiotic
modalities to be conflated. This is, however, a confusion of sense and perception. There is now considerable
evidence that sensory inputs are combined and influence each other at very early stages in neural processing, well
before perceptions are formed (cf., e.g., Kluss et al., 2012; Seeley, 2012). Since, as we shall argue in more detail
below, semiotic modes require, and only operate in terms of, perceptible distinctions, taking sensory channels as the
basic ingredients of multimodality and multimodal artefacts is not going to be adequate. More effective models of
semiotic mode require that we first return to the basic notion of materiality, i.e., the stuff which is used when making
meanings.
182 J.A. Bateman, J. Wildfeuer / Journal of Pragmatics 74 (2014) 180--208
2.1. Semiotic modes and materiality
Our redefinition of the notion of semiotic mode is situated within the paradigm of socio-semiotics as developed by Kress
and colleagues (cf. Kress et al., 2000; Kress, 2010). Within this account, modes are pre-given neither by material nor by
sensory access to material. In particular:
what counts as a mode is a matter for a community and its social-representational needs. What a community
decides to regard and use as mode is mode. If the community of designers have a need to develop the potentials
of font or of colour into full means for representation, then font and colour will be mode in that community.
(Kress, 2010: 87; original emphasis)
Semiotic modes therefore may grow whenever a community of users puts work into their use and the material employed is
sufficiently manipulable to show the traces necessary for revealing that choices between semiotically-charged
alternatives have been made (cf. Kress et al., 2000:43). One practical consequence of this modelling decision is that there
are many more semiotic modes than are generally discussed in the literature. As we shall set out in detail below, for
example, it is generally insufficient to talk of the visual, or of sound, etc. as corresponding to semiotic modes.
Considering just the visual sensory channel to illustrate the point, we find that comics (or sequential art: Eisner, 1992)
have their own quite distinctive ways of connecting together elements; graphs made up of axes and curves rely on a more
restricted and also quite separate set of interpretative conventions; information set out in tables is restricted in other ways;
and within diagrammatic representations it is possible to delineate an entire host of quite distinct usages (cf. Bertin, 1983;
Kostelnick and Hassett, 2003). The semiotic modes operating can only be uncovered by detailed empirical investigation of
meaning-making practices and are not obviously given by fixing a sensory channel.
This situation demands that we make more explicit the role played by the material distinctions employed for indicating
that meaning-making is underway. In this respect, we move beyond traditional semiotic accounts where materiality was
generally excluded from the characterization of sign systems (Saussure, 1959; Hjelmslev, 1953) and so recognize the
growing awareness being given to the importance of materiality in meaning-making. The lack of a proper consideration not
only of the materiality of semiotic artefacts, but of the semiotic systems employing distinct forms of materiality, has worked
against a sufficiently powerful semiotics of multimodality that can be appropriately extended to the visual. Our embedding
of materiality within semiotic modes makes the claim that any semiotic mode will reach into a particular material substrate
to leave traces for interpretation there and that any given material substrate may support multiple semiotic modes
simultaneously.
Different materialities are clearly capable of supporting different uses --- i.e., in the sense of Gibson (1977), they afford
different kinds of traces. For example, the materials of a photograph or of painting do not readily support movement. This
does not mean that it is impossible to express movement within this material, since expression is a question of what
meanings material traces are used to convey rather than of the material itself; what it does mean is that particular, material-
specific solutions for expressing this meaning will need to be sought (cf. Cutting, 2002). The materiality of a semiotic
mode is then furthermore not limited to particular sensory channels because it is up to the semiotic mode to determine
which properties of the material manipulated may be meaning-bearing. We will not be arguing, however, that all
regularities observed and interpretable in some material are semiotic in the more specific sense we develop here, but our
position does hold the converse to be true: i.e., no semiotic mode can be considered without attention to its material.
Evidence for this inseparable relation between semiotic modes and material, and hence the perception of that material,
can now be drawn from many sources. A particularly striking phenomenon is offered by synaesthesia, which under certain
circumstances can reveal the effects of semiotic distinctions directly on perception (cf., e.g., Cytowic and Eagleman,
2011:75--76). Even for verbal language, the most well studied of all semiotic modes, an explicit reconsideration in terms of
the material properties of semiotic modes is beneficial. We know for language that the visual input of, for example, mouth
shape and movements can directly influence the perception of phonetics (cf. McGurk and MacDonald, 1976). Thus, even
for spoken language an a priori restriction of material to the audial channel is at best an approximation. Similarly, the fact
that film is an audiovisual medium --- i.e., employs a material limited to audial and visual material traces --- actually says
rather little about its modal deployment as a richly organized visual, spatial, audial, haptic, vestibular, gustorial (and so on)
aesthetic experience. This point is now being argued increasingly often for a variety of media/materials, although the
precise use of terms mode, medium and so on still shows considerable variation (cf. Sobchack, 2004; Mitchell, 2005;
Hague, 2014).
Given patterned traces left in some material, we then need to characterize just how those traces are employed for
meaning-making. Our model for this process posits three abstract semiotic strata that are definitional for each semiotic
mode. Materiality occupies the least abstract semiotic stratum in such a system --- each semiotic mode is defined by
particular combinations of material properties which act as traces of the semiotic decisions made at the more abstract
levels of organization. To account for meaning attribution, we need to turn to these more abstract strata of semiotic modes
and, in particular, to their discourse organization.
2.2. Semiotic modes and discourse
Many traditional models or approaches to multimodality drawing on semiotics posit a relatively direct relationship
between signs (formed out of some material) and meanings for those signs. In contrast, our approach insists on a more
indirect and pragmatically-aware relationship between material traces and attributions of meaning and incorporates a
further mediating stratum of discourse semantics. We consider this a fundamental ingredient of a usable definition of
semiotic modes and the key to achieving analytical precision when confronting multimodal artefacts and behaviours.
The specific task of the semiotic stratum of discourse semantics within any semiotic mode is to relate particular
deployments of semiotically-charged material to their contexts of use and the communicative purposes they can take
up. The discourse semantics therefore provides the pragmatic interpretative mechanisms necessary for relating the
forms a semiotic mode distinguishes to their contexts of use and for demarcating the intended range of interpretation of
those forms. Such interpretations can vary with respect to just how tightly constrained they are intended to be, ranging
from the very specific to rather more loose directions for interpretation --- however, we consider an ordering of some
directions for interpretation with respect to others as definitional for the kind of intendedly communicative artefacts with
which we are concerned.
The explicit incorporation of a discourse semantics for each semiotic mode allows us to address a recurrent
problem faced within previous approaches to explaining multimodal meaning. Earlier conceptions of signs and
semiotic codes promoted accounts in which both the active involvement of an interpreter and the insistent materiality
of individual objects of interpretation were either backgrounded or disappear. However, in more contemporary
theories of the interpretation of non-linguistic communicative artefacts --- communicative here being understood in
the deliberately broad sense of an artefact intentionally created to be interpreted --- it is broadly recognized that more
explicitly dialogic treatments of the relation between recipient and artefacts are required. We see this in art theory
and aesthetics in Gombrichs discussion of the beholders share (Gombrich, 1959, Part III) and notions of image
acts (Bredekamp, 2010), in the literary reader-response theory involving intended gaps constructed in texts of Iser
(1978), in Bakhtins (1981) tenets of inherent dialogicism, in Barthess (1977 [1964]) focus on connotative meaning, in
views of visual material and images as necessarily embedded within intersubjective communicative actions
demanding pragmatic principles of interpretation (Sachs-Hombach, 2001; Schirra and Sachs-Hombach, 2007), in
narratological descriptions in terms of the necessary filling out of the story world guided by the plot (Ryan, 2003), and
many more.2
Reception studies and proposals for modelling interpretation have therefore given increasing centrality to the
interaction entered into by text/artefact and its recipient. The majority of such accounts now see interpretation as
operating by virtue of historically and socially situated dynamic processes of response formation unfolding between
artefact and recipient/audience. The lack of dynamic historicity and social placement in traditional semiotic accounts was
a significant factor in the subsequent parting of the ways between linguistically-inflected, semiotic notions of text
(construed broadly) and approaches to aesthetic artefacts such as literature, film, painting and the media (cf. Hall, 1980;
Bordwell, 1982; Hatt and Klonk, 2006). A basic distrust of linguistically-inflected semiotics persists in visual studies, art
history and related fields on these grounds to this day.
We consider it essential for any appropriate treatment of the meaning of complex communicative artefacts to deal with
this foundational dialogic property directly and in a manner that also does justice to the distinct materialities that may be
involved. This is the role we allocate to our semiotic stratum of discourse semantics. Furthermore, one inherent property of
the discourse semantic stratum in any semiotic mode is that it operates abductively: i.e., as a process of defeasible
hypothesis formation. The mechanism of abduction goes back to the semiotic work of Peirce (cf. Wirth, 2005) and has now
been suggested to play a role in several media --- for example, by Moriarty (1996) for visual semiotics and by the authors of
the present paper for film (Bateman, 2007; Wildfeuer, 2014a). Below this position will enable us to make ready contact with
several well articulated theories of dynamic semantics developed for verbal discourse and natural language pragmatics ---
these will then be extended to cover visual artefacts.
The significance of adopting an abductive approach to semantics can be shown briefly by the following example.
Consider two mini-texts made up of the same sentences presented in contrasting orders:
(1) Mary went to the park and played football.

(2) Mary played football and went to the park.
2
We cannot trace the history of this very large body of disparate research in any detail here other than to draw attention to a marked
reoccurrence of the requirements of dynamic interaction between artefact and recipient. One could also track these positions further back in time,
involving considerations of the philosophy of aesthetics, of perception and so on -- which would similarly lead us too far from our principal point of
concern here.
A traditional, static logical conception of conjunction would suggest that these two variants should have the same
truth value because A and B is equivalent to B and A. This is evidently far from an ideal treatment when addressing
natural language because the order in which the sentences are presented is significant in its own right: very different
interpretations are naturally made of these two mini-texts, usually described in terms of linguistic iconicity and
differing assumptions concerning the temporal order in which the events unfold. This divergence between a
compositional semantics and the usual interpretation of sentences needs then in traditional treatments to be
accounted for by additional rules of pragmatic entailment or implication. This moves immediately to models of
knowledge, belief, expectations, etc. and a significant loosening of the relation between compositional semantics and
contextual interpretation.
The extent to which this is a problem is often severely underestimated or downplayed. What is compromised by this
weakening of the connection between compositional semantics and interpretation is the possibility of a principled
account of how fine-grained differences in grammatical form can set significant constraints both on the range of
interpretations to be considered and on the need to consider some interpretations at all. This area of strongly
conventionalized implicature (cf., e.g., Meibauer, 2009) is distinct from further inferences that may be induced by virtue
of economy or violated conversational maxims and is taken here to be always necessarily present in discourse
interpretation. Gaining maximal control of this kind of pragmatic inference is important since, as a further refinement of
Levinsons (2000:29) slogan inference is cheap, articulation expensive, we know from a formal perspective that there
are many different kinds of inference and some of these are not at all cheap. In order to ensure that the highly
conventionalized implicatures at issue here are performed as quickly and automatically as many pragmatic accounts
would desire and actual discourse production and interpretation suggest, the formal complexity of the operations
involved needs to be constrained. Simply assuming that the necessary work will be performed by some theoretically
under-differentiated procedural component (cf., e.g., de Beaugrande and Dressler, 1981) offers no guarantees that this
will be the case.
The evidence in favour of some kind of procedural, or dynamic, account of discourse interpretation is nevertheless
irrefutable. Early work on the cognitive processing of discourse proposed that comprehension operates by constructing
explicit mental models representing the situations that a text describes (cf. van Dijk and Kintsch, 1983; Johnson-Laird,
1983). The general utility of such models has since been confirmed in many experimental studies and has been
extended to cover varied types of texts as well as, particularly relevant for us here, communicative situations involving
various combinations of media. In this latter case, shared situation models offer a relatively natural way of accounting for
integration of information coming from quite different sensory modalities. Numerous questions remain, however,
concerning just how such situation models are constructed and developed during discourse comprehension. Situation
models are now generally assumed to be multidimensional constructs spanning at least spatial, causal, temporal and
motivational facets --- and all of these then require updating as discourse proceeds (cf. Zwaan and Radvansky, 1998).
The construction and update of situation models is commonly taken to rely on approximately sentence level semantic
representations plus various mechanisms for integrating meanings in order to achieve global coherence. Experimental
research has examined the possible influences on these processes, including working memory, access to background
knowledge and inferences, and has succeeded in identifying a variety of linguistic cues that appear to trigger integration
by forming links between the incoming material and a growing integrated model. These cues include reference chains,
causal inferences, spatial relations and descriptions of events. The issue of precisely which elements are to be pursued for
linking is a critical one, however, since comprehension evidently operates with considerable efficiency. Indeed, as
recently summarized by Ferstl: . . .in the absence of an overt, demanding comprehension task, language processing in
context proceeds with surprisingly little brain power (Ferstl, 2007:66).
The precise forms of the linguistic material in a verbal discourse are commonly taken as one major constraint on the
process of situation model update that contributes to this efficiency (Zwaan and Radvansky, 1998:177). Particular
linguistic forms and grammatical constructions have been explored as more or less explicit instructions for textual uptake:
that is, as long suggested by functional linguists of various persuasions: grammar is an automated discourse processing
strategy (Givon, 2005:92). However, and just as suggested above with respect to our example mini-texts, a focus on
grammar (and its compositional semantics) leaves many of the details of such textual updates unclarified. There are then
several paths open for exploring this crucial area further. Psychological and neurocognitive experimentation, for example,
continues to probe the underlying comprehension processes involved with ever-growing precision, not only for verbal
discourse but now increasingly for visual communicative forms such as comics (Cohn et al., 2012) and films (Zacks and
Magliano, 2011) as well. Moreover, there is also a complementary need for appropriate linguistic and semiotic models
which can, on the one hand, be used to generate more refined hypotheses for experimentation and, on the other, be
assessed with respect to their alignment with existing experimental results, corpus studies and other empirical and
analytic research methods commonly employed within linguistics.
This is then precisely the role we suggest for dynamic discourse semantics. In contrast to static approaches to
compositional semantics, dynamic semantics responds to the challenges of describing dynamic discourse development
by providing an abstract, but nevertheless formally detailed, characterization of how the process of meaning construction
is guided by the linguistic information present in interaction with further semantic and contextual knowledge. Rather than
constructing a semantics for sentences combined in a text compositionally or by more informal linking, dynamic
semantics captures formally the idea that discourse interpretation is a process of progressively adding information into a
growing discourse context: semantics is thus formally specified in terms of principles of discourse semantic update that
determine precisely how such growth occurs. This then gives rise to dynamically produced structures predicting testable
consequences for discourse interpretation --- particularly in terms of the accessibility of discourse referents and other
discourse-related phenomena that we will return to below.
One consequence of this shift in perspective is that there is no longer any reason, even when considered in terms of
a logical account, to assume that the A and B and B and A patterns seen in our mini-texts will lead inexorably to the
same result. A different discourse context is being updated by a different semantic contribution in each case and so the
results of discourse update may diverge.3 Moreover, accounts of this kind usually make crucial reliance on logics
which are non-monotonic, i.e., defeasible and abductive. Within such non-monotonic accounts, dynamic discourse
interpretations may be overridden as more information becomes available --- all interpretations are provisional. The
non-monotonic logics employed allow us to specify precisely just how and under what conditions such revisions may
occur. In the case of our mini-texts, for example, it cannot be the case that the temporal ordering is read non-
monotonically off of the clause semantics because it is possible to find contexts of use where the ordering does not
apply --- e.g., if the sentences were to occur while simply listing the activities performed during some time period
without regard to their ordering.
The dynamic growth of discourse meaning thus covered also provides a means of avoiding a longstanding criticism of
applications of grammar (in a syntactic sense) to both text --- e.g., in the form of story grammars (cf. de Beaugrande,
1982) --- and film (cf., e.g., Mller-Na, 1986) concerning notions of grammaticality. It is often claimed that the notion of
grammaticality is misguided for texts and extended sequences of visual materials, such as films, because there are no
ungrammatical sequences, simply sequences that may be more or less difficult to interpret. If this position is accepted, it
raises significant problems for traditional accounts of meaning because compositional semantics requires well-formed
structures to operate on. If the notion of grammaticality is relaxed, it is then unclear formally how meanings are to be
generated. Whereas there are accounts that appear to wish to maintain the grammaticality assumption for discourse
(e.g., Cohn, 2013b, whose position we return to below), the dynamic semantics approach instead sidesteps the problem
of needing to determine which meaning-bearing structures are possible in advance --- i.e., specifying grammaticality ---
because it does not rely on syntactic-structural descriptions to drive semantic interpretation. This role is taken over by the
principles of discourse update. Thus, in this respect also, the principles of discourse interpretation within such a
framework differ fundamentally from the kinds of interpretative mechanisms derived for treating syntax. Below we will
see all of these aspects of dynamic semantics in operation in our examples where the need and value of an abductive
discourse semantics already found for natural language texts becomes even more striking when visual material is
considered.
For the present, however, we can also relate this directly to the notions mentioned above of guided interaction between
artefact and recipient of the general form suggested by Iser (1978), in which verbal texts construct gaps in interpretation
that stand as explicit indications of necessary interpretative work on the part of the recipient. A detailed discourse
semantics expressed within a dynamic logic identifies structurally determined gaps in knowledge of very specific kinds
that must then be filled abductively from context. Discourse semantic principles then control when and how world
knowledge may be accessed in this interpretation process --- on the one hand, candidate semantic interpretations are
signalled as relevant by the unfolding discourse and, on the other, formal mechanisms operate so that these candidates
are resolved against more abstract semiotic levels, such as context, style or genre.
The adoption of Isers general scheme of interpretation for further modalities other than language and literary texts has
now been suggested by several researchers. We also adopt this perspective here but provide considerably more detail
and explicit consideration of just how such gaps are both created by artefacts and resolved by recipients. In short, our
more explicit discourse semantics thus seeks to characterize in a manner that is multimodally viable more precisely just
what kind of gaps are created in a work, how they are created, and how they may be filled. This also places us firmly in the
realm of pragmatics, since we will be addressing the relationship between some communicative artefact and its context of
use, while at the same time maintaining a close hold on the concrete technical details of any artefact as providing
guidelines or instructions for carrying out that contextualization. The result of our investigations can thus best be seen as a
pragmatically-grounded theory of multimodal discourse.
3
Also noteworthy here is then Curries (1995:134) reliance on older, non-dynamic views of semantics to argue against the relevance of
linguistic treatments for film interpretation precisely on the basis of the fact that A and B differs from B and A --- a singularly overhasty conclusion
as the rest of this paper demonstrates.
2.3. Consequences of our definition of semiotic mode
Our basic account of semiotic modes therefore combines fine-grained detail concerning the workings of a discoursal
component with considerations of materiality and that materials perception. We arrange these analogously to the view of
the linguistic system proposed within systemic-functional socio-semiotics, i.e., each semiotic mode is seen as a stratified
system. A material substrate must first be fixed as an essential component for any semiotic mode. Collections of
distinguishable marks with particular meanings-in-context (e.g., traffic lights, patterns of sticks left at decision points to
indicate which path to follow, etc.) then make up sign repertoires. Here, rather than adopting the metaphor of the sign
catalogue that places prominence on individual signs, we draw again on linguistic insights and consider sign-vehicles, i.e.,
particular physically accessible traces, in terms of sets of minimal distinctions organized into hierarchies of more or less
specific choices. This mid-level, or mediating stratum generally operates compositionally and can be characterized
independently of context. Following this we place our more abstract stratum of discourse semantics operating abductively
as set out above. The model as a whole then gives us a tri-stratal organization for semiotic modes, depicted graphically in
Fig. 1. This general organization is posited to hold for all semiotic modes, although the precise manner in which the
material and discourse semantics are filled in each mode is distinct. Further discussion and motivation of this definition of
semiotic modes is given in Bateman (2011).
3. Multimodal discourse semantics
Before proceeding to the application of our framework for the analysis of particular multimodal artefacts, we need to specify
two vital aspects of the model. First, we must commit to some specific discourse semantic mechanisms in order to pursue
analysis; and second, we must explicitly consider the relation between our model and perception --- whereas accounts of
verbal language are often able to background issues of perception (for solid semiotic reasons related to the double
articulation of language: Martinet, 1960), for visual semiotic modes this is, for reasons we shall see, not a viable option.
3.1. Monomodal segmented discourse representation theory
We have proposed discourse semantics as a definitional ingredient of any semiotic mode. The essential task of this
level is to mediate between details of the form of artefacts --- captured in terms of those artefacts technical features --- and
contextual interpretation. There are now several linguistically-inflected approaches to text and discourse that address
the precise characterization of the dynamic aspects of this process of artefact-guided interpretation. The one which we
will build on here is Segmented Discourse Representation Theory (SDRT: Asher and Lascarides, 2003), a further
dynamic semantic development in the spirit of Kamps original Discourse Representation Theory (DRT: Kamp, 1981;
Kamp and Reyle, 1993). SDRT has shown itself particularly well suited for natural language texts and provides compelling
accounts of a host of discourse-related phenomena. It has also been articulated in sufficient detail to support corpus-
based studies and automatic analysis, both of which we consider important criteria to be met by any account of discourse,
including accounts of multimodal discourse.
The SDRT framework has received extensive discussion and introduction elsewhere and so for current purposes we
will provide only a very abbreviated summary so that the subsequent discussion can be followed. One central aspect of
the design of SDRT is its reliance on several distinct logics that combine to form an overall logic of discourse
interpretation. These component logics cover and distribute the distinct kinds of inferential work required for differing
levels of linguistic abstraction. The resulting modularity enables relatively simple operations, such as syntactic and
semantic compositionality, to be insulated from inferentially very complex areas, such as general problem solving. Two
Fig. 1. Abstract semiotic mode combining three semiotic strata: material substrate, technical features of form, and discourse semantics.
Table 1
DRS logical forms of the clauses (a) Max fell, (b) John helped him up and (c) John pushed Max.
K k x, y, e k
K i x, e i K j x, y, e j / y = him[M ax ]
John (x)
M ax (x) John (x)
M ax (y)
fall(e i , x) help-up(e j , x, y)
push(e k , x, y)
(a) (b)
(c)
intermediate, and thereby relatively cheap logics defined in SDRT are of central concern here: the logic of information
content and the logic of information packaging and discourse update, both of which we extend for the multimodal case.
Discourse interpretation in SDRT operates by constructing a semantic representation for each incoming discourse
contribution, traditionally a sentence or utterance, which is then linked by means of discourse relations into a growing
discourse structure. Discourse relations are defined so that both their applicability to particular semantic representations
and the requirements they make of context are made explicit. They thus look both downwards towards concrete linguistic
forms (and their semantics) and upwards towards context. The requirements made of context define precisely the ways
in which identifiable gaps in interpretation are both created and resolved. Below we will adopt the overall form of this
framework for characterizing the discourse semantic stratum of all semiotic modes, independently of those modes
materiality.
The following standard example from Asher and Lascarides (2003:201) illustrates how two very different
interpretations result for the two mini-texts (3) and (4), despite the fact that in both cases we have a sequence of
actions presented in simple past tense; most of the logical formulae used for illustration here are taken directly from
Asher and Lascarides (2003: 4.3 and 5.6), to which the reader is referred for more details.
(3) Max fell. John helped him up.

(4) Max fell. John pushed him.
First, the logic of information content provides the syntax and semantics of a formal language that, on the one hand, is
used to construct and represent the logical forms of discourse segments and, on the other, specifies constraints on the
contextual knowledge to be consulted when attributing discourse relations. The logical forms are represented as
Discourse Representation Structures (DRS), usually labelled Ki and written using the DRT box-style notation introduced
by Kamp and Reyle (1993). DRSs corresponding to the clauses of the sentences making up our mini-texts are shown in
Table 1(a--c). These structures group the discourse referents introduced by a sentence, shown by the variables in the
upper part of each box, into single domains of accessibility for further discourse operations, such as participating in
anaphoric binding as the discourse develops. Structures of this kind can be produced as the result of standard
compositional clause analyses.4
Next, to characterize the sentences discourse contributions and their progressive construction of text, the individual
semantic specifications need to be combined by finding appropriate discourse relations. Discourse relations are
characterized from two perspectives. First, there are hard constraints that need to be met from context knowledge for a
discourse relation to obtain. These are termed meaning postulates and follow the general form illustrated in Rule 1 (Asher
and Lascarides, 2003:159), expressed within the monotonic (i.e., non-abductive) logic of information content:
Ra;b ) conditionsa; b (1)
The lefthand side of the rule picks out a particular discourse relation R being added to the current discourse structure
between the segments labelled a and b. If this relation holds, the conditions on the righthand side are required to follow by
regular, non-defeasible material implication.
The second perspective on discourse relations provides abductive inference rules, called default axioms, that specify
which discourse relations may apply given specified properties of the discourse elements being related. The definitions of
default axioms follow the schema given in Rule 2 (Asher and Lascarides, 2003:199):
?a; b; l ^ some stuff > Ra; b; l (2)
where ? (a, b, l) indicates an underspecified discourse relation holding between segments a and b in the context of the
discourse structure labelled l, R is the specific abduced discourse relation and > is defeasible implication, typically read
as if. . .then normally. . .. Asher and Lascarides use some stuff to represent the conditions that have to hold in the
4
Since our concern here is with discourse meaning construction, these representations are only suggestive. Our only commitment is to a
broadly event-based semantics (cf. Davidson, 1967; Parsons, 1990). We do not address here further specifically linguistic issues such as
anaphora resolution, etc.
Table 2
Meaning postulates and default axioms for the discourse relations Narration and Explanation in
verbal discourse (identified for current purposes by the prefix L for language).
Meaning postulates:
L.MP.Narration: Narration(a,b) ) overlap( prestate(eb), Advb( poststate(ea)))
L.MP.Explanation: Explanation(a,b) ) (event(eb) ) eb ea)
Default Axioms:
L.A.Narration: (? (a, b, l) ^ occasion(a, b)) > Narration(a, b, l)
L.A.Explanation: (? (a, b, l) ^ causeD(b, a)) > Explanation(a, b, l)
antecedent in order to state that there is evidence for the relation. This class of rules is abductive because the attribution of
a discourse relation can always be overridden should more information become available. Constructing a discourse
relation between two discourse elements is as a consequence always an interpretative hypothesis. Nevertheless,
whereas a discourse relation can only be hypothesized, once that hypothesis has been made, certain constraints on the
state of affairs that hold necessarily follow as given by the meaning postulates. If they can subsequently be shown not to
hold, then the original hypothesis of that discourse relation must be retracted.
This perspectival separation between abductive hypotheses and necessary consequences is important and provides a
manageable link with descriptions of the world and background knowledge external to the text. The default axioms
describe which information the discursive context must provide in order to interpret discourse relations between the
segments. This then offers precisely the mechanism needed for identifying text-determined gaps in information that need
to be filled in from context for a coherent discourse to result. The reasoning process as a whole is then described as one of
gluing together the logical forms of clauses according to applicable default axioms in order to build a maximally coherent
overall logical form for a discourse (cf. Asher and Lascarides, 2003:184).
We now characterize specifically the interpretations relevant for the example mini-texts above. For this we require just
two discourse relations: Narration and Explanation. The meaning postulates and abductive axioms for these relations are
shown in Table 2 in the form proposed by Asher and Lascarides, which we will also employ below multimodally. Let us
consider the meaning postulate for narration (Rule: L.MP.Narration) as an example. This rule states that whenever a
discourse segment a is related to a discourse segment b in the discourse relationship of Narration, then the initial state of
the event expressed in the second segment (i.e., prestate(eb)) must overlap with the final state of the event expressed in
the first segment (i.e., poststate(ea)), potentially shifted by some adverbial modifiers of time or place added by the second
segment (i.e., Advb) --- a case of such shifting can be seen in John fell. Three hours later Max helped him up. This
enforces that a spatiotemporal consequence should hold between the related events whenever Narration is abduced. The
corresponding default axiom (Rule: L.A.Narration) then specifies the discourse conditions that must hold for such a
hypothesis to be pursued. The predicate occasion holding between the first discourse segment a and the second
discourse segment b in this rule indicates a natural-event-sequence such that events of the sort described by a lead to
events of the sort described by b (Asher and Lascarides, 2003:200). When such a condition holds, the hypothesis on the
righthand side, i.e., that a Narration(a, b) holds, may be entertained.
In mini-text (3), therefore, we first assume that some relation links the two sentences, i.e., ? (a, b, l). Moreover, the first
event of Max falling is readily seen as occasioning the second of John helping him up. The preconditions of Rule L.A.
Narration are therefore met and the hypothesis of a Narration relation between the two discourse elements can be
entertained. With this hypothesis in place, the meaning postulates are considered in order to check that the hypothesis is
compatible with world or context knowledge (Asher and Lascarides, 2003:201). The corresponding meaning postulate
(L.MP.Narration) requires quite specific and restricted access to just one aspect of contextual knowledge: temporal
relations. In the present case, if it is known (or can be assumed as an abductive hypothesis) that the pre- and post-states of
the respective events overlap as indicated, then the meaning postulate for narration is fulfilled and the hypothesis of
narration stands. For mini-text (3), we can assume this to be the case and we have therefore succeeded in abducing a
coherent discourse interpretation over the two sentences.
For mini-text (4), however, this does not work since the temporal relations are reversed. The meaning postulate for
Narration therefore fails to hold and so this relation must be rejected. Working through the catalogue of possible discourse
relations, the interpretation process can then consider Explanation. The default axiom for explanation specifies that
explanation may be hypothesized when a causal relationship is suggested by the juxtaposition of discourse elements
in the discourse itself (indicated by the subscript D). Placing two sentences in succession involving elements exhibiting
lexical or activity sequence relationships such as fall and push suggests such a relationship and so this alternative can
be pursued further. This invokes the meaning postulate for Explanation, which again calls for temporal relationships to be
checked. In this case, the second event can readily be taken as preceding the first event as L.MP.Explanation requires
and so the hypothesis of an Explanation-relation is supported. We thus have two distinct discourse structures suggested
Table 3
Segmented discourse representation structure of the example discourse: Max met John. Max fell.
John pushed him. and its corresponding graphical representation. In the graph, coordinating
discourse relations are shown running horizontally, and subordinating discourse relations vertically.
0
0
1, 2, 3
1 : K m , 2 : K i , 3 : K k 1 2
0 : N arration (1, 2) Narration
Explanation (2, 3) Explanation
3
for the two superficially very similar mini-texts, one preferring Narration and the other Explanation. These differ by virtue of
the discourse relations that have been induced and the corresponding logical constraints on contextual knowledge that
must hold.
Finally, SDRT formulates the process of glueing together the semantics of discourse elements in terms of a discourse
update procedure (Asher and Lascarides, 2003:212). This procedure takes as inputs the discourse so far and the new
discourse element to be added, considers the discourse relations that might hold, and produces as output a single
segmented discourse representation structure (SDRS) capturing the result of the update. On the left of Table 3 we see a
corresponding completed discourse semantics for a slightly longer mini-text involving both narrative and explanation
relations: Max met John. Max fell. John pushed him. Here we have two box structures, one embedded within the other
reflecting the discourse dependencies holding. The first lines of each box label the corresponding discourse segments
(p0--p3). The outer box corresponds to the entire discourse considered as a unit, while the inner box identifies the three
sentences constituting that discourses sub-units. The lower parts of each box list the logical relations holding between
segments, reusing two of the DRSs defined in Table 1 above, and their abduced discourse relationships. p1, meeting, and
p2, falling, are interpreted as following each other sequentially and so are related by a Narration-relation; p2, falling, and
p3, pushing, are in contrast related via an Explanation-relation as described above. Thus the first two sentences follow our
argumentation concerning mini-text (3) and the last two sentences follow that concerning mini-text (4).5 An equivalent
graphical representation highlighting just the discourse structure dependencies and their status as coordinating or
subordinating (cf. Asher and Vieu, 2005) is shown on the right of the table. The geometry of such graphs is used for
specifying discourse accessibility conditions (cf. Asher and Lascarides, 2003:44) and for describing how distinct parts of a
text can play differing roles relative to other parts; we return to this for the visual case below.
Progressively larger discourse structures are constructed by invoking discourse relations that impose segmentations
over collections of individual discourse representation structures introduced by a discourse. The selection of discourse
relations always attempts to maximise the coherence of the discourse as a whole, which may well lead to retractions of
previous hypotheses as more information becomes available.
3.2. Multimodal perception and discourse interpretation
The previous subsection introduced a method of formal discourse analysis well established for verbal discourse. Below
we extend this framework to provide a process of dynamic meaning construction for multimodal discourse in general. For
this, we must also incorporate the very different role played by perception when we deal with non-linguistic artefacts.
Approaches that begin from a linguistic perspective always face the danger that the basic mechanisms of perception fail to
be given sufficient weight, leading to two unfortunate consequences. First, the entire process of, for example, visual
meaning construction can readily come to be treated as one of conventionalized interpretation, thereby disregarding (or
rejecting) natural or iconic (in the traditional Peircian sense) sources of meaning. And second, visual (and other non-
linguistic) artefacts may come to be modelled along the lines of syntax, positing basic elements which need to be
combined to derive their meanings (cf., e.g., Saint-Martin, 1990; Groupe m, 1992 and many others). In visual studies more
broadly, however, the existence of such elements has been, and continues to be, strongly contested (cf., e.g., Gombrich,
1959:6; Eco, 1976:215; Groensteen, 2007:3--7; Sachs-Hombach and Schirra, 2011:103--104) and so we take any
account that avoids reliance on such an assumption --- such as the framework we propose here --- to be preferable. There
5
A similar argumentation applies to our first examples, mini-texts (1) and (2), suggesting a Narration discourse relation for mini-text (1) and a
Continuation relation (cf. Asher and Lascarides, 2003:461), i.e., a relation that does not entail temporal succession, for mini-text (2). We omit the
details here.
Fig. 2. Detail extracted from a panel used as an example of pictorial runes in Forceville (2011:886) (taken from Soirs de Paris: 52 Avenue de la
Motte, panel 3.2.2, Avril and Petit-Roulet).
are also, conversely, dangers when working purely from the side of perception, where it is rare for the particular role of
discoursal interpretations to find sufficient consideration. In short, we need various sources of information and
interpretative constraints to coexist.
We illustrate this briefly here by considering the problem of achieving the visual segmentation necessary when
interpreting constructed images such as those used in comics. We suggest that some of the fine-grained inferential gaps
that require filling in order to construct coherent discourse structures also operate to drive goals and hypotheses during
visual perception. Moreover, it is then the perceptually interpreted visual material that establishes an appropriate
alternative to the logical representations delivered by compositional semantics in the case of linguistic discourse
interpretation. That is: perception reveals potential analytic units, among which discourse organization further refines and
selects when placing these units into a communicatively motivated structure. This provides the raw material for the
formation of discourse hypotheses in visual media.
To show this in action, consider the small extract taken from Forceville (2011) depicted in Fig. 2. As we suggest further
with our examples below, comics are a very good source of discussion points in this area precisely because of their
combination of both iconic and conventionalized semiotic material. In the rest of this paper, therefore, we will continue to
draw our examples from comics so as to improve the consistency of the discussion --- however, the mechanisms we
discuss are equally applicable to visual sequences in general and are by no means restricted to comics in their application.
Certain information concerning any visual material being interpreted as communicative will be revealed directly by visual
perception. In the extract in Fig. 2, for example, Gestalt processing and the human propensity to orient towards faces will
deliver the information that we are seeing a human face. It is this Gestalt organization that imposes particular communicative
roles on the lines and curves present in the image --- that is, certain lines take on the roles of mouth, eyebrows, eyes, certain
closed areas become face, neck, hair and so on. Once fixed in this way, additional interpretations of the facial expression may
be delivered similarly, although with more or less reliance on established semiotic conventions depending on how
naturalistic a rendition is being employed. Feng and OHalloran (2012), for example, offer a detailed characterization of the
various kinds of depictions available in comics for representing emotions via iconicity of bodily representations. In their
system, they would characterize the facial expression and posture of the woman in Fig. 2 as eyebrows: inner corners
lowered, eyes: narrow/closed, mouth: tensed (Feng and OHalloran, 2012:2071). This overlaps well (although not
completely) with their characterization of expressions of anger (Feng and OHalloran, 2012:2076).
In addition, we see in the present example that there is further information to be extracted since there are several visual
components that are not exhausted, or consumed, by the Gestalt recognition of a womans face. First, above the
womans head is a grouping of three slightly wavy vertical lines: these will be separated by Gestalt principles from the face
below them and so are available for interpretation in their own right. Second, and considerably more challenging, are the
two slightly curved horizontal lines to the right of the womans head. Demarcating and interpreting these lines constitute an
almost prototypical case of the need for discourse interpretation --- they overlap spatially and perceptually with lines
depicting the woman and so might have a range of candidate interpretations, such as, for example, two long, out of place
hairs sticking out.
Various authors have proposed labels for expressive devices of this kind, which generally add conventionalized non-
iconically represented content modifying the information depicted in comics panels --- typically specifying types or
manners of action, the emotional state of characters, etc. Forceville presents a very useful and detailed classification of
this class of expressive forms, which he terms pictorial runes (following an early proposal from Kennedy, 1982); Cohn
(2013a:37--48) offers a similarly detailed characterization, drawing an analogy with the principles of linguistic morphology.
In this respect, the notion of bound visual morphemes used by Cohn captures more appropriately the necessary
dependence between such expressive elements and the iconic elements they modify, but meshes less well with our
Fig. 3. Details extracted from two consecutive panels from Soirs de Paris: 52 Avenue de la Motte, Avril and Petit-Roulet, showing the importance
of difference for assigning meanings.
orientation to discourse rather than grammatical (which we take here to include morphological concerns) levels of
description. For ease of reference, therefore, we will continue to use Forcevilles runes for the time being, although a
discourse-oriented term would be preferable.
We will now re-construct the informal characterization of the interpretation of such devices set out by Forceville from the
perspective of a multimodal discourse semantics of the kind we have introduced. We focus on two important aspects here:
how do we determine just which visual elements are pictorial runes --- i.e., what drives an appropriate segmentation ---
and, once determined, how is their meaning to be determined. As Forceville demonstrates, the interpretation of these
expressive devices is often subject to indeterminacies concerning just what meaning may be being made. However, these
indeterminacies are generally resolvable when other features of the panels and their combined role in discourse are
considered. In the section following, we will set out the defeasible consequences of some pictorial runes in more detail;
here we simply sketch the interpretation process involved to show that an explicit representation of the operations of
discourse coherence helps clarify precisely how segmentation and interpretation can occur.
To begin, the more straightforward marks above the womans head in Fig. 2 are an example of the expressive device
that Forceville labels as spirals (which, in contrast to twirls, which we will see below, are drawn without loops). Forceville,
(2011:881) describes spirals as having several distinct meanings, although when placed around someones head, they
typically indicate a generically negative emotion, such as anger, disgust, or frustration. In this case, therefore, we can
directly restrict among these three most likely hypotheses by considering them together with the information yielded by the
depicted facial expression: that is, we extend the context of interpretation. This is typical of the process of discourse
interpretation --- cues are combined from an immediate context of relevance (related to the domain of accessibility in
linguistic interpretation) following the search for applicable discourse relations that can bind the information together in a
way that maximizes discourse coherence.
The two lines to the right of the womans head are more complex. In fact, these lines also constitute a pictorial rune in
Forcevilles terms, in this case a form of motion lines or movement lines (Forceville, 2011:877; Cohn, 2013a:38--39) --- i.e.,
indicators of rapid movement showing that the woman is turning her head quickly to her right (i.e., leftwards in the image).
The central question is then how an interpreter can determine this since there is actually rather little visual material here to
draw on. However, the interpretation choice between stray hairs and motion lines can also be made by extending the
context of interpretation --- this time taking in the immediately preceding panel as well. In that panel, we see the same
woman depicted but with her head facing in the other direction. Thus, when coreference is hypothesized, the face has
evidently turned round across the two panels and the additional lines can be interpreted as showing the manner of that
movement, i.e., she turns very quickly. We see this in Fig. 3 where the relevant aspects of the two panels under
examination have been extracted and placed in their original order.6
We therefore consider both the identification of elements that may be functioning as pictorial runes and the attribution
of meanings to those elements to be a result of pursuing discourse coherence in the manner described above for verbal
discourse. Only when a discourse relation can be found that succeeds in offering a coherent interpretation of the material
on offer is the material (a) identified as such and (b) integrated within the growing discourse interpretation. Since, as we
have explained, this is always an abductive, non-defeasible operation, the process can go wrong and false interpretations
can be entertained (for example, that the motion lines are stray hairs). There may also be differences in interpretation due
to individual differences in knowledge (for example, that one viewer does not know the convention of using spirals). Such
potential differences are natural but still fall within the operations of a discourse semantics as we have defined it.
6
There are also other narrative events in the originals of these panels not shown here that raise the likelihood of the anger-interpretation of the
spirals still further --- for those events, the reader is referred to Forceville (2011) directly.
Table 4
Discourse Representation for the panel extract
depicted in Fig. 2 and on the right of Fig. 3.
e j = anger
[v] head of represented participant (q)
[v] eyebrows: inner corners lowered (r)
[v] eyes: narrow closed (s)
[v] mouth: tensed (t)
[v] spirals: negative mental state (u)
[v] horizontal runes: rapid movement (v)
q, r, s, t, u | anger (e j )
Formally, we make the visual and layout materials that are being combined here accessible to the process of discourse
interpretation by providing propositional descriptions similar to those used above for verbal discourse. In particular, we
take the SDRT box-style notation and extend this to capture information delivered visually such as the results of visual
Gestalt processing and the abductive identification of expressive devices such as Forcevilles pictorial runes as
described above. This then combines to yield a perceptually-motivated logical form, an example of which for our first
example panel from Fig. 2 (and so for the second panel of Fig. 3 also) is shown in Table 4.
In these logical forms we extend the DRT-box representations for multimodal use in several ways. First, the upper
portion of the structure shows an inferred verbalization for the eventuality or object identified by the structure; this
represents a propositional description that would normally be inferred (abductively) by the recipient because of his/her
world knowledge. The last line of the box representation explicitly documents the sources of evidence leading to this
inference by referring to discourse referents identified in the middle portions of the representation. Thus, in the present
case, it is the combination of the discourse referents q, r, s, t, u that allows the inference of the proposition anger; this is
described as a defeasible consequence, which we indicate with a logical operator . This means that whereas in the
case of language, it is the compositional semantics derived for clauses that provides the content of such boxed discourse
structure representations, in the visual case we derive a similarly structured representation on the basis of what is being
attended to visually (cf. Wildfeuer, 2012).
The role of these representations is to make explicit a distinction between some open-ended notion of the complete
information theoretically present in a visual depiction and the much more constrained selection from this information
necessary for explaining and driving discourse coherence. This division echoes several positions taken with respect to
visual artefacts. Bryson (1981:1--6) distinguishes similarly between figure and discourse in images and their
interpretation, while the comics semiotician Groensteen, in his detailed semiotic analysis of the visual material of comics
to which we will return below, argues:
Indeed, the reduction of the utterable [the visual] to a statement mobilizes, in the image, only the elements directly
concerned with the narrative process, that is to say, those engaged in action. . . .If, for example, I convert the [panel]
into a statement such as raising his saber, the first horseman falls upon Corentin and Zaila immobilized on the
ground, the details relative to the attitudes of the protagonists that are retained as pertinent . . .do not exhaust the
visual information contained in the image. (Groensteen, 2007:121)
The procedure we adopt for extracting propositional-like information from visual material draws very similar conclusions
and serves precisely the same purpose of identifying (discursively) pertinent information. In the terms of the account
developed here, this then corresponds to the statement that the information used in our formal representations is just the
information that is licensed by the need to pursue coherence with respect to a specific repertoire of discourse relations. If
those discourse relations are pursuing narrative ends, then the effect will be quite similar to that set out by Groensteen or
Bryson (although more formally and completely specified).7
The discourse referents given in the middle portions of the representation in Table 4 then explicitly list the multimodal
composition of the contributing sensory modalities --- i.e., the specific Gestalt interpretations and technical features of the
elements abduced. In the present case, these are all visual and so are marked as [v].8 The elements made available by the
Gestalt assumption of a face are given first (q--t) using the descriptive categories motivated by Feng and OHalloran
(2012), while the next portion of the representation shows the expressive devices identified in the elements (u--v). These
7
And, in fact, we also open up the account of discourse to include all discourse relations licensed by a semiotic mode (see below), which
generally includes other discoursal uses of visual material than narrative.
8
The selection of candidates for modalities is itself an important area of multimodal theorizing; we will not address this in any detail further at this
point, however. In addition, individual visual events are classified and decomposed according to the framework of visual transitivity developed by
Kress and van Leeuwen (2006 [1996]); again, we omit this level of detail here.
latter are maintained separately in the representation in order to document their very different potential for supporting
coreference and discourse development --- two spirals or other similar expressive devices in two successive panels will
not, for example, be taken as coreferential in the same manner that the depiction of the same individual in two panels may
be, although there may be cohesive relations characterizing continuations or changes in the states of affairs depicted.9
The logical form as a whole gives ample material for setting in motion the process of abductively maximizing discourse
coherence by searching for applicable discourse relations as shown above for verbal discourse. In contrast to
the corresponding logical forms for verbal language, however, the representations here are already the result of extensive
abductive reasoning on the one hand and delivered packets of information from visual perception on the other. The
selection of the desired meanings for technical devices such as pictorial runes, for example, must already be made with
respect to their support for discourse coherence so that mutually compatible collections of meanings result. This means
that the meanings selected for the spirals and horizontal lines will have already been classified in terms of their abductively
hypothesized interpretations as indicators of a negative mental state and motion respectively. Such interpretations are
often, we suggest, not even recognizable without the broader context of at least the material depicted in the panel in which
they occur and often of surrounding panels as well, thereby constituting a larger coherent segment of discourse. We
described this informally above and will present a more formal characterization with respect to defined discourse relations
in the section following.
In contrast to the case with verbal discourse, therefore, we have made no assumption for the visual case of some
compositional construction of logical form from elements present in the image. We avoid the need for this by attributing
meanings to visual depictions via a combination of top-down and bottom-up mechanisms drawn from appropriate models
of perception. The events that are taken for discourse construction (e.g., anger in the present case as supported by the
facial expression and negative emotion indicated by the spirals) are then themselves crucially seen as defeasible
hypotheses made in the course of maximizing discourse coherence.
All events which are described as discourse representations can be interpreted and described in a similar way and may
be picked up subsequently if discourse coherence makes them relevant: i.e., they are available perceptually and may play
roles for discourse interpretation when creating hypotheses concerning particular discourse referents or activities that
may be needed to achieve coherence. By these means our account aligns itself with some of the critiques brought against
linguistically-inspired semiotic accounts of visual media --- such as, for example, Miodrags (2013:188) re-statement of the
argument that [t]he pertinent distinctions on which differential meanings depend are, in visual signification, context-
specific --- while at the same time moving beyond such critiques to demonstrate precisely in what way such context-
specificity can operate. On the one hand, since we can specify quite precisely just what context requirements are raised
when seeking discourse coherence, it becomes possible to identify pertinent distinctions that can be sought to meet those
requirements; on the other hand, we can suggest how the assumption of a more or less specific context can itself make an
entire repertoire of relevant distinctions available for driving and verifying interpretations.
This relation between discourse hypothesis formation and perception also suggests potential strategies for empirical
investigation --- in particular, since discourse hypotheses may set up cues that need to be attended to in order to be
perceived, techniques for exploring attention allocation, such as eye-tracking (cf. Boeriis and Holsanova, 2012), may
prove revealing. Such mechanisms may clarify considerably how a dynamically developing discourse context can
contribute to the isolation of intended meanings of underspecified visual signs. Our general position is compatible with all
models of visual perception that treat perception as a process of actively interrogating the environment for useful
information rather than as passive consumption (cf. Bruner and Postman, 1949). Moreover, it is well established that this
interrogation is both highly sensitive to the particular goals and tasks being undertaken by a perceiving agent (cf. Yarbus,
1967) and highly selective (cf. Simons and Rensink, 2005). In this spirit, Schill et al. (2001) present an explicit model of
perception in terms of information gain, whereby attention is directed towards those parts of the perceptual field that will
provide maximal information for resolving hypotheses concerning what is being perceived.
Now, although we do not want (or need) to claim that natural scene perception of this kind is the same as interpreting a
text, a more abstract similarity is revealed by addressing more finely the question of the distinct kinds of hypotheses that
are being pursued. Important here is to note that our concern is solely with goals that are established as part of the process
of producing and interpreting discourse --- we are not considering visual or audiovisual perception in general. This is then
relevant for visual processing whenever the visual material is taken as embedded in, or constitutes, an unfolding
discourse, which certainly applies to visual artefacts that have been created in order to communicate and guide
9
As pointed out to us by one anonymous reviewer, there may well be examples such as a sequence in which a panel showing a heart above
someones head as an indicator of being in love is followed by a similar panel where the heart is visually broken in two. We interpret this in terms of
cohesion and not in terms of co-reference because we see each application of the expressive device as re-stating or changing the attitude
attributed to the bearer rather than as picking out the entity depicted as such; further discussion is, unfortunately, beyond the scope of the present
paper.
Fig. 4. Example of use of pictorial runes in panel 26 B3 (original in colour) from Hergs Tintin et les Picaros (1975--1976). Herg/Moulinsart
2014. Used by permission.
interpretation on the part of their viewers. In short, we see aesthetically and textually constructed materials as necessarily
leading to classes of hypotheses that hold in addition to those operating in natural visual perception.10 Viewing and
interpreting visual material such as comics, graphic novels, film or other designed images thus involves the mechanisms
both of natural perception and of discourse interpretation. This is particularly important for aesthetically overcoded
artefacts, such as comics and film.
4. Multimodal discourse interpretation in comics
We have suggested that the formal perspective on discourse provided by Asher and Lascarides offers an appropriate
framework for exploring how meaning can be dynamically constructed in discourse, even when that discourse involves
multiple modalities. With the help of these descriptive logical tools and their adaptation for multimodal discourses, it is
possible to uncover formalized patterns of meaning construction in a far wider range of discourses than previously
possible. We now illustrate this further with respect to some more complex examples, all drawn from comics but
considering progressively more challenging aspects of their strategies for meaning-making. In addition, we provide a
more formal characterization of the multimodal discourse relations that must be applied in order to construct coherent
multimodal discourse interpretations.
4.1. More pictorial runes and interpretation indeterminacy
In Fig. 4, taken from one of Hergs well-known The Adventures of Tintin volumes, we see two examples of the kinds of
pictorial runes that Forceville (2011) labels as twirls. These graphical elements can be seen immediately above the
head of Professor Calculus (the character being supported as he descends the stairs) and also below him, to the left of
his feet in the image. Although the graphical elements themselves are similar in form, they end up with quite different
interpretations. Here too, therefore, we have an interesting case of interpretative indeterminacy for which multimodal
discourse interpretation offers a resolution. Forcevilles suggestion is that one of the twirls indicates dizziness, while the
other indicates motion and direction. These differences are related to where the graphical elements appear spatially with
respect to the figures to which they are attached: dizziness twirls, as here, go up from the head; movement twirls, also as
here, run in a parallel (i.e., usually horizontal) orientation to . . .feet (Forceville, 2011:882). In the terms we have
developed in this article, these differing meanings can be characterized as distinct discourse purposes or
communicative goals. This then makes them available as discourse hypotheses that can be probed during discourse
interpretation by combining the meanings they offer with the spatial attachment points available plus any other
(discoursally) relevant information derivable from the text.
10
The conflation of these two distinct kinds of concerns is analogous to the effacement of dynamic discourse semantics in considerations of text
comprehension that focus on story content at the expense of the discoursal construction of that content.
Table 5
Propositional representations of the Tintin et les Picaros example.
Table 6
Meaning postulates and default axioms for discourse relations in comics (identified for current
purposes by the prefix C for comics).
Meaning postulates:
C.MP.Enhancement: Enhancement(a,b) ) event(ea) ^ Circumstantial(ea, eb)
C.MP.Property: Property-of(a,b) ) object(ea) ^ eb(ea)
C.MP.Part: Part-of(a,b) ) part-of(eb, ea)
C.MP.Contrast: Contrast(a,b) ) & (Ka Kb)
C.MP.Parallel: Parallel(a,b) ) & (Ka Kb)
C.MP.Narration: Narration(a,b) ) overlap( prestate(eb), poststate(ea))
Default Axioms:
C.A.Enhancement: (? (a, b, l) ^ proximityD(a, b)) > Enhancement(a, b, l)
C.A.Property: (? (a, b, l) ^ proximityD(a, b)) > Property-of(a, b, l)
C.A.Part: (? (a, b, l) ^ sequenceD(a, b)) ^ detailD(b, a) > Part-of(a, b, l)
C.A.Contrast: (? (a, b, l) ^ semantic dissimilarity(a, b)) > Contrast(a, b, l)
C.A.Parallel: (? (a, b, l) ^ semantic similarity(a, b)) > Parallel(a, b, l)
C.A.Narration: (? (a, b, l) ^ occasion(a, b)) > Narration(a, b, l)
This operates as follows. As explained in the previous section, the normal workings of perception will already raise a
range of potentially relevant events: visual salience will lead to the underspecified recognition of some human
participants, their interactions, objects of attention and activities or motions. The potential discourse representation of
these is then seen as a collection of such event descriptions, which will be further refined by the application of discourse
relations while attempting to maximize discourse coherence. As before, we gloss these here in propositional
representations as given in Table 5. We consider the presence of the twirls in the image as technical features of the
material defined within this semiotic mode and so again place these in a distinct portion of the representation. These
representations are still underspecified in that they do not yet include the finer determination of their meanings that will be
abduced as hypotheses by application of further discourse rules.
The next step is then to apply discourse relations in order to combine these contributions in a coherent discourse. To do
this, we need to provide corresponding definitions of discourse relations that appear appropriate for the putative semiotic
mode at work in comics. As we shall see, these definitions show interesting similarities but also differences with those
employed for verbal language. In the present case, for example, we can note that for comics and many other depictive
visual modes, notions of spatial proximity will often provide important evidence for constructing coherence --- that is,
the fact that some visual element is in the vicinity of another may well indicate that one is a candidate for modifying the
meaning of the other (cf. Cohn, 2013a:34). This and several other related notions give us a first approximation for a set of
discourse relation definitions that may subsequently be investigated empirically. As in our discussion of verbal discourse
above, we first summarize in a single overview table the discourse relations that we propose for the semiotic mode of
comics (cf. Table 6). We will refer to these rules as necessary in the discussions below, providing more information about
their definitions as necessary.
The rules that we will employ for characterizing the Tintin et les Picaros example are C.MP.Enhancement and C.MP.
Property, both of which employ spatial proximity within panels; the former is intended to hold when the modified element
depicts an activity of some kind; the latter when some entity is receiving additional properties --- that is, if Enhancement
holds, then the circumstantial information specified is added to the modified eventuality11 and, if Property-of holds, then
the indicated property is applied to the modified entity. These rules will apply broadly within this semiotic mode whenever
there are non-iconic expressive devices of the kind Forceville, Cohn and others list.
11
Circumstantial is taken as a relation in a linguistically-motivated semantic ontology of the kind described in considerable detail in, for
example, Bateman et al. (2010).
These distinctions then allow the two distinct senses of twirls suggested by Forceville to be captured straightforwardly.
For this, it is useful to add some semantic content to characterize the specifically conventional contribution of these
devices more clearly: in one case, the line of the twirl visually indicates a spatial path and the tight curves are suggestive of
unsteady motion; in the other case, the line of the twirl is more indicative of the mental state or subjective feeling invoked.
Rules 3 and 4 then refine the rules C.A.Enhancement and C.A.Property respectively and can be seen as analogous to
lexical entries for devices of the form twirl, the perceptible presence of which in the visual material is indicated by twirlD:
Enhancementa; b; l ^ twirl D b ^ proximity D b; movementa > patha; shape-of b; l (3)
Property-of a; b; l ^ twirl D b ^ proximity D b; heada > dizzya; l (4)
These rules are then explored when attempting to construct a maximally coherent interpretation over the hypothesized
discourse elements. The rules themselves indicate which information must be present (or be assumed to be present) and
how that information may be combined. In the case of the lefthand side of Table 5, the preconditions for Enhancement
hold: the visible twirl is spatially proximal to an actor depicted as moving: this then allows an interpretation by Rule 3 in
which the shape of the twirl can be abduced to suggest manner of the motion. In the case shown on the righthand side, the
preconditions for Property hold, which then leads to the abduction of the actor having a particular mental state, that of
being dizzy. Thus we can see how in both cases a very similar visual entity is defeasibly hypothesized to take on very
different meanings, precisely those meanings that Forceville proposes to hold on empirical grounds. A formal treatment of
the spirals discussed in the previous section can be set out in exactly the same way.
Although there is much more to be drawn out here, particularly important for us in the present discussion is to
emphasize how our account meets a standard criticism brought against linguistically-based models of text-image and
image-image combinations that inappropriate grammar-like structuring relationships need to be assumed in the material
being analysed (cf., e.g., Bucher, 2011). In our account, such relationships and their meanings are always abductive
hypotheses that are introduced by the coherence-making process -- they are not simply assumed to be present to be read
off when relating components.
4.2. Sequential visual narrative and narrative structures
The examples we have shown so far have focused primarily on single panels, their visual details and the discourse
relations to be interpreted between these details. In the following, we set out how our model of discourse interpretation
helps to identify and describe the overall narrative structure of comic pages or more complex sequences of images with
regard to panel transitions. Panel transitions are probably the most frequently discussed aspect of comics interpretation of
all. McCloud (1994) describes this process in terms of closure, a very particular kind of textual gap that the recipients of
comics need to fill in --- comics panels fracture both time and space, offering a jagged staccato rhythm of unconnected
moments. But closure allows us to connect these moments and mentally construct a continuous, unified reality
(McCloud, 1994:67).
McCloud does not provide further explanations of how precisely this might operate, however. Relying on
subjective and impressionistic descriptions of what occurs during closure does not appear to do justice to the
reliability of interpretations actually observed. Thus the readers involvement in this interpretation is evidently
being constrained considerably. Here, we will argue further that we can characterize the readers involvement in
terms of discourse interpretation. We will also use some previous discussions of the functioning of panel sequences
in comics to support our claim that discourse interpretation of the kind we have introduced is a beneficial path to follow.
In particular, Cohn has convincingly demonstrated in several publications that the simple linear transition model
inherent in McClouds characterization is insufficient and more structured interpretations are essential (Cohn, 2010,
2013a,b). We will now build on this, showing how similar effects accrue within our model of multimodal discourse
semantics.
Fig. 5 shows an early example used in Cohn (2010) to demonstrate the insufficiency of the linear transition account.
Although his actual treatment of this example given in the earlier publication has now been superceded and considerably
extended (cf., in particular, Cohn, 2013a,b), the example and its early analysis still stand as a straightforward and relatively
clear-cut illustration of the issues that need to be addressed and so we shall remain with this example for the purposes of
the current discussion. The example also allows us to introduce rather more of our framework without needing to
complicate the discussion overly. Nevertheless, we will also make reference to Cohns current position as we proceed.
Cohn shows that the panels in the sequence in the figure need to be grouped in certain ways rather than others in order for
an acceptable interpretation to result. Moreover, that grouping is not compatible with a simple chaining together of
transitions as they occur in sequence. Cohn argues, for example, that the two images in the second and third panel refer to
the same character and that this then demands the kind of bracketing shown on the lefthand side of Fig. 5. It is then
necessary to make sure that the close-up of the eye shown in the third panel is grouped together with the person it belongs
Fig. 5. Example taken from Cohn (2010:143): acceptable and non-acceptable grouping of panels according to time and space.
to in the second panel before proceeding to the final panel, which relates to both the first panel and the second and third
panels understood as a single unit.
This then overrides the linear order of the images in favour of grouping them together hierarchically in a manner that
cannot be described from the perspective of McClouds sequential view. In Cohns terms, this hierarchical combination of
the two panels is the only acceptable grouping and is thus the preferred interpretation for this short sequence of images.
Cohn thus refutes McClouds principle of the temporal map in comics where space equals time by moving from one
panel to another (Cohn, 2010:130). Instead, Cohn underlines the general ability of several panels that show the same
action or circumstance to be interpreted as belonging together according to specific constraints --- what in linguistics
would be called a grammar --- that differentiate the acceptable groupings from the unacceptable ones (Cohn, 2010:142).
Both Cohn (2010) and Cohn (2013a,b) also discuss further examples where structural ambiguity can be documented ---
again, ambiguity is only possible if there are underlying structural descriptions since the linear sequence itself remains
unchanged and so cannot be described as ambiguous.
The notion of grammar employed by Cohn relies on his extensive argument for a productive treatment of comics that
draws connections with the kind of structured information processing found in cognitive studies of verbal syntax. Here we
will suggest in contrast that rather similar structural effects can be achieved by using a discourse semantics and the
provided set of discourse relations. The precisely defined conditions of application to be fulfilled within the discourse
interpretation then dynamically grow hierarchical interpretations of the kind required. More specifically, we refer back to
our discourse analytical approach which provides constraints both on the semantic content of the discourse as well as on
the contextual knowledge necessary to infer its logical form. With this dynamic discourse interpretation, applying the
principles of SDRT as introduced above, we will now show that both a linear (reflecting left-to-right processing) as well as
hierarchical (reflecting the constructed discourse structure) grouping can be upheld for the example sequence shown.
Table 7 briefly summarizes the eventualities that can be visually derived for the four panels and which then serve as
inputs to the discourse structure construction process. Successive discourse segments are attached progressively to the
growing discourse context following our definitions of discourse relations. Crucially, although the transitions
are considered as they occur --- i.e., linearly left-to-right --- the resulting discourse structures may be substantially
more complex, as we shall now see. This is because the transitions are not transitions between panels as in McClouds
case but between an input discourse semantic structure and the semantics of the incoming panel, effected by discourse
semantic update using discourse relations as described above.
The first two events, epi and ep j a , are related to each other contrastively, showing two different participants with
different action processes that share a general semantic similarity with specific differences: that is they each show a figure
Table 7
Propositional representations of the respective panels in Cohns example with resolved
visual coreference.
e i = raise e j a = look e j b = look e k = punch
[v] participant (m) [v] participant (o) [v] eye (CU) (p) [v] participant (m)
[v] arm (n) [v] participant (o)
m, n | raise(e i ) o | look (e ja ) p | look (e jb ) m, o | punch(e k )
with certain commonalities in stance and attitude. This general compatibility is enforced by the consequent of the meaning
postulate for contrast (see Table 6), which states that it is necessarily the case (&) that the states of knowledge
concerning the two events (Ka and Kb) are structurally similar (). The panels at issue then fulfil both the meaning
postulate and the default axiom for the discourse relation Contrast thus:
Contrastpi ;p ja ) &K pi K p j a (5)
?pi ; p j a ; l ^ semantic dissimilaritypi ; p j a > Contrastpi ; p j a ; l (6)
The next event, ep j , is then attached to this short two-part discourse via a further relation which can also be inferred on the
b
basis of its logical form. Corresponding to the well-known right frontier constraint introduced for discourse parsing by
Polanyi (1988) and developed further by Asher and Lascarides for SDRT, the most accessible point of attachment for the
close-up of an eye (detailD) is picked out as the represented participant shown in the panel before. This, together with the
immediate sequentiality of the two panels given by the text (sequenceD), then verifies both the meaning postulate and
default axiom for the discourse relation Part-of:
Part -of p ja ;p j ) part-of ep jb ; ep ja (7)
b
?p j a ; p j b ; l ^ sequenceD p j a ; p j b ^ detail D p j b ; p j a > Part-of p j a ; p j b ; l (8)
As a consequence, the DRS of the third panel is attached immediately to the panel before and, moreover, in a
subordinating relationship analogous to the Explanation discourse relation we saw in the analyses of our verbal examples
above. The resulting discourse structure for these three panels is shown in Table 8.
Here we see that the visual coreference required to form a coherent interpretation has been achieved by solving the
task of maximizing discourse coherence using the given discourse relations and the points of accessibility that they
define. Visual coreference as such has also been considered recently for comics in a formal discourse representation
style by Abusch (2013), who adopts traditional, unsegmented discourse representation theory (DRT) with the aim of
showing visual reference accessibility in a manner similar to that observed for verbal discourse. Here we prefer the
additional structure provided by discourse relations in SDRT, and our extension of these for multimodality, since this
provides a stronger basis for constructing the larger discourse structures we now require.
The next and final step is to consider the fourth panel and its attachment to the growing discourse context. This time,
the right frontier constraint makes available three potential attachment points in the discourse context: i.e., the nodes
making up the righthand edge of the discourse structure graph shown in Table 8. In this case, the decision is relatively
straightforward to make and follows from a combination of temporal constraints and topicality. First, the action depicted,
the actual punching, may be assumed to temporally follow the depictions in the preceding three panels. Among the
discourse relations in Table 6, Narration would therefore be a logical choice to consider. This requires verification of the
corresponding meaning postulate and necessary default axiom as follows:
Narrationa;pk ) overlapprestateepk ; poststateea (9)
?a; pk ; l ^ occasiona; pk > Narrationa; pk ; l (10)
The unresolved segment a denotes the as yet undecided point of attachment in the discourse structure overall.
An additional constraint defined by Asher and Lascarides (2003:219) for several discourse relations, Narration
included, is then the Update Topic Constraint. This relates discourse coherence to shared and developing topics and
functions to favour discourse structures where topicality is maintained over discourse structures that are purely temporally
or spatially connected. The constraint requires the existence of a node in the discourse capturing shared topicality. Among
the nodes available in the current example, this is clearly fulfilled by the node dominating the Contrast relation, which,
however, represents the whole DRS as given in Table 8. This structure is the first point where both the depicted figures are
involved and gives a suitable identification of the discourse segment a in both the partially instantiated rules (9) and (10).
Table 8
The segmented discourse representation structure of the first three panels of the Cohn example and its
corresponding discourse structure graph.
0 0
i , j a , j b
i : K i ; j a : K j a ; j b : K j b i j a
0 : Contrast (i , j a ) Contrast
P art -of (j a , j b ) Part-of
j b
Fig. 6. Graphical representation of the complete SDRS for Cohns example.
Interpreting this node as the point where new information is attached results in its re-labelling as p0 in the expanded SDRS
for the whole sequence of panels, displayed graphically in Fig. 6.
The attachment process for the latter panel then constitutes a so-called discourse pop (cf. Asher and Lascarides,
2003:223) of not attaching pk to the previous node p j b , but rather to the preferred higher node p0 . Asher and Lascarides
describe this process of maximizing coherence when updating the discourse as popping because of the leap the recipient
has to make from the last utterance or event back to the one where new information is finally attached. This phenomenon is
described in detail for filmic discourse in Wildfeuer (2014a,b:120--121); the mechanism appears equally applicable for
comics.
The discourse structure unfolding on the basis of our interpretation exhibits several significant parallels to Cohns
analyses which would be worth discussing in considerably more detail than we can do here. To pick out just two: first, the
overall dependency structure derived during discourse interpretation is equivalent both to Cohns structure as displayed in
Fig. 5 and to the dependency structure that would be given in his more recently articulated account of narrative grammar as a
separate level of description. Second, the constraints on relations between events that are bought about by the discourse
relations --- such as the shift in time that relates the final panel to the others caused by the attachment point required
for epk and the meaning postulate of Narration in rule (9) --- also correspond well to Cohns currently proposed separation of
the narrative grammar itself and further sources of constraint, such as temporal or spatial relations, as distinct levels of
description (cf. Cohn, 2013b). The modularities required may thus be generalizable across accounts in interesting ways.
One principal difference between the approaches, however, lies in the formal mechanisms that are assumed to operate
for describing and constructing discourse structures. Cohn (2013a,b) contrasts three approaches to explaining meaning
construction in comics: (i) panel transitions, corresponding broadly to McClouds sequential description, (ii) promiscuous
transitions, referring to the comics semiotician Groensteens proposal, to which we will return briefly below, that
connections may exist between any panels in a comic regardless of where situated, and (iii) general cognitive scripts,
corresponding to problem-solving approaches of the kind briefly mentioned above in our introduction to monomodal
SDRT. Cohn suggests that all of these are problematic precisely because they do not provide the kind of structural
discrimination required for interpreting panel sequences. This is, in fact, a general problem that can be observed in all
approaches that take simple linguistic notions of connection --- such as, for example, cohesive links (cf. Royce, 1998;
Saraceni, 2001) --- and apply these without further structural constraint to visual or combined text-image materials
(for detailed discussion from a linguistic/semiotic perspective, see Bateman, 2014:172--173). Cohn considers that a
distinct visual narrative grammar is necessary to capture and impose the structural regularities just illustrated and to
make fine-grained predictions concerning how such sequences are processed (Cohn, 2013a:66--69). Although also seen
abstractly as a model of discourse, the mechanisms Cohn adopts for this visual grammar are syntactic, in particular
drawing considerably on the notion of phrase structure given in X-bar theory (Jackendoff, 1977) and extended by
Jackendoffs current parallel architecture model in which several modules combine independent forms of constraint so
as to co-determine linguistic (and other) units (Jackendoff, 2007). Cohns approach thus relies on structural configurations
that make available functional roles for the individual panels occurring in a sequence (i.e., the slots in the syntactic
structure). Only certain syntactically-governed patterns of functional roles are licensed, and it is these that are intended to
give the structural properties observed above.
Regardless of individual details, syntactic accounts require that the conditions licensing acceptable syntactic
structures be specified --- that is, it is possible to distinguish grammatical from ungrammatical structural patterns. It is
generally these conditions that allow well-formed semantic composition to be defined and, without them, it is questionable
whether one is dealing with a syntactic model at all. Thus, on the one hand, to the extent that Cohns approach maintains
functionally distinguished structural slots, concerns over the appropriateness of a grammatical account of discourse ---
particularly for longer stretches of discourse --- remain. One of the standard objections to traditional applications of
grammar-like notions at the narrative level has always been the rigidity that such approaches appear to impose. Although
broader empirical studies are still necessary, naturally occurring narratives (and discourses in general) suggest a
contingent flexibility of a different order to that provided by the syntactic mechanisms of recursion (for further discussion of
this issue for verbal discourse, see Martin, 1992:163; Bateman, 2001). On the other hand, as the syntactic structures
become increasingly generalized to their X-bar core, it becomes interesting to consider whether what is actually being
expressed are discourse-level dependency structures along the lines of subordination and coordination posited by many
accounts of discourse, SDRT included. There is much more to be considered here, as for certain genres of sequential
images functional slots such as Cohn proposes do appear to offer a useful level of description --- and genres are precisely
the kinds of linguistic units that are thought to provide functional slots or stages on top of more dynamic discourse
organizations (cf., e.g., Martin, 1992:546--573). A more exact comparison of the two approaches would therefore be very
worthwhile: on the one hand, many of the constraints that Cohn suggests may well support a more detailed definition of the
discourse relations relevant for comics; on the other, further empirical investigations of a broader range of audiovisual
artefacts are clearly necessary.
For current purposes, however, we have shown that much of the effect of Cohns visual narrative grammar can be
produced in a more dynamic fashion by an operative level of discourse semantics --- the structures that abductively result
from application of discourse relations capture dependencies between panels and impose the kind of hierarchical
structure that Cohn builds relying on notions of phrase structure. Our interpretation exemplifies how concrete conditions
and constraints on both the actual semantic representation of the comic panels as well as the contextual circumstances
help in filling in the gaps between panels with explicitly described discourse relations. The interplay of the conceptual
representations of the logical forms presented in Table 7 on the one hand and the available discourse relations for comics
on the other overrides the criticized default principles of space = time and panels = moments suggested by McCloud
and others and, moreover, produces a coherent hierarchical interpretation without the need for postulating specific
structural relationships in advance. This then achieves a high degree of dynamic structure in a visual narrative without
appealing to a syntactic notion of grammar.
Also of note here is how our framework naturally addresses the six attributes of visual narratives that Cohn (2013a:69,
2013b:417) picks out as targets necessary for any adequate account, while raising several further interesting research
questions. In particular, we have seen how the process of discourse interpretation enforces groupings on panels (Cohns
attribute 1), generates (discourse) structural ambiguities whenever coherence does not distinguish between contrasting
discourse hypotheses (attribute 5), and provides a detailed formalization of the interaction between bottom-up
information from the materials under analysis and top-down constraints from context and the discourse relation
definitions (attribute 2) as well as the interaction between discourse structure and inference (attribute 6) via, for example,
accessibility constraints. Moreover, long-distance dependencies between panels (attribute 4) become relatable to a
broader range of discourse phenomena, including discourse pops and considerations of the general properties of
discourse structures as such (cf., e.g., Wolf and Gibson, 2005; Danlos, 2008); exploring the particular properties of visual
discourse in these terms would constitute a significant research challenge of its own. And finally, narrative pacing
(attribute 3) may be considered not only in terms of the delay or acceleration of information delivery via embedding but
also as a product of, for example, parallelism or misleading interpretative cues that are only resolved (abductively) in
subsequent discourse interpretation (cf., e.g., Bateman, 2007:51--59).
4.3. Comics and layout: two-dimensional meaning-making on the page
In our final example, we take the considerations of the previous sections one step further and show how our approach
also extends to the rather more varied relations discussed in terms of general arthrology and braiding by Groensteen
(2007). Relationships of this kind may be drawn across any panels in a comic and are not restricted to panel sequences.
Important here is to manage this without falling foul of Cohns critique of promiscuous relations on the grounds of memory
overload when the constraint of physical panel sequentiality is relaxed. Including meaning-making of this kind will suggest
how our account generalises beyond considerations of sequentiality and offers methods for dealing with the deployment
of spatial resources and layout for discourse construction more generally. This, as Groensteen convincingly argues, is an
area where detailed formalized accounts have been almost entirely lacking. Our more general foundation in semiotic
modes that are equally capable of drawing on the spatial extent of material as a manipulable resource for meaning
construction allows us to offer a well-founded approach for this area of comics (and other media) as well.
This is also a logical extension of our first visual example above concerning segmentation of the available cues in a
comics panel: there the selection and isolation in the visual field of a particular group of material qualities was dependent
on the discourse hypotheses being formed. Here we see that this applies equally well in cases where it is not even clear
prior to attempted discourse interpretation what the signs at issue may be. Moreover, in this last example, the kinds of
elements identified (or not) are considerably more abstract and impinge directly on the vexed question of units of analysis
in comics in general, particularly at the page level. Although analytic decisions can be relatively clear in particular cases
due, for example, to perceptual salience, there are few methodological principles offered that would allow those decisions
to be made on general grounds and in a manner that reveals systematically the meanings they help create. Here we will
see that this is, again, precisely what the mechanisms of discourse interpretation can provide.
The example we discuss is shown in Fig. 7. Here we see a climactic confrontation between two major characters in a
complex superhero narrative, Marvel Comics Civil War (2006--2007). The character shown first is Captain America, who
is about to be persuaded to act against some of his (former) superhero colleagues by the female character who first
appears lower right in the second panel and then in the two righthand panels below this. The individual panels clearly can
Fig. 7. Page from Civil War #1 (May 3rd, 2006, p. 22; Writer: Mark Millar, Penciller: Steve McNiven, Cover Artist: Steve McNiven). MARVEL.
Used by permission.
be identified as units on the basis of their strong framing: each of these is a candidate for a single logical form
representation within the developing discourse structure as we have shown above for our other comics examples.
However, the entire page needs also to be seen as a potential unit, functioning as a perceptual unity grouping its panels
together within the visual field. And, moreover, there may also be intermediate structures in addition to this evinced by
particular visual patterns in which the panels are organized. Such intermediate structures are particularly problematic for
analysis in that they arise spontaneously from visual features of the layout in combination with supportive discourse
coherence interpretations. Attempting descriptions of such units without an explicit account of the discourse coherence
mechanisms is then difficult on any other than an on demand basis lacking methodological rigour.
We treat this within the methodology provided by our framework as follows. The extract in the example begins with a
close-up depiction of an unhappy looking Captain America stating that a suggested plan is going to split us down the
middle. The second panel jumps behind Captain America (identified visibly by the star on the back of his costume) to
include a portion of dialogue (thereby demonstrating that individual panels can have quite complex temporal extents: a
further property that can be inferred from the discourse structure as argued, for example, by McCloud (1994:95)). The
dialogue is to be read from left to right, top to bottom, ending with the question raised by the woman in the lower righthand
corner: Any majors? (i.e., any major superheroes who might therefore pose a threat). The next group of panels is
distinguished visually from the previous two: instead of two page-width panels there is a collection of four organized in a
2 2 matrix. This segmentation reinforces the dialogic structure of the discourse up to that point: the woman asks a
question, and the four panels following are embedded as the answer or response to this question. In one sense, these
panels play out as usual, in that they support a temporally sequential reading following the turn-structure of the dialogue;
however, in another sense, they simultaneously establish an intermediate structure, emphasizing the two sides of the
discussion as it reaches its climax and Captain Americas sudden realization (in the bottom panel on the left) of what is
being demanded of him. Thus, the individual panels show moves in a dialogue; the more extended visual grouping of the
panels shows simultaneously the emergence of the conflict -- this is precisely the kind of meaning attributed specifically to
comics as a unique feature of the medium, a feature Groensteen (2007 [1999]:21--23) terms spatiotopia, whereby the
particular layout of panels and groups of panels (multiframes) within a work can be meaningful in its own right.
This structure may be constructed directly by the mechanisms of discourse coherence. For this, we first build logical
forms for each panel as a discourse segment as usual. The narrative processes inferable from the images and the speech
balloons are given in Table 9. Most of these forms can simply be described by predicates such as say or ask, since,
together, they construct the dialogue situation characterized above. The four short turns within the second panel can thus
be subsumed as one event, ep2 , which, from a more fine-grained perspective, is constructed of four sub-events, ep2a --ep2d .
The four panels in the second part of the tableau are then embedded in a sub-structure whose discourse subordination is
Table 9
Logical forms of the Civil War example.
e 1 = state
[v] Captain America (m)
[t] I think this plan. . . (n)
m, n | state(e 2a)
e 2 = dialogue
e 2a = ask e 2b = say
[v] male person (o) [v] Captain America (q)
[t] Whats the matter. . . (p) [t] A lot. (r)
o, p | ask(e 2a ) q, r | say(e 2b )
e 2c = ask e 2d = ask
[v] second male person (s) [v] female person (u)
[t] How many rebels. . . (t) [t] Any majors? (v)
s, t | ask(e 2c ) u, v | ask(e 2d )
e = dialogue
e 3 = say e 4 = ask
[v] Captain America (w) [v] female person (y)
[t] A few, but mostly. . . (x) [t] So nobody. . . (z)
w, x | say(e 3 ) y, z | ask(e 4 )
e 5 = ask e 6 = say
[v] Captain America (a) [v] female person (c)
[t] Excuse me? (b) [t] You heard. (d)
a, b | ask(e 5 ) c, d | say(e 6 )
Fig. 8. (left) Graphical representation of the Civil War discourse structure; (right) Corresponding spatiotopical panel organization suggested by the
layout of the Civil War page.
described by the label ep0 and which we will explain below in more detail. The discourse structure here summarizes the
events ep3 --ep6 , which can again be inferred as parts of the dialogue. Since these events are clearly separated as different
panels, we do not represent them as sub-events.
The logical forms can now be related to each other via discourse relations that can be inferred between the events of
the discourse in the usual way. Since these events as turns in a dialogue occasion each other and thus are given in a
temporal sequence, we can infer Narration-relations between them. As has already been indicated within Table 9, the
following group of panels is then organized as a 2 2 matrix. This can be interpreted as two discourse structures that
unfold in parallel, but which contrast with each other in their semantic content. Whereas ep3 and ep5 as well as ep4 and ep6
can also be related via Narration, since they follow each other in a temporal sequence, we can infer a further discourse
relation between ep3 and ep5 and between ep4 and ep6 , namely that of Contrast. The default axioms and meaning
postulates for the Contrast-relation were given in Table 6 above and are derived directly from Wildfeuers (2014b) similar
relation for film. Thus, although the logical forms of the discourse segments resemble each other and show a structural
similarity, which is also a condition for a Parallel-relation, the semantic content of the utterances given in the speech
balloons leads to the inference of a Contrast-relation between these forms. Parallel is thus overruled by Contrast here,
since the inference of a semantic contrast or conflict maximizes the discourses coherence overall. The representation of
the overall discourse structure in Fig. 8 depicts this resulting coherence graphically.12
The achieved overall discourse coherence makes it clear that the events in the second half of the tableau are
subordinated to the first dialogue. This can be explained by the embedding of the four panels as answers to the questions
raised before. Since the logical forms all refer to the event of asking, they are dependent on this structure and thus are
subordinated to it. With this structure in place, therefore, we can see how the additional schematic structure sketched in
Fig. 8 is also formally derived. This structure shows more of the potential layout interpretations offered by the page in
addition to simple sequence. The additional discourse dependencies uncovered correspond closely to Groensteens
(2007 [1999]:144--149) notion of braiding, while the discourse structure itself shows how the emergence of an
intermediate multiframe consisting of the 2 2 matrix is directly motivated during the process of constructing a coherent
discourse structure. Accounting for such larger scale panel organizations has been an open problem within comics
12
Here we make a further simplification since we in fact run out of dimensions on the page --- both narration and contrast should be horizontal in
a classical SDRT graph representation but here they refer to differing layers of interpretation. Asher and Lascarides (2003:154--155) show
something similar with contrast relations by adding dashed links across different portions of their graphs; here we maintain a closer iconic relation
between the graph and the comics physical page layout to assist comparison.
analysis for some time; here we have suggested that one means of inducing such organizations when relevant is again
provided by an appropriate representation of the underlying formal discourse structure.
5. Conclusions and discussion
There is now an increasing number of frameworks being proposed where linguistically-inflected accounts are being
extended into the analysis of multimodal communicative artefacts. Earlier work such as the socio-functional approach of
Kress and van Leeuwen (2006) has been substantially extended (e.g., Royce, 1998; Saraceni, 2001; Lim, 2007; Liu and
OHalloran, 2009; Painter et al., 2013) and joined by text-linguistic oriented approaches from other traditions (e.g., Stckl,
2004). Several of these also stress the importance of inference, and range from applications of linguistic relevance theory
(cf. Forceville, 2014) to more cognitive or neurocognitive approaches (cf., e.g., Saraceni, 2001; Cognarts and Kravanja,
2012; Zacks and Magliano, 2011; Cohn, 2013b). Valuable empirical investigations are also now beginning to appear,
applying a variety of techniques for measuring recipients responses during processing of multimodal artefacts of various
kinds (e.g., Levin and Simons, 2000; Holsanova and Nord, 2010; Bucher and Niemann, 2012; Cohn, 2013b). Broader
overviews can be found in Bateman and Schmidt (2012) for film and in Bateman (2014) for static combinations of visual
and verbal materials; an extensive review of the application particularly of narrative models, such as story grammars, to
visual materials can be found in Cohn (2013b).
In this paper we have set out what we believe to be a stronger theoretical framework for addressing the close
relationship that often exists between fine-grained details of form and the guidance of recipients interpretative
processes --- a property we characterize in terms of an extended notion of textuality. Previous accounts may draw
attention to places where inference needs to occur in specific cases but have so far failed to provide general analytic
frameworks and methodologies that (a) mesh well with what is currently known about perception, (b) are applicable to
multimodal artefacts in general, and (c) are capable of generating specific predictions concerning interpretations and
interpretative difficulties.
We have suggested that one prime contributing factor to the lack of development of appropriate linguistically-inflected
and semiotic accounts of the workings of multimodal artefacts lies in inadequate characterizations of the phenomenon of
multimodality itself. The current received position in which sensory and semiotic modalities are often conflated developed
piecemeal from earlier discussions that themselves grew out of a strongly linguistically-inflected structural semiotics (cf.
Barthes, 1977) that did not offer a sufficiently strong foundation for visual artefacts. Thus, while important, this model had
neither a sufficiently general notion of discourse as a dynamic process of making the interpretation of an artefacts formal
features relevant to its context of use and to its communities of users, nor a sufficiently incisive means of identifying and
delimiting semiotic modes in the first place. As a consequence, there has been a broad orientation to individual sensory
channels rather than to the actual meaning-making potentials that complex materialities offer.
This notwithstanding, our framework also offers a refined view of the constitutive role played by materiality in the
formation of semiotic systems. This style of account is then open not only to considerations of higher-level sociocultural
meanings and value-attributions but also to treatments of more basal involvement in the somatic physical experience of a
spectator. Both facets must be taken to be fundamentally entwined in the aesthetic experience of a communicative
artefact and, moreover, can now be seen to be mediated in the manner that our account of multimodality provides. We
thus see our approach as a general way of engaging with multimodal communicative artefacts that is compatible with a
broad range of investigative styles. On the one hand, principles of discourse update and their application may generate
empirically testable predictions (both experimental and analytic) concerning discourse interpretations while, on the other,
an objectively more stable basis for discussion is offered even for aesthetic, literary, and socio-cultural considerations.
Previous approaches have also been limited by their lack of a sound foundation for treating the inherently dynamic
nature of discourse phenomena. Inferencing has either been equated with general cognitive processing mechanisms or
been treated only schematically or informally. The account of multimodality we have set out is in contrast specifically
responsive to the advances made in the understanding and formal modelling of discourse as a dynamic process of
artefact-guided interpretation. Applying common principles of discourse to multimodal artefacts in general has allowed us
to move beyond debates couched within the more static and syntactic views of semiotics developed within the 1960s and
still at work in many discussions today. We have consequently characterized in more detail than hitherto possible just how
information is made accessible for combination in multimodal artefacts, as well as articulating some of the important
media specificities that need to be considered. We see this moreover as a further response to the striking degree of
convergence that can now be seen across (i) the formal, dynamic semantic and semiotic requirements of avoiding
explosive computational complexity during discourse processing, (ii) observational and introspective indications of fast
yet highly reliable discourse interpretations, and (iii) brain studies showing that discourse interpretation must be being
tightly constrained. Multimodal dynamic discourse semantics suggests a general framework strongly motivating a
convergence of this kind.
Our account therefore differs from previous frameworks for discussing multimodality in several important respects ---
including the details of its internal organization, its level of explicitness, and its acknowledgement of the essential role played
by an extended dynamic notion of textuality. As a consequence, our framework supports a finer discrimination of just what
the relevant context for top-down interpretation is and precisely how it is allowed to enter into the interpretation process. We
suggest that significantly more focused interpretative models of what is occurring with multimodal artefacts can then be
formulated building on the crucial role of discourse as we have defined it. That is: we assume that in many situations the tasks
that recipients are involved in are not to be described in terms of general problem solving, of somehow working out what was
meant from a broad understanding of the context, but instead in terms of a very much more focused development of
discourse --- of understanding what discourse purposes are being pursued as defined by explicit repertoires of discourse
relations. Our concern is then with hypotheses that are driven particularly by the requirements of creating coherent discourse.
We have argued that this plays a far more extensive role than previously explored within many accounts of verbal discourse
and offers important insights for the operations and mechanisms of multimodal meaning-making overall.
Thus, while maintaining a distinction similar in some respects to relevance theorys separation of decoding and
inference (Sperber and Wilson, 1995), this distinction is in our case modelled in terms of distinct logics and includes in
addition fine-grained guidance for discourse inference with respect to any decoded semantics; this enables us to be more
specific in analysis, which is particularly important for multimodality. The result also offers, we claim, an advance over non-
linguistically-inflected accounts, where interactions between recipient and artefact have been focused typically within
more informal discussions of individual kinds of media, e.g., literature, painting, etc., and have not been able to draw on the
more extensive theoretical foundation necessary for driving empirical research.
Finally, in order to imbue observable material distinctions with interpretations it is always necessary to fix the semiotic
mode(s) that apply; this in turn is necessarily taken to involve a corresponding selection of the discourse semantics
associated with that semiotic mode. If we consider, for example, verbal language, then this semiotic mode has its own typical
discourse semantics made up of resources for connecting messages, maintaining identification chains, evaluating entities
and so on as well discussed in the linguistics literature. Different semiotic modes will in general have their discourse
semantics filled in differently with, at the very least, differing repertoires of discourse relations. We have suggested this above
in our separate listings of discourse relations for each of verbal language and comics, while Wildfeuer (2014a) articulates a
further repertoire specifically for film. One primary challenge for multimodal research at this time, therefore, will be to map out
the varied collections of discourse relations and their definitions as required to capture the workings of distinct semiotic
modes. It is this inclusion of discourse semantics that offers a principled pragmatic framework within which highly explicit
characterizations of multimodal meaning making can be made and empirically investigated.
We still stand very much at the beginning of this programme of research and have mentioned above several areas
where empirical studies are now necessary. The connection of discourse hypothesis formation with perception suggests
possibilities for eye-tracking experiments with controlled variation of discourse contexts; the accessibility conditions
defined by discourse dependency structures suggest experiments where interpretability is evaluated with respect to
varying discourse structures; the linking of discourse structures to fine-grained details of form suggest experiments in
which those details are varied to assess the implications of those changes for discourse interpretation and situation model
updates; and behavioural correlates of the particular inferential chains suggested by hypothesized courses of discourse
interpretation may also be sought to help refine or change the model further. Moreover, corpus-based evaluations of the
coverage of particular repertoires of discourse relations may also be undertaken in a manner precisely analogous to
corpus studies of verbal discourse. Overall, we expect such investigations to lead to significant improvements in our
understandings not only of commonalities across media and modalities but also of their differences.
Acknowledgements
This work was partially funded by an internal research grant from Bremen University as well as by a cooperative
research mobility grant (PPP HK, Project: 56156404) from the German Academic Exchange Service (DAAD).
References
Abusch, Dorit, 2013. Applying discourse semantics and pragmatics to co-reference in picture sequences. In: Chemla, Emmanuel, Homer,
Vincent, Winterstein, Grgoire (Eds.), Proceedings of Sinn und Bedeutung 17, Paris, pp. 9--25. http://hdl.handle.net/1813/30598.
Asher, Nicholas, Lascarides, Alex, 2003. Logics of Conversation. Cambridge University Press, Cambridge.
Asher, Nicholas, Vieu, Laure, 2005. Subordinating and coordinating discourse relations. Lingua 115 (4), 591--610.
Bakhtin, Mikhail M., 1981. The Dialogic Imagination: Four Essays. University of Texas Press, Austin. Translated by Caryl Emerson and Michael
Holquist.
Barthes, Roland, 1977. Image--Music--Text. Fontana Press, LondonTranslated by Stephen Heath.
Barthes, Roland, 1977[1964]. The rhetoric of the image. In: Image--Music--Text. Fontana, London, pp. 32--51.
Bateman, John A., 2001. Between the leaves of rhetorical structure: static and dynamic aspects of discourse organisation. Verbum: revue de
linguistique 23 (1), 31--58.
Bateman, John A., 2007. Towards a grande paradigmatique of film: Christian Metz reloaded. Semiotica 167 (1/4), 13--64.
Bateman, John A., 2011. The decomposability of semiotic modes. In: OHalloran, Kay L., Smith, Bradley A. (Eds.), Multimodal Studies: Multiple
Approaches and Domains. Routledge Studies in Multimodality. Routledge, London, pp. 17--38.
Bateman, John A., 2014. Text and Image: A Critical Introduction to the Visual/Verbal Divide. Routledge, London/New York.
Bateman, John A., Hois, Joana, Ross, Robert J., Tenbrink, Thora, 2010. A linguistic ontology of space for natural language processing. Artif. Intell.
174 (14), 1027--1071. http://dx.doi.org/10.1016/j.artint.2010.05.008.
Bateman, John A., Schmidt, Karl-Heinrich, 2012. Multimodal Film Analysis: How Films Mean. Routledge Studies in Multimodality. Routledge,
London.
Bertin, Jacques, 1983. Semiology of Graphics. University of Wisconsin Press, Madison. Translated Smiologie graphique (1967) by William J.
Berg.
Boeriis, Morten, Holsanova, Jana, 2012. Tracking visual segmentation: connecting semiotic and cognitive perspectives. Vis. Commun. 11 (3),
259--281. http://dx.doi.org/10.1177/1470357212446408.
Bordwell, David, 1982. Textual analysis, etc. Double Issue: International Conference on the Textual Analysis of Film. Enclitic 6 (1), 125--136.
Bredekamp, Horst, 2010. Theorie des Bildakts. Suhrkamp, BerlinFrankfurter Adorno-Vorlesungen 2007.
Bruner, Jerome S., Postman, Leo, 1949. On the perception of incongruity: a paradigm. J. Pers. 18 (2), 206--223.
Bryson, Norman, 1981. Word and Image: French Painting of the Ancien Rgime. Cambridge University Press, Cambridge.
Bucher, Hans-Jrgen, 2011. Multimodales Verstehen oder Rezeption als Interaktion. Theoretische und empirische Grundlagen einer system-
atischen Analyse der Multimodalitt. In: Diekmannshenke, Hans-Joachim, Klemm, Michael, Stckl, Hartmut (Eds.), Bildlinguistik. Theorien --
Methoden -- Fallbeispiele. Erich Schmidt, Berlin, pp. 123--156.
Bucher, Hans-Jrgen, Niemann, Philipp, 2012. Visualizing science: the reception of PowerPoint presentations. Vis. Commun. 11 (3), 283--306.
Cognarts, Maarten, Kravanja, Peter, 2012. The visual and multimodal representation of time in film, or how time is metaphorically shaped in
space. Image Narrative 13 (3), 85--100.
Cohn, Neil, 2010. The limits of time and transitions: challenges to theories of sequential image comprehension. Stud. Comics 1 (1), 127--147.
Cohn, Neil, 2013a. The Visual Language of Comics: Introduction to the Structure and Cognition of Sequential Images. Bloomsbury, London.
Cohn, Neil, 2013b. Visual narrative structure. Cogn. Sci. 37 (3), 413--452.
Cohn, Niel, Paczynski, Martin, Jackendoff, Ray, Holcomb, Phillip J., Kuperberg, Gina R., 2012. (Pea)nuts and bolts of visual narrative: structure
and meaning in sequential image comprehension. Cogn. Psychol. 65 (1), 1--38.
Currie, Gregory, 1995. Image and Mind: Film, Philosophy and Cognitive Science. Cambridge University Press, Cambridge, UK.
Cutting, James E., 2002. Representing motion in a static image: constraints and parallels in art, science, and popular culture. Perception 31 (10),
1165--1193.
Cytowic, Richard E., Eagleman, David M., 2011. Wednesday Is Indigo Blue: Discovering the Brain of Synesthesia. MIT Press, Cambridge, MA.
Danlos, Laurence, 2008. Strong generative capacity of RST, SDRT and discourse dependency DAGS. In: Benz, Anton, Khnlein, Peter (Eds.),
Constraints in Discourse. John Benjamins, Amsterdam, pp. 69--95.
Davidson, Donald, 1967. The logical form of action sentences. In: Rescher, N. (Ed.), The Logic of Decision and Action. University of Pittsburgh
Press, Pittsburgh, PA, pp. 81--95.
de Beaugrande, Robert, 1982. The story of grammars and the grammar of stories. J. Pragmat. 6 (5/6), 383--422.
de Beaugrande, Robert, Dressler, Wolfgang U., 1981. Introduction to Text Linguistics. Longman, London.
Eco, Umberto, 1976. A Theory of Semiotics. Indiana University Press, Bloomington.
Eisner, Will, 1992. Comics and Sequential Art. Kitchen Sink Press Inc., Princeton, WI.
Evans, Janet (Ed.), 2009. Talking Beyond the Page: Reading and Responding to Picturebooks. Routledge, London.
Feng, Dezheng, OHalloran, Kay L., 2012. Representing emotive meaning in visual images: a social semiotic approach. J. Pragmat. 44,
2067--2084.
Ferstl, Evelyn C., 2007. The functional neuroanatomy of text comprehension: whats the story so far? In: Schmalhofer, Franz, Perfetti, Charles
A. (Eds.), Higher Level Language Processes in the Brain: Inference and Comprehension Processes. Lawrence Erlbaum, Mahwah, NJ, pp.
53--102.
Forceville, Charles J., 2011. Pictorial runes in Tintin and the Picaros. J. Pragmat. 43 (3), 875--890.
Forceville, Charles J., 2014. Relevance theory as model for analyzing visual and multimodal communication. In: Machin, David (Ed.), Multimodal
Communication. Mouton de Gruyter, Berlin, pp. 51--70.
Gibson, James J., 1977. The theory of affordances. In: Shaw, Robert, Bransford, John (Eds.), Perceiving, Acting, and Knowing: Toward and
Ecological Psychology. Erlbaum, Hillsdale, NJ, pp. 62--82.
Givon, Talmy (Ed.), 2005. Grammar as an adaptive evolutionary product. Context as Other Minds: The Pragmatics of Sociality, Cognition and
Communication. John Benjamins, Amsterdam, pp. 91--123.
Gombrich, Ernst H., 1959. Art and Illusion: A Study in the Psychology of Pictorial Representation. Phaidon Press, Oxford.
Groensteen, Thierry, 2007[1999]. The System of Comics. Studies in Popular Culture. University Press of Mississippi, Jackson, MSTranslated by
Bart Beaty and Nick Nguyen, from the original French Systme de la bande desine (1999).
Groupe m, 1992. Trait du signe visuel: pour une rhtorique de limage. Editions du Seuil, Paris.
Hague, Ian, 2014. Comics and the Senses: A Multisensory Approach to Comics and Graphic Novels. Routledge, New York/London.
Hall, Stuart, 1980. Encoding/decoding. In: Hall, Stuart, Hobson, Dorothy, Lowe, Andrew, Willis, Paul (Eds.), Culture, Media, Language.
Working Papers in Cultural Studies, 1972--79. Routledge, London. Transferred to digital printing 2006, pp. 128--138. Transferred to digital
printing 2006.
Hatt, Michael, Klonk, Charlotte, 2006. Art History: A Critical Introduction to Its Methods. Manchester Univeristy Press, Manchester.
Hjelmslev, Louis, 1953. Prolegomena to a Theory of Language. Indiana University Publications in Anthropology and Linguistics, Bloomington.
Translated by Francis J. Whitfield.
Holsanova, Jana, Nord, Andreas, 2010. Multimedia design: media structures, media principles and users meaning-making in newspapers and net
papers. In: Bucher, Hans-Jrgen, Gloning, Thomas, Lehnen, Katrin (Eds.), Neue Medien -- neue Formate. Ausdifferenzierung und
Konvergenz in der Medienkommunikation (Interaktiva. Schriftenreihe des Zentrums fr Medien und Interaktivitt (ZMI), Gieen 10). Campus
Verlag, Frankfurt/New York, pp. 81--103.
Iser, Wolfgang, 1978. The Act of Reading: A Theory of Aesthetic Response. John Hopkins University Press, Baltimore.
Jackendoff, Ray, 1977. X Syntax: A Study of Phrase Structure. MIT Press, Cambridge, MA.
Jackendoff, Ray, 2007. A parallel architecture perspective on language processing. Brain Res. 1146, 2--22.
Johnson-Laird, Philip N., 1983. Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Cambrige University
Press, Cambridge.
Kamp, Hans, 1981. A theory of truth and semantic representation. In: Groenendijk, Jeroen A.G., Janssen, T.M.V., Stokhof, Martin B.J. (Eds.),
Formal Methods in the Study of Language. 136. Mathematical Centre Tracts, Amsterdam, pp. 277--322.
Kamp, Hans, Reyle, Uwe, 1993. From Discourse to Logic: Introduction to Model-Theoretic Semantics of Natural Language, Formal Logic and
Discourse Representation Theory. Studies in Linguistics and Philosophy, vol. 42. Kluwer Academic Publishers, London/Boston/Dordrecht.
Kennedy, John M., 1982. Metaphor in pictures. Perception 11 (5), 589--605.
Kluss, Thorsten, Schult, Niclas, Schill, Kerstin, Fahle, Manfred, Zetzsche, Christoph, 2012. Investigating the in-between: multisensory integration
of auditory and visual motion streams. Seeing Perceiving 25 (1), 45--69.
Kostelnick, Charles, Hassett, Michael, 2003. Shaping Information: The Rhetoric of Visual Conventions. Southern Illinois University Press,
Carbondale, Illinois.
Kress, Gunther, 2010. Multimodality: A Social Semiotic Approach to Contemporary Communication. Routledge, London.
Kress, Gunther, Jewitt, Carey, Ogborn, Jon, Tsatsarelis, Charalampos, 2000. Multimodal Teaching and Learning. Continuum, London.
Kress, Gunther, van Leeuwen, Theo, 2006[1996]. Reading Images: The Grammar of Visual Design. Routledge, London/New York.
Levin, Daniel T., Simons, Daniel J., 2000. Perceiving stability in a changing world: combining shots and integrating views in motion pictures and
the real world. Media Psychol. 2 (4), 357--380.
Levinson, Stephen C., 2000. Presumptive Meanings: The Theory of Generalized Conversational Implicature. MIT Press, Cambridge, MA.
Lim, Fei Victor, 2007. The visual semantics stratum: making meaning in sequential images. In: Royce, Terry D., Bowcher, Wendy L. (Eds.), New
Directions in the Analysis of Multimodal Discourse. Lawrence Erlbaum Associates, Mahway, New Jersey, pp. 195--214.
Liu, Yu, OHalloran, Kay L., 2009. Intersemiotic texture: analyzing cohesive devices between language and images. Soc. Semiotics 19 (4), 367--388.
Martin, James R., 1992. English Text: Systems and Structure. Benjamins, Amsterdam.
Martinet, Andr, 1960. Elements of General Linguistics. Faber and Faber, London.
McCloud, Scott, 1994. Understanding Comics: The Invisible Art. Harper Perennial, New York.
McGurk, Harry, MacDonald, John, 1976. Hearing lips and seeing voices. Nature 264 (5588), 746--748. http://dx.doi.org/10.1038/264746a0.
Meibauer, Jrg, 2009. Implicature. In: Mey, Jacob L. (Ed.), Concise Encyclopedia of Pragmatics. 2nd ed. Elsevier, Kidlington, Oxford, pp.
365--378.
Metz, Christian, 1964. Le cinma: langue ou langage? Communications 4, 52--90. Appears in English as The cinema: language or language
system? in Metz (1974: 31--91).
Metz, Christian, 1974. Film Language: A Semiotics of the Cinema. Oxford University Press/Chicago University Press, Oxford/Chicago. Translated
by Michael Taylor.
Miodrag, Hannah, 2013. Comics and Language: Reimagining Critical Discourse on the Form. University Press of Mississippi, Jackson, MS.
Mitchell, W.J.T., 2005. There are no visual media. J. Vis. Cult. 4 (2), 257--266. http://dx.doi.org/10.1177/1470412905054673.
Mller-Na, Karl-Dietmar, 1986. Filmsprache: eine kritische Theoriegeschichte. MAKS (Mnsteraner Arbeitskreis for Semiotik) Publikationen,
Mnster.
Moriarty, Sandra E., 1996. Abduction: a theory of visual interpretation. Commun. Theory 6 (2), 167--187.
Painter, Claire, Martin, James R., Unsworth, Len, 2013. Reading Visual Narratives: Image Analysis of Childrens Picture Books. Equinox, London.
Parsons, Terence, 1990. Events in the Semantics of English: A Study in Subatomic Semantics. MIT Press, Cambridge, MA/London.
Polanyi, Livia, 1988. A formal model of the structure of discourse. J. Pragmat. 12, 601--638.
Royce, Terry D., 1998. Synergy on the page: exploring intersemiotic complementarity in page-based multimodal text. Jpn. Assoc. Syst. Funct.
Linguist. (JAarea.sfl) Occas. Pap. 1 (1), 25--49.
Ryan, Marie-Laure, 2003. Cognitive maps and the construction of narrative space. In: Herman, David (Ed.), Narrative Theory and the Cognitive
Sciences. CSLI, Stanford, CA, pp. 214--242.
Sachs-Hombach, Klaus (Ed.), 2001. Bildhandeln: interdisziplinre Forschungen zur Pragmatik bildhafter Darstellungsformen (Reihe Bildwis-
senschaft 3). Scriptum-Verlag, Magdeburg.
Sachs-Hombach, Klaus, Schirra, Jrg R.J., 2011. Prdikative und modale Bildtheorie. In: Diekmannshenke, Hans-Joachim, Klemm, Michael,
Stckl, Hartmut (Eds.), Bildlinguistik. Theorien -- Methoden -- Fallbeispiele. Erich Schmidt, Berlin, pp. 97--120.
Saint-Martin, Fernande, 1990. Semiotics of Visual Language. Bloomington University Press, Bloomington, IN.
Saraceni, Mario, 2001. Relatedness: aspects of textual connectivity in comics. In: Baetens, Jan (Ed.), The Graphic Novel. Leuven University
Press, Leuven, pp. 167--179.
Saussure, Ferdinand de, 1959[1915]. In: Bally, Charles, Sechehaye, Albert (Eds.), Course in General Linguistics. McGraw-Hill/The Philosophical
Library, Inc., New York/Toronto/Londonin collaboration with Albert Riedlinger; Translated by Wade Baskin.
Schill, Kerstin, Umkehrer, Elisabeth, Beinich, Stephan, Krieger, Gerhard, Zetzsche, Christoph, 2001. Scene analysis with saccadic eye
movements: top-down and bottom-up modeling. J. Electron. Imaging 10 (1), 152--160.
Schirra, Jrg R.J., Sachs-Hombach, Klaus, 2007. To show and to say: comparing the uses of pictures and language. Stud. Commun. Sci. 7 (2),
35--62.
Schumacher, Peter, 2013. A pattern language for pictorial assembly instructions (PAIs). Inf. Des. J. 20 (2), 111--135.
Seeley, William P., 2012. Hearing how smooth it looks: selective attention and crossmodal perception in the arts. Essays Philos. 13 (2), 498--517.
Special issue: Aesthetics and the Senses, edited by Cynthia Freeland. http://commons.pacificu.edu/cgi/viewcontent.cgi?article=1434&con
text=eip.
Simons, D., Rensink, R., 2005. Change blindness: past, present, and future. Trends Cogn. Sci. 9 (1), 16--20.
Sobchack, Vivian, 2004. Carnal Thoughts: Embodiment and Moving Image Culture Chap. What My Fingers Knew: The Cinesthetic Subject, or
Vision in the Flesh. University of California Press, Berkeley/Los Angeles/London, pp. 53--84.
Sperber, Dan, Wilson, Deirdre, 1995. Relevance: Communication and Cognition, 2nd ed. Blackwell, Oxford [1986].
Stckl, Hartmut, 2004. Die Sprache im Bild -- Das Bild in der Sprache: Zur Verknpfung von Sprache und Bild im massenmedialen Text. Konzepte
-- Theorien -- Analysemethoden. Walter de Gruyter, Berlin.
van Dijk, Teun A., Kintsch, Walter, 1983. Strategies of Discourse Comprehension. Academic Press, New York.
Wildfeuer, Janina, 2012. Intersemiosis in film: towards a new organisation of semiotic resources in multimodal filmic text. Multimodal Commun.
1 (3), 233--304.
Wildfeuer, Janina, 2014a. Film Discourse Interpretation. Towards a New Paradigm for Multimodal Film Analysis. Routledge Studies in
Multimodality. Routledge, London/New York.
Wildfeuer, Janina, 2014b. Coherence in film: analysing the logical form of multimodal narrative discourse. In: Maiorani, Arianna, Christie, Christine
(Eds.), Multimodal Epistemologies: Towards an Integrated Framework Routledge Studies in Multimodality. Routledge, London, pp. 260--274.
Wirth, Uwe, 2005. Abductive reasoning in Peirces and Davidsons account of interpretation. Semiotica 153 (1), 199--208.
Wolf, Florian, Gibson, Edward, 2005. Representing discourse coherence: a corpus-based study. Comput. Linguist. 31 (2), 249--287.
Yarbus, Alfred Lukyanovich, 1967. Eye Movements and Vision. Plenum Press, New York, NY.
Zacks, Jeffrey M., Magliano, Joseph P., 2011. Film, narrative and cognitive neuroscience. In: Bacci, Francesca, Melcher, David P. (Eds.), Art and
the Senses. Oxford University Press, Oxford/New York, pp. 435--454.
Zwaan, Rolf A., Radvansky, Gabriel A., 1998. Situation models in language comprehension and memory. Psychol. Bull. 123 (2), 162--185.

Bateman - Wildfeuer JOP Article

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Bateman - Wildfeuer JOP Article

Cargado por

Copyright:

Formatos disponibles

Available online at www.sciencedirect.

A multimodal discourse theory of visual narrative

Keywords: Narrative; Discourse; Semantics; Visuals; Comics; Multimodality

2. Semiotic modes and multimodality

Approaches to understanding the workings of complex multimodal artefacts continue to be hampered by

2.1. Semiotic modes and materiality

2.2. Semiotic modes and discourse

(1) Mary went to the park and played football.

2.3. Consequences of our definition of semiotic mode

3. Multimodal discourse semantics

3.1. Monomodal segmented discourse representation theory

(3) Max fell. John helped him up.

3.2. Multimodal perception and discourse interpretation

4. Multimodal discourse interpretation in comics

4.1. More pictorial runes and interpretation indeterminacy

4.2. Sequential visual narrative and narrative structures

Fig. 6. Graphical representation of the complete SDRS for Cohns example.

4.3. Comics and layout: two-dimensional meaning-making on the page

5. Conclusions and discussion

También podría gustarte