Brief History:
In July 2001 the French Research Ministry accepted the "RoboKid" project with the objective to combine visual scene analysis, speech perception, speech production and language processing research into a combined platform to study human cognitive development. As part of that activity, in June 2003 a working meeting was held at the Instutut des Sciences Cognitives in Lyon, France. The meeting included an extension of the initial ACI group, and the objective of the meeting was to identify a long term strategy for the development of cognitive capabilities (including vision, language and motor control) for humanoid robots. This web site is intended to serve both as a mechanism for communication and collaboration between the group members, and for diffusion to the scientific and lay communities.
Working Group Members:
Peter F. Dominey Institut des Sciences Cognitives, CNRS UMR 5015,
Lyon
Jean-Luc Schwartz, Institut de la Communication Parlée,
Grenoble Cedex 1,
Deb Roy Cognitive Machines Group, MIT Media Laboratory, Cambridge,
MA, USA
Luc Steels, AI Laboratory, Vrije Universiteit Brussel, Brussels
Belgium
Michael Arbib Human Brain Project, University of Southern California
Ram Nevatia ISI, University of Southern California
Laurent Itti, Human Brain Project, University of Southern California
Aude Billard, EPFL, Lausanne Switzerland
Jeffrey Mark Siskind Purdue University, Electrical and Computer
Engineering, IN, USA
M. Anthony Lewis, Iguana Robotics, IL, USA
Andrew H. Fagg, University of Massachusetts Amhurst, Computer
Science Dept, USA
Pat Zukow-Goldring, Linguistics, University of Southern California
Global Activity Objective:
The global objective of this working group is to develop humanoid robot
capabilities that can be used to test developmental theories of cognitive
development and language acquisition. This includes the identification
of perceptual, conceptual and processing primitives, and the derived capabilities
from which they are constructed. The essential idea is that
“meaning” is extracted from the visual (perceptual) world via perceptual
scene analysis, and the mapping between the structure of this meaning,
and the structure of language is learned. The system should be able
to learn language and then demonstrate this knowledge in behavioral contexts
that include: (1) interactive dialogs in scene analysis/description, and
(2) sensorimotor imitation and action.In this context we want to clearly
identify functional capabilities for which we want to truly address their
ecological development, vs. those for which we will be satisfied to take
as “off the shelf” capabilities. Thus we can consider that this type
of modeling can serve two related goals
(1) To understand how the real system might function, and
(2) To produce a non-human system that can display interesting
language performance/behavior
Global Outcome of the June 10-12 Meeting:
Each of the participants approaches robot cognition from a different
perspective. One of the unusual benefits of this meeting derived
from the mixing of diverse disciplines including computer vision and scene
analysis, human event analysis, robot sensoirmotor control and locomotion,
primate and human visual psychophysics and modeling, natural language processing,
and various degrees of intersection between these disciplines. Combined
with a second aspect of the meeting - the provision for ample time for
discussion during and after the talks - the interdisciplinarity provided
for some interesting cross-fertilization. In particular, the
idea of sensorimotor affordances can be extended into the domain of lexical
and possibly phrasal semantics. A central issue that was extensively
discussed was the nature of representation of
meaning. In addition, we considered that the most straightforward
method to define functionality and measure performance is to define specific
behavioral tasks in a continuum of complexity.
In this context we considered a 4-dimensional space
in which task complexity could be defined. A central point of
collaboration
in this working group will
be the identification of shared resources, and the development of protocols/standards
for representations of events, objects, relations etc. These activities
will be pursued in a follow-up meeting in the Spring of 2004 at USC.
The Centrality of Representational Structure
and Function
The representation of meaning, clearly, is a central issue. Word
to referent mapping requires that the referent has a representation (that
may be modiefied in the mapping process). A given behavioral function
or task (e.g. the ability to recognize simple events given a segmented
video image) will drive the definition of the representations that support
that function. If we are interested in the representations that support
human cognition, then the choice of the target task (and corresponding
representations) is of fundamental importance. One approach
would consider the specification of a series of successively more complex
tasks that require successively more complex representations that build
upon earlier versions in a developmental manner. The pieces required
for the blueprint or roadmap are partially available in the developmental
literature. Include constrains on human processing limits (memory,
etc.).
Note that there are representational characteristics, including particularly
1. “compositionality” i.e. that we can represent new combinations of
things “on the fly” , and
2. the assumed internal representations that support “language” that
have become the object of scientific investigation or curiosity as representations
per se, rather than simply as part of an interesting behavior.
Sucessively complex robot cognitive tasks:
Robot-human interaction about static object configurations (static
states):
Names, locations,
relations between objects
Robot-human interaction about single dynamic object “events” (state
transitions) that require learning about:
Classical events (touch; push, take, give),
"Siskind" events (pouring, folding, bending, tearing)
Physical regularities
Naive physics
Gravity, falling
Support
Affordances, Effectivities
Robot-human interaction about multiple dynamic object configurations
(composed/multiple interrelated state transitions) that require learning
about:
How a succession of events contibute to a state
change, i.e. how a block house can be constructed.
Robot-human interaction about intentions that require learning
about:
Intentions, joint attention, theory of mind (Token
Task)
Complexity Space:
Complexity can be defined along at least three principal dimensions:
1. Discreteness (of states and state transitions): While push
and take are relatively simple, state transitions involving pouring,
bending, opening etc. are less discrete and more complex.
2. Compositional: Single events are “simpler” that the sequence
of events and state transitions involved in e.g. building a toy house,
and the associated narrative or discourse.
3. Causal-intentional: Corresponding to the Leslie-style formulation
of agency as self motion, building to intentionality, belief attribution,
mind reading etc.
4. Affordance & Effectivity: Agent must plan how to
use things.
Collaborative Dimenstions:
0. "Promotion" of "Robot Cognition" through a web page
summarizing some of these ideas, with links to participants pages, powerpoint
presentations etc.
1. Data collection/sharing:
Various databases are available
Corpus of sentence, scene pairs
Eye tracking data
200 hrs of CDS, labeled data (Siskind)
10 hrs CDS transcribed and context (Roy)
and context (PZG)
100
2. System sharing:
Eg use of Leonard
Ram and Laurent: may provide more robust vision systems: again what
is the interface
3. Joint projects and Interface Definition
Nevatia: Event Description Language.
4. Next Meeting
Spring 2004 at USC. Avoid March 31 - April 6 if possible (Evolution of Language Meeting, then a European Project meeting both in Leipzig)
Any interaction implies meaning and representation
Are representations for learning motor control part of the foundation
of more complex representations?
Observations like “infants learn words from syntax” levy requirements
on a system and can correspond to a set of algorithms but they cannot be
verified
Make a distinction between observations vs. theories that can be implemented.
If we use Ram’s vision system, what is the interface?
Why are we here; infant studies vs machine studies
Is there an “atom” of representation from which complex representations
are constructed
Attention as a more general method for traversal of representational
structures
Representation shall vary along at least two dimensions, one is temporal
(immediate to long term), and the other is concrete object related, to
human related (e.g. intentional states of others).
If the concern is with human cognition, then we must take human memory/processing
limits into account?
So consider a task like the token task, then later ToM issues
Two levels of representational complexity:
1) event types including pour, tear, bend
2) compositionality
3) Affordances
“Next level”
have a system watch humans play a board game that involves pickup putdown,
and then have the system learn the game
Perhaps games involving the required use of affordances: pick up sticks,
or “equilibrium/balanced” structure building
Cross-modal vs. amodal representations
Learning about physics