Brief History:
In July 2001 the French Research Ministry accepted the "RoboKid" project with the objective to combine visual scene analysis, speech perception, speech production and language processing research into a combined platform to study human cognitive development. As part of that activity, in June 2003 a working meeting was held at the Instutut des Sciences Cognitives in Lyon, France. The meeting included an extension of the initial ACI group, and the objective of the meeting was to identify a long term strategy for the development of cognitive capabilities (including vision, language and motor control) for humanoid robots. This web site is intended to serve both as a mechanism for communication and collaboration between the group members, and for diffusion to the scientific and lay communities.
Working Group Members:
Peter F. Dominey Institut des Sciences Cognitives, CNRS UMR
5015,
Lyon
Jean-Luc Schwartz, Institut de la Communication Parlée,
Grenoble Cedex 1,
Deb Roy Cognitive Machines Group, MIT Media Laboratory,
Cambridge,
MA, USA
Luc Steels, AI Laboratory, Vrije Universiteit Brussel, Brussels
Belgium
Michael Arbib Human Brain Project, University of Southern
California
Ram Nevatia ISI, University of Southern California
Laurent Itti, Human Brain Project, University of Southern
California
Aude Billard, EPFL, Lausanne Switzerland
Jeffrey Mark Siskind Purdue University, Electrical and Computer
Engineering, IN, USA
M. Anthony Lewis, Iguana Robotics, IL, USA
Andrew H. Fagg, University of Massachusetts Amhurst, Computer
Science Dept, USA
Pat Zukow-Goldring, Linguistics, University of Southern
California
Global Activity Objective:
The global objective of this working group is to develop humanoid
robot
capabilities that can be used to test developmental theories of
cognitive
development and language acquisition. This includes the identification
of perceptual, conceptual and processing primitives, and the derived
capabilities
from which they are constructed. The essential idea is that
“meaning” is extracted from the visual (perceptual) world via
perceptual
scene analysis, and the mapping between the structure of this meaning,
and the structure of language is learned. The system should be
able
to learn language and then demonstrate this knowledge in behavioral
contexts
that include: (1) interactive dialogs in scene analysis/description,
and
(2) sensorimotor imitation and action.In this context we want to
clearly
identify functional capabilities for which we want to truly address
their
ecological development, vs. those for which we will be satisfied to
take
as “off the shelf” capabilities. Thus we can consider that this
type
of modeling can serve two related goals
(1) To understand how the real system might function, and
(2) To produce a non-human system that can display
interesting
language performance/behavior
Global Outcome of the June 10-12 Meeting:
Each of the participants approaches robot cognition from a different
perspective. One of the unusual benefits of this meeting derived
from the mixing of diverse disciplines including computer vision and
scene
analysis, human event analysis, robot sensoirmotor control and
locomotion,
primate and human visual psychophysics and modeling, natural language
processing,
and various degrees of intersection between these disciplines.
Combined
with a second aspect of the meeting - the provision for ample time for
discussion during and after the talks - the interdisciplinarity
provided
for some interesting cross-fertilization. In particular,
the
idea of sensorimotor affordances can be extended into the domain of
lexical
and possibly phrasal semantics. A central issue that was
extensively
discussed was the nature of representation
of
meaning. In addition, we considered that the most
straightforward
method to define functionality and measure performance is to define
specific
behavioral tasks in a continuum of
complexity.
In this context we considered a 4-dimensional
space
in which task complexity could be defined. A central point of
collaboration in this working group will
be the identification of shared resources, and the development of
protocols/standards
for representations of events, objects, relations etc. These
activities
will be pursued in a follow-up meeting in the Spring of 2004 at USC.
The Centrality of Representational
Structure
and Function
The representation of meaning, clearly, is a central issue. Word
to referent mapping requires that the referent has a representation
(that
may be modiefied in the mapping process). A given behavioral
function
or task (e.g. the ability to recognize simple events given a segmented
video image) will drive the definition of the representations that
support
that function. If we are interested in the representations that
support
human cognition, then the choice of the target task (and corresponding
representations) is of fundamental importance. One approach
would consider the specification of a series of successively more
complex
tasks that require successively more complex representations that build
upon earlier versions in a developmental manner. The pieces
required
for the blueprint or roadmap are partially available in the
developmental
literature. Include constrains on human processing limits
(memory,
etc.).
Note that there are representational characteristics, including
particularly
1. “compositionality” i.e. that we can represent new combinations of
things “on the fly” , and
2. the assumed internal representations that support “language” that
have become the object of scientific investigation or curiosity as
representations
per se, rather than simply as part of an interesting behavior.
Sucessively complex robot cognitive tasks:
Robot-human interaction about static object configurations (static
states):
Names, locations,
relations between objects
Robot-human interaction about single dynamic object “events” (state
transitions) that require learning about:
Classical events (touch; push, take, give),
"Siskind" events (pouring, folding, bending,
tearing)
Physical regularities
Naive physics
Gravity, falling
Support
Affordances, Effectivities
Robot-human interaction about multiple dynamic object configurations
(composed/multiple interrelated state transitions) that require
learning
about:
How a succession of events contibute to a state
change, i.e. how a block house can be constructed.
Robot-human interaction about intentions that require learning
about:
Intentions, joint attention, theory of mind (Token
Task)
Complexity Space:
Complexity can be defined along at least three principal dimensions:
1. Discreteness (of states and state transitions): While push
and take are relatively simple, state transitions involving
pouring,
bending, opening etc. are less discrete and more complex.
2. Compositional: Single events are “simpler” that the sequence
of events and state transitions involved in e.g. building a toy house,
and the associated narrative or discourse.
3. Causal-intentional: Corresponding to the Leslie-style
formulation
of agency as self motion, building to intentionality, belief
attribution,
mind reading etc.
4. Affordance & Effectivity: Agent must plan how to
use things.
Collaborative Dimenstions:
0. "Promotion" of "Robot Cognition" through a web page
summarizing some of these ideas, with links to participants pages,
powerpoint
presentations etc.
1. Data collection/sharing:
Various databases are available
Corpus of sentence, scene pairs
Eye tracking data
200 hrs of CDS, labeled data (Siskind)
10 hrs CDS transcribed and context (Roy)
and context (PZG)
100
2. System sharing:
Eg use of Leonard
Ram and Laurent: may provide more robust vision systems: again what
is the interface
3. Joint projects and Interface Definition
Nevatia: Event Description Language.
4. Next Meeting
Spring 2004 at USC. Avoid March 31 - April 6 if possible (Evolution of Language Meeting, then a European Project meeting both in Leipzig)
Any interaction implies meaning and representation
Are representations for learning motor control part of the foundation
of more complex representations?
Observations like “infants learn words from syntax” levy requirements
on a system and can correspond to a set of algorithms but they cannot
be
verified
Make a distinction between observations vs. theories that can be
implemented.
If we use Ram’s vision system, what is the interface?
Why are we here; infant studies vs machine studies
Is there an “atom” of representation from which complex representations
are constructed
Attention as a more general method for traversal of
representational
structures
Representation shall vary along at least two dimensions, one is
temporal
(immediate to long term), and the other is concrete object related, to
human related (e.g. intentional states of others).
If the concern is with human cognition, then we must take human
memory/processing
limits into account?
So consider a task like the token task, then later ToM issues
Two levels of representational complexity:
1) event types including pour, tear, bend
2) compositionality
3) Affordances
“Next level”
have a system watch humans play a board game that involves pickup
putdown,
and then have the system learn the game
Perhaps games involving the required use of affordances: pick up
sticks,
or “equilibrium/balanced” structure building
Cross-modal vs. amodal representations
Learning about physics
retour
équipes
ISC
![]()
Institut
des Sciences Cognitives UMR 5015 CNRS UCB Lyon 1
67, boulevard
Pinel 69675 BRON cedex
33 (0)4 37 91 12 12
33 (0)4 37 91 12 10
web@isc.cnrs.fr
![]()
ACCUEIL
ISC