Robot Cognition Working Group

Brief History:

In July 2001 the French Research Ministry accepted the "RoboKid" project with the objective to combine visual scene analysis, speech perception, speech production and language processing research into a combined platform to study human cognitive development. As part of that activity, in June 2003 a working meeting was held at the Instutut des Sciences Cognitives in Lyon, France. The meeting included an extension of the initial ACI group, and the objective of the meeting was to identify a long term strategy for the development of cognitive capabilities (including vision, language and motor control) for humanoid robots.  This web site is intended to serve both as a mechanism for communication and collaboration between the group members, and for diffusion to the scientific and lay communities.

Working Group Members:

Peter F. Dominey Institut des Sciences Cognitives, CNRS UMR 5015, Lyon
Jean-Luc Schwartz, Institut de la Communication Parlée, Grenoble Cedex 1,
Deb Roy Cognitive Machines Group, MIT Media Laboratory, Cambridge, MA, USA
Luc Steels, AI Laboratory, Vrije Universiteit Brussel, Brussels Belgium
Michael Arbib Human Brain Project, University of Southern California
Ram Nevatia ISI, University of Southern California
Laurent Itti, Human Brain Project, University of Southern California
Aude Billard, EPFL, Lausanne Switzerland
Jeffrey Mark Siskind Purdue University, Electrical and Computer Engineering, IN, USA
M. Anthony Lewis, Iguana Robotics, IL, USA
Andrew H. Fagg, University of Massachusetts Amhurst, Computer Science Dept, USA
Pat Zukow-Goldring, Linguistics, University of Southern California
 

Global Activity Objective:

The global objective of this working group is to develop humanoid robot capabilities that can be used to test developmental theories of cognitive development and language acquisition. This includes the identification of perceptual, conceptual and processing primitives, and the derived capabilities from which they are constructed.   The essential idea is that “meaning” is extracted from the visual (perceptual) world via perceptual scene analysis, and the mapping between the structure of this meaning, and the structure of language is learned.  The system should be able to learn language and then demonstrate this knowledge in behavioral contexts that include: (1) interactive dialogs in scene analysis/description, and (2) sensorimotor imitation and action.In this context we want to clearly identify functional capabilities for which we want to truly address their ecological development, vs. those for which we will be satisfied to take as “off the shelf” capabilities.  Thus we can consider that this type of modeling can serve two related goals
(1)   To understand how the real system might function, and
(2)   To produce a non-human system that can display interesting language performance/behavior

Global Outcome of the June 10-12 Meeting:

Each of the participants approaches robot cognition from a different perspective.  One of the unusual benefits of this meeting derived from the mixing of diverse disciplines including computer vision and scene analysis, human event analysis, robot sensoirmotor control and locomotion, primate and human visual psychophysics and modeling, natural language processing, and various degrees of intersection between these disciplines.  Combined with a second aspect of the meeting - the provision for ample time for discussion during and after the talks - the interdisciplinarity provided for some interesting cross-fertilization.   In particular, the idea of sensorimotor affordances can be extended into the domain of lexical and possibly phrasal semantics.  A central issue that was extensively discussed was the nature of representation of meaning.  In addition, we considered that the most straightforward method to define functionality and measure performance is to define specific behavioral tasks in a continuum of complexity.  In this context we considered a 4-dimensional space in which task complexity could be defined.  A central point of collaboration in this working group will be the identification of shared resources, and the development of protocols/standards for representations of events, objects, relations etc.  These activities will be pursued in a follow-up meeting in the Spring of 2004 at USC.
 

The Centrality of Representational Structure and Function
The representation of meaning, clearly, is a central issue.  Word to referent mapping requires that the referent has a representation (that may be modiefied in the mapping process).  A given behavioral function or task (e.g. the ability to recognize simple events given a segmented video image) will drive the definition of the representations that support that function.  If we are interested in the representations that support human cognition, then the choice of the target task (and corresponding representations) is of fundamental importance.   One approach would consider the specification of a series of successively more complex tasks that require successively more complex representations that build upon earlier versions in a developmental manner.  The pieces required for the blueprint or roadmap are partially available in the developmental literature.  Include constrains on human processing limits (memory, etc.).

Note that there are representational characteristics, including particularly
1. “compositionality” i.e. that we can represent new combinations of things “on the fly” , and
2. the assumed internal representations that support “language” that have become the object of scientific investigation or curiosity as representations per se, rather than simply as part of an interesting behavior.

Sucessively complex robot cognitive tasks:
Robot-human interaction about static object configurations (static states):
    Names, locations,
    relations between objects

Robot-human interaction about single dynamic object “events” (state transitions) that require learning about:
    Classical events (touch; push, take, give),
    "Siskind" events (pouring, folding, bending, tearing)
    Physical regularities
        Naive physics
            Gravity, falling
            Support
        Affordances, Effectivities

Robot-human interaction about multiple dynamic object configurations (composed/multiple interrelated state transitions)  that require learning about:
    How a succession of events contibute to a state change, i.e. how a block house can be constructed.

Robot-human interaction about intentions  that require learning about:
    Intentions, joint attention, theory of mind (Token Task)
 

Complexity Space:
Complexity can be defined along at least three principal dimensions:
1. Discreteness (of states and state transitions):  While push and take are relatively simple,  state transitions involving pouring, bending, opening etc. are less discrete and more complex.
2. Compositional:  Single events are “simpler” that the sequence of events and state transitions involved in e.g. building a toy house, and the associated narrative or discourse.
3. Causal-intentional:  Corresponding to the Leslie-style formulation of agency as self motion, building to intentionality, belief attribution, mind reading etc.
4.  Affordance & Effectivity:  Agent must plan how to use things.

Collaborative Dimenstions:
0.  "Promotion" of "Robot Cognition"  through a web page summarizing some of these ideas, with links to participants pages, powerpoint presentations etc.
 

1. Data collection/sharing:
Various databases are available
Corpus of sentence, scene pairs
Eye tracking data
200 hrs of CDS, labeled data (Siskind)
10 hrs CDS transcribed and context (Roy)
 and context (PZG)
100

2. System sharing:

Eg use of Leonard
Ram and Laurent: may provide more robust vision systems: again what is the interface
 

3.  Joint projects and Interface Definition

Nevatia:  Event Description Language.
 

4.  Next Meeting

Spring 2004 at USC.  Avoid March 31 - April 6 if possible (Evolution of Language Meeting, then a European Project meeting both in Leipzig)

Any interaction implies meaning and representation
Are representations for learning motor control part of the foundation of more complex representations?
Observations like “infants learn words from syntax” levy requirements on a system and can correspond to a set of algorithms but they cannot be verified
Make a distinction between observations vs. theories that can be implemented.
If we use Ram’s vision system, what is the interface?
Why are we here; infant studies vs machine studies
Is there an “atom” of representation from which complex representations are constructed
 
 

Attention as a  more general method for traversal of representational structures
Representation shall vary along at least two dimensions, one is temporal (immediate to long term), and the other is concrete object related, to human related (e.g. intentional states of others).

If the concern is with human cognition, then we must take human memory/processing limits into account?
So consider a task like the token task, then later ToM issues

Two levels of representational complexity:
1) event types including pour, tear, bend
2) compositionality
3) Affordances
 

“Next level”
have a system watch humans play a board game that involves pickup putdown, and then have the system learn the game
Perhaps games involving the required use of affordances: pick up sticks, or “equilibrium/balanced” structure building
Cross-modal vs. amodal representations
Learning about physics