Grounded Language Processing 21-22

The Grounded Language Processing course is taught by Raffaella Bernardi (UniTN), the TA is Alberto Testoni. Classes are on Tuesdays (15:00-17:00), Wednesdays (13:00-15:00) and Thursdays (13:00-15:00) in Rovereto, Palazzo Fedrigotti, Corso Bottini 31, 3rd floor (aula seminari).

The course is part of the degree in Artificial Intelligent Systems, but any student at UniTN interested in the topic can attend it as a Free Choice Course -- following the rules of the Program in which he/she is enrolled.

UniTn students, who are interested in attending the course but cannot attend it in presence, are welcome to email me -- we plan to teach the course using a digital board so to facilitate virtual participation.

If you are planning to attend the course, please add info about you in this form, it will help us planning the course better. .

What this course is about This course focuses on the new emerging field of Grounded Language Processing (GLP), a subarea of AI that studies the connection between natural language, perception and action in the world. It gives students an overview of recent advances by revisiting also the long standing challenges set by the AI community at its start. It makes connections between natural language processing (NLP) and computer vision and robotics. It covers both grounded Natural Language Understanding and grounded Natural Language Generation and unified architecture for these two crucial components of AI agents. If time allows, the course ends by providing students with hints towards connection between GLP and Robotics and by comparing humans’ neural representations and attention mechanisms behind grounded NL and State-of-the-Art multimodal models.

Each main section consists both of frontal and hands-on experience.

Prerequisites: The course presupposes knowledge in Machine Learning, Natural Language Processing and possibly Computer Vision.
Grading Criteria: paper review 15%, presentation of a research question and its SOTA 35%, Project 50%. Details can be found here

WEEK 1: The Grounding Problem Harnad (1990), Pulvermüller (2005), Kafle et al (2019)

Tuesday 14.09.21 Intro to the course. Why grounding?
Wednesday 15.09.21 From the faraway to the near computational past
Thursday 16.09.21 Reading Group: Mayo (2003)

WEEK 2: Computational models for Multimodal Concept Representations Baroni (2016), Beinborn et al (2018)

Tuesday 21.09.21 Word representation
Wednesday 22.09.21 (13:15-14:45) Practical Lab (MM conceptual representation)
23.09.21 (13:15-14:45) Reading Group Lazaridou, Bruni, Baroni 2014

WEEK 3: Grounded NL understanding

Tuesday 28.09.21 (15:15-16:45) Sentence representation
Wednesday 29.09.21 Practical Lab
Thursday 30.09.21 Reading Group: Aishwarya Kamath et al 2021

WEEK 4: Visual Question Answering Kafle and Kanan 2017, Bernardi and Pezzelle 2021, Srivastava et al 2021

Tuesday 05.10.21task, dataset and models
Wednesday 06.10.21Practical Lab
Thursday 07.10.21Reading Group: Bugliarello et al TACL 2021

WEEK 5: Grounded NL generation Hossain et al (2019)

Tuesday 12.10.21 Datasets and models
Wednesday 13.10.21 Practical Lab
Thursday 14.10.21 Reading Group: Vinyals et al 2015

WEEK 6: Visual Dialogues Chen et al 2020TBC

Tuesday 19.10.21datasets and models
Wednesday 20.10.21Practical Lab
Thursday 21.10.21Reading Group: Holtzman et al 2020

WEEK 7: Neuro representations (Stefania Bracci)

Tuesday 26.10.21Vision in the Brain
Wednesday 27.10.21Practical Lab (with Alberto and Raffaella)
Thursday 28.10.21Reading Group (Stefania's paper)

WEEK 8: Work on Language and Vision at CIMeC (?)

Tuesday 02.11.21Lab on annotation, inter-agremment, correlation
Wednesday 03.11.21GLP at LaVi's -- Alberto's current project. Define the research questions of each group.
Thursday 04.11.21Reading Group on Suglia et al 2020

WEEK 9: Project Design

Tuesday 09.11.21 NO CLASS
Wednesday 10.11.21Converge towards the main idea
Thursday 11.11.21 GLP at LaVi's running projects: Claudio's PhD overview, Emma's and David's MSc plans)

WEEK 10: Project Proposal Discussion

Tuesday 16.11.21 Search for existing codes and related work
Wednesday 17.11.21 Get your hands in the code
Thursday 18.11.21 Set the specific research questions: update the other groups

WEEK 11:

Tuesday 23.11.21 Design the experiments and evaluation method
Wednesday 24.11.21Peer-to-peer supervision
Thursday 25.11.21Hawkins et al TBC

WEEK 12:

Tuesday 30.11.21Schema of relevant literature
Wednesday 01.12.21Frontal class: Embodied Agents
Thursday 02.12.21 (14:00-16:00) Group 1, 2, 3 (Project Proposal Literature overview presentations)

WEEK 13: Project Proposal Presentations

Tuesday 09.12.21 (10:00-12:00) by the three groups

Main Surveys

Harnad, S. (1990) The Symbol Grounding Problem. Physica D 42: 335-346.
F. Pulvermüller (2005) Brain mechanisms linking language and action
Kafle, K., Shrestha, R., & Kanan, C. (2019). Challenges and prospects in vision and language research. Frontiers in Artificial Intelligence, 2, 28.
Marco Baroni (2016) Grounding Distributional Semantics in the Visual World
Lisa Beinborn, Teresa Botschen and Iryna Gurevych (2018) Multimodal Grounding for Language Processing
Hossain, Sohele, Shiratuddin, Laga (2019) A Comprehensive Survey of Deep Learning for Image Captioning
Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara Plank. 2016. Automatic description generation from images: A survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research (JAIR) 55 (2016), 409–442.
Kushal Kafle and Christopher Kanan (2017) Visual Question Answering: Datasets, Algorithms, and Future Challenges
Raffaella Bernardi and Sandro Pezzelle (2021) Linguistic issues behind visual question answering
Srivastava, Y., Murali, V., Dubey, S. R., & Mukherjee, S. (2021). Visual question answering using deep learning: A survey and performance analysis. In S. Singh, P. Roy, B. Raman, & P. Nagabhushan (Eds.), Computer vision and image processing. CVIP 2020, volume 1377 of communications in computer and information science. Springer. (Pre-print)
Chen, Lao, Duan (2020) Multimodal Fusion of Visual Dialog: A Survey

Papers for Reading Groups

Michael J. Mayo (2003) Symbol Grounding and its Implications for Artificial Intelligence
Lazaridou, Bruni and Baroni Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world ACL 2014
Douwe Kiela, Alexis Conneau, Allan Jabri, Maximilian Nickel (2018) Learning Visually Grounded Sentence Representations
Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan Show and Tell: A Neural Image Caption Generator
Will Monroe, Robert X.D. Hawkins, Noah D. Goodman and Christopher Potts Colors in Context: A Pragmatic Neural Model for Grounded Language Understanding
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6077–6086).
Jesse Thomason, Aishwarya Padmakumar, Jivko Sinapov, Nick Walker, Yuqian Jiang, Harel Yedidsion, Justin Hart, Peter Stone, Raymond J. Mooney (2019) Improving Grounded Natural Language Understanding through Human-Robot Dialog [pre-print]

Open Access Codes TBD

Other interesting papers

Tadas Baltrusaitis, Chaitanya Ahuja, Louis-Philippe Morency Multimodal Machine Learning: A Survey and Taxonomy IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, Issue 2. 2019. Video by Louis-Philippe Morency
Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, Brenden M. Lake A Benchmark for Systematic Generalization in Grounded Language Understanding
Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. (2018) Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , NeurIPS 2018
Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian (2020) Experience Grounds Language
Felix Hill, Stephen Clark, Karl Moritz Hermann, Phil Blunsom Understanding Early Word Learning in Situated Artificial Agents

Leonardo Fernandino, Jeffrey R. Binder, Rutvik H. Desai, Suzanne L. Pendl, Colin J. Humphries, William L. Gross, Lisa L. Conant, Mark S. Seidenberg Concept Representation Reflects Multimodal Abstraction: A Framework for Embodied Semantics
Emmanuel Dupoux Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner
Armand S. Rotaru, Gabriella Vigliocco (2020) Constructing Semantic Models From Words, Images, and Emojis
Gabriella Vigliocco, Lotte Meteyard, Mark Andrews, Stavroula Kousta (2009) Toward a theory of semantic representation
L. Smith and M. Gasser. The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
Harnad, S. (1994) Computation Is Just Interpretable Symbol Manipulation: Cognition Isn't.
Gabriella Vigliocco, Pamela Perniss, David Vinson (2014) Language as a multimodal phenomenon: implications for language learning, processing and evolution
Vogt, Paul. "Language evolution and robotics: issues on symbol grounding and language acquisition." Artificial cognition systems. IGI Global, 2007. 176–209.
Michael F. Bonner and Russell A. Epstein Object representations in the human brain reflect the co-occurrence statistics of vision and language. Nature Commuications.

Last modified: Wed Nov 24 13:28:00 CET 2021