Grounded Language Processing 22-23

The Grounded Language Processing course is taught by Raffaella Bernardi (UniTN), the TA is Alberto Testoni. Classes are on Tuesdays (13:00-15:00), Wednesdays (13:00-15:00) and Thursdays (15:00-17:00) in Rovereto, Palazzo Fedrigotti, Corso Bottini 31, 3rd floor (aula seminari).

The course is part of the degree in Artificial Intelligent Systems, but any student at UniTN interested in the topic can attend it as a Free Choice Course -- following the rules of the Program in which he/she is enrolled.

UniTn students, who are interested in attending the course but cannot attend it in presence, are welcome to email me -- we plan to teach the course using a digital board so to facilitate virtual participation.

If you are planning to attend the course, please add info about you in this form, it will help us planning the course better.

What this course is about This course focuses on the new emerging field of Grounded Language Processing (GLP), a subarea of AI that studies the connection between natural language, perception and action in the world. It gives students an overview of recent advances by revisiting also the long standing challenges set by the AI community at its start. It makes connections between natural language processing (NLP) and computer vision and robotics. It covers both grounded Natural Language Understanding and grounded Natural Language Generation and unified architecture for these two crucial components of AI agents. If time allows, the course ends by providing students with hints towards connection between GLP and Robotics and by comparing humans’ neural representations and attention mechanisms behind grounded NL and State-of-the-Art multimodal models.

Each main section consists both of frontal and hands-on experience.

Prerequisites: The course presupposes knowledge in Machine Learning, Natural Language Processing and possibly Computer Vision.
Grading Criteria: paper review 15%, presentation of a research question and its SOTA 35%, Project 50%. Details can be found here

WEEK 1: The Grounding Problem Harnad (1990), Pulvermüller (2005), Kafle et al (2019)

Tuesday 20.09.22 Intro to the course. Why grounding?
Wednesday 21.09.22 From the faraway to the near computational past
Thursday 22.09.22 Reading Group: Mayo (2003)

WEEK 2: Computational models for Multimodal Concept Representations Baroni (2016), Beinborn et al (2018)

Tuesday 04.10.22 Word representation
Wednesday 05.10.22 Practical Lab (with Raffa -- MM conceptual representation)
Thursday 06.10.22 Reading Group Lazaridou, Bruni, Baroni 2014

WEEK 3: Grounded NL understanding

Tuesday 11.10.22 Sentence representation
Wednesday 12.10.22 Practical Lab (ALBERTO -- MDETR code)
Thursday 13.10.22 Reading Group: Aishwarya Kamath et al 2021

WEEK 4: Visual Question Answering Kafle and Kanan 2017, Bernardi and Pezzelle 2021, Srivastava et al 2021

Tuesday 18.10.22 task, dataset and models
Wednesday 19.10.22 Practical Lab (ALBERTO -- VQA with MDETR)
Thursday 20.10.22 Reading Group: Parcalabescu et al 2022

WEEK 5: Grounded NL generation Hossain et al (2019)

Tuesday 25.10.22 Datasets and models
Wednesday 26.10.22 Practical Lab (ALBERTO -- IC with MDETR)
Thursday 27.10.22 Reading Group: Vinyals et al 2015

WEEK 6: Visual Dialogues Chen et al 2020

CANCELLED Wednesday 02.11.22 datasets and models
Thursday 03.11.22 Practical Lab (ALBERTO -- decoding strategies)

WEEK 7 and &: Work on Language and Vision at CIMeC

Monday 07.11.22 (11:30-13:00 online via zoom) datasets and models on Visual Dialogue
Tuesday 08.11.22 Alex and Federico
Wednesday 09.11.22 Lab on annotation, inter-agreemment, correlation (RAFFA)
Thursday 10.11.22 Reading Group: Holtzman et al 2020

Tuesday 15.11.22 GLP at LaVi'spast and current projects
Wednesday 16.11.22 continuation on previous Lab (ALBERTO)
Thursday 17.11.22 Reading Group on Mazuecos/Benotti et al EMNLP 2021

WEEK 9: Project Design

Tuesday 22.11.22 (YOUR PROJECT) Converge towards the main idea of your projects
Wednesday 23.11.22 (YOUR PROJECT) Search for existing codes and related work
Thursday 24.11.22 Emobodied AI

WEEK 10: Project Proposal Discussion

Monday 28.11.22 (YOUR PROJECT) (15:00-17:00) Get your hands in the code (ALBERTO)
Tuesday 29.11.22 (YOUR PROJECT) Group 1 and Group 2: present relevant literature
Wednesday 30.11.22 ( 13:00-14:30) -- YOUR PROJECT) Group 3 and Group 4: Present relevant literature
Thursday 01.12.22 Evaluation methods in NLP (TBC)

WEEK 11:

Tuesday 06.12.22 (YOUR PROJECT) Design the experiments and evaluation method (Raffa and Alberto)
Wednesday 07.12.23 (YOUR PROJECT)Peer-to-peer supervision (exchanges of groups --only Raffa)

WEEK 12:

Tuesday 13.12.22 (4 hrs YOUR PROJECT) Group 1, 2, 3 and 4: project proposal

Main Surveys

Harnad, S. (1990) The Symbol Grounding Problem. Physica D 42: 335-346.
F. Pulvermüller (2005) Brain mechanisms linking language and action
Kafle, K., Shrestha, R., & Kanan, C. (2019). Challenges and prospects in vision and language research. Frontiers in Artificial Intelligence, 2, 28.
Marco Baroni (2016) Grounding Distributional Semantics in the Visual World
Lisa Beinborn, Teresa Botschen and Iryna Gurevych (2018) Multimodal Grounding for Language Processing
Hossain, Sohele, Shiratuddin, Laga (2019) A Comprehensive Survey of Deep Learning for Image Captioning
Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara Plank. 2016. Automatic description generation from images: A survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research (JAIR) 55 (2016), 409–442.
Kushal Kafle and Christopher Kanan (2017) Visual Question Answering: Datasets, Algorithms, and Future Challenges
Raffaella Bernardi and Sandro Pezzelle (2021) Linguistic issues behind visual question answering
Srivastava, Y., Murali, V., Dubey, S. R., & Mukherjee, S. (2021). Visual question answering using deep learning: A survey and performance analysis. In S. Singh, P. Roy, B. Raman, & P. Nagabhushan (Eds.), Computer vision and image processing. CVIP 2020, volume 1377 of communications in computer and information science. Springer. (Pre-print)
Chen, Lao, Duan (2020) Multimodal Fusion of Visual Dialog: A Survey

Papers for Reading Groups

Michael J. Mayo (2003) Symbol Grounding and its Implications for Artificial Intelligence
Lazaridou, Bruni and Baroni Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world ACL 2014
Douwe Kiela, Alexis Conneau, Allan Jabri, Maximilian Nickel (2018) Learning Visually Grounded Sentence Representations
Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan Show and Tell: A Neural Image Caption Generator
Will Monroe, Robert X.D. Hawkins, Noah D. Goodman and Christopher Potts Colors in Context: A Pragmatic Neural Model for Grounded Language Understanding
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6077–6086).
Jesse Thomason, Aishwarya Padmakumar, Jivko Sinapov, Nick Walker, Yuqian Jiang, Harel Yedidsion, Justin Hart, Peter Stone, Raymond J. Mooney (2019) Improving Grounded Natural Language Understanding through Human-Robot Dialog [pre-print]

Open Access Codes

Blog on ACL 2022 by Mubashara Akhtar

Other interesting papers

Goel, Ashok K. 2021. “Looking back, looking ahead: Symbolic versus connectionist AI.” AI Magazine 42: 83–85. https://doi.org/10.1609/aaai.12026
Tadas Baltrusaitis, Chaitanya Ahuja, Louis-Philippe Morency Multimodal Machine Learning: A Survey and Taxonomy IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, Issue 2. 2019. Video by Louis-Philippe Morency
Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, Brenden M. Lake A Benchmark for Systematic Generalization in Grounded Language Understanding
Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. (2018) Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , NeurIPS 2018
Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian (2020) Experience Grounds Language
Felix Hill, Stephen Clark, Karl Moritz Hermann, Phil Blunsom Understanding Early Word Learning in Situated Artificial Agents

Leonardo Fernandino, Jeffrey R. Binder, Rutvik H. Desai, Suzanne L. Pendl, Colin J. Humphries, William L. Gross, Lisa L. Conant, Mark S. Seidenberg Concept Representation Reflects Multimodal Abstraction: A Framework for Embodied Semantics
Emmanuel Dupoux Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner
Armand S. Rotaru, Gabriella Vigliocco (2020) Constructing Semantic Models From Words, Images, and Emojis
Gabriella Vigliocco, Lotte Meteyard, Mark Andrews, Stavroula Kousta (2009) Toward a theory of semantic representation
L. Smith and M. Gasser. The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
Harnad, S. (1994) Computation Is Just Interpretable Symbol Manipulation: Cognition Isn't.
Gabriella Vigliocco, Pamela Perniss, David Vinson (2014) Language as a multimodal phenomenon: implications for language learning, processing and evolution
Vogt, Paul. "Language evolution and robotics: issues on symbol grounding and language acquisition." Artificial cognition systems. IGI Global, 2007. 176–209.
Michael F. Bonner and Russell A. Epstein Object representations in the human brain reflect the co-occurrence statistics of vision and language. Nature Commuications.

Last modified: Tue Dec 6 14:21:05 CET 2022