如何用Ai做任何事 | Mit How To Ai (Almost) Anything, Spring 2025
https://www.bilibili.com/video/BV1agH8zCE1V/?spm_id_from=333.1387.upload.video_card.click&vd_source=0645a76390602d5640c372c2f44d99e1
Lecturer: https://pliang279.github.io/

https://mit-mi.github.io/how2ai-course/spring2025/schedule/

Research Project
Research Projects on New Modalities
Motivation: Many tasks of real-world impact go beyond image and text.
Challenges:
- Al with non-deep-learning effective modalities (e.g., tabular, time-series)
- Multimodal deep learning + time-series analysis + tabular models
- Al for physiological sensing, loT sensing in cities, climate and environment sensing
- Smell, taste, art, music, tangible and embodied systems
Potential models and dataset to start with
- Brain EEG Signal: https://arxiv.org/abs/2306.16934
- Speech: https://arxiv.org/pdf/2310.02050.pdf
- Facial Motion: https://arxiv.org/abs/2308.10897
- Tactile: https://arxiv.org/pdf/2204.00117.pdf
Research Projects s on Al Reasoning
Motivation: Robust, reliable, interpretable reasoning in (multimodal) LLMs.
Challenges:
- Fine-grained and compositional reasoning
- Neuro-symbolic reasoning
- Emergent reasoning in foundation models
Potential models and dataset to start with
- Can LLMs actually reason and plan?
- Code for VQA:
CodeVQA: https://arxiv.org/pdf/2306.05392.pdf,
VisProg: https://prior.allenai.org/projects/visprog,
Viper: https://viper.cs.columbia.edu/ - Cola: https://openreview.net/pdf?id=kdHpWogtX6Y
- NLVR2: https://arxiv.0rg/abs/1811.00491
- Reference games: https://mcgill-nlp.github.io/imagecode/,
https://github.com/Alab-Nll/onecommon,
https://dmg-photobook.github.io/
Research Projects on Interactive Agents
Motivation: Grounding Al models in the web, computer, or other virtual worlds to help humans with digital tasks.
Challenges:
- Web visual understanding is quite different from natural image understanding
- Instructions and language grounded in web images, tools, APls
- Asking for human clarification, human-in-the-loop
- Search over environment and planning
Potential models and dataset to start with
- WebArena: https://arxiv.org/pdf/2307.13854.pdf
- AgentBench: https://arxiv.org/pdf/2308.03688.pdf
- ToolFormer: https://arxiv.org/abs/2302.04761
- SeeAct: https://osu-nlp-group.github.io/SeeAct/
Research Projects on Embodied and Tangible Al
Motivation: Building tangible and embodied Al systems that help humans in physical tasks.
Challenges:
- Perception, reasoning, and interaction
- Connecting sensing and actuation
- Efficient models that can run on hardware
- Understanding influence of actions on the world (world model)
Potential models and dataset to start with
- Virtual Home: http://virtual-home.org/paper/virtualhome.pdf
- Habitat 3.0 https://ai.meta.com/static-resource/habitat3
- RoboThor: https://ai2thor.allenai.org/robothor
- LangSuite-E: https://github.com/bigai-nlco/langsuite
- Language models and world models: https://arxiv.org/pdf/2305.10626.pdf
Research Projects on Socially Intelligent Al
Motivation: Building Al that can understand and interact
with humans in social situations.
Challenges:
- Social interaction, reasoning, and commonsense.
- Building social relationships over months and years.
- Theory-of-Mind and multi-party social interactions.
Potential models and dataset to start with
- Multimodal WereWolf: https://persuasion-deductiongame.socialai-data.org/
- Eg04D: https://arxiv.0rg/abs/2110.07058
- MMToM-QA: https://openreview.net/pdf?id=ibLM1yvxaL
- 11866 Artificial Social Intelligence: https://cmu-multicomp-lab.github.io/asi-course/spring2023/
Research Projects on Human-Al Interaction
Motivation: What is the right medium for human-Al
interaction? How can we really trust Al? How do we enable collaboration and synergy?
Challenges:
- Modeling and conveying model uncertainty - text input uncertainty, visual uncertainty, multimodal uncertainty? cross-modal interaction uncertainty?
- Asking for human clarification, human-in-the-loop, types of human feedback and ways to learn from human feedback through all modalities.
- New mediums to interact with Al. New tasks beyond imitating humans, leading to collaboration.
Potential models and dataset to start with
- MMHal-Bench: https://arxiv.org/pdf/2309.14525.pdf aligning multimodal LLMs
- HACL: https://arxiv.org/pdf/2312.06968.pdf hallucination + LLM
Research Projects on Ethics and Safety
Motivation: Large Al models are can emit unsafe text content, generate or retrieve biased images.
Challenges:
- Taxonomizing types of biases: text, vision, audio, generation, etc.
- Tracing biases to pretraining data, seeing how bias can be amplified during training, fine-tuning.
- New ways of mitigating biases and aligning to human preferences.
Potential models and dataset to start with
- Many works on fairness in LLMs -> how to extend to multimodal?
- Mitigating bias in text generation, image-captioning, image generation
How Do We Get Research Ideas?
我们如何产生研究想法?
1 Bottom-up(自下而上)
Turn a concrete understanding of existing research's failings to a higher-level experimental question.
意思是:
- 先深入理解现有研究的不足、漏洞或失败之处
- 然后把这种具体的问题
- 抽象提升为一个更高层次的实验问题
换句话说:从“别人哪里没做好”出发 → 提炼出“更本质的科学问题”
典型的 problem-driven research generation(问题驱动型研究思路)
Bottom-up discovery(自下而上的研究想法)
“从细节往理论走” - 从具体实验结果、局部问题、技术瓶颈出发
- 一点点往上抽象
- 形成新的研究方向
Great tool for incremental progress, but may preclude larger leaps
这种方法很适合渐进式进步(incremental progress),但可能不利于做大跨越式创新(large conceptual leaps)
总是在“修补现有体系”,而不是重新定义问题框架。
2 Top-down(自上而下)
Move from a higher-level question to a lower-level concrete testing of that question.
- 先提出一个宏观的、高层次的研究问题
- 然后把它拆解成可操作的具体实验或验证方式
理论驱动(theory-driven)或主题驱动(theme-driven)
Favors bigger ideas, but can be disconnected from reality - 容易产生大问题
- 有可能带来突破性研究
- 更具战略性和系统性
但是可能脱离现实,缺乏可行性,有时会停留在概念层面
Bottom-up = 解决问题型思维
Top-down = 提问型思维
优秀的研究者通常需要在两种模式之间切换。

浙公网安备 33010602011771号