3D视觉深度学习教程 3D Deep Learning Tutorial from SU Lab UC San Diego 2020 （数据表示，常见任务和分析方法

https://www.youtube.com/watch?v=vfL6uJYFrp4

3D 深度学习教程详细笔记 (Deep Learning for 3D Data)

第一部分：引言与数据基础 (Introduction & Fundamentals)

3D 学习的必要性

应用场景： 我们生活在3D世界中，3D数据对于机器人感知环境、增强现实（AR）、自动驾驶（检测车辆和行人）、医疗影像（CT/MRI）等领域至关重要。
核心挑战： 3D数据是非结构化的，这与2D图像的规则网格（Rasterized grid）不同，因此需要设计专门的深度学习架构。

传统视觉 vs. 3D深度学习

传统3D视觉 (Traditional 3D Vision)： 主要基于多视图几何（Multi-view Geometry）。假设从不同视点观察物体，通过寻找对应点和解方程来恢复3D几何结构。这种方法植根于成像过程的逆过程。
基于学习的方法 (Learning-based)： 类似于人类通过经验“脑补”空间结构。输入可能是2D的，但通过训练获得的“先验知识（Priors）”，网络能够推断出3D结构，而不需要显式地解几何方程。

3D 数据表示形式 (Data Representations)

不同的应用决定了不同的数据表示，也决定了网络架构的设计：

点云 (Point Cloud)： 激光雷达（Lidar）的原始输出，一组(x, y, z)坐标的集合。
网格 (Mesh)： 图形学常用，包含点（Vertices）和面（Faces），描述了拓扑连接关系。
体素 (Voxels)： 类似于3D像素（Pixels），是对空间的规则离散化。
数据集：
- 物体级： ShapeNet, ModelNet, PartNet (带有点级别的部件标签)。
- 室内场景： ScanNet (RGB-D扫描重建)。
- 室外场景： KITTI, Waymo (激光雷达数据，自动驾驶标杆)。

第二部分：3D 分类任务 (3D Classification)

1. 多视图方法 (Multi-view Representation)

原理： 在物体周围放置多个虚拟相机，拍摄多张2D图像，分别用成熟的2D CNN提取特征，最后聚合（View Pooling）。
优缺点： 能够利用ImageNet预训练的强大特征，性能很好（约90%准确率）。但投影过程会丢失几何信息，且难以处理遮挡和不完整的输入。

2. 体素化方法 (Volumetric/Voxelization)

原理： 将3D空间划分为小立方体网格，使用3D卷积（3D CNN）。
计算复杂度问题： 传统的密集3D卷积在计算量和内存上随分辨率呈立方级增长（\(O(N^3)\)），导致只能处理低分辨率（如30x30x30）。
解决方案 - 八叉树 (Octree)：
- 利用3D数据的稀疏性（物体只占据表面，大部分空间是空的）。
- 使用八叉树结构（如 O-CNN, Sparse ConvNet）仅在有数据的叶子节点进行计算，可将分辨率扩展到256甚至更高。

3. 点云方法 (Point Cloud Processing)

这是目前最主流的研究方向之一，直接处理原始点云数据。

核心挑战： 置换不变性 (Permutation Invariance)。点云是一组点的集合，点的存储顺序改变不应影响分类结果。
PointNet (2017)：
- 设计： 对每个点独立使用MLP提取特征，然后通过对称函数（如 Max Pooling）聚合全局特征。
- 解释： 数学上证明了它可以拟合任何对称函数；可视化显示它能捕捉物体的轮廓/骨架。
PointNet++：
- 改进： PointNet缺乏局部特征（Locality）。PointNet++ 引入了分层结构：采样中心点 -> 寻找邻域（球查询或KNN） -> 局部应用PointNet。这类似于CNN的卷积核感受野。
图卷积与连续卷积 (Graph & Continuous Conv)：
- 将点云视为图结构。
- 连续卷积： 考虑到点是从连续曲面采样的，设计卷积核时结合密度估计（Density Estimation），使卷积具有几何感知能力，并能处理非均匀采样。

4. 谱分析方法 (Spectral Methods)

目标： 实现等距不变性 (Isometric Invariance)（即物体发生非刚性形变，如人体弯曲，识别结果不变）。
Spherical CNN： 将数据投影到球面上进行卷积，可实现旋转不变性。

第三部分：3D 分割与检测 (Segmentation & Detection)

1. 3D 语义分割 (Semantic Segmentation)

任务： 预测点云中每个点的语义标签（如椅子、地板）。
解码器 (Decoder)： 由于PointNet++等网络会对点云进行降采样，分割需要上采样恢复分辨率。通常使用插值法（根据距离加权平均）将特征传播回原始点。
多模态融合 (3DMV / Multi-view fusion)：
- 纯3D几何有时难以区分平面物体（如门 vs. 浴帘）。
- 结合2D图像的纹理信息和3D几何信息（通过反投影 Back-projection），能显著提高分割精度。

2. 3D 目标检测 (Instance-level Understanding)

任务是输出3D边界框（Bounding Box）和类别。

自顶向下方法 (Top-down / Proposal-based)：
- Frustum PointNet： 先用2D检测器在图像上找到物体，将该区域反投影形成3D视锥（Frustum），再在视锥内进行点云分割和边框回归。优点是利用了成熟的2D检测。
自底向上方法 (Bottom-up)：
- SGPN / JSIS3D： 学习点的特征嵌入（Embedding），使属于同一实例的点在特征空间距离更近，然后聚类。
基于投票的方法 (Voting-based)：
- VoteNet： 受Hough Voting启发。点云中的表面点“投票”预测物体的中心点。解决物体中心通常是空的（没有点）的问题。先聚类投票点，再回归边框。
鸟瞰图方法 (BEV - Bird's Eye View)：
- 自动驾驶（如Lidar数据）常用。将点云压缩到地面平面（Feature map），使用2D CNN处理。
- PointPillars / Second / Continuous Fusion： 高效处理大规模室外场景。

3. 少样本与零样本学习 (Few-shot/Zero-shot)

优势： 3D数据比2D更适合少样本学习，因为没有透视变形、光照干扰。
结构归纳 (Structure Induction)： 通过观察物体部件的重复和运动（如门的开关），推断物体的结构，从而识别未见过的类别。

第四部分：3D 生成与重建 (Generation & Reconstruction)

评价指标 (Metrics)

Chamfer Distance (CD)： 计算A中每个点到B中最近点的距离平均值（双向）。
Earth Mover's Distance (EMD)： 计算将点集A变形为点集B所需的最小“搬运成本”，对点密度分布更敏感。

1. 生成模型 (Generative Models)

目标是从隐向量或图像生成3D形状。

体素生成： 如 Octree Generating Net (OGN)，由粗到细生成八叉树结构。
点云生成： Point Set Generation Net (PSGN)，使用全连接层直接预测点坐标。通常使用Chamfer Distance作为损失函数。
基于图元 (Primitive-based)： 将物体分解为简单的几何体（如平面、长方体）组合。
网格变形 (Mesh Deformation)： 从一个模板（如球体）开始，预测顶点的偏移量。缺点是无法改变拓扑结构（如不能把球变成甜甜圈）。
隐式函数 (Implicit Functions - DeepSDF)：
- 不直接生成点或网格，而是学习一个函数 \(f(x,y,z) = s\)，表示点到表面的距离（SDF）。
- 优点： 可表示无限分辨率，拓扑灵活。最后通过Marching Cubes算法提取等值面得到网格。
结构化生成 (StructureNet)： 利用图神经网络（Graph NN）生成物体的层级部件结构，不仅仅是生成几何，还生成部件关系。

2. 网格重建 (Mesh Reconstruction)

从点云重建高质量网格。
方法： 先构建候选三角形图，利用神经网络判断哪些三角形是真实的表面（基于Intrinsic-Extrinsic Ratio特征），去除错误的连接。

3. 对抗生成网络 (3D GANs)

难点： 判别器（Discriminator，如PointNet）通常比生成器强大太多，导致训练不稳定。
PC-GAN / Tree-GAN： 探索如何让GAN在点云上稳定训练。

4. 多视图立体视觉 (Multi-view Stereo - MVS)

从多张已知参数的图片重建稠密3D模型。

SurfaceNet： 构建“彩色体素立方体”（Colored Voxel Cubes），将相机参数编码进体素，用3D CNN预测体素占用率。
Recurrent MVS (R-MVSNet)： 为了解决3D CNN显存消耗大的问题，使用RNN沿深度方向逐层正则化代价体（Cost Volume），实现了高分辨率重建。
Point-based MVS (PointFlow)：
- 先生成粗糙深度图转为点云。
- 迭代优化： 使用点流模块（Point Flow Module）预测点的位移残差，并进行上采样（分裂点），从而由粗到细地优化出高精度表面。这种方法比体素方法更节省内存且精度更高。

原文文稿详细整理

Part 0｜Opening & Motivation

Intro（引言）

Deep learning for 3D data is introduced as a “new and interesting field.” The talk motivates that we live in a 3D world (objects, houses, environment), and argues that without the ability to understand 3D environments, it is hard to achieve intelligence.

介绍“面向 3D 数据的深度学习”这一新领域。强调我们生活在三维世界中（物体、房屋、环境），并提出：如果不能理解三维环境，就难以实现更高层次的智能。

Broad Applications of 3D data（3D 数据的广泛应用）

3D data is positioned as broadly useful: robotics (robots perceive the environment), augmented reality (perceive environment and present to viewer), autonomous driving (3D sensors detect cars and pedestrians), and medical image processing (CT/MRI capture 3D medical data for downstream processing).

指出 3D 数据应用广泛：机器人（感知环境）、增强现实（理解环境并呈现给用户）、自动驾驶（3D 传感器检测车辆与行人）、医疗影像（CT/MRI 获取三维数据并进行处理）。

Part 1｜Traditional 3D Vision vs Learning-based 3D + Representation & Datasets

Part 1｜传统 3D 视觉 vs 学习式 3D；表示挑战与数据集

Traditional 3D Vision（传统 3D 视觉）

Traditional 3D vision is described as primarily multi-view geometry: given multiple images from different viewpoints, find correspondences across views, solve a reconstruction/estimation problem to recover 3D geometry. The procedure is “deeply rooted in the imaging formation process,” and the mathematical tools are based on constraint equations.

传统 3D 视觉以多视图几何为核心：从多个视角图像中寻找跨视角对应关系，求解几何估计问题以恢复三维结构。流程深深依赖成像过程与显式几何约束（方程）。

3D Learning: Knowledge Based（3D 学习：基于知识/先验）

The learning-based philosophy is motivated with a 2D chalk drawing that evokes a 3D space in our perception: we do not solve explicit equations, but rely on priors encoded from past experiences. The goal is to learn such data priors via training and impose them on new data for 3D understanding, analysis, or synthesis. The field is described as interdisciplinary (geometry, topology, functional analysis, plus vision/graphics/robotics) and emerging around ~2015.

学习式路线用“街头粉笔画”的例子说明：输入只是 2D，但人会在脑中“补全”3D 空间，因为经验中积累了先验；不是靠硬约束方程，而是靠 learned priors。深度学习要通过训练获得这种先验，并将其施加到新数据上，实现 3D 的理解、分析或生成。该领域跨学科（几何、拓扑、函数分析，以及视觉/图形学/机器人），并提到大约在 2015 年左右兴起。

Instructor Team（讲师团队）

The tutorial is delivered by multiple instructors, and is divided into major parts: 3D data & representation challenges; 3D classification; segmentation & detection; and 3D synthesis (reconstruction and generative literature).

由多位讲者共同授课，内容分为几块：3D 数据与表示挑战、3D 分类、分割与检测、以及 3D 合成（重建与生成方向文献脉络）。

Schedule（日程安排）

A session timeline is mentioned (start now, end around 2:20 with a break). This serves as the framing before the technical content.

说明课程时间线（开始、结束与中间休息），作为进入技术内容前的组织信息。

The Representation Challenge of 3D Deep Learning（3D 深度学习的表示挑战）

Deep learning works naturally on regular/rasterized representations (1D/2D arrays for audio/images). In contrast, 3D data comes with structures that are not as regular, and different applications use different representations, requiring tailored architectures. Examples emphasized: point clouds (from 3D sensors) and meshes (with neighborhood/connectivity), leading to the question of defining learning on graphs/sets beyond standard image processing.

深度学习擅长处理规则栅格数据（音频 1D、图像 2D 数组）。但 3D 数据通常不规则，不同应用对应不同表示，算法往往需要“因表示定制”。重点提到点云（来自 3D 传感器）与网格（包含邻接关系/连通性），进一步引出：如何在图/集合等结构上定义深度学习，而非只在图像栅格上。

Datasets for 3D Objects（3D 物体数据集）

Synthetic object datasets are described: 3D models built by artists using modeling software; used for object-level learning. ShapeNet/ModelNet-like datasets are referenced as representative examples in spirit (the talk lists several synthetic datasets).

介绍合成物体数据集：由艺术家使用建模软件构建 3D 模型，用于物体级学习与评测。讲稿列举了多种类似 ShapeNet/ModelNet 的合成对象数据资源。

Datasets for 3D Object Parts（3D 物体部件数据集）

Part-level datasets are highlighted (e.g., PartNet is mentioned): objects annotated with detailed part information. Part annotations make objects “interactable/animatable,” going beyond object-level labels.

强调部件级数据集（讲稿提到 PartNet）：包含精细部件标注。部件信息使对象不仅可分类，还更可交互/可动画化（结构层面的监督更强）。

Datasets for Indoor 3D Scenes（室内 3D 场景数据集）

Two types are contrasted: synthetic indoor scenes with realistic object layouts and high-quality material/renderer outputs, versus real scanned indoor scenes (e.g., ScanNet-like) built via RGB-D scanning and fusion into meshes; real geometry/materials may be less detailed.

室内场景对比两类：合成室内场景（自然布局、高质量材质与渲染，接近真实照片），以及真实扫描室内场景（通过 RGB-D 扫描并融合成网格；物体真实但材质细节往往不如合成精细）。

Part 2｜Task: 3D Classification (Families of Methods)

Part 2｜任务：3D 分类（表示 → 模型家族对比）

Task: 3D Classification（任务：3D 分类）

Classification is introduced as a simple task: given an input 3D representation, assign a semantic label (chair, etc.). The speaker notes it is relatively mature with many methods and strong baselines.

将分类作为入门任务：输入 3D 数据，输出语义类别标签（如 chair）。讲者认为该方向相对成熟，方法众多、基线强。

Experiments - Classification & Retrieval（实验：分类与检索）

Multi-view methods are noted to perform strongly (e.g., around ~90% accuracy on common benchmarks is mentioned in the script), sometimes surprisingly competitive versus “native” 3D approaches, partly because they leverage the large 2D CNN literature and pretraining.

讲稿强调多视图方法在分类/检索基准上效果很强（脚本提到在常见基准上可达约 90% 精度的量级），并能借助成熟的 2D CNN 体系与预训练模型，因此即使不是“原生 3D”，也常具竞争力。

Part 3｜Multi-view → Voxel/Volumetric → Octree/Sparse

Part 3｜多视图、体素与复杂度；稀疏体素/八叉树

(Multi-view family) Multi-view representation（多视图表示方法家族）

A 3D object is rendered/projected from multiple camera viewpoints into multiple 2D images (RGB and possibly depth). Each view is processed independently by a 2D CNN, and features are aggregated (often via pooling) into a single object feature for classification/segmentation. Advantages: reuse ImageNet pretraining and 2D recognition advances. Disadvantages: requires projection and camera selection; projection can break symmetries; robustness to noisy/incomplete input and invariances requires extra handling.

将 3D 物体从多个视角投影为多张 2D 图（RGB/深度），每张图用 2D CNN 单独提特征，再聚合（常见为 pooling）得到物体级特征用于分类/分割。优点：可复用 ImageNet 预训练与 2D 视觉成果。缺点：需要投影与选相机位姿；投影会破坏对称性；对噪声/缺失与旋转平移不变性往往需要额外设计。

(Volumetric family) Volumetric CNN（体素/体积方法家族）

Voxelization discretizes 3D space into a grid of cuboids (voxels). Store occupancy (binary indicator) or similar per-voxel signals. Extend 2D convolution to 3D by using 3D kernels sliding over a 3D grid. The central bottleneck is cubic scaling of computation and memory with resolution; early works used small grids (e.g., ~30³), causing information loss and artifacts.

体素化把空间离散成规则体素网格，每个体素记录占据信息（如二值占据）等信号。然后将 2D 卷积直接推广为 3D 卷积核在 3D 网格上滑动。核心瓶颈是计算/内存随分辨率呈立方增长；早期往往只能用较小分辨率（如约 30³ 量级），带来信息损失与伪影。

Complexity Issue（复杂度问题）

The script explicitly stresses “high complexity” (storage + computation) for dense voxels, proportional to resolution cubed, limiting practicality.

脚本明确强调稠密体素的存储与计算复杂度高，随分辨率立方增长，从而限制了可用性。

Idea 1: Learn to Project（想法 1：学习式投影）

One idea is to avoid isotropic 3D kernels by treating one spatial dimension as channels, turning volumetric data into a 2D-like input for standard CNNs (the talk relates it to BEV-like approaches). This is widely used in practice (e.g., autonomous driving), but is less principled because it requires a canonical orientation; different viewing directions can cause ambiguity in how to assign the channel dimension.

一种思路是不做各向同性 3D 卷积核，而把某一空间维度当作通道，将 3D 转为“类 2D”的输入以使用标准 CNN（讲稿将其与 BEV 思路类比，并提到在自动驾驶中常见）。但它不够“原则化”，因为需要规范朝向：不同观察方向下通道维的定义会变得不明确。

Octree: Recursively Partition the Space（八叉树：递归划分空间）

The talk argues most information lies on surfaces: as voxel resolution increases, occupancy becomes sparse. Use an octree to recursively subdivide only occupied/interesting regions; store data in a hierarchical tree; neighborhood lookup may use hashing. This enables higher effective resolutions (the script mentions reaching ~256³ scale) compared to dense voxel grids.

讲稿指出信息主要在表面上：分辨率越高，体素占据越稀疏。用八叉树递归划分空间，只在与表面相交/有信息的区域继续细分；树结构存储数据，邻域索引常用哈希等。这样可把有效分辨率扩展到更高（脚本提到可到约 256³ 量级），远超稠密体素。

Memory Efficiency（内存效率）

Sparse/octree representations dramatically improve memory efficiency and allow finer geometric detail.

稀疏/八叉树表示显著提升内存效率，从而能处理更细的几何细节。

Implementation（实现/工程实现）

Engineering is described as complex (neighbor indexing, hashing). The script cites well-engineered sparse convolution systems as strong baselines.

讲稿强调工程实现较复杂（邻域索引、哈希结构等），但也指出现有稀疏卷积系统工程化成熟、性能强，可作为重要基线。

Part 4｜Native Point Cloud Learning: PointNet → PointNet++ → Geometry-aware Point Convolution

Part 4｜点云原生学习：PointNet、PointNet++ 与“几何感知”点卷积

Directly Process Point Cloud Data（直接处理点云数据）

Since sensors (especially LiDAR) output point sets directly, the talk motivates networks that avoid projection/voxelization and instead operate natively on point clouds (a set of XYZ coordinates).

由于传感器（尤其 LiDAR）直接输出点集，讲稿强调希望避免投影或体素化，直接在点云（XYZ 坐标集合）上进行学习。

Permutation invariance（置换不变性）

If points are stored in GPU memory as an array (rows=points, columns=XYZ), permuting rows should not change the output. This leads to the requirement of permutation-invariant architectures.

点云在显存里常以二维数组存储（行是点、列是 XYZ 等维度），若交换点的顺序，网络输出应不变，因此需要置换不变的结构设计。

Construct a Symmetric Function（构造对称函数）

PointNet is described: apply a small neural network (shared MLP) independently to each point to get per-point embeddings; then use a symmetric aggregation function (max pooling) to combine them into a single global feature vector; then apply another network for classification/segmentation. The script states a theoretical result: such architecture can be proven to be a universal approximator for symmetric functions. Visualizations suggest it captures a “skeleton/silhouette” of shapes.

PointNet 的核心：对每个点使用共享 MLP 独立编码，得到高维点特征；再用对称聚合（如 max pooling）汇聚为全局特征向量；最后用后续网络完成分类/分割。脚本提到其可被严格证明具有对称函数的通用逼近能力。可视化显示网络倾向抓住形状的骨架/轮廓信息。

Limitations of PointNet（PointNet 的局限）

The limitation is weak local pattern modeling: it respects permutation invariance but does not explicitly encode local neighborhoods, which hurts capturing translation-invariant local structures.

局限在于局部模式建模弱：虽然满足置换不变性，但缺少对局部邻域的显式建模，不利于捕捉平移不变的局部结构。

Points in Metric Space（度量空间中的点）

Points are embedded in a metric space; neighborhoods can be defined via ball queries (radius-based) or k-nearest neighbors (kNN). Once topology is defined, networks can incorporate locality.

点位于度量空间中；可以用球查询（半径邻域）或 kNN 定义邻域。有了拓扑结构后，就能在网络里引入局部性。

(PointNet++) Hierarchical local abstraction（PointNet++：层级局部抽象）

PointNet++ is described as multi-scale: sample a subset of points as anchors; for each anchor, group local neighbors; apply PointNet locally to encode each neighborhood into a feature; obtain a sparser point set with richer features; repeat the process to build hierarchical representations. This improves handling of translation/local patterns.

PointNet++ 采用多尺度层级结构：先采样一部分点作为 anchor；对每个 anchor 聚合局部邻域；在邻域内运行“局部 PointNet”得到局部特征；这样得到更稀疏但特征更高维的点集；再重复“采样—分组—局部抽象”，形成层级表示，从而更好处理平移与局部结构。

Point Convolution As Graph Convolution（点卷积作为图卷积）

With points as nodes and neighborhood relations as edges, many point-based operations resemble graph convolution; PointNet++ can be interpreted under this lens.

把点看作节点、邻域看作边，点云学习中的局部聚合与图卷积高度相关；PointNet++ 可视作一种图卷积式的层级特征抽象。

Mathematically Proper Conv. Discretization（数学上严格的卷积离散化）

The talk argues generic graph convolution may not be geometry-aware: points are samples from an underlying continuous surface; features should ideally be sampling-invariant. A more principled approach starts from continuous convolution and derives an empirical discretization over point samples, often incorporating density estimation.

讲稿指出通用图卷积不一定“几何感知”：点云是从连续表面采样而来，理想特征应对采样方式更不敏感。更原则的路线是从连续卷积出发，在点采样上推导经验离散形式，并结合密度估计等思想。

Interpolated Kernel for Convolution（用于卷积的插值核）

Kernel design: define kernel values on a set of regularly arranged support points; for arbitrary spatial positions, interpolate kernel values (via kernel density estimation-like interpolation). This yields a continuous kernel evaluable at sampled points. The script also notes deformable convolution: because the kernel is a set of points, you can deform their positions to adapt to shape deformations (animals/humans), improving local modeling flexibility.

核的构造方式：在规则排列的一组支撑点上定义核值；对任意空间位置通过插值（类密度估计插值）得到核值，从而获得可在采样点处求值的连续核。脚本还强调可形变卷积：核由一组点定义，这些点的位置可学习形变，以适应动物/人体等可形变形状，增强局部表达灵活性。

Part 5｜Spectral / Isometric Invariance / Spherical CNN

Part 5｜谱域/等距不变性与球面卷积

Recognition with Isometric Invariance?（具有等距不变性的识别？）

Isometric transformations are defined via geodesic distances on surfaces: under isometric deformation, shortest path distances between points on the surface remain unchanged (the script uses the example of hands on a body). The aim is to build networks robust to such deformations. Spectral methods do convolution in the spectral domain of manifold signals.

等距变换通过表面测地距离定义：形变前后，表面上两点最短路径距离保持不变（脚本用人体姿态变化、手之间距离的测地意义作类比）。目标是设计对等距形变稳健的网络。谱方法在流形信号的谱域中进行卷积。

(Spectral challenge) Basis computation and activation（谱方法难点：基函数与激活）

The script notes practical issues: you need eigen-decomposition of the Laplacian operator to obtain spectral bases; nonlinear activations are usually applied in spatial domain because “activation in spectral domain” is not straightforward (though some works attempt fully spectral pipelines).

脚本提到谱方法的实际困难：需要对拉普拉斯算子做特征分解得到谱基；非线性激活通常仍在空间域进行，因为在谱域直接做激活并不自然（尽管也有尝试“全谱域网络”的工作）。

Domain Synchronization（域同步/谱域对齐）

If shapes are not truly isometric, spectral bases across shapes are not aligned, making parameter sharing difficult. A fundamental fix is to relate spectral domains via functional maps (a linear mapping between bases).

若形状并非严格等距，不同形状的谱基可能不对齐，导致共享参数困难。讲稿给出的根本性思路是用 functional map 在不同谱域基之间建立线性映射，实现域同步/对齐。

Learned Filters（学习到的滤波器/球面谱方法）

A practical special case is to project signals onto a sphere: the sphere has a fixed geometry and thus a shared basis, avoiding cross-shape spectral alignment issues. This also enables fast transforms (FFT-like on the sphere) and yields rotation handling (rotation invariance/equivariance is highlighted).

一个更易处理的特例是将信号投影到球面：球面几何固定，因此谱基共享，从而避免跨形状的谱域对齐问题；还可利用球面上的快速变换加速计算，并强调对旋转具有良好处理能力（旋转不变/等变特性）。

Part 6｜3D Semantic Segmentation + Multimodal Fusion

Part 6｜语义分割：稀疏卷积解码与多模态融合

Task: 3D Semantic Segmentation（任务：3D 语义分割）

Semantic segmentation predicts a semantic label for each point in a point cloud (or each voxel in a voxel grid). Because encoders (PointNet++ / sparse conv nets) downsample, decoders are needed to upsample or interpolate features back to the original resolution.

语义分割是对点云中每个点（或体素网格中每个体素）预测语义类别。由于编码器（PointNet++、稀疏卷积网络等）通常会下采样，因此需要解码器把特征恢复到原始分辨率（上采样/插值）。

(Sparse conv decoding) Sparse convolution and deconvolution（稀疏卷积的解码方式）

For sparse voxel representations, 3D convolution/deconvolution is an extension of 2D; computation only happens at activated voxels. The script emphasizes that active output voxels depend on an “associated sparse structure,” and naive sparse conv needs careful definition of active sites.

对于稀疏体素表示，3D 卷积/反卷积可视为 2D 的直接扩展，但只在激活体素处计算。脚本强调输出激活体素由“关联的稀疏结构”决定，直接套用稀疏卷积需要仔细定义激活位置集合。

(Point decoding) kNN interpolation for point clouds（点云解码：kNN 插值）

For point clouds, a common decoder idea is interpolation: each high-resolution point’s feature is interpolated from its k-nearest neighbors in the lower-resolution set.

对点云，常用的解码方式是插值：高分辨率点的特征由低分辨率点集中其 kNN 邻居的特征插值获得。

Multimodal（多模态）

The script stresses that using both 2D and 3D information can improve segmentation. One example pipeline: extract 2D features with a 2D CNN from RGB images; back-project these features into 3D voxels using known camera geometry; aggregate multi-view features via voxel-wise max pooling to form a 3D feature volume; then run a 3D CNN combining multi-view features and geometric features. A second described approach: back-project 2D features to a 3D point cloud (not necessarily the target points), then for each target point, gather its neighborhood in the back-projected cloud and use a PointNet-based encoder; experiments suggest fusing 2D+3D features early is better than processing separately. The “shower curtain vs door” bathroom example illustrates that 2D appearance cues correct geometry-only confusion.

脚本强调 2D+3D 融合能显著提升分割。示例流程之一：RGB 图像用 2D CNN 提特征；依据相机几何把 2D 特征反投影到 3D 体素；对多视角特征做体素级 max pooling 得到单一 3D 特征体；再用 3D CNN 将多视图特征与几何特征结合。另一路线：把 2D 特征反投影到 3D 点云（这些点不一定就是要预测标签的目标点），对每个目标点在反投影点云中取邻域，用 PointNet 编码邻域特征；实验上更倾向于“早融合”2D+3D 特征而非分开处理。浴室场景中“浴帘被几何网络误判为门”的例子说明 2D 纹理/外观信息能纠正纯几何的混淆。

Part 7｜Instance-level Understanding: Detection & Instance Segmentation

Part 7｜实例级理解：检测与实例分割（Top-down / Bottom-up / BEV）

Task: Instance-level Understanding（任务：实例级理解）

The talk frames real-world needs beyond semantic labels: 3D object detection predicts 3D bounding boxes; 3D instance segmentation predicts per-point instance IDs; 3D part segmentation refines instances into parts.

指出真实任务往往需要实例级理解：3D 检测输出 3D 框；实例分割输出逐点实例 ID；部件分割进一步把实例细分为功能/结构部件。

Top-down Methods（自顶向下方法）

The top-down question is: “Does this point cloud contain an object?” Classical sliding window is expensive; two-stage pipelines generate a limited number of proposals then refine them, analogous to R-CNN. Box regression is described via anchors and offsets; orientation may be needed beyond axis-aligned boxes. Instance segmentation inside proposals can be reduced to foreground/background segmentation, but proposals may contain multiple instances, requiring separation.

自顶向下方法围绕“这片点云区域是否包含一个物体”。滑动窗口代价极高，因此常用两阶段：先生成少量 proposals，再做精炼（类似 R-CNN）。3D 框回归通过 anchor + 相对偏移实现，且常需预测朝向（不止轴对齐框）。proposal 内的实例分割可先转化为前景/背景分割，但 proposal 可能包含多个实例，还需进一步分离。

Sliding Shapes（滑动窗口/滑动形状）

Sliding windows scan many candidate regions, causing high computation cost and poor real-time suitability.

滑窗需要扫描大量候选区域，计算开销极大，难以实时部署。

Localization（定位/框回归定位）

The script explains parameterization: predefined anchors; regress offsets to target boxes; sometimes include box orientation estimation.

脚本解释了常见参数化：预定义 anchor；回归到目标框的相对偏移；并可能需要估计框的朝向。

Volumetric R-CNN（体素 R-CNN）

One described variant voxelizes (e.g., TSDF volume), predicts anchors/proposals in stage 1, then refines with combined 2D+3D networks, similar to a 3D Fast R-CNN.

讲稿描述了一类体素两阶段检测：用体素/TSDF 表示在第一阶段生成 anchors/proposals；第二阶段结合 2D 与 3D 网络细化预测，类比 3D 版 Fast R-CNN。

Stage 2: Coordinate Normalization（阶段 2：坐标归一化）

For frustum-based detection from RGB-D: generate 2D region proposals on RGB; extrude to a 3D viewing frustum; collect points inside; perform instance segmentation to isolate the object of interest. Coordinate normalization is applied: rotate from camera coordinates to frustum coordinates; translate to centroid of foreground points; optionally apply an additional translation predicted by a lightweight network to move to the true box center before final box regression.

针对从 RGB-D 做检测的 frustum 流程：先在 RGB 上做 2D proposal，再沿视线挤出成 3D frustum，取 frustum 内点云；先做实例分割得到目标实例。为减小姿态变化影响，进行坐标归一化：从相机坐标旋转到 frustum 坐标；平移到前景点的质心；还可能用轻量网络再预测一次平移把原点移动到真实框中心，再回归最终 3D 框与类别。

(Point-cloud proposals) Point-based two-stage detection（点云两阶段检测）

Another described family predicts point-wise foreground/background first, then generates proposals from foreground points. In stage 2, each proposal is transformed into a canonical coordinate system (origin at box center; axes aligned with heading), and then refined. A bin-based representation is mentioned to convert parts of regression into classification (discretize ranges into bins).

另一类方法直接从点云出发：先预测逐点前景/背景，再从前景点生成 proposals。第二阶段把每个 proposal 变换到规范坐标系（原点在框中心，坐标轴与朝向对齐）后进行框细化。脚本还提到 bin-based 表示：把部分连续回归转为离散分类（把范围离散成多个 bin）。

Proposal from Voting（基于投票的候选生成）

Voting-based detection is motivated: object centroids can be far from surface points, making direct regression hard. Sample seed points; each seed predicts one or more votes (offset vectors) targeting object centers; cluster votes efficiently (e.g., farthest point sampling); apply a PointNet-like network to each cluster to decide objectness and regress box/class.

投票式检测的动机是：物体中心往往远离表面点，直接回归中心困难。做法是采样 seed points，每个 seed 预测一个或多个投票点（偏移向量指向物体中心），再对 votes 聚类（脚本提到可用 farthest point sampling 等高效实现），对每个聚类用 PointNet 类网络判断是否为物体并回归 3D 框与类别。

(Shape proposal generation) Generative proposal idea（生成式 proposal 思路）

A method is described that generates proposals by reconstructing shapes from noisy observations: sample seed points and conditionally generate a point cloud proposal; then compute a tight bounding box from the generated shape.

脚本提到一种“生成式 proposal”思路：从噪声观测出发，采样 seed point，并用条件生成模型生成点云形状 proposal；再由生成形状计算紧致包围盒作为候选框。

Bottom-up Methods（自底向上方法）

Bottom-up asks: “Do these points belong to the same instance?” Learn per-point embeddings such that same-instance points are close; different instances (and often different semantic classes) are far. Post-processing groups high-confidence points; low-confidence/small clusters are discarded; overlapping groups are merged. Another approach combines discriminative embeddings with CRF-style refinement; embeddings are pulled toward instance centroids and pushed away across instances.

自底向上方法从“哪些点属于同一实例”出发。学习逐点 embedding：同实例点更接近，不同实例（甚至不同类别）更远。后处理会从高相似度/高置信点形成分组，过滤低置信或小簇，并合并重叠簇。另一类方法把判别式 embedding 与 CRF 等图模型结合：把点 embedding 拉向实例中心，同时让不同实例中心彼此远离。

BEV (bird-eye view)（BEV 鸟瞰图）

BEV detection is emphasized in self-driving. Voxelize point cloud into a ground-plane grid (H×W), treat it as an image with D channels (features per cell/pillar). Then apply typical 2D CNN detectors; some methods assume each pixel contains at most one object. Variants: pillars encoded by PointNet-like encoders; sparse 3D conv volumes reshaped into 2D BEV. Continuous fusion is described: for each BEV pixel, find k nearby points in BEV plane, project them into camera view, sample image features at projected pixels, then fuse k image features + geometric info via an MLP, using points as intermediates between camera and BEV.

讲稿强调 BEV 在自动驾驶中的重要性。将点云体素化到地面平面网格（H×W），视作带 D 通道的“图像”（每格/每柱记录特征），然后使用 2D CNN 检测器；并提到有方法显式假设每个像素只含一个目标。变体包括：用柱（pillar）并用点网络编码柱特征；或先做稀疏 3D 卷积再 reshape 成 2D BEV。连续融合（continuous fusion）流程更具体：对每个 BEV 像素，在 BEV 平面搜索 k 个近邻点，将点投影到相机视角，在投影像素处采样图像特征，再用 MLP 融合这 k 个图像特征与几何信息；点在这里充当从相机到 BEV 的中介。

Part 8｜Few-shot / Zero-shot 3D Learning & Structure Induction

Part 8｜专题：Few-shot / Zero-shot 3D 学习与结构归纳

Topics（主题）

The speaker flags few-shot and zero-shot 3D learning as promising directions.

将 3D 的小样本/零样本学习作为值得关注的前沿主题。

Why Few-shot/Zero-shot Learning by 3D?（为什么用 3D 做小样本/零样本学习？）

Three advantages of 3D shapes are stated: (1) easier correspondence because local geometry is less affected by viewpoint/illumination; (2) easier similarity computation using distances such as Chamfer distance or Earth Mover’s Distance; (3) easier abstraction into parts / primitives (e.g., compositions of geometric primitives), supporting structure-aware generalization.

讲稿给出三点理由： (1) 3D 局部几何不易受视角与光照影响，更易建立对应；(2) 可用 Chamfer Distance、Earth Mover’s Distance 等度量直接比较形状相似性；(3) 更易把形状抽象为部件或几何基元组合，有利于结构化泛化。

Task: Few-shot Structure Induction（任务：小样本结构归纳）

Example: observing a door’s motion (rotation around an axis) from paired observations allows inferring the door as an object/part under rigidity assumptions; capturing recurring units across observations enables structure induction with weak supervision.

例子是门绕轴旋转：给定成对观测，可在“刚性/一致性”假设下推断门是一个独立对象/部件；从多次观测中捕捉可重复单元，可在弱监督下归纳结构。

Part Induction by Relating Shapes（通过关联形状进行部件归纳）

A pipeline is described with modules: correspondence (matching probability matrix between two point sets), flow (predict deformation from one point cloud to another), and segmentation (group points into parts under partial rigidity). These modules can run iteratively, akin to ICP but with learned robustness.

讲稿描述了模块化流程：对应模块输出两点云间的匹配概率矩阵；flow 模块预测从点云 1 到点云 2 的形变流；分割模块在“局部刚性/部分刚性”假设下把点分组成部件。三个模块可迭代运行，类似 ICP，但用学习模块提升鲁棒性。

Mobility Induction（可动性/运动结构归纳）

After training, the system predicts moving parts and feasible motions, enabling automatic animation of CAD models (the script references a highlighted example within a bounding box).

训练后可以得到运动部件及其可行动作，从而对 CAD 模型进行自动动画化（脚本用红框示例强调“能动起来”）。

Task: Few-shot Part Discovery（任务：小样本/跨类部件发现）

A zero-shot setting is described: train on classes with fine-grained part labels, test on unseen classes without fine-tuning, assuming different classes share similar parts. A policy-driven merging process is described: start from superpixel-like subparts; use a learned policy (reinforcement learning) to pick pairs most likely to form a part; a verifier decides whether to merge; repeat until no valid pairs remain.

讲稿描述零样本场景：训练类有细粒度部件标注，测试类是未见新类且不微调，假设不同类别共享相似部件。算法流程是策略驱动合并：从类似超像素的子块开始，用强化学习学到的策略挑选最可能构成部件的子块对；验证模块决定是否合并；循环直至无可合并对，剩余块即最终部件。

Part 9｜3D Generative Modeling: Metrics, Representations, Priors, Reconstruction

Part 9｜3D 生成模型：表示、评估与结构先验

3D Generative Models（3D 生成模型）

The talk focuses on generating 3D shapes with deep networks under different conditions: reconstructing from a single image, completing shapes from partial point clouds, or mapping from a low-dimensional latent space to 3D shapes for sampling.

讲稿讨论用深度网络生成 3D 形状：单图重建、部分点云补全、以及从低维潜空间到 3D 形状的映射（可无参考地采样生成）。

Metrics for Comparing Point Clouds（点云对比评估指标）

Chamfer distance is explained via bidirectional nearest neighbor distances averaged over points. Earth Mover’s Distance is described as requiring an optimal bijection (one-to-one matching), more sensitive to density distribution. Precision/recall (and F-score) are computed by checking whether points find neighbors within a threshold. If normals are available, normal consistency can be computed via dot products. Also mentioned: project 3D shapes to 2D and reuse strong 2D metrics/representations.

脚本解释 Chamfer Distance：红点找最近蓝点、蓝点找最近红点，取平均距离衡量相似度。EMD 需要最优一一匹配，对点密度分布更敏感。还可用阈值邻域定义 precision/recall 并计算 F-score。若有法向，可用点积计算 normal consistency。另一个思路是把 3D 投影到 2D，用成熟的 2D 指标或特征空间做比较。

(Voxel & Octree) Volumetric representation and octree generation（体素与八叉树生成）

Voxel representation is intuitive and supports 3D CNNs, but expensive due to cubic scaling; octree-based generation is introduced as a more efficient adaptive-resolution alternative, generating volumetric occupancy layer-by-layer.

体素直观且利于 3D CNN，但计算/内存昂贵；八叉树生成作为替代，通过自适应单元大小逐层生成体素占据，可支持更高分辨率。

(Point-based) Point cloud generation from images（从图像生成点云）

A common point-based paradigm: a 2D CNN encoder plus a fully-connected branch predicts a fixed number of 3D points; losses use point-set distances (Chamfer/EMD) to train end-to-end.

点集生成范式：2D CNN 编码图像，再通过全连接分支预测固定数量的 3D 点；用 Chamfer/EMD 等点集距离作为损失端到端训练。

(Primitive-based) Primitive assembly（基元组合）

The script mentions representing shapes as combinations of primitives (e.g., planes/cubes) by predicting primitive parameters and assembling them; it is straightforward but struggles with fine-grained details.

脚本提到用几何基元组合表示（如平面/立方体等），网络预测基元参数并组装成形状；思路直观，但难生成细粒度复杂细节。

Patch-based Surface Generation（基于 Patch 的表面生成）

A “wrap a 2D plane into 3D surfaces” paradigm: sample 2D points on a square conditioned on a latent code; use MLPs to map them into 3D; assemble multiple patches to form a surface; uses point-set distances for training.

“把 2D 平面包裹成 3D 表面”的思路：在 2D 方形上采样点并结合潜变量，通过 MLP 映射到 3D 坐标，多个 patch 拼接成完整表面；训练仍用点集距离度量。

Polygon Meshes Representation（多边形网格表示）

Direct mesh generation is non-trivial due to irregularity. Many works deform a template mesh (sphere/ellipsoid) by predicting vertex positions while keeping connectivity fixed; this cannot change topology (e.g., cannot deform a sphere into a torus). Other approaches decompose shapes into structured parts or update topology dynamically to handle more complex shapes.

直接生成网格较难，因为连通性不规则。常见路线是从模板网格（球/椭球）出发预测顶点位置来形变，但拓扑固定（例如球无法变成甜甜圈）。也有方法通过结构化部件分解或动态拓扑更新来处理更复杂拓扑。

Implicit Function Representation（隐式函数表示）

Implicit methods learn a signed distance function (SDF): a network maps a 3D query point to its signed distance to the surface (inside/outside sign convention), where the zero level set defines the surface. Surfaces are extracted with post-processing such as marching cubes.

隐式表示学习 SDF：网络输入 3D 查询点，输出其到表面的符号距离（内外符号），零等值面即为形状表面；再用 marching cubes 等后处理提取网格。

Intermediate Representation（中间表示）

An alternative to direct end-to-end shape decoding is to predict intermediate 2.5D cues (depth, normal, silhouette) from an image, then reconstruct or complete the 3D shape from the intermediate representation. This can improve reconstruction and transferability; projection consistency can be used when full ground truth is unavailable.

除了直接解码 3D 形状，也可先从图像预测 2.5D 中间线索（深度、法线、轮廓等），再从中间表示生成完整 3D。这样往往更利于质量与迁移；缺少完整 3D 标注时可用投影一致性进行训练/微调。

Structured Shape Generation（结构化形状生成）

The script discusses explicitly modeling object structure as a hierarchical part decomposition with relationships (adjacency, symmetry). Encoding/decoding is done recursively with geometry encoders and graph encoders/decoders (graph convolution over part relations), enabling structure-aware generation and downstream geometry processing.

讲稿强调显式结构建模：用层级部件分解表达形状，并建模邻接、对称等关系；编码器由几何编码与图编码组成，图上用图卷积聚合关系信息，解码器递归重建部件与关系，从而支持结构感知生成与后续结构化几何处理。

Mesh Reconstruction（网格重建）

Mesh reconstruction from point clouds remains hard: traditional Poisson surface reconstruction can fail on ambiguous structures (e.g., two nearby planes), producing distortions. Learning-based methods that are not grounded on input may miss details or generalize poorly.

点云到网格的重建仍很难：传统泊松重建在结构歧义（如两张空间接近的平面）时可能产生粘连或畸变；学习方法若缺少对输入的显式约束也可能细节不足、泛化差。

Bottom-up Reconstruction with Candidate Triangles（基于候选三角片的自底向上重建）

A reconstruction strategy is described: build a kNN graph on the input points, construct multiple candidate triangles, and use a network to classify valid surface triangles vs invalid ones; merge remaining triangles into a final mesh. Supervision can be derived from a heuristic comparing intrinsic (geodesic-like) vs extrinsic (Euclidean) distances: if the ratio is near 1, the region between points is likely on the surface.

讲稿给出一种自底向上重建策略：先在输入点上构建 kNN 图，生成大量候选三角片，再用网络分类哪些属于表面、哪些应剔除，最后把保留三角片合并成网格。监督信号可来自启发式：比较“内蕴距离（类测地）”与“欧氏距离”的比值，若接近 1，则两点间区域更可能位于真实表面。

Part 10｜Multi-view Stereo (MVS): Learned Reconstruction Pipelines

Part 10｜多视图立体（MVS）：从经典到学习式

Multi-view Stereo (MVS)（多视图立体重建）

Classical MVS pipelines involve multiple separate steps: sparse feature detection/correspondence → densification into point clouds, or depth map estimation per view → fusion into 3D. Learning-based MVS aims to learn multi-view consistency and geometric context end-to-end, reducing reliance on hand-designed steps.

经典 MVS 往往是多阶段：稀疏特征与匹配→稠密点云，或逐视图深度图估计→深度融合得到 3D。学习式 MVS 试图端到端学习多视一致性与几何上下文，减少手工步骤依赖。

(Voxel cube) Colored voxel cubes for MVS（彩色体素立方：编码相机几何）

One approach constructs colored voxel cubes by projecting voxels onto images and storing corresponding values/features, effectively encoding camera parameters implicitly. A 3D CNN predicts per-voxel surface confidence; a binarization step yields reconstructed surfaces. Limitations noted: coarse resolution and quantization errors from voxelization.

一种方法构建“彩色体素立方”：把体素投影到图像并写入对应颜色/特征，从而把相机参数隐式编码进体素；再用 3D CNN 预测每体素的表面置信度，经过二值化得到重建表面。局限在于体素分辨率较粗与体素化引入量化误差。

(Feature back-projection) 2D feature maps → 3D feature grids（2D 特征反投影到 3D 网格）

Another approach first encodes images with strong 2D CNNs, then back-projects 2D feature maps into 3D feature grids; it can handle variable numbers of views via recurrent aggregation of 3D feature grids; then uses 3D convolution (often sparse) to predict occupancy.

另一类方法先用强 2D CNN 编码图像特征，再把 2D 特征图反投影到 3D 特征网格；通过递归/循环方式聚合多视图的 3D 特征网格以适配可变视角数；再用 3D 卷积（常强调稀疏性）预测占据/表面。

Cost Volume（代价体）

A depth/disparity estimation family is described: rectify stereo/multi-view images, extract 2D features, build a view-aligned 4D cost volume by concatenating features across disparity levels, process with 3D CNN, and regress sub-pixel disparity via soft-argmin-like operations. Memory consumption of cost volumes is highlighted.

讲稿描述深度/视差估计路线：对多视图进行校正，提取 2D 特征；按不同视差层把对应特征拼接形成对齐的 4D 代价体；用 3D CNN 处理后，通过 soft-argmin 等方式回归亚像素视差/深度。并强调代价体非常耗显存。

Recurrent Depth Aggregation（递归式深度聚合）

To reduce memory, process the cost volume layer-by-layer along depth using recurrent networks, lowering peak memory and enabling higher resolution.

为降低内存，沿深度方向逐层递归处理代价体（用循环/递归网络），降低峰值显存，从而支持更高分辨率。

A point-based refinement idea is described: predict a low-resolution depth map first, convert to a point cloud, then iteratively update point locations and add more points (densification). A graph neural network aggregates neighborhood information and updates points along camera rays based on hypothesized candidate locations; iteration improves accuracy and supports flexible “coarse-to-fine” interaction.

讲稿提出点基迭代细化：先预测低分辨率深度图，转成点云；再通过迭代更新点的位置并逐步增密。做法是沿相机射线为每点假设候选位置，用图网络聚合邻域信息并预测更新量；多次迭代逐步逼近更准确的结果，并支持从粗到细、可交互的细化流程。

posted @ 2025-12-15 15:44 asandstar 阅读(95) 评论(0) 收藏举报

刷新页面返回顶部