Multimodal scene understanding

✍ Scribed by Yang M.Y (ed.)

Publisher: Academic Press
Year: 2019
Tongue: English
Leaves: 419
Category: Library

No coin nor oath required. For personal study only.

✦ Table of Contents

Cover......Page 1
Multimodal Scene Understanding:Algorithms, Applications and Deep Learning......Page 4
Copyright......Page 5
Contents......Page 6
List of Contributors......Page 8
1.1 Introduction......Page 11
1.2 Organization of the Book......Page 13
References......Page 17
2 Deep Learning for Multimodal Data Fusion......Page 18
2.1 Introduction......Page 19
2.2 Related Work......Page 20
2.3.1 Auto-Encoder......Page 22
2.3.2 Variational Auto-Encoder (VAE)......Page 23
2.3.3 Generative Adversarial Network (GAN)......Page 24
2.3.5 Adversarial Auto-Encoder (AAE)......Page 25
2.3.6 Adversarial Variational Bayes (AVB)......Page 26
2.4 Multimodal Image-to-Image Translation Networks......Page 28
2.4.1 Pix2pix and Pix2pixHD......Page 29
2.4.2 CycleGAN, DiscoGAN, and DualGAN......Page 30
2.4.3 CoGAN......Page 31
2.4.4 UNIT......Page 32
2.4.5 Triangle GAN......Page 33
2.5 Multimodal Encoder-Decoder Networks......Page 34
2.5.1 Model Architecture......Page 36
2.5.2 Multitask Training......Page 37
2.6 Experiments......Page 38
2.6.1 Results on NYUDv2 Dataset......Page 39
2.6.2 Results on Cityscape Dataset......Page 41
2.6.3 Auxiliary Tasks......Page 42
References......Page 45
3.1 Introduction......Page 49
3.2 Overview......Page 51
3.2.1 Image Classiﬁcation and the VGG Network......Page 52
3.2.2 Architectures for Pixel-level Labeling......Page 53
3.2.3 Architectures for RGB and Depth Fusion......Page 55
3.3 Methods......Page 57
3.3.1 Datasets and Data Splitting......Page 58
3.3.3 Preprocessing of the ISPRS Dataset......Page 59
3.3.5 Color Spaces for RGB and Depth Fusion......Page 61
3.4 Results and Discussion......Page 63
3.4.1 Results and Discussion on the Stanford Dataset......Page 64
3.4.2 Results and Discussion on the ISPRS Dataset......Page 66
References......Page 70
4 Learning Convolutional Neural Networks for Object Detection with Very Little Training Data......Page 73
4.1 Introduction......Page 74
4.2.1 Types of Learning......Page 76
4.2.2.1 Artiﬁcial neuron......Page 78
4.2.2.2 Artiﬁcial neural network......Page 79
4.2.2.3 Training......Page 81
4.2.2.4 Convolutional neural networks......Page 83
4.2.3.1 Decision tree......Page 85
4.3 Related Work......Page 87
4.4.1 Feature Learning......Page 89
4.4.3 RF to NN Mapping......Page 90
4.4.4 Fully Convolutional Network......Page 92
4.4.5 Bounding Box Prediction......Page 93
4.5 Localization......Page 94
4.6 Clustering......Page 95
4.7 Dataset......Page 97
4.7.2 Filtering......Page 98
4.8.1 Training and Test Data......Page 99
4.8.3 Object Detection......Page 100
4.8.4 Computation Time......Page 103
4.8.5 Precision of Localizations......Page 104
References......Page 106
5.1 Introduction......Page 109
5.2.1 Visible Pedestrian Detection......Page 113
5.2.2 Infrared Pedestrian Detection......Page 115
5.2.3 Multimodal Pedestrian Detection......Page 116
5.3.1 Multimodal Feature Learning/Fusion......Page 118
5.3.2.1 Baseline DNN model......Page 120
5.3.2.2 Scene-aware DNN model......Page 121
5.3.3 Multimodal Segmentation Supervision......Page 124
5.4.2 Implementation Details......Page 126
5.4.3 Evaluation of Multimodal Feature Fusion......Page 127
5.4.4 Evaluation of Multimodal Pedestrian Detection Networks......Page 129
5.4.5 Evaluation of Multimodal Segmentation Supervision Networks......Page 132
5.4.6 Comparison with State-of-the-Art Multimodal Pedestrian Detection Methods......Page 133
References......Page 138
6 Multispectral Person Re-Identiﬁcation Using GAN for Color-to-Thermal Image Translation......Page 142
6.1 Introduction......Page 143
6.2.1 Person Re-Identiﬁcation......Page 144
6.2.3 Generative Adversarial Networks......Page 145
6.3.1 ThermalWorld ReID Split......Page 146
6.3.2 ThermalWorld VOC Split......Page 147
6.3.3 Dataset Annotation......Page 149
6.3.4 Comparison of the ThermalWorld VOC Split with Previous Datasets......Page 150
6.3.5 Dataset Structure......Page 151
6.4 Method......Page 152
6.4.3 Relative Thermal Contrast Generator......Page 154
6.4.4 Thermal Signature Matching......Page 155
6.5.2.1 Qualitative comparison......Page 156
6.5.2.2 Quantitative evaluation......Page 157
6.5.3 ReID Evaluation Protocol......Page 158
6.5.5 Comparison and Analysis......Page 159
6.6 Conclusion......Page 161
References......Page 162
7 A Review and Quantitative Evaluation of Direct Visual-Inertial Odometry......Page 166
7.1 Introduction......Page 167
7.2 Related Work......Page 168
7.2.2 Visual-Inertial Odometry......Page 169
7.3.1 Gauss-Newton Algorithm......Page 170
7.4.1 Notation......Page 172
7.4.3 Interaction Between Coarse Tracking and Joint Optimization......Page 174
7.4.4 Coarse Tracking Using Direct Image Alignment......Page 175
7.4.5 Joint Optimization......Page 177
7.5 Direct Sparse Visual-Inertial Odometry......Page 178
7.5.1 Inertial Error......Page 179
7.5.2 IMU Initialization and the Problem of Observability......Page 180
7.5.4 Scale-Aware Visual-Inertial Optimization......Page 181
7.5.4.1 Nonlinear optimization......Page 182
7.5.4.2 Marginalization using the Schur complement......Page 183
7.5.4.3 Dynamic marginalization for delayed scale convergence......Page 184
7.5.4.4 Measuring scale convergence......Page 187
7.5.5 Coarse Visual-Inertial Tracking......Page 188
7.6 Calculating the Relative Jacobians......Page 189
7.6.1 Proof of the Chain Rule......Page 190
7.6.2 Derivation of the Jacobian with Respect to Pose in Eq. (7.58)......Page 191
7.6.3 Derivation of the Jacobian with Respect to Scale and Gravity Direction in Eq. (7.59)......Page 192
7.7 Results......Page 193
7.7.1 Robust Quantitative Evaluation......Page 194
7.7.2 Evaluation of the Initialization......Page 196
7.7.3 Parameter Studies......Page 200
References......Page 203
8 Multimodal Localization for Embedded Systems: A Survey......Page 206
8.1 Introduction......Page 207
8.2 Positioning Systems and Perception Sensors......Page 209
8.2.1.1 Inertial navigation systems......Page 210
8.2.1.2 Global navigation satellite systems......Page 212
8.2.2.1 Visible light cameras......Page 214
8.2.2.2 IR cameras......Page 216
8.2.2.4 RGB-D cameras......Page 217
8.2.2.5 LiDAR sensors......Page 218
8.2.3.1 Sensor conﬁguration types......Page 219
8.2.3.2 Sensor coupling approaches......Page 220
8.2.3.3 Sensors fusion architectures......Page 221
8.2.4 Discussion......Page 223
8.3 State of the Art on Localization Methods......Page 224
8.3.1.2 GNSS-based localization......Page 225
8.3.1.3 Image-based localization......Page 227
8.3.1.4 LiDAR-map based localization......Page 232
8.3.2 Multimodal Localization......Page 233
8.3.2.1 Classical data fusion algorithms......Page 234
8.3.2.2 Reference multimodal benchmarks......Page 237
8.3.2.3 A panorama of multimodal localization approaches......Page 238
8.3.2.4 Graph-based localization......Page 243
8.3.3 Discussion......Page 244
8.4.1 Application Domain and Hardware Constraints......Page 246
8.4.2.1 SoC constraints......Page 247
8.4.2.2 IP modules for SoC......Page 249
8.4.2.3 SoC......Page 250
8.4.2.5 ASIC......Page 252
8.4.2.6 Discussion......Page 254
8.4.3.2 Smart phones......Page 255
8.4.3.3 Smart glasses......Page 256
8.4.3.5 Unmanned aerial vehicles......Page 258
8.4.3.6 Autonomous driving vehicles......Page 259
8.4.4 Discussion......Page 260
8.5 Application Domains......Page 262
8.5.1.1 Aircraft inspection......Page 263
8.5.1.2 SenseFly eBee classic......Page 264
8.5.2.1 Indoor localization in large-scale buildings......Page 266
8.5.3 Automotive Navigation......Page 267
8.5.3.1 Autonomous driving......Page 268
8.5.4 Mixed Reality......Page 269
8.5.4.1 Virtual cane system for visually impaired individuals......Page 270
8.5.4.2 Engineering, construction and maintenance......Page 271
8.6 Conclusion......Page 272
References......Page 273
9 Self-Supervised Learning from Web Data for Multimodal Retrieval......Page 286
9.1.2 Alternatives to Annotated Data......Page 287
9.2 Related Work......Page 288
9.3 Multimodal Text-Image Embedding......Page 290
9.4 Text Embeddings......Page 292
9.5.2 WebVision......Page 294
9.5.3 MIRFlickr......Page 295
9.6.2 Results and Conclusions......Page 296
9.6.3.3 Words with different meanings or uses......Page 301
9.7.1 Experiment Setup......Page 302
9.7.2 Results and Conclusions......Page 303
9.8.1 Experiment Setup......Page 305
9.8.2 Results and Conclusions......Page 306
9.10.1 Dimensionality Reduction with t-SNE......Page 307
9.10.4 Semantic Space Inspection......Page 309
References......Page 311
10 3D Urban Scene Reconstruction and Interpretation from Multisensor Imagery......Page 314
10.2 Pose Estimation for Wide-Baseline Image Sets......Page 315
10.2.1 Pose Estimation for Wide-Baseline Pairs and Triplets......Page 316
10.2.2 Hierarchical Merging of Triplets......Page 317
10.2.3 Automatic Determination of Overlap......Page 318
10.3 Dense 3D Reconstruction......Page 320
10.3.1 Dense Depth Map Generation and Uncertainty Estimation......Page 321
10.3.2 3D Uncertainty Propagation and 3D Reconstruction......Page 322
10.4.1.1 Color coherence......Page 324
10.4.1.2 Deﬁnition of neighborhood......Page 325
10.4.1.3 Relative height......Page 326
10.4.1.4 Coplanarity of 3D points......Page 327
10.4.2.2 Results for Bonnland......Page 328
10.5 Scene and Building Decomposition......Page 329
10.5.1 Scene Decomposition......Page 330
10.5.2 Building Decomposition......Page 331
10.5.2.1 Ridge extraction......Page 333
10.6.1 Primitive Selection and Optimization......Page 334
10.6.2 Primitive Assembly......Page 336
10.6.3 LoD2 Models......Page 338
10.6.4 Detection of Facade Elements......Page 339
10.6.5 Shell Model......Page 342
References......Page 344
11 Decision Fusion of Remote-Sensing Data for Land Cover Classiﬁcation......Page 348
11.1 Introduction......Page 349
11.1.1.1 Early fusion - fusion at the observation level......Page 350
11.1.1.3 Late fusion - fusion at the decision level......Page 351
11.1.2 Discussion and Proposal of a Strategy......Page 353
11.2 Proposed Framework......Page 354
11.2.1.1 Fuzzy rules......Page 356
11.2.1.3 Margin-based rules......Page 358
11.2.1.4 Dempster-Shafer evidence theory......Page 359
11.2.2.1 Model formulation(s)......Page 360
11.2.2.2 Optimization......Page 362
11.2.2.3 Parameter tuning......Page 363
11.3.1 Introduction......Page 364
11.3.3 Datasets......Page 365
11.3.4.1 Source comparison......Page 367
11.3.4.2 Decision fusion classiﬁcation......Page 368
11.3.4.3 Regularization......Page 370
11.4.1 Introduction......Page 371
11.4.2 Proposed Framework: A Two-Step Urban Footprint Detection......Page 373
11.4.2.2 First regularization......Page 374
11.4.3 Data......Page 375
11.4.4.1 Five-class classiﬁcations......Page 376
11.4.4.2 Urban footprint extraction......Page 380
11.5 Final Outlook and Perspectives......Page 383
References......Page 384
12.1 Introduction......Page 390
12.2.1 Generalized Distillation......Page 393
12.2.2 Multimodal Video Action Recognition......Page 394
12.3.1 Cross-stream Multiplier Networks......Page 395
12.3.2 Hallucination Stream......Page 398
12.3.3 Training Paradigm......Page 399
12.4.1 Datasets......Page 400
12.4.3 Hyperparameters and Validation Set......Page 401
12.4.4 Ablation Study......Page 402
12.4.4.3 Contributions of the proposed training procedure......Page 403
12.4.6 Comparison with Other Methods......Page 404
12.5 Conclusions and Future Work......Page 406
References......Page 407
Index......Page 409
Back Cover......Page 419

📜 SIMILAR VOLUMES

Multimodal Scene Understanding: Algorith

📁 Multimodal Scene Understanding: Algorithms, Applications and Deep Learning

✍ Michael Ying Yang (editor), Bodo Rosenhahn (editor), Vittorio Murino (editor) 📂 Library 📅 2019 🏛 Academic Press 🌐 English

Multimodal Scene Understanding: Algorithms, Applications and Deep Learning presents recent advances in multi-modal computing, with a focus on computer vision and photogrammetry. It provides the latest algorithms and applications that involve combining multiple sources of information and de

Multimodal Computational Attention for S

📁 Multimodal Computational Attention for Scene Understanding and Robotics

✍ Boris Schauerte (auth.) 📂 Library 📅 2016 🏛 Springer International Publishing 🌐 English

This book presents state-of-the-art computational attention models that have been successfully tested in diverse application areas and can build the foundation for artificial systems to efficiently explore, analyze, and understand natural scenes. It gives a comprehensive overview of the most r

Semantic Networks for Understanding Scen

📁 Semantic Networks for Understanding Scenes

✍ Gerhard Sagerer, Heinrich Niemann (auth.) 📂 Library 📅 1997 🏛 Springer US 🌐 English

Figure 1.1. An outdoor scene "A bus is passing three cars which are parking between trees at the side of the road. Houses having two storeys are lined up at the street. 3 4 Introduction Figure 1.2. An assembly scene There seems to be a small open place between the group of houses in the foregroun

Multimodal Learning Toward Micro-video U

📁 Multimodal Learning Toward Micro-video Understanding

✍ Liqiang Nie, Meng Liu, Xuemeng Song 📂 Library 📅 2019 🏛 Morgan & Claypool 🌐 English

Micro-videos, a new form of user-generated content, have been spreading widely across various social platforms, such as Vine, Kuaishou, and TikTok.Different from traditional long videos, micro-videos are usually recorded by smart mobile devices at any place within a few seconds. Due

Crime Scene Investigation Mapping: Under

📁 Crime Scene Investigation Mapping: Understanding Hot Spots

✍ U.S. Department of Justice 📂 Library 📅 2005 🏛 Quality Information Publishers, Inc. 🌐 English

Mapping Crime: Understanding Hot Spots 2005. 79 pages. Table of Contents About This Report Chapter 1. Crime Hot Spots: What They Are, Why We Have Them, and How to Map Them 1

Parallel MATLAB for Multicore and Multin

📁 Parallel MATLAB for Multicore and Multinode Computers

✍ Jeremy Kepner 📂 Library 📅 2009 🏛 SIAM-Society for Industrial and Applied Mathematic 🌐 English