Duino commited on
Commit
01b8a56
·
verified ·
1 Parent(s): dc92bd1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -103
README.md CHANGED
@@ -1,45 +1,44 @@
1
- ```yaml
2
  ---
3
- title: Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment
4
- emoji:
5
- colorFrom: indigo
6
- colorTo: blue
7
- sdk: gradio
8
- sdk_version: 4.x
9
- app_file: app.py # Replace with your actual Gradio app file if you have one
 
 
 
 
 
 
 
 
 
 
 
10
  tags:
11
  - 3d-mapping
12
- - indoor-reconstruction
13
  - depth-estimation
14
  - semantic-segmentation
15
  - vision-language-model
 
16
  - mobile-video
17
- - gradio
18
- - point-cloud
19
  - dpt
20
  - paligemma
21
- - computer-vision
22
- - research-paper
23
  author: Jalal Mansour (Jalal Duino)
24
- date: 2025-02-18
25
  email: [email protected]
26
- hf_space: Duino # Assuming 'Duino' is your HF username/space name
27
- license: mit
28
  ---
29
 
30
- # Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment
31
 
32
- **Author:** Jalal Mansour (Jalal Duino)
33
- **Date Created:** 2025-02-18
34
- **Email:** [[email protected]](mailto:[email protected])
35
- **Hugging Face Space:** [https://huggingface.co/Duino](https://huggingface.co/Duino)
36
- **License:** MIT License
37
 
38
- ---
39
-
40
- ## Abstract
41
-
42
- This paper introduces Duino-Idar, a novel end-to-end system for generating interactive 3D maps of indoor environments from mobile video. Leveraging state-of-the-art monocular depth estimation techniques, specifically DPT (Dense Prediction Transformer)-based models, and semantic understanding via a fine-tuned vision-language model (PaLiGemma), Duino-Idar offers a comprehensive solution for indoor scene reconstruction. The system extracts key frames from video input, computes depth maps, constructs a 3D point cloud, and enriches it with semantic labels. A user-friendly Gradio-based graphical user interface (GUI) facilitates video upload, processing, and interactive 3D scene exploration. This research details the system's architecture, implementation, and potential applications in areas such as indoor navigation, augmented reality, and automated scene understanding, setting the stage for future enhancements including LiDAR integration for improved accuracy and robustness.
43
 
44
  **Keywords:** 3D Mapping, Indoor Reconstruction, Mobile Video, Depth Estimation, Semantic Segmentation, Vision-Language Models, DPT, PaLiGemma, Point Cloud, Gradio, Interactive Visualization.
45
 
@@ -49,7 +48,7 @@ This paper introduces Duino-Idar, a novel end-to-end system for generating inter
49
 
50
  Recent advancements in computer vision and deep learning have significantly propelled the field of 3D scene reconstruction from 2D imagery. Mobile devices, now ubiquitous and equipped with high-quality cameras, provide a readily available source of video data suitable for spatial mapping. While monocular depth estimation has matured considerably, enabling real-time applications, many existing 3D reconstruction approaches lack a crucial component: semantic understanding of the scene. This semantic context is vital for enabling truly interactive and context-aware applications, such as augmented reality (AR) navigation, object recognition, and scene understanding for robotic systems.
51
 
52
- To address this gap, we present Duino-Idar, an innovative system that integrates a robust depth estimation pipeline with a fine-tuned vision-language model, PaLiGemma, to enhance indoor 3D mapping. The system's name, Duino-Idar, reflects the vision of combining accessible technology ("Duino," referencing approachability and user-centric design) with advanced spatial sensing ("Idar," hinting at the potential for LiDAR integration in future iterations, although the current prototype focuses on vision-based depth). This synergistic combination not only achieves geometric reconstruction but also provides semantic enrichment, significantly enhancing both visualization and user interaction capabilities. This paper details the architecture, implementation, and potential of Duino-Idar, highlighting its contribution to accessible and semantically rich indoor 3D mapping.
53
 
54
  ---
55
 
@@ -58,15 +57,19 @@ To address this gap, we present Duino-Idar, an innovative system that integrates
58
  Our work builds upon and integrates several key areas of research:
59
 
60
  ### 2.1 Monocular Depth Estimation:
61
- The foundation of our geometric reconstruction lies in monocular depth estimation. Models such as MiDaS [1] and DPT [2] have demonstrated remarkable capabilities in inferring depth from single images. DPT, in particular, leverages transformer architectures to capture global contextual information, leading to improved depth accuracy compared to earlier convolutional neural network (CNN)-based methods. Equation (1) illustrates the depth normalization process used in DPT-like models to scale the predicted depth map to a usable range.
 
62
 
63
  ### 2.2 3D Reconstruction Techniques:
64
- Generating 3D point clouds or meshes from 2D inputs is a well-established field, encompassing techniques from photogrammetry [3] and Simultaneous Localization and Mapping (SLAM) [4]. Our approach utilizes depth maps derived from DPT to construct a point cloud, offering a simpler yet effective method for 3D scene representation, particularly suitable for indoor environments where texture and feature richness can support monocular depth estimation. The transformation from 2D pixel coordinates to 3D space is mathematically described by the pinhole camera model, as shown in Equations (2)-(4).
 
65
 
66
  ### 2.3 Vision-Language Models for Semantic Understanding:
 
67
  Vision-language models (VLMs) have emerged as powerful tools for bridging the gap between visual and textual understanding. PaLiGemma [5] is a state-of-the-art multimodal model that integrates image understanding with natural language processing. Fine-tuning such models on domain-specific datasets, such as indoor scenes, allows for the generation of semantic annotations and descriptions that can be overlaid on reconstructed 3D models, enriching them with contextual information. The fine-tuning process for PaLiGemma, aimed at minimizing the token prediction loss, is formalized in Equation (5).
68
 
69
  ### 2.4 Interactive 3D Visualization:
 
70
  Effective visualization is crucial for user interaction with 3D data. Libraries like Open3D [6] and Plotly [7] provide tools for interactive exploration of 3D point clouds and meshes. Open3D, in particular, offers robust functionalities for point cloud manipulation, rendering, and visualization, making it an ideal choice for desktop-based interactive 3D scene exploration. For web-based interaction, Plotly offers excellent capabilities for embedding interactive 3D visualizations within web applications.
71
 
72
  ---
@@ -77,9 +80,11 @@ Effective visualization is crucial for user interaction with 3D data. Libraries
77
 
78
  The Duino-Idar system is structured into three primary modules, as illustrated in Figure 1:
79
 
80
- 1. **Video Processing and Frame Extraction:** This module ingests mobile video input and extracts representative key frames at configurable intervals to reduce computational redundancy and capture scene changes effectively.
81
- 2. **Depth Estimation and 3D Reconstruction:** Each extracted frame is processed by a DPT-based depth estimator to generate a depth map. These depth maps are then converted into 3D point clouds using a pinhole camera model, transforming 2D pixel coordinates into 3D spatial positions.
82
- 3. **Semantic Enrichment and Visualization:** A fine-tuned PaLiGemma model provides semantic annotations for the extracted key frames, enriching the 3D reconstruction with object labels and scene descriptions. A Gradio-based GUI integrates these modules, providing a user-friendly interface for video upload, processing, interactive 3D visualization, and exploration of the semantically enhanced 3D scene.
 
 
83
 
84
  ```mermaid
85
  graph LR
@@ -99,7 +104,7 @@ graph LR
99
  style H fill:#eee,stroke:#333,stroke-width:2px
100
  style I fill:#ace,stroke:#333,stroke-width:2px
101
  ```
102
- *Figure 1: Duino-Idar System Architecture. The diagram illustrates the flow of data through the system modules, from video input to interactive 3D visualization with semantic enrichment.*
103
 
104
  ### 3.2 Detailed Pipeline
105
 
@@ -111,20 +116,20 @@ The Duino-Idar pipeline operates through the following detailed steps:
111
 
112
  2. **Depth Estimation Module:**
113
  * **Preprocessing:** Each extracted frame undergoes preprocessing, including resizing and normalization, to optimize it for input to the DPT model. This ensures consistent input dimensions and value ranges for the depth estimation network.
114
- * **Depth Prediction:** The preprocessed frame is fed into the DPT model, which generates a depth map. This depth map represents the estimated distance of each pixel in the image from the camera.
115
- * **Normalization and Scaling:** The raw depth map is normalized to a standard range (e.g., 0-1 or 0-255) for subsequent 3D reconstruction and visualization. Equation (1) details the normalization process.
116
 
117
  3. **3D Reconstruction Module:**
118
- * **Point Cloud Generation:** A pinhole camera model is applied to convert the depth map and corresponding pixel coordinates into 3D coordinates in camera space. Color information from the original frame is associated with each 3D point to create a colored point cloud. Equations (2), (3), and (4) formalize this transformation.
119
- * **Point Cloud Aggregation:** To build a comprehensive 3D model, point clouds generated from multiple key frames are aggregated. In this initial implementation, we assume a static camera or negligible inter-frame motion for simplicity. More advanced implementations could incorporate camera pose estimation and point cloud registration for improved accuracy, especially in dynamic scenes. The aggregation process is mathematically represented by Equation (4).
120
 
121
  4. **Semantic Enhancement Module:**
122
- * **Vision-Language Processing:** The fine-tuned PaLiGemma model processes the key frames to generate scene descriptions and semantic labels. The model is prompted to identify objects and provide contextual information relevant to indoor scenes.
123
  * **Semantic Data Integration:** Semantic labels generated by PaLiGemma are overlaid onto the reconstructed point cloud. This integration can be achieved through various methods, such as associating semantic labels with clusters of points or generating bounding boxes around semantically labeled objects within the 3D scene.
124
 
125
  5. **Visualization and User Interface Module:**
126
  * **Interactive 3D Viewer:** The final semantically enriched 3D model is visualized using Open3D (or Plotly for web-based deployments). Users can interact with the 3D scene, rotating, zooming, and panning to explore the reconstructed environment.
127
- * **Gradio GUI:** A user-friendly Gradio web interface provides a seamless experience, allowing users to upload videos, initiate the processing pipeline, and interactively navigate the resulting 3D scene. The GUI also provides controls for adjusting parameters like frame extraction interval and potentially visualizing semantic labels.
128
 
129
  ---
130
 
@@ -136,43 +141,43 @@ The Duino-Idar system relies on several core mathematical principles:
136
 
137
  **1. Depth Estimation via Deep Network:**
138
 
139
- Let $I \in \mathbb{R}^{H \times W \times 3}$ represent the input image of height $H$ and width $W$. The DPT model, denoted as $f$, with learnable parameters $\theta$, estimates the depth map $D$:
140
 
141
- **(1) $D = f(I; \theta)$**
142
 
143
  The depth map $D$ is then normalized to obtain $D_{\text{norm}}$:
144
 
145
- **(2) $D_{\text{norm}}(u,v) = \frac{D(u,v)}{\displaystyle \max_{(u,v)} D(u,v)}$**
146
 
147
  If a maximum physical depth $Z_{\max}$ is assumed, the scaled depth $z(u,v)$ is:
148
 
149
- **(3) $z(u,v) = D_{\text{norm}}(u,v) \times Z_{\max}$**
150
 
151
  For practical implementation and visualization, we often scale the depth to an 8-bit range:
152
 
153
- **(4) $D_{\text{scaled}}(u,v) = \frac{D(u,v)}{\displaystyle \max_{(u,v)} D(u,v)} \times 255$**
154
 
155
  **2. 3D Reconstruction with Pinhole Camera Model:**
156
 
157
  Assuming a pinhole camera model with intrinsic parameters: focal lengths $(f_x, f_y)$ and principal point $(c_x, c_y)$, the intrinsic matrix $K$ is:
158
 
159
- **(5) $K = \begin{pmatrix}
160
  f_x & 0 & c_x \\
161
  0 & f_y & c_y \\
162
  0 & 0 & 1
163
- \end{pmatrix}$**
164
 
165
  Given a pixel $(u, v)$ and its depth value $z(u,v)$, the 3D coordinates $(x, y, z)$ in the camera coordinate system are:
166
 
167
- **(6) $x = \frac{(u - c_x) \cdot z(u,v)}{f_x}$**
168
 
169
- **(7) $y = \frac{(v - c_y) \cdot z(u,v)}{f_y}$**
170
 
171
- **(8) $z = z(u,v)$**
172
 
173
  In matrix form:
174
 
175
- **(9) $\begin{pmatrix}
176
  x \\
177
  y \\
178
  z
@@ -181,13 +186,13 @@ z
181
  u \\
182
  v \\
183
  1
184
- \end{pmatrix}$**
185
 
186
  **3. Aggregation of Multiple Frames:**
187
 
188
  Let $P_i$ be the point cloud from the $i^{th}$ frame, where $P_i = \{(x_{i,j}, y_{i,j}, z_{i,j}) \mid j = 1, 2, \ldots, N_i\}$. The overall point cloud $P$ is the union:
189
 
190
- **(10) $P = \bigcup_{i=1}^{M} P_i$**
191
 
192
  where $M$ is the number of frames.
193
 
@@ -195,7 +200,7 @@ where $M$ is the number of frames.
195
 
196
  For fine-tuning PaLiGemma, given an image $I$ and caption tokens $c = (c_1, c_2, \ldots, c_T)$, the cross-entropy loss $\mathcal{L}$ is minimized:
197
 
198
- **(11) $\mathcal{L} = -\sum_{t=1}^{T} \log P(c_t \mid c_{<t}, I)$**
199
 
200
  where $P(c_t \mid c_{<t}, I)$ is the conditional probability of predicting the $t^{th}$ token given the preceding tokens $c_{<t}$ and the input image $I$.
201
 
@@ -217,7 +222,7 @@ pip install transformers peft bitsandbytes gradio opencv-python pillow numpy tor
217
 
218
  ### 4.3 Code Snippets and Dynamicity
219
 
220
- Here are illustrative code snippets demonstrating key functionalities. These are excerpts from the provided datasheet code and are used for demonstration purposes within this paper.
221
 
222
  #### 4.3.1 Depth Estimation using DPT:
223
 
@@ -289,24 +294,58 @@ def visualize_3d_model(ply_file):
289
  pcd = o3d.io.read_point_cloud(ply_file)
290
  o3d.visualization.draw_geometries([pcd]) # Interactive window
291
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
292
  with gr.Blocks() as demo:
293
  gr.Markdown("### Duino-Idar 3D Mapping")
294
  video_input = gr.Video(label="Upload Video", type="filepath")
295
  process_btn = gr.Button("Process & Visualize")
296
- # ... (integrate process_video function from datasheet here) ...
 
 
 
 
 
297
 
298
- process_btn.click(fn=process_video, inputs=video_input, outputs=None) # Example interaction
299
 
300
  demo.launch()
301
  ```
302
 
303
- This Gradio code demonstrates the dynamic user interface for Duino-Idar. The `gr.Blocks()` and associated components create an interactive web application. Users can upload videos, trigger the processing pipeline via the "Process & Visualize" button, and the `visualize_3d_model` function (and the integrated `process_video` function which is not fully shown here for brevity, but present in the datasheet) dynamically handles the video processing and 3D visualization when the button is clicked.
304
-
305
- *(Figure 2: Conceptual Gradio Interface Screenshot - Conceptual)*
306
 
307
- *(A conceptual screenshot of a Gradio interface showing a video upload area, a process button, and potentially a placeholder for 3D visualization or a link to view the 3D model. Due to limitations of text-based output, a real screenshot cannot be embedded here, but imagine a simple, functional web interface based on Gradio.)*
308
 
309
- *Figure 2: Conceptual Gradio Interface for Duino-Idar. This illustrates a user-friendly web interface for video input, processing initiation, and 3D model visualization access.*
310
 
311
  ---
312
 
@@ -314,30 +353,54 @@ This Gradio code demonstrates the dynamic user interface for Duino-Idar. The `g
314
 
315
  While this paper focuses on system design and implementation, a preliminary demonstration was conducted to validate the Duino-Idar pipeline. Mobile videos of indoor environments (e.g., living rooms, kitchens, offices) were captured using a standard smartphone camera. These videos were then uploaded to the Duino-Idar Gradio interface.
316
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
317
  The system successfully processed these videos, extracting key frames, estimating depth maps using the DPT model, and reconstructing 3D point clouds. The fine-tuned PaLiGemma model provided semantic labels, such as "sofa," "table," "chair," and "window," which were (in a conceptual demonstration, as full integration is ongoing) intended to be overlaid onto the 3D point cloud, enabling interactive semantic exploration.
318
 
319
- *(Figure 3: Example 3D Point Cloud Visualization - Conceptual)*
320
 
321
- *(A conceptual visualization of a 3D point cloud generated by Duino-Idar, potentially showing a simple room scene with furniture. Due to limitations of text-based output, a real 3D rendering cannot be embedded here, but imagine a sparse but recognizable 3D point cloud of a room.)*
322
 
323
  *Figure 3: Conceptual 3D Point Cloud Visualization. This illustrates a representative point cloud output from Duino-Idar, showing the geometric reconstruction of an indoor scene.*
324
 
 
 
 
 
 
325
  ## 6. Discussion and Future Work
326
 
327
  Duino-Idar demonstrates a promising approach to accessible and semantically rich indoor 3D mapping using mobile video. The integration of DPT-based depth estimation and PaLiGemma for semantic enrichment provides a valuable combination, offering both geometric and contextual understanding of indoor scenes. The Gradio interface significantly enhances usability, making the system accessible to users with varying technical backgrounds.
328
 
329
  However, several areas warrant further investigation and development:
330
 
331
- * **Enhanced Semantic Integration:** Future work will focus on robustly overlaying semantic labels directly onto the point cloud, potentially using point cloud segmentation techniques to associate labels with specific object regions. This will enable object-level annotation and more granular scene understanding.
332
- * **Multi-Frame Fusion and SLAM:** The current point cloud aggregation is simplistic. Integrating a robust SLAM or multi-view stereo method is crucial for handling camera motion and improving reconstruction fidelity, particularly in larger or more complex indoor environments. This would also address potential drift and inconsistencies arising from independent frame processing.
333
- * **LiDAR Integration (Duino-*Idar* Vision):** To truly realize the "Idar" aspect of Duino-Idar, future iterations will explore the integration of LiDAR sensors. LiDAR data can provide highly accurate depth measurements, complementing and potentially enhancing the video-based depth estimation, especially in challenging lighting conditions or for textureless surfaces. A hybrid approach combining LiDAR and vision could significantly improve the robustness and accuracy of the system.
334
- * **Real-Time Processing and Optimization:** The current implementation is primarily offline. Optimizations, such as using TensorRT or mobile GPU acceleration, are necessary to achieve real-time or near-real-time mapping capabilities, making Duino-Idar suitable for applications like real-time AR navigation.
335
- * **Improved User Interaction:** Further enhancements to the Gradio interface, or integration with web-based 3D viewers like Three.js, can create a more immersive and intuitive user experience, potentially enabling virtual walkthroughs and interactive object manipulation within the reconstructed 3D scene.
336
- * **Handling Dynamic Objects:** The current system assumes static scenes. Future research should address the challenge of dynamic objects (e.g., people, moving furniture) within indoor environments, potentially using techniques for object tracking and removal or separate reconstruction of static and dynamic elements.
337
 
338
  ## 7. Conclusion
339
 
340
- Duino-Idar presents a novel and accessible system for indoor 3D mapping from mobile video, enriched with semantic understanding through the integration of deep learning-based depth estimation and vision-language models. By leveraging state-of-the-art DPT models and fine-tuning PaLiGemma for indoor scene semantics, the system achieves both geometric reconstruction and valuable scene context. The user-friendly Gradio interface lowers the barrier to entry, enabling broader accessibility for users to create and explore 3D representations of indoor spaces. While this initial prototype lays a strong foundation, future iterations will focus on enhancing semantic integration, improving reconstruction robustness through multi-frame fusion and LiDAR integration, and optimizing for real-time performance, ultimately expanding the applicability and user experience of Duino-Idar in diverse domains such as augmented reality, robotics, and interior design.
341
 
342
  ---
343
 
@@ -355,35 +418,4 @@ Duino-Idar presents a novel and accessible system for indoor 3D mapping from mob
355
 
356
  [6] Zhou, Q. Y., Park, J., & Koltun, V. (2018). Open3D: A modern library for 3D data processing. *arXiv preprint arXiv:1801.09847*.
357
 
358
- [7] Plotly Technologies Inc. (2015). *Plotly Python Library*. https://plotly.com/python/
359
-
360
- ---
361
-
362
- ## Citation
363
-
364
- If you use Duino-Idar in your research, please cite the following paper:
365
-
366
- ```bibtex
367
- @misc{duino-idar-2025,
368
- author = {Jalal Mansour (Jalal Duino)},
369
- title = {Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment},
370
- year = {2025},
371
- publisher = {Hugging Face Space},
372
- howpublished = {Online},
373
- url = {https://huggingface.co/Duino/duino-idar} # Assuming 'duino-idar' is your Space name in HF Duino org
374
- }
375
- ```
376
-
377
- ## Contact
378
-
379
- For questions or collaborations, please contact:
380
-
381
- Jalal Mansour (Jalal Duino) - [[email protected]](mailto:[email protected])
382
-
383
- ---
384
-
385
- ## License
386
-
387
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. (You'll need to create a LICENSE file in your repository with the MIT license text).
388
-
389
- ---
 
 
1
  ---
2
+ model-index:
3
+ - name: Duino-Idar
4
+ paper: https://huggingface.co/Duino/Duino-Idar/blob/main/README.md # Link to this README.md
5
+ results:
6
+ - task:
7
+ type: 3D Indoor Mapping # Task type as object
8
+ dataset:
9
+ name: Mobile Video # Dataset name as object
10
+ type: Video # Added dataset type here to resolve the error
11
+ metrics:
12
+ - name: Qualitative 3D Reconstruction
13
+ type: Visual Inspection
14
+ value: "Visually Inspected; Subjectively assessed for geometric accuracy and completeness of the point cloud." # Added value
15
+ - name: Semantic Accuracy (Conceptual)
16
+ type: Qualitative Assessment
17
+ value: "Qualitatively Assessed; Subjectively evaluated for the relevance and coherence of semantic labels generated for indoor scenes." # Added value
18
+ language: en
19
+ license: mit
20
  tags:
21
  - 3d-mapping
 
22
  - depth-estimation
23
  - semantic-segmentation
24
  - vision-language-model
25
+ - indoor-scene-understanding
26
  - mobile-video
 
 
27
  - dpt
28
  - paligemma
29
+ - gradio
30
+ - point-cloud
31
  author: Jalal Mansour (Jalal Duino)
32
+ date_created: 2025-02-18
33
  email: [email protected]
34
+ hf_hub_url: https://huggingface.co/Duino/Duino-Idar
 
35
  ---
36
 
37
+ # Duino-Idar: An Interactive Indoor 3D Mapping System via Mobile Video with Semantic Enrichment
38
 
39
+ **Abstract**
 
 
 
 
40
 
41
+ This paper introduces Duino-Idar, a novel end-to-end system for generating interactive 3D maps of indoor environments from mobile video. Leveraging state-of-the-art monocular depth estimation techniques, specifically DPT (Dense Prediction Transformer)-based models, and semantic understanding via a fine-tuned vision-language model (PaLiGemma), Duino-Idar offers a comprehensive solution for indoor scene reconstruction. The system extracts key frames from video input, computes depth maps, constructs a 3D point cloud, and enriches it with semantic labels. A user-friendly Gradio-based graphical user interface (GUI) facilitates video upload, processing, and interactive 3D scene exploration. This research details the system's architecture, implementation, and potential applications in areas such as indoor navigation, augmented reality, and automated scene understanding, setting the stage for future enhancements including LiDAR integration for improved accuracy and robustness.
 
 
 
 
42
 
43
  **Keywords:** 3D Mapping, Indoor Reconstruction, Mobile Video, Depth Estimation, Semantic Segmentation, Vision-Language Models, DPT, PaLiGemma, Point Cloud, Gradio, Interactive Visualization.
44
 
 
48
 
49
  Recent advancements in computer vision and deep learning have significantly propelled the field of 3D scene reconstruction from 2D imagery. Mobile devices, now ubiquitous and equipped with high-quality cameras, provide a readily available source of video data suitable for spatial mapping. While monocular depth estimation has matured considerably, enabling real-time applications, many existing 3D reconstruction approaches lack a crucial component: semantic understanding of the scene. This semantic context is vital for enabling truly interactive and context-aware applications, such as augmented reality (AR) navigation, object recognition, and scene understanding for robotic systems.
50
 
51
+ To address this gap, we present Duino-Idar, an innovative system that integrates a robust depth estimation pipeline with a fine-tuned vision-language model, PaLiGemma, to enhance indoor 3D mapping. The system's name, Duino-Idar, reflects the vision of combining accessible technology ("Duino," referencing approachability and user-centric design) with advanced spatial sensing ("Idar," hinting at the potential for LiDAR integration in future iterations, although the current prototype focuses on vision-based depth). This synergistic combination not only achieves geometric reconstruction but also provides semantic enrichment, significantly enhancing both visualization and user interaction capabilities. This paper details the architecture, implementation, and potential of Duino-Idar, highlighting its contribution to accessible and semantically rich indoor 3D mapping.
52
 
53
  ---
54
 
 
57
  Our work builds upon and integrates several key areas of research:
58
 
59
  ### 2.1 Monocular Depth Estimation:
60
+
61
+ The foundation of our geometric reconstruction lies in monocular depth estimation. Models such as MiDaS [1] and DPT [2] have demonstrated remarkable capabilities in inferring depth from single images. DPT, in particular, leverages transformer architectures to capture global contextual information, leading to improved depth accuracy compared to earlier convolutional neural network (CNN)-based methods. Equation (1) illustrates the depth normalization process used in DPT-like models to scale the predicted depth map to a usable range.
62
 
63
  ### 2.2 3D Reconstruction Techniques:
64
+
65
+ Generating 3D point clouds or meshes from 2D inputs is a well-established field, encompassing techniques from photogrammetry [3] and Simultaneous Localization and Mapping (SLAM) [4]. Our approach utilizes depth maps derived from DPT to construct a point cloud, offering a simpler yet effective method for 3D scene representation, particularly suitable for indoor environments where texture and feature richness can support monocular depth estimation. The transformation from 2D pixel coordinates to 3D space is mathematically described by the pinhole camera model, as shown in Equations (2)-(4).
66
 
67
  ### 2.3 Vision-Language Models for Semantic Understanding:
68
+
69
  Vision-language models (VLMs) have emerged as powerful tools for bridging the gap between visual and textual understanding. PaLiGemma [5] is a state-of-the-art multimodal model that integrates image understanding with natural language processing. Fine-tuning such models on domain-specific datasets, such as indoor scenes, allows for the generation of semantic annotations and descriptions that can be overlaid on reconstructed 3D models, enriching them with contextual information. The fine-tuning process for PaLiGemma, aimed at minimizing the token prediction loss, is formalized in Equation (5).
70
 
71
  ### 2.4 Interactive 3D Visualization:
72
+
73
  Effective visualization is crucial for user interaction with 3D data. Libraries like Open3D [6] and Plotly [7] provide tools for interactive exploration of 3D point clouds and meshes. Open3D, in particular, offers robust functionalities for point cloud manipulation, rendering, and visualization, making it an ideal choice for desktop-based interactive 3D scene exploration. For web-based interaction, Plotly offers excellent capabilities for embedding interactive 3D visualizations within web applications.
74
 
75
  ---
 
80
 
81
  The Duino-Idar system is structured into three primary modules, as illustrated in Figure 1:
82
 
83
+ 1. **Video Processing and Frame Extraction:** This module ingests mobile video input and extracts representative key frames at configurable intervals to reduce computational redundancy and capture scene changes effectively.
84
+ 2. **Depth Estimation and 3D Reconstruction:** Each extracted frame is processed by a DPT-based depth estimator to generate a depth map. These depth maps are then converted into 3D point clouds using a pinhole camera model, transforming 2D pixel coordinates into 3D spatial positions.
85
+ 3. **Semantic Enrichment and Visualization:** A fine-tuned PaLiGemma model provides semantic annotations for the extracted key frames, enriching the 3D reconstruction with object labels and scene descriptions. A Gradio-based GUI integrates these modules, providing a user-friendly interface for video upload, processing, interactive 3D visualization, and exploration of the semantically enhanced 3D scene.
86
+
87
+ **Figure 1: System Architecture Diagram**
88
 
89
  ```mermaid
90
  graph LR
 
104
  style H fill:#eee,stroke:#333,stroke-width:2px
105
  style I fill:#ace,stroke:#333,stroke-width:2px
106
  ```
107
+ *Figure 1: Duino-Idar System Architecture. The diagram illustrates the flow of data through the system modules, from video input to interactive 3D visualization with semantic enrichment.*
108
 
109
  ### 3.2 Detailed Pipeline
110
 
 
116
 
117
  2. **Depth Estimation Module:**
118
  * **Preprocessing:** Each extracted frame undergoes preprocessing, including resizing and normalization, to optimize it for input to the DPT model. This ensures consistent input dimensions and value ranges for the depth estimation network.
119
+ * **Depth Prediction:** The preprocessed frame is fed into the DPT model, which generates a depth map. This depth map represents the estimated distance of each pixel in the image from the camera.
120
+ * **Normalization and Scaling:** The raw depth map is normalized to a standard range (e.g., 0-1 or 0-255) for subsequent 3D reconstruction and visualization. Equation (1) details the normalization process.
121
 
122
  3. **3D Reconstruction Module:**
123
+ * **Point Cloud Generation:** A pinhole camera model is applied to convert the depth map and corresponding pixel coordinates into 3D coordinates in camera space. Color information from the original frame is associated with each 3D point to create a colored point cloud. Equations (2), (3), and (4) formalize this transformation.
124
+ * **Point Cloud Aggregation:** To build a comprehensive 3D model, point clouds generated from multiple key frames are aggregated. In this initial implementation, we assume a static camera or negligible inter-frame motion for simplicity. More advanced implementations could incorporate camera pose estimation and point cloud registration for improved accuracy, especially in dynamic scenes. The aggregation process is mathematically represented by Equation (4).
125
 
126
  4. **Semantic Enhancement Module:**
127
+ * **Vision-Language Processing:** The fine-tuned PaLiGemma model processes the key frames to generate scene descriptions and semantic labels. The model is prompted to identify objects and provide contextual information relevant to indoor scenes.
128
  * **Semantic Data Integration:** Semantic labels generated by PaLiGemma are overlaid onto the reconstructed point cloud. This integration can be achieved through various methods, such as associating semantic labels with clusters of points or generating bounding boxes around semantically labeled objects within the 3D scene.
129
 
130
  5. **Visualization and User Interface Module:**
131
  * **Interactive 3D Viewer:** The final semantically enriched 3D model is visualized using Open3D (or Plotly for web-based deployments). Users can interact with the 3D scene, rotating, zooming, and panning to explore the reconstructed environment.
132
+ * **Gradio GUI:** A user-friendly Gradio web interface provides a seamless experience, allowing users to upload videos, initiate the processing pipeline, and interactively navigate the resulting 3D scene. The GUI also provides controls for adjusting parameters like frame extraction interval and potentially visualizing semantic labels.
133
 
134
  ---
135
 
 
141
 
142
  **1. Depth Estimation via Deep Network:**
143
 
144
+ Let $I \in \mathbb{R}^{H \times W \times 3}$ represent the input image of height $H$ and width $W$. The DPT model, denoted as $f$, with learnable parameters $\theta$, estimates the depth map $D$:
145
 
146
+ **(1)** $D = f(I; \theta)$
147
 
148
  The depth map $D$ is then normalized to obtain $D_{\text{norm}}$:
149
 
150
+ **(2)** $D_{\text{norm}}(u,v) = \frac{D(u,v)}{\displaystyle \max_{(u,v)} D(u,v)}$
151
 
152
  If a maximum physical depth $Z_{\max}$ is assumed, the scaled depth $z(u,v)$ is:
153
 
154
+ **(3)** $z(u,v) = D_{\text{norm}}(u,v) \times Z_{\max}$
155
 
156
  For practical implementation and visualization, we often scale the depth to an 8-bit range:
157
 
158
+ **(4)** $D_{\text{scaled}}(u,v) = \frac{D(u,v)}{\displaystyle \max_{(u,v)} D(u,v)} \times 255$
159
 
160
  **2. 3D Reconstruction with Pinhole Camera Model:**
161
 
162
  Assuming a pinhole camera model with intrinsic parameters: focal lengths $(f_x, f_y)$ and principal point $(c_x, c_y)$, the intrinsic matrix $K$ is:
163
 
164
+ **(5)** $K = \begin{pmatrix}
165
  f_x & 0 & c_x \\
166
  0 & f_y & c_y \\
167
  0 & 0 & 1
168
+ \end{pmatrix}$
169
 
170
  Given a pixel $(u, v)$ and its depth value $z(u,v)$, the 3D coordinates $(x, y, z)$ in the camera coordinate system are:
171
 
172
+ **(6)** $x = \frac{(u - c_x) \cdot z(u,v)}{f_x}$
173
 
174
+ **(7)** $y = \frac{(v - c_y) \cdot z(u,v)}{f_y}$
175
 
176
+ **(8)** $z = z(u,v)$
177
 
178
  In matrix form:
179
 
180
+ **(9)** $\begin{pmatrix}
181
  x \\
182
  y \\
183
  z
 
186
  u \\
187
  v \\
188
  1
189
+ \end{pmatrix}$
190
 
191
  **3. Aggregation of Multiple Frames:**
192
 
193
  Let $P_i$ be the point cloud from the $i^{th}$ frame, where $P_i = \{(x_{i,j}, y_{i,j}, z_{i,j}) \mid j = 1, 2, \ldots, N_i\}$. The overall point cloud $P$ is the union:
194
 
195
+ **(10)** $P = \bigcup_{i=1}^{M} P_i$
196
 
197
  where $M$ is the number of frames.
198
 
 
200
 
201
  For fine-tuning PaLiGemma, given an image $I$ and caption tokens $c = (c_1, c_2, \ldots, c_T)$, the cross-entropy loss $\mathcal{L}$ is minimized:
202
 
203
+ **(11)** $\mathcal{L} = -\sum_{t=1}^{T} \log P(c_t \mid c_{<t}, I)$
204
 
205
  where $P(c_t \mid c_{<t}, I)$ is the conditional probability of predicting the $t^{th}$ token given the preceding tokens $c_{<t}$ and the input image $I$.
206
 
 
222
 
223
  ### 4.3 Code Snippets and Dynamicity
224
 
225
+ Here are illustrative code snippets demonstrating key functionalities. These are excerpts from the provided datasheet code and are used for demonstration purposes within this paper.
226
 
227
  #### 4.3.1 Depth Estimation using DPT:
228
 
 
294
  pcd = o3d.io.read_point_cloud(ply_file)
295
  o3d.visualization.draw_geometries([pcd]) # Interactive window
296
 
297
+ def process_video(video_path):
298
+ """ Process video: extract frames, estimate depth, and generate a 3D model. """
299
+ frames = extract_frames(video_path) # Assuming extract_frames function is defined (from full code)
300
+ depth_maps = [estimate_depth(frame) for frame in frames]
301
+ final_pcd = None
302
+ for frame, depth_map in zip(frames, depth_maps):
303
+ pcd = reconstruct_3d(depth_map, frame)
304
+ if final_pcd is None:
305
+ final_pcd = pcd
306
+ else:
307
+ final_pcd += pcd
308
+ o3d.io.write_point_cloud("output.ply", final_pcd)
309
+ return "output.ply"
310
+
311
+ def extract_frames(video_path, interval=10): # Example extract_frames function
312
+ import cv2
313
+ from PIL import Image
314
+ cap = cv2.VideoCapture(video_path)
315
+ frames = []
316
+ i = 0
317
+ while cap.isOpened():
318
+ ret, frame = cap.read()
319
+ if not ret:
320
+ break
321
+ if i % interval == 0:
322
+ frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
323
+ frames.append(Image.fromarray(frame))
324
+ i += 1
325
+ cap.release()
326
+ return frames
327
+
328
+
329
  with gr.Blocks() as demo:
330
  gr.Markdown("### Duino-Idar 3D Mapping")
331
  video_input = gr.Video(label="Upload Video", type="filepath")
332
  process_btn = gr.Button("Process & Visualize")
333
+ output_file = gr.File(label="Generated 3D Model (PLY)")
334
+
335
+ process_btn.click(fn=process_video, inputs=video_input, outputs=output_file)
336
+
337
+ view_btn = gr.Button("View 3D Model")
338
+ view_btn.click(fn=visualize_3d_model, inputs=output_file, outputs=None)
339
 
 
340
 
341
  demo.launch()
342
  ```
343
 
344
+ **Figure 2: Example Gradio Interface Screenshot - Conceptual**
 
 
345
 
346
+ *(A conceptual screenshot of a Gradio interface showing a video upload area, a process button, and potentially a placeholder for 3D visualization or a link to view the 3D model. Due to limitations of text-based output, a real screenshot cannot be embedded here, but imagine a simple, functional web interface based on Gradio.)*
347
 
348
+ *Figure 2: Conceptual Gradio Interface for Duino-Idar. This illustrates a user-friendly web interface for video input, processing initiation, and 3D model visualization access.*
349
 
350
  ---
351
 
 
353
 
354
  While this paper focuses on system design and implementation, a preliminary demonstration was conducted to validate the Duino-Idar pipeline. Mobile videos of indoor environments (e.g., living rooms, kitchens, offices) were captured using a standard smartphone camera. These videos were then uploaded to the Duino-Idar Gradio interface.
355
 
356
+ To illustrate the depth estimation performance conceptually, consider a simplified representation of depth accuracy across different distances. **Note:** Markdown cannot render dynamic graphs. The following is a *text-based* approximation. For actual graphs, you would embed image links below.
357
+
358
+ **Conceptual Depth Accuracy vs. Distance (Text-Based Graph):**
359
+
360
+ ```
361
+ Depth Accuracy (Qualitative)
362
+ ^
363
+ | Excellent
364
+ | *
365
+ | * *
366
+ | * *
367
+ | Good *
368
+ | * *
369
+ | Moderate *
370
+ +---------------------> Distance from Camera (meters)
371
+ ```
372
+
373
+ *This is a highly simplified, qualitative representation. For quantitative evaluation, you would typically use metrics like Root Mean Squared Error (RMSE) or Mean Absolute Error (MAE) on a dataset with ground truth depth.*
374
+
375
  The system successfully processed these videos, extracting key frames, estimating depth maps using the DPT model, and reconstructing 3D point clouds. The fine-tuned PaLiGemma model provided semantic labels, such as "sofa," "table," "chair," and "window," which were (in a conceptual demonstration, as full integration is ongoing) intended to be overlaid onto the 3D point cloud, enabling interactive semantic exploration.
376
 
377
+ **Figure 3: Example 3D Point Cloud Visualization - Conceptual**
378
 
379
+ *(A conceptual visualization of a 3D point cloud generated by Duino-Idar, potentially showing a simple room scene with furniture. Due to limitations of text-based output, a real 3D rendering cannot be embedded here, but imagine a sparse but recognizable 3D point cloud of a room.)*
380
 
381
  *Figure 3: Conceptual 3D Point Cloud Visualization. This illustrates a representative point cloud output from Duino-Idar, showing the geometric reconstruction of an indoor scene.*
382
 
383
+ **Figure 4: Semantic Labeling Performance (Conceptual - Image Link)**
384
+
385
+ [![Semantic Labeling Performance](link-to-your-semantic-labeling-performance-graph.png)](link-to-your-semantic-labeling-performance-graph.png)
386
+ *Figure 4: Conceptual Semantic Labeling Performance. [Replace `link-to-your-semantic-labeling-performance-graph.png`](link-to-your-semantic-labeling-performance-graph.png) with the actual URL or relative path to an image of a graph illustrating semantic labeling quality, if available. This could be a bar chart showing accuracy per object category, for instance.*
387
+
388
  ## 6. Discussion and Future Work
389
 
390
  Duino-Idar demonstrates a promising approach to accessible and semantically rich indoor 3D mapping using mobile video. The integration of DPT-based depth estimation and PaLiGemma for semantic enrichment provides a valuable combination, offering both geometric and contextual understanding of indoor scenes. The Gradio interface significantly enhances usability, making the system accessible to users with varying technical backgrounds.
391
 
392
  However, several areas warrant further investigation and development:
393
 
394
+ * **Enhanced Semantic Integration:** Future work will focus on robustly overlaying semantic labels directly onto the point cloud, potentially using point cloud segmentation techniques to associate labels with specific object regions. This will enable object-level annotation and more granular scene understanding.
395
+ * **Multi-Frame Fusion and SLAM:** The current point cloud aggregation is simplistic. Integrating a robust SLAM or multi-view stereo method is crucial for handling camera motion and improving reconstruction fidelity, particularly in larger or more complex indoor environments. This would also address potential drift and inconsistencies arising from independent frame processing.
396
+ * **LiDAR Integration (Duino-*Idar* Vision):** To truly realize the "Idar" aspect of Duino-Idar, future iterations will explore the integration of LiDAR sensors. LiDAR data can provide highly accurate depth measurements, complementing and potentially enhancing the video-based depth estimation, especially in challenging lighting conditions or for textureless surfaces. A hybrid approach combining LiDAR and vision could significantly improve the robustness and accuracy of the system.
397
+ * **Real-Time Processing and Optimization:** The current implementation is primarily offline. Optimizations, such as using TensorRT or mobile GPU acceleration, are necessary to achieve real-time or near-real-time mapping capabilities, making Duino-Idar suitable for applications like real-time AR navigation.
398
+ * **Improved User Interaction:** Further enhancements to the Gradio interface, or integration with web-based 3D viewers like Three.js, can create a more immersive and intuitive user experience, potentially enabling virtual walkthroughs and interactive object manipulation within the reconstructed 3D scene.
399
+ * **Handling Dynamic Objects:** The current system assumes static scenes. Future research should address the challenge of dynamic objects (e.g., people, moving furniture) within indoor environments, potentially using techniques for object tracking and removal or separate reconstruction of static and dynamic elements.
400
 
401
  ## 7. Conclusion
402
 
403
+ Duino-Idar presents a novel and accessible system for indoor 3D mapping from mobile video, enriched with semantic understanding through the integration of deep learning-based depth estimation and vision-language models. By leveraging state-of-the-art DPT models and fine-tuning PaLiGemma for indoor scene semantics, the system achieves both geometric reconstruction and valuable scene context. The user-friendly Gradio interface lowers the barrier to entry, enabling broader accessibility for users to create and explore 3D representations of indoor spaces. While this initial prototype lays a strong foundation, future iterations will focus on enhancing semantic integration, improving reconstruction robustness through multi-frame fusion and LiDAR integration, and optimizing for real-time performance, ultimately expanding the applicability and user experience of Duino-Idar in diverse domains such as augmented reality, robotics, and interior design.
404
 
405
  ---
406
 
 
418
 
419
  [6] Zhou, Q. Y., Park, J., & Koltun, V. (2018). Open3D: A modern library for 3D data processing. *arXiv preprint arXiv:1801.09847*.
420
 
421
+ [7] Plotly Technologies Inc. (2015). *Plotly Python Library*. https://plotly.com/python/