Learning to Predict 3D Meshes from a Single Image via Depth Consistency

Hao Huang; Shaoli Liu; Jianhua Liu; Peng Jin

doi:10.1186/s10033-025-01335-2

Learning to Predict 3D Meshes from a Single Image via Depth Consistency

Huang, Hao; Liu, Shaoli; Liu, Jianhua; Jin, Peng 2025-08-25 00:00:00 Reconstructing three-dimensional (3D) shapes from a single image remains a significant challenge in computer vision due to the inherent ambiguity caused by missing or occluded shape information. Previous studies have predominantly focused on mesh models supervised by multi-view silhouettes. However, such methods are limited in reconstructing fine details. In this study, a 3D mesh model is predicted from a single image, leveraging depth consistency and without requiring viewpoint pose annotations. The model effectively learns strong shape priors that preserve finer structures and accurately predicts view poses from "correlation-supervised" viewpoints. Addition- ally, standard deviation and Laplacian losses were employed to regulate mesh edge distribution, resulting in more precise reconstructions. Differentiable renderer functions were derived from the 3D mesh to generate depth maps. Compared to conventional approaches, the proposed method provided superior representation of subtle structures. When applied to both synthetic and real-world datasets, the model outperformed existing methods in view-based 3D reconstruction tasks. Keywords Depth-consistency, Mesh, Standard deviation loss, View-based reconstruction during reconstruction, as shown in Figure 1, which limits 1 Introduction the utility of traditional reconstruction techniques. With the development of robotics, autonomous driv- Recently, neural networks have been used success- ing, and 3D animation, 3D shape inference of objects fully to infer 3D shape based on a single view [1]. The has become a mainstream research field. When humans convolutional layer of the neural network can resolve obtain 3D shape priors from objects or computer-aided the underlying features from the input image, where the design (CAD) models, they can estimate the 3D shape 3D shape is output as voxels [2–4], point clouds [5], or a of an object from a single view. However, estimating mesh [6]. However, voxels require more memory, which detailed 3D shapes from a single image/perspective is still compromises computational efficiency, while the point challenging for computer vision. It is impossible to find cloud representation loses important surface details [7]. matching points in corresponding images when using Compared with the voxel and point cloud representa- conventional graphics techniques for reconstruction. In tions, the mesh retains more important shape details and addition, the smooth surface or occlusion of the object represents surface topology. Furthermore, the mesh is makes it difficult to obtain significant feature points more likely to find real applications, as it has the capac - ity to efficiently model shape details [7, 8]. Recent stud- ies on single-view mesh reconstruction have proposed *Correspondence: recreating the 3D mesh by deforming a template model Shaoli Liu [email protected] based on perceptual features extracted from the input School of Mechanical Engineering, Beijing Institute of Technology, image. The reconstructed results usually have a topologi - Beijing 100081, China cal structure identical to that of the template model (e.g., Shenzhen Bay Laboratory, Institute of Biomedical Engineering, Shenzhen 518132, China a sphere or unit). Although promising results have been © The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 2 of 13 projection is obtained using the camera pose. The 3D shape can be inferred by rendering with dense pixel- level supervision [6]. Mesh models in the literature mainly used silhouettes as the view data [6, 14, 28, 29, 31], as silhouettes are binary; this approach allows for simplified forward rendering and backward gradient propagation. However, silhouette data do not reflect changes in mesh vertices in the depth direction, which prevents the renderer from moving from flow gradi - Figure 1 The challenges of traditional reconstruction methods: ents to coordinates in the depth direction. Therefore, occlusion and smoothness the learned shape prior is “weaker”, so that the recon- structed shape is similar to a visual hull (VH) [32, 33] lacking detail. Given that normal and depth maps con- achieved, finer structures such as a hole in the back of a tain rich geometric information, they are commonly chair cannot be reconstructed. adopted as supervision signals for 3D shape inference, Most pioneering approaches [1, 7, 9–20] use 3D ground in the form of point clouds and voxels. While recon- truth to learn the shape priors of a mesh. For exam- structing the mesh model using view-based methods, ple, taking the Chamfer distance (CD) with the ground the connection of vertices in the mesh model increases truth point clouds as the loss function allows the point the difficulty of differentiability for the renderer. The clouds [7, 19, 21] and 3D meshes [22] to be reconstructed silhouettes are usually taken as supervisory signals; the directly from a single view. Wei et al. [10] fused the CD [27], l2-norm [28], negative intersection over union unambiguous parts of point clouds inferred from mul- (IoU) [14], and element-wise product [6, 14] are taken tiple views; based on the loss function of voxel models as training losses. The renderer is usually only differen - using binary cross entropy, the latent features of multi- tiable on the silhouettes; however, the silhouettes only ple views were integrated to predict the 3D volume [1, 13, contain VH information lacking detail, resulting in a 17, 18, 23]. Tatarchenko et al. [16] generated high-resolu- coarse 3D shape. tion voxel models using an octree representation. Some Besides, cameras should be calibrated when project- studies have focused on advanced loss functions. Gwak ing a 3D model onto an image. However, it is difficult to et al. [20] used a discriminator to examine generated and calibrate all of the cameras, because the camera poses ground truth shapes and to reconstruct more realistic are almost impossible to obtain in practical application. models. Jiang et al. [12] used the CD and a discrimina- Moreover, training traditional model convolutional neu- tor to constrain an estimated point cloud to follow the ral networks (CNNs) with camera pose requires a large 3D ground truth shape, locally and globally. Some recent amount of annotated data, yet it cannot predict the cor- approaches [9, 11] have constrained the mesh model responding camera pose. Renderer differentiability also using the distance of vertices to the surface of the mesh limits the use of images for supervision. Because the as a signed distance field (SDF); however, obtaining the values between pixels in an image projected by the 3D ground truth SDF of each vertex requires considerable model are discontinuous, the camera parameters are not computational resources. Although 3D shapes of satis- differentiable from the projected points. Thus, backprop - factory quality can be produced with these methods, 3D agation of the loss function is difficult, resulting in failure annotation involves a significant amount of manual work, to update network weights. As such, to predict the cam- which is not practical or feasible in all situations. era poses, it is necessary to make the projection points View-based reconstruction has recently gained sig- and camera parameters differentiable. nificant attention, as a more detailed model can be The objective of this study was to achieve detailed 3D reconstructed using a red-green-blue (RGB) or depth mesh reconstruction through view-based training from map for supervision. Recent work has involved view- a single viewpoint, leveraging the principle of depth based training to reduce the reliance on 3D ground consistency. In contrast to recent methodologies [4, 6, truth. The underlying principle of view-based training 14, 27–31], multi-view depth maps were employed as is photo-consistency [13, 24, 25], i.e., obtaining single- supervisory signals. A graphical representation of the or multiple-view 3D results and minimizing rendering research objective is provided in Figure 2. The rendered loss using observed view data. The view data com - depth maps captured variations in mesh vertex posi- monly consist of two silhouettes [2, 3, 6, 14, 26–31], tions along the depth axis, providing crucial informa- a 2.5D depth map, and a normal map [2–4]. Render- tion for enhancing reconstruction precision. To facilitate ing is the image formulation mechanism; a 3D shape depth error backpropagation, the renderer’s gradient was Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 3 of 13 Figure 2 The graphical representation of the research objective approximated using a customized function [14], which applied to synthetic and realistic datasets, and outper- enabled the model to learn robust shape priors and accu- formed the view-based training methods; results com- rately reconstruct fine geometric details. parable to those of typical fully supervised methods with As the 3D ground truth is not given, the learned shape weaker supervision were obtained. priors are unaware of the unobserved views, and may reconstruct ambiguous shapes for occluded or unseen 2 Methods parts [10, 34–36]. Therefore, a depth map with 20 views The proposed framework consists of four parts: fea - was used for supervision, to reduce ambiguity [29, 30]. ture map extraction, initial model deformation, differ - Well-calibrated cameras are necessary when rendering; entiable rendering, and pose prediction of novel views, however, extra work is required, preventing practical which is shown in Figure 3. To reduce the dependence application. Therefore, we propose a network structure, on the camera pose ground truth, the novel viewpoint PoseCNN, to predict the poses of multiple viewpoints; poses prediction network was trained to obtain the dif the framework can be applied in the absence of camera ferent camera parameters. Then the two regularization calibration. The ground truth depth value of the projec - terms (standard deviation and Laplacian) were added to tion was interpolated bilinearly based on local four-pixel improve the reconstruction performance. neighborhoods. The initial mesh created by previous researches [6, 14, 29] was not usable due to the many self-intersections and large numbers of meshes con- 2.1 Proposed SingleV ‑ iew Reconstruction Framework centrated on flat faces (i.e., there were fewer meshes in In the feature extraction step, a shape was generated by the finer structures). To solve this issue, we incorpo - deforming a pre-defined cube, following typical learn - rated the standard deviation loss of the edge length and ing-based mesh reconstruction methods [7, 28, 29]. Laplacian loss [7] to refine the mesh. The integration of ResNet-18 [37] was used as the image encoder (Enc) to these two terms provided more uniform edge lengths, as compute the latent representations F from a single-view well as meshes that were reasonably distributed. u Th s, RGB image I . The face of each cube had 16 × 16 vertices. our inferred 3D models show finer structures and more The associated feature maps F were generated separately details. by the shape decoder (‘Dec’). In summary, the contributions of this work are as fol- In the initial model deformation step, since the vertices lows: (1) A 3D shape reconstruction model is proposed on the cube edges are shared by two faces, a total of 1,352 that uses depth maps for supervision, learning strong vertices are present. The six feature maps F are concat- shape priors for single-view 3D mesh inference. Fine enated to generate the bias F using a fully connected layer. details are addressed by a view-based training strategy. The initial cube is deformed by F to form the 3D mesh (2) Standard deviation and Laplacian losses were incor- model. Then the ground truth depth maps D are scattered porated as regularization terms to improve the recon- and have N novel views. The 3D structures recovered from struction performance; their effects were explored in different depth maps are consistent with each other, based ablation experiments. (3) The proposed model was Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 4 of 13 Figure 3 Framework of the proposed method (Our method uses multiple depth maps D as supervisory signals and combines two regularization terms (L , L ) to derive the total loss, which aids learning of strong shape priors for single-view reconstruction) std lap Figure 5 Backpropagation of the loss Figure 4 Projected vertices of one piece of the mesh and the pixels of the image on which the viewpoint poses (R , t ) of novel views can be are the areas of the corresponding triangles. The gradient n n predicted by the proposed network, PoseCNN. of the network loss L with respect to the vertex is given in Finally, with the predicted camera poses (R , t ), the Eq. (2) and (3): n n reconstructed model is rendered to yield novel views, and ∂L ∂L ∂P ∂L the rendered depth maps are D . Inspired by Kato et al. n = = a , 1 (2) ∂V ∂P ∂V ∂P 1 i 1 i [14], the gradient of the renderer is approximated by the difference in depth value between two adjacent pixels, ∂L ∂L ∂d ∂L ∂d which makes the renderer differentiable and the framework i i+1 = + . (3) trainable. ∂P ∂d ∂P ∂d ∂P i i i i+1 i V V V is the projection of one of the mesh m , as 1 2 3 i shown in Figure 4. Pixel P is located in V V V , which i 1 2 3 Only the manually designed renderer-related gradients can be represented as the sum of weights using Eq. (1): ∂d /∂P need to be calculated, as shown in Figure 5. They i i P = a V + a V + a V , i 1 1 2 2 3 3 (1) are approximated by the depth difference of adjacent pixels. Assuming ∂L /∂d = g and for the case of pixel P moving i i i i where a = S /S , a = S /S , 1 �P V V �V V V 2 �P V V �V V V i 1 2 1 2 3 i 1 3 1 2 3 to the right and left, the gradients are approximated as Eq. a = S /S and S , S , S 3 �P V V �V V V P V V P V V P V V i 2 3 1 2 3 i 1 2 i 1 3 i 2 3 (4) and (5), respectively: Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 5 of 13 ∂L ∂L ∂d ∂L ∂d i i+1 = + ∂c ∂d ∂c ∂d ∂c i i i+1 i (4) = g (d − d ) + g (d − d ) i i−1 i i+1 i i+1 right =g , ∂L ∂L ∂d ∂L ∂d i i−1 = + ∂d ∂c ∂d ∂c ∂c i i i+1 i (5) = g (d − d ) + g (d − d ) i i i+1 i−1 i−1 i left =g , where d , d , and d are the depth values of P , P , i−1 i i+1 i−1 i and P , resp e ctively . i+1 2.2 P ose Prediction of the Novel Views Some of the calibrated cameras can be taken as the ref- Figure 6 Network architectures: (a) Architecture of shape decoder, erence camera Cam with depth maps dep . The world 0 r (b) Architecture of PoseCNN coordinate system (WCS) is used as their common coor- dinate system. As a novel view, the pose pose = (R , t ) n n n of an uncalibrated camera Cam = (1 < n < N) is given 2.3 Structures of the Shape Decoder and PoseCNN by the rotation R and translation t relative to the WCS, n n The shape decoder, PoseCNN, and dimensions of each which is predicted by PoseCNN from depth maps dep . layer are shown in Figure 6. The layer symbols are Pixel X = (x , y , d ) in dep is unprojected into the WCS i i i n defined as follows: (1) Bn+ relu: Batch normalization −1 by p = R (KX − t ) to obtain the point cloud, where and rectified linear unit (ReLU) layers; (2) Deconv: d is the depth value for pixel X . We assumed a default 2D deconvolution layer with a kernel size of 3×3 and intrinsic matrix with an orthographic camera K . Initially , stride size of 2×2; (3) Conv: 2D convolution layer with the point clouds are projected onto Cam by Eq. (6): a kernel size of 1×1 and stride size of 1×1; (4) FC: fully connected layer; (5) Concat: stacks the feature maps X = K (R p + t ) = (x , yˆ , dep ˆ ). (6) n n i i 0 of six surfaces. The shape decoder outputs the (x , y, z) Because the point cloud is dense, many points are pro- coordinates of the 1352 vertices of the mesh model. The jected onto the same pixel. Therefore, the point nearest PoseCNN output includes a quaternion and the (x, y, z) to the depth map is the only one preserved, and (x ˆ , y ˆ ) is i i positions of the viewpoints. the image coordinate of the projected point. The depth The proposed shape prior learning framework is mainly of the nearest point is the rendered depth dep ˆ , which 0 supervised by the depth-consistency for depth loss, given can be obtained by the method proposed in reference by: [38]. The bilinear sampling at location (x ˆ , y ˆ ) is differen - i i tiable with respect to the camera pose (R , t ) , such that n n L = D − D . (7) dep n n the framework is differentiable with respect to the view - n=1 point pose prediction. PoseCNN is trained by minimiz- ing ||dep ˆ − dep || . However, the mesh surfaces may exhibit serious self- 0 0 2 If no camera is calibrated, we can take any camera in intersection problems, which can be prevented by Lapla- the novel views as Cam , and its camera coordinate cian loss using Eq. (8): system (CCS) is defined as the WCS to predict other   1352 1352 uncalibrated novel views; an absolute camera coordinate � �   L = p − , (8) system is not necessary. More reference cameras Cam lap i �U (p )� i=1 k∈U (p ) contribute to the pose prediction accuracy, which is i explored in the experiments (Section 3). The viewpoint is where p is a vertex of the inferred mesh model and U (p ) i i assumed at a known location, which can also be inferred are the adjacent vertices connected to p . The Laplacian by extending the PoseCNN, such that we only analyze the loss L may not contribute to accuracy improvement lap rotation prediction error ε (in degrees). [14]; In fact, it may reduce the accuracy when a large Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 6 of 13 weight is applied (See the experimental results in Sec- Table 1 IoU scores for ShapeNet obtained with view-based training methods (Baselines: NR1 [29], NR2 [14], SOFT [6]) tion 3 for details). Besides the self-intersections, the meshes are placed ran- Methods NR1 NR2 SOFT OURS domly on the object surface. There are numerous dense IoU 60.2 65.5 64.6 67.8 meshes on flat surfaces, but fewer on complex parts, which hinders the reconstruction of finer structures. We propose regularizing the shape using the standard deviation of the edge length of the meshes, which makes the mesh distribu- 3.2.1 Quantitative Evaluations tion more even. L is calculated as follows: std Twenty depth maps D were rendered from random views, and the proposed network was trained accord- M (l ) i=1 l − i ingly. Table 1 presents a comparison between our i=1 M (9) L = , approach and typical view-based training methods [6, std 14, 29] in terms of the IoU score. The results indicate that the proposed method outperformed the others. where l is the length of the i mesh edge and M is the i th Furthermore, since these methods are supervised by number of mesh edges. The experimental results dis - silhouettes, their optimal performance approaches that cussed in Section 3 showed that these two regularization of VH. Table 2 reports the IoU scores of fully super- terms improve reconstruction accuracy by setting rea- vised methods, where most models [1, 4, 10, 13, 15, 16, sonable weights (W and W ). The total loss is given by lap std 30] were represented using voxels. However, generat- the following: ing high-resolution models proves challenging due to L = L + W L + W L . dep lap lap std std (10) the memory inefficiency of voxels, limiting their accu - racy. Other studies employed mesh models [7] or point clouds [19], supervised by CD [7] or Earth Mover’s Dis- 3 Experiments tance [19]. These losses only adjust the mesh vertices In this section, the proposed reconstruction framework within a small field [12]. While the results of Thai et al. was validated using both synthetic (ShapeNet [39]) and [9] were slightly superior to ours, their fully supervised natural image datasets (Pix3D [40]), each containing 3D method utilized RGB images, estimated depth, and CAD models. The IoU was computed, with quantitative normal maps as inputs. When using only a single image results scaled by a factor of 100. All rendered and real RGB as input, our model achieved better performance (67.8 images, as well as the rendered depth maps, were produced vs. 65.0). at a resolution of 224×224. The datasets were randomly partitioned into training (80%) and testing (20%) subsets. 3.2.2 Qualitative Results The network optimization was conducted using the Adam Objects with small structures, particularly concavi- optimizer. ties, were effectively reconstructed using the proposed method. As shown in Figure 7, finer details, such as the 3.1 Implementation Details tail of an airplane and the holes in a chair, were accu- For each 3D model in ShapeNet, 24 RGB images were ren- rately reconstructed by the model. In Figure 8, the dered with an azimuth angle increment of 15°, elevation meshes generated by NR1 [29] and NR2 [14] exhibited angles of 30°, and eight depth maps from fixed viewpoints significant self-intersection issues. In contrast, the pro - corresponding to the eight corners of a central cube. Addi- posed model produced smoother meshes, with more tionally, for each 3D model in ShapeNet and Pix3D, depth reasonable distribution. Finer structures, such as the maps were rendered from 100 randomly selected novel tail and wings of the planes, were better captured with viewpoints. The network was optimized with parameters a a sufficient mesh representation. −4 = 1×10 , b = 0.5, and b = 0.999 across all experiments. 1 2 A batch size of 5 was employed, and the learning rate was −43.2.3 Ablation Study set to 4×10 . The reconstructed and ground truth models Controlled experiments were conducted to validate the were converted into voxel grids, and the IoU was calculated significance of different components. The effects of two for quantitative evaluation. regularization terms, L and L , were investigated. To std lap assess the impact of the proposed regularization terms, 3.2 R econstruction Results on the Synthetic Dataset the following experiments were performed: Laplacian Similar to traditional view-based training methods, 13 regularization (LR) only, standard deviation regulariza- categories from the ShapeNet dataset were utilized to tion (SR), and both Laplacian and standard deviation train the PoseCNN and reconstruction network. Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 7 of 13 Table 2 IoU scores for ShapeNet obtained with fully supervised methods (Baselines: P2V [17], 3D-R2N2 [1], PTN [30], OGN [16], LSM [13], Matry [15], VTN [15], PSGN [15], DRC [4], SDFNet-Img [9], SDFNet-Est [9], and P2M [7]) Methods P2V 3D‑R2N2 PTN OGN LSM Matry VTN IoU 66.1 63.4 57.4 59.6 61.5 63.5 64.1 Methods PSGN DRC SDFNet‑Img SDFNet‑Est P2M OURS IoU 64.0 54.5 65.0 68.0 61.0 67.8 Figure 7 Reconstructed models using ShapeNet (GT: ground truth, Baselines NR1 [29], NR2 [14], P2M [7], and SOFT [6]) Figure 8 Mesh distribution (GT: ground truth. Baselines: NR1 [29], NR2 [14]) Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 8 of 13 supervision, the improvement was markedly greater, with an increase of 3.6 percentage points (54.5 vs. 58.1), demonstrating that the regularization terms are par- ticularly beneficial for reconstruction under limited supervision. The weights of the regularization terms were exam - ined under the condition of 20 views being used for supervision. The relationships between the weights and the IoU and CD values are illustrated in Figure 9, while Figure 10 presents the reconstruction models gener- ated with varying weights (W and W ). The CD, uti - std lap lized as a metric index, quantifies the distance between Figure 9 IoU obtained using different numbers of views (Baseline: two point cloud sets [19], offering an evaluation of the without Laplacian and standard deviation regularization terms. LR: reconstruction accuracy and consistency of the gener- Laplacian regularization only; SR: standard deviation regularization ated models. only; LSR: both Laplacian and standard deviation regularization) The weights of the regularization terms were inves - tigated using 20 views for supervision. Figure 11 illus- trates the relationships between the weights and the regularization (LSR). The results with varying numbers IoU and CD values. Figure 12 presents reconstruction of views are presented in Figures 9 and 10, where the models with varying weights of W and W . The std lap baselines were trained without the regularization terms results in Figure 11 suggest that W and W should std lap L and L . std lap remain below 3 and 0.003, respectively, to maintain Notably, the visual quality of the generated meshes reconstruction accuracy. Furthermore, Figure 11 indi- was significantly enhanced by the inclusion of regu - cates that the trends in CD and IoU were consistent. larization terms, as indicated by the IoU score, with 20 As shown in Figure 12, the models failed to preserve views used for supervision. As shown in Figure 9, SR details when either W and W was set to relatively std lap had a more substantial impact on the reconstruction high values. For all experiments, W = 3 and W = std lap results than LR. The former better distributes mesh 0.003 were selected. These regularization terms are cru - edges globally, whereas the latter primarily affects local cial for producing an accurate mesh model. mesh edges. When only three views were available for Figure 10 Reconstructed ShapeNet models obtained using different numbers of views (GT: ground truth. ‘B’: baseline trained without the Laplacian and standard deviation regularization terms. ‘P’: proposed method trained with Laplacian and standard deviation regularization terms) Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 9 of 13 Figure 11 Relationships between reconstruction accuracy and regularization weights: (a), (c) Relationship between W and reconstruction std accuracy, (b), (d) Relationship between W and reconstruction accuracy lap 3.3 R econstruction on a Natural Image Dataset corners of a central cube, which were used to validate the Evaluating a method on a natural image dataset is criti- prediction accuracy of novel viewpoint poses. cal for assessing its real-world practicality. Pix3D [40], The accuracy of pose predictions was evaluated by a real-world dataset consisting of 3D CAD models and using varying numbers of fixed views as references to corresponding 2D multi-view images, was used for this predict the poses of 10 randomly rendered novel views. purpose. Multi-view depth maps and silhouettes were Table 4 demonstrates the relationship between the num- rendered, with typical studies [6, 7, 14, 29] serving as the ber of reference views and the rotation prediction error. baseline for comparison. When three fixed views were used as references, the Reimplementation results are presented in Figure 13 error was 0.61°, while the error decreased to 0.34° with and Table 3, where the proposed method outperformed eight fixed reference views, which is sufficiently accurate the baseline. The improvement was particularly signifi - for the training framework. cant for the desk category. As shown in Figure 13, the Robust pose prediction was demonstrated by using empty section of the desk was clearly distinguished from view 3 as the reference, successfully predicting the poses the other parts. of the other seven views. Table 5 presents the predicted results, with the notable observation that the prediction 3.4 V iewpoint Pose Predictions for view 6 differed significantly from that of view 3. This Figure 14 presents the rendering of eight depth maps suggests that the pose prediction method is robust to from fixed viewpoints, corresponding to the eight appearance variations and occlusions. Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 10 of 13 Figure 12 Reconstructed models with different weights (W =1, 2, 10, 20; W =0.003, 0.005, 0.01, 0.1, GT: ground truth) std lap standard deviation losses were incorporated as regu- 4 Conclusions larization terms. These regularization terms contrib - In this study, a 3D shape reconstruction model utiliz- uted to smoother surfaces and a more reasonable mesh ing multiple depth maps as supervisory signals for mesh distribution. Controlled experiments were conducted to inference from a single view has been presented. The assess the impact of these regularization terms, and the depth maps, containing embedded 3D object structures, results demonstrated substantial improvements when the facilitated effective shape prior learning and multi-view weights for Laplacian and standard deviation losses were pose prediction. A PoseCNN structure was introduced set to 0.003 and 3, respectively. Evaluations across both to predict novel camera views, while Laplacian and Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 11 of 13 synthetic and real-world datasets revealed superior per- formance compared to typical view-based training meth- ods [6, 14, 29] and most fully supervised approaches. The generated models achieved comparable quality to those obtained using fully supervised methods, despite relying on weaker supervision. This study demonstrates that the proposed approach successfully learns robust shape priors and reconstructs 3D models with finer structural details, underscoring its potential as an effective solution for 3D reconstruc - tion tasks under limited supervision. Future research will extend the capabilities of the pro- posed 3D shape reconstruction model by incorporat- ing additional image parameters beyond object shapes. Humans can estimate various visual cues, such as light- ing conditions, reflection properties, and camera pose from RGB images, and it is expected that neural net- works are expected to be capable of replicating these abilities. A promising direction for this development is the application of self-supervised learning approaches. Figure 13 Reconstructed models on Pix3D (Baseline: NR1 [29], NR2 As depicted in Figure 15, neural networks can be [14], P2M [7], SOFT [6]) leveraged to infer shape and other image parameters simultaneously. These inferred parameters are then used to render and optimize input images by minimiz- ing the backpropagated RGB rendering loss, eliminat- ing the need for additional ground truth beyond the Table 3 A comparison of the IoU scores for Pix3D (Baselines: NR1 input RGB images. In contrast, current methods often [29], NR2 [14], P2M [7], SOFT [6]) rely on mapping RGB images onto meshes to generate Methods NR1 NR2 P2M SOFT OURS visually appealing results, while the underlying shape is optimized using extra ground truth, such as multi- IoU 58.7 63.3 58.9 61.2 64.9 view silhouettes. Self-supervised 3D reconstruction, by removing the reliance on such external ground truth, represents one of the most promising advancements in this field. In this context, the proposed method’s capacity for shape prior learning, supervised by depth maps, can be extended to RGB images in future work. Given the structural similarities between depth maps and RGB images—despite differences in the number of chan - nels—depth maps, easily obtainable via structured light sensors in practical applications, can serve as a basis for this extension. Thus, the proposed framework lays the groundwork for further exploration in self-supervised shape prior learning, especially in real-world settings where ground truth may not be readily available. Figure 14 Changes among the eight views (especially for views with opposite orientations) Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 12 of 13 Table 4 Relationship between the number of reference views N and rotation prediction error N 1 2 3 4 5 6 7 8 ε (°) 2.77 1.30 0.61 0.42 0.38 0.35 0.34 0.34 Table 5 Prediction results and error ε of seven fixed views, taking view 3 as the reference Views Rotation (quaternion) ε View 1 OURS − 0.31 0.18 0.41 0.83 1.04 GT 0.33 − 0.17 − 0.42 − 0.82 View 2 OURS − 0.39 0.83 0.34 0.18 2.87 GT 0.42 − 0.82 − 0.33 − 0.17 View 4 OURS − 0.17 0.33 − 0.83 − 0.42 0.29 GT 0.18 − 0.33 0.82 0.42 View 5 OURS − 0.14 0.33 0.82 0.43 1.35 GT 0.17 − 0.34 − 0.82 − 0.42 View 6 OURS − 0.81 0.43 0.17 0.34 0.99 GT 0.82 − 0.42 − 0.17 − 0.33 View 7 OURS 0.34 − 0.17 0.42 0.82 0.39 GT 0.33 − 0.17 0.42 0.82 OURS 0.41 − 0.82 0.34 0.17 View 8 0.99 GT 0.42 − 0.82 0.33 0.17 Funding Supported by National Key Research and Development Program of China (Grant No. 2024YFB3409800), Postdoctoral Fellowship Program of CPSF of China (Grant No. GZB20240940), and China Postdoctoral Science Foundation (Grant Nos. 2025T181108, 2024M764127). Data availability The raw/processed data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study. Declarations Competing Interests The authors declare no competing financial interests. Received: 23 August 2023 Revised: 10 July 2025 Accepted: 23 July 2025 References [1] C B Choy, D Xu, J Y Gwak, et al. 3d-r2n2: A unified approach for single Figure 15 Self-supervised 3D reconstruction framework (Image and multi-view 3d object reconstruction. Computer Vision–ECCV 2016: formation factors include the camera pose, lighting, shape, 14th European Conference, Amsterdam, The Netherlands, October and reflection) 11-14, 2016, Springer International Publishing, 2016: 628-644. [2] J Wu, Y Wang, T Xue, et al. Marrnet: 3d shape reconstruction via 2.5 d sketches. Advances in Neural Information Processing Systems, 2017, 30. [3] S Tulsiani, A A Efros, J Malik. Multi-view consistency as supervisory sig- Acknowledgements nal for learning shape and pose prediction. Proceedings of the IEEE Con- Not applicable. ference on Computer Vision and Pattern Recognition, 2018: 2897-2905. [4] S Tulsiani, T Zhou, A A Efros, et al. Multi-view supervision for single-view Authors’ Contributions reconstruction via differentiable ray consistency. Proceedings of the IEEE HH wrote the manuscript and was in charge of the whole trial; SLL and JHL Conference on Computer Vision and Pattern Recognition, 2017: 2626-2634. supervised the whole work of this paper; PJ assisted review and editing. All authors read and approved the final manuscript. Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 13 of 13 [5] C H Lin, C Kong, S Lucey. Learning efficient point cloud generation for [28] A Kanazawa, S Tulsiani, A Efros, et al. Learning category-specific mesh dense 3d object reconstruction. Proceedings of the AAAI Conference on reconstruction from image collections. Proceedings of the European Con- Artificial Intelligence, 2018, 32(1). ference on Computer Vision (ECCV), 2018: 371-386. [6] S Liu, T Li, W Chen, et al. Soft rasterizer: A differentiable renderer for [29] H Kato, Y Ushiku, T Harada. Neural 3d mesh renderer. Proceedings of image-based 3d reasoning. Proceedings of the IEEE/CVF International the IEEE Conference on Computer Vision and Pattern Recognition, 2018: Conference on Computer Vision, 2019: 7708-7717. 3907-3916. [7] N Wang, Y Zhang, Z Li, et al. Pixel2mesh: Generating 3d mesh models [30] X Yan, J Yang, E Yumer, et al. Perspective transformer nets: Learning single- from single rgb images. Proceedings of the European Conference on Com- view 3d object reconstruction without 3d supervision. Advances in Neural puter Vision (ECCV ), 2018: 52-67. Information Processing Systems, 2016, 29. [8] T Goueix, M Fisher, V G Kim, et al. Atlasnet: A papier-mch approach to [31] S Goel, A Kanazawa, J Malik. Shape and viewpoint without keypoints. learning 3d surface generation. CVPR, 2018. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, [9] A Thai, S Stojanov, V Upadhya, et al. 3d reconstruction of novel object August 23–28, 2020, Proceedings, Part XV 16. Springer International shapes from single images. 2021 International Conference on 3D Vision Publishing, 2020: 88-104. (3DV ), IEEE, 2021: 85-95. [32] K N Kutulakos, S M Seitz. A theory of shape by space carving. International [10] Y Wei, S Liu, W Zhao, et al. Conditional single-view shape generation for Journal of Computer Vision, 2000, 38: 199-218. multi-view stereo reconstruction. Proceedings of the IEEE/CVF Conference [33] A Laurentini. The visual hull concept for silhouette-based image under- on Computer Vision and Pattern Recognition, 2019: 9651-9660. standing. IEEE Transactions on Pattern Analysis and Machine Intelligence, [11] Q Xu, W Wang, D Ceylan, et al. DISN: Deep implicit surface network for 1994, 16(2): 150-162. high-quality single-view 3d reconstruction. Advances in Neural Informa- [34] Y Yao, N Schertler, E Rosales, et al. Front2back: Single view 3d shape tion Processing Systems, 2019, 32. reconstruction via front to back prediction. Proceedings of the IEEE/CVF [12] L Jiang, S Shi, X Qi, et al. Gal: Geometric adversarial loss for single-view Conference on Computer Vision and Pattern Recognition, 2020: 531-540. 3d-object reconstruction. Proceedings of the European Conference on [35] J Wu, C Zhang, X Zhang, et al. Learning shape priors for single-view 3d Computer Vision (ECCV), 2018: 802-816. completion and reconstruction. Proceedings of the European Conference [13] A Kar, C Häne, J Malik. Learning a multi-view stereo machine. Advances in on Computer Vision (ECCV), 2018: 646-662. Neural Information Processing Systems, 2017, 30. [36] Y Wu, Z Sun. DFR: differentiable function rendering for learning 3D gen- [14] H Kato, T Harada. Learning view priors for single-view 3d reconstruction. eration from images. Computer Graphics Forum, 2020, 39(5): 241-252. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern [37] K He, X Zhang, S Ren, et al. Deep residual learning for image recognition. Recognition, 2019: 9778-9787. Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- [15] S R Richter, S Roth. Matryoshka networks: Predicting 3d geometry via tion, 2016: 770-778. nested shape layers. Proceedings of the IEEE Conference on Computer Vision [38] P Jin, S Liu, J Liu, et al. Weakly-supervised single-view dense 3D point and Pattern Recognition, 2018: 1936-1944. cloud reconstruction via differentiable renderer. Chinese Journal of [16] M Tatarchenko, A Dosovitskiy, T Brox. Octree generating networks: Mechanical Engineering, 2021, 34: 93. Efficient convolutional architectures for high-resolution 3d outputs. [39] A X Chang, T Funkhouser, L Guibas, et al. Shapenet: An information-rich Proceedings of the IEEE International Conference on Computer Vision, 2017: 3d model repository. arXiv preprint arXiv: 1512. 03012, 2015. 2088-2096. [40] X Sun, J Wu, X Zhang, et al. Pix3d: Dataset and methods for single-image [17] H Xie, H Yao, X Sun, et al. Pix2vox: Context-aware 3d reconstruction from 3d shape modeling. Proceedings of the IEEE Conference on Computer Vision single and multi-view images. Proceedings of the IEEE/CVF International and Pattern Recognition, 2018: 2974-2983. Conference on Computer Vision, 2019: 2690-2698. [18] H Xie, H Yao, S Zhang, et al. Pix2Vox++: Multi-scale context-aware 3D object reconstruction from single and multiple images. International Hao Huang is currently a postdoctoral researcher at School of Journal of Computer Vision, 2020, 128(12): 2919-2935. Mechanical Engineering, Beijing Institute of Technology, China. [19] H Fan, H Su, L J Guibas. A point set generation network for 3d object reconstruction from a single image. Proceedings of the IEEE Conference on Shaoli Liu is currently a professor at School of Mechanical Engineer- Computer Vision and Pattern Recognition, 2017: 605-613. ing, Beijing Institute of Technology, China. [20] J Y Gwak, C B Choy, M Chandraker, et al. Weakly supervised 3d reconstruc- tion with adversarial constraint. 2017 International Conference on 3D Vision Jianhua Liu is currently a professor at School of Mechanical Engi- (3DV ), IEEE, 2017: 263-272. [21] Z Li, Y Yeh, M Chandraker. Through the looking glass: Neural 3d recon- neering, Beijing Institute of Technology, China. struction of transparent shapes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 1262-1271. Peng Jin is currently an associate research fellow at Shenzhen Bay [22] Y Nie, X Han, S Guo, et al. Total3dunderstanding: Joint layout, object pose Laboratory, Institute of Biomedical Engineering, Shenzhen, China. and mesh reconstruction for indoor scenes from a single image. Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 55-64. [23] Q Chen, V Nguyen, F Han, et al. Topology-aware single-image 3D shape reconstruction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020: 270-271. [24] G Vogiatzis, P H S Torr, R Cipolla. Multi-view stereo via volumetric graph- cuts. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), IEEE, 2005, 2: 391-398. [25] D Jimenez Rezende, S M Eslami, S Mohamed, et al. Unsupervised learning of 3d structure from images. Advances in Neural Information Processing Systems, 2016, 29. [26] K L Navaneet, P Mandikal, M Agarwal, et al. Capnet: Continuous approxi- mation projection for 3d point cloud reconstruction using 2d supervi- sion. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(01): 8819-8826. [27] A Kar, S Tulsiani, J Carreira, et al. Category-specific object reconstruction from a single image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1966-1974. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Chinese Journal of Mechanical Engineering Springer Journals http://www.deepdyve.com/lp/springer-journals/learning-to-predict-3d-meshes-from-a-single-image-via-depth-0wCXgCHPCL

Loading next page...

References (34)

Publisher: Springer Journals
Copyright: Copyright © The Author(s) 2025
ISSN: 1000-9345
eISSN: 2192-8258
DOI: 10.1186/s10033-025-01335-2
Publisher site: See Article on Publisher Site

Abstract

Reconstructing three-dimensional (3D) shapes from a single image remains a significant challenge in computer vision due to the inherent ambiguity caused by missing or occluded shape information. Previous studies have predominantly focused on mesh models supervised by multi-view silhouettes. However, such methods are limited in reconstructing fine details. In this study, a 3D mesh model is predicted from a single image, leveraging depth consistency and without requiring viewpoint pose annotations. The model effectively learns strong shape priors that preserve finer structures and accurately predicts view poses from "correlation-supervised" viewpoints. Addition- ally, standard deviation and Laplacian losses were employed to regulate mesh edge distribution, resulting in more precise reconstructions. Differentiable renderer functions were derived from the 3D mesh to generate depth maps. Compared to conventional approaches, the proposed method provided superior representation of subtle structures. When applied to both synthetic and real-world datasets, the model outperformed existing methods in view-based 3D reconstruction tasks. Keywords Depth-consistency, Mesh, Standard deviation loss, View-based reconstruction during reconstruction, as shown in Figure 1, which limits 1 Introduction the utility of traditional reconstruction techniques. With the development of robotics, autonomous driv- Recently, neural networks have been used success- ing, and 3D animation, 3D shape inference of objects fully to infer 3D shape based on a single view [1]. The has become a mainstream research field. When humans convolutional layer of the neural network can resolve obtain 3D shape priors from objects or computer-aided the underlying features from the input image, where the design (CAD) models, they can estimate the 3D shape 3D shape is output as voxels [2–4], point clouds [5], or a of an object from a single view. However, estimating mesh [6]. However, voxels require more memory, which detailed 3D shapes from a single image/perspective is still compromises computational efficiency, while the point challenging for computer vision. It is impossible to find cloud representation loses important surface details [7]. matching points in corresponding images when using Compared with the voxel and point cloud representa- conventional graphics techniques for reconstruction. In tions, the mesh retains more important shape details and addition, the smooth surface or occlusion of the object represents surface topology. Furthermore, the mesh is makes it difficult to obtain significant feature points more likely to find real applications, as it has the capac - ity to efficiently model shape details [7, 8]. Recent stud- ies on single-view mesh reconstruction have proposed *Correspondence: recreating the 3D mesh by deforming a template model Shaoli Liu [email protected] based on perceptual features extracted from the input School of Mechanical Engineering, Beijing Institute of Technology, image. The reconstructed results usually have a topologi - Beijing 100081, China cal structure identical to that of the template model (e.g., Shenzhen Bay Laboratory, Institute of Biomedical Engineering, Shenzhen 518132, China a sphere or unit). Although promising results have been © The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 2 of 13 projection is obtained using the camera pose. The 3D shape can be inferred by rendering with dense pixel- level supervision [6]. Mesh models in the literature mainly used silhouettes as the view data [6, 14, 28, 29, 31], as silhouettes are binary; this approach allows for simplified forward rendering and backward gradient propagation. However, silhouette data do not reflect changes in mesh vertices in the depth direction, which prevents the renderer from moving from flow gradi - Figure 1 The challenges of traditional reconstruction methods: ents to coordinates in the depth direction. Therefore, occlusion and smoothness the learned shape prior is “weaker”, so that the recon- structed shape is similar to a visual hull (VH) [32, 33] lacking detail. Given that normal and depth maps con- achieved, finer structures such as a hole in the back of a tain rich geometric information, they are commonly chair cannot be reconstructed. adopted as supervision signals for 3D shape inference, Most pioneering approaches [1, 7, 9–20] use 3D ground in the form of point clouds and voxels. While recon- truth to learn the shape priors of a mesh. For exam- structing the mesh model using view-based methods, ple, taking the Chamfer distance (CD) with the ground the connection of vertices in the mesh model increases truth point clouds as the loss function allows the point the difficulty of differentiability for the renderer. The clouds [7, 19, 21] and 3D meshes [22] to be reconstructed silhouettes are usually taken as supervisory signals; the directly from a single view. Wei et al. [10] fused the CD [27], l2-norm [28], negative intersection over union unambiguous parts of point clouds inferred from mul- (IoU) [14], and element-wise product [6, 14] are taken tiple views; based on the loss function of voxel models as training losses. The renderer is usually only differen - using binary cross entropy, the latent features of multi- tiable on the silhouettes; however, the silhouettes only ple views were integrated to predict the 3D volume [1, 13, contain VH information lacking detail, resulting in a 17, 18, 23]. Tatarchenko et al. [16] generated high-resolu- coarse 3D shape. tion voxel models using an octree representation. Some Besides, cameras should be calibrated when project- studies have focused on advanced loss functions. Gwak ing a 3D model onto an image. However, it is difficult to et al. [20] used a discriminator to examine generated and calibrate all of the cameras, because the camera poses ground truth shapes and to reconstruct more realistic are almost impossible to obtain in practical application. models. Jiang et al. [12] used the CD and a discrimina- Moreover, training traditional model convolutional neu- tor to constrain an estimated point cloud to follow the ral networks (CNNs) with camera pose requires a large 3D ground truth shape, locally and globally. Some recent amount of annotated data, yet it cannot predict the cor- approaches [9, 11] have constrained the mesh model responding camera pose. Renderer differentiability also using the distance of vertices to the surface of the mesh limits the use of images for supervision. Because the as a signed distance field (SDF); however, obtaining the values between pixels in an image projected by the 3D ground truth SDF of each vertex requires considerable model are discontinuous, the camera parameters are not computational resources. Although 3D shapes of satis- differentiable from the projected points. Thus, backprop - factory quality can be produced with these methods, 3D agation of the loss function is difficult, resulting in failure annotation involves a significant amount of manual work, to update network weights. As such, to predict the cam- which is not practical or feasible in all situations. era poses, it is necessary to make the projection points View-based reconstruction has recently gained sig- and camera parameters differentiable. nificant attention, as a more detailed model can be The objective of this study was to achieve detailed 3D reconstructed using a red-green-blue (RGB) or depth mesh reconstruction through view-based training from map for supervision. Recent work has involved view- a single viewpoint, leveraging the principle of depth based training to reduce the reliance on 3D ground consistency. In contrast to recent methodologies [4, 6, truth. The underlying principle of view-based training 14, 27–31], multi-view depth maps were employed as is photo-consistency [13, 24, 25], i.e., obtaining single- supervisory signals. A graphical representation of the or multiple-view 3D results and minimizing rendering research objective is provided in Figure 2. The rendered loss using observed view data. The view data com - depth maps captured variations in mesh vertex posi- monly consist of two silhouettes [2, 3, 6, 14, 26–31], tions along the depth axis, providing crucial informa- a 2.5D depth map, and a normal map [2–4]. Render- tion for enhancing reconstruction precision. To facilitate ing is the image formulation mechanism; a 3D shape depth error backpropagation, the renderer’s gradient was Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 3 of 13 Figure 2 The graphical representation of the research objective approximated using a customized function [14], which applied to synthetic and realistic datasets, and outper- enabled the model to learn robust shape priors and accu- formed the view-based training methods; results com- rately reconstruct fine geometric details. parable to those of typical fully supervised methods with As the 3D ground truth is not given, the learned shape weaker supervision were obtained. priors are unaware of the unobserved views, and may reconstruct ambiguous shapes for occluded or unseen 2 Methods parts [10, 34–36]. Therefore, a depth map with 20 views The proposed framework consists of four parts: fea - was used for supervision, to reduce ambiguity [29, 30]. ture map extraction, initial model deformation, differ - Well-calibrated cameras are necessary when rendering; entiable rendering, and pose prediction of novel views, however, extra work is required, preventing practical which is shown in Figure 3. To reduce the dependence application. Therefore, we propose a network structure, on the camera pose ground truth, the novel viewpoint PoseCNN, to predict the poses of multiple viewpoints; poses prediction network was trained to obtain the dif the framework can be applied in the absence of camera ferent camera parameters. Then the two regularization calibration. The ground truth depth value of the projec - terms (standard deviation and Laplacian) were added to tion was interpolated bilinearly based on local four-pixel improve the reconstruction performance. neighborhoods. The initial mesh created by previous researches [6, 14, 29] was not usable due to the many self-intersections and large numbers of meshes con- 2.1 Proposed SingleV ‑ iew Reconstruction Framework centrated on flat faces (i.e., there were fewer meshes in In the feature extraction step, a shape was generated by the finer structures). To solve this issue, we incorpo - deforming a pre-defined cube, following typical learn - rated the standard deviation loss of the edge length and ing-based mesh reconstruction methods [7, 28, 29]. Laplacian loss [7] to refine the mesh. The integration of ResNet-18 [37] was used as the image encoder (Enc) to these two terms provided more uniform edge lengths, as compute the latent representations F from a single-view well as meshes that were reasonably distributed. u Th s, RGB image I . The face of each cube had 16 × 16 vertices. our inferred 3D models show finer structures and more The associated feature maps F were generated separately details. by the shape decoder (‘Dec’). In summary, the contributions of this work are as fol- In the initial model deformation step, since the vertices lows: (1) A 3D shape reconstruction model is proposed on the cube edges are shared by two faces, a total of 1,352 that uses depth maps for supervision, learning strong vertices are present. The six feature maps F are concat- shape priors for single-view 3D mesh inference. Fine enated to generate the bias F using a fully connected layer. details are addressed by a view-based training strategy. The initial cube is deformed by F to form the 3D mesh (2) Standard deviation and Laplacian losses were incor- model. Then the ground truth depth maps D are scattered porated as regularization terms to improve the recon- and have N novel views. The 3D structures recovered from struction performance; their effects were explored in different depth maps are consistent with each other, based ablation experiments. (3) The proposed model was Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 4 of 13 Figure 3 Framework of the proposed method (Our method uses multiple depth maps D as supervisory signals and combines two regularization terms (L , L ) to derive the total loss, which aids learning of strong shape priors for single-view reconstruction) std lap Figure 5 Backpropagation of the loss Figure 4 Projected vertices of one piece of the mesh and the pixels of the image on which the viewpoint poses (R , t ) of novel views can be are the areas of the corresponding triangles. The gradient n n predicted by the proposed network, PoseCNN. of the network loss L with respect to the vertex is given in Finally, with the predicted camera poses (R , t ), the Eq. (2) and (3): n n reconstructed model is rendered to yield novel views, and ∂L ∂L ∂P ∂L the rendered depth maps are D . Inspired by Kato et al. n = = a , 1 (2) ∂V ∂P ∂V ∂P 1 i 1 i [14], the gradient of the renderer is approximated by the difference in depth value between two adjacent pixels, ∂L ∂L ∂d ∂L ∂d which makes the renderer differentiable and the framework i i+1 = + . (3) trainable. ∂P ∂d ∂P ∂d ∂P i i i i+1 i V V V is the projection of one of the mesh m , as 1 2 3 i shown in Figure 4. Pixel P is located in V V V , which i 1 2 3 Only the manually designed renderer-related gradients can be represented as the sum of weights using Eq. (1): ∂d /∂P need to be calculated, as shown in Figure 5. They i i P = a V + a V + a V , i 1 1 2 2 3 3 (1) are approximated by the depth difference of adjacent pixels. Assuming ∂L /∂d = g and for the case of pixel P moving i i i i where a = S /S , a = S /S , 1 �P V V �V V V 2 �P V V �V V V i 1 2 1 2 3 i 1 3 1 2 3 to the right and left, the gradients are approximated as Eq. a = S /S and S , S , S 3 �P V V �V V V P V V P V V P V V i 2 3 1 2 3 i 1 2 i 1 3 i 2 3 (4) and (5), respectively: Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 5 of 13 ∂L ∂L ∂d ∂L ∂d i i+1 = + ∂c ∂d ∂c ∂d ∂c i i i+1 i (4) = g (d − d ) + g (d − d ) i i−1 i i+1 i i+1 right =g , ∂L ∂L ∂d ∂L ∂d i i−1 = + ∂d ∂c ∂d ∂c ∂c i i i+1 i (5) = g (d − d ) + g (d − d ) i i i+1 i−1 i−1 i left =g , where d , d , and d are the depth values of P , P , i−1 i i+1 i−1 i and P , resp e ctively . i+1 2.2 P ose Prediction of the Novel Views Some of the calibrated cameras can be taken as the ref- Figure 6 Network architectures: (a) Architecture of shape decoder, erence camera Cam with depth maps dep . The world 0 r (b) Architecture of PoseCNN coordinate system (WCS) is used as their common coor- dinate system. As a novel view, the pose pose = (R , t ) n n n of an uncalibrated camera Cam = (1 < n < N) is given 2.3 Structures of the Shape Decoder and PoseCNN by the rotation R and translation t relative to the WCS, n n The shape decoder, PoseCNN, and dimensions of each which is predicted by PoseCNN from depth maps dep . layer are shown in Figure 6. The layer symbols are Pixel X = (x , y , d ) in dep is unprojected into the WCS i i i n defined as follows: (1) Bn+ relu: Batch normalization −1 by p = R (KX − t ) to obtain the point cloud, where and rectified linear unit (ReLU) layers; (2) Deconv: d is the depth value for pixel X . We assumed a default 2D deconvolution layer with a kernel size of 3×3 and intrinsic matrix with an orthographic camera K . Initially , stride size of 2×2; (3) Conv: 2D convolution layer with the point clouds are projected onto Cam by Eq. (6): a kernel size of 1×1 and stride size of 1×1; (4) FC: fully connected layer; (5) Concat: stacks the feature maps X = K (R p + t ) = (x , yˆ , dep ˆ ). (6) n n i i 0 of six surfaces. The shape decoder outputs the (x , y, z) Because the point cloud is dense, many points are pro- coordinates of the 1352 vertices of the mesh model. The jected onto the same pixel. Therefore, the point nearest PoseCNN output includes a quaternion and the (x, y, z) to the depth map is the only one preserved, and (x ˆ , y ˆ ) is i i positions of the viewpoints. the image coordinate of the projected point. The depth The proposed shape prior learning framework is mainly of the nearest point is the rendered depth dep ˆ , which 0 supervised by the depth-consistency for depth loss, given can be obtained by the method proposed in reference by: [38]. The bilinear sampling at location (x ˆ , y ˆ ) is differen - i i tiable with respect to the camera pose (R , t ) , such that n n L = D − D . (7) dep n n the framework is differentiable with respect to the view - n=1 point pose prediction. PoseCNN is trained by minimiz- ing ||dep ˆ − dep || . However, the mesh surfaces may exhibit serious self- 0 0 2 If no camera is calibrated, we can take any camera in intersection problems, which can be prevented by Lapla- the novel views as Cam , and its camera coordinate cian loss using Eq. (8): system (CCS) is defined as the WCS to predict other   1352 1352 uncalibrated novel views; an absolute camera coordinate � �   L = p − , (8) system is not necessary. More reference cameras Cam lap i �U (p )� i=1 k∈U (p ) contribute to the pose prediction accuracy, which is i explored in the experiments (Section 3). The viewpoint is where p is a vertex of the inferred mesh model and U (p ) i i assumed at a known location, which can also be inferred are the adjacent vertices connected to p . The Laplacian by extending the PoseCNN, such that we only analyze the loss L may not contribute to accuracy improvement lap rotation prediction error ε (in degrees). [14]; In fact, it may reduce the accuracy when a large Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 6 of 13 weight is applied (See the experimental results in Sec- Table 1 IoU scores for ShapeNet obtained with view-based training methods (Baselines: NR1 [29], NR2 [14], SOFT [6]) tion 3 for details). Besides the self-intersections, the meshes are placed ran- Methods NR1 NR2 SOFT OURS domly on the object surface. There are numerous dense IoU 60.2 65.5 64.6 67.8 meshes on flat surfaces, but fewer on complex parts, which hinders the reconstruction of finer structures. We propose regularizing the shape using the standard deviation of the edge length of the meshes, which makes the mesh distribu- 3.2.1 Quantitative Evaluations tion more even. L is calculated as follows: std Twenty depth maps D were rendered from random views, and the proposed network was trained accord- M (l ) i=1 l − i ingly. Table 1 presents a comparison between our i=1 M (9) L = , approach and typical view-based training methods [6, std 14, 29] in terms of the IoU score. The results indicate that the proposed method outperformed the others. where l is the length of the i mesh edge and M is the i th Furthermore, since these methods are supervised by number of mesh edges. The experimental results dis - silhouettes, their optimal performance approaches that cussed in Section 3 showed that these two regularization of VH. Table 2 reports the IoU scores of fully super- terms improve reconstruction accuracy by setting rea- vised methods, where most models [1, 4, 10, 13, 15, 16, sonable weights (W and W ). The total loss is given by lap std 30] were represented using voxels. However, generat- the following: ing high-resolution models proves challenging due to L = L + W L + W L . dep lap lap std std (10) the memory inefficiency of voxels, limiting their accu - racy. Other studies employed mesh models [7] or point clouds [19], supervised by CD [7] or Earth Mover’s Dis- 3 Experiments tance [19]. These losses only adjust the mesh vertices In this section, the proposed reconstruction framework within a small field [12]. While the results of Thai et al. was validated using both synthetic (ShapeNet [39]) and [9] were slightly superior to ours, their fully supervised natural image datasets (Pix3D [40]), each containing 3D method utilized RGB images, estimated depth, and CAD models. The IoU was computed, with quantitative normal maps as inputs. When using only a single image results scaled by a factor of 100. All rendered and real RGB as input, our model achieved better performance (67.8 images, as well as the rendered depth maps, were produced vs. 65.0). at a resolution of 224×224. The datasets were randomly partitioned into training (80%) and testing (20%) subsets. 3.2.2 Qualitative Results The network optimization was conducted using the Adam Objects with small structures, particularly concavi- optimizer. ties, were effectively reconstructed using the proposed method. As shown in Figure 7, finer details, such as the 3.1 Implementation Details tail of an airplane and the holes in a chair, were accu- For each 3D model in ShapeNet, 24 RGB images were ren- rately reconstructed by the model. In Figure 8, the dered with an azimuth angle increment of 15°, elevation meshes generated by NR1 [29] and NR2 [14] exhibited angles of 30°, and eight depth maps from fixed viewpoints significant self-intersection issues. In contrast, the pro - corresponding to the eight corners of a central cube. Addi- posed model produced smoother meshes, with more tionally, for each 3D model in ShapeNet and Pix3D, depth reasonable distribution. Finer structures, such as the maps were rendered from 100 randomly selected novel tail and wings of the planes, were better captured with viewpoints. The network was optimized with parameters a a sufficient mesh representation. −4 = 1×10 , b = 0.5, and b = 0.999 across all experiments. 1 2 A batch size of 5 was employed, and the learning rate was −43.2.3 Ablation Study set to 4×10 . The reconstructed and ground truth models Controlled experiments were conducted to validate the were converted into voxel grids, and the IoU was calculated significance of different components. The effects of two for quantitative evaluation. regularization terms, L and L , were investigated. To std lap assess the impact of the proposed regularization terms, 3.2 R econstruction Results on the Synthetic Dataset the following experiments were performed: Laplacian Similar to traditional view-based training methods, 13 regularization (LR) only, standard deviation regulariza- categories from the ShapeNet dataset were utilized to tion (SR), and both Laplacian and standard deviation train the PoseCNN and reconstruction network. Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 7 of 13 Table 2 IoU scores for ShapeNet obtained with fully supervised methods (Baselines: P2V [17], 3D-R2N2 [1], PTN [30], OGN [16], LSM [13], Matry [15], VTN [15], PSGN [15], DRC [4], SDFNet-Img [9], SDFNet-Est [9], and P2M [7]) Methods P2V 3D‑R2N2 PTN OGN LSM Matry VTN IoU 66.1 63.4 57.4 59.6 61.5 63.5 64.1 Methods PSGN DRC SDFNet‑Img SDFNet‑Est P2M OURS IoU 64.0 54.5 65.0 68.0 61.0 67.8 Figure 7 Reconstructed models using ShapeNet (GT: ground truth, Baselines NR1 [29], NR2 [14], P2M [7], and SOFT [6]) Figure 8 Mesh distribution (GT: ground truth. Baselines: NR1 [29], NR2 [14]) Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 8 of 13 supervision, the improvement was markedly greater, with an increase of 3.6 percentage points (54.5 vs. 58.1), demonstrating that the regularization terms are par- ticularly beneficial for reconstruction under limited supervision. The weights of the regularization terms were exam - ined under the condition of 20 views being used for supervision. The relationships between the weights and the IoU and CD values are illustrated in Figure 9, while Figure 10 presents the reconstruction models gener- ated with varying weights (W and W ). The CD, uti - std lap lized as a metric index, quantifies the distance between Figure 9 IoU obtained using different numbers of views (Baseline: two point cloud sets [19], offering an evaluation of the without Laplacian and standard deviation regularization terms. LR: reconstruction accuracy and consistency of the gener- Laplacian regularization only; SR: standard deviation regularization ated models. only; LSR: both Laplacian and standard deviation regularization) The weights of the regularization terms were inves - tigated using 20 views for supervision. Figure 11 illus- trates the relationships between the weights and the regularization (LSR). The results with varying numbers IoU and CD values. Figure 12 presents reconstruction of views are presented in Figures 9 and 10, where the models with varying weights of W and W . The std lap baselines were trained without the regularization terms results in Figure 11 suggest that W and W should std lap L and L . std lap remain below 3 and 0.003, respectively, to maintain Notably, the visual quality of the generated meshes reconstruction accuracy. Furthermore, Figure 11 indi- was significantly enhanced by the inclusion of regu - cates that the trends in CD and IoU were consistent. larization terms, as indicated by the IoU score, with 20 As shown in Figure 12, the models failed to preserve views used for supervision. As shown in Figure 9, SR details when either W and W was set to relatively std lap had a more substantial impact on the reconstruction high values. For all experiments, W = 3 and W = std lap results than LR. The former better distributes mesh 0.003 were selected. These regularization terms are cru - edges globally, whereas the latter primarily affects local cial for producing an accurate mesh model. mesh edges. When only three views were available for Figure 10 Reconstructed ShapeNet models obtained using different numbers of views (GT: ground truth. ‘B’: baseline trained without the Laplacian and standard deviation regularization terms. ‘P’: proposed method trained with Laplacian and standard deviation regularization terms) Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 9 of 13 Figure 11 Relationships between reconstruction accuracy and regularization weights: (a), (c) Relationship between W and reconstruction std accuracy, (b), (d) Relationship between W and reconstruction accuracy lap 3.3 R econstruction on a Natural Image Dataset corners of a central cube, which were used to validate the Evaluating a method on a natural image dataset is criti- prediction accuracy of novel viewpoint poses. cal for assessing its real-world practicality. Pix3D [40], The accuracy of pose predictions was evaluated by a real-world dataset consisting of 3D CAD models and using varying numbers of fixed views as references to corresponding 2D multi-view images, was used for this predict the poses of 10 randomly rendered novel views. purpose. Multi-view depth maps and silhouettes were Table 4 demonstrates the relationship between the num- rendered, with typical studies [6, 7, 14, 29] serving as the ber of reference views and the rotation prediction error. baseline for comparison. When three fixed views were used as references, the Reimplementation results are presented in Figure 13 error was 0.61°, while the error decreased to 0.34° with and Table 3, where the proposed method outperformed eight fixed reference views, which is sufficiently accurate the baseline. The improvement was particularly signifi - for the training framework. cant for the desk category. As shown in Figure 13, the Robust pose prediction was demonstrated by using empty section of the desk was clearly distinguished from view 3 as the reference, successfully predicting the poses the other parts. of the other seven views. Table 5 presents the predicted results, with the notable observation that the prediction 3.4 V iewpoint Pose Predictions for view 6 differed significantly from that of view 3. This Figure 14 presents the rendering of eight depth maps suggests that the pose prediction method is robust to from fixed viewpoints, corresponding to the eight appearance variations and occlusions. Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 10 of 13 Figure 12 Reconstructed models with different weights (W =1, 2, 10, 20; W =0.003, 0.005, 0.01, 0.1, GT: ground truth) std lap standard deviation losses were incorporated as regu- 4 Conclusions larization terms. These regularization terms contrib - In this study, a 3D shape reconstruction model utiliz- uted to smoother surfaces and a more reasonable mesh ing multiple depth maps as supervisory signals for mesh distribution. Controlled experiments were conducted to inference from a single view has been presented. The assess the impact of these regularization terms, and the depth maps, containing embedded 3D object structures, results demonstrated substantial improvements when the facilitated effective shape prior learning and multi-view weights for Laplacian and standard deviation losses were pose prediction. A PoseCNN structure was introduced set to 0.003 and 3, respectively. Evaluations across both to predict novel camera views, while Laplacian and Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 11 of 13 synthetic and real-world datasets revealed superior per- formance compared to typical view-based training meth- ods [6, 14, 29] and most fully supervised approaches. The generated models achieved comparable quality to those obtained using fully supervised methods, despite relying on weaker supervision. This study demonstrates that the proposed approach successfully learns robust shape priors and reconstructs 3D models with finer structural details, underscoring its potential as an effective solution for 3D reconstruc - tion tasks under limited supervision. Future research will extend the capabilities of the pro- posed 3D shape reconstruction model by incorporat- ing additional image parameters beyond object shapes. Humans can estimate various visual cues, such as light- ing conditions, reflection properties, and camera pose from RGB images, and it is expected that neural net- works are expected to be capable of replicating these abilities. A promising direction for this development is the application of self-supervised learning approaches. Figure 13 Reconstructed models on Pix3D (Baseline: NR1 [29], NR2 As depicted in Figure 15, neural networks can be [14], P2M [7], SOFT [6]) leveraged to infer shape and other image parameters simultaneously. These inferred parameters are then used to render and optimize input images by minimiz- ing the backpropagated RGB rendering loss, eliminat- ing the need for additional ground truth beyond the Table 3 A comparison of the IoU scores for Pix3D (Baselines: NR1 input RGB images. In contrast, current methods often [29], NR2 [14], P2M [7], SOFT [6]) rely on mapping RGB images onto meshes to generate Methods NR1 NR2 P2M SOFT OURS visually appealing results, while the underlying shape is optimized using extra ground truth, such as multi- IoU 58.7 63.3 58.9 61.2 64.9 view silhouettes. Self-supervised 3D reconstruction, by removing the reliance on such external ground truth, represents one of the most promising advancements in this field. In this context, the proposed method’s capacity for shape prior learning, supervised by depth maps, can be extended to RGB images in future work. Given the structural similarities between depth maps and RGB images—despite differences in the number of chan - nels—depth maps, easily obtainable via structured light sensors in practical applications, can serve as a basis for this extension. Thus, the proposed framework lays the groundwork for further exploration in self-supervised shape prior learning, especially in real-world settings where ground truth may not be readily available. Figure 14 Changes among the eight views (especially for views with opposite orientations) Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 12 of 13 Table 4 Relationship between the number of reference views N and rotation prediction error N 1 2 3 4 5 6 7 8 ε (°) 2.77 1.30 0.61 0.42 0.38 0.35 0.34 0.34 Table 5 Prediction results and error ε of seven fixed views, taking view 3 as the reference Views Rotation (quaternion) ε View 1 OURS − 0.31 0.18 0.41 0.83 1.04 GT 0.33 − 0.17 − 0.42 − 0.82 View 2 OURS − 0.39 0.83 0.34 0.18 2.87 GT 0.42 − 0.82 − 0.33 − 0.17 View 4 OURS − 0.17 0.33 − 0.83 − 0.42 0.29 GT 0.18 − 0.33 0.82 0.42 View 5 OURS − 0.14 0.33 0.82 0.43 1.35 GT 0.17 − 0.34 − 0.82 − 0.42 View 6 OURS − 0.81 0.43 0.17 0.34 0.99 GT 0.82 − 0.42 − 0.17 − 0.33 View 7 OURS 0.34 − 0.17 0.42 0.82 0.39 GT 0.33 − 0.17 0.42 0.82 OURS 0.41 − 0.82 0.34 0.17 View 8 0.99 GT 0.42 − 0.82 0.33 0.17 Funding Supported by National Key Research and Development Program of China (Grant No. 2024YFB3409800), Postdoctoral Fellowship Program of CPSF of China (Grant No. GZB20240940), and China Postdoctoral Science Foundation (Grant Nos. 2025T181108, 2024M764127). Data availability The raw/processed data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study. Declarations Competing Interests The authors declare no competing financial interests. Received: 23 August 2023 Revised: 10 July 2025 Accepted: 23 July 2025 References [1] C B Choy, D Xu, J Y Gwak, et al. 3d-r2n2: A unified approach for single Figure 15 Self-supervised 3D reconstruction framework (Image and multi-view 3d object reconstruction. Computer Vision–ECCV 2016: formation factors include the camera pose, lighting, shape, 14th European Conference, Amsterdam, The Netherlands, October and reflection) 11-14, 2016, Springer International Publishing, 2016: 628-644. [2] J Wu, Y Wang, T Xue, et al. Marrnet: 3d shape reconstruction via 2.5 d sketches. Advances in Neural Information Processing Systems, 2017, 30. [3] S Tulsiani, A A Efros, J Malik. Multi-view consistency as supervisory sig- Acknowledgements nal for learning shape and pose prediction. Proceedings of the IEEE Con- Not applicable. ference on Computer Vision and Pattern Recognition, 2018: 2897-2905. [4] S Tulsiani, T Zhou, A A Efros, et al. Multi-view supervision for single-view Authors’ Contributions reconstruction via differentiable ray consistency. Proceedings of the IEEE HH wrote the manuscript and was in charge of the whole trial; SLL and JHL Conference on Computer Vision and Pattern Recognition, 2017: 2626-2634. supervised the whole work of this paper; PJ assisted review and editing. All authors read and approved the final manuscript. Huang et al. Chinese Journal of Mechanical Engineering (2025) 38:165 Page 13 of 13 [5] C H Lin, C Kong, S Lucey. Learning efficient point cloud generation for [28] A Kanazawa, S Tulsiani, A Efros, et al. Learning category-specific mesh dense 3d object reconstruction. Proceedings of the AAAI Conference on reconstruction from image collections. Proceedings of the European Con- Artificial Intelligence, 2018, 32(1). ference on Computer Vision (ECCV), 2018: 371-386. [6] S Liu, T Li, W Chen, et al. Soft rasterizer: A differentiable renderer for [29] H Kato, Y Ushiku, T Harada. Neural 3d mesh renderer. Proceedings of image-based 3d reasoning. Proceedings of the IEEE/CVF International the IEEE Conference on Computer Vision and Pattern Recognition, 2018: Conference on Computer Vision, 2019: 7708-7717. 3907-3916. [7] N Wang, Y Zhang, Z Li, et al. Pixel2mesh: Generating 3d mesh models [30] X Yan, J Yang, E Yumer, et al. Perspective transformer nets: Learning single- from single rgb images. Proceedings of the European Conference on Com- view 3d object reconstruction without 3d supervision. Advances in Neural puter Vision (ECCV ), 2018: 52-67. Information Processing Systems, 2016, 29. [8] T Goueix, M Fisher, V G Kim, et al. Atlasnet: A papier-mch approach to [31] S Goel, A Kanazawa, J Malik. Shape and viewpoint without keypoints. learning 3d surface generation. CVPR, 2018. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, [9] A Thai, S Stojanov, V Upadhya, et al. 3d reconstruction of novel object August 23–28, 2020, Proceedings, Part XV 16. Springer International shapes from single images. 2021 International Conference on 3D Vision Publishing, 2020: 88-104. (3DV ), IEEE, 2021: 85-95. [32] K N Kutulakos, S M Seitz. A theory of shape by space carving. International [10] Y Wei, S Liu, W Zhao, et al. Conditional single-view shape generation for Journal of Computer Vision, 2000, 38: 199-218. multi-view stereo reconstruction. Proceedings of the IEEE/CVF Conference [33] A Laurentini. The visual hull concept for silhouette-based image under- on Computer Vision and Pattern Recognition, 2019: 9651-9660. standing. IEEE Transactions on Pattern Analysis and Machine Intelligence, [11] Q Xu, W Wang, D Ceylan, et al. DISN: Deep implicit surface network for 1994, 16(2): 150-162. high-quality single-view 3d reconstruction. Advances in Neural Informa- [34] Y Yao, N Schertler, E Rosales, et al. Front2back: Single view 3d shape tion Processing Systems, 2019, 32. reconstruction via front to back prediction. Proceedings of the IEEE/CVF [12] L Jiang, S Shi, X Qi, et al. Gal: Geometric adversarial loss for single-view Conference on Computer Vision and Pattern Recognition, 2020: 531-540. 3d-object reconstruction. Proceedings of the European Conference on [35] J Wu, C Zhang, X Zhang, et al. Learning shape priors for single-view 3d Computer Vision (ECCV), 2018: 802-816. completion and reconstruction. Proceedings of the European Conference [13] A Kar, C Häne, J Malik. Learning a multi-view stereo machine. Advances in on Computer Vision (ECCV), 2018: 646-662. Neural Information Processing Systems, 2017, 30. [36] Y Wu, Z Sun. DFR: differentiable function rendering for learning 3D gen- [14] H Kato, T Harada. Learning view priors for single-view 3d reconstruction. eration from images. Computer Graphics Forum, 2020, 39(5): 241-252. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern [37] K He, X Zhang, S Ren, et al. Deep residual learning for image recognition. Recognition, 2019: 9778-9787. Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- [15] S R Richter, S Roth. Matryoshka networks: Predicting 3d geometry via tion, 2016: 770-778. nested shape layers. Proceedings of the IEEE Conference on Computer Vision [38] P Jin, S Liu, J Liu, et al. Weakly-supervised single-view dense 3D point and Pattern Recognition, 2018: 1936-1944. cloud reconstruction via differentiable renderer. Chinese Journal of [16] M Tatarchenko, A Dosovitskiy, T Brox. Octree generating networks: Mechanical Engineering, 2021, 34: 93. Efficient convolutional architectures for high-resolution 3d outputs. [39] A X Chang, T Funkhouser, L Guibas, et al. Shapenet: An information-rich Proceedings of the IEEE International Conference on Computer Vision, 2017: 3d model repository. arXiv preprint arXiv: 1512. 03012, 2015. 2088-2096. [40] X Sun, J Wu, X Zhang, et al. Pix3d: Dataset and methods for single-image [17] H Xie, H Yao, X Sun, et al. Pix2vox: Context-aware 3d reconstruction from 3d shape modeling. Proceedings of the IEEE Conference on Computer Vision single and multi-view images. Proceedings of the IEEE/CVF International and Pattern Recognition, 2018: 2974-2983. Conference on Computer Vision, 2019: 2690-2698. [18] H Xie, H Yao, S Zhang, et al. Pix2Vox++: Multi-scale context-aware 3D object reconstruction from single and multiple images. International Hao Huang is currently a postdoctoral researcher at School of Journal of Computer Vision, 2020, 128(12): 2919-2935. Mechanical Engineering, Beijing Institute of Technology, China. [19] H Fan, H Su, L J Guibas. A point set generation network for 3d object reconstruction from a single image. Proceedings of the IEEE Conference on Shaoli Liu is currently a professor at School of Mechanical Engineer- Computer Vision and Pattern Recognition, 2017: 605-613. ing, Beijing Institute of Technology, China. [20] J Y Gwak, C B Choy, M Chandraker, et al. Weakly supervised 3d reconstruc- tion with adversarial constraint. 2017 International Conference on 3D Vision Jianhua Liu is currently a professor at School of Mechanical Engi- (3DV ), IEEE, 2017: 263-272. [21] Z Li, Y Yeh, M Chandraker. Through the looking glass: Neural 3d recon- neering, Beijing Institute of Technology, China. struction of transparent shapes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 1262-1271. Peng Jin is currently an associate research fellow at Shenzhen Bay [22] Y Nie, X Han, S Guo, et al. Total3dunderstanding: Joint layout, object pose Laboratory, Institute of Biomedical Engineering, Shenzhen, China. and mesh reconstruction for indoor scenes from a single image. Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 55-64. [23] Q Chen, V Nguyen, F Han, et al. Topology-aware single-image 3D shape reconstruction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020: 270-271. [24] G Vogiatzis, P H S Torr, R Cipolla. Multi-view stereo via volumetric graph- cuts. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), IEEE, 2005, 2: 391-398. [25] D Jimenez Rezende, S M Eslami, S Mohamed, et al. Unsupervised learning of 3d structure from images. Advances in Neural Information Processing Systems, 2016, 29. [26] K L Navaneet, P Mandikal, M Agarwal, et al. Capnet: Continuous approxi- mation projection for 3d point cloud reconstruction using 2d supervi- sion. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(01): 8819-8826. [27] A Kar, S Tulsiani, J Carreira, et al. Category-specific object reconstruction from a single image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1966-1974.

Journal

Chinese Journal of Mechanical Engineering – Springer Journals

Published: Aug 25, 2025

Keywords: Depth-consistency; Mesh; Standard deviation loss; View-based reconstruction

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Learning to Predict 3D Meshes from a Single Image via Depth Consistency

Learning to Predict 3D Meshes from a Single Image via Depth Consistency

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Learning to Predict 3D Meshes from a Single Image via Depth Consistency

Learning to Predict 3D Meshes from a Single Image via Depth Consistency

References (34)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies