A Monocular SLAM System Leveraging Structural Regularity in Manhattan World

Haoang Li, Jian Yao*, Jean-Charles Bazin, Xiaohu Lu and Yazhou Xing

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, Hubei, P.R.China

*EMail: jian.yao@whu.edu.cn

*Web: http://www.scholat.com/jianyao http://cvrs.whu.edu.cn

1. Abstract

The structural features in Manhattan world encode useful geometric information of parallelism, orthogonality and/or coplanarity in the scene. By fully exploiting these structural features, we propose a monocular SLAM system which can obtain accurate estimation of camera poses and 3D map. The foremost contribution of the proposed system is a structural features based optimization module which contains three novel optimization strategies. First, a rotation optimization strategy using the parallelism and orthogonality of 3D lines is presented. Based on these two geometric cues, we propose a global binding method and an approach for calculating relative rotation to get accurate absolute rotations. Second, a translation optimization strategy leveraging coplanarity is proposed. Coplanar features are effectively identified, and they are exploited by a unified model handling points and lines equivalently to calculate relative translation, followed by obtaining optimal absolute translations. Third, a 3D line optimization strategy utilizing parallelism, orthogonality and coplanarity simultaneously is proposed to obtain an accurate 3D map consisting of structural line segments with low computational complexity. Experiments in man-made environments have demonstrated that the proposed system outperforms existing state-of-the-art monocular SLAM systems in terms of accuracy and robustness.

2. Approach

We first exploit nonstructural features to obtain rough estimation of camera poses and 3D map following existing methods, and then use the structural features to develop an optimization module containing three novel optimization strategies as main contributions:

We first proposed a unified model exploiting point and line features simultaneously, which has three main advantages:

(1) Accurate rotation optimization strategy leveraging the parallelism and orthogonality: A global binding method and an approach for calculating precise relative rotation are proposed to significantly reduce accumulating error of absolute rotations;

(2) Accurate translation optimization strategy exploiting coplanarity: coplanar features are identified effectively, and then used by a unified model handling coplanar points and lines equivalently to obtain the relative translations, followed by the absolute translations optimization;

(3) Accurate and efficient 3D map optimization strategy based on parallelism, orthogonality and coplanarity: a novel 3D line parameterization method is designed, along with a reliable cost function based on re-projection error minimization of lines.


Fig. 1.Illustrations of geometric model of structural features in Manhattan world: (a) structural constraint of parallelism; (b) structural constraint of coplanarity



The basic geometric model used in this paper is shown as Fig.1. The details of algorithm are described in the paper.

3. Experimental Results

To demonstrate the performance of the proposed structural features based SLAM system, we conduct experiments on both simulated data and real image sequence. We compare our methods with existing state-of-the-art approaches in terms of accuracy and efficiency.

3.1. Simulated Data

Experiments on simulated data are divided into two parts. First, two images having overlap are synthesized, to compare proposed algorithms with existing methods on relative rotation and translation estimation as well as 3D line segment optimization. Second, a long image sequence is synthesized to make comparisons in large-scale scene. Note that all the tested approaches were implemented on an Intel Core i7 CPU with 2.40 Ghz.

a) Relative rotation estimation

For the algorithms to obtain relative rotation based on VPs, we compare our Gr¨obner basis based method GB-RR (cf. Section III-C) with a linear system based method [23] denoted as LSRR. We also evaluate our GB-RR using the same 6 structural lines as above in a nearly degenerate case in which the camera baseline is fixed as a very small value of 0.05 (shown in Fig. 2(a)).


b) Relative translation estimation

We compare our method UM-RT (cf. Section IV-B) based on the unified model handling coplanar points and lines, with non-structural points based method [20] noted as NP-RT, and structural lines based approach [23] denoted as SL-RT (shown in Fig. 2(b)).


Fig. 2. Relative pose estimation results between two simulated images with respect to noise: (a) relative rotation; (b) relative translation.


c) 3D line optimization

We compare our 3D line optimization strategy S-LO based on structural constraint (cf.Section V) with traditional non-structural constraint based approach [6] denoted as NS-LO. Both two methods aim to minimize the re-projection error, and we solve them by the Levenberg-Marquardt method available on the Ceres Solver [25] (shown in Tab. 1).


Table 1. A comparison between two 3D optimization strategies


d) Test on long image sequence

To evaluate the proposed system Struct-PL-SLAM (cf. Section II-A) based on structural points and lines in the large-scale scene, an experiment is conducted on the long synthetic image sequence. We compare our Struct-PL-SLAM with non-structural lines based system [4] denoted as Line-SLAM. Fig. 3 shows a comparison of recovered absolute pose. Fig. 4 shows a comparison of the reconstructed 3D line segments.


Fig. 3. Absolute pose estimation results on the long synthetic image sequence: (a) absolute rotation; (b) absolute translation.



Fig. 4.Reconstructed 3D maps consisting of line segments: (a) results of Line-SLAM; (b) results of our Struct-PL-SLAM (color red, green and blue represent 3 dominant directions).

3.2. Real Images

We evaluate the proposed system on the HRBB4 dataset [26] which is ideal to evaluate visual SLAM systems in the typical corridor scene. The image sequence is recorded by a monocular camera mounted on a moving robot and contains 12,000 frames of 640 * 320 pixels. The total length of the squared trajectory is about 70 m, and the ground truth of camera positions is provided.

We compare our Struct-PL-SLAM with existing state-of-the-art systems:
- non-structural points based ORB-SLAM [3];
- non-structural lines based Line-SLAM [4];
- non-structural points and lines based PL-SLAM [6];
- structural lines based Struct-Line-SLAM [10].

As shown in Fig. 5, the proposed system Struct-PL-SLAM (cf. Section II-A) can effectively detect and exploit the structural features of the corridor scene. Fig. 5(a) presents the results of VP extraction, and Fig. 5(b) shows the representative coplanar line matches judgment results which are obtained based on the clustering results of 2D line segments with respect to VPs.

Fig. 5. Structural features in the corridor scenario: (a) three orthogonal VPs detection result; (b) four coplanar line matches which corresponding to different 3D planes are determined by one horizontal VP, and marked with respective colors (only one image of the matched image pair is shown).


Fig. 6 shows the trajectories of cameras estimated by various systems. Overall, above non-structural features based systems have unsatisfactory performance. As to structural features based systems, Struct- Line-SLAM does not perform as we have expected. On the contrary, proposed Struct-PL-SLAM has high accuracy and robustness.


Fig. 6.The top view of estimated trajectories of the camera on the HRBB4 dataset. The black line represents the ground truth. The results of nonstructural features based systems: ORB-SLAM [3], Line-SLAM [4] and PL-SLAM [6], as well as structural features based systems Struct-Line- SLAM [10] and our Struct-PL-SLAM are presented.


Next, we evaluate the accuracy of 3D maps reconstructed by various systems. Fig. 7 shows the comparison between 3D map consisting of line segments of PL-SLAM and 3D structural map of Struct-PL-SLAM. The map of PL-SLAM is more disordered, due to limited accuracy of rotation and translation, as well as the noise in image line matches. In contrast, the result of our Struct-PL-SLAM is more accurate.


Fig. 7. Comparison between two 3D maps consisting of line segments: (a) result of PL-SLAM [6]; (b) result of our Struct-PL-SLAM (three dominant directions are marked as color red, green and blue).


Dataset and Codes

1. HRBB4 Dataset : http://telerobot.cs.tamu.edu/MFG/data/hrbb4/index.html

2. Algorithm Codes of Proposed Methods: Source Code



1. Haoang Li, Jian Yao*, Jean-Charles Bazin, Xiaohu Lu and Yazhou Xing, Submitted to The 2018 IEEE International Conference on Robotics and Automation (ICRA 2018) , September 2017.