Application of Probabilistic and Machine Learning Models for Groundwater Potentiality Mapping in Damghan Sedimentary Plain, Iran
Application of Probabilistic and Machine Learning Models for Groundwater Potentiality Mapping in...
Arabameri, Alireza;Roy, Jagabandhu;Saha, Sunil;Blaschke, Thomas;Ghorbanzadeh, Omid;Tien Bui, Dieu
2019-12-14 00:00:00
remote sensing Article Application of Probabilistic and Machine Learning Models for Groundwater Potentiality Mapping in Damghan Sedimentary Plain, Iran 1 2 2 3 Alireza Arabameri , Jagabandhu Roy , Sunil Saha , Thomas Blaschke , 3 4 , Omid Ghorbanzadeh and Dieu Tien Bui * Department of Geomorphology, Tarbiat Modares University, Tehran 14117-13116, Iran; [email protected] Department of Geography, University of Gour Banga, Malda, West Bengal 732103, India; [email protected] (J.R.); [email protected] (S.S.) Department of Geoinformatics – Z_GIS, University of Salzburg, 5020 Salzburg, Austria; [email protected] (T.B.); [email protected] (O.G.) Institute of Research and Development, Duy Tan University, Da Nang 550000, Vietnam * Correspondence: [email protected] Received: 12 November 2019; Accepted: 10 December 2019; Published: 14 December 2019 Abstract: Groundwater is one of the most important natural resources, as it regulates the earth’s hydrological system. The Damghan sedimentary plain area, located in the region of a semi-arid climate of Iran, has very critical conditions of groundwater due to massive pressure on it and is in need of robust models for identifying the groundwater potential zones (GWPZ). The main goal of the current research is to prepare a groundwater potentiality map (GWPM) considering the probabilistic, machine learning, data mining, and multi-criteria decision analysis (MCDA) approaches. For this purpose, 80 wells collected from the Iranian groundwater resource department and field investigation with global positioning system (GPS), have been selected randomly and considered as the groundwater inventory datasets. Out of 80 wells, 56 (70%) wells have been brought into play for modeling and 24 (30%) for validation purposes. Elevation, slope, aspect, convergence index (CI), rainfall, drainage density (Dd), distance to river, distance to fault, distance to road, lithology, soil type, land use/land cover (LU/LC), normalized dierence vegetation index (NDVI), topographic wetness index (TWI), topographic position index (TPI), and stream power index (SPI) have been used for modeling purpose. The area under the receiver operating characteristic (AUROC), sensitivity (SE), specificity (SP), accuracy (AC), mean absolute error (MAE), and root mean square error (RMSE) are used for checking the goodness-of-fit and prediction accuracy of approaches to compare their performance. In addition, the influence of groundwater determining factors (GWDFs) on groundwater occurrence was evaluated by performing a sensitivity analysis model. The GWPMs, produced by technique for order preference by similarity to ideal solution (TOPSIS), random forest (RF), binary logistic regression (BLR), weight of evidence (WoE) and support vector machine (SVM) have been classified into four categories, i.e., low, medium, high and very high groundwater potentiality with the help of the natural break classification methods in the GIS environment. The very high groundwater potentiality class is covered 15.09% for TOPSIS, 15.46% for WoE, 25.26% for RF, 15.47% for BLR, and 18.74% for SVM of the entire plain area. Based on sensitivity analysis, distance from river, and drainage density represent significantly eects on the groundwater occurrence. validation results show that the BLR model with best prediction accuracy and goodness-of-fit outperforms the other five models. Although, all models have very good performance in modeling of groundwater potential. Results of seed cell area index model that used for checking accuracy classification of models show that all models have suitable performance. Therefore, these are promising models that can be applied for the GWPZs identification, which will help for some needful action of these areas. Remote Sens. 2019, 11, 3015; doi:10.3390/rs11243015 www.mdpi.com/journal/remotesensing Remote Sens. 2019, 11, 3015 2 of 35 Keywords: groundwater potential mapping (GWPM); probabilistic models; machine learning algorithms; sensitivity analysis; Damghan sedimentary plain 1. Introduction Groundwater plays a crucial role in serving the heterogeneous need of human being such as drinking, agricultural, industrial, etc. [1]. In another way, groundwater availability and accessibility control sustainable development at a global, regional and local scale [2]. Large numbers of countries of the earth are facing the problem of water scarcity at the societal level [3]. In the arid and semi-arid regions, groundwater is the prime source of water and accounts for 80% groundwater resource [4]. Notably, in Iran, groundwater is more demanded source owing to its cleanness, tawdriness, constant chemical composition, constant temperature, lower pollution coecient, and a high certainty [4,5]. Groundwater extensively aects economic development, biological diversity and community health [6]. Similar to Iran country, a major part of the largely arid and semi-arid physiographic regions suers from the scarcity of water. Therefore, the groundwater is a main source of water to serve the dierent purpose and utilization of this region [7]. Iran has received average annual precipitation of 413 mm, and the evapo-transportation rate is 296 mm. Therefore, the 117 billion m of water is stored as groundwater over the whole country. The global per capita annual renewal water is 7600 m while the quantity of per capita global renewable water in Iran is 1900 m . In this region, the average yearly water consumption is 3.4 billion m , out of which about 65% is supplied from groundwater. In the present day, Iran is facing harsh water supply problems [8]. From these data, it is inevitable to implement water resource management policy for continuing the country’s economic and societal development. However, this issue can be short out by taking some necessary steps and decisions such as watershed management, artificial recharge, and management of soil and water [8]. In present the decades, groundwater recharge level has fallen due to unnecessary use and unscientific management plans [9]. Hence, aquifer potential determination through groundwater potentiality analysis is a good strategy in this field [10,11]. Dierent methods and models, techniques and processes were induced and used for the groundwater potentiality map (GWPM) or identification of areas having good potentiality of groundwater recharge. A few decades back, conventional techniques were applied for GWPM. Day by day improvement in technology with regarding scientific approach, the measuring instrument and computerized data analysis able to recognize the groundwater level, flow and other aspects for GWPM. Comparatively, contemporary scientific methods are providing better outcomes than the conventional method. Recently, remote sensing (RS) and geographic information system (GIS) are playing important role in managing the groundwater resource without the computational requirements [12]. RS technique provides the spatial and non-spatial information—even over the inaccessible areas in a short duration [13]. Therefore, RS technique also a powerful, ecient, accurate tool for collecting, restoring, manipulating, analyzing the spatial data of the surface and sub-surface water research, e.g., groundwater recharge, potentiality, evaluation of water quality [2,14]. Specifically, satellite imagery can provide hydrological characteristics, i.e., drainage network, flow accumulation, drainage density, recharge, and other geomorphologic characteristics [15]. The modeling of groundwater potential zones (GWPZ) is not only dependent on the single factors but also dependent on the dierent geo-environmental factors such as elevation, slope, aspect, rainfall, geology, fault, rainfall, drainage density (Dd), land use/land cover (LU/LC), normalized dierence vegetation index (NDVI), topographic wetness index (TWI), stream power index (SPI), soil permeability, topographic position index (TPI), convergence index (CI), infiltration rate, and soil texture. RS and GIS integration with modern groundwater mapping models such as probabilistic, knowledge-driven, machine learning, data mining could provide a powerful way to gain valuable decision-making information. The rapid development of probabilistic, machine learning, data mining, and ensemble models in recent decades is enhancing the basement to determine groundwater recharge opportunity, soil erosion susceptibility, gully erosion susceptibility, and other spatial modelings. Some Remote Sens. 2019, 11, 3015 3 of 35 new methods which were used by the researcher for spatial hazards probability and groundwater potentiality modeling are: evidential belief function (EBF), weights of evidence (WoE), frequency ratio (FR), classification and regression tree (CART,), boosted regression tree (BRT), decision tree (DT), artificial neural network (ANN), multivariate adaptive regression splines (MARS), binary logistic regression (BLR), Shannon’s entropy (SE), analytic hierarchy process (AHP), maximum entropy (ME), random forest (RF), fuzzy logic (FL), support vector machine (SVM), multi-criteria decision analysis (MDCA), logistic model tree (LMT), quadratic discriminate analysis (QDA), K-nearest neighbor (KNN), and certainty factor (CF) [16–22]. In this work, we have used probabilistic, machine learning, data mining, and MDCA methods, namely WoE, BLR, SVM, RF, and a technique for order preference by similarity to ideal solution (TOPSIS). The outcomes of the same model vary depending on the physiographical situation in dierent regions. The suitable models help to demarcate the areas having groundwater potentiality. The models used in this research are accessible and eciently capable of groundwater modeling and are used in various areas for environmental management [23–26]. Thus, the study aims to recognize the GWPZ using five models (RF, TOPSIS, WoE, SVM, and BLR,) in the Damghan sedimentary plain of Semnan province in Iran. The current study will help in determining the proper groundwater resource and to the decision-maker for managing the water resources. 2. Materials and Methods 2.1. Study Area Damghan sedimentary plain, located within the Semnan province in Iran, covers an area of 2 0 0 0 1559 km . Geographically, this plain region stretches from 35 56 to 36 18 N latitude and 54 00 E to 54 40 E longitude (Figure 1). The long-term average of precipitation and long-term evaporation are about 151.01 and 3000 mm, respectively [27]. The arid climate prevails in this plain because the annual evaporation is greater than annual precipitation [28]. The average temperature in the mountainous portion of the study area is 9.8 ºC, and in the plain area, the mean temperature is 23.5 ºC. In the south of Alborz zone, the upland area of the watershed is extended, and the plain’s elevation ranges from 2860 m. a.s.l. in the northwest, to 1043 m a.s.l. in the southeast. Major portions of the study region are composed of Quaternary deposits [29]. The remaining parts of the plain are situated in the Alborz region and are covered by calcrete layers such as Cretaceous formations, as well as sandstone and Paleogene-related conglomerates. The low elevated area, composed of Quaternary deposits, has a high-water yield and recharge rate because of sediment nature and succession [30]. Nevertheless, the upland region in the Alborz zone is not suitable for recharge [31]. The mean depth of alluvial sediment varies from 150 m in north to 240 m in the south. In this area, the unconfined aquifer and bedrock consist of Neogene alluvium, such as marl and conglomerate, and the well logs set out the type of sediment. Remote Sens. 2018, 10, x FOR PEER REVIEW 4 of 37 Remote Sens. 2019, 11, 3015 4 of 35 Figure 1. Location of the study area in Iran and Semnan province and location of training and Figure 1. Location of the study area in Iran and Semnan province and location of training and validations wells in the study area. validations wells in the study area. The sedimentary plain of Damghan is located in arid and semi-arid regions and facing the problem 2.2. Methodology of water supply such as other arid regions. The main source of freshwater is sub-surface water storage For assessing the groundwater potentiality (GWP), some spatial and non-spatial data have been and undergoes the problems of over pumping and lowering of groundwater. In the recent decade, gathered to prepare different datasets for modeling and validation of results. The data consists of due to excessive groundwater exploitation for irrigation and industrial purposes combining with the two, i.e., primary and secondary, data. Primary data are pumping tests and yield measurements in decreasing amount of rainfall, the water table is coming down rapidly. Therefore, immediate planning the field. The secondary data are topographical map (scale 1:50000), lithological map (scale 1:100000), is needed to conserve the groundwater [30]. In this respect, the delineation of GWPZ is essential for Sentinel 2A, Phased Array type L-band synthetic aperture radar (PALSAR) digital elevation model proper planning and sustainable management of water. (DEM), rainfall of different metrological station of last 30 years, well location data from Water 2.2. Resou Methodology rce Management, Iran, soil data from soil department of Iran. Thematic maps of all the data were extracted and analyzed by the RS and GIS. The present work methodologically consists of four For assessing the groundwater potentiality (GWP), some spatial and non-spatial data have been phases (Figure 2) including; (1) preparation of groundwater inventory database thematic data layers gathered to prepare dierent datasets for modeling and validation of results. The data consists of of the groundwater conditioning factors including elevation, slope, aspect, CI, rainfall, lithology, soil two, i.e., primary and secondary, data. Primary data are pumping tests and yield measurements type, LU/LC, Dd, distance to river, distance to fault, distance to road, NDVI, TPI), TWI, and SPI; (2) in the field. The secondary data are topographical map (scale 1:50,000), lithological map (scale multicollinearity assessment of the effective groundwater determining factors (GWDFs); (3) application of models and preparation of GWPMs. The GWPMs were classified according to the four Remote Sens. 2019, 11, 3015 5 of 35 1:100,000), Sentinel 2A, Phased Array type L-band synthetic aperture radar (PALSAR) digital elevation model (DEM), rainfall of dierent metrological station of last 30 years, well location data from Water Resource Management, Iran, soil data from soil department of Iran. Thematic maps of all the data were extracted and analyzed by the RS and GIS. The present work methodologically consists of four phases (Figure 2) including; (1) preparation of groundwater inventory database thematic data layers of the groundwater conditioning factors including elevation, slope, aspect, CI, rainfall, lithology, soil type, LU/LC, Dd, distance to river, distance to fault, distance to road, NDVI, TPI), TWI, and SPI; (2) multicollinearity assessment of the eective groundwater determining factors (GWDFs); (3) application of models and preparation of GWPMs. The GWPMs were classified according to the four classification methods, namely quantile, natural breaks, equal interval, and geometrical interval, into four dierent groundwater susceptibility classes, including low, medium, high, and very high. By comparing the results of each classification method and the distribution of training and validation wells on the high and very high groundwater susceptibility classes, it was found that the natural break classification method gave the most accurate distribution. This agrees with the findings by Arabameri et al. [32], in that natural break method is a good classifier in susceptibility mapping; and (4), evaluation of the models performances using area under receiver operating characteristics (AUROC) curve, sensitivity (SE), specificity (SP), accuracy (AC), mean absolute error (MAE), root mean square error (RMSE) and seed cell area index (SCAI) methods. 2.3. Data Preparation 2.3.1. Groundwater Inventory Map (GWIM) The groundwater inventory database is of a key role in groundwater potentiality mapping. An inventory map is a target variable for any spatial modeling [32]. The well inventory database was prepared after extensive field visit with a hand GPS (global positioning system), and yield data were collected from the Department of Water Resources Management, Iran. Groundwater wells, with high yield of11 m h 1 by pumping test analysis, have been considered for the GWPM. As a result, 80 wells have recognized in the study area. 56 wells (70%) of this dataset, were randomly selected to produce the GWPM models [32], whereas the remaining 24 (30%) wells were considered for validation of GWPMs [11]. The training and testing wells locations have been mentioned in Figure 1. Remote Remote Sens Sens. 2019 . ,2018 11, , 3015 10, x FOR PEER REVIEW 6 of 37 6 of 35 Figure 2. Methodological flowchart of the present work. Figure 2. Methodological flowchart of the present work. 2.3.2. Groundwater Determining Factors (GWDFs) The dierent geo-environmental components play a crucial role in determining the status of groundwater. GWPM represents the association between GWDFs and well locations [21,22]. For the GWPM, 16 GWDFs have been selected including elevation, slope, aspect, CI, rainfall, lithology, soil type, LU/LC, NDVI, Dd, distance to the river, distance to fault, distance to road, TWI, SPI and TPI Remote Sens. 2019, 11, 3015 7 of 35 (Figure 3a–p). The PALSAR DEM (12.5 m resolution) downloaded from the Alaska Satellite Facility (ASF) Distributed Active Archive Center (DAAC). In this study, PALSAR DEM was used to extract the topographical, hydrological factors such as elevation, slope, aspect, CI, drainage, TWI, SPI, and TPI. The slope, aspect, and elevation are the major topographic components, used to determine the groundwater potentiality, erosion probability, etc. [21]. The DEM has been used as the elevation dataset (Figure 3a). The altitudinal fluctuation controls climatic conditions and helps to induce various vegetation types and soil development [33]. The slope data layer has been derived from PALSAR DEM by spatial analysis in the GIS environment (Figure 3b). In the same way, the aspect map has also been extracted from PALSAR DEM imagery using a spatial analysis tool (Figure 3d). CI is an important terrain factor that demonstrates the arrangement of relief as a set of channels and ridges. It is developed by Kiss [34]. The convergence index (CI) has been calculated using Equation (1). CI = 90 , (1) i=1 where indicates the average angle between the aspect of adjacent cells and the direction to the central cell. The CI value ranges from 100 to +100 (Figure 3c). The rainfall map was prepared by the kriging method considering the last 10-year annual rainfall of dierent stations (Figure 3e). The drainage was extracted from the topographical map and PALSAR DEM imagery. The Dd was computed based on Horton’s morphometric formula (Equation (2)). Dd = , (2) where Lu means the total length of all orders streams, A is the area in square kilometer. Finally, the spatial data layer of the Dd has been built using the IDW interpolation method in the GIS environment (Figure 3f). The fault layer has been taken out from Landsat 7 imagery in ENVI software. The road network has been taken o from the topographical map and Google Earth imagery. The distance to river, fault, road data layers have been built using the Euclidian distance buering tool and expressed in km (Figure 3g,h,i) [11]. The lithological information for the study area was gathered from the geological department of Iran [29]. The lithology map has been prepared by the digitization process (Figure 3j). Geologically, the region is composed of nine geological segments, namely A, B, C, D, E, F, G, H, and I (Figure 3j and Table 1). Soil data was collected from the soil department of Iranian and with the help of the digitized process, the thematic dataset of soil has been produced (Figure 3k). The LU/LC map has been produced from Sentinel 2A satellite image (12/08/2017) of 10 m, 20 m, and 60 m spatial resolution for each band using the supervised image classification method (Figure 3l). NDVI has been computed from satellite image (Figure 3m) using Equation (3): NIR Red NDVI = , (3) NIR + Red where NIR is the near-infrared band or band 8 and red band or band 4. The TWI directly aects the topographic conditions, which control the hydrological process. TWI is the function of slope and the upstream area per unit width orthogonal to the direction of flow [35]. TWI plays a major role in the spatial heterogeneity of hydrological conditions such as soil moisture, underwater flow and slope steady-state [32]. The TWI has been introduced by Beven and Kirkby [36]. TWI is calculated from Equation (4): TWI = In(A / tan ), (4) 2 1 where AS represents the cumulative area of the catchment (m m ) and is the slope gradient (degrees). The TWI value ranges from 1.11 to 21.54 (Figure 3n). The SPI is a calculation of water flow erosive power based on the assumption that discharge is commensurate with a given catchment Remote Sens. 2019, 11, 3015 8 of 35 area [37]. One of the most important factors in controlling slope erosion processes is SPI. Regions with high stream power have high erosion potentiality [38]. From Equation (5), SPI has been calculated: SPI = A tan , (5) where A is the upstream contributing area and is slope gradient (in degrees). The spatial allocation of SPI ranges from 6.27 to 24.44 (Figure 3o) in the research area. TPI is defined as the dierence between the middle point elevation (Z ) and the average elevation (Z) in a predetermined radius around it (R) [39]: TPI = Z Z, (6) Z = Z . (7) i2R The TPI has positive and negative value; a positive value demonstrates that the midpoint is located at a higher place than its average while a negative value indicates a lower place than the average. The TPI range depends not only on variations in altitude but also on landscape units (R) [40]. Where large R values mainly depend on the main units of landscape, and small R values show up lower valleys such as small valleys and ridges. The TPI value ranges from 12.16 to 14.67 in this plain (Figure 3p). Spatial resolutions of the selected GWDFs are not the same. For preparing the groundwater potential maps of the study area the resolution of PALSAR DEM, i.e., 12.5 m* 12.5 m has been selected as the base scale and all the GWCFs of which scale are greater or lesser than the PALSAR DEM have been resembled into a 12.5 m* 12.5 m resolution. The data layers of elevation (Figure 3a), slope (Figure 3b), CI (Figure 3c), rainfall (Figure 3e), Dd (Figure 3f), distance to river (Figure 3g), distance to fault (Figure 3h), distance to road (Figure 3i), TWI (Figure 3n), SPI (Figure 3p), TPI (Figure 3o), and NDVI (Figure 3m) have been categorized into five sub-classes using the natural break classification method in GIS environment (Table 2). Aspect (Figure 3d), lithology (Figure 3j), soil type (Figure 3k), and LU/LC (Figure 3l) are the categorical factors. The categorical factors are also mentioned in Table 2. Table 1. Description of lithology units in the study area. Group Unit Description COm Dolomite platy and flaggy limestone containing trilobite; sandstone and shale (MILA FM). Cl Dark red medium-grained arkosic to subarkosic sandstone and micaceous siltstone (LALUN FM). Yellowish, thin to thick-bedded, fossiliferous argillaceous limestone, dark grey limestone, greenish DCkh marl and shale, locally including gypsum Db Grey and black, partly nodular limestone with intercalations of calcareous shale (BAHRAM FM). E1s Sandstone, conglomerate, marl and sandy limestone. Ek Well bedded green tu and tuaceous shale (KARAJ FM). D Jl Light grey, thin-bedded to massive limestone (LAR FM). K2m,l Marl, shale and detritic limestone. K Cretaceous rocks in general. Murmg Gypsiferous marl. Murc Red conglomerate and sandstone. Plc Polymictic conglomerate and sandstone. PlQc Fluvial conglomerate, Piedmont conglomerate and sandstone. P Undierentiated Permian rocks. Pr Dark grey medium-bedded to massive limestone (RUTEH LIMESTONE). Qft2 Low level piedmont fan and valley terrace deposits. Qft1 High level piedmont fan and valley terrace deposits. Qcf Clay flat. Qal Stream channel, braided channel and flood plain deposits. I TRJs Dark grey shale and sandstone (SHEMSHAK FM). Remote Sens. 2018, 10, x FOR PEER REVIEW 10 of 37 Remote Sens. 2019, 11, 3015 9 of 35 Figure 3. Cont. Remote Sens. Remote Sens 2019, . 11 2018 , 3015 , 10, x FOR PEER REVIEW 11 of 10 37 of 35 Figure 3. Groundwater determining factors: (a) elevation, (b) slope, (c) aspect, (d) convergence index, (e) rainfall, Figure 3. Groundwater determining factors: (a) elevation, (b) slope, (c) aspect, (d) convergence index, (f) drainage density, (g) distance to river, (h) distance to fault, (i) distance to road, (j) lithology, (k) soil type, (l) (e) rainfall, (f) drainage density, (g) distance to river, (h) distance to fault, (i) distance to road, (j) land use/land cover (LULC), (m) normalized difference vegetation index (NDVI), (n) topographic wetness index lithology, (k) soil type, (l) land use/land cover (LULC), (m) normalized dierence vegetation index (TWI), (O) topographic position index (TPI), (p) stream power index (SPI). (NDVI), (n) topographic wetness index (TWI), (O) topographic position index (TPI), (p) stream power index (SPI). Remote Sens. 2019, 11, 3015 11 of 35 Table 2. Computation of statistics and classes of groundwater determining factors (GWDFs). Factors Min. Max. Classes Methods (1.) <1155, (2.) 1155 –1297, (3.) 1297–1512, (4.) Natural break Elevation (m) 1043 2869 1512–1993, (5.) >1993 (Jenks) (1.) <2.55, (2.) 2.55–9.35, (3.) 9.35–20.70, Natural break Slope (degree) 0 72.32 (4.) 20.70–34.03, (5.) >34.03 (Jenks) (1.) Flat ( 1), (2.) North (0–22.5), (3.) Northeast (22.5–67.5), (4.) East (67.5–112.5), (5.) Southeast Aspect - - (112.5–157.5), (6.) South (157.5–202.5), (7.) Directional units Southwest (202.5–247.5), (8.) West (247.5–292.5), (9.) Northwest (292.5–337.5) (1.) <