Introduction

What you shall learn according to the descrition of the course:

And besides this:

In the short available time we cannot cover all of these topics in detail. Therefore, we aim to provide a starting point that enables you to continue studying and learning.

We will teach you the scheduled content using the open-source software R (https://www.r-project.org/ and http://cran.r-project.org/). The reason: You do not need to learn many different software tools. If you are able to use R you can do every GIS and geostatistic related task.

Here is a not exhaustive list of useful literature about spatial data analyses, (geo-)statistics, and machine learing.

Getting started with R

What is R

R is a high-level computer/programming language and environment for data analysis and graphics (Crawley 2012). What can R do for you? “R can do anything you can imagine” (Zuur et al. 2009, p.1). You can write functions, do calculations, apply hundreds of statistical and geostatistical techniques, create complex graphs, and adapt it to your needs by writing your own library functions. R is supported by a huge user group, so besides continuous development of the software, you will always find experts able to help you with R related questions; e.g. using mailing lists (general: https://www.r-project.org/mail.html or specific: https://stat.ethz.ch/mailman/listinfo/R-SIG-Geo/), at SO (http://stackoverflow.com/questions/tagged/r), using vignettes that are available for a lot of packages, functions, and problems (https://stat.ethz.ch/R-manual/R-devel/library/utils/html/vignette.html), and finally via google search to the rest of the world not listed here. Ways to find help in R are nicely summarised in the SO answer: http://stackoverflow.com/questions/15289995/how-to-get-help-in-r.

A rising number of research institutes, companies, and universities have migrated to R, what gets obvious by looking at the number of scientific articles (see http://r4stats.com/articles/popularity/) as well as by the large amount of books published about R and topics related to (geo-)statistics. A non-exhaustive collection:

  • Baddeley & Turner (2005)
  • Baddeley (2008)
  • Bivand et al. (2008)
  • Borcard et al. (2011)
  • Crawley (2012)
  • Everitt (2006)
  • Maindonald & Braun (2003)
  • Radziwill (2015)
  • Schumacker & Tomek (2013)
  • Soetaert et al. (2012)
  • Stevens (2010)
  • Wickham (2009)
  • Zhao (2012)
  • Zuur et al. (2009)

By the way: R is available free of charge — for everyone, everywhere, any time. R is free software. If you want to learn more about this fundamental and important aspects have a look at https://www.fsf.org/ as well as https://www.gnu.org/.

Having said this, there are a lot of different ways to use R. Besides Rs own GUI, my personal favourite is ESS, what stands for Emacs Speaks Statistics, an add-on for the famous GNU emacs text editor (more information at http://ess.r-project.org/). Nevertheless, since learning emacs demands a course on its we are going to use R-Studio, probably the most accessible and popular GUI for R at the moment (more information at https://www.rstudio.com/; a collection of GUIs for R at wikipedia.

Video lectures (like An Introduction to Quantitative Inference and Thinking); YouTube in general is a great resource to find help about R, statistics, etc. A complete lecture series on Geographical Analysis at University Utah by Dr. Steven Farber can be found here

Massive Open Online Courses (MOOC) on R e.g. at edX or at Coursera

Interactive, online “Introduction to R”: https://www.datacamp.com/courses/free-introduction-to-r

Another good introduction to R: https://ramnathv.github.io/pycon2014-r/

Stay updated: http://www.r-bloggers.com/

R as a calculator

In the simplest case R can be used directly from the console. Let’s try it by using R as a calculator. Just type the following in and hit enter after each line:

1+1
## [1] 2
10-1
## [1] 9
3*5
## [1] 15
12/3
## [1] 4
16%/%3
## [1] 5
16%%3
## [1] 1
12^2
## [1] 144
sqrt(16)
## [1] 4
log(1)
## [1] 0
log10(100)
## [1] 2
exp(0)
## [1] 1

What is %% and %/%? Let’s find out:

help(%%)

Anything more that you do not understand? Search the help for it! There are many different possibilities to do it…

Another resource for beginners is the official “An Introduction to R”-documentation.

Learning using swirl package

If you want to learn more about R you can use the interactive tutorial from the swirl package to get started on your own. “The swirl R package makes it fun and easy to learn R programming and data science. If you are new to R, have no fear.” (http://swirlstats.com/students.html)

To install and use swirl type the following code in your R console.

install.packages("swirl", dependencies = TRUE)
library(swirl)
swirl()

Excercise

  1. play around and get familiar with R and RStudio
  2. install the swirl package and do the first unit (“R Programming: The basics of programming in R”).

Scripts

“A R script (basically any script) is simply a text file containing (almost) the same commands that you would enter on the command line of R” (https://cran.r-project.org/doc/contrib/Lemon-kickstart/kr_scrpt.html).

This is a great feature since while writing a script you automatically have a documentation of your work, hence it is possible for you (and others) to reconstruct how you produced your results. Besides, you can share your script with other researchers in order to debug it, enhance it, get feedback on it, help others with it, …

Think of a script like a publication and follow some basic rules to get the most of it. Give a title, mention the purpose, give references, set the license,…:

################################################################################ 
## An example of how to write the header of a R Script 
## ============================================================================= 
## Project: GIS in Geostatistics in Sri Lank 
## Author: Daniel Knitter 
## Version: 01 
## Date of last changes: So 30. Aug 17:26:27 CEST 2015 
## Data: 
## Author of data: 
## Purpose: just an example 
## Content: nothing yet 
## Licence data: -
## Licence Script: GPL
## 
## how to cite a package? citation(package="PACKAGE-NAME")
################################################################################ 

Please recognise the # symbol. It defines the rest of the line as a comment and is not interpreted by R. Hence, everywhere in your script where you want to make a remark you can just do it using a comment.

sqrt(12) # sqrt() means square root of something in the brackets
## [1] 3.464102

Before we you will fill your script with commands we have to define a style guide. There are some style guides available, for instance:

It does not matter which style guide you use, but be consistent. Here are some examples for points in a style guide:

  • Use short meaningful names
  • For combining parts of the name you can use points, hyphens or underscores. It does not matter which symbol you use but use every time the same symbol
  • Limit the line length to 80 characters
  • Use spaces before and after operators like +, =, >
  • Try to align similar parts in different rows. You can insert as many spaces as you like
  • Curly braces do never start in an own line but end in an own line
  • Use four spaces for indentation
  • Use <- for assignment
  • Use comments in a consistent way

Get data

Open Repositories

At rOpenSci you will find packages that help you to access data repositories through R. “Transforming science through open data – We are changing how science works” https://ropensci.org/.

Since these ideas are important here are some of these packages, allowing data access and analyses.

Geonames repository (optional since it requires free registration)

Install required packages

install.packages("devtools")
require(devtools)
## what the heck is the difference between "library" and "require"? ##
install.packages("rjson")
install_github("geonames","barryrowlingson")

And load the package. Here is the point where your username is required

library(geonames)
options(geonamesUsername="YOURUSERNAME")

World Bank climate data

A tutorial on how to access and use the data is here. We want to use the package to get information about the development of temperature and precipitation between 1960 and 2050.

install.packages("rWBclimate")
library(rWBclimate)

Now, get your ISO 3 country code and start to download some data

At Sri Lanka official departments

Survey Department of Sri Lanka: http://www.survey.gov.lk

We are going to use topographic information that can downloaded from the homepage of the department. Please download and extract the free shapefile-set they offer here.

Department of Census and Statistics: http://www.statistics.gov.lk/

Census Data of Sri Lanka can be accessed via the great and brand new (12/2014) LankaSIS.

We collected some datasets for you that we thought might be interesting. Wait for the exercises.

Spatial Data - some necessary basics

[The following is to a large extent taken from Knitter & Nakoinz (submitted): “Point Pattern Analysis as Tool for Digital Geoarchaeology – A Case Study of Megalithic Graves in Schleswig-Holstein, Germany”]

Statistics is a very large, sometimes overwhelmingly large, subject. Nevertheless, there are good news: in focusing on “Geostatistics and GIS” we already defined the focus of our statistical analyses: everything we are investigating is concerned with space and hence spatial data.

Spatial Data are Special

In contrast to normal everyday statistical data, spatial data are special because they do not fulfil one of the most common prerequisites of conventional statistical analyses: they are not random, i.e. stochastically independent. This causes the specificity of spatial data (collection after O’Sullivan & Unwin 2010, p.34):

  • Spatial autocorrelation is a measure of the importance of a location. It measures to which degree the characteristics at a certain location—or in the study area as a whole—are indicative for other locations in the area. The concept is closely related to Tobler’s first law of geography (at least for positive autocorrelation): “…everything is related to everything else, but near things are more related than distant things” (Tobler 1970, p.236). This means that it is more likely that points next to each other have similar or comparable characteristics of e.g. elevation than points that are distant. Local similarities are used to describe and differentiate space. For instance, an area of high concentration of people may be called settlement; a wetland area of low pH-values, dense vegetation and high organic carbon content may be called swamp, etc. The law also indicates that this holds true for all spatial data. If spatial phenomena would vary randomly through space spatial data would be meaningless (O’Sullivan & Unwin 2010, p.35). There are different techniques that allow to assess the importance of a location—hence spatial autocorrelation—in an analyses, i.e. Moran’s I as well as Geary’s C (e.g. Lloyd 2011, pp.80–82).
  • The modifiable areal unit problem arises when spatial data are compiled or acquired on a certain level of detail but are analysed in aggregated, areal-modified form (O’Sullivan & Unwin 2010, pp.36–38). For instance, humans are individuals but their distribution is reported in census data as sum per district. Districts are a modifiable areal unit that is arbitrary in terms of the investigated object. This can lead to problems in subsequent analyses because the unit of aggregation, i.e. the size of the district, influences the outcome of the analysis. The comprehensive discussion of this issue by (Openshaw 1984, pp.4–5, 10–11) shows that different aggregation schemes—e.g. different grid cell sizes or shapes—have a very strong effect on correlation measures.
  • The common statistical problem of ecological fallacy is often related to modifiable areal units. It occurs when a statistical relationship at one level of aggregation is assumed to be present because it is present at another (O’Sullivan & Unwin 2010, p.39). Thinking of some settlements that might occur more frequently at higher elevated locations, this observation does not allow us to conclude that those sites are located there because these locations seek higher visibility or better climate. Hence, data on occurrence of settlements in different altitudes can only support the conclusion that these are often more elevated in relation to their surroundings.
  • Before the start of a spatial analyses it is necessary to decide on which geographic scale the analysis will be conducted because this affects what we are able observe. The data we are using here is on a scale of 1:250000 and the settlements are represented as points. This already implies that this scale is too small to, for instance, investigate the shape of settlements. Furthermore, investigating the characteristics of settlements as points, only gives one measure—e.g. their altitude—although they cover a certain area, i.e. a certain range of altitudes.
  • Space is not uniform; accordingly, processes measured in space can be heterogeneous although their characteristics do not change. This is an induced spatial dependence (Borcard et al. 2011, p.229).
  • Edge effects are related to the issue of non-uniformity and arise when an artificial boundary is imposed on a study area (Diggle 2013, p.9).

Many of these points may sound trivial. Nevertheless, it is important to be aware of them since they directly influence the results. Spatial data are the result of processes. In analysing them it is possible to detect functional relationships. But these do not infer causality (see Ahnert 2003, pp.19–20). Hence, it needs to be discussed continuously, whether these processes are the actual reason of the configuration of spatial data or just an artefact of the analytical approach.

Spatial data in R

An overview of the different spatial analytical tools and packages (135 are listed on August 31 2015) available for R you can find at https://cran.r-project.org/web/views/Spatial.html A great introduction into the handling of spatial data is given by Lovelace & Cheshire (2015) and can be downloaded from Lovelace’s github account.

Warm up exercise: Interactive Maps in R

“Leaflet is one of the most popular open-source JavaScript libraries for interactive maps. It’s used by websites ranging from The New York Times and The Washington Post to GitHub and Flickr, as well as GIS specialists like OpenStreetMap, Mapbox, and CartoDB” (https://rstudio.github.io/leaflet/)

The R package makes it easy to integrate and control Leaflet maps in R.

First install and load the necessary packages

## devtools::install_github("rstudio/leaflet")
library(leaflet)

Well, before we produce some maps we shall define a location/an area that we want to see. How about this campus? A search for your campus on http://wikimapia.org gave gives us the geographic coordinates in degree, minutes, and seconds. This is a small problem, since we need them in decimal degree. R to the rescue, we just recalculate the values by writing our very own functions.

Convert Geographic coordinates from Decimal Degree to Degree Minute Second (and the other way around).

This is a small task to learn how to write functions. The equations used within the functions can be found at wikipedia and via google I found another version here.

dms.to.dd <- function(d,m,s) {
    dd <- d + (m/60) + (s/3600)
    return(dd)
}

dd.to.dms1 <- function(dd) {
    dd <- as.numeric(dd)
    d <- floor(dd)
    m <- floor((dd - d)*60)
    s <- floor((dd - d - m/60)*3600)
    dms <- paste(d,"°",m,"'",s,"\'\'",sep = "")
    return(dms)
}

dd.to.dms2 <- function(dd) {
    dd <- as.numeric(dd)
    d <- floor(dd)
    m <- floor((abs(dd) * 60))%%60
    s <- floor((abs(dd) * 3600))%%60
    dms <- paste(d,"°",m,"'",s,"\'\'",sep = "")
    return(dms)
}

Question: Which version of the dd.to.dms function is more convenient? And why? Question: How to advance the code? What is bad with the code at the moment?

Let us use our brand new functions. Get some geographic coordinates of your PGIS institute (I found these 7°15'30"N 80°35'47"E on http://wikimapia.org) and try them out.

co.pgis <- c(lat = dms.to.dd(7,15,30),lon = dms.to.dd(80,35,47),name = "Welcome at PGIS :)")
co.pgis
##                  lat                  lon                 name 
##   "7.25833333333333"   "80.5963888888889" "Welcome at PGIS :)"

Let’s see, whether our functions lead to the same results:

dms.co.pgis1 <- c(dd.to.dms1(co.pgis[1]),dd.to.dms1(co.pgis[2]))
dms.co.pgis2 <- c(dd.to.dms2(co.pgis[1]),dd.to.dms2(co.pgis[2]))
dms.co.pgis1
## [1] "7°15'29''"  "80°35'47''"
dms.co.pgis2
## [1] "7°15'29''"  "80°35'47''"

And now, produce some nice interactive maps…and try to make sense of the %>% symbol.

m <- leaflet() %>%
    addTiles() %>%
        addMarkers(lng=as.numeric(co.pgis[2]), lat=as.numeric(co.pgis[1]))
m

This produces an output like this with the default OpenStreetMap background.

You can also change the map tile provider for a wide range of different maps. An overview can be found here: http://leaflet-extras.github.io/leaflet-providers/preview/index.html

m1 <- leaflet() %>%
    addProviderTiles("Thunderforest.Landscape") %>%
        addMarkers(lng=as.numeric(co.pgis[2]), lat=as.numeric(co.pgis[1]), popup = "PGIS")
m1

m2 <- leaflet() %>%
      addProviderTiles("Stamen.Watercolor") %>%
          addMarkers(lng=co.pgis[2], lat=co.pgis[1], popup=as.character(co.pgis[3])) %>%
               setView(lng = as.numeric(co.pgis[2]), lat = as.numeric(co.pgis[1]), zoom = 10)
m2

Question: What are the differences in m1 and m2 besides the different map tile provider?

Exercise

Change the map tile provider and add another marker to the map (probably your hometown?)

References

Ahnert, F., 2003. Einführung in die Geomorphologie, Stuttgart: Eugen Ulmer.

Baddeley, A., 2008. Analysing spatial point patterns in R, CSIRO; University of Western Australia. Available at: http://www.csiro.au/files/files/pn0y.pdf.

Baddeley, A. & Turner, R., 2005. Spatstat: An R package for analyzing spatial point patterns. Journal of Statistical Software, 12(6), pp.1–42. Available at: www.jstatsoft.org.

Bivand, R.S., Pebesma, E.J. & Gómez-Rubio, V., 2008. Applied Spatial Data Analysis with R, New York: Springer.

Borcard, D., Gillet, F. & Legendre, P., 2011. Numerical Ecology with R, New York, NY: Springer New York. Available at: http://link.springer.com/10.1007/978-1-4419-7976-6 [Accessed March 12, 2015].

Crawley, M.J., 2012. The R Book, Chichester, UK: John Wiley & Sons, Ltd.

Diggle, P.J., 2013. Statistical Analysis of Spatial and Spatio-Temporal Point Patterns, Third Edition 3rd ed., Boca Raton: Chapman; Hall/CRC.

Everitt, B., 2006. A handbook of statistical analyses using R, Boca Raton: Chapman & Hall/CRC.

Fortin, M.-J. & Dale, M.R.T., 2005. Spatial analysis a guide for ecologists, Cambridge, N.Y.: Cambridge University Press.

Friedman, J., Hastie, T. & Tibshirani, R., 2001. The elements of statistical learning, Springer series in statistics Springer, Berlin. Available at: http://statweb.stanford.edu/~tibs/book/preface.ps [Accessed May 22, 2015].

Gaetan, C. & Guyon, X., 2010. Spatial Statistics and Modeling, New York, NY: Springer New York. Available at: http://link.springer.com/10.1007/978-0-387-92257-7 [Accessed March 13, 2015].

Gelfand, A.E. et al., 2010. Handbook of Spatial Statistics, CRC Press.

Glenberg, A.M. & Andrzejewski, M.E., 2008. Learning from data: An introduction to statistical reasoning 3rd ed., New York: Lawrence Erlbaum Associates.

Haining, R.P., 2003. Spatial data analysis theory and practice, Cambridge, UK; New York: Cambridge University Press.

Hengl, T., 2009. A practical guide to geostatistical mapping 2nd extended ed., Amsterdam: Hengl.

Illian, J. et al., 2008. Statistical Analysis and Modelling of Spatial Point Patterns, West Sussex: John Wiley & Sons.

James, G. et al., 2013. An Introduction to Statistical Learning, New York, NY: Springer New York. Available at: http://link.springer.com/10.1007/978-1-4614-7138-7 [Accessed May 11, 2015].

Legendre, P. & Legendre, L., 2012. Numerical ecology Third English edition., Amsterdam: Elsevier.

Lloyd, C.D., 2011. Local Models for Spatial Analysis, Boca Raton: CRC Press.

Lovelace, R. & Cheshire, J., 2015. Introduction to visualising spatial data in R, Available at: https://github.com/Robinlovelace/Creating-maps-in-R/raw/master/intro-spatial-rl.pdf [Accessed November 29, 2014].

Maindonald, J. & Braun, J., 2003. Data Analysis and Graphics Using R 1st ed., Cambridge University Press.

Openshaw, S., 1984. The modifiable areal unit problem, Norwich: Geo Abstracts Univ. of East Anglia.

O’Sullivan, D. & Perry, G.L.W., 2013. Spatial simulation: Exploring pattern and process, Chichester, West Sussex, UK: John Wiley & Sons Inc.

O’Sullivan, D. & Unwin, D., 2010. Geographic information analysis, Hoboken: John Wiley & Sons.

Pilz, J., 2009. Interfacing Geostatistics and GIS, Springer.

Radziwill, N.M., 2015. Statistics (The Easier Way) with R: An informal text on applied statistics, San Francisco, California: Lapis Lucera.

Ripley, B.D., 2004. Spatial statistics, Hoboken, N.J: Wiley-Interscience.

Schabenberger, O. & Gotway, C.A., 2005. Statistical methods for spatial data analysis, Boca Raton: Chapman & Hall.

Schumacker, R. & Tomek, S., 2013. Understanding Statistics Using R, New York, NY: Springer New York. Available at: http://link.springer.com/10.1007/978-1-4614-6227-9 [Accessed March 12, 2015].

Soetaert, K., Cash, J. & Mazzia, F., 2012. Solving differential equations in R, New York: Springer.

Stevens, M.H., 2010. A Primer of Ecology with R 1st ed. 2009 edition., Dordrecht ; New York: Springer.

Tobler, W.R., 1970. A Computer Movie Simulating Urban Growth in the Detroit Region. Economic Geography, 46, pp.234–240. Available at: http://www.jstor.org/stable/143141 [Accessed August 22, 2012].

Wickham, H., 2009. Ggplot2: Elegant Graphics for Data Analysis, New York: Springer.

Wiegand, T. & Moloney, K.A., 2013. Handbook of Spatial Point-Pattern Analysis in Ecology, CRC Press.

Zhao, Y., 2012. R and Data Mining: Examples and Case Studies, Academic Press, Elsevier. Available at: http://www.rdatamining.com/docs/RDataMining.pdf.

Zuur, A.F., Ieno, E.N. & Meesters, E., 2009. A Beginner’s Guide to R 1st ed., Springer.