Point pattern analyses

THE references for this subject are:

And the best thing is: the authors are active developers of R. The package we will use in this topic very often is spatstat. A package that powerful that the manual alone has over 1400 pages (as of 2015-09-26).

If you use the package in your research, do not forget to cite it (this holds true for every package used, including the base software R):

library(spatstat)
citation(package = "spatstat")
## 
## To cite spatstat in publications use:
## 
##   Adrian Baddeley, Ege Rubak, Rolf Turner (2015). Spatial Point
##   Patterns: Methodology and Applications with R. London: Chapman
##   and Hall/CRC Press, 2015. In press. URL
##   http://www.taylorandfrancis.com/books/details/9781482210200/
## 
##   Adrian Baddeley, Rolf Turner (2005). spatstat: An R Package for
##   Analyzing Spatial Point Patterns. Journal of Statistical
##   Software 12(6), 1-42. URL http://www.jstatsoft.org/v12/i06/.
## 
## If you use hybrid models, please also cite:
## 
##   Adrian Baddeley, Rolf Turner, Jorge Mateu, Andrew Bevan (2013).
##   Hybrids of Gibbs Point Process Models and Their Implementation.
##   Journal of Statistical Software, 55(11), 1-43. URL
##   http://www.jstatsoft.org/v55/i11/.

[following text largely from Knitter & Nakoinz (submitted)]

Point patterns are the result of processes that are influenced by (a) first-order effects, i.e. the location of the point is influenced by the underlying structure of the area but not by the location of other points (Wiegand & Moloney 2004, p.210); (b) second-order effects that occur when the location of a point is influenced by the presence or absence of other points (Wiegand & Moloney 2004, p.210). Point pattern analyses are common in ecological studies (e.g. Legendre & Legendre 2012; Wiegand & Moloney 2013); up to now there are only few applications in archaeological contexts (e.g. Knitter et al. 2014).

Warm up - some exploratory calculations

Data preparation

Load shapefiles and create point dataset of the city polygons.

library(rgdal)
boundary <- readOGR(dsn = "./data/DSL250-Shp/", layer = "Boundary")
## OGR data source with driver: ESRI Shapefile 
## Source: "./data/DSL250-Shp/", layer: "Boundary"
## with 504 features
## It has 5 fields
cities <- readOGR(dsn = "./data/DSL250-Shp", layer = "Cities")
## OGR data source with driver: ESRI Shapefile 
## Source: "./data/DSL250-Shp", layer: "Cities"
## with 27 features
## It has 6 fields
library(rgeos)
cities.p <- gCentroid(cities, byid = TRUE)

par(mfrow = c(1,2))
plot(cities)
plot(cities.p)

Mean center and standard distance

## prepare the data
x.co <- cities.p@coords[,1]
y.co <- cities.p@coords[,2]
n <- length(cities.p@coords[,1])

## calculate mean center
mc <- SpatialPoints(coords = cbind(sum(x.co)/n,sum(y.co)/n), proj4string = cities.p@proj4string)
mc
## class       : SpatialPoints 
## features    : 1 
## extent      : 188800.9, 188800.9, 265830.7, 265830.7  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=tmerc +lat_0=7.000480277777778 +lon_0=80.77171111111112 +k=0.9999238418 +x_0=200000 +y_0=200000 +a=6377276.345 +b=6356075.41314024 +units=m +no_defs
## calculate standard distance
sd <- sqrt(sum((x.co-mean(x.co))^2+(y.co-mean(y.co))^2)/n)
sd
## [1] 128172.4
## plot it
plot(boundary)
points(cities.p, pch = 19, cex = .5)
points(mc, col="red", pch = 19)
library(plotrix)
draw.circle(x = mc@coords[,1], y = mc@coords[,2], radius = sd, border = "red")
title("Mean center and standard distance \nof cities.p in Sri Lanka")

Global intensity

## get Area of Sri Lanka and change it to sqkm
area <- gArea(boundary)
area <- area/1000000

## calculate intensity
intensity <- length(cities.p)/area
intensity
## [1] 0.0004115326

Exercise: what about the villages?!

  1. Calculate mean centre and standard distance for the village dataset.
  2. Plot the results
  3. Calculate the global intensity of the village dataset. What does the value mean?

Working with ppp objects – Conducting Point Pattern Analysis

First, let us create the point pattern.

## another window -- when "boundary" is too detailed
# library(mapdata)
# sl <- map("worldHires","Sri Lanka")
# sl <- data.frame(x=sl$x,y=sl$y)
# sl <- sl[is.na(sl$x)==FALSE & is.na(sl$y)==FALSE,]
# sl.owin <- owin(poly = sl)
# plot(sl.owin) ## quite rough, i know... but it's fast

library(spatstat)
library(maptools)
sl.owin <- as.owin.SpatialPolygons(unionSpatialPolygons(boundary, IDs = boundary@data$OUTERB_ID))
pp.cit <- ppp(x = cities.p@coords[,1], y = cities.p@coords[,2], window = sl.owin)
#str(pp.cit)
pp.vil <- ppp(x = vil@coords[,1], y = vil@coords[,2], window = sl.owin)
## Warning in ppp(x = vil@coords[, 1], y = vil@coords[, 2], window = sl.owin):
## 51 points were rejected as lying outside the specified window
#plot(pp.vil, pch = 3, cex = .5)
plot(pp.cit, pch = 19, cols = "red")

As you can see the plot of pp.cit already shows the boundary of Sri Lanka. Why? And: why is this important? (hint: ?owin)

Quadrat count

qc.cit <- quadratcount(X = pp.cit)
qc.vil <- quadratcount(X = pp.vil)

par(mfrow = c(1,2))
plot(qc.cit)
plot(qc.vil)

So the question is: does the quadratcount indicates CSR? To check we use a \(\chi^2\) test approach (remember: Relation between observed (i.e. empirical) and expected (i.e. theoretical, here CSR) amounts of points in quadrants)

qt.cit <- quadrat.test(X = pp.cit)
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
qt.cit
## 
##  Chi-squared test of CSR using quadrat counts
##  Pearson X2 statistic
## 
## data:  pp.cit
## X2 = 30.323, df = 21, p-value = 0.1714
## alternative hypothesis: two.sided
## 
## Quadrats: 22 tiles (irregular windows)
qt.vil <- quadrat.test(X = pp.vil)
qt.vil
## 
##  Chi-squared test of CSR using quadrat counts
##  Pearson X2 statistic
## 
## data:  pp.vil
## X2 = 304.3, df = 21, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## Quadrats: 22 tiles (irregular windows)
# par(mfrow = c(1,2))
# plot(qt.cit)
# plot(qt.vil)

Exercise: What influence does a change in the amount and density of the quadrants have? Is the current approach useful?

Nearest-neighbor distance

nn.cit <- nndist(X = pp.cit)
nn.vil <- nndist(X = pp.vil)

str(nn.cit)
##  num [1:27] 51998 47135 47135 68439 47586 ...
mean(nn.cit)
## [1] 38571.04
mean(nn.vil)
## [1] 2819.191
hist(nn.cit)
abline(v=mean(nn.cit))
abline(v=median(nn.cit), lty=2)

hist(nn.vil)
abline(v=mean(nn.vil))
abline(v=median(nn.vil), lty=2)

Nearest neighbor Distance – Clark and Evans’ R

“An R value of less than 1 indicates of a tendency toward clustering, since it shows that observed nearest neighbor distances are shorter than expected. An R value of more than 1 indicatives of a tendency toward evenly spaced events” (O’Sullivan & Unwin 2010, 144)

nnE <- 1/(2*sqrt((pp.vil$n/gArea(boundary))))
nnE
## [1] 2493.517
R.vil <- mean(nn.vil)/nnE
R.vil
## [1] 1.130608
nnE <- 1/(2*sqrt((pp.cit$n/gArea(boundary))))
nnE
## [1] 24647.22
R.cit <- mean(nn.cit)/nnE
R.cit
## [1] 1.564925

First order effects

Density calculation - 1st approach

Amount of events within pixel…in imprecise language: a histogram in space. This is a step by step example but the code is rather inefficient.

cs <- 50000    # cellsize
# enlarge the study area by ONE pixel (in E-W and N-S direction)
xmin <- pp.cit$window$xrange[1] - cs/2 # enlarge the study area by cs/2
xmax <- pp.cit$window$xrange[2] + cs/2 # enlarge the study area by cs/2
ymin <- pp.cit$window$yrange[1] - cs/2 # enlarge the study area by cs/2
ymax <- pp.cit$window$yrange[2] + cs/2 # enlarge the study area by cs/2
rows  <- round((ymax-ymin)/cs, 0) + 1 # calculate the number of rows (add 1 just because a pixl might get lost through the rounding operation)
columns <- round((xmax-xmin)/cs, 0) + 1 # calculate the number of columns (add 1 just because a pixl might get lost through the rounding operation)
z <- cbind(1:(columns*rows)) # create a vector with all grid cells
df <- data.frame(z) # create a data.frame of it
gt <- GridTopology(c(pp.cit$window$xrange[1] - cs/2,pp.cit$window$yrange[1] - 
                     cs/2), c(cs,cs), c(columns,rows)) # create a topological description of a grid and aftecsards...
gt # ...have a look at it and...
##                       min      min
## cellcentre.offset 37439.5 55431.19
## cellsize          50000.0 50000.00
## cells.dim             7.0    11.00
sgdf <- SpatialGridDataFrame(gt, df, proj4string = cities.p@proj4string) # ...create the grid.

for (i in seq(along=coordinates(gt)[,1])){ # loop for every cell
    x <- coordinates(gt)[i,1] - cs/2 # because the coordinate is define for the center of the cell, the half cellsize is substracted
    y <- coordinates(gt)[i,2] - cs/2 # because the coordinate is define for the center of the cell, the half cellsize is substracted
    xi <- which(pp.cit$x>x & pp.cit$x<x+cs) # which events lie within the x direction?
    yi <- which(pp.cit$y>y & pp.cit$y<y+cs) # which events lie within the y direction? 
    pz <- length(intersect(xi,yi)) # how many objects in x and y direction intersect?
    sgdf@data$z[i]<- pz / (cs/1000)^2  # divide the number of points by the area
}

plot(raster(sgdf), col = gray.colors(25, start = 0.97, end = 0.4))      
plot(boundary, add=TRUE)
points(pp.cit$x, pp.cit$y, pch=16, cex=0.4)  

Density calculation - 2nd approach

Kernel density estimation; again here performed step by step and rather inefficiently coded. Use it to understand but do not run the code.

sgdf_kde <- sgdf # copy the grid into a new variable
sd <- 100000 # define the bandwidth of the kernel - it is NOT the pixel size; this is defined in the beginning of the 1str approach
# ...and think of it, the bandwidth shall not be smaller than the pixelsize, cause points outside the kernel are "underestimated"

for (i in seq(along=coordinates(gt)[,1])){ # loop through all pixel
    x <- coordinates(gt)[i,1]
    y <- coordinates(gt)[i,2]
    g2 <- 0
    for (j in seq(along=pp.cit$x)){ # we use ALL points
        distanz <- sqrt((pp.cit$x[j] - x)^2 + (pp.cit$y[j] - y)^2) # calculate this distance of one point to the centre of the raster cell. this is necessary for the...
        g1 <- dnorm(distanz, mean=0, sd=sd) #...weighting of point in th e cell using a normal distributed kernel
        g2 <- g2 + g1 # now collect the weights for the different points under in the pixel to get the density value for the pixel
    }
    sgdf_kde@data$z[i]<- g2     
}
plot(raster(sgdf_kde), col = gray.colors(25, start = 0.97, end = 0.4))   
plot(boundary, add=TRUE)
points(pp.cit$x, pp.cit$y, pch=16, cex=0.4) 

Density calculation - 3rd approach

Kernel density estimation using the built in function of spatstat. Since the package is great, it is perfectly coded and fast. Play around with the values. Calculate also a KDE for the village data set.

cs <- 1000  
sd <- 50000

dens_r5 <- density(pp.cit, sd, eps=cs, edge=TRUE, at="pixels") 
plot(dens_r5, col = gray.colors(25, start = 0.97, end = 0.4))
#contour(dens_r5, add=T)    
points(pp.cit$x, pp.cit$y, pch=16, cex=0.4)  

# use another measure for sd - in this case three times the mean nearest neighbor distance
sdev <- 3*mean(nndist(pp.cit))  
dens_r6 <- density(pp.cit, sdev, eps=cs, edge=TRUE, at="pixels")
plot(dens_r6, col = gray.colors(25, start = 0.97, end = 0.4))
#contour(dens_r6, add=T)    
points(pp.cit$x, pp.cit$y, pch=16, cex=0.4)  

Exercise

  1. Use the density function to calculate the KDE for the village data set.
  2. Change the size of the kernel and interpret/discuss the results
  3. Get familiar with the function and concepts using the spatstat manual

Second order effects

G and F function

First approach based on theoretical assumptions, i.e. CSR

g.cit <- Gest(pp.cit)
plot(g.cit)

f.cit <- Fest(pp.cit)
plot(f.cit)

Second approach based on simulations of a theortical process, i.e. CSR

g.cit.env <- envelope(pp.cit, fun = Gest, nsim = 10)
## Generating 10 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9,  10.
## 
## Done.
plot(g.cit.env)

f.cit.env <- envelope(pp.cit, fun = Fest, nsim = 10)
## Generating 10 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9,  10.
## 
## Done.
plot(f.cit.env)

Question: Interpret the results. Why two approaches? What are the advantages?

K and L function

First approach based on theoretical assumptions, i.e. CSR

k.cit <- Kest(pp.cit)
plot(k.cit)

l.cit <- Lest(pp.cit)
plot(l.cit)

Second approach based on simulations of a theortical process, i.e. CSR

k.cit.env <- envelope(pp.cit, fun = Kest, nsim = 10)
## Generating 10 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9,  10.
## 
## Done.
plot(k.cit.env)

l.cit.env <- envelope(pp.cit, fun = Lest, nsim = 10)
## Generating 10 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9,  10.
## 
## Done.
plot(l.cit.env)

Question: Interpret the results. Why two approaches? What are the advantages?

Exercise

  1. Calculate G,F,K, and L function for the village data set
  2. Interpret the results. Which approach do you chose? Why?
  3. Get familiar with the function and concepts using the spatstat manual

References

Baddeley, A., 2008. Analysing spatial point patterns in R, CSIRO; University of Western Australia. Available at: http://www.csiro.au/files/files/pn0y.pdf.

Bivand, R.S., Pebesma, E.J. & Gómez-Rubio, V., 2008. Applied Spatial Data Analysis with R, New York: Springer.

Diggle, P.J., 2013. Statistical Analysis of Spatial and Spatio-Temporal Point Patterns, Third Edition 3rd ed., Boca Raton: Chapman; Hall/CRC.

Knitter, D. et al., 2014. The Centrality of Aleppo and its environs. eTopoi. Journal for Ancient Studies, 3, pp.107–127.

Legendre, P. & Legendre, L., 2012. Numerical ecology Third English edition., Amsterdam: Elsevier.

Wiegand, T. & Moloney, K.A., 2013. Handbook of Spatial Point-Pattern Analysis in Ecology, CRC Press.

Wiegand, T. & Moloney, K.A., 2004. Rings, Circles, and Null-Models for Point Pattern Analysis in Ecology. Oikos, 104(2), pp.209–229. Available at: http://www.jstor.org/stable/3547954 [Accessed March 13, 2015].