Inspect data and ask questions

WS 2016/2017

Inspect data and ask questions

Prerequisites

Set working directory

setwd("YOUR-PATH")

The working directory should be at the top of your project directory structure.

Load data

Get the data from xls(x) files – load required package [always troublesome on windows due to missing perl]

Steps:

get help
load required package

##??xls
library(gdata)
raw_data <- read.xls(xls = "./data/Fundstellen_UTM.xlsx", sheet = 1)

A nice tutorial on loading table (including xls) data to R is on datacamp

Load data

A better way: use text files.

Question: Why?

raw_data <- read.table(file = "data/FS_UTM.csv",
                       header = TRUE,
                       sep = ";",
                       stringsAsFactors = FALSE,
                       quote = "")

Play with the data

Check structure and content

str(raw_data)

## 'data.frame':    508 obs. of  36 variables:
##  $ Nr.                                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Site.Name                          : chr  "Beba Veche - in the ditch" "Dudestii Vechi - Dragomirs' hill" "Nerau - Hunca Mare" "Nerau - Dogans' hill" ...
##  $ Site.type                          : chr  "fortification made from soil" "settlement and necropole" "tumulus and necropole" "tumulus " ...
##  $ settlement                         : int  0 1 0 0 1 0 0 0 1 1 ...
##  $ open.settlement                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cave.settlement                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tell.settlement                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mound                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fortification                      : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ court                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ depot                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cult.complex                       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mine                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ chruch                             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ temple                             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ villa.rustica                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ monastery                          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cemetery                           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ necropolis                         : int  0 1 1 0 0 0 0 0 0 1 ...
##  $ tumulus                            : int  0 0 1 1 0 1 1 1 0 0 ...
##  $ Epoch..site.                       : chr  "Bronze Age" "Latène Period,post-Roman, Middle Ages" "Bronze Age" "Bronze Age" ...
##  $ prehistoric                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ palaeolithic                       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ neolithic                          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ copper.age                         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bronze.age                         : int  1 0 1 1 1 1 1 0 1 1 ...
##  $ iron.age                           : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ roman.and.daco.roman.age           : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ post.roman.age.and.migration.period: int  0 1 0 0 0 0 0 1 1 1 ...
##  $ medieval                           : int  0 1 0 0 1 0 0 1 0 1 ...
##  $ modern                             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Code.RAN                           : num  155733 156721 158877 158877 156268 ...
##  $ Code.LMI                           : chr  "" "" "TM-I-s-B-06073" "TM-I-s-B-06073" ...
##  $ Code.Fisenon                       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ xUTM                               : num  443157 459882 462115 466751 467930 ...
##  $ yUTM                               : num  5105906 5095991 5088415 5089685 5109462 ...

Ask questions

How many archaeological sites are in the data set?

length(raw_data$Nr.)

## [1] 508

Exercise: Show another way how to answer the question?

Ask questions

How many settlements?

sum(raw_data$settlement)

## [1] 393

Question:

What other possibilities are there to query the amount of settlements?
What other question could be asked?

Create a dataset for further analyses

Clean data

Some mandatory things:

Are there missing values?
Do you understand every column/variable?
Where are problems with the data?
Typos?
What data are necessary/unnecessary?

The easiest but sometimes quite verbose way to check:

str(raw_data)

## 'data.frame':    508 obs. of  36 variables:
##  $ Nr.                                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Site.Name                          : chr  "Beba Veche - in the ditch" "Dudestii Vechi - Dragomirs' hill" "Nerau - Hunca Mare" "Nerau - Dogans' hill" ...
##  $ Site.type                          : chr  "fortification made from soil" "settlement and necropole" "tumulus and necropole" "tumulus " ...
##  $ settlement                         : int  0 1 0 0 1 0 0 0 1 1 ...
##  $ open.settlement                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cave.settlement                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tell.settlement                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mound                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fortification                      : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ court                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ depot                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cult.complex                       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mine                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ chruch                             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ temple                             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ villa.rustica                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ monastery                          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cemetery                           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ necropolis                         : int  0 1 1 0 0 0 0 0 0 1 ...
##  $ tumulus                            : int  0 0 1 1 0 1 1 1 0 0 ...
##  $ Epoch..site.                       : chr  "Bronze Age" "Latène Period,post-Roman, Middle Ages" "Bronze Age" "Bronze Age" ...
##  $ prehistoric                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ palaeolithic                       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ neolithic                          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ copper.age                         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bronze.age                         : int  1 0 1 1 1 1 1 0 1 1 ...
##  $ iron.age                           : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ roman.and.daco.roman.age           : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ post.roman.age.and.migration.period: int  0 1 0 0 0 0 0 1 1 1 ...
##  $ medieval                           : int  0 1 0 0 1 0 0 1 0 1 ...
##  $ modern                             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Code.RAN                           : num  155733 156721 158877 158877 156268 ...
##  $ Code.LMI                           : chr  "" "" "TM-I-s-B-06073" "TM-I-s-B-06073" ...
##  $ Code.Fisenon                       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ xUTM                               : num  443157 459882 462115 466751 467930 ...
##  $ yUTM                               : num  5105906 5095991 5088415 5089685 5109462 ...

Check data structure 2

Examples of more controlled queries:

head(names(raw_data))

## [1] "Nr."             "Site.Name"       "Site.type"       "settlement"     
## [5] "open.settlement" "cave.settlement"

head(sapply(raw_data, class))

##             Nr.       Site.Name       Site.type      settlement 
##       "integer"     "character"     "character"       "integer" 
## open.settlement cave.settlement 
##       "integer"       "integer"

Check data structure 2a

Examples of more controlled queries (head, tail, sample):

sample(raw_data$Epoch..site., 3)

## [1] "Middle Ages "                "Neolithic, late Bronze Age "
## [3] "Neolithic Period"

The structure of the "Epoch..site." column is a little complicated to use for further analyses. Fortunately, we do not need to use them, since there are already separated and precoded into binary variables.

Questions:

What is the problem with column?
What does "binary variable" mean?
Why is it useful?
Are there other ways to code a variable?

Check data structure 2a

Exercise [for the rest of the course-week…]

Clean the messed variables of the table yourself!

Hint 1: try ?gsub
Hint 2:

grep(pattern = "Bronze Age", x = raw_data$Epoch..site.)

Further questions 1

What is the proportion of the different settlement types?

One ordinary way (by the way, giving variable names "a","b", etc. is really bad style!)

a <- raw_data[,4:7]
b <- cbind(sum(a[,1]),sum(a[,2]),sum(a[,3]),sum(a[,4]))
colnames(b) <- c("settlement","open.settlement","cave.settlement","tell.settlement")
b

##      settlement open.settlement cave.settlement tell.settlement
## [1,]        393             183              19              12

c <- b/b[1]*100
c

##      settlement open.settlement cave.settlement tell.settlement
## [1,]        100        46.56489        4.834606        3.053435

Further questions 2

What is the proportion of the different settlement types?

One of the *apply ways

a <- data.frame(settlement = raw_data$settlement[raw_data$settlement==1],
                open.settlement = raw_data$open.settlement[raw_data$settlement==1],
                cave.settlement = raw_data$cave.settlement[raw_data$settlement==1],
                tell.settlement = raw_data$tell.settlement[raw_data$settlement==1]
                )
sapply(a, sum)

##      settlement open.settlement cave.settlement tell.settlement 
##             389             182              19              12

sapply(a, function(x){sum(x)/length(x)*100})

##      settlement open.settlement cave.settlement tell.settlement 
##      100.000000       46.786632        4.884319        3.084833

Further questions 3

What is the proportion of the different settlement types?

The magrittr and dplyr way (Version 1):

library(magrittr)
library(dplyr)

raw_data %>%
    filter(settlement==1) %>%
    select(settlement, open.settlement, cave.settlement, tell.settlement) %>%
    summarise(settlement=sum(.[,1])/n()*100,
              open.settlement=sum(.[,2])/n()*100,
              cave.settlement=sum(.[,3])/n()*100,
              tell.settlement=sum(.[,4])/n()*100
              )

##   settlement open.settlement cave.settlement tell.settlement
## 1        100        46.78663        4.884319        3.084833

Further questions 3a

What is the proportion of the different settlement types?

The magrittr and dplyr way (Version 2):

library(magrittr)
library(dplyr)

raw_data %>%
    filter(settlement==1) %>%
    select(settlement, open.settlement, cave.settlement, tell.settlement) %>%
    summarise(settlements = sum(settlement),
              open.settlement = sum(open.settlement),
              cave.settlement = sum(cave.settlement),
              tell.settlement = sum(tell.settlement)
              )

##   settlements open.settlement cave.settlement tell.settlement
## 1         389             182              19              12

Further questions 3b

What is the proportion of the different settlement types?

The magrittr and dplyr way (Version 2a):

library(magrittr)
library(dplyr)
raw_data %>%
    filter(settlement==1) %>%
    select(settlement, open.settlement, cave.settlement, tell.settlement) %>%
    summarise(settlements = sum(settlement)/n()*100,
              open.settlement = sum(open.settlement)/n()*100,
              cave.settlement = sum(cave.settlement)/n()*100,
              tell.settlement = sum(tell.settlement)/n()*100
              )

##   settlements open.settlement cave.settlement tell.settlement
## 1         100        46.78663        4.884319        3.084833

Excercise:

Find other ways to answer the question.
Discuss the different advantages of your approach(es) and those presented above.

Further questions 4

What the heck…!? Why not using table and prop.table ?!

Excercise: Why? :)

Clean data continued

Ok, let's continue. First, get rid of all the unnecessary variables. Again, there are many different possibilities to do this.

Here is a "classic" way (approach: I want to have…):

selected_data <- raw_data[,c(2,4:20,22:31,35:36)]
names(selected_data)

##  [1] "Site.Name"                          
##  [2] "settlement"                         
##  [3] "open.settlement"                    
##  [4] "cave.settlement"                    
##  [5] "tell.settlement"                    
##  [6] "mound"                              
##  [7] "fortification"                      
##  [8] "court"                              
##  [9] "depot"                              
## [10] "cult.complex"                       
## [11] "mine"                               
## [12] "chruch"                             
## [13] "temple"                             
## [14] "villa.rustica"                      
## [15] "monastery"                          
## [16] "cemetery"                           
## [17] "necropolis"                         
## [18] "tumulus"                            
## [19] "prehistoric"                        
## [20] "palaeolithic"                       
## [21] "neolithic"                          
## [22] "copper.age"                         
## [23] "bronze.age"                         
## [24] "iron.age"                           
## [25] "roman.and.daco.roman.age"           
## [26] "post.roman.age.and.migration.period"
## [27] "medieval"                           
## [28] "modern"                             
## [29] "xUTM"                               
## [30] "yUTM"

#knitr::kable(head(selected_data))
#str(selected_data)

Clean data continued

Ok, let's continue. First, get rid of all the unnecessary variables. Again, there are many different possibilities to do this.

Here is a "classic" way (approach: I do not want to have…):

selected_data <- raw_data[,c(-1,-3,-21,-32:-34)]
names(selected_data)

##  [1] "Site.Name"                          
##  [2] "settlement"                         
##  [3] "open.settlement"                    
##  [4] "cave.settlement"                    
##  [5] "tell.settlement"                    
##  [6] "mound"                              
##  [7] "fortification"                      
##  [8] "court"                              
##  [9] "depot"                              
## [10] "cult.complex"                       
## [11] "mine"                               
## [12] "chruch"                             
## [13] "temple"                             
## [14] "villa.rustica"                      
## [15] "monastery"                          
## [16] "cemetery"                           
## [17] "necropolis"                         
## [18] "tumulus"                            
## [19] "prehistoric"                        
## [20] "palaeolithic"                       
## [21] "neolithic"                          
## [22] "copper.age"                         
## [23] "bronze.age"                         
## [24] "iron.age"                           
## [25] "roman.and.daco.roman.age"           
## [26] "post.roman.age.and.migration.period"
## [27] "medieval"                           
## [28] "modern"                             
## [29] "xUTM"                               
## [30] "yUTM"

#knitr::kable(head(selected_data))
#str(selected_data)

Clean data continued

Ok, let's continue. First, get rid of all the unnecessary variables. Again, there are many different possibilities to do this.

And here a more modern way (Question: Which approach is used?):

library(dplyr)
selected_data <- select(.data = raw_data,
                        -Nr.,
                        -starts_with("Site.ty"),
                        -starts_with("Ep"),
                        -starts_with("Code")
                        )
names(selected_data)

##  [1] "Site.Name"                          
##  [2] "settlement"                         
##  [3] "open.settlement"                    
##  [4] "cave.settlement"                    
##  [5] "tell.settlement"                    
##  [6] "mound"                              
##  [7] "fortification"                      
##  [8] "court"                              
##  [9] "depot"                              
## [10] "cult.complex"                       
## [11] "mine"                               
## [12] "chruch"                             
## [13] "temple"                             
## [14] "villa.rustica"                      
## [15] "monastery"                          
## [16] "cemetery"                           
## [17] "necropolis"                         
## [18] "tumulus"                            
## [19] "prehistoric"                        
## [20] "palaeolithic"                       
## [21] "neolithic"                          
## [22] "copper.age"                         
## [23] "bronze.age"                         
## [24] "iron.age"                           
## [25] "roman.and.daco.roman.age"           
## [26] "post.roman.age.and.migration.period"
## [27] "medieval"                           
## [28] "modern"                             
## [29] "xUTM"                               
## [30] "yUTM"

#str(selected_data)

Question: Which way do you prefer? And besides your opinion, which is more suitable/robust [if any]?

Create chronologically separated dataset 1

Now, we will create site data sets for different chronological periods. Again there are different ways to achieve this:

classic

1. select relevant data
1. check whether everything is fine
1. drop unnecessary columns

Create chronologically separated dataset 1

prehist <- selected_data[selected_data$prehistoric==1,]

table(selected_data$prehistoric)[2]==length(prehist$prehistoric)

##    1 
## TRUE

prehist <- prehist[,-c(19,25:28)]
#str(prehist)

Question: What does the line beginning with "table(…" mean?

Create chronologically separated dataset 2

a modern way

library(dplyr)
prehist <- selected_data %>%
    filter(prehistoric == 1) %>%
    select(-contains("preh")) %>%
    select(-contains("roman")) %>%
    select(-contains("medie")) %>%
    select(-contains("modern"))

#str(prehist)
table(selected_data$prehistoric)[2]==length(prehist$prehistoric)

##     1 
## FALSE

Question: Why is the "table(…" line now FALSE?

at last: something to see

Now that we have a clean dataset, let's plot some of its characteristics. As always we are going to use the standard and a modern approach.

The standard plotting way.

barplot(cbind(
    paleolithic = sum(prehist$palaeolithic),
    neolithic = sum(prehist$neolithic),
    copper_age = sum(prehist$copper.age),
    bronze_age = sum(prehist$bronze.age),
    iron_age = sum(prehist$iron.age)
),
main = "Number of sites from prehistoric periods"
)

at last: something to see

Exercise:

Plot the different numbers of site types.
Which way do you chose? Why? What other possibilities are there?

at last: something to see – GGPLOT2

For the modern way we need to transform the data first from the a wide to a long format. Infos can be found e.g., here or Wickham (2007) or Wickham (2014). In the case of wide data there is a column for each variable.

Let's first take care of the data structure. We want to investigate the amount of different sity types per period.

types.per.period <- data.frame(palaeolithic = 0,
                               neolithic = 0,
                               copper.age = 0,
                               bronze.age = 0,
                               iron.age = 0
                               )
names(prehist)

##  [1] "Site.Name"       "settlement"      "open.settlement"
##  [4] "cave.settlement" "tell.settlement" "mound"          
##  [7] "fortification"   "court"           "depot"          
## [10] "cult.complex"    "mine"            "chruch"         
## [13] "temple"          "villa.rustica"   "monastery"      
## [16] "cemetery"        "necropolis"      "tumulus"        
## [19] "palaeolithic"    "neolithic"       "copper.age"     
## [22] "bronze.age"      "iron.age"        "xUTM"           
## [25] "yUTM"

at last: something to see – GGPLOT2

Data crunching

library(magrittr); library(dplyr)
for (i in 3:18) {
    j <- i-2
    types.per.period[j,] <- prehist %>%
        filter(.[i] == 1) %>%
        summarise(paleolithic = sum(palaeolithic),
                  neolithic = sum(neolithic),
                  copper.age = sum(copper.age),
                  bronze.age = sum(bronze.age),
                  iron.age = sum(iron.age)
                  )
}

#str(types.per.period)
types.per.period$type <- names(prehist)[3:18]
types.per.period

##    palaeolithic neolithic copper.age bronze.age iron.age            type
## 1             2        10          1         57        9 open.settlement
## 2             4         7          1         13        5 cave.settlement
## 3             0         5          0          6        3 tell.settlement
## 4             0         0          0          1        0           mound
## 5             2         5          4         28       20   fortification
## 6             0         0          0          0        0           court
## 7             0         0          0          0        0           depot
## 8             0         0          0          0        1    cult.complex
## 9             0         1          0          2        4            mine
## 10            0         1          0          1        1          chruch
## 11            0         0          0          0        1          temple
## 12            0         0          0          0        0   villa.rustica
## 13            0         0          0          0        2       monastery
## 14            0         2          0         12        0        cemetery
## 15            1         4          2         10        9      necropolis
## 16            1         3          0         27        1         tumulus

at last: something to see – GGPLOT2

This was again just data crunching/selection. Now, we change the data from wide to long format using the reshape2 package.

library(reshape2)
types2 <- melt(types.per.period, id.vars = c("type"))
str(types2)

## 'data.frame':    80 obs. of  3 variables:
##  $ type    : chr  "open.settlement" "cave.settlement" "tell.settlement" "mound" ...
##  $ variable: Factor w/ 5 levels "palaeolithic",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value   : num  2 4 0 0 2 0 0 0 0 0 ...

head(types2, 3)

##              type     variable value
## 1 open.settlement palaeolithic     2
## 2 cave.settlement palaeolithic     4
## 3 tell.settlement palaeolithic     0

Question: Do you see the difference? Describe the data new structure.

at last: something to see – GGPLOT2

From this point on it is very easy and effortless to produce informative plots. We use the ggplot2 package for this. Examples and documentation can be found at http://docs.ggplot2.org/current/ or in the manual or in Wickham (2009) as well as Zuur et al. (2009) (for the last on check this link).

ggplot2 package is build on the ideas of "Grammar of graphics". More information about this broad topic can be found in Wilkinson (2005).

ggplot2 expects "tidy" data. To get to know what this exactly is, look at Wickham (2014).

at last: something to see – GGPLOT2

library(ggplot2)
ggplot(types2, aes(x=type,y=value, fill=variable)) +
    geom_bar(stat="identity")

at last: something to see – GGPLOT2

Exercises

Check the help of ggplot and reference pages on the internet to:
- change the theme of the plot; Don't know what a "theme" is? Check help!
- reorder the x-axis.
- give a title
- change the direction of x-labels to make them readable
Create a facet plot differentiated by period.

In the end it might look like this and this…

at last: something to see – GGPLOT2

References

Wickham, H., 2009. Ggplot2: Elegant Graphics for Data Analysis, New York: Springer.

Wickham, H., 2007. Reshaping Data with the reshape Package., 21(12). Available at: http://www.jstatsoft.org/article/view/v021i12 [Accessed November 30, 2015].

Wickham, H., 2014. Tidy data. The Journal of Statistical Software, 59(10). Available at: http://www.jstatsoft.org/v59/i10/.

Wilkinson, L., 2005. The grammar of graphics 2nd ed., New York: Springer.

Zuur, A.F., Ieno, E.N. & Meesters, E., 2009. A Beginner’s Guide to R 1st ed., Springer.