WS 2016/2017
setwd("YOUR-PATH")
The working directory should be at the top of your project directory structure.
Get the data from xls(x) files – load required package [always troublesome on windows due to missing perl]
Steps:
##??xls library(gdata) raw_data <- read.xls(xls = "./data/Fundstellen_UTM.xlsx", sheet = 1)
A nice tutorial on loading table (including xls) data to R
is on datacamp
A better way: use text files.
Question: Why?
raw_data <- read.table(file = "data/FS_UTM.csv", header = TRUE, sep = ";", stringsAsFactors = FALSE, quote = "")
str(raw_data)
## 'data.frame': 508 obs. of 36 variables: ## $ Nr. : int 1 2 3 4 5 6 7 8 9 10 ... ## $ Site.Name : chr "Beba Veche - in the ditch" "Dudestii Vechi - Dragomirs' hill" "Nerau - Hunca Mare" "Nerau - Dogans' hill" ... ## $ Site.type : chr "fortification made from soil" "settlement and necropole" "tumulus and necropole" "tumulus " ... ## $ settlement : int 0 1 0 0 1 0 0 0 1 1 ... ## $ open.settlement : int 0 0 0 0 0 0 0 0 0 0 ... ## $ cave.settlement : int 0 0 0 0 0 0 0 0 0 0 ... ## $ tell.settlement : int 0 0 0 0 0 0 0 0 0 0 ... ## $ mound : int 0 0 0 0 0 0 0 0 0 0 ... ## $ fortification : int 1 0 0 0 1 0 0 0 0 0 ... ## $ court : int 0 0 0 0 0 0 0 0 0 0 ... ## $ depot : int 0 0 0 0 0 0 0 0 0 0 ... ## $ cult.complex : int 0 0 0 0 0 0 0 0 0 0 ... ## $ mine : int 0 0 0 0 0 0 0 0 0 0 ... ## $ chruch : int 0 0 0 0 0 0 0 0 0 0 ... ## $ temple : int 0 0 0 0 0 0 0 0 0 0 ... ## $ villa.rustica : int 0 0 0 0 0 0 0 0 0 0 ... ## $ monastery : int 0 0 0 0 0 0 0 0 0 0 ... ## $ cemetery : int 0 0 0 0 0 0 0 0 0 0 ... ## $ necropolis : int 0 1 1 0 0 0 0 0 0 1 ... ## $ tumulus : int 0 0 1 1 0 1 1 1 0 0 ... ## $ Epoch..site. : chr "Bronze Age" "Latène Period,post-Roman, Middle Ages" "Bronze Age" "Bronze Age" ... ## $ prehistoric : int 1 1 1 1 1 1 1 1 1 1 ... ## $ palaeolithic : int 0 0 0 0 0 0 0 0 0 0 ... ## $ neolithic : int 0 0 0 0 0 0 0 0 0 0 ... ## $ copper.age : int 0 0 0 0 0 0 0 0 0 0 ... ## $ bronze.age : int 1 0 1 1 1 1 1 0 1 1 ... ## $ iron.age : int 0 1 0 0 0 0 0 0 0 0 ... ## $ roman.and.daco.roman.age : int 0 0 0 0 1 0 0 0 0 0 ... ## $ post.roman.age.and.migration.period: int 0 1 0 0 0 0 0 1 1 1 ... ## $ medieval : int 0 1 0 0 1 0 0 1 0 1 ... ## $ modern : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Code.RAN : num 155733 156721 158877 158877 156268 ... ## $ Code.LMI : chr "" "" "TM-I-s-B-06073" "TM-I-s-B-06073" ... ## $ Code.Fisenon : int NA NA NA NA NA NA NA NA NA NA ... ## $ xUTM : num 443157 459882 462115 466751 467930 ... ## $ yUTM : num 5105906 5095991 5088415 5089685 5109462 ...
How many archaeological sites are in the data set?
length(raw_data$Nr.)
## [1] 508
Exercise: Show another way how to answer the question?
How many settlements?
sum(raw_data$settlement)
## [1] 393
Question:
Some mandatory things:
The easiest but sometimes quite verbose way to check:
str(raw_data)
## 'data.frame': 508 obs. of 36 variables: ## $ Nr. : int 1 2 3 4 5 6 7 8 9 10 ... ## $ Site.Name : chr "Beba Veche - in the ditch" "Dudestii Vechi - Dragomirs' hill" "Nerau - Hunca Mare" "Nerau - Dogans' hill" ... ## $ Site.type : chr "fortification made from soil" "settlement and necropole" "tumulus and necropole" "tumulus " ... ## $ settlement : int 0 1 0 0 1 0 0 0 1 1 ... ## $ open.settlement : int 0 0 0 0 0 0 0 0 0 0 ... ## $ cave.settlement : int 0 0 0 0 0 0 0 0 0 0 ... ## $ tell.settlement : int 0 0 0 0 0 0 0 0 0 0 ... ## $ mound : int 0 0 0 0 0 0 0 0 0 0 ... ## $ fortification : int 1 0 0 0 1 0 0 0 0 0 ... ## $ court : int 0 0 0 0 0 0 0 0 0 0 ... ## $ depot : int 0 0 0 0 0 0 0 0 0 0 ... ## $ cult.complex : int 0 0 0 0 0 0 0 0 0 0 ... ## $ mine : int 0 0 0 0 0 0 0 0 0 0 ... ## $ chruch : int 0 0 0 0 0 0 0 0 0 0 ... ## $ temple : int 0 0 0 0 0 0 0 0 0 0 ... ## $ villa.rustica : int 0 0 0 0 0 0 0 0 0 0 ... ## $ monastery : int 0 0 0 0 0 0 0 0 0 0 ... ## $ cemetery : int 0 0 0 0 0 0 0 0 0 0 ... ## $ necropolis : int 0 1 1 0 0 0 0 0 0 1 ... ## $ tumulus : int 0 0 1 1 0 1 1 1 0 0 ... ## $ Epoch..site. : chr "Bronze Age" "Latène Period,post-Roman, Middle Ages" "Bronze Age" "Bronze Age" ... ## $ prehistoric : int 1 1 1 1 1 1 1 1 1 1 ... ## $ palaeolithic : int 0 0 0 0 0 0 0 0 0 0 ... ## $ neolithic : int 0 0 0 0 0 0 0 0 0 0 ... ## $ copper.age : int 0 0 0 0 0 0 0 0 0 0 ... ## $ bronze.age : int 1 0 1 1 1 1 1 0 1 1 ... ## $ iron.age : int 0 1 0 0 0 0 0 0 0 0 ... ## $ roman.and.daco.roman.age : int 0 0 0 0 1 0 0 0 0 0 ... ## $ post.roman.age.and.migration.period: int 0 1 0 0 0 0 0 1 1 1 ... ## $ medieval : int 0 1 0 0 1 0 0 1 0 1 ... ## $ modern : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Code.RAN : num 155733 156721 158877 158877 156268 ... ## $ Code.LMI : chr "" "" "TM-I-s-B-06073" "TM-I-s-B-06073" ... ## $ Code.Fisenon : int NA NA NA NA NA NA NA NA NA NA ... ## $ xUTM : num 443157 459882 462115 466751 467930 ... ## $ yUTM : num 5105906 5095991 5088415 5089685 5109462 ...
Examples of more controlled queries:
head(names(raw_data))
## [1] "Nr." "Site.Name" "Site.type" "settlement" ## [5] "open.settlement" "cave.settlement"
head(sapply(raw_data, class))
## Nr. Site.Name Site.type settlement ## "integer" "character" "character" "integer" ## open.settlement cave.settlement ## "integer" "integer"
Examples of more controlled queries (head
, tail
, sample
):
sample(raw_data$Epoch..site., 3)
## [1] "Middle Ages " "Neolithic, late Bronze Age " ## [3] "Neolithic Period"
The structure of the "Epoch..site." column is a little complicated to use for further analyses. Fortunately, we do not need to use them, since there are already separated and precoded into binary variables.
Questions:
Exercise [for the rest of the course-week…]
Clean the messed variables of the table yourself!
?gsub
grep(pattern = "Bronze Age", x = raw_data$Epoch..site.)
What is the proportion of the different settlement types?
One ordinary way (by the way, giving variable names "a","b", etc. is really bad style!)
a <- raw_data[,4:7] b <- cbind(sum(a[,1]),sum(a[,2]),sum(a[,3]),sum(a[,4])) colnames(b) <- c("settlement","open.settlement","cave.settlement","tell.settlement") b
## settlement open.settlement cave.settlement tell.settlement ## [1,] 393 183 19 12
c <- b/b[1]*100 c
## settlement open.settlement cave.settlement tell.settlement ## [1,] 100 46.56489 4.834606 3.053435
What is the proportion of the different settlement types?
One of the *apply ways
a <- data.frame(settlement = raw_data$settlement[raw_data$settlement==1], open.settlement = raw_data$open.settlement[raw_data$settlement==1], cave.settlement = raw_data$cave.settlement[raw_data$settlement==1], tell.settlement = raw_data$tell.settlement[raw_data$settlement==1] ) sapply(a, sum)
## settlement open.settlement cave.settlement tell.settlement ## 389 182 19 12
sapply(a, function(x){sum(x)/length(x)*100})
## settlement open.settlement cave.settlement tell.settlement ## 100.000000 46.786632 4.884319 3.084833
What is the proportion of the different settlement types?
The magrittr and dplyr way (Version 1):
library(magrittr) library(dplyr) raw_data %>% filter(settlement==1) %>% select(settlement, open.settlement, cave.settlement, tell.settlement) %>% summarise(settlement=sum(.[,1])/n()*100, open.settlement=sum(.[,2])/n()*100, cave.settlement=sum(.[,3])/n()*100, tell.settlement=sum(.[,4])/n()*100 )
## settlement open.settlement cave.settlement tell.settlement ## 1 100 46.78663 4.884319 3.084833
What is the proportion of the different settlement types?
The magrittr and dplyr way (Version 2):
library(magrittr) library(dplyr) raw_data %>% filter(settlement==1) %>% select(settlement, open.settlement, cave.settlement, tell.settlement) %>% summarise(settlements = sum(settlement), open.settlement = sum(open.settlement), cave.settlement = sum(cave.settlement), tell.settlement = sum(tell.settlement) )
## settlements open.settlement cave.settlement tell.settlement ## 1 389 182 19 12
What is the proportion of the different settlement types?
The magrittr and dplyr way (Version 2a):
library(magrittr) library(dplyr) raw_data %>% filter(settlement==1) %>% select(settlement, open.settlement, cave.settlement, tell.settlement) %>% summarise(settlements = sum(settlement)/n()*100, open.settlement = sum(open.settlement)/n()*100, cave.settlement = sum(cave.settlement)/n()*100, tell.settlement = sum(tell.settlement)/n()*100 )
## settlements open.settlement cave.settlement tell.settlement ## 1 100 46.78663 4.884319 3.084833
Excercise:
What the heck…!? Why not using table
and prop.table
?!
Excercise: Why? :)
Ok, let's continue. First, get rid of all the unnecessary variables. Again, there are many different possibilities to do this.
Here is a "classic" way (approach: I want to have…):
selected_data <- raw_data[,c(2,4:20,22:31,35:36)] names(selected_data)
## [1] "Site.Name" ## [2] "settlement" ## [3] "open.settlement" ## [4] "cave.settlement" ## [5] "tell.settlement" ## [6] "mound" ## [7] "fortification" ## [8] "court" ## [9] "depot" ## [10] "cult.complex" ## [11] "mine" ## [12] "chruch" ## [13] "temple" ## [14] "villa.rustica" ## [15] "monastery" ## [16] "cemetery" ## [17] "necropolis" ## [18] "tumulus" ## [19] "prehistoric" ## [20] "palaeolithic" ## [21] "neolithic" ## [22] "copper.age" ## [23] "bronze.age" ## [24] "iron.age" ## [25] "roman.and.daco.roman.age" ## [26] "post.roman.age.and.migration.period" ## [27] "medieval" ## [28] "modern" ## [29] "xUTM" ## [30] "yUTM"
#knitr::kable(head(selected_data)) #str(selected_data)
Ok, let's continue. First, get rid of all the unnecessary variables. Again, there are many different possibilities to do this.
Here is a "classic" way (approach: I do not want to have…):
selected_data <- raw_data[,c(-1,-3,-21,-32:-34)] names(selected_data)
## [1] "Site.Name" ## [2] "settlement" ## [3] "open.settlement" ## [4] "cave.settlement" ## [5] "tell.settlement" ## [6] "mound" ## [7] "fortification" ## [8] "court" ## [9] "depot" ## [10] "cult.complex" ## [11] "mine" ## [12] "chruch" ## [13] "temple" ## [14] "villa.rustica" ## [15] "monastery" ## [16] "cemetery" ## [17] "necropolis" ## [18] "tumulus" ## [19] "prehistoric" ## [20] "palaeolithic" ## [21] "neolithic" ## [22] "copper.age" ## [23] "bronze.age" ## [24] "iron.age" ## [25] "roman.and.daco.roman.age" ## [26] "post.roman.age.and.migration.period" ## [27] "medieval" ## [28] "modern" ## [29] "xUTM" ## [30] "yUTM"
#knitr::kable(head(selected_data)) #str(selected_data)
Ok, let's continue. First, get rid of all the unnecessary variables. Again, there are many different possibilities to do this.
And here a more modern way (Question: Which approach is used?):
library(dplyr) selected_data <- select(.data = raw_data, -Nr., -starts_with("Site.ty"), -starts_with("Ep"), -starts_with("Code") ) names(selected_data)
## [1] "Site.Name" ## [2] "settlement" ## [3] "open.settlement" ## [4] "cave.settlement" ## [5] "tell.settlement" ## [6] "mound" ## [7] "fortification" ## [8] "court" ## [9] "depot" ## [10] "cult.complex" ## [11] "mine" ## [12] "chruch" ## [13] "temple" ## [14] "villa.rustica" ## [15] "monastery" ## [16] "cemetery" ## [17] "necropolis" ## [18] "tumulus" ## [19] "prehistoric" ## [20] "palaeolithic" ## [21] "neolithic" ## [22] "copper.age" ## [23] "bronze.age" ## [24] "iron.age" ## [25] "roman.and.daco.roman.age" ## [26] "post.roman.age.and.migration.period" ## [27] "medieval" ## [28] "modern" ## [29] "xUTM" ## [30] "yUTM"
#str(selected_data)
Question: Which way do you prefer? And besides your opinion, which is more suitable/robust [if any]?
Now, we will create site data sets for different chronological periods. Again there are different ways to achieve this:
classic
prehist <- selected_data[selected_data$prehistoric==1,] table(selected_data$prehistoric)[2]==length(prehist$prehistoric)
## 1 ## TRUE
prehist <- prehist[,-c(19,25:28)] #str(prehist)
Question: What does the line beginning with "table(…" mean?
a modern way
library(dplyr) prehist <- selected_data %>% filter(prehistoric == 1) %>% select(-contains("preh")) %>% select(-contains("roman")) %>% select(-contains("medie")) %>% select(-contains("modern")) #str(prehist) table(selected_data$prehistoric)[2]==length(prehist$prehistoric)
## 1 ## FALSE
Question: Why is the "table(…" line now FALSE?
Now that we have a clean dataset, let's plot some of its characteristics. As always we are going to use the standard and a modern approach.
The standard plotting way.
barplot(cbind( paleolithic = sum(prehist$palaeolithic), neolithic = sum(prehist$neolithic), copper_age = sum(prehist$copper.age), bronze_age = sum(prehist$bronze.age), iron_age = sum(prehist$iron.age) ), main = "Number of sites from prehistoric periods" )
Exercise:
For the modern way we need to transform the data first from the a wide to a long format. Infos can be found e.g., here or Wickham (2007) or Wickham (2014). In the case of wide data there is a column for each variable.
Let's first take care of the data structure. We want to investigate the amount of different sity types per period.
types.per.period <- data.frame(palaeolithic = 0, neolithic = 0, copper.age = 0, bronze.age = 0, iron.age = 0 ) names(prehist)
## [1] "Site.Name" "settlement" "open.settlement" ## [4] "cave.settlement" "tell.settlement" "mound" ## [7] "fortification" "court" "depot" ## [10] "cult.complex" "mine" "chruch" ## [13] "temple" "villa.rustica" "monastery" ## [16] "cemetery" "necropolis" "tumulus" ## [19] "palaeolithic" "neolithic" "copper.age" ## [22] "bronze.age" "iron.age" "xUTM" ## [25] "yUTM"
Data crunching
library(magrittr); library(dplyr) for (i in 3:18) { j <- i-2 types.per.period[j,] <- prehist %>% filter(.[i] == 1) %>% summarise(paleolithic = sum(palaeolithic), neolithic = sum(neolithic), copper.age = sum(copper.age), bronze.age = sum(bronze.age), iron.age = sum(iron.age) ) } #str(types.per.period) types.per.period$type <- names(prehist)[3:18] types.per.period
## palaeolithic neolithic copper.age bronze.age iron.age type ## 1 2 10 1 57 9 open.settlement ## 2 4 7 1 13 5 cave.settlement ## 3 0 5 0 6 3 tell.settlement ## 4 0 0 0 1 0 mound ## 5 2 5 4 28 20 fortification ## 6 0 0 0 0 0 court ## 7 0 0 0 0 0 depot ## 8 0 0 0 0 1 cult.complex ## 9 0 1 0 2 4 mine ## 10 0 1 0 1 1 chruch ## 11 0 0 0 0 1 temple ## 12 0 0 0 0 0 villa.rustica ## 13 0 0 0 0 2 monastery ## 14 0 2 0 12 0 cemetery ## 15 1 4 2 10 9 necropolis ## 16 1 3 0 27 1 tumulus
This was again just data crunching/selection. Now, we change the data from wide to long format using the reshape2
package.
library(reshape2) types2 <- melt(types.per.period, id.vars = c("type")) str(types2)
## 'data.frame': 80 obs. of 3 variables: ## $ type : chr "open.settlement" "cave.settlement" "tell.settlement" "mound" ... ## $ variable: Factor w/ 5 levels "palaeolithic",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ value : num 2 4 0 0 2 0 0 0 0 0 ...
head(types2, 3)
## type variable value ## 1 open.settlement palaeolithic 2 ## 2 cave.settlement palaeolithic 4 ## 3 tell.settlement palaeolithic 0
Question: Do you see the difference? Describe the data new structure.
From this point on it is very easy and effortless to produce informative plots. We use the ggplot2
package for this. Examples and documentation can be found at http://docs.ggplot2.org/current/ or in the manual or in Wickham (2009) as well as Zuur et al. (2009) (for the last on check this link).
ggplot2
package is build on the ideas of "Grammar of graphics". More information about this broad topic can be found in Wilkinson (2005).
ggplot2
expects "tidy" data. To get to know what this exactly is, look at Wickham (2014).
library(ggplot2) ggplot(types2, aes(x=type,y=value, fill=variable)) + geom_bar(stat="identity")
Exercises
Check the help of ggplot and reference pages on the internet to:
Create a facet plot differentiated by period.
In the end it might look like this and this…
Wickham, H., 2009. Ggplot2: Elegant Graphics for Data Analysis, New York: Springer.
Wickham, H., 2007. Reshaping Data with the reshape Package., 21(12). Available at: http://www.jstatsoft.org/article/view/v021i12 [Accessed November 30, 2015].
Wickham, H., 2014. Tidy data. The Journal of Statistical Software, 59(10). Available at: http://www.jstatsoft.org/v59/i10/.
Wilkinson, L., 2005. The grammar of graphics 2nd ed., New York: Springer.
Zuur, A.F., Ieno, E.N. & Meesters, E., 2009. A Beginner’s Guide to R 1st ed., Springer.