The Royal Danish Library has made public the OCR text of a large amount of newspapers published during the years from 1666 up to 1877. This newspaper collection is available at the Royal Library Open Access Repository: LOAR.
The collection can also be accessed using an API. The system under LOAR is DSPace. The API is described at DSpace REST API. In this post, I’ll explore this API using R.
You can play with this code your self at a public available RStudio Cloud project. The static document is available as a RPUb at Playing with the LOAR API.
Start by loading the Tidyverse, a library for using JSON in R, and a library for date and time.
library(tidyverse)
library(jsonlite)
library(lubridate)
The top level of the LOAR hierarchy is something called communities, but we don’t need that concept here. Still, it it gives an idea about where the collections stems from.
fromJSON("https://loar.kb.dk/rest/communities") %>% select(name)
## name
## 1 AU Library
## 2 Aarhus University
## 3 Aquatic Biology
## 4 Arts
## 5 Audio Collections
## 6 Danish School of Education
## 7 Data Management in Practice
## 8 Department of Agroecology
## 9 Department of Bioscience
## 10 Events
## 11 IT Development
## 12 LARM.fm
## 13 LOAR
## 14 Moesgaard
## 15 National Museum of Denmark
## 16 NetLab
## 17 Newspapers from Royal Danish Library
## 18 Open Digital Collections
## 19 Research Data
## 20 Royal Danish Library
## 21 School of Communication and Culture
## 22 School of Culture and Society
## 23 Science and Tecnology
## 24 VDM Video Life Cycle Data Management
But, as we don’t need that hierarchy level, just list the available collections.
fromJSON("https://loar.kb.dk/rest/collections") %>% select(name)
## name
## 1 AnaEE
## 2 Archive for Danish Literature in 3D: Data, Danish Literature & Distant Reading
## 3 Arctic freshwaters
## 4 Beretningsarkiv for Arkæologiske Undersøgelser
## 5 Danmarks Kirker
## 6 Datasprint 2019
## 7 Example Data
## 8 Front pages of Berlingske
## 9 LOAR Legal Documents
## 10 Machine learning
## 11 Military land use in Denmark 1870-2017
## 12 NetLab data
## 13 Newspapers 1666-1678
## 14 Newspapers 1749-1799
## 15 Newspapers 1800-1849
## 16 Newspapers 1850-1877
## 17 Open Access LARM.fm datasets
## 18 Ruben Recordings
## 19 Sandbox
## 20 Skolelove
## 21 Soviet and Warsaw-pact maps
So, the newspapers are split into four collections. To get those collections, we need their ids. We’ll store this list of collections and ids for later
fromJSON("https://loar.kb.dk/rest/collections") %>%
filter(str_detect(name, "Newspaper")) %>%
select(name, uuid) -> newspaper_collections
newspaper_collections
What can we then get from a collection? Let’s look at the first using this URL
str_c(
"https://loar.kb.dk/rest/collections/",
first(newspaper_collections %>% pull(uuid))
)
## [1] "https://loar.kb.dk/rest/collections/8a36005e-07c2-4e88-ace2-b02eddef07b9"
fromJSON(str_c(
"https://loar.kb.dk/rest/collections/",
first(newspaper_collections %>% pull(uuid))
))
## $uuid
## [1] "8a36005e-07c2-4e88-ace2-b02eddef07b9"
##
## $name
## [1] "Newspapers 1666-1678"
##
## $handle
## [1] "1902/158"
##
## $type
## [1] "collection"
##
## $expand
## [1] "parentCommunityList" "parentCommunity" "items"
## [4] "license" "logo" "all"
##
## $logo
## NULL
##
## $parentCommunity
## NULL
##
## $parentCommunityList
## list()
##
## $items
## list()
##
## $license
## NULL
##
## $copyrightText
## [1] ""
##
## $introductoryText
## [1] "Collection of OCR text in csv files from digitised newspapers. The csv files contain\r\n<ul>\r\n<li>Reference to the scanned newspaper page in <a href=\"http://www2.statsbiblioteket.dk/mediestream/avis\" target=\"_blank\">Newspaper article</a>. This reference will point to the article when there in the search field is inserted recordID: and then the reference surrounded by the sign \".</li>\r\n<li>The date the newspaper was printed</li>\r\n<li>The newspaper id</li>\r\n<li>The scanned newspaper page</li>\r\n<li>Text which was generated by doing OCR of the scanned article</li>\r\n</ul>"
##
## $shortDescription
## [1] "Collection of OCR text in csv files from digitised newspapers"
##
## $sidebarText
## [1] ""
##
## $numberItems
## [1] 13
##
## $link
## [1] "/rest/collections/8a36005e-07c2-4e88-ace2-b02eddef07b9"
Which items do we have in that collection?
str_c(
"https://loar.kb.dk/rest/collections/",
first(newspaper_collections %>% pull(uuid)),
"/items"
)
## [1] "https://loar.kb.dk/rest/collections/8a36005e-07c2-4e88-ace2-b02eddef07b9/items"
fromJSON(str_c(
"https://loar.kb.dk/rest/collections/",
first(newspaper_collections %>% pull(uuid)),
"/items"
))
Let’s pick the first item for a closer look
fromJSON(str_c(
"https://loar.kb.dk/rest/collections/",
first(newspaper_collections %>% pull(uuid)),
"/items"
)) %>%
pull(uuid) %>%
first() -> uuid
uuid
## [1] "b4fb558a-1c56-42de-8c56-7fff565bb7b4"
fromJSON(str_c("https://loar.kb.dk/rest/items/",uuid))
## $uuid
## [1] "b4fb558a-1c56-42de-8c56-7fff565bb7b4"
##
## $name
## [1] "Newspapers from 1678"
##
## $handle
## [1] "1902/179"
##
## $type
## [1] "item"
##
## $expand
## [1] "metadata" "parentCollection" "parentCollectionList"
## [4] "parentCommunityList" "bitstreams" "all"
##
## $lastModified
## [1] "2018-02-05 10:24:08.214"
##
## $parentCollection
## NULL
##
## $parentCollectionList
## NULL
##
## $parentCommunityList
## NULL
##
## $bitstreams
## NULL
##
## $archived
## [1] "true"
##
## $withdrawn
## [1] "false"
##
## $link
## [1] "/rest/items/b4fb558a-1c56-42de-8c56-7fff565bb7b4"
##
## $metadata
## NULL
So, this is wierd, because even though the bitstreams
value is NULL
, I know it contains the actual content of the record/item. Let’s look at that
fromJSON(str_c("https://loar.kb.dk/rest/items/",uuid,"/bitstreams"))
And now we’re close to the actual data. In the above table, the data is available in bitstream with id d2d3869f-ad37-461c-bcb4-79ffc7d9d0fe
, and we get it by using the retrieve
function from the API. The content is delivered as CSV, and normally I would use the read_csv
for such data. But this CSV format has some issues with the encoding of quotes. Therefore, we must use the more general read_delim
function with the two escape_
parameters.
fromJSON(str_c("https://loar.kb.dk/rest/items/",uuid,"/bitstreams")) %>%
filter(name == "artikler_1678.csv") %>%
pull(retrieveLink) -> artikler_1678_link
artikler_1678_link
## [1] "/rest/bitstreams/d2d3869f-ad37-461c-bcb4-79ffc7d9d0fe/retrieve"
artikler_1678 <- read_delim(
str_c("https://loar.kb.dk/",artikler_1678_link),
delim = ",",
escape_backslash = TRUE,
escape_double = FALSE)
## Parsed with column specification:
## cols(
## recordID = col_character(),
## sort_year_asc = col_character(),
## editionId = col_character(),
## newspaper_page = col_double(),
## fulltext_org = col_character()
## )
artikler_1678
glimpse(artikler_1678)
## Rows: 80
## Columns: 5
## $ recordID <chr> "doms_newspaperCollection:uuid:082e12e4-cbc1-4502-93d1…
## $ sort_year_asc <chr> "1678-01", "1678-09", "1678-03-01", "1678-03-01", "167…
## $ editionId <chr> "dendanskemercurius1666 1678-01 001", "dendanskemercur…
## $ newspaper_page <dbl> 2, 2, 3, 2, 2, 4, 3, 3, 3, 4, 1, 4, 1, 4, 3, 3, 1, 1, …
## $ fulltext_org <chr> "(Pi", "Aff Kaaber stobte Hals dersom den cene broler …
To get an idea about the amount of data, let’s count pages
artikler_1678 %>%
group_by(sort_year_asc) %>%
summarise(page_count = sum(newspaper_page)) %>%
arrange(desc(sort_year_asc))
Unfortunately, to explore the metadata, we have to download all the actual data. Hopefully this will change in the future. Still, the bitstreams
do have a sizeBytes
value, so let’s collect those, and see how much bandwidth and storage is needed for the full collection. Well, actually for the full four newspaper collections.
So:
for each newspaper collection for each item sum the sizeBytes
of bitsteams with name ^artikel_
First look at the first collection
fromJSON(str_c("https://loar.kb.dk/rest/collections/",newspaper_collections %>% pull(uuid) %>% first() , "/items"))
Using that technique we can map the above used fromJSON
function onto a list of ids, to get the items from all the newspaper collections
map_df(
newspaper_collections %>% pull(uuid),
~fromJSON(str_c("https://loar.kb.dk/rest/collections/", .x, "/items"))
) -> all_items
all_items
Now, get all the bitstreams
associated with those items. Fist we can extract the item id
all_items %>% pull(uuid) %>% first()
## [1] "b4fb558a-1c56-42de-8c56-7fff565bb7b4"
and from that item id, get a bitstream
fromJSON(
str_c(
"https://loar.kb.dk/rest/items/",
all_items %>% pull(uuid) %>% first(), "/bitstreams"
)
)
The actual bitstream we are interested in, is the one named artikler_1678.csv
, and we can see that it is the only one with the CSV
format. Filter for that, and just retain the name
and sizeBytes
fromJSON(
str_c(
"https://loar.kb.dk/rest/items/",
all_items %>% pull(uuid) %>% first(), "/bitstreams"
)
) %>%
filter(format == "CSV") %>%
select(name, sizeBytes)
So, how do we do that for all items? Well, it should be as easy as when getting the items above
map_df(
all_items %>% filter(row_number() < 4) %>% pull(uuid),
~fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams"))
)
But I get that wierd error message?!?
Well, we have this item id: b4fb558a-1c56-42de-8c56-7fff565bb7b4
. Which bitstreams does that give
fromJSON(str_c("https://loar.kb.dk/rest/items/", "b4fb558a-1c56-42de-8c56-7fff565bb7b4", "/bitstreams"))
Okay, can we use the map_df
for just that one item?
c("b4fb558a-1c56-42de-8c56-7fff565bb7b4") %>%
map_df(
~fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams"))
)
No?! Okay, what if I then select the needed columns, and here the assumption is, that one of the unneeded columns are causing the havoc.
c("b4fb558a-1c56-42de-8c56-7fff565bb7b4") %>%
map_df(
~(fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams")) %>% select(name,sizeBytes))
)
Oh, that worked! Then try it with a few more items
all_items %>% filter(row_number() < 4) %>% pull(uuid) %>%
map_df(
~(fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams")) %>% select(name,sizeBytes))
)
YES! Now build the data frame, that I want. First just for 3 rows (row_number() < 4)
)
all_items %>% filter(row_number() < 4) %>% pull(uuid) %>%
map_df(
~(fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams")) %>% select(name,sizeBytes, format))
) %>%
filter(format == "CSV")
And now with everything
all_items %>% pull(uuid) %>%
map_df(
~(fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams")) %>% select(name,sizeBytes, format, uuid))
) %>%
filter(format == "CSV") %>%
select(-format) -> all_bitstreams
Let’s have a look
all_bitstreams
summary(all_bitstreams)
## name sizeBytes uuid
## Length:142 Min. : 23490 Length:142
## Class :character 1st Qu.: 17376786 Class :character
## Mode :character Median : 46032351 Mode :character
## Mean : 66204283
## 3rd Qu.: 86858365
## Max. :365887109
So at last, we can get the answer to the question regarding the resources needed for downloading the complete collection. But, first lets load a library for formatting numbers in a more human readable way
library(gdata)
Sum all the bytes
all_bitstreams %>%
summarise(total_bytes = humanReadable(sum(as.numeric(sizeBytes)), standard = "SI"))
That’s more or less that same as intalling TeX Live ;-)
Let’s select a year: 1853. Let’s get all the available text from that year. Now, we can cheat a bit, as we have all the bitstreams, and we can filter their names for 1853
all_bitstreams %>%
filter(str_detect(name, "1853.csv"))
Let’s get that bitstream using the GET /bitstreams/{bitstream id}/retrieve
print(now())
## [1] "2020-05-27 13:23:00 CEST"
articles_1853 <- read_delim(
"https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve",
delim = ",",
escape_backslash = TRUE,
escape_double = FALSE)
## Parsed with column specification:
## cols(
## recordID = col_character(),
## sort_year_asc = col_date(format = ""),
## editionId = col_character(),
## newspaper_page = col_double(),
## fulltext_org = col_character()
## )
## Warning: 34 parsing failures.
## row col expected actual file
## 5889 sort_year_asc date like 1853 'https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve'
## 5890 sort_year_asc date like 1853 'https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve'
## 7560 sort_year_asc date like 1853 'https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve'
## 7561 sort_year_asc date like 1853 'https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve'
## 7562 sort_year_asc date like 1853 'https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve'
## .... ............. .......... ...... ..................................................................................
## See problems(...) for more details.
print(now())
## [1] "2020-05-27 13:23:29 CEST"
That took less that a minute, so…
Okay, what did we get
articles_1853