Database-specific package for various big scholarly data on Google BigQuery curated at the SUB Göttingen.
OTHER License
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
The goal of bqschol is to provide an interface to SUB Göttingen's big scholarly datasets stored on Google Big Query.
This package is of internal use.
You can install the development version from GitHub with:
# install.packages("remotes")
remotes::install_github("njahn82/bqschol")
Connect to dataset with Unpaywall snapshots
library(bqschol)
my_con <- bqschol::bgschol_con(
dataset = "cr_history",
path = "~/hoad-private-key.json")
You need to have a service account token to make use of this package!
The package provides wrapper for the most common table operations
bgschol_list()
: List tablesbgschol_tbl()
: Access tables withbgschol_query()
: Perform of a SQL query and retrieve resultsbgschol_execute()
: Execute a SQL query on the databaseLet's start by listing all Crossref snapshots on SUB Göttingen's Big Query project
bgschol_list(my_con)
We can determine the top publisher by type as of April 2018. Note that we only stored Crossref records published later than 2007.
cr_instant_df <- bgschol_tbl(my_con, table = "cr_apr18")
cr_instant_df %>%
#top publishers
dplyr::group_by(publisher) %>%
dplyr::summarise(n = dplyr::n_distinct(doi)) %>%
dplyr::arrange(desc(n))
For more complex tasks, we use SQL.
cc_query <- c("SELECT
publisher,
COUNT(DISTINCT(DOI)) AS n
FROM
`api-project-764811344545.cr_history.cr_apr18`,
UNNEST(license) AS license
WHERE
REGEXP_CONTAINS(license.URL, 'creativecommons')
GROUP BY
publisher
ORDER BY
n DESC
LIMIT
10")
bgschol_query(my_con, cc_query)
bgschol_execute()
is when new tables shall be created or dropped in
Big Query.