Title: | Tidy Analysis of Wikipedia |
Version: | 0.1.14 |
Description: | Access 'Wikipedia' through the several 'MediaWiki' APIs (https://www.mediawiki.org/wiki/API), as well as through the 'XTools' API (https://www.mediawiki.org/wiki/XTools/API). Ensure your API calls are correct, and receive results in tidy tibbles. |
License: | MIT + file LICENSE |
URL: | https://wikihistories.github.io/wikkitidy/, https://github.com/wikihistories/wikkitidy |
BugReports: | https://github.com/wikihistories/wikkitidy/issues |
Depends: | R (≥ 4.1.0) |
Imports: | cli, coro, dplyr, glue, httr2, lubridate, magrittr, openssl, pillar, purrr, rlang (≥ 0.4.11), stringr, tibble, vctrs, webfakes |
Suggests: | covr, igraph, roxygen2, testthat (≥ 3.0.0), tidyr |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-02-13 19:53:27 UTC; falk |
Author: | Michael Falk |
Maintainer: | Michael Falk <michaelgfalk@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-02-13 20:10:02 UTC |
wikkitidy: Tidy Analysis of Wikipedia
Description
Access 'Wikipedia' through the several 'MediaWiki' APIs (https://www.mediawiki.org/wiki/API), as well as through the 'XTools' API (https://www.mediawiki.org/wiki/XTools/API). Ensure your API calls are correct, and receive results in tidy tibbles.
Author(s)
Maintainer: Michael Falk michaelgfalk@gmail.com (ORCID) [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/wikihistories/wikkitidy/issues
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling rhs(lhs)
.
Combine new results for a query with previously downloaded results
Description
Combine new results for a query with previously downloaded results
Usage
append_query_result(old, new)
Arguments
old |
The query_tbl of previous results |
new |
The query_tbl of new results from the server |
Value
A new query_tbl of the appropriate subclass, depending on whether the batch is complete.
See Also
Ensure that the limit is correct for the endpoint. Raise an error if not.
Description
Ensure that the limit is correct for the endpoint. Raise an error if not.
Usage
check_limit(limit, max)
Arguments
limit |
The limit to be added to the query |
max |
The maximum allowed for the given endpoint |
Value
limit
, assuming no errors
Ensure namespace arguments are valid
Description
Ensure namespace arguments are valid
Usage
check_namespace(namespace)
Arguments
namespace |
An integer vector of namespace ids, or NULL |
Value
A character vector of namespace, spliced together with a |
, or NULL
Query the Action API continually until a continuation condition no longer holds.
Description
Query the Action API continually until a continuation condition no longer holds.
Usage
continue_query(last_result, predicate, max_requests = 1000)
Arguments
last_result |
The query_tbl of results to complete |
predicate |
The while condition. Results will be continually requested until this evaluates 'false'. |
Value
A query_tbl: an S3 dataframe that is a subclass of tibble::tibble
Search for insertions, deletions or relocations of text between two versions of a Wikipedia page
Description
Any two revisions of a Wikipedia page can be compared using the 'diff' tool. The tool compares the 'from' revision to the 'to' revision, looking for insertions, deletions or relocations of text. This operation can be performed in any order, across any span of revisions.
Usage
get_diff(from, to, language = "en", simplify = TRUE)
Arguments
from |
Vector of revision ids |
to |
Vector of revision ids |
language |
Vector of two-letter language codes (will be recycled if length==1) |
simplify |
logical: should R simplify the result (see return) |
Value
The return value depends on the simplify
parameter.
If
simplify
== TRUE: A list of tibble::tbl_df objects the same length asfrom
andto
. Most of the response data is stripped away, leaving just the textual differences between the revisions, their location, type and 'highlightRanges' if the textual differences are complicated.If
simplify
== FALSE: A list the same length asfrom
andto
containing the full wikidiff2 response for each pair of revisions. This response includes additional data for displaying diffs onscreen.
Examples
# Compare revision 847170467 to 851733941 on English Wikipedia
get_diff(847170467, 851733941)
# The function is vectorised, so you can compare multiple pairs of revisions
# in a single call
# See diffs for the last two revisions of the Main Page
revisions <- wiki_action_request() %>%
query_by_title("Main Page") %>%
query_page_properties(
"revisions",
rvlimit = 2, rvprop = "ids", rvdir = "older"
) %>%
gracefully(next_result)
if (tibble::is_tibble(revisions)) {
revisions <- revisions %>%
tidyr::unnest(cols = c(revisions)) %>%
dplyr::mutate(diffs = get_diff(from = parentid, to = revid))
print(revisions)
}
Count how many times Wikipedia articles have been edited
Description
Count how many times Wikipedia articles have been edited
Usage
get_history_count(
title,
type = c("edits", "anonymous", "bot", "editors", "minor", "reverted"),
from = NULL,
to = NULL,
language = "en",
failure_mode = c("error", "quiet")
)
Arguments
title |
A vector of article titles |
type |
|
from |
Optional: a vector of revision ids |
to |
Optional: a vector of revision ids |
language |
Vector of two-letter language codes for Wikipedia editions |
failure_mode |
What to do if no data is found. See |
Value
A tibble::tbl_df with two columns:
'count': integer, the number of edits of the given type
'limit': logical, whether the 'count' exceeds the API's limit. Each type of edit has a different limit. If the 'count' exceeds the limit, then the limit is returned as the count and 'limit' is set to TRUE
Examples
# Get the number of edits made by auto-confirmed editors to a page between
# revisions 384955912 and 406217369
get_history_count(
title="Jupiter",
type="editors",
from=384955912,
to=406217369,
failure_mode="quiet"
)
# Compare which authors have the most edit activity
authors <- tibble::tribble(
~author,
"Jane Austen",
"William Shakespeare",
"Emily Dickinson"
) %>%
dplyr::mutate(get_history_count(author, failure_mode="quiet"))
authors
Perform a query using the MediaWiki Action API
Description
next_result()
sends exactly one request to the server.
next_batch()
requests results from the server until data is complete the
latest batch of pages in the result.
retrieve_all()
keeps requesting data until all the pages from the query
have been returned.
Usage
next_result(x)
next_batch(x)
retrieve_all(x)
Arguments
x |
The query. Either a wiki_action_request or a query_tbl. |
Details
It is rare that a query can be fulfilled in a single request to the
server. There are two ways a query can be incomplete. All queries return a
list of pages as their result. The result may be incomplete because not all
the data for each page has been returned. In this case the batch is
incomplete. Or the data may be complete for all pages, but there are more
pages available on the server. In this case the query can be continued.
Thus the three functions for next_result()
, next_batch()
and
retrieve_all()
.
Value
A query_tbl containing results of the query. If x
is a
query_tbl, then the function will return a new data with the new data
appended to it. If x
is a wiki_action_request, then the returned
query_tbl will contain the necessary data to supply future calls to
next_result()
, next_batch()
or retrieve_all()
.
Examples
# Try out a request using next_result(), then retrieve the rest of the
# results. The clllimt limits the first request to 40 results.
preview <- wiki_action_request() %>%
query_by_title("Steve Wozniak") %>%
query_page_properties("categories", cllimit = 40) %>%
gracefully(next_result)
preview
all_results <- preview %>%
gracefully(retrieve_all)
all_results
# tidyr is useful for list-columns.
if (tibble::is_tibble(all_results)) {
all_results %>%
tidyr::unnest(cols=c(categories), names_sep = "_")
}
Get resources from one of Wikipedia's two REST APIs
Description
This function is intended for developer use. It makes it easy to quickly generate vectorised calls to the different APIs.
Usage
get_rest_resource(
...,
language = "en",
api = c("core", "wikimedia", "wikimedia_org", "xtools"),
response_format = c("json", "html"),
response_type = NULL,
failure_mode = c("error", "quiet")
)
Arguments
... |
< |
language |
Character vector of two-letter language codes |
api |
The desired REST api: "core", "wikimedia", "wikimedia_org", or "xtools" |
response_format |
The expected Content-Type of the response. Currently "html" and "json" are supported. |
response_type |
The schema of the response. If supplied, the results will be parsed using the schema. |
failure_mode |
How to respond if a request fails "error", the default: raise an error "quiet", silently return NA, and include the http error code in the response |
Details
The key invariant to maintain is the number of rows. Users ought to be able to use this function with dplyr::mutate, which requires the number of rows to be invariant.
Value
A list of responses. If response_format
== "json", then the responses
will be simple R lists. If response_format
== "html", then the responses
will xml_document
objects. If response_type
is supplied, the response
will be coerced into a tibble::tbl_df or vector using the relevant schema.
If the response is a 'scalar list' (i.e. a list of length == 1), then it is
silently unlisted, returning a simple list or vector.
Gracefully request a resource from Wikipedia
Description
The main purpose of this function is to enable examples using live resources
in the documentation. Examples must not throw errors, according to CRAN
policy. If you wrap a requesting method in gracefully
, then any
errors of type httr2_http
will be caught and no error will be thrown.
Usage
gracefully(request_object, request_method)
Arguments
request_object |
A |
request_method |
The desired function for performing the request, typically one of those in get_query_results |
Value
The output of request_method
called on request_object
, if the
request was successful. Otherwise a httr2_response
object with details
of the failed request.
Examples
# This fails without throwing an error
req <- httr2::request(httr2::example_url()) |>
httr2::req_url_path("/status/404")
resp <- gracefully(req, httr2::req_perform)
print(resp)
# This request succeeds
req <- httr2::request(httr2::example_url())
resp <- gracefully(req, httr2::req_perform)
print(resp)
Determine if a page parameter comprises titles or pageids, and prefix accordingly.
Description
Determine if a page parameter comprises titles or pageids, and prefix accordingly.
Usage
id_or_title(page, prefix = NULL)
## S3 method for class 'character'
id_or_title(page, prefix = NULL)
## S3 method for class 'numeric'
id_or_title(page, prefix = NULL)
Arguments
page |
Either a character or numeric vector. If a character vector, it is interpreted as a vector of page titles. If a numeric vector, of pageids. |
prefix |
Optional: A prefix to affix to the page titles if it is missing |
Value
A list
Constructor for generator query type
Description
Construct a new query to a generator module of
the Action API. This low-level constructor only performs basic type-checking.
It is your responsibility to ensure that the chosen generator
is an
existing API endpoint, and that you have composed the query correctly. For
a more user-friendly interface, use query_generate_pages.
Usage
new_generator_query(.req, generator, ...)
Arguments
.req |
A |
generator |
The generator to add to the query. If the generator is based
on a property module, then
|
... |
< |
Value
The output type depends on the input. If .req
is a
query/action_api/httr2_request
, then the output
will be a generator/query/action_api/httr2_request
. If .req
is a
prop/query/action_api/httr2_request
, then the return
object will be a subclass of the passed request, with "generator" as the
first term in the class vector, i.e.
generator/(titles|pageids|revids)/prop/query/action_api/httr2_request
.
Examples
# Build a generator query using a list module
# List all members of Category:Physics on English Wikipedia
physics <- wiki_action_request() %>%
new_generator_query("categorymembers", gcmtitle = "Category:Physics")
# Build a generator query on a property module
# Generate the pages that are linked to Albert Einstein's page on English
# Wikipedia
einstein_categories <- wiki_action_request() %>%
new_prop_query("titles", "Albert Einstein") %>%
new_generator_query("iwlinks")
Constructor for list queries
Description
This low-level constructor only performs basic type checking.
Usage
new_list_query(.req, list, ...)
## S3 method for class 'list'
new_list_query(.req, list, ...)
## S3 method for class 'generator'
new_list_query(.req, list, ...)
## S3 method for class 'prop'
new_list_query(.req, list, ...)
## S3 method for class 'query'
new_list_query(.req, list, ...)
Arguments
.req |
A |
list |
The list module to add to the query |
... |
< |
Value
An object of type list/query/action_api/httr2_request
.
Examples
# Create a query to list all members of Category:Physics
physics_query <- wiki_action_request() %>%
new_list_query("categorymembers", cmtitle="Category:Physics")
Constructor for the property query type
Description
The intended use for this query is to set the 'titles', 'pageids' or 'revids'
parameter, and enforce that only one of these is set. All property modules API in the Action API require
this parameter to be set, or they require a
generator
parameter to be set instead. The
prop/query
type is an abstract type representing the three possible kinds
of property query that do not rely on a generator (see below on the return
value). A complication is that a prop/query
can itself be used as the
basis for a generator.
Usage
new_prop_query(.req, by, pages, ...)
Arguments
.req |
A |
by |
The type of page. Allowed values are: pageids, titles, revids |
pages |
A string, the pages to query by, corresponding to the 'by' parameter. Multiple values should be separated with "|" |
... |
< |
Value
A properly qualified prop/query
object. There are six
possibilities:
-
titles/prop/query
-
pageids/prop/query
-
revids/prop/query
-
generator/titles/prop/query
-
generator/pageids/prop/query
-
generator/revids/prop/query
Examples
# Build a query on a set of pageids
# 963273 and 1159171 are Kate Bush albums
bush_albums_query <- wiki_action_request() %>%
new_prop_query("pageids", "963273|1159171")
Get data about pages from their titles
Description
get_latest_revision()
returns metadata about the latest
revision of each
page.
get_page_html()
returns the rendered html for each
page.
get_page_summary()
returns metadata about the latest revision, along
with the page description and a summary extracted from the opening
paragraph
get_page_related()
returns summaries for 20 related pages for each
passed page
get_page_talk()
returns structured talk page content for each
title. You must ensure to use the title for the Talk page itself, e.g.
"Talk:Earth" rather than "Earth"
get_page_langlinks()
returns interwiki links for each
title
Usage
get_latest_revision(title, language = "en", failure_mode = "error")
get_page_html(title, language = "en", failure_mode = "error")
get_page_summary(title, language = "en", failure_mode = "error")
get_page_talk(title, language = "en", failure_mode = "error")
get_page_langlinks(title, language = "en", failure_mode = "error")
Arguments
title |
A character vector of page titles. |
language |
A character vector of two-letter language codes, either of
length 1 or the same length as |
failure_mode |
Either "quiet" or "error." See |
Value
A list, vector or tibble, the same length as title
, with the
desired data.
Examples
# Get language links for a known page on English Wikipedia
get_page_langlinks("Charles Harpur", failure_mode = "quiet")
# The functions are vectorised over title and language
# Find all articles about Joanna Baillie, and retrieve summary data for
# the first two.
baillie <- get_page_langlinks("Joanna Baillie") %>%
dplyr::slice(1:2) %>%
dplyr::mutate(get_page_summary(title = title, language = code, failure_mode = "quiet"))
baillie
Convert a response from a Wikipedia API into a convenient format
Description
Wikipedia's APIs provide data using a range of different json schemas. This generic function converts the data into a convenient formats for use in an R data frame.
Usage
## S3 method for class 'wikidiff2'
parse_response(response)
parse_response(response)
## Default S3 method:
parse_response(response)
## S3 method for class 'row_list'
parse_response(response)
Arguments
response |
The data retrieved from Wikipedia. |
Value
A vector the same length as the response. Generally, this will be a simple vector, a tibble::tbl_df or a list of tibble::tbl_df objects.
Methods (by class)
-
parse_response(wikidiff2)
: Simplify a wikidiff2 response to a dataframe of textual differences, discarding display data -
parse_response(default)
: By default, create a list of nested tbl_dfs -
parse_response(row_list)
: Many of the endpoints return a list of named values for each page, which can easily be row-bound. They often contain nested data, however, which is automatically unnested by dplyr::bind_rows. Hence this more basic approach.
Perform a single request to the Action API.
Description
This function is the workhorse behind the user-facing next_result()
,
next_batch()
and retrieve_all()
.
Usage
perform_query(request, continue)
Arguments
request |
The request object |
continue |
The continue parameter returned by the previous request |
Value
A query_tbl()
of the results
See Also
Add required prefix to URL parameters for MediaWiki Action API request
Description
Add required prefix to URL parameters for MediaWiki Action API request
Usage
prefix_params(params, prefix)
Arguments
params |
A character vector |
prefix |
A character vector |
Value
A character vector
Convert passed objects into ISO8601 strings for API requests
Description
Convert passed objects into ISO8601 strings for API requests
Usage
process_timestamps(...)
Arguments
... |
Dynamic dots: the objects to be coerced |
Value
A named list of ISO strings, the same length as ...
Query the MediaWiki Action API using a vector of Wikipedia pages
Description
These functions help you to build a query for the MediaWiki Action API if you already have a set of pages that you wish to investigate. These functions can be combined with query_page_properties to choose which properties to return for the passed pages.
Usage
query_by_title(.req, title)
query_by_pageid(.req, pageid)
query_by_revid(.req, revid)
Arguments
.req |
A wiki_action_request query to modify |
title |
A character vector of page titles |
pageid |
A character or numeric vector of page ids |
revid |
A character or numeric vector of revision ids |
Details
If you don't already know which pages you wish to examine, you can build a query to find pages that meet certain criteria using query_list_pages or query_generate_pages.
Value
A request object of type pages/query/action_api/httr2_request
. To
perform the query, pass the object to next_batch or retrieve_all
See Also
Examples
# Retrieve the categories for Charles Harpur's Wikipedia page
resp <- wiki_action_request() %>%
query_by_title("Charles Harpur") %>%
query_page_properties("categories") %>%
gracefully(next_batch)
Explore Wikipedia's category system
Description
These functions provide access to the CategoryMembers endpoint of the Action API.
query_category_members()
builds a generator query to return the members of a given category.
build_category_tree()
finds all the pages and subcategories beneath the
passed category, then recursively finds all the pages and subcategories
beneath them, until it can find no more subcategories.
Usage
query_category_members(
.req,
category,
namespace = NULL,
type = c("file", "page", "subcat"),
limit = 10,
sort = c("sortkey", "timestamp"),
dir = c("ascending", "descending", "newer", "older"),
start = NULL,
end = NULL,
language = "en"
)
build_category_tree(category, language = "en")
Arguments
.req |
|
category |
The category to start from. |
namespace |
Only return category members from the provided namespace |
type |
Alternative to |
limit |
The number to return each batch. Max 500. |
sort |
How to sort the returned category members. 'timestamp' sorts them by the date they were included in the category; 'sortkey' by the category member's unique hexadecimal code |
dir |
The direction in which to sort them |
start |
If |
end |
If |
language |
The language edition of Wikipedia to query |
Value
query_category_members()
: A request object of type
generator/query/action_api/httr2_request
, which can be passed to
next_batch()
or retrieve_all()
. You can specify which properties to
retrieve for each page using query_page_properties()
.
build_category_tree()
: A list containing two dataframes. nodes
lists
all the subcategories and pages found underneath the passed categories.
edges
records the connections between them. The source
column gives the
pageid of the parent category, while the target
column gives the pageid
of any categories, pages or files contained within the source
category.
The timestamp
records the moment when the target
page or subcategory
was included in the source
category. The two dataframes in the list can
be passed to igraph::graph_from_data_frame for network analysis.
See Also
Examples
# Get the first 10 pages in 'Category:Physics' on English Wikipedia
physics_members <- wiki_action_request() %>%
query_category_members("Physics") %>%
gracefully(next_batch)
physics_members
# Build the tree of all albums for the Melbourne band Custard
tree <- build_category_tree("Category:Custard_(band)_albums")
tree
# For network analysis and visualisation, you can pass the category tree
# to igraph
tree_graph <- igraph::graph_from_data_frame(tree$edges, vertices = tree$nodes)
tree_graph
Generate pages that meet certain criteria, or which are related to a set of known pages by certain properties
Description
Many of the endpoints on the Action API can be used as generators
. Use
list_all_generators()
to see a complete list. The main advantage of using a
generator is that you can chain it with calls to query_page_properties()
to
find out specific information about the pages. This is not possible for
queries constructed using query_list_pages()
.
Usage
query_generate_pages(.req, generator, ...)
list_all_generators()
Arguments
.req |
A httr2_request, e.g. generated by |
generator |
The generator module you wish to use. Most list and property modules can be used, though not all. |
... |
< |
Details
There are two kinds of generator
: list-generators and prop-generators. If
using a prop-generator, then you need to use a query_by_()
function to tell
the API where to start from, as shown in the examples.
To set additional parameters to a generator, prepend the parameter with "g".
For instance, to set a limit of 10 to the number of pages returned by the
categorymembers
generator, set the parameter gcmlimit = 10
.
Value
query_generate_pages: The modified request, which can be passed to next_batch or retrieve_all as appropriate.
list_all_generators: a tibble of all the available generator
modules. The name
column gives the name of the generator, while the
group
column indicates whether the generator is based on a list module
or a property module. Generators based on property modules can only be
added to a query if you have already used query_by_ to specify which
pages' properties should be generated.
See Also
Examples
# Search for articles about seagulls
seagulls <- wiki_action_request() %>%
query_generate_pages("search", gsrsearch = "seagull") %>%
gracefully(next_batch)
seagulls
List pages that meet certain criteria
Description
See API:Lists for available
list actions. Each list action returns a list of pages, typically including
their pageid, namespace
and title. Individual lists have particular properties that can be requested,
which are usually prefaced with a two-word code based on the name of the
list (e.g. specific properties for the categorymembers
list action are
prefixed with cm
).
Usage
query_list_pages(.req, list, ...)
list_all_list_modules()
Arguments
.req |
A httr2_request, e.g. generated by |
list |
The type of list to return |
... |
< |
Details
When the request is performed, the data is returned in the body of the
request under the query
object, labeled by the chosen list action.
If you want to study the actual pages listed, it is advisable to retrieve the pages directly using a generator, rather than listing their IDs using a list action. When using a list action, a second request is required to get further information about each page. Using a generator, you can query pages and retrieve their relevant properties in a single API call.
Value
An HTTP response: an S3 list with class httr2_request
See Also
Examples
# Get the ten most recently added pages in Category:Physics
physics_pages <- wiki_action_request() %>%
query_list_pages("categorymembers",
cmsort = "timestamp",
cmdir = "desc", cmtitle = "Category:Physics"
) %>%
gracefully(next_batch)
physics_pages
Choose properties to return for pages from the action API
Description
See API:Properties for a list of available properties. Many have additional parameters to control their behavior, which can be passed to this function as named arguments.
Usage
query_page_properties(.req, property, ...)
list_all_property_modules()
Arguments
.req |
A httr2_request, e.g. generated by |
property |
The property to request |
... |
< |
Details
query_page_properties is not useful on its own. It must be combined with a
query_by_ function or query_generate_pages to specify which pages
properties are to be returned. It should be noted that many of the
API:Properties modules can
themselves be used as generators. If you wish to use a property module in
this way, then you must use query_generate_pages, passing the name of the
property module as the genenerator
.
Value
An HTTP response: an S3 list with class httr2_request
See Also
Examples
# Search for articles about seagulls and retrieve their number of
# watchers
resp <- wiki_action_request() %>%
query_generate_pages("search", gsrsearch = "seagull") %>%
query_page_properties("info", inprop = "watchers") %>%
gracefully(next_batch) %>%
dplyr::select(pageid, ns, title, watchers)
resp
Representation of Wikipedia data returned from an Action API Query module as tibble, with request metadata stored as attributes.
Description
Representation of Wikipedia data returned from an Action API Query module as tibble, with request metadata stored as attributes.
Usage
query_tbl(x, request, continue, batchcomplete)
Arguments
x |
A tibble |
request |
The httr2_request object used to generate the tibble |
continue |
The continue parameter returned by the API |
batchcomplete |
The batchcomplete parameter returned by the API |
Value
A tibble: an S3 data.frame with class query_tbl
.
Tidy eval helpers
Description
This page lists the tidy eval tools reexported in this package from rlang. To learn about using tidy eval in scripts and packages at a high level, see the dplyr programming vignette and the ggplot2 in packages vignette. The Metaprogramming section of Advanced R may also be useful for a deeper dive.
The tidy eval operators
{{
,!!
, and!!!
are syntactic constructs which are specially interpreted by tidy eval functions. You will mostly need{{
, as!!
and!!!
are more advanced operators which you should not have to use in simple cases.The curly-curly operator
{{
allows you to tunnel data-variables passed from function arguments inside other tidy eval functions.{{
is designed for individual arguments. To pass multiple arguments contained in dots, use...
in the normal way.my_function <- function(data, var, ...) { data %>% group_by(...) %>% summarise(mean = mean({{ var }})) }
-
enquo()
andenquos()
delay the execution of one or several function arguments. The former returns a single expression, the latter returns a list of expressions. Once defused, expressions will no longer evaluate on their own. They must be injected back into an evaluation context with!!
(for a single expression) and!!!
(for a list of expressions).my_function <- function(data, var, ...) { # Defuse var <- enquo(var) dots <- enquos(...) # Inject data %>% group_by(!!!dots) %>% summarise(mean = mean(!!var)) }
In this simple case, the code is equivalent to the usage of
{{
and...
above. Defusing withenquo()
orenquos()
is only needed in more complex cases, for instance if you need to inspect or modify the expressions in some way. The
.data
pronoun is an object that represents the current slice of data. If you have a variable name in a string, use the.data
pronoun to subset that variable with[[
.my_var <- "disp" mtcars %>% summarise(mean = mean(.data[[my_var]]))
Another tidy eval operator is
:=
. It makes it possible to use glue and curly-curly syntax on the LHS of=
. For technical reasons, the R language doesn't support complex expressions on the left of=
, so we use:=
as a workaround.my_function <- function(data, var, suffix = "foo") { # Use `{{` to tunnel function arguments and the usual glue # operator `{` to interpolate plain strings. data %>% summarise("{{ var }}_mean_{suffix}" := mean({{ var }})) }
Many tidy eval functions like
dplyr::mutate()
ordplyr::summarise()
give an automatic name to unnamed inputs. If you need to create the same sort of automatic names by yourself, useas_label()
. For instance, the glue-tunnelling syntax above can be reproduced manually with:my_function <- function(data, var, suffix = "foo") { var <- enquo(var) prefix <- as_label(var) data %>% summarise("{prefix}_mean_{suffix}" := mean(!!var)) }
Expressions defused with
enquo()
(or tunnelled with{{
) need not be simple column names, they can be arbitrarily complex.as_label()
handles those cases gracefully. If your code assumes a simple column name, useas_name()
instead. This is safer because it throws an error if the input is not a name as expected.
Value
Consult the original rlang documentation for the return types of these re-exported functions.
Check that a Wikimedia XML file has not been corrupted
Description
The Wikimedia Foundation publishes MD5 checksums for all its database dumps.
This function looks up the published sha1 checksums based on the file name,
then compares them to the locally calcualte has using the openssl
package.
Usage
verify_xml_integrity(path)
Arguments
path |
The path to the file |
Value
True (invisibly) if successful, otherwise error
Query Wikipedia using the MediaWiki Action API
Description
Wikipedia exposes a To build up a query, you first call
wiki_action_request()
to create the basic request object, then use the
helper functions query_page_properties()
, query_list_pages()
and
query_generate_pages()
to modify the request, before calling next_batch()
or retrieve_all()
to perform the query and download results from the
server.
Usage
wiki_action_request(..., action = "query", language = "en")
Arguments
... |
< |
action |
The action to perform, typically 'query' |
language |
The language edition of Wikipedia to request, e.g. 'en' or 'fr' |
Details
wikkitidy provides an ergonomic API for the Action API's Query modules. These modules are most
useful for researchers, because they allow you to explore the structure of
Wikipedia and its back pages. You can obtain a list of available modules in
your R console using list_all_property_modules()
, list_all_list_modules()
and list_all_generators()
,
Value
An action_api
object, an S3 list that subclasses httr2::request.
The dependencies between different aspects of the Action API are complex.
At the time of writing, there are five major subclasses of
action_api/httr2_request
:
-
generator/action_api/httr2_request
, returned (sometimes) by query_generate_pages -
list/action_api/httr2_request
, returned by query_list_pages -
titles
,pageids
andrevids/action_api/httr2_request
, returned by the various query_by_ functionsYou can use query_page_properties to modify any kind of query except for
list
queries: indeed, the central limitation of thelist
queries is that you cannot choose what properties to return for the pages the meet the given criterion. The concept of agenerator
is complex. If thegenerator
is based on a property module, then it must be combined with a query_by_ function to produce a valid query. If the generator is based on a list module, then it cannot be combined with a query_by_ query.
See Also
Examples
# List the first 10 pages in the category 'Australian historians'
historians <- wiki_action_request() %>%
query_list_pages(
"categorymembers",
cmtitle = "Category:Australian_historians",
cmlimit = 10
) %>%
gracefully(next_batch)
historians
Build a REST request to one of the Wikimedia Foundation's central APIs
Description
wikimedia_org_rest_request()
builds a request for the
wikimedia.org REST API, which
provides statistical data about Wikimedia Foundation projects
xtools_rest_request()
builds a request to the XTools API, which provides additional
statistical data about Wikimedia foundation projects
Usage
wikimedia_org_rest_request(endpoint, ..., language = "en")
xtools_rest_request(endpoint, ..., language = "en")
Arguments
endpoint |
The endpoint for the specific kind of request; for wikimedia apis, this comprises the path components in between the general API endpoint and the component specifying the project to query |
... |
< |
language |
Two-letter language code for the desired Wikipedia edition. |
Value
A wikimedia_org/rest
or xtools/rest
object, an S3 vector that
subclasses httr2::request.
Examples
# Build request for articleinfo about Kate Bush's page on English Wikipedia
request <- xtools_rest_request("page/articleinfo", "Kate_Bush")
# Build request for most-viewed pages on German Wikipedia in July 2020
request <- wikimedia_org_rest_request(
"metrics/pageviews/top",
"all-access", "2020", "07", "all-days",
language = "de"
)
Build a REST request to one of Wikipedia's specific REST APIs
Description
core_request_request()
builds a request for the MediaWiki Core REST API, the basic REST
API available on all MediaWiki wikis.
wikimedia_rest_request()
builds a request for the Wikimedia REST API, an additional
api just for Wikipedia and other wikis managed by the Wikimedia
Foundation
Usage
core_rest_request(..., language = "en")
wikimedia_rest_request(..., language = "en")
Arguments
... |
< |
language |
The two-letter language code for the Wikipedia edition |
Value
A core/rest
, wikimedia/rest
, object, an S3 vector that subclasses
httr2_request
(see httr2::request). The request needs to be passed to
httr2::req_perform to retrieve data from the API.
Examples
# Get the html of the 'Earth' article on English Wikipedia
response <- core_rest_request("page", "Earth", "html") %>%
httr2::req_perform()
response <- wikimedia_rest_request("page", "html", "Earth") %>%
httr2::req_perform()
# Some REST requests take query parameters. Pass these as named arguments.
# To search German Wikipedia for articles about Goethe
response <- core_rest_request("search/page", q = "Goethe", limit = 2, language = "de") %>%
httr2::req_perform() %>%
httr2::resp_body_json()
Get path to wikkitidy example
Description
wikkitidy comes bundled with a number of sample files in its inst/extdata
directory. This function make them easy to access
Usage
wikkitidy_example(file = NULL)
Arguments
file |
Name of file. If |
Value
A character vector, containing either the path of the chosen file, or the nicknames of all available example files.
Examples
wikkitidy_example()
wikkitidy_example("fatwiki_dump")
Access page-level statistics from the XTools Page API endpoint
Description
get_xtools_page_info()
returns basic statistics
about articles' history and quality, including their total edits, creation
date, and assessment value (good, featured etc.)
get_xtools_page_prose()
returns statistics about the word counts and referencing of
articles
get_xtools_page_links()
returns the number of ingoing and outgoing links to articles, including redirects
get_xtools_page_top_editors()
returns the list of top editors for articles, with
optional filters by date range and non-bot status
get_xtools_page_assessment()
returns more detailed statistics about articles' assessment status and Wikiproject importance levels
Usage
get_xtools_page_info(title, language = "en", failure_mode = "error")
get_xtools_page_prose(
title,
language = "en",
failure_mode = c("error", "quiet")
)
get_xtools_page_links(
title,
language = "en",
failure_mode = c("error", "quiet")
)
get_xtools_page_top_editors(
title,
start = NULL,
end = NULL,
limit = 1000,
nobots = FALSE,
language = "en",
failure_mode = c("error", "quiet")
)
get_xtools_page_assessment(
title,
classonly = FALSE,
language = "en",
failure_mode = c("error", "quiet")
)
Arguments
title |
Character vector of page titles |
language |
Language code for the version of Wikipedia to query |
failure_mode |
What to do if no data is found. See |
start |
A character vector or date object (optional): the start date for calculating top editors |
end |
A character vector or date object (optional): the end date for calculating top editors |
limit |
An integer: the maximum number of top editors to return |
nobots |
TRUE or FALSE: if TRUE, bots are excluded from the top editor calculation |
classonly |
TRUE or FALSE: if TRUE, only return the article's assessment status, without Wikiproject information |
Value
A list or tbl of results, the same length as title
. NB: The
results for get_xtools_page_assessment
are still not parsed properly.
Examples
# Get basic statistics about Erich Auerbach on German Wikipedia
auerbach <- get_xtools_page_info("Erich Auerbach", language = "de", failure_mode = "quiet")
auerbach