Help for package validateIt

Title:

Validating Topic Coherence and Topic Labels

Version:

1.2.1

Description:

By creating crowd-sourcing tasks that can be easily posted and results retrieved using Amazon's Mechanical Turk (MTurk) API, researchers can use this solution to validate the quality of topics obtained from unsupervised or semi-supervised learning methods, and the relevance of topic labels assigned. This helps ensure that the topic modeling results are accurate and useful for research purposes. See Ying and others (2022) <doi:10.1101/2023.05.02.538599>. For more information, please visit https://github.com/Triads-Developer/Topic_Model_Validation.

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.2.3

Imports:

pyMTurkR, rlang (≥ 0.4.11), tm (≥ 0.7-11), here, SnowballC

Suggests:

roxygen2, testthat

Author:

Luwei Ying

[aut, cre], Jacob Montgomery [aut], Brandon Stewart [aut]

Maintainer:

Luwei Ying <triads.developers@wustl.edu>

Repository:

CRAN

Depends:

R (≥ 3.5.0)

NeedsCompilation:

Packaged:

2023-05-15 12:34:05 UTC; jessiewalker

Date/Publication:

2023-05-16 08:20:02 UTC

Example R4WSI0 Tasks

Description

Data of 15 example R4WSI0 Tasks structured as a matrix.

Usage

data(R4WSItasktest)

Format

A matrix with 15 rows and 6 columns.

topic: Index of topics
doc: Example documents associated with each topic
opt1: Words set option 1
opt2: Words set option 2
opt3: Words set option 3
optcrt: Words set option 4, also the correct choice

Details

Please note that the difference between the R4WSI0 examples used here and the R4WSI tasks is that the R4WSI tasks do not present any documents.

Topic_Model_Validation Repository Overview

Description

The 'Topic_Model_Validation' repository is a collection of scripts and functions for performing topic modeling and evaluating topic models. This document provides an overview of the different scripts and functions in the repository and their purpose.

Details

## Python Scripts ### evaluate.py The 'evaluate.py' script provides functions for evaluating the performance of topic models on different datasets and tasks. The functions within this script include: - R4WSItasktest(): Evaluates the performance of a topic model on the R4WSI task, which involves predicting the top k words for a given topic. - allR4WSItasktest(): Evaluates the performance of a topic model on multiple versions of the R4WSI task. - goldR4WSItest(): Evaluates the performance of a topic model on a gold-standard R4WSI dataset. - heldouttest(): Evaluates the performance of a topic model on held-out data. - keypostedtest(): Evaluates the performance of a topic model on a key-posted dataset. - masstest(): Evaluates the performance of a topic model on a massive dataset. - modtest(): Evaluates the performance of a topic model on a given dataset. - resultstest(): Evaluates the performance of a topic model on a given dataset and stores the results. ### record.py The 'record.py' script provides a function for storing the results of topic model evaluations. The function within this script is: - record(): Stores the results of topic model evaluations. ## R Scripts ### lda.R The 'lda.R' script provides functions for performing Latent Dirichlet Allocation (LDA) topic modeling on text data. The functions within this script include: - lda_model(): Fits an LDA model to text data. ### lsa.R The 'lsa.R' script provides functions for performing Latent Semantic Analysis (LSA) topic modeling on text data. The functions within this script include: - lsa_model(): Fits an LSA model to text data. ### evaluate.R The 'evaluate.R' script provides functions for evaluating the performance of topic models using various metrics, such as perplexity and coherence. The functions within this script include: - evaluate_model(): Evaluates the performance of a topic model using various metrics. ### helpers.R The 'helpers.R' script provides various helper functions that are used by the other scripts in the repository. The functions within this script include: - clean_text(): Cleans and preprocesses text data for use in topic modeling. - read_data(): Reads in text data from a file. - write_data(): Writes text data to a file.

Example R4WSI Tasks with Regular and Gold-Standard Tasks

Description

Data frame of 20 example R4WSI0 Tasks, with 5 of them being gold-standard and 15 of them not.

Usage

data(allR4WSItasktest)

Format

A data frame of 20 rows and 6 columns.

topic: Index of topics
id: Index of topics
doc: Example documents associated with each topic
opt1: Words set option 1
opt2: Words set option 2
opt3: Words set option 3
optcrt: Words set option 4, also the correct choice

Check Agreement Rate between Identical Trails

Description

Check Agreement Rate between Identical Trails

Usage

checkAgree(results1, results2, key, type = NULL)

Arguments

results1

first batch of results; outputs from getResults()

results2

first batch of results; outputs from getResults()

key

the local task record; outputs from recordTasks()

type

Task structures to be specified. Must be one of "WI" (word intrusion), "T8WSI" (top 8 word set intrusion), "R4WSI" (random 4 word set intrusion), "LI" (Label Intrusion), and "OL" (Optimal Label)

Details

Evaluate workers' performance by agreement rate between identical trails (Notice that this means the two input, results1 and results2, must be identical.); Return 1) the exact agreement rate when both workers agree on the exact same choice, and 2) the binary agreement rate when both workers get the task either right or wrong simultaneously

Value

A numeric value to be returned with output.

Combine the mass of words with the same root

Description

Combine the mass of words with the same root

Usage

combMass(mod = NULL, vocab = NULL, beta = NULL)

Arguments

mod

Fitted structural topic models.

vocab

A character vector specifying the words in the corpus. Usually, it can be found in topic model output.

beta

A matrix of word probabilities for each topic. Each row represents a topic and each column represents a word. Note this should not be in the logged form.

Details

Use as a preparing step for validating unstemmed topic models.

Value

A list with two elements:

newvocab

A matrix of new vocabulary. Each row represents a topic and each column represents a unique stemmed word.

newbeta

A matrix of new beta. Each row represents a topic and each column represents the sum of the probabilities of the words with the same root.

Evaluate results

Description

Evaluate results

Usage

evalResults(results, key, type = NULL)

Arguments

results

results of human choice; outputs from getResults()

key

the local task record; outputs form recordTasks()

type

Task structures to be specified. Must be one of "WI" (word intrusion), "T8WSI" (top 8 word set intrusion), "R4WSI" (random 4 word set intrusion), "LI" (Label Intrusion), and "OL" (Optimal Label)

Details

Evaluate worker performance by gold-standard HITs; Return the accuracy rate (proportion correct) for a specified batch

Value

A list containing the gold-standard HIT correct rate, gold-standard HIT correct rate by workers, and non-gold-standard HIT correct rate

Get results from Mturk

Description

Get results from Mturk

Usage

getResults(
  batch_id = "unspecified",
  hit_ids,
  retry = TRUE,
  retry_in_seconds = 60,
  AWS_id = Sys.getenv("AWS_ACCESS_KEY_ID"),
  AWS_secret = Sys.getenv("AWS_SECRET_ACCESS_KEY"),
  sandbox = getOption("pyMTurkR.sandbox", TRUE)
)

Arguments

batch_id

any number or string to annotate the batch

hit_ids

hit ids returned from the MTurk API, i.e., output of sendTasks()

retry

if TRUE, retry retriving results from Mturk API five times; default to TRUE

retry_in_seconds

default to 60 seconds

AWS_id

AWS_ACCESS_KEY_ID

AWS_secret

AWS_SECRET_ACCESS_KEY

sandbox

sanbox setting

Details

this function works for complete or incomplete batches

Value

a data frame with columns:

batch_id

an annotation for the batch

local_task_id

an identifier for the task in the batch

mturk_hit_id

the ID of the HIT in MTurk

assignment_id

the ID of the assignment in MTurk

worker_id

the ID of the worker who completed the assignment

result

the worker's response to the task

completed_at

the time when the worker submitted the assignment

Example Gold-Standard R4WSI0 Tasks

Description

Data frame of 5 example gold-standard R4WSI0 Tasks.

Usage

data(goldR4WSItest)

Format

A data frame of 5 rows and 6 columns.

topic: Index of topics
doc: Example documents associated with each topic
opt1: Words set option 1
opt2: Words set option 2
opt3: Words set option 3
optcrt: Words set option 4, also the correct choice

An Example Heldout Test Set

Description

An output from the make.heldout function of the stm package.

Usage

data(heldouttest)

Format

A list of the heldout documents, vocab, and missing.

Source

See https://CRAN.R-project.org/package=stm for more details.

References

Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. "Stm: An R package for structural topic models." Journal of Statistical Software 91 (2019): 1-40.

Example Answer Keys

Description

Example Answer Keys

Usage

data(keypostedtest)

Format

A list of two data frames. Similar to recordtest.

data.frame1: A data frame of tasks with the optcrt indicating the machine predicted choice.
data.frame2: A data frame of tasks with randomized choices. Exactly the same with what would be sent online.

An Example of the Combined Mass for Words with the Same Roots

Description

A list of two with the words (the most frequent form in each topic) and the corresponding word probabilities.

Usage

data(masstest)

Format

A list of two.

Details

vocab: A matrix of words for each topic. Each row represents a topic and each column represents the words. Words with the same roots are only represented by the most common form in that topic.
beta: A matrix of combined word probabilities for each topic. Each row represents a topic and each column represents a combined word.

Mix the gold-standard tasks with the tasks need to be validated

Description

Mix the gold-standard tasks with the tasks need to be validated

Usage

mixGold(tasks, golds)

Arguments

tasks

All tasks need to be validated

golds

Gold standard tasks with the same structure

Value

A data frame with the same structure as the input, where gold-standard tasks are randomly inserted

An Example Topic Model

Description

A structural topic model (STM) object generated from the stm package using a random sample of US senators' Facebook posts.

Usage

data(modtest)

Format

A STM object.

Source

See https://CRAN.R-project.org/package=stm for more details.

References

Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. "Stm: An R package for structural topic models." Journal of Statistical Software 91 (2019): 1-40.

Pick the optimal label from candidate labels

Description

Pick the optimal label from candidate labels

Usage

pickLabel(
  n,
  text.predict = NULL,
  text.name = "text",
  top1.name = "top1",
  labels.index = NULL,
  candidate.labels = NULL
)

Arguments

n

The number of desired tasks

text.predict

A data frame or matrix containing both the text and the indicator(s) of the model predicted topic(s).

text.name

variable name in 'text.predict' that indicates the text

top1.name

variable name in 'text.predict' that indicates the top1 model predicted topic

labels.index

The topic index in correspondence with the labels, e.g., c(10, 12, 15).

candidate.labels

A list of vectors containing the user-defined labels assigned to the topics, Must be in the same length and order with 'labels.index'.

Details

Users need to specify four plausible labels for each topic

Value

A matrix with n rows and 6 columns (topic, doc, opt1, opt2, opt3, optcrt) where optcrt is the correct label that was picked.

Plot results

Description

Plot results

Usage

plotResults(path, x, n, taskname, ...)

Arguments

path

path to store the plot

x

a vector of counts of successes; could be obtained from getResults()

n

a vector of counts of trials

taskname

the name of the task for labeling, e.g., Word Intrusion, Optimal Label.

...

additional arguments to be passed to plot function

Details

Visualize the accuracy rate (proportion correct) for a specified batch

Value

Nothing is returned; a plot is created and saved as a pdf file.

Reform tasks to facilitate sending to Mturk

Description

Reform tasks to facilitate sending to Mturk

Usage

record(type, tasks, path)

Arguments

type

(character) one of WI, T8WSI, R4WSI

tasks

(data.frame) outputs from validateTopic(), validateLabel(), or mixGold() if users mix in gold-standard HITs

path

(character) path to record the tasks (with meta-information)

Details

Randomize the order of options and record the tasks in a specified local directory

Value

A list of two data frames, containing the original tasks and the randomized options respectively.

Example Local Record of the R4WSI Tasks

Description

Local record generated by the recordTasks function.

Usage

data(recordtest)

Format

A list of two data frames.

data.frame1: A data frame of tasks with the optcrt indicating the machine preficted choice.
data.frame2: A data frame of tasks with randomized choices. Exactly the same with what would be sent online.

Details

To be compared with the answers from the online workers to evaluate the topic model performance.

Example Results Retrieved from Mturk

Description

Example Results Retrieved from Mturk

Usage

data(resultstest)

Format

A data frame of ten example tasks retrieved from the Mturk with or without online workers' answers.

assignment_id: Assignment id. Mturk assigned. If 0, then the task hasn't been completed.
batch_id: User specified batch id.
completed_at: Timestamp when the task was completed. If 0, then the task hasn't been completed.
local_task_id: Local task id.
mturk_hit_id: Mturk HIT id. Mturk assigned.
result: Choice made by the worker. 1-4. If 0, then the task hasn't been completed.
worker_id: Mturk worker id. If 0, then the task hasn't been completed.

Send prepared task to Mturk and record the API-returned HIT ids.

Description

Send prepared task to Mturk and record the API-returned HIT ids.

Usage

sendTasks(
  hit_type = NULL,
  hit_layout = NULL,
  type = NULL,
  tasksrecord = NULL,
  tasksids = NULL,
  HITidspath = NULL,
  n_assignments = "1",
  expire_in_seconds = as.character(60 * 60 * 8),
  batch_annotation = NULL
)

Arguments

hit_type

find from the Mturk requester's dashboard

hit_layout

find from the Mturk requester's dashboard

type

one of WI, T8WSI, R4WSI

tasksrecord

output of recordTasks()

tasksids

ids of tasks to send in numeric form. If left unspecified, the whole batch will be posted

HITidspath

path to record the returned HITids

n_assignments

number of of assignments per task. For the validation tasks, people almost always want 1

expire_in_seconds

default 8 hours

batch_annotation

add if needed

Details

Pairs the local ids with Mturk ids and save them to specified paths

Value

A list containing two elements:

current_HIT_ids:A vector of the HIT IDs returned by the API.

map_ids:A data frame that maps the tasksids to their corresponding HIT ids.

An Example Object of Prepared Documents

Description

An output from the prepDocuments function of the stm package.

Usage

data(stmPreptest)

Format

A list containing a documents and vocab object.

Source

See https://CRAN.R-project.org/package=stm for more details.

References

Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. "Stm: An R package for structural topic models." Journal of Statistical Software 91 (2019): 1-40.

Tidy eval helpers

Description

This page lists the tidy eval tools reexported in this package from rlang. To learn about using tidy eval in scripts and packages at a high level, see the dplyr programming vignette and the ggplot2 in packages vignette. The Metaprogramming section of Advanced R may also be useful for a deeper dive.

The tidy eval operators ⁠{{⁠, ⁠!!⁠, and ⁠!!!⁠ are syntactic constructs which are specially interpreted by tidy eval functions. You will mostly need ⁠{{⁠, as ⁠!!⁠ and ⁠!!!⁠ are more advanced operators which you should not have to use in simple cases.

The curly-curly operator ⁠{{⁠ allows you to tunnel data-variables passed from function arguments inside other tidy eval functions. ⁠{{⁠ is designed for individual arguments. To pass multiple arguments contained in dots, use ... in the normal way.
```
my_function <- function(data, var, ...) {
  data %>%
    group_by(...) %>%
    summarise(mean = mean({{ var }}))
}
```
enquo() and enquos() delay the execution of one or several function arguments. The former returns a single expression, the latter returns a list of expressions. Once defused, expressions will no longer evaluate on their own. They must be injected back into an evaluation context with ⁠!!⁠ (for a single expression) and ⁠!!!⁠ (for a list of expressions).
```
my_function <- function(data, var, ...) {
  # Defuse
  var <- enquo(var)
  dots <- enquos(...)

  # Inject
  data %>%
    group_by(!!!dots) %>%
    summarise(mean = mean(!!var))
}
```
In this simple case, the code is equivalent to the usage of ⁠{{⁠ and ... above. Defusing with enquo() or enquos() is only needed in more complex cases, for instance if you need to inspect or modify the expressions in some way.
The .data pronoun is an object that represents the current slice of data. If you have a variable name in a string, use the .data pronoun to subset that variable with [[.
```
my_var <- "disp"
mtcars %>% summarise(mean = mean(.data[[my_var]]))
```

Another tidy eval operator is ⁠:=⁠. It makes it possible to use glue and curly-curly syntax on the LHS of =. For technical reasons, the R language doesn't support complex expressions on the left of =, so we use ⁠:=⁠ as a workaround.

my_function <- function(data, var, suffix = "foo") {
  # Use `{{` to tunnel function arguments and the usual glue
  # operator `{` to interpolate plain strings.
  data %>%
    summarise("{{ var }}_mean_{suffix}" := mean({{ var }}))
}

Many tidy eval functions like dplyr::mutate() or dplyr::summarise() give an automatic name to unnamed inputs. If you need to create the same sort of automatic names by yourself, use as_label(). For instance, the glue-tunnelling syntax above can be reproduced manually with:
```
my_function <- function(data, var, suffix = "foo") {
  var <- enquo(var)
  prefix <- as_label(var)
  data %>%
    summarise("{prefix}_mean_{suffix}" := mean(!!var))
}
```
Expressions defused with enquo() (or tunnelled with ⁠{{⁠) need not be simple column names, they can be arbitrarily complex. as_label() handles those cases gracefully. If your code assumes a simple column name, use as_name() instead. This is safer because it throws an error if the input is not a name as expected.

Value

This function does not return any value (NULL). It only serves to document the tidy eval tools reexported in this package from rlang.

Create validation tasks for labels assigned to the topics in the topic model of choice.

Description

Create validation tasks for labels assigned to the topics in the topic model of choice.

Usage

validateLabel(
  type,
  n,
  text.predict = NULL,
  text.name = "text",
  top1.name = "top1",
  top2.name = "top2",
  top3.name = "top3",
  labels = NULL,
  labels.index = NULL,
  labels.add = NULL
)

Arguments

type

Task structures to be specified. Must be one of "LI" (Label Intrusion) and "OL" (Optimal Label).

n

The number of desired tasks

text.predict

A data frame or matrix containing both the text and the indicator(s) of the model predicted topic(s).

text.name

variable name in 'text.predict' that indicates the text

top1.name

variable name in 'text.predict' that indicates the top1 model predicted topic

top2.name

variable name in 'text.predict' that indicates the top2 model predicted topic

top3.name

variable name in 'text.predict' that indicates the top3 model predicted topic

labels

The user-defined labels assigned to the topics

labels.index

The topic index in correspondence with the labels, e.g., c(10, 12, 15). Must be in the same length and order with 'label'.

labels.add

Labels from other broad catagories. Default to NULL. Users could specify them to evaluate how well different broad categories are distinguished from one another.

#' value A matrix containing the validation tasks as described in the return section.

Details

Users need to pick a topic model that they deem to be good and label the topics they later would like to use as measures.

Value

A matrix containing the validation tasks. The matrix has six value columns:

topic: The topic index associated with the document.
doc: The text of the document.
opt1: The first option label presented to the user.
opt2: The second option label presented to the user.
opt3: The third option label presented to the user.
optcrt: The correct label for the document.

Create validation tasks for topic model selection

Description

Create validation tasks for topic model selection

Usage

validateTopic(type, n, text = NULL, vocab, beta, theta = NULL, thres = 20)

Arguments

type

Task structures to be specified. Must be one of "WI" (word intrusion), "T8WSI" (top 8 word set intrusion), and "R4WSI" (random 4 word set intrusion).

n

The number of desired tasks

text

The pool of documents to be shown to the Mturk workers

vocab

A character vector specifying the words in the corpus. Usually, it can be found in topic model output.

beta

A matrix of word probabilities for each topic. Each row represents a topic and each column represents a word. Note this should not be in the logged form.

theta

A matrix of topic proportions. Each row represents a document and each clums represents a topic. Must be specified if task = "T8WSI" or "R4WSI".

thres

the threshold to draw words from, default to top 50 words.

Details

Users need to fit their own topic models.

Value

A matrix of validation tasks. Each row represents a task and each column represents an aspect of a task, including the topic label, the document text (for "T8WSI" and "R4WSI"), and five words, including four non-intrusive words and one intrusive word.