Type: | Package |
Title: | Feature Extraction and Document Classification with Noisy Labels |
Version: | 0.9.5 |
Maintainer: | Kohei Watanabe <watanabe.kohei@gmail.com> |
Description: | Extract features and classify documents with noisy labels given by document-meta data or keyword matching Watanabe & Zhou (2020) <doi:10.1177/0894439320907027>. |
License: | MIT + file LICENSE |
URL: | https://github.com/koheiw/wordmap |
BugReports: | https://github.com/koheiw/wordmap/issues |
LazyData: | TRUE |
Encoding: | UTF-8 |
Depends: | R (≥ 3.5), methods |
Imports: | utils, Matrix, quanteda (≥ 2.1), stringi, ggplot2, ggrepel |
Suggests: | spelling, testthat |
Language: | en-US |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-07-10 07:59:26 UTC; watan |
Author: | Kohei Watanabe [aut, cre, cph] |
Repository: | CRAN |
Date/Publication: | 2025-07-10 12:50:02 UTC |
Evaluate classification accuracy in precision and recall
Description
accuracy()
counts the number of true positive,
false positive, true negative, and false negative cases
for each predicted class and calculates precision, recall and F1 score
based on these counts.
summary()
calculates micro-average precision and recall, and
macro-average precision and recall based on the output of
accuracy()
.
Usage
accuracy(x, y)
## S3 method for class 'textmodel_wordmap_accuracy'
summary(object, ...)
Arguments
x |
vector of predicted classes. |
y |
vector of true classes. |
object |
output of |
... |
not used. |
Value
accuracy()
returns a data.frame with following columns:
tp |
the number of true positive cases. |
fp |
the number of false positive cases. |
tn |
the number of true negative cases. |
fn |
the number of false negative cases. |
precision |
|
recall |
|
f1 |
the harmonic mean of precision and recall. |
summary()
returns a named numeric vector with the following elements:
p |
micro-average precision. |
r |
micro-average recall |
P |
macro-average precision. |
R |
macro-average recall. |
Examples
class_pred <- c('US', 'GB', 'US', 'CN', 'JP', 'FR', 'CN') # prediction
class_true <- c('US', 'FR', 'US', 'CN', 'KP', 'EG', 'US') # true class
acc <- accuracy(class_pred, class_true)
print(acc)
summary(acc)
Compute Average Feature Entropy (AFE)
Description
afe()
computes Average Feature Entropy (AFE), which measures randomness of
occurrences of features in labelled documents (Watanabe & Zhou, 2020). In
creating seed dictionaries, AFE can be used to avoid adding seed words that would
decrease classification accuracy.
Usage
afe(x, y, smooth = 1)
Arguments
x |
a dfm for features. |
y |
a dfm for labels. |
smooth |
a numeric value for smoothing to include all the features. |
Value
Returns a single numeric value.
References
Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
Coerce various objects to coefficients_textmodel
This is a helper function used in summary.textmodel_*
.
Description
Coerce various objects to coefficients_textmodel
This is a helper function used in summary.textmodel_*
.
Usage
as.coefficients_textmodel(x)
Arguments
x |
an object to be coerced |
Value
Returns a coefficients_textmodel object
Create lexicon from a Wordmap model
Description
as.list()
returns features with the largest coefficients as a list of
character vector. as.dictionary()
returns a quanteda::dictionary object
that can be use for dictionary analysis.
Usage
## S3 method for class 'textmodel_wordmap'
as.dictionary(x, separator = NULL, ...)
## S3 method for class 'textmodel_wordmap'
as.list(x, ...)
Arguments
x |
a model fitted by |
separator |
the character in between multi-word dictionary values. If
|
... |
passed to coef.textmodel_wordmap |
Value
Returns a list or a quanteda::dictionary object.
Coerce various objects to statistics_textmodel
Description
This is a helper function used in summary.textmodel_*
.
Usage
as.statistics_textmodel(x)
Arguments
x |
an object to be coerced |
Value
A statistics_textmodel object
Assign the summary.textmodel class to a list
Description
Assign the summary.textmodel class to a list
Usage
as.summary.textmodel(x)
Arguments
x |
a named list |
Value
Returns a summary.textmodel object.
Extract coefficients from a Wordmap model
Description
coef()
extracts top n
features with largest coefficients for each class.
Usage
## S3 method for class 'textmodel_wordmap'
coef(object, n = 10, select = NULL, ...)
## S3 method for class 'textmodel_wordmap'
coefficients(object, n = 10, select = NULL, ...)
Arguments
object |
a model fitted by |
n |
the number of coefficients to extract. |
select |
returns the coefficients for the selected class; specify by the
names of rows in |
... |
not used. |
Value
Returns a list of named numeric vectors sorted in descending order.
UN General Debate speeches from 2017
Description
A corpus of 196 speeches from the 2017 UN General Debate (Mikhaylov and Baturo, 2017). The economic data for 2017 (GDP and GDP per capita) are downloaded from the World Bank website.
Usage
data_corpus_ungd2017
Format
The corpus includes the following document variables:
- country_iso
ISO3c country code, e.g. "AFG" for Afghanistan
- un_session
UN session, a numeric identifier (in this case, 72)
- year
4-digit year (2017).
- country
country name, in English.
- continent
continent of the country, one of: Africa, Americas, Asia, Europe, Oceania. Note that the speech delivered on behalf of the European Union is coded as "Europe".
- gdp
GDP in $US for 2017, from the World Bank. Contains missing values for 9 countries.
- gdp_per_capita
GDP per capita in $US for 2017, derived from the World Bank. Contains missing values for 9 countries.
Source
Mikhaylov, M., Baturo, A., & Dasandi, N. (2017). "United Nations General Debate Corpus". doi:10.7910/DVN/0TJX8Y. Harvard Dataverse, V4.
References
Baturo, A., Dasandi, N., & Mikhaylov, S. (2017). "Understanding State Preferences With Text As Data: Introducing the UN General Debate Corpus". doi:10.1177/2053168017712821. Research and Politics.
Seed topic dictionary
Description
A dictionary with seed words for size common topics at the United Nations General Assembly (Watanabe and Zhou, 2020).
Usage
data_dictionary_topic
Format
An object of class dictionary2
of length 6.
Author(s)
Kohei Watanabe watanabe.kohei@gmail.com
References
Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
Predict the most likely class of documents
Description
Predict document class using fitted Wordmap models.
Usage
## S3 method for class 'textmodel_wordmap'
predict(
object,
newdata = NULL,
confidence = FALSE,
rank = 1L,
type = c("top", "all"),
rescale = FALSE,
min_conf = -Inf,
min_n = 0L,
...
)
Arguments
object |
a model fitted by |
newdata |
a dfm on which prediction will be made. |
confidence |
if |
rank |
rank of the class to be predicted. Only used when |
type |
if |
rescale |
if |
min_conf |
returns |
min_n |
set the minimum number of polarity words in documents. |
... |
not used. |
Value
Returns predicted classes as a vector. If confidence = TRUE
,
it returns a list of two vectors:
class |
predicted classes of documents. |
confidence.fit |
the confidence of predictions. |
Print methods for textmodel features estimates
This is a helper function used in print.summary.textmodel
.
Description
Print methods for textmodel features estimates
This is a helper function used in print.summary.textmodel
.
Usage
## S3 method for class 'coefficients_textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)
Arguments
x |
a coefficients_textmodel object |
digits |
minimal number of significant digits, see
|
... |
additional arguments not used |
Value
Does not return anything
Implements print methods for textmodel_statistics
Description
Implements print methods for textmodel_statistics
Usage
## S3 method for class 'statistics_textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)
Arguments
x |
a textmodel_wordscore_statistics object |
digits |
minimal number of significant digits, see
|
... |
further arguments passed to or from other methods |
Value
Does not return anything
print method for summary.textmodel
Description
print method for summary.textmodel
Usage
## S3 method for class 'summary.textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)
Arguments
x |
a |
digits |
minimal number of significant digits, see
|
... |
additional arguments not used. |
Value
Does not return anything
A model for multinomial feature extraction and document classification
Description
Wordmap is a model for multinomial feature extraction and document classification. Its naive Bayesian algorithm allows users to train the model on a large corpus with noisy labels given by document meta-data or keyword matching.
Usage
textmodel_wordmap(
x,
y,
label = c("all", "max"),
smooth = 0.01,
boolean = FALSE,
drop_label = TRUE,
entropy = c("none", "global", "local", "average"),
residual = FALSE,
verbose = quanteda_options("verbose"),
...
)
Arguments
x |
a dfm or fcm created by |
y |
a dfm or a sparse matrix that record class membership of the
documents. It can be created applying |
label |
if "max", uses only labels for the maximum value in each row of
|
smooth |
the amount of smoothing in computing coefficients.
When |
boolean |
if |
drop_label |
if |
entropy |
the scheme to compute the entropy to
regularize likelihood ratios. The entropy of features are computed over
labels if |
residual |
if |
verbose |
if |
... |
additional arguments passed to internal functions. |
Details
Wordmap learns association between words in x
and classes in y
based on likelihood ratios. The large
likelihood ratios tend to concentrate to a small number of features but the
entropy of their frequencies over labels or documents helps to disperse the
distribution.
A residual class is created internally by adding a new column to y
.
The column is given 1 if the other values in the same row are all zero
(i.e. rowSums(y) == 0
); otherwise 0. It is useful when users cannot create
an exhaustive dictionary that covers all the categories.
Value
Returns a fitted textmodel_wordmap object with the following elements:
model |
a matrix that records the association between classes and features. |
data |
the original input of |
feature |
the feature set in |
class |
the class labels in |
concatenator |
the concatenator in |
entropy |
the scheme to compute entropy weights. |
boolean |
the use of the Boolean transformation of |
call |
the command used to execute the function. |
version |
the version of the wordmap package. |
References
Watanabe, Kohei (2018). "Newsmap: semi-supervised approach to geographical news classification". doi.org/10.1080/21670811.2017.1293487, Digital Journalism.
Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
Examples
require(quanteda)
# split into sentences
corp <- corpus_reshape(data_corpus_ungd2017)
# tokenize
toks <- tokens(corp, remove_punct = TRUE) %>%
tokens_remove(stopwords("en"))
# apply seed dictionary
toks_dict <- tokens_lookup(toks, data_dictionary_topic)
# form dfm
dfmt_feat <- dfm(toks)
dfmt_dict <- dfm(toks_dict)
# fit wordmap model
map <- textmodel_wordmap(dfmt_feat, dfmt_dict)
coef(map)
predict(map)
Plot coefficients of words
Description
Plot coefficients of words
Usage
textplot_terms(
x,
highlighted = NULL,
max_highlighted = 50,
max_words = 1000,
...
)
Arguments
x |
a fitted textmodel_wordmap object. |
highlighted |
quanteda::pattern to select words to highlight. If a quanteda::dictionary is passed, words in the top-level categories are highlighted in different colors. |
max_highlighted |
the maximum number of words to highlight. When
|
max_words |
the maximum number of words to plot. Words are randomly sampled to keep the number below the limit. |
... |
passed to underlying functions. See the Details. |
Details
Users can customize the plots through ...
, which is
passed to ggplot2::geom_text()
and ggrepel::geom_text_repel()
. The
colors are specified internally but users can override the settings by appending
ggplot2::scale_colour_manual()
or ggplot2::scale_colour_brewer()
. The
legend title can also be modified using ggplot2::labs()
.