Type: | Package |
Title: | Wrapper to the 'spaCy' 'NLP' Library |
Version: | 1.3.0 |
Description: | An R wrapper to the 'Python' 'spaCy' 'NLP' library, from https://spacy.io. |
License: | GPL-3 |
LazyData: | TRUE |
Depends: | R (≥ 3.0.0), methods |
Imports: | data.table, reticulate (≥ 1.6) |
Suggests: | dplyr, knitr, quanteda, R.rsp, rmarkdown, spelling, testthat, tidytext, tibble |
URL: | https://spacyr.quanteda.io |
Encoding: | UTF-8 |
BugReports: | https://github.com/quanteda/spacyr/issues |
RoxygenNote: | 7.2.3 |
Language: | en-GB |
VignetteBuilder: | R.rsp |
NeedsCompilation: | no |
Packaged: | 2023-12-07 16:16:42 UTC; kbenoit |
Author: | Kenneth Benoit |
Maintainer: | Kenneth Benoit <kbenoit@lse.ac.uk> |
Repository: | CRAN |
Date/Publication: | 2023-12-08 15:20:02 UTC |
An R wrapper to the spaCy NLP system
Description
An R wrapper to the Python (Cython) spaCy NLP system, from https://spacy.io. Nicely integrated with quanteda. spacyr is designed to provide easy access to the powerful functionality of spaCy, in a simple format.
Author(s)
Ken Benoit and Akitaka Matsuo
References
https://spacy.io, https://spacyr.quanteda.io.
See Also
Useful links:
A short paragraph of text for testing
Description
A sample of text from the Irish budget debate of 2010 (531 tokens long).
Usage
data_char_paragraph
Format
An object of class character
of length 1.
Sample short documents for testing
Description
A character object consisting of 30 short documents in plain text format for testing. Each document is one or two brief sentences.
Usage
data_char_sentences
Format
An object of class character
of length 30.
Extract or consolidate entities from parsed documents
Description
From an object parsed by spacy_parse()
, extract the entities as a
separate object, or convert the multi-word entities into single "token"
consisting of the concatenated elements of the multi-word entities.
Usage
entity_extract(x, type = c("named", "extended", "all"), concatenator = "_")
entity_consolidate(x, concatenator = "_")
Arguments
x |
output from |
type |
type of named entities, either |
concatenator |
the character(s) used to join the elements of multi-word named entities |
Value
entity_extract()
returns a data.frame of all named
entities, containing the following fields:
-
doc_id
name of the document containing the entity -
sentence_id
the sentence ID containing the entity, within the document -
entity
the named entity -
entity_type
the type of named entities (e.g. PERSON, ORG, PERCENT, etc.)
entity_consolidate
returns a modified data.frame
of
parsed results, where the named entities have been combined into a single
"token". Currently, dependency parsing is removed when this consolidation
occurs.
Examples
## Not run:
spacy_initialize()
# entity extraction
txt <- "Mr. Smith of moved to San Francisco in December."
parsed <- spacy_parse(txt, entity = TRUE)
entity_extract(parsed)
entity_extract(parsed, type = "all")
## End(Not run)
## Not run:
# consolidating multi-word entities
txt <- "The House of Representatives voted to suspend aid to South Dakota."
parsed <- spacy_parse(txt, entity = TRUE)
entity_consolidate(parsed)
## End(Not run)
Find spaCy
Description
Locate the user's version of Python for which spaCy installed.
Usage
find_spacy(model = "en_core_web_sm", ask)
Arguments
model |
name of the language model |
ask |
logical; if |
Value
spacy_python
get functions for spaCy
Description
A collection of get methods for spacyr return objects (of spacy_out
class).
Usage
get_tokens(spacy_out)
get_tags(spacy_out, tagset = c("google", "detailed"))
get_attrs(spacy_out, attr_name, deal_utf8 = FALSE)
get_named_entities(spacy_out)
get_dependency(spacy_out)
get_noun_phrases(spacy_out)
get_ntokens(spacy_out)
get_ntokens_by_sent(spacy_out)
Arguments
spacy_out |
a spacy_out object |
tagset |
character label for the tagset to use, either |
attr_name |
name of spaCy token attributes to extract |
Value
get_tokens
returns a data.frame of tokens from spaCy.
get_tags
returns a tokenized text object with part-of-speech tags.
Options exist for using either the Google or Detailed tagsets. See
https://spacy.io.
get_attrs
returns a list of attributes from spaCy output
get_named_entities
returns a list of named entities in texts
get_dependency
returns a data.frame of dependency relations.
get_noun_phrases
returns a data.frame of noun phrases.
get_ntokens
returns a data.frame of dependency relations
get_ntokens_by_sent
returns a data.frame of dependency
relations, by sentence
Examples
## Not run:
# get_tags examples
txt <- c(text1 = "This is the first sentence.\nHere is the second sentence.",
text2 = "This is the second document.")
results <- spacy_parse(txt)
tokens <- tokens(results)
tokens_with_tag <- tokens_tag(tokens)
## End(Not run)
Extract or consolidate noun phrases from parsed documents
Description
From an object parsed by spacy_parse()
, extract the multi-word
noun phrases as a separate object, or convert the multi-word noun phrases
into single "token" consisting of the concatenated elements of the multi-word
noun phrases.
Usage
nounphrase_extract(x, concatenator = "_")
nounphrase_consolidate(x, concatenator = "_")
Arguments
x |
output from |
concatenator |
the character(s) used to join elements of multi-word noun phrases |
Value
noun
returns a data.frame
of all named
entities, containing the following fields:
-
doc_id
name of the document containing the noun phrase -
sentence_id
the sentence ID containing the noun phrase, within the document -
nounphrase
the noun phrase -
root
the root token of the noun phrase
nounphrase_consolidate
returns a modified data.frame
of
parsed results, where the noun phrases have been combined into a single
"token". Currently, dependency parsing is removed when this consolidation
occurs.
Examples
## Not run:
spacy_initialize()
# entity extraction
txt <- "Mr. Smith of moved to San Francisco in December."
parsed <- spacy_parse(txt, nounphrase = TRUE)
entity_extract(parsed)
## End(Not run)
## Not run:
# consolidating multi-word noun phrases
txt <- "The House of Representatives voted to suspend aid to South Dakota."
parsed <- spacy_parse(txt, nounphrase = TRUE)
nounphrase_consolidate(parsed)
## End(Not run)
Tokenize text using spaCy
Description
Tokenize text using spaCy. The results of tokenization is stored as a Python
object. To obtain the tokens results in R, use get_tokens()
.
https://spacy.io.
Usage
process_document(x, multithread, ...)
Arguments
x |
input text
functionalities including the tagging, named entity recognition, dependency
analysis.
This slows down |
multithread |
logical; |
... |
arguments passed to specific methods |
Value
result marker object
Examples
## Not run:
spacy_initialize()
# the result has to be "tag() is ready to run" to run the following
txt <- c(text1 = "This is the first sentence.\nHere is the second sentence.",
text2 = "This is the second document.")
results <- spacy_parse(txt)
## End(Not run)
Download spaCy language models
Description
Download spaCy language models
Usage
spacy_download_langmodel(lang_models = "en_core_web_sm", force = FALSE)
Arguments
lang_models |
character; language models to be installed. Defaults
|
force |
ignore if spaCy/the lang_models is already present and install it anyway. |
Value
Invisibly returns the installation log.
Examples
## Not run:
# install medium sized model
spacy_download_langmodel("en_core_web_md")
#' # install several models with spaCy
spacy_install(lang_models = c("en_core_web_sm", "de_core_news_sm"))
# install transformer based model
spacy_download_langmodel("en_core_web_trf")
## End(Not run)
Install a language model in a conda or virtual environment
Description
Deprecated. spacyr
now always uses a virtual environment,
making this function redundant.
Usage
spacy_download_langmodel_virtualenv(...)
Arguments
... |
not used |
Extract named entities from texts using spaCy
Description
This function extracts named entities from texts, based on the entity tag
ent
attributes of documents objects parsed by spaCy (see
https://spacy.io/usage/linguistic-features#section-named-entities).
Usage
spacy_extract_entity(
x,
output = c("data.frame", "list"),
type = c("all", "named", "extended"),
multithread = TRUE,
...
)
Arguments
x |
a character object or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif) |
output |
type of returned object, either |
type |
type of named entities, either |
multithread |
logical; If |
... |
unused |
Details
When the option output = "data.frame"
is selected, the
function returns a data.frame
with the following fields.
text
contents of entity
entity_type
type of entity (e.g.
ORG
for organizations)start_id
serial number ID of starting token. This number corresponds with the number of
data.frame
returned fromspacy_tokenize(x)
with default options.length
number of words (tokens) included in a named entity (e.g. for an entity, "New York Stock Exchange"",
length = 4
)
Value
either a list
or data.frame
of tokens
Examples
## Not run:
spacy_initialize()
txt <- c(doc1 = "The Supreme Court is located in Washington D.C.",
doc2 = "Paul earned a postgraduate degree from MIT.")
spacy_extract_entity(txt)
spacy_extract_entity(txt, output = "list")
## End(Not run)
Extract noun phrases from texts using spaCy
Description
This function extracts noun phrases from documents, based on the
noun_chunks
attributes of documents objects parsed by spaCy (see
https://spacy.io/usage/linguistic-features#noun-chunks).
Usage
spacy_extract_nounphrases(
x,
output = c("data.frame", "list"),
multithread = TRUE,
...
)
Arguments
x |
a character object or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif) |
output |
type of returned object, either |
multithread |
logical; If |
... |
unused |
Details
When the option output = "data.frame"
is selected, the
function returns a data.frame
with the following fields.
text
contents of noun-phrase
root_text
contents of root token
start_id
serial number ID of starting token. This number corresponds with the number of
data.frame
returned fromspacy_tokenize(x)
with default options.root_id
serial number ID of root token
length
number of words (tokens) included in a noun-phrase (e.g. for a noun-phrase, "individual car owners",
length = 3
)
Value
either a list
or data.frame
of tokens
Examples
## Not run:
spacy_initialize()
txt <- c(doc1 = "Natural language processing is a branch of computer science.",
doc2 = "Paul earned a postgraduate degree from MIT.")
spacy_extract_nounphrases(txt)
spacy_extract_nounphrases(txt, output = "list")
## End(Not run)
Finalize spaCy
Description
While running spaCy on Python through R, a Python process is always running
in the background and Rsession will take up a lot of memory (typically over
1.5GB). spacy_finalize()
terminates the Python process and frees up
the memory it was using.
Usage
spacy_finalize()
Author(s)
Akitaka Matsuo
Initialize spaCy
Description
Initialize spaCy to call from R.
Usage
spacy_initialize(model = "en_core_web_sm", entity = TRUE, ...)
Arguments
model |
Language package for loading spaCy. Example: |
entity |
logical; if |
... |
not used. |
Author(s)
Akitaka Matsuo, Johannes B. Gruber
Install spaCy in conda or virtualenv environment
Description
Install spaCy in a self-contained environment, including specified language models.
Usage
spacy_install(
version = "latest",
lang_models = "en_core_web_sm",
ask = interactive(),
force = FALSE,
...
)
Arguments
version |
character; spaCy version to install (see details). |
lang_models |
character; language models to be installed. Defaults
|
ask |
logical; ask whether to proceed during the installation. By default, questions are only asked in interactive sessions. |
force |
ignore if spaCy/the lang_models is already present and install it anyway. |
... |
not used. |
Details
The function checks whether a suitable installation of Python is
present on the system and installs one via
reticulate::install_python()
otherwise. It then creates a
virtual environment with the necessary packages in the default location
chosen by reticulate::virtualenv_root()
.
If you want to install a different version of Python than the default, you
should call reticulate::install_python()
directly. If you want
to create or use a different virtual environment, you can use, e.g.,
Sys.setenv(SPACY_PYTHON = "path/to/directory")
.
See Also
Examples
## Not run:
# install the latest version of spaCy
spacy_install()
# update spaCy
spacy_install(force = TRUE)
# install an older version
spacy_install(version = "3.1.0")
# install with GPU enabled
spacy_install(version = "cuda-autodetect")
# install on Apple ARM processors
spacy_install(version = "apple")
# install an old custom version
spacy_install(version = "[cuda-autodetect]==3.2.0")
# install several models with spaCy
spacy_install(lang_models = c("en_core_web_sm", "de_core_news_sm"))
# install spaCy to an existing virtual environment
Sys.setenv(RETICULATE_PYTHON = "path/to/python")
spacy_install()
## End(Not run)
Install spaCy to a virtual environment
Description
Deprecated. spacy_install
now installs to a virtual environment by default.
Usage
spacy_install_virtualenv(...)
Arguments
... |
not used |
Parse a text using spaCy
Description
The spacy_parse()
function calls spaCy to both tokenize and tag the
texts, and returns a data.table of the results. The function provides options
on the types of tagsets (tagset_
options) either "google"
or
"detailed"
, as well as lemmatization (lemma
). It provides a
functionalities of dependency parsing and named entity recognition as an
option. If "full_parse = TRUE"
is provided, the function returns the
most extensive list of the parsing results from spaCy.
Usage
spacy_parse(
x,
pos = TRUE,
tag = FALSE,
lemma = TRUE,
entity = TRUE,
dependency = FALSE,
nounphrase = FALSE,
multithread = TRUE,
additional_attributes = NULL,
...
)
Arguments
x |
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif) |
pos |
logical whether to return universal dependency POS tagset https://universaldependencies.org/u/pos/) |
tag |
logical whether to return detailed part-of-speech tags, for the
language model |
lemma |
logical; include lemmatized tokens in the output (lemmatization may not work properly for non-English models) |
entity |
logical; if |
dependency |
logical; if |
nounphrase |
logical; if |
multithread |
logical; If |
additional_attributes |
a character vector; this option is for
extracting additional attributes of tokens from spaCy. When the names of
attributes are supplied, the output data.frame will contain additional
variables corresponding to the names of the attributes. For instance, when
|
... |
not used directly |
Value
a data.frame
of tokenized, parsed, and annotated tokens
Examples
## Not run:
spacy_initialize()
# See Chap 5.1 of the NLTK book, http://www.nltk.org/book/ch05.html
txt <- "And now for something completely different."
spacy_parse(txt)
spacy_parse(txt, pos = TRUE, tag = TRUE)
spacy_parse(txt, dependency = TRUE)
txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.",
doc2 = "This is the second document.",
doc3 = "This is a \\\"quoted\\\" text." )
spacy_parse(txt2, entity = TRUE, dependency = TRUE)
txt3 <- "We analyzed the Supreme Court with three natural language processing tools."
spacy_parse(txt3, entity = TRUE, nounphrase = TRUE)
spacy_parse(txt3, additional_attributes = c("like_num", "is_punct"))
## End(Not run)
Tokenize text with spaCy
Description
Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.
Usage
spacy_tokenize(
x,
what = c("word", "sentence"),
remove_punct = FALSE,
remove_url = FALSE,
remove_numbers = FALSE,
remove_separators = TRUE,
remove_symbols = FALSE,
padding = FALSE,
multithread = TRUE,
output = c("list", "data.frame"),
...
)
Arguments
x |
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif) |
what |
the unit for splitting the text, available alternatives are:
|
remove_punct |
remove punctuation tokens. |
remove_url |
remove tokens that look like a url or email address. |
remove_numbers |
remove tokens that look like a number (e.g. "334", "3.1415", "fifty"). |
remove_separators |
remove spaces as separators when
all other remove functionalities (e.g. |
remove_symbols |
remove symbols. The symbols are either |
padding |
if |
multithread |
logical; If |
output |
type of returning object. Either |
... |
not used directly |
Value
either list
or data.frame
of tokens
Examples
## Not run:
spacy_initialize()
txt <- "And now for something completely different."
spacy_tokenize(txt)
txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.",
doc2 = "This is the second document.",
doc3 = "This is a \\\"quoted\\\" text." )
spacy_tokenize(txt2)
## End(Not run)
Uninstall the spaCy environment
Description
Removes the virtual environment created by spacy_install()
Usage
spacy_uninstall(confirm = interactive())
Arguments
confirm |
logical; confirm before uninstalling spaCy? |
Shorthand function to upgrade spaCy
Description
Upgrade spaCy (to a specific version).
Usage
spacy_upgrade(
version = "latest",
lang_models = NULL,
ask = interactive(),
force = TRUE,
...
)
Arguments
version |
character; spaCy version to install (see details). |
lang_models |
character; language models to be installed. Defaults
|
ask |
logical; ask whether to proceed during the installation. By default, questions are only asked in interactive sessions. |
force |
ignore if spaCy/the lang_models is already present and install it anyway. |
... |
passed on to |
Details
The function checks whether a suitable installation of Python is
present on the system and installs one via
reticulate::install_python()
otherwise. It then creates a
virtual environment with the necessary packages in the default location
chosen by reticulate::virtualenv_root()
.
If you want to install a different version of Python than the default, you
should call reticulate::install_python()
directly. If you want
to create or use a different virtual environment, you can use, e.g.,
Sys.setenv(SPACY_PYTHON = "path/to/directory")
.
See Also
Examples
## Not run:
# install the latest version of spaCy
spacy_install()
# update spaCy
spacy_install(force = TRUE)
# install an older version
spacy_install(version = "3.1.0")
# install with GPU enabled
spacy_install(version = "cuda-autodetect")
# install on Apple ARM processors
spacy_install(version = "apple")
# install an old custom version
spacy_install(version = "[cuda-autodetect]==3.2.0")
# install several models with spaCy
spacy_install(lang_models = c("en_core_web_sm", "de_core_news_sm"))
# install spaCy to an existing virtual environment
Sys.setenv(RETICULATE_PYTHON = "path/to/python")
spacy_install()
## End(Not run)