Version: | 0.91 |
Type: | Package |
Title: | Import and Handling for Plain and Formatted Text Files |
Description: | Functions for importing and handling text files and formatted text files with additional meta-data, such including '.csv', '.tab', '.json', '.xml', '.html', '.pdf', '.doc', '.docx', '.rtf', '.xls', '.xlsx', and others. |
License: | GPL-3 |
Depends: | R (≥ 3.6) |
Imports: | antiword, data.table, digest, httr, jsonlite (≥ 0.9.10), pillar, pdftools, readODS (≥ 1.7.0), readxl, streamR, stringi, striprtf, xml2, utils |
Suggests: | knitr, pkgload, rmarkdown, quanteda (≥ 3.0), testthat, covr |
URL: | https://github.com/quanteda/readtext |
Encoding: | UTF-8 |
BugReports: | https://github.com/quanteda/readtext/issues |
LazyData: | TRUE |
VignetteBuilder: | knitr |
RoxygenNote: | 7.3.1 |
NeedsCompilation: | no |
Packaged: | 2024-02-23 05:01:29 UTC; kbenoit |
Author: | Kenneth Benoit [aut, cre, cph], Adam Obeng [aut], Kohei Watanabe [ctb], Akitaka Matsuo [ctb], Paul Nulty [ctb], Stefan Müller [ctb] |
Maintainer: | Kenneth Benoit <kbenoit@lse.ac.uk> |
Repository: | CRAN |
Date/Publication: | 2024-02-23 05:40:02 UTC |
Import and handling for plain and formatted text files
Description
A set of functions for importing and handling text files and formatted text files with additional meta-data, such including .csv, .tab, .json, .xml, .xls, .xlsx, and others.
Details
readtext makes it easy to import text files in various formats, including using operating system filemasks to load in groups of files based on glob pattern matches, including files in multiple directories or sub-directories. readtext can also read multiple files into R from compressed archive files such as .gz, .zip, .tar.gz, etc. Finally readtext reads in the document-level meta-data associated with texts, if those texts are in a format (e.g. .csv, .json) that includes additional, non-textual data.
Package options
readtext_verbosity
Default verbosity for messages produced when reading files. See
readtext()
.
Author(s)
Ken Benoit, Adam Obeng, and Paul Nulty
See Also
Useful links:
Set the docid for multi-document objects
Description
Set the docid for multi-document objects
Usage
add_docid(x, path, docid_field)
Arguments
x |
data.frame; contains texts and document variables |
path |
character; file path from which |
docid_field |
numeric or character; indicate position of a text column in x |
return only the texts from a readtext object
Description
An accessor function to return the texts from a readtext object as a character vector, with names matching the document names.
Usage
## S3 method for class 'readtext'
as.character(x, ...)
Arguments
x |
the readtext object whose texts will be extracted |
... |
further arguments passed to or from other methods |
Return basenames that are unique
Description
Return basenames that are unique
Usage
basename_unique(x, path_only = FALSE)
Arguments
x |
character vector; file paths |
path_only |
logical; if |
Examples
files <- c("../data/glob/subdir1/test.txt", "../data/glob/subdir2/test.txt")
readtext:::basename_unique(files)
# [1] "subdir1/test.txt" "subdir2/test.txt"
readtext:::basename_unique(files, path_only = TRUE)
# [1] "subdir1" "subdir2"
readtext:::basename_unique(c("../data/test1.txt", "../data/test2.txt"))
# [1] "test1.txt" "test2.txt"
Internal function to cache remote file
Description
Internal function to cache remote file
Usage
cache_remote(url, ignore_missing, cache, basename = NULL, verbosity = 1)
Arguments
url |
location of a remote file |
ignore_missing |
if |
cache |
|
basename |
name of temporary file to preserve file extensions. If
|
verbosity |
|
encoded texts for testing
Description
data_char_encodedtexts
is a 10-element character vector with 10
different encodings
Usage
data_char_encodedtexts
Format
An object of class character
of length 10.
Examples
## Not run:
Encoding(data_char_encodedtexts)
data.frame(labelled = names(data_char_encodedtexts),
detected = encoding(data_char_encodedtexts)$all)
## End(Not run)
a .zip file of texts containing a variety of differently encoded texts
Description
A set of translations of the Universal Declaration of Human Rights, plus one or two other miscellaneous texts, for testing the text input functions that need to translate different input encodings.
Source
The Universal Declaration of Human Rights resources, https://www.un.org/en/about-us/universal-declaration-of-human-rights
Examples
## Not run: # unzip the files to a temporary directory
FILEDIR <- tempdir()
unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"),
exdir = FILEDIR)
# get encoding from filename
filenames <- list.files(FILEDIR, "\\.txt$")
# strip the extension
filenames <- gsub(".txt$", "", filenames)
parts <- strsplit(filenames, "_")
fileencodings <- sapply(parts, "[", 3)
fileencodings
# find out which conversions are unavailable (through iconv())
cat("Encoding conversions not available for this platform:")
notAvailableIndex <- which(!(fileencodings %in% iconvlist()))
fileencodings[notAvailableIndex]
# try readtext
require(quanteda)
txts <- readtext(paste0(FILEDIR, "/", "*.txt"))
substring(texts(txts)[1], 1, 80) # gibberish
substring(texts(txts)[4], 1, 80) # hex
substring(texts(txts)[40], 1, 80) # hex
# read them in again
txts <- readtext(paste0(FILEDIR, "/", "*.txt"), encoding = fileencodings)
substring(texts(txts)[1], 1, 80) # English
substring(texts(txts)[4], 1, 80) # Arabic, looking good
substring(texts(txts)[40], 1, 80) # Cyrillic, looking good
substring(texts(txts)[7], 1, 80) # Chinese, looking good
substring(texts(txts)[26], 1, 80) # Hindi, looking good
txts <- readtext(paste0(FILEDIR, "/", "*.txt"), encoding = fileencodings,
docvarsfrom = "filenames",
docvarnames = c("document", "language", "inputEncoding"))
encodingCorpus <- corpus(txts, source = "Created by encoding-tests.R")
summary(encodingCorpus)
## End(Not run)
detect the encoding of texts
Description
Detect the encoding of texts in a character readtext object and report
on the most likely encoding for each document. Useful in detecting the
encoding of input texts, so that a source encoding can be (re)specified when
inputting a set of texts using readtext()
, prior to constructing
a corpus.
Usage
encoding(x, verbose = TRUE, ...)
Arguments
x |
character vector, corpus, or readtext object whose texts' encodings will be detected. |
verbose |
if |
... |
additional arguments passed to stri_enc_detect |
Details
Based on stri_enc_detect, which is in turn based on the ICU libraries. See the ICU User Guide, https://unicode-org.github.io/icu/userguide/.
Examples
## Not run: encoding(data_char_encodedtexts)
# show detected value for each text, versus known encoding
data.frame(labelled = names(data_char_encodedtexts),
detected = encoding(data_char_encodedtexts)$all)
# Russian text, Windows-1251
myreadtext <- readtext("https://kenbenoit.net/files/01_er_5.txt")
encoding(myreadtext)
## End(Not run)
extract texts and meta data from Nexis HTML files
Description
This extract headings, body texts and meta data (date, byline, length, section, edition) from items in HTML files downloaded by the scraper.
Usage
get_nexis_html(path, paragraph_separator = "\n\n", verbosity, ...)
Arguments
path |
either path to a HTML file or a directory that contains HTML files |
paragraph_separator |
a character to separate paragraphs in body texts |
verbosity |
|
... |
only to trap extra arguments |
Examples
## Not run:
irt <- readtext:::get_nexis_html('tests/data/nexis/irish-times_1995-06-12_0001.html')
afp <- readtext:::get_nexis_html('tests/data/nexis/afp_2013-03-12_0501.html')
gur <- readtext:::get_nexis_html('tests/data/nexis/guardian_1986-01-01_0001.html')
sun <- readtext:::get_nexis_html('tests/data/nexis/sun_2000-11-01_0001.html')
spg <- readtext:::get_nexis_html('tests/data/nexis/spiegel_2012-02-01_0001.html',
language_date = 'german')
all <- readtext('tests/data/nexis', source = 'nexis')
all <- readtext('tests/data/nexis', source = 'nexis')
## End(Not run)
Get path to temporary file or directory
Description
Get path to temporary file or directory
Usage
get_temp(prefix = "readtext-", temp_dir = NULL, directory = FALSE, seed = NULL)
Arguments
prefix |
a string appended to random file or directory names. |
temp_dir |
a path to temporary directory. If |
directory |
logical; if |
seed |
a seed value for |
Detect and set variable types automatically
Description
Detect and set variable types in a similar way as read.csv()
does.
Should be used when imported data.frame is all characters.
Usage
impute_types(x)
Arguments
x |
data.frame; columns are all characters vectors |
print method for a readtext object
Description
Print a readtext object in a nicely formatted way.
Usage
## S3 method for class 'readtext'
print(x, n = 6L, text_width = 10L, ...)
Arguments
x |
the readtext object to be printed |
n |
a single integer, the number of rows of a readtext object to print. |
text_width |
number of characters to display of the text field |
... |
not used here |
read a text file(s)
Description
Read texts and (if any) associated document-level meta-data from one or more source files. The text source files come from the textual component of the files, and the document-level metadata ("docvars") come from either the file contents or filenames.
Usage
readtext(
file,
ignore_missing_files = FALSE,
text_field = NULL,
docid_field = NULL,
docvarsfrom = c("metadata", "filenames", "filepaths"),
dvsep = "_",
docvarnames = NULL,
encoding = NULL,
source = NULL,
cache = TRUE,
verbosity = readtext_options("verbosity"),
...
)
Arguments
file |
the complete filename(s) to be read. This is designed to automagically handle a number of common scenarios, so the value can be a "glob"-type wildcard value. Currently available filetypes are: Single file formats:
Reading multiple files and file types: In addition,
|
ignore_missing_files |
if |
text_field , docid_field |
a variable (column) name or column number
indicating where to find the texts that form the documents for the corpus
and their identifiers. This must be specified for file types |
docvarsfrom |
used to specify that docvars should be taken from the
filenames, when the |
dvsep |
separator (a regular expression character string) used in
filenames to delimit docvar elements if |
docvarnames |
character vector of variable names for |
encoding |
vector: either the encoding of all files, or one encoding for each files |
source |
used to specify specific formats of some input file types, such
as JSON or HTML. Currently supported types are |
cache |
if |
verbosity |
|
... |
additional arguments passed through to low-level file reading
function, such as |
Value
a data.frame consisting of a columns doc_id
and text
that contain a document identifier and the texts respectively, with any
additional columns consisting of document-level variables either found
in the file containing the texts, or created through the
readtext
call.
Examples
## Not run:
## get the data directory
if (!interactive()) pkgload::load_all()
DATA_DIR <- system.file("extdata/", package = "readtext")
## read in some text data
# all UDHR files
(rt1 <- readtext(paste0(DATA_DIR, "/txt/UDHR/*")))
# manifestos with docvars from filenames
(rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
encoding = "LATIN1"))
# recurse through subdirectories
(rt3 <- readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"),
docvarsfrom = "filepaths", docvarnames = "sentiment"))
## read in csv data
(rt4 <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv")))
## read in tab-separated data
(rt5 <- readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech"))
## read in JSON data
(rt6 <- readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts"))
## read in pdf data
# UNHDR
(rt7 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"),
docvarsfrom = "filenames",
docvarnames = c("document", "language")))
Encoding(rt7$text)
## read in Word data (.doc)
(rt8 <- readtext(paste0(DATA_DIR, "/word/*.doc")))
Encoding(rt8$text)
## read in Word data (.docx)
(rt9 <- readtext(paste0(DATA_DIR, "/word/*.docx")))
Encoding(rt9$text)
## use elements of path and filename as docvars
(rt10 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"),
docvarsfrom = "filepaths", dvsep = "[/_.]"))
## End(Not run)
Get or set package options for readtext
Description
Get or set global options affecting functions across readtext.
Usage
readtext_options(..., reset = FALSE, initialize = FALSE)
Arguments
... |
options to be set, as key-value pair, same as
|
reset |
logical; if |
initialize |
logical; if |
Details
Currently available options are:
verbosity
Default verbosity for messages produced when reading files. See
readtext()
.
Value
When called using a key = value
pair (where key
can be
a label or quoted character name)), the option is set and TRUE
is
returned invisibly.
When called with no arguments, a named list of the package options is returned.
When called with reset = TRUE
as an argument, all arguments are
options are reset to their default values, and TRUE
is returned
invisibly.
Examples
## Not run:
# save the current options
(opt <- readtext_options())
# set higher verbosity
readtext_options(verbosity = 3)
# read something in here
if (!interactive()) pkgload::load_all()
DATA_DIR <- system.file("extdata/", package = "readtext")
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
# reset to saved options
readtext_options(opt)
## End(Not run)
Move text to the first column and set types to document variables
Description
Move text to the first column and set types to document variables
Usage
sort_fields(x, path, text_field, impute_types = TRUE)
Arguments
x |
data.frame; contains texts and document variables |
path |
character; file path from which |
text_field |
numeric or character; indicate position of a text column in x |
impute_types |
logical; if |
Get corpus texts [deprecated]
Description
Get the texts from a readtext object.
Usage
texts(x, ...)
## S3 method for class 'readtext'
texts(x, ...)
Arguments
x |
a readtext object |
... |
not used |
Details
This function is deprecated.
Use as.character.readtext()
to turn a readtext object into a simple named
character vector of documents.
Value
a character vector of the texts in the corpus