Help for package arabicStemR

Type:

Package

Title:

Arabic Stemmer for Text Analysis

Version:

1.3

Date:

2022-07-14

Author:

Rich Nielsen

Maintainer:

Rich Nielsen <rnielsen@mit.edu>

Description:

Allows users to stem Arabic texts for text analysis.

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

NeedsCompilation:

Packaged:

2022-07-16 13:13:46 UTC; rich

Repository:

CRAN

Date/Publication:

2022-07-18 08:20:09 UTC

A package for stemming Arabic for text analysis.

Description

This package is a stemmer for texts in Arabic (Modern Standard). The stemmer is loosely based on the light 10 stemmer, but with a number of modifications.

Details

Use the stemArabic function.

Author(s)

Maintainer: Rich Nielsen <rnielsen@mit.edu>

Examples

## generate some text in Arabic
x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647
     \u0627\u0644\u0631\u062D\u0645\u0646 
     \u0627\u0644\u0631\u062D\u064A\u0645"

## stem and transliterate
stemArabic(x)

## stem while not stemming certain words
stem(x, dontStemTheseWords = c("alr7mn"))

## stem and return the stemlist
out <- stemArabic(x,returnStemList=TRUE)
out$text
out$stemlist

Clean all characters that are not Latin or Arabic

Description

Cleans any characters in string that are not in either the Latin unicode range or in the Arabic alphabet

Usage

cleanChars(texts)

Arguments

texts

A string from which characters which are not Latin or Arabic should be removed.

Value

cleanChars returns a string with only Latin and Arabic characters.

Author(s)

Rich Nielsen

Examples


## Create string with Arabic, latin, and Hebrew characters 

x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 Hello \u05d0'

## Remove characters from string that are not Arabic or latin

cleanChars(x)

Clean Latin characters

Description

Cleans Latin characters from a string

Usage

cleanLatinChars(texts)

Arguments

texts

A string from which Latin characters should be removed.

Value

cleanLatinChars returns a string with Latin characters removed.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic and latin characters 

x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 Hello'

## Rewmove latin characters from string

cleanLatinChars(x)

Removes Arabic prefixes and suffixes

Description

Removes prefixes and suffixes, and can return a list matching the words to stemmed words. Does not stem different forms of Allah.

Usage

doStemming(texts, dontstem =  c('\u0627\u0644\u0644\u0647','\u0644\u0644\u0647'))

Arguments

texts

The original texts.

dontstem

By default, does not stem different forms of Allah

Value

doStemming returns a named list with the following elements:

text

The stemmed text

stemmedWords

A list matching the words and the stemmed words.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters
x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629
 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627'

## Remove prefixes and suffixes
y<-doStemming(x)
y$text
y$stemmedWords

Standardize different hamzas on alif seats

Description

Standardize different hamzas on alif seats in a string.

Usage

fixAlifs(texts)

Arguments

texts

A string from which different alifs are standardized.

Value

fixAlifs returns a string with standardized alifs.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters 
x <- '\u0622 \u0623 \u0675'

## Standardize Alifs
fixAlifs(x)

Remove Arabic numbers

Description

Removes Arabic numerals from a string.

Usage

removeArabicNumbers(texts)

Arguments

texts

A string from which Arabic numerals should be removed.

Value

removeArabicNumbers returns a string with Arabic numerals removed.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters and numbers 

x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 \u0661\u0662\u0663'

## Remove Arabic numbers
removeArabicNumbers(x)

Remove Arabic diacritics

Description

Removes diacritics from Arabic unicode text.

Usage

removeDiacritics(texts)

Arguments

texts

A string from which Arabic diacritics should be removed.

Value

removeDiacritics returns a string with Arabic diacritics removed.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters and diacritics 

x<- '\u0627\u0647\u0644\u0627\u064b \u0648\u0633\u0647\u0644\u0627\u064b'

## Remove diacritics
removeDiacritics(x)

Remove English numbers

Description

Removes Arabic numerals from a string.

Usage

removeEnglishNumbers(texts)

Arguments

texts

A string from which English numerals should be removed.

Value

removeEnglishNumbers returns a string with English numerals removed.

Author(s)

Rich Nielsen

Examples


## Create string with Arabic characters and English number

x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 123'

## Remove English Numbers

removeNumbers(x)

Remove Farsi numbers

Description

Removes Farsi numerals from a string.

Usage

removeFarsiNumbers(texts)

Arguments

texts

A string from which Farsi numerals should be removed.

Value

removeFarsiNumbers returns a string with Arabic numerals removed.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters and numbers 

x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 \u06f1\u06f2\u06f3\u06f4\u06f5'

## Remove Farsi numbers
removeFarsiNumbers(x)

Remove new line characters

Description

Removes new line characters from a string.

Usage

removeNewlineChars(texts)

Arguments

texts

A string from which new line characters should be removed.

Value

removeNewlineChars returns a string with new line characters removed.

Author(s)

Rich Nielsen

Examples


## Create string with Arabic characters 

x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627
      \u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627'

## Remove newline characters (gets rid of \n\r\t\f\v)

removeNewlineChars(x)

Remove English, Arabic, and Farsi numerals.

Description

Removes English, Arabic, and Farsi numerals from a string.

Usage

removeNumbers(texts)

Arguments

texts

A string from which English, Arabic, and Farsi numerals should be removed.

Value

removeNumbers returns a string with English, Arabic, and Farsi numerals removed.

Author(s)

Rich Nielsen

Examples


## Create string with Arabic characters and number

x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 123 \u0661\u0662\u0663'

## Remove Numbers

removeNumbers(x)

Remove Arabic prefixes

Description

Removes some Arabic prefixes from a unicode string. The prefixes are: "waw", "alif-lam", "waw-alif-lam", "ba-alif-lam", "kaf-alif-lam", "fa-alif-lam", and "lam-lam." Prefixes are removed from a word (as defined by spaces) only if the remaining stem would not be too short.

Usage

removePrefixes(texts, x1 = 4, x2 = 4, x3 = 5, x4 = 5, x5 = 5, x6 = 5, x7 = 4, 
dontstem = c('\u0627\u0644\u0644\u0647','u0644\u0644\u0647'))

Arguments

texts

An Arabic-language string in unicode

x1

The number of letters that must be in a word for the function to remove the prefix "waw".

x2

The number of letters that must be in a word for the function to remove the prefix "alif-lam".

x3

The number of letters that must be in a word for the function to remove the prefix "waw-alif-lam".

x4

The number of letters that must be in a word for the function to remove the prefix "ba-alif-lam".

x5

The number of letters that must be in a word for the function to remove the prefix "kaf-alif-lam".

x6

The number of letters that must be in a word for the function to remove the prefix "fa-alif-lam".

x7

The number of letters that must be in a word for the function to remove the prefix "lam-lam".

dontstem

Words that should not be stemmed (entered in unicode).

Value

Returns a string with Arabic prefixes removed.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters

x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629
 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627'

# Remove Prefixes

removePrefixes(x)

Remove punctuation.

Description

Removes punctuation from a string, including some specialized Arabic characters.

Usage

removePunctuation(texts)

Arguments

texts

A string from which punctuation should be removed.

Value

Returns a string with punctuation removed.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters and punctuation

x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627!!!?'

## Remove punctuation

removePunctuation(x)

Remove Arabic stopwords.

Description

Defines a list of Arabic-language stopwords and removes them from a string.

Usage

removeStopWords(texts, defaultStopwordList=TRUE, customStopwordList=NULL)

Arguments

texts

A string from which Arabic stopwords should be removed.

defaultStopwordList

If TRUE, use the default stopword list of words to be removed. If FALSE, do not use the default stopword list. Default is TRUE.

customStopwordList

Optional user-specified stopword list of words to be removed, supplied as a vector of strings in either Arabic UTF-8 or Latin characters following the stemmer's transliteration scheme (words without Arabic UTF-8 characters are processed with reverse.transliterate()). Default is NULL.

Value

Returns a string with Arabic stopwords removed.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters

x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627
 \u064a\u0627  \u0635\u062f\u064a\u0642\u064a'

## Remove stop words
removeStopWords(x)$text

## Not run
## To see the full list of stop words 
removeStopWords(x)$arabicStopwordList

Remove Arabic suffixes

Description

Removes some Arabic suffixes from a unicode string. The suffixes (in order of removal) are: "ha-alif", "alif-nun", "alif-ta", "waw-nun", "yah-nun", "yah-heh", "yah-ta marbutta", "heh", "ta marbutta", and "yah." Suffixes are removed from a word (as defined by spaces) only if the remaining stem would not be too short. Only one suffix is removed from each word.

Usage

removeSuffixes(texts, x1 = 4, x2 = 4, x3 = 4, x4 = 4, 
x5 = 4, x6 = 4, x7 = 4, x8 = 3, x9 = 3, x10 = 3, 
dontstem = c('\u0627\u0644\u0644\u0647','u0644\u0644\u0647'))

Arguments

texts

An Arabic-language string in unicode.

x1

The number of letters that must be in a word for the function to remove the suffix "ha-alif".

x2

The number of letters that must be in a word for the function to remove the suffix "alif-nun".

x3

The number of letters that must be in a word for the function to remove the suffix "alif-ta".

x4

The number of letters that must be in a word for the function to remove the suffix "waw-nun".

x5

The number of letters that must be in a word for the function to remove the suffix "yah-nun".

x6

The number of letters that must be in a word for the function to remove the suffix "yah-heh".

x7

The number of letters that must be in a word for the function to remove the suffix "yah-ta marbutta".

x8

The number of letters that must be in a word for the function to remove the suffix "heh".

x9

The number of letters that must be in a word for the function to remove the suffix "ta marbutta".

x10

The number of letters that must be in a word for the function to remove the suffix "yah".

dontstem

Words that should not be stemmed (entered in unicode).

Value

Returns a string with Arabic suffixes removed.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters

x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629
 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627'

# Remove Suffixes

removeSuffixes(x)

Transliterate latin characters into Arabic unicode characters

Description

Transliterates latin characters into Arabic unicode characters using a transliteration system developed by Rich Nielsen.

Usage

reverse.transliterate(texts)

Arguments

texts

A string in latin characters to be transliterated into Arabic characters.

Value

Returns a string in Arabic characters.

Author(s)

Rich Nielsen

Examples

## Create latin string following the arabicStemR package transliteration scheme.

x <- 'al3rby'

## Convert latin characters into Arabic unicode characters

reverse.transliterate(x)

Arabic Stemmer for Text Analysis

Description

Allows users to stem Arabic texts for text analysis. Now deprecated. Please use stemArabic.

Usage

stem(dat, cleanChars = TRUE, cleanLatinChars = TRUE, 
    transliteration = TRUE, returnStemList = FALSE,
	defaultStopwordList=TRUE, customStopwordList=NULL,
	dontStemTheseWords = c("allh", "llh"))

Arguments

dat

The original data, as a vector of length one containing the text.

cleanChars

Removes all unicode characters except Latin characters and Arabic alphabet

cleanLatinChars

Removes Latin characters

transliteration

Transliterates the text

returnStemList

Performs stemming by removing prefixes and suffixes

defaultStopwordList

If TRUE, use the default stopword list of words to be removed. If FALSE, do not use the default stopword list. Default is TRUE.

customStopwordList

dontStemTheseWords

Optional vector of strings that should not be stemmed. These words can be supplied as transliterated Arabic (according to the transliteration scheme of transliterate() and reverse.transliterate()) or in unicode Arabic. If a term matches an element of this argument at any intermediate point in stemming, that term will not be stemmed further. The default is c("allh","llh") because in most applications, stemming these common words for "God" creates some confusion by resulting in the string "lh".

Details

stem prepares texts in Arabic for text analysis by stemming.

Value

stem returns a named list with the following elements:

text

The stemmed text

stemlist

A list of the stemmed words.

Author(s)

Rich Nielsen

Examples

## generate some text in Arabic
x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647
     \u0627\u0644\u0631\u062D\u0645\u0646 
     \u0627\u0644\u0631\u062D\u064A\u0645"

## stem and transliterate
## NOTE: the "stem()" function only accepts a vector of length 1. 
## The function is deprecated in favor of stemArabic() which accepts vectors with multiple elements.
stem(x)

## stem while not stemming certain words
stem(x, dontStemTheseWords = c("alr7mn"))

## stem and return the stemlist
out <- stem(x,returnStemList=TRUE)
out$text
out$stemlist

Arabic Stemmer for Text Analysis

Description

Allows users to stem Arabic texts for text analysis.

Usage

stemArabic(dat, cleanChars = TRUE, cleanLatinChars = TRUE, 
    transliteration = TRUE, returnStemList = FALSE,
	defaultStopwordList=TRUE, customStopwordList=NULL,
	dontStemTheseWords = c("allh", "llh"))

Arguments

dat

The original data, as a vector of texts.

cleanChars

Removes all unicode characters except Latin characters and Arabic alphabet

cleanLatinChars

Removes Latin characters

transliteration

Transliterates the text

returnStemList

Performs stemming by removing prefixes and suffixes

defaultStopwordList

If TRUE, use the default stopword list of words to be removed. If FALSE, do not use the default stopword list. Default is TRUE.

customStopwordList

dontStemTheseWords

Details

stemArabic prepares texts in Arabic for text analysis by stemming.

Value

stemArabic returns a named list with the following elements:

text

The stemmed text

stemlist

A list of the stemmed words.

Author(s)

Rich Nielsen

Examples

## generate some text in Arabic
x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647
     \u0627\u0644\u0631\u062D\u0645\u0646 
     \u0627\u0644\u0631\u062D\u064A\u0645"

## inspect
print(x)

## stem and transliterate
stemArabic(x)

## stem while not stemming certain words
stem(x, dontStemTheseWords = c("alr7mn"))

## stem and return the stemlist
out <- stemArabic(x,returnStemList=TRUE)
out$text
out$stemlist

Transliterate Arabic unicode characters into latin characters

Description

Transliterates Arabic unicode characters into latin characters using a transliteration system developed by Rich Nielsen.

Usage

transliterate(texts)

Arguments

texts

A string in Arabic characters to be transliterated into latin characters.

Value

Returns a string in latin characters.

Author(s)

Rich Nielsen

Examples

## Create Arabic string 

x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627'

## Performs transliteration of Arabic into latin characters.
transliterate(x)

A package for stemming Arabic for text analysis.

Description

Details

Author(s)

See Also

Examples

Clean all characters that are not Latin or Arabic

Description

Usage

Arguments

Value

Author(s)

Examples

Clean Latin characters

Description

Usage

Arguments

Value

Author(s)

Examples

Removes Arabic prefixes and suffixes

Description

Usage

Arguments

Value

Author(s)

Examples

Standardize different hamzas on alif seats

Description

Usage

Arguments

Value

Author(s)

Examples

Remove Arabic numbers

Description

Usage

Arguments

Value

Author(s)

Examples

Remove Arabic diacritics

Description

Usage

Arguments

Value

Author(s)

Examples

Remove English numbers

Description

Usage

Arguments

Value

Author(s)

Examples

Remove Farsi numbers

Description

Usage

Arguments

Value

Author(s)

Examples

Remove new line characters

Description

Usage

Arguments

Value

Author(s)

Examples

Remove English, Arabic, and Farsi numerals.

Description

Usage

Arguments

Value

Author(s)

Examples

Remove Arabic prefixes

Description

Usage

Arguments