Handling Emojis and Unicode in YouTube Data

YouTube content frequently contains emojis, special Unicode characters, and text in various languages. The tuber package provides built-in functions for detecting, extracting, and manipulating emojis without external dependencies.

Quick Start

library(tuber)

# Get comments from a video
comments <- get_all_comments(video_id = "your_video_id")

# Check which comments contain emojis
comments$has_emoji <- has_emoji(comments$textDisplay)

# Count emojis per comment
comments$emoji_count <- count_emojis(comments$textDisplay)

# Filter to emoji-rich comments
emoji_comments <- comments[comments$emoji_count > 0, ]

Emoji Detection Functions

The package provides five main functions for working with emojis:

has_emoji() - Check for emoji presence

has_emoji("Hello world")
# FALSE

has_emoji("Great video! \U0001F44D")
# TRUE

has_emoji(c("No emoji", "Has emoji \U0001F600", "Also none"))
# c(FALSE, TRUE, FALSE)

count_emojis() - Count emojis in text

count_emojis("Hello world")
# 0

count_emojis("Rating: \U0001F600\U0001F600\U0001F600")
# 3

count_emojis(c("None", "\U0001F44D", "\U0001F600\U0001F601"))
# c(0, 1, 2)

extract_emojis() - Get emojis from text

extract_emojis("Hello \U0001F44B World \U0001F30D!")
# list(c("\U0001F44B", "\U0001F30D"))

extract_emojis(c("No emoji", "\U0001F600\U0001F601"))
# list(character(0), c("\U0001F600", "\U0001F601"))

remove_emojis() - Strip emojis from text

remove_emojis("Hello \U0001F44B World!")
# "Hello  World!"

remove_emojis(c("No emoji", "Has \U0001F600 emoji"))
# c("No emoji", "Has  emoji")

replace_emojis() - Substitute emojis

replace_emojis("Hello \U0001F44B World!", replacement = "[emoji]")
# "Hello [emoji] World!"

replace_emojis("Rate: \U0001F600\U0001F600\U0001F600", replacement = "*")
# "Rate: ***"

Common Use Cases

Filter comments with high emoji usage

comments <- get_all_comments(video_id = "your_video_id")
comments$emoji_count <- count_emojis(comments$textDisplay)

# Top 10 most emoji-heavy comments
top_emoji <- comments[order(-comments$emoji_count), ][1:10, ]

Text analysis without emojis

# Remove emojis for text analysis
comments$clean_text <- remove_emojis(comments$textDisplay)

# Now use clean_text for sentiment analysis or word clouds

Emoji frequency analysis

# Extract all emojis from comments
all_emojis <- unlist(extract_emojis(comments$textDisplay))

# Count frequency
emoji_freq <- table(all_emojis)
sort(emoji_freq, decreasing = TRUE)[1:10]

Unicode Text Processing

Beyond emojis, tuber handles Unicode text consistently:

safe_utf8() - Ensure UTF-8 encoding

problematic_text <- c("caf\xe9", "na\xefve")
safe_text <- safe_utf8(problematic_text)

clean_youtube_text() - Clean HTML and normalize text

raw_text <- "Great video! &lt;3 &amp; more..."
clean_text <- clean_youtube_text(raw_text)
# "Great video! <3 & more..."

Troubleshooting

Emojis appear as question marks

Your R environment may not support UTF-8 display. The data is still correct; only the display is affected. Try:

# Check locale
Sys.getlocale("LC_CTYPE")

# Set UTF-8 locale on macOS/Linux
Sys.setlocale("LC_CTYPE", "en_US.UTF-8")

Emoji counts seem too high

Compound emojis (like family emojis or skin tone modifiers) may be counted as multiple characters. This is due to how Unicode encodes these as sequences of code points.

Some emojis not detected

The emoji pattern covers most common Unicode emoji blocks. Very new emojis added in recent Unicode versions may not be detected until the pattern is updated.