Research analysts in quantitative areas, such as epidemiology, medical statistics, and ‘-omic’ aligned fields often have considerable expertise in handling and analysing tabular data that is mostly numeric in nature. Dealing with unstructured, text-heavy data is typically not in their arsenal of analytical tools. However, there is great value in being able to manipulate and gain inferences from textual data. This course will provide a hands-on introduction to some of the tools and methods available in R with which to process, visualize and analyse text data.
Topics to be covered in this session:
Text pre-processing (learn how to read in different types of textual data and preprocess text using string manipulation functions)
Run simple lexical analyses on a corpus of texts
Perform a simple sentiment analysis
Build a text classifier (using a simple supervised machine learning algorithm)
Learning Objectives
Understand, build and manipulate a corpus object
Understand, build and manipulate document term matrices
Obtain simple lexical metrics which can be used in downstream analyses
Perform a simple sentiment analysis
Classify/categorise text based documents
Participants should be comfortable running commands and basic functions in R via RStudio.
Participants can if they want simply watch along with the live demo. If they wish to follow along with the demonstration they will need
R and RStudio installed
With the following packages
tidyverse, lubridate, readtext, quanteda, quanteda.sentiment, caret, quanteda.textmodels, hunspell
Download the toy datasets from the the github repo: github.com/shellylac/TextMiningIntro_inR
We will run some analyses and visualisations on a dataset of COVID19 tweets from Twitter. If participants wish they can obtain their own dataset of tweets by following the instructions here:
github.com/shellylac/TextMiningIntro_inR/blob/main/TextMining_RScripts/Get_Twitter_Data.R