EPAR has developed a suite of open-source text analytic tools in R and Python for reviewing large numbers of documents. We will continue to develop and add to the code for these tools as we apply them to our research, but a repository of code and guidelines for using them can be accessed on the eparTextTools GitHub page. The eparTextTools package is also available for download for users in R.

The eparTextTools tool kit provides a set of resources for analyzing textual documents using the R programming language for in portfolio and literature reviews. The tools rely on text mining, natural language processing, and machine learning programs developed by other R users, and as such rely heavily on code developed by other packages. Thus, they may be thought of as a set of tools enabling document analysis rather than a new package for conducting text analysis.

The tools work towards two broad goals. First, the tools provide a flexible framework for describing and classifying the content of textual documents. This includes analysis of word frequencies, description of common words, testing for correlations between words, and categorization of strings of text into modeled or human coded topic categories. The text tools, as designed, support query-based description, such as "how often does the set of pharmaceutical SEC 10-K filings involve the phrase “tax credit” versus “research subsidy”?" However, the tools also allow a user to explore documents by allowing the documents to suggest word correlations, commonalities, and topics. This inductive topic modeling provides a different perspective on analysis that is not as easy to replicate through human analysis.

These text analysis tools could be used alone, or to supplement the human research protocols for literature reviews and portfolio reviews that EPAR has developed. The tools will identify patterns across a set of documents, and can provide a guide for human review. Using machine tools to automate and focus parts of our manual analysis reduces costs associated with labor and time, and minimizes human error. A presentation describing the potential contributions of the EPAR text analysis tools in the context of an investment portfolio review process is available here.

The eparTextTools GitHub repository includes multiple vignettes providing an introduction to the tools and guidance for using them. The GitHub repository also includes a variety of documentation to support new users. EPAR RAs largely responsible for developing this code are current and former UW Evans students Dr. Ryan Scott, Graham Andrews, and Adam Hayes.