The “R Vs. Python” Debate in The Digital Humanities
INTRODUCTION
A lively debate has been underway for a number of years in the data science community concerning the relative merits (or superiority in some cases) of Python and R, the two most popular programming languages for data science and for the digital humanities, with committed partisans on both sides of the issue. The debate is taking place in various professional communities, particularly in the data science, scientific/engineering/biomedical, and digital humanities areas. Python and R are the leading programming languages in the digital humanities. JavaScript and compiled languages, such as C (or C++) and Java are also used, but less frequently.
The general “consensus” (very loosely defined) that has emerged from these debates is that Python is recognized as the premier general purpose programming language, but that R, more focused on data analysis and statistics, is more intuitive and facilitates data exploration. To some, Python has a steeper learning curve. It is an extensible language, meaning that its core does not support an expansive array of functionality, but this functionality can be incorporated through interoperable libraries. R, on the other hand, has been praised for its ease of use and for its sophisticated visualization capabilities, and is considered by some to be easily integrated into data analysis workflows. R is particularly popular with statisticians and those working in data mining or with very large data.
As discussed in a previous section, many programming languages are available for a variety of purposes. For general purpose, enterprise computing, where large, complex, reliable systems are needed and where efficiency is at premium, software developers typically select compiled languages that are compiled into object code and linked with library functions to generate executable code. Among those languages, C is a clear favourite. C ranked #2 in the Tiobe Index (December 2021) [https://www.tiobe.com/tiobe-index/], which measures the popularity of programming languages based on the quantity of search engine results for queries on the languages (note that these rankings do not indicate the “best” language, or the suitability of any given language for a particular purpose). This language ranked #3 on the IEEE (Institute of Electrical and Electronics Engineers) Spectrum Top Programming Languages for 2021 [https://spectrum.ieee.org/top-programming-languages/]. Like the Tiobe Index, the IEEE Spectrum rankings are metrics of programming language popularity. The rankings are calculated through a weighted combination of eleven metrics from eight web-based sources, including Google, CareerBuilder, and Twitter [https://spectrum.ieee.org/top-programming-languages/]. On that site, C is “…used to write software where speed and flexibility are important, such as in embedded systems or high-performance computing”. Although not prominent in the development of web sites or web applications, C is an important language for mobile computing, desktop and enterprise applications, and embedded systems and device controllers. Originally conceived as a systems language, C has become the primary general purpose compiled language. Along with the older compiled Fortran language, C is also widely employed in the development of scientific, biomedical, and engineering applications. Java is another popular general-purpose language. Java is technically compiled to an intermediate format known as byte code, which subsequently runs in a specialized execution environment known as a Java Virtual Machine (JVM). The language ranked #3 on the Tiobe Index (December 2021) and #2 on the 2021 IEEE Spectrum Rankings. The latter describes Java as “[a]n object-oriented language that creates code intended to be run on a virtual machine, allowing it to run on different platforms with little or no modification. Java is a popular choice for Web applications.” Java is also used for the development of mobile and desktop/enterprise applications. Although originally developed as a language for programming small embedded systems, Java is not currently prominent in this area. Other widely used languages include Javascript (not directly related to Java) for adding interactivity and computational capabilities to websites, C++, an object-oriented enhancement of C, Go, a language developed by Google for enabling data exchange by applications that utilize multiple cores executing concurrently, and PHP for supporting dynamic websites. C, C++, and Go, are compiled languages, Java is compiled to byte code and run on a JVM, and Javascript is an interpreted language. Mention should also be made of Assembly language, a processor-dependent low-level language for communicating directly with the computer hardware, and SQL, a special-purpose language for defining, manipulating, and querying relational databases. A large variety of other programming languages are available, and the reader is encouraged to explore them, starting with the Tiobe Index and IEEE Spectrum Rankings.
However, as explained earlier, in the digital humanities, data science, and the overlap between the two, Python and R are dominant. Python and R are both interpreted scripting languages, and both can be run interactively from a variety of graphical user interface (GUI) frameworks, ranging from the most basic to the highly sophisticated. Python, the #1 language on the Tiobe Index (December 2021) [https://www.tiobe.com/tiobe-index/] and #1 in the Language Rankings of IEEE Spectrum [https://spectrum.ieee.org/top-programming-languages/] is a widely used general purpose programming language. In the IEEE Spectrum Language Ranking, Python is described as “[an] object-oriented, interpreted language that gains much of its power from a large constellation of libraries, including popular modules for machine learning and scientific computing.” It is used for developing web-based applications, large-scale enterprise, desktop, and scientific and engineering applications, and for programming embedded systems and device controllers. Python has also gained a foothold in biomedical research (Deardorff, 2020), and specialized Python tools, such as Biopython, are available for computational molecular biology [https://biopython.org/]. The other participant in this debate is the R programming language. It ranked #11 on the Tiobe index (December, 2021) and #7 on the IEEE Spectrum rankings. In the IEEE Spectrum Language Ranking, R is described as “[a] language and programming environment designed for statistical analysis and data-mining applications”, in contrast to the more general-purpose Python language. Unlike Python, R is not widely used for web sites or web applications or for embedded system. Its primary deployment platform is desktop systems for large-scale scientific and enterprise applications. However, like Python, it is also heavily used in biomedical applications (Deardorff, 2020).
In the digital humanities, both languages are very popular. There does not seem to be specific applications or categories of applications in humanities scholarship in which one language has a clear advantage over the other. The examples presented in the labs in this course employ both languages.
Head-to-head comparison
A decision between the two languages may not be ultimately crucial, as both Python and R have very large, established, and well-developed software ecosystems to support any functionality. Because of their popularity, each language has a very large and active user base and community, and consequently it is relatively easy to find support. For data science applications, both languages are suitable (Python vs. R for Data Science: What’s the Difference?).
Furthermore, both languages have extensive ecosystems: libraries, tools, and frameworks for programming, development, and for a wide breadth of applications that extend and enhance the functionality of each language. Prominent in Python’s ecosystem is a huge collection of libraries that are easily installed in Python integrated development environments (IDEs) or through the “pip” command. The PyPI (Python Package Index) repository contains a huge number of software packages written by and shared by the extensive Python developer community (although this repository is not curated). Popular Python libraries include Matplotlib for plotting and graphing; Numpy, which is essential for numerical computing; SciPy, which provides additional computational and numerical functionality; Scikit Learn, for implementation of machine learning algorithms, and many algorithms for text analysis; Pandas for data manipulation and data frames; and utilities for systems operations. Additional libraries for visualization include Plotly, Seaborn, Bokeh, and Plotnine. The latter is based on the powerful R package ggplot2 for visualization, described below.
The R ecosystem is represented primarily through the CRAN (The Comprehensive R Archive Network) repository. From the CRAN website, “CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.” In contrast to PyPI, contributions to CRAN are generally curated. Because of the expansive ecosystems for Python and R, both languages are very well suited to digital humanities scholarship and data science.
Main Strengths
For machine learning and deep learning, explained in the previous section on Machine Learning, Python is the clear favourite. The Python library Scikit Learn features several supervised and unsupervised machine learning functions that are used in subsequent sections. Many powerful machine learning environments were developed for Python, including Keras and PyTorch. On the other hand, R excels at statistical tasks, statistical modeling, and data analysis. R is particularly strong in data modeling.
For data science and many of types of analysis used in humanities scholarship, the R ecosystem makes it an attractive option. There are a variety of robust text analysis tools, including readtext for importing and processing text files in a variety of formats (CSV, JSON, HTML, etc.); UDPipe for tokenization, POS tagging, and lemmatization; quanteda for quantitative text analysis; and topicmodels, an interface to C code for Latent Dirichlet Allocation (LDA) and other topic modeling functions. These packages, and many others that are useful for humanities research, are found on the CRAN repository.
Integration into Other Systems and Workflows
Another specific Python strength is that it can integrated into other software systems (R vs Python for Data Science: The Winner is … for a discussion of this topic). Alternately, entire applications can be written Python, as it is a general-purpose language. Python can be used for designing and implementing machine learning workflows and pipelines as well as for full data science workflows. R, however, features easy mechanisms to create dashboards, which are particularly important in data analytics and visual analytics. Dashboards are visual displays of all data for a particular application. All the data are presented simultaneously in a variety of formats, as are different aspects or components of the data, to provide users with a plethora of data analysis and exploration tools on one display, thereby facilitating analysis and decision support services. Multiple graphs, visualizations, and text representation of data are presented on the same screen to facilitate analysis and the decision-making process. The Shiny package allows users to design, create, and implement dashboards in R. Python, on the other hand, also features libraries for dashboard development and deployment. The Dash library, related to the Plotly visualization library, is one example of a set of tools that provides this functionality.
User Community
Both languages are supported by large expertise bases. Because it is a general-purpose language and not limited to data science or statistical analysis, the user community for Python is larger than for R. However, this community is generally more diffuse, given the wide range of Python applications and functionalities. Conversely, the R expertise base is more focused, albeit not exclusively, on statistical analysis and data science algorithms and tools. In addition, there is a vast amount of published material from which to learn Python and R, both commercial, open-access, and online through the Web. Many for-credit and non-credit courses in these languages are available through different types of educational programs, colleges, universities, and online courses.
In addition to the user community for either language, another consideration is what the user’s colleagues are using (see Python vs. R for Data Science: What’s the Difference? for a discussion of this topic). Especially in institutional contexts, some development tools may be preferred over others. Therefore, one factor for the scholar to at least consider is what tools or languages have a higher degree of support institutionally. Although not a deciding factor, if the choice is between Python and R, and both are unfamiliar to the scholar, then the type and degree of local support for a particular language may have an impact on the language choice.
Installation and Getting Started
For those with little or no programming experience, setting up the R environment may be easier and more straightforward than for Python. The development environment included in R, simply called RGui, is a basic interface with an editor in code is written, and from which individual lines of code or sections of code can be executed and tested. Some data practitioners, however, see R are having a relatively steep learning curve, especially for those new to programming, although learning resources are readily available.
Pedagogy
From a pedagogical viewpoint, it is sensible to learn both languages, if this option is available. In general, knowledge of one programming languages lowers the learning curve for other languages, especially if those languages are within the same general category (e.g., scripting languages). Although learning C or Java may still prove difficult even with a knowledge of R or Python, this knowledge will likely ease the learning process for the more complex compiled languages, at least to some degree. Furthermore, it is important to be conversant in both digital humanities/data science languages, and to be able to flexibly use one or the other (or both) as needed. For those humanities scholars and students with some programming experience, Python may be a logical choice, as it is a general-purpose language. Readability and simplicity were some of the main design goals of Python, and therefore the learning curve is relatively low, especially for those with some degree of prior programming experience.
R, as explained earlier, is primarily (but not exclusively) a statistical modeling and data analysis language, and its syntax, which is somewhat different than that of Python, C, and Java, reflects this tendency. For those with some experience with coding, and who simply need to analyze data quickly, as well as to benefit from a large collection of text processing and analysis packages, R may be a suitable option. Which language is easier to learn is therefore inconclusive, and depends on the experience of the user, as well as the time allocated to learn the language. For many users, “getting results” may proceed quicker in R, especially for quantitatively inclined users. Compared to R, Python has greater similarities to widely used general-purpose languages such as C. In general, many users find the learning curve for Python relatively low, due to its flexibility and readability.
Summary
From this discussion, it can be concluded, perhaps unsatisfyingly, that the choice between Python or R as a language for the digital humanities and data science is not straightforward. Both languages have strengths and emphases. For purely data analysis and statistical modeling applications, as well as for text analysis and text mining, R may have a slight advantage, as the language was specifically developed for statistical analysis and for applications that rely on such analysis, such as data science and text processing. R also features sophisticated data visualization capabilities, enhanced through ggplot2 and other special-purpose packages, such as Plotly (for R). Python, as a general-purpose language, is not focused specifically on data science, but, because of the availability of robust libraries, is also very widely used for both data science and digital humanities scholarship, and in areas where these two fields overlap. In humanities scholarship, machine learning techniques are increasingly important. Where these applications are needed, Python has the edge, due to the availability advanced numerical, computational, and machine learning libraries. Therefore, again perhaps unsatisfyingly, the best approach for humanists is to become conversant in both languages.
As mentioned in previous sections, digital humanities scholarship will be advanced when humanists themselves design, develop, and implement their own tools that can be integrated into their own workflows (Ramsay, 2016), (Tenen, 2016). Programming tools liberates scholars from the constraints of pre-programmed software and allows them to adapt sophisticated computational tools to their own research and to answer their own questions. Both Python and R offer a plethora of tools and frameworks to enrich humanities scholarship. Entire workflows can be built with Python, and complemented with special-purpose applications in R, particularly for statistics and text processing. Although the current discussion has focused on Python and R, other programming languages may increase in importance in the future. However, at present, and at least for the foreseeable future, a knowledge of Python, R, or preferably both, provides benefits to humanists in their computational endeavors. This knowledge gives scholars the flexibility and adaptability required to move the digital humanities forward.