Introduction

Mark P. Wachowiak

This project is made possible with funding by the Government of Ontario and through eCampusOntario’s support of the Virtual Learning Strategy.

To learn more about the Virtual Learning Strategy visit: https://vls.ecampusontario.ca.

Nipissing University sits on the territory of Nipissing First Nation, the territory of the Anishnabek, within lands protected by the Robinson Huron Treaty of 1850. We are grateful to be able to live and learn on these lands with all our relations.

DIGITAL HUMANITIES TOOLS AND TECHNIQUES I

This course is a 2000-level (second year undergraduate) course that introduces the basic tools and techniques used in modern digital humanities scholarship. Learners are introduced to algorithms used in the field, particularly (but not exclusively) in text analysis, which is historically foundational for the development of the digital humanities and is also relevant to and vitally important in many digital humanities sub-specialities, even those that are not directly related to text analysis.

Other disciplines which have a synergistic and symbiotic relationship to the digital humanities are also presented within the context of digital humanities scholarship. Data science, like the digital humanities, is an emerging field. These two fields, while distinct, naturally mutually overlap. In a nutshell, digital humanities scholarship works with, processes, analyzes, visualizes, and interprets data. Additionally, a large amount of humanistic data falls into the category of “Big Data”. Texts, such as those archived by Google Books, is a canonical example of “Big Humanities Data”. New paradigms in the humanities, including “distant reading”, emerged as a response to this vast volume of complex, mostly unstructured data. Consequently, the synergy between these two fields is examined in this course. For data to be advantageous for humanities scholarship, however, humanists must use computational tools, which comprise the other major component of this course. These tools are implementations of algorithms that solve specific problems and perform various functions and operations that enable and facilitate humanities work. Although commercial and free (open source) computer programs exist to facilitate this work, many digital humanists emphatically argue that success in the field and developing it further requires a basic knowledge of what has come to be known as “coding”. This allows a much greater degree of flexibility to address specific questions than is generally available in “pre-written” software systems.

In other words, humanities scholars should know, at a very basic level, to write computer programs, or to code (used as a verb) to perform the complex algorithmic manipulations necessary for many applications. However, learning to program, or to “code”, has become immeasurably easier with interpreted scripting languages. In humanities disciplines, Python and R, both freely available, are the languages of choice. Python is considered as the top programming language in the world at present, employed in a vast array of applications. In addition, although most of the useful algorithms in humanities scholarship are very sophisticated and mathematically complex, libraries and packages that implement these algorithms are available in these languages, allowing scholars to integrate these functions into their own workflows to solve their own particular problems. Therefore, humanists do not need to develop and write full-scale programs. With only a rudimentary knowledge of Python and/or R, they can integrate algorithms and libraries that have already been written, optimized, and tested into their own workflows. To provide learners with the knowledge and insights needed for these tasks is one of the goals of this course. With these tools, very complex, leading-edge algorithms can be implemented, and powerful, efficient systems can be built by humanists themselves. The course provides gentle introductions to these language through interactive notebooks (Jupyter) and code examples.

The course also provides an overview of two of the most important computational paradigms in the digital humanities: machine learning and information visualization. These topics are discussed both in general, and how they specifically relate to humanities work. Simple Python and R code examples are provided in the context of interactive virtual labs that illustrate some of these introductory, yet powerful techniques.

The overview of the computational techniques discussed in this course is beneficial to learners continuing in the certificate program, and who will enroll in the subsequent courses in the digital classics and in the spatial humanities.

This course will provide learners with valuable knowledge and technical skills that can be applied to humanities work and beyond. Working with code in Python are R is increasingly important in many professions. The experience with coding and computational algorithms from this course will benefit learners in the professional careers, even outside humanities disciplines.

This course is an introduction to data science and computational techniques in the digital humanities. The relationships between two emerging fields, the digital humanities, and data science, are explored. The two most popular programming languages for humanities scholarship – Python and R – are introduced. Machine learning and information visualization, whose importance in digital humanities scholarship is rapidly increasing, are also presented. Topics include data in the digital humanities, Big Data, creating basic scripts in Python and R and integrating publicly available libraries into workflows, an overview of some of the machine learning techniques in the digital humanities, including artificial neural networks, and information visualization. Useful algorithms in the digital humanities are presented via virtual labs that allow learners to experiment and interact with Python and R code.

REQUIRED TEXT: Digital Humanities Tools and Techniques I, available as online content for this course.

Downloading and installing the freely available Python (https://www.python.org/) and R (https://www.r-project.org/) languages, as well as the libraries and packages for these languages is recommended to follow the code examples and interactive workbooks.

Each module contains a reading list from publicly available web pages and other Internet sources.

LEARNING OUTCOMES

By the end of this course, successful students will be able to:

Explain in detail the scope and practices of data science, and how data science facilitates many aspects of humanities scholarship.
Apply basic data science methodologies to humanities questions.
Explain in detail the different types of data (Big Data, Smart Data, multimedia data, etc.) and the role of these types of data in humanities scholarship.
Design, implement, test, and analyze short code sections or scripts in Python and R.
Apply basic machine learning techniques to the digital humanities.
Generate interactive graphs and visualizations that are widely used in the digital humanities.
Apply basic text processing and analysis methods using specialized Python libraries and R packages.
Obtain in-depth information about Python and R functions using online resources available to help solve programming problems. Produce basic plots in Python and R based by modifying examples and templates.

TOPICS COVERED IN THIS COURSE

Module Topic

1 What Is Data Science?

2 What Is (Are) Data?

3 Big Data in the Humanities

4 Big Data and Smart Data

5 An Overview of Programming Languages

6 Python Tutorial

7 R Tutorial

8 Machine Learning in the Digital Humanities

9 Python vs. R

10 Introduction to Visualization

11 Visualization in Python with Matplotlib

12 Visual Analytics in the Digital Humanities and Colour Models

13 Basic Text Processing In R

14 Introduction to GIS in the Digital Humanities

A NOTE TO INSTRUCTORS

Instructors should freely utilize material that is relevant to their class and omit or modify modules or sections of modules that are not. Advanced material and computational techniques may also be omitted without interrupting the schema of course.

Readings accessible through the Web are assigned for most sections. Most of these readings are short and are selected for students first studying tools and techniques used in the digital humanities. Some readings are websites that learners may browse to examine some state-of-the-art and leading-edge digital humanities scholarship. Many optional readings are indicated and may be used according to the instructor’s and learners’ interests. Instructors may supplement these readings with reading assignments of their own that complement the material presented in each section, or that offer different or contrary perspectives. Other readings are reference pages for the Python and R techniques discussed in the corresponding sections. Learners need not peruse these sites, but may use them as reference resources, or to learn more about these methods. As the course mainly focuses on methods, some lab exercises are suggested at the end of some of the sections. However, instructors should prepare assignments and laboratory (Python/R implementation) work specifically for the type of class they are teaching, and tailor these assignments and laboratories to meet the needs of their specific students. A test bank for this course may be requested to assist instructors in preparing midterm and/or final examinations. Instructors may also use or adapt questions from this test bank for short assignments or quizzes, especially the short computational questions.

PYTHON, R, AND JUPYTER INSTALLATION

Python download: https://www.python.org/

This installation also includes the IDLE interface.

Note: When installing Python, ensure that the checkbox for pathing is checked.

R download: https://www.r-project.org/

This installation also includes the RGui interface.

Download and installation instructions for Linux and Apple computers are found on these web sites.

In addition, several Python and R packages are required. However, these packages are also easy to install.

These Python libraries can be “pipped”.

On Windows, the easiest approach is to go to the command line with the cmd command. Change the directory to your Python directory, and enter the pip command. For example, to install the Numpy package, enter:

pip install numpy

The following Python packages need to be installed. Other packages will be needed later, but they can also be installed with pip.

numpy https://pypi.org/project/numpy/

pip install numpy

pandas https://pypi.org/project/pandas/

pip install pandas

matplotlib https://pypi.org/project/matplotlib/

pip install matplotlib

scipy https://pypi.org/project/scipy/

pip install scipy

sklearn https://pypi.org/project/scikit-learn/

pip install scikit-learn

The following packages are installed in the same way.

network https://pypi.org/project/networkx/

pandas https://pypi.org/project/pandas/

regex https://pypi.org/project/regex/

nltk https://pypi.org/project/nltk/

plotly https://pypi.org/project/plotly/

pyvis https://pypi.org/project/pyviz/

graph_tools https://pypi.org/project/graph-tools/

graphviz https://pypi.org/project/graphviz/

utm https://pypi.org/project/utm/

wordcloud https://pypi.org/project/wordcloud/

treelib https://pypi.org/project/treelib/

unicodedata2 https://pypi.org/project/unicodedata2/

unicode https://pypi.org/project/Unidecode/

bokeh https://pypi.org/project/bokeh/

seaborn https://pypi.org/project/seaborn/

Syntax for installing R packages: install.packages(“name of package”). For example, to install ggplot2, use the following command on the R command line.

install.packages(“ggplot2”)

The following R packages need to be installed:

ggplot2

plotly

reticulate

cleanNLP

sotu

dplyr

topicmodels

Rcpp

glmnet

readtext

quanteda

quanteda.textplots

quanteda.textstats

plyr

JUPYTER NOTEBOOKS

This course and the subsequent course (which continues this one) also has tutorials employing Jupyter Notebooks (https://jupyter.org/), (https://jupyter.org/install). Jupyter is a free, open-source, interactive web tool, or “notebook”, in which text, code, output, explanations, and multimedia resources can be combined into a single document that is presented through a web browser (Perkel, 2018). Jupyter has become very popular as a computational notebook. Jupyter notebooks are also widely used in educational settings. Jupyter supports several programming languages, including Python and R, the two most popular coding tools for the digital humanities and data science. Jupyter is considered to be the “de facto standard” in data science (Perkel, 2018), which, as explained in this course, has a close synergy with the digital humanities. Jupyter enables and facilitates interactive data exploration, where learners can execute code, observe the results, modify and experiment with the code, and engage in an “iterative conversation” between scholars, learners, computations, and data (Perkel, 2018). As Jupyter notebooks are shareable and run in a browser, instructors, through their institutional technology services, may deploy them online as virtual labs, or learners may use them on their local machine, even when offline.

The literature contains several case studies of courses employing, including teaching and learning engineering courses (Cardoso Leitão, & Teixeira, 2018), problem generation and lecturing (Domínguez et al., 2021), for assignment in natural language processing, an important component of text processing in the digital humanities (Foster and Wagner, 2021), for visualization in Python (a focal point of this course) (Pajankar, 2021), for a freely available textbook for an undergraduate scientific computing chemistry course (Weiss, 2020), and, especially relevant for this course, for creating interactive manuals and tutorials (Perkel, 2018).

This present course employs Jupyter notebooks as interactive tutorial labs for Python and R. The subsequent course contains a Jupyter notebook as an interactive tutorial lab in statistics with Python.

INSTALLING JUPYTER

Installation instructions for Jupyter are found on https://jupyter.org/install. For Python,

Install the classic Jupyter Notebook with pip:

pip install notebook

To run the notebook from the command line (e.g., in Windows, accessed through cmd):

jupyter notebook

This command will open Jupyter Notebooks in the user’s browser.

When first working with Jupyter, it is easiest to run the above command in the same directory (folder) where the Jupyter notebook files – files with the ipynb extension. In addition, it is easiest to also keep all data files used by the notebooks in that directory. However, the user can navigate to any directory from the main Jupyter interface that opens in the user’s browser upon entering the jupyter notebook command from the command line.

Jupyter supports Python and R. To set up Jupyter for R, install the IRkernel package from the R command line (e.g., from the command line in the RGui interface).

install.packages(‘IRkernel’)

After the IRkernel package has been installed, run the following command from the R command line to make the kernel available to Jupyter:

IRkernel::installspec(user = FALSE)

See the document Package ‘IRkernel’ on https://cran.r-project.org/web/packages/IRkernel/IRkernel.pdf for additional information.

INTERACTING WITH THE JUPYTER NOTEBOOKS

Jupyter Notebooks provide the user with flexibility to interact with the code in different ways. For instance, users may choose to run the entire notebook by selecting Cell from the menu, and then selecting Run All. The user may then modify the code, add new cells, and experiment with the code.

DATA USED BY THE PYTHON AND R SCRIPTS AND JUPYTER NOTEBOOKS

Many Python and R scripts and Jupyter require data files, which are supplied in this distribution. The data may be stored in any directory/folder, but the corresponding code (Python, R, and Jupyter Notebooks) needs to be slightly adjusted for this path. The Python and R code default to the Data\ directory, meaning that this code is expecting to locate any data in a separate Data subdirectory within the directory where the Python and R code execute. For instance, if the Python code is placed in a directory named C:\DIGI2306\Python, the data would be placed into the C:\DIGI2306\Python\Data directory. For R, the directories would be, for example, C:\DIGI2306\R and C:\DIGI2306\R\Data.

For the Jupyter Notebooks, the default is for data to be located in the same location as the notebooks. For instance, if the Jupyter notebooks were placed into the directory C:\DIGI2306\Jupyter, then the data files would be placed there too.

The instructor and/or learner can modify the file path in the code as necessary. The code contains commented sections indicating where the path(s) should be changed.

JUPYTER NOTEBOOKS AVAILABLE FOR THIS CERTIFICATE

The following Jupyter notebooks are available for the first three courses in this certificate (DIGI 2016, DIGI 2316, and DIGI 3017).

Bigram_Visualization_Example.ipynb

Colours_Example.ipynb

GenderedPerspectives_Visualization.ipynb

GenreTree_Example.ipynb

GIS_Density_Mapping_Example.ipynb

K-Means_Ancient_Authors_Example.ipynb

K-Means_Example.ipynb

K-Means_tSNE_Example.ipynb

N-Gram_Visualization_Example.ipynb

PCA_tSNE_Example.ipynb

PythonStatisticsTutorial.ipynb

PythonTutorial.ipynb

Regression_Example.ipynb

RTutorial.ipynb

Sentences_KMeans_Example.ipynb

SocialNetworks_GIS_Example.ipynb

SocialNetwork_Visualization_Example.ipynb

Sunburst_Example.ipynb

TextAnalysis_Example.ipynb

TF-IDF_Example.ipynb

Visualizations_Matplotlib_Plotly_Example.ipynb

WordCloud_Example_2.ipynb

ACKNOWLEDGEMENT

Thanks are given to Ysabel Castle, MESc., Department of Geography at Nipissing University, for developing the GIS lab example (GIS_Density_Mapping_Example.R and

GIS_Density_Mapping_Example.ipynb) that is discussed in the GIS section of this course.

Thanks are also given to Renata Smolíková, Ph.D., for assistance in developing the interactive Jupyter Notebooks and the interactive Python and R tutorials.

REFERENCES

Perkel, J. M. (2018). Why Jupyter is data scientists’ computational notebook of choice. Nature, 563(7732), 145-147.

Cardoso, A., Leitão, J., & Teixeira, C. (2018, September). Using the Jupyter notebook as a tool to support the teaching and learning processes in engineering courses. In International Conference on Interactive Collaborative Learning (pp. 227-236). Springer, Cham.

Domínguez, J. C., Alonso, M. V., González, E. J., Guijarro, M. I., Miranda, R., Oliet, M., … & Yustos, P. (2021). Teaching chemical engineering using Jupyter notebook: Problem generators and lecturing tools. Education for Chemical Engineers, 37, 1-10.

Foster, J., & Wagner, J. (2021, June). Naive Bayes versus BERT: Jupyter notebook assignments for an introductory NLP course. In Proceedings of the Fifth Workshop on Teaching NLP (pp. 112-114).

Pajankar, A. (2021). Exploring Jupyter Notebook. In Practical Python Data Visualization (pp. 17-29). Apress, Berkeley, CA.

Weiss, C. J. (2020). A Creative Commons Textbook for Teaching Scientific Computing to Chemistry Students with Python and Jupyter Notebooks. Journal of Chemical Education, 98(2), 489-494.

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

License

Share This Book