This project is made possible with funding by the Government of Ontario and through eCampusOntario’s support of the Virtual Learning Strategy.
To learn more about the Virtual Learning Strategy visit: https://vls.ecampusontario.ca.
Nipissing University sits on the territory of Nipissing First Nation, the territory of the Anishnabek, within lands protected by the Robinson Huron Treaty of 1850. We are grateful to be able to live and learn on these lands with all our relations.
DIGITAL HUMANITIES TOOLS AND TECHNIQUES I
This course is a 2000-level (second year undergraduate) course that introduces the basic tools and techniques used in modern digital humanities scholarship. Learners are introduced to algorithms used in the field, particularly (but not exclusively) in text analysis, which is historically foundational for the development of the digital humanities and is also relevant to and vitally important in many digital humanities sub-specialities, even those that are not directly related to text analysis.
Other disciplines which have a synergistic and symbiotic relationship to the digital humanities are also presented within the context of digital humanities scholarship. Data science, like the digital humanities, is an emerging field. These two fields, while distinct, naturally mutually overlap. In a nutshell, digital humanities scholarship works with, processes, analyzes, visualizes, and interprets data. Additionally, a large amount of humanistic data falls into the category of “Big Data”. Texts, such as those archived by Google Books, is a canonical example of “Big Humanities Data”. New paradigms in the humanities, including “distant reading”, emerged as a response to this vast volume of complex, mostly unstructured data. Consequently, the synergy between these two fields is examined in this course. For data to be advantageous for humanities scholarship, however, humanists must use computational tools, which comprise the other major component of this course. These tools are implementations of algorithms that solve specific problems and perform various functions and operations that enable and facilitate humanities work. Although commercial and free (open source) computer programs exist to facilitate this work, many digital humanists emphatically argue that success in the field and developing it further requires a basic knowledge of what has come to be known as “coding”. This allows a much greater degree of flexibility to address specific questions than is generally available in “pre-written” software systems.
In other words, humanities scholars should know, at a very basic level, to write computer programs, or to code (used as a verb) to perform the complex algorithmic manipulations necessary for many applications. However, learning to program, or to “code”, has become immeasurably easier with interpreted scripting languages. In humanities disciplines, Python and R, both freely available, are the languages of choice. Python is considered as the top programming language in the world at present, employed in a vast array of applications. In addition, although most of the useful algorithms in humanities scholarship are very sophisticated and mathematically complex, libraries and packages that implement these algorithms are available in these languages, allowing scholars to integrate these functions into their own workflows to solve their own particular problems. Therefore, humanists do not need to develop and write full-scale programs. With only a rudimentary knowledge of Python and/or R, they can integrate algorithms and libraries that have already been written, optimized, and tested into their own workflows. To provide learners with the knowledge and insights needed for these tasks is one of the goals of this course. With these tools, very complex, leading-edge algorithms can be implemented, and powerful, efficient systems can be built by humanists themselves. The course provides gentle introductions to these language through interactive notebooks (Jupyter) and code examples.
The course also provides an overview of two of the most important computational paradigms in the digital humanities: machine learning and information visualization. These topics are discussed both in general, and how they specifically relate to humanities work. Simple Python and R code examples are provided in the context of interactive virtual labs that illustrate some of these introductory, yet powerful techniques.
The overview of the computational techniques discussed in this course is beneficial to learners continuing in the certificate program, and who will enroll in the subsequent courses in the digital classics and in the spatial humanities.
This course will provide learners with valuable knowledge and technical skills that can be applied to humanities work and beyond. Working with code in Python are R is increasingly important in many professions. The experience with coding and computational algorithms from this course will benefit learners in the professional careers, even outside humanities disciplines.
This course is an introduction to data science and computational techniques in the digital humanities. The relationships between two emerging fields, the digital humanities, and data science, are explored. The two most popular programming languages for humanities scholarship – Python and R – are introduced. Machine learning and information visualization, whose importance in digital humanities scholarship is rapidly increasing, are also presented. Topics include data in the digital humanities, Big Data, creating basic scripts in Python and R and integrating publicly available libraries into workflows, an overview of some of the machine learning techniques in the digital humanities, including artificial neural networks, and information visualization. Useful algorithms in the digital humanities are presented via virtual labs that allow learners to experiment and interact with Python and R code.
REQUIRED TEXT: Digital Humanities Tools and Techniques I, available as online content for this course.
Downloading and installing the freely available Python (https://www.python.org/) and R (https://www.r-project.org/) languages, as well as the libraries and packages for these languages is recommended to follow the code examples and interactive workbooks.
Each module contains a reading list from publicly available web pages and other Internet sources.
LEARNING OUTCOMES
By the end of this course, successful students will be able to:
- Explain in detail the scope and practices of data science, and how data science facilitates many aspects of humanities scholarship.
- Apply basic data science methodologies to humanities questions.
- Explain in detail the different types of data (Big Data, Smart Data, multimedia data, etc.) and the role of these types of data in humanities scholarship.
- Design, implement, test, and analyze short code sections or scripts in Python and R.
- Apply basic machine learning techniques to the digital humanities.
- Generate interactive graphs and visualizations that are widely used in the digital humanities.
- Apply basic text processing and analysis methods using specialized Python libraries and R packages.
- Obtain in-depth information about Python and R functions using online resources available to help solve programming problems. Produce basic plots in Python and R based by modifying examples and templates.
TOPICS COVERED IN THIS COURSE
Module Topic
1 What Is Data Science?
2 What Is (Are) Data?
3 Big Data in the Humanities
4 Big Data and Smart Data
5 An Overview of Programming Languages
6 Python Tutorial
7 R Tutorial
8 Machine Learning in the Digital Humanities
9 Python vs. R
10 Introduction to Visualization
11 Visualization in Python with Matplotlib
12 Visual Analytics in the Digital Humanities and Colour Models
13 Basic Text Processing In R
14 Introduction to GIS in the Digital Humanities
A NOTE TO INSTRUCTORS
Instructors should freely utilize material that is relevant to their class and omit or modify modules or sections of modules that are not. Advanced material and computational techniques may also be omitted without interrupting the schema of course.
Readings accessible through the Web are assigned for most sections. Most of these readings are short and are selected for students first studying tools and techniques used in the digital humanities. Some readings are websites that learners may browse to examine some state-of-the-art and leading-edge digital humanities scholarship. Many optional readings are indicated and may be used according to the instructor’s and learners’ interests. Instructors may supplement these readings with reading assignments of their own that complement the material presented in each section, or that offer different or contrary perspectives. Other readings are reference pages for the Python and R techniques discussed in the corresponding sections. Learners need not peruse these sites, but may use them as reference resources, or to learn more about these methods. As the course mainly focuses on methods, some lab exercises are suggested at the end of some of the sections. However, instructors should prepare assignments and laboratory (Python/R implementation) work specifically for the type of class they are teaching, and tailor these assignments and laboratories to meet the needs of their specific students. A test bank for this course may be requested to assist instructors in preparing midterm and/or final examinations. Instructors may also use or adapt questions from this test bank for short assignments or quizzes, especially the short computational questions.
PYTHON, R, AND JUPYTER INSTALLATION
Python download: https://www.python.org/
This installation also includes the IDLE interface.
Note: When installing Python, ensure that the checkbox for pathing is checked.
R download: https://www.r-project.org/
This installation also includes the RGui interface.
Download and installation instructions for Linux and Apple computers are found on these web sites.
In addition, several Python and R packages are required. However, these packages are also easy to install.
These Python libraries can be “pipped”.
On Windows, the easiest approach is to go to the command line with the cmd command. Change the directory to your Python directory, and enter the pip command. For example, to install the Numpy package, enter:
pip install numpy
The following Python packages need to be installed. Other packages will be needed later, but they can also be installed with pip.
numpy https://pypi.org/project/numpy/
pip install numpy
pandas https://pypi.org/project/pandas/
pip install pandas
matplotlib https://pypi.org/project/matplotlib/
pip install matplotlib
scipy https://pypi.org/project/scipy/
pip install scipy
sklearn https://pypi.org/project/scikit-learn/
pip install scikit-learn
The following packages are installed in the same way.
network https://pypi.org/project/networkx/
pandas https://pypi.org/project/pandas/
regex https://pypi.org/project/regex/
nltk https://pypi.org/project/nltk/
plotly https://pypi.org/project/plotly/
pyvis https://pypi.org/project/pyviz/
graph_tools https://pypi.org/project/graph-tools/
graphviz https://pypi.org/project/graphviz/
utm https://pypi.org/project/utm/
wordcloud https://pypi.org/project/wordcloud/
treelib https://pypi.org/project/treelib/
unicodedata2 https://pypi.org/project/unicodedata2/
unicode https://pypi.org/project/Unidecode/
bokeh https://pypi.org/project/bokeh/
seaborn https://pypi.org/project/seaborn/
Syntax for installing R packages: install.packages(“name of package”). For example, to install ggplot2, use the following command on the R command line.
install.packages(“ggplot2”)
The following R packages need to be installed:
ggplot2
plotly
reticulate
cleanNLP
sotu
dplyr
topicmodels
Rcpp
glmnet
readtext
quanteda
quanteda.textplots
quanteda.textstats
plyr
JUPYTER NOTEBOOKS
This course and the subsequent course (which continues this one) also has tutorials employing Jupyter Notebooks (https://jupyter.org/), (https://jupyter.org/install). Jupyter is a free, open-source, interactive web tool, or “notebook”, in which text, code, output, explanations, and multimedia resources can be combined into a single document that is presented through a web browser (Perkel, 2018). Jupyter has become very popular as a computational notebook. Jupyter notebooks are also widely used in educational settings. Jupyter supports several programming languages, including Python and R, the two most popular coding tools for the digital humanities and data science. Jupyter is considered to be the “de facto standard” in data science (Perkel, 2018), which, as explained in this course, has a close synergy with the digital humanities. Jupyter enables and facilitates interactive data exploration, where learners can execute code, observe the results, modify and experiment with the code, and engage in an “iterative conversation” between scholars, learners, computations, and data (Perkel, 2018). As Jupyter notebooks are shareable and run in a browser, instructors, through their institutional technology services, may deploy them online as virtual labs, or learners may use them on their local machine, even when offline.
The literature contains several case studies of courses employing, including teaching and learning engineering courses (Cardoso Leitão, & Teixeira, 2018), problem generation and lecturing (Domínguez et al., 2021), for assignment in natural language processing, an important component of text processing in the digital humanities (Foster and Wagner, 2021), for visualization in Python (a focal point of this course) (Pajankar, 2021), for a freely available textbook for an undergraduate scientific computing chemistry course (Weiss, 2020), and, especially relevant for this course, for creating interactive manuals and tutorials (Perkel, 2018).
This present course employs Jupyter notebooks as interactive tutorial labs for Python and R. The subsequent course contains a Jupyter notebook as an interactive tutorial lab in statistics with Python.
INSTALLING JUPYTER
Installation instructions for Jupyter are found on https://jupyter.org/install. For Python,
Install the classic Jupyter Notebook with pip:
pip install notebook
To run the notebook from the command line (e.g., in Windows, accessed through cmd):
jupyter notebook
This command will open Jupyter Notebooks in the user’s browser.
When first working with Jupyter, it is easiest to run the above command in the same directory (folder) where the Jupyter notebook files – files with the ipynb extension. In addition, it is easiest to also keep all data files used by the notebooks in that directory. However, the user can navigate to any directory from the main Jupyter interface that opens in the user’s browser upon entering the jupyter notebook command from the command line.
Jupyter supports Python and R. To set up Jupyter for R, install the IRkernel package from the R command line (e.g., from the command line in the RGui interface).
install.packages(‘IRkernel’)
After the IRkernel package has been installed, run the following command from the R command line to make the kernel available to Jupyter:
IRkernel::installspec(user = FALSE)
See the document Package ‘IRkernel’ on https://cran.r-project.org/web/packages/IRkernel/IRkernel.pdf for additional information.
INTERACTING WITH THE JUPYTER NOTEBOOKS
Jupyter Notebooks provide the user with flexibility to interact with the code in different ways. For instance, users may choose to run the entire notebook by selecting Cell from the menu, and then selecting Run All. The user may then modify the code, add new cells, and experiment with the code.
DATA USED BY THE PYTHON AND R SCRIPTS AND JUPYTER NOTEBOOKS
Many Python and R scripts and Jupyter require data files, which are supplied in this distribution. The data may be stored in any directory/folder, but the corresponding code (Python, R, and Jupyter Notebooks) needs to be slightly adjusted for this path. The Python and R code default to the Data\ directory, meaning that this code is expecting to locate any data in a separate Data subdirectory within the directory where the Python and R code execute. For instance, if the Python code is placed in a directory named C:\DIGI2306\Python, the data would be placed into the C:\DIGI2306\Python\Data directory. For R, the directories would be, for example, C:\DIGI2306\R and C:\DIGI2306\R\Data.
For the Jupyter Notebooks, the default is for data to be located in the same location as the notebooks. For instance, if the Jupyter notebooks were placed into the directory C:\DIGI2306\Jupyter, then the data files would be placed there too.
The instructor and/or learner can modify the file path in the code as necessary. The code contains commented sections indicating where the path(s) should be changed.
JUPYTER NOTEBOOKS AVAILABLE FOR THIS CERTIFICATE
The following Jupyter notebooks are available for the first three courses in this certificate (DIGI 2016, DIGI 2316, and DIGI 3017).
Bigram_Visualization_Example.ipynb
Colours_Example.ipynb
GenderedPerspectives_Visualization.ipynb
GenreTree_Example.ipynb
GIS_Density_Mapping_Example.ipynb
K-Means_Ancient_Authors_Example.ipynb
K-Means_Example.ipynb
K-Means_tSNE_Example.ipynb
N-Gram_Visualization_Example.ipynb
PCA_tSNE_Example.ipynb
PythonStatisticsTutorial.ipynb
PythonTutorial.ipynb
Regression_Example.ipynb
RTutorial.ipynb
Sentences_KMeans_Example.ipynb
SocialNetworks_GIS_Example.ipynb
SocialNetwork_Visualization_Example.ipynb
Sunburst_Example.ipynb
TextAnalysis_Example.ipynb
TF-IDF_Example.ipynb
Visualizations_Matplotlib_Plotly_Example.ipynb
WordCloud_Example_2.ipynb
ACKNOWLEDGEMENT
Thanks are given to Ysabel Castle, MESc., Department of Geography at Nipissing University, for developing the GIS lab example (GIS_Density_Mapping_Example.R and
GIS_Density_Mapping_Example.ipynb) that is discussed in the GIS section of this course.
Thanks are also given to Renata Smolíková, Ph.D., for assistance in developing the interactive Jupyter Notebooks and the interactive Python and R tutorials.
REFERENCES
Perkel, J. M. (2018). Why Jupyter is data scientists’ computational notebook of choice. Nature, 563(7732), 145-147.
Cardoso, A., Leitão, J., & Teixeira, C. (2018, September). Using the Jupyter notebook as a tool to support the teaching and learning processes in engineering courses. In International Conference on Interactive Collaborative Learning (pp. 227-236). Springer, Cham.
Domínguez, J. C., Alonso, M. V., González, E. J., Guijarro, M. I., Miranda, R., Oliet, M., … & Yustos, P. (2021). Teaching chemical engineering using Jupyter notebook: Problem generators and lecturing tools. Education for Chemical Engineers, 37, 1-10.
Foster, J., & Wagner, J. (2021, June). Naive Bayes versus BERT: Jupyter notebook assignments for an introductory NLP course. In Proceedings of the Fifth Workshop on Teaching NLP (pp. 112-114).
Pajankar, A. (2021). Exploring Jupyter Notebook. In Practical Python Data Visualization (pp. 17-29). Apress, Berkeley, CA.
Weiss, C. J. (2020). A Creative Commons Textbook for Teaching Scientific Computing to Chemistry Students with Python and Jupyter Notebooks. Journal of Chemical Education, 98(2), 489-494.