Web Scraping
Although the Internet, Web services, and specifically the data that they contain constitute a huge data resource conferring immense potential benefits to the humanities, obtaining this data from this vast data repository is challenging. One of the technologies that is used only peripherally in the digital humanities is web scraping, although it can potentially support data collection and processing endeavors (Black, 2016). Web scraping is technique to extract specific, meaningful data from web pages through specialized software programs and scripts, known as a web crawler, bot, or Web API. These web crawlers navigate web pages to obtain targeted data directly from the HTML of the web page, and not, for example, from screen captures or assets directly accessible on the web page. The web sites that are “crawled” are selected in a pre-specified, methodical manner. Data from web scraping enables the collection of a large quantity of unstructured data that must be transformed into a semi-structured or structured format for it to be useful (Zeng, 2017). Several web scrapers written in Python and R, the two main programming languages employed in the digital humanities, are available.
Although the terms web scraping and web crawling (n. web scraper and web crawler) are often used interchangeably, they have subtle differences. Web crawling is generally non-specific and extracts all Web-accessible information for subsequent aggregation and processing. A search engine is an example application of web crawling. By contrast, in web scraping, specific information is extracted for a targeted purpose.
Web scraping also differs from how a human user would browse or navigate a web site. Many web sites explicitly prohibit web scraping and web crawling on their sites through the robots exclusion standard, implemented in a small text file named robots.txt. This text file is a standard that web sites use to communicate with web crawlers and other robots (“bots”) or automated processes using the resources of the site. The robots.txt file specifies to the web robot the areas of the web site that are excluded from access, scanning, or other processing. Because of the potentially problematic nature of web scraping and web crawling, some leading legal critics have argued that web scraping is a form of intellectual property theft, although the practice is not illegal according to U.S. Law. It can also be seen as a contravention of fair use policy, as web scraping may circumvent the guidelines delineated in the robots.txt file of the website. For instance, the robot.txt file for the Project Gutenberg site explicitly prohibits scraping of their website. (The Gendered Novels Project).
In addition. web scraping uses a large amount of resources on the website being scraped, and therefore may decrease the performance of the latter and potentially damage the technological infrastructure of that site. While web scraping is not officially against U.S. law, some may see it as against the fair use policy to disobey the guidelines set out in a website’s robots.txt file.
These problems are brought to the foreground in part because of the commercial and marketing aims of many web crawlers (Black, 2016). However, web scraping is also used in non-commercial, not-for-profit, educational, and research activity, such as harvesting data for The Internet Archive (Black, 2016). As an example of web scraping in digital humanities scholarship, in a digital ethnography study of Asian-American musicians working in independent rock music (“indie rock”), social interactions and connections, as well as geographic information pertaining to these musicians were extracted from MySpace profile pages using a custom-developed web scraping script. In this research, a web scraper script was written in the Ruby programming language (Hsu, 2014).