Data Mining

Mark P. Wachowiak

Data Mining

Data mining is extracting information from, and discovering new, possibly hidden patterns in large quantities of data. Data mining is an interdisciplinary field of study involving computational techniques from statistics, machine learning, data analysis, and other branches of computational science. Data mining often involves extracting and discovering knowledge from database management systems. Typical data mining functions include anomaly detection, clustering, classification, and summarization. Common sources of data include data warehouses, or databases that are created or re-purposed and maintained for analysis and decision support. Data warehouses are not intended for operational purposes or for acquiring and storing up-to-date transactional data.

Data mining has been used successfully in many domains for decades, especially in business, decision support, and management. Data mining is also applied in medical and healthcare applications. In the digital humanities, data mining is employed in a number ofseveral areas, with text mining among the most prominent. Text mining, or text data mining, involves processes that facilitate gaining insights from text. The text mining process involves several steps, the first of which is structuring the data for subsequent analysis. This structuring involves parsing the input text, adding and removing linguistic features, and incorporating the structured text into a database. Patterns can then be derived from this structured data using data mining techniques or domain-specific approaches. The results are then evaluated and interpreted. There are several text mining operations. Text categorization is assigning a document to a category, which is a common application in library science. In text clustering, textual documents are grouped into clusters. Concepts and various entities or objects represented in the text can be extracted and delineated. Sentiment analysis uses natural language processing and methods from computational linguistics to determine subjective information and states in the text. Document summarization is the computational reduction of document data to obtain the most salient features and information from the original documents. The determination of relationships between named entities identified through name-entity recognition methods is another text mining task. Text mining also makes use of visualization and advanced analytical techniques. Beyond the digital humanities, important domain areas employing text mining include medicine, marketing, online media, (computational) sociology, and security.

[NEXT]

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

License

Share This Book