✦ LIBER ✦

Theory and Algorithms for Information Extraction and Classification in Textual Data Mining

✍ Scribed by Wu T.

Book ID: 127406799
Year: 2003
Tongue: English
Weight: 89 KB
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Regular expressions can be used as patterns to extract features from semi-structured and narrative text [8]. For example, in police reports a suspect's height might be recorded as "{CD} feet {CD} inches tall", where {CD} is the part of speech tag for a numeric value. The result in [1] shows us that regular expressions could have higher performance than explicit expressions in some applications such as Posting Act Tagging. Although much work has been done in the field of information extraction, relatively little has focused on the automatic discovery of regular expressions. Therefore, my Ph.D. research will focus on the automatic generation of reduced regular expressions (RREs) (defined in [8]) used in Information Extraction (IE).The reduced regular expressions learned can be directly used to extract features from free text, or they can be used to fill in templates in Eric Brill's Transformation-Based Learning (TBL) [2] frameworks. The original templates in TBL are explicit expressions, which are weaker than reduced regular expressions. I propose an innovative enhancement to TBL termed "Error-Driven Boolean-Logic-Rule-Based Learning" (BLogRBL) [9], which is strictly more powerful than TBL [2]. Similar to Brill's method, rules are automatically derived from templates during learning. It differs from Brill's technique in that rules take the form of complex expressions of combinational logic. Therefore, my final contribution in my PhD thesis will be a framework that combines regular expression discovery with BLogRBL.A necessary component of this research is a study of various biases inherent in the use of reduced regular expressions in IE. The purpose of this work is to determine the language biases, search biases, and overfitting biases in the RRE discovery and BLogRBL algorithms.

📜 SIMILAR VOLUMES

Textual data mining for industrial knowl

Textual data mining for industrial knowledge management and text classification: A business oriented approach

✍ N. Ur-Rahman; J.A. Harding 📂 Article 📅 2012 🏛 Elsevier Science 🌐 English ⚖ 537 KB

A comparative analysis of classification

A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

✍ Dogan, Neslihan; Tanrikulu, Zuhal 📂 Article 📅 2012 🏛 Springer US 🌐 English ⚖ 808 KB

A semi-supervised active learning algori

A semi-supervised active learning algorithm for information extraction from textual data

✍ Tianhao Wu; William M. Pottenger 📂 Article 📅 2005 🏛 John Wiley and Sons 🌐 English ⚖ 141 KB

## Abstract In this article we present a semi‐supervised active learning algorithm for pattern discovery in information extraction from textual data. The patterns are reduced regular expressions composed of various characteristics of features useful in information extraction. Our major contribution

A supervised clustering and classificati

A supervised clustering and classification algorithm for mining data with mixed variables

✍ Xiangyang Li; Nong Ye 📂 Article 📅 2006 🏛 IEEE 🌐 English ⚖ 212 KB

Information content in textual data: Rev

Information content in textual data: Revisited for Arabic text

✍ Hegazi, Nadia ;Ali, Nabil ;Abed, Ehsan 📂 Article 📅 1987 🏛 John Wiley and Sons 🌐 English ⚖ 322 KB

Arabic as opposed to English is a highly redundant language due to its morphological nature. A study was done to measure this redundancy and compare it to its respective values in English. Samples of books, news papers, and social magazines were used to measure the entropy of the Arabic language us

Multimedia Information Extraction (Advan

Multimedia Information Extraction (Advances in Video, Audio, and Imagery Analysis for Search, Data Mining, Surveillance, and Authoring) || Extracting Information from Human Behavior

✍ Maybury, Mark T. 📂 Article 📅 2012 🏛 John Wiley & Sons, Inc. 🌐 English ⚖ 162 KB

The advent of increasingly large consumer collections of audio (e.g., iTunes), imagery (e.g., Flickr), and video (e.g., YouTube) is driving a need not only for multimedia retrieval but also information extraction from and across media. Furthermore, industrial and government collections fuel requirem