๐”– Bobbio Scriptorium
โœฆ   LIBER   โœฆ

Converting numerical classification into text classification

โœ Scribed by Sofus A. Macskassy; Haym Hirsh; Arunava Banerjee; Aynur A. Dayanik


Publisher
Elsevier Science
Year
2003
Tongue
English
Weight
243 KB
Volume
143
Category
Article
ISSN
0004-3702

No coin nor oath required. For personal study only.

โœฆ Synopsis


Consider a supervised learning problem in which examples contain both numerical-and textvalued features. To use traditional feature-vector-based learning methods, one could treat the presence or absence of a word as a Boolean feature and use these binary-valued features together with the numerical features. However, the use of a text-classification system on this is a bit more problematic-in the most straight-forward approach each number would be considered a distinct token and treated as a word. This paper presents an alternative approach for the use of text classification methods for supervised learning problems with numerical-valued features in which the numerical features are converted into bag-of-words features, thereby making them directly usable by text classification methods. We show that even on purely numerical-valued data the results of text classification on the derived text-like representation outperforms the more naive numbers-as-tokens representation and, more importantly, is competitive with mature numerical classification methods such as C4.5, Ripper, and SVM. We further show that on mixed-mode data adding numerical features using our approach can improve performance over not adding those features.


๐Ÿ“œ SIMILAR VOLUMES


Passage detection using text classificat
โœ Saket Mengle; Nazli Goharian ๐Ÿ“‚ Article ๐Ÿ“… 2009 ๐Ÿ› John Wiley and Sons ๐ŸŒ English โš– 589 KB

## Abstract Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retriev