Text categorization
What Is Text Categorization?
Text categorization is the task of assigning a document or text passage to one or more predefined categories based on its content. It is a foundational problem in information retrieval and natural language processing, with applications wherever large volumes of unstructured text must be organized, routed, or filtered at scale. The field draws from statistical learning theory, computational linguistics, and data analysis, and its methods have evolved from rule-based keyword matching through probabilistic classifiers to neural architectures trained on billions of tokens.
The dominant approach to text categorization is supervised machine learning: a classifier is trained on a corpus of preclassified examples and then applied to unseen documents. As established in the ACM Computing Surveys treatment of machine learning in automated text categorization, the key technical challenges are document representation (converting variable-length text into fixed-dimensional feature vectors), algorithm selection (choosing a learner well-suited to the feature space and label structure), and evaluation (measuring performance across categories of varying frequency).
Feature Representation
Before a classifier can operate on text, documents must be converted into numerical representations. The bag-of-words model represents each document as a vector of word frequencies or term frequency-inverse document frequency (TF-IDF) weights, which down-weight terms that occur frequently across all documents and therefore carry little discriminative information. TF-IDF remains a strong baseline for many categorization tasks despite its simplicity. Dimensionality reduction techniques, including latent semantic analysis (LSA) and non-negative matrix factorization (NMF), project high-dimensional term-document matrices into lower-dimensional latent spaces that capture topical co-occurrence structure. More recent representation approaches use word embeddings (dense vector representations trained on large corpora) and contextual encoders based on the transformer architecture, in which the representation of each token reflects its full sentential context. An IEEE survey on unsupervised deep feature representation for text categorization reviews how these learned representations consistently outperform handcrafted TF-IDF features on standard benchmarks.
Classification Methods
The classification step maps document representations to category labels. Naive Bayes classifiers, which apply Bayes' theorem under the conditional independence assumption, are computationally inexpensive and perform surprisingly well on high-dimensional sparse feature spaces, making them a common baseline for email filtering and topic routing. Support vector machines (SVMs) find the maximum-margin hyperplane separating category members from non-members in the feature space and have historically shown strong performance on text because documents are typically linearly separable in high-dimensional TF-IDF space. Deep learning classifiers, including convolutional neural networks over word embeddings and transformer-based encoders such as BERT fine-tuned on category-labeled datasets, now achieve the best reported results on most benchmark corpora. Performance analysis of machine learning and deep learning models for text classification compares these families across accuracy, training cost, and generalization, finding that transformer models offer the highest accuracy at substantially greater computational expense.
Evaluation and Multi-label Categorization
Text categorization evaluation uses precision, recall, and F1 score computed per category, then aggregated by macro-averaging (equal weight to each category) or micro-averaging (equal weight to each instance). Macro-averaging is sensitive to performance on rare categories; micro-averaging reflects overall throughput. Multi-label categorization, in which a document may belong to several categories simultaneously (a news article covering both economics and technology, for example), requires adapted training objectives and evaluation metrics such as label-based F1 and subset accuracy.
Applications
Text categorization has applications in a wide range of fields, including:
- Email spam and phishing detection
- News article topic routing and feed personalization
- Legal document classification and case law retrieval
- Customer support ticket triage and routing
- Content moderation and policy violation detection on digital platforms