Extracting keywords from text

Keywords are used in several fields as a way to summarize and categorize natural text, such as papers, essays, news articles or blog posts. Keywords, or tags, allow a quick and easy categorization of available information, thus facilitating its access and search. However, while many of existing natural texts have been assiegned their own keywords, either by their authors or by third parties, the vast majority of the information out there is currently untagged and, in many cases, virtually inaccessible. This begs the question: how difficult is it to develop a system that performs automatic and reliable tagging of written documents?

This post is a quick and dirty attempt to provide an answer to the above question. While the approach presented here is certainly not the most sophisticated, it delivers some encouraging preliminary results and provides a benchmark against which more complex solutions can be compared.

The approach outlined in this document consists of the following steps:

  • Overview
  • Data Loading and Preprocessing
  • Metric Definition
  • Natural Text Processing
    • Defining the Keywords Set
    • Building the Vocabulary
    • Building the TF-IDF matrix
  • Fit and Predict
    • Creating a Multi Label Binarizer
    • Building the Classifier
  • Results
    • Using the Title
    • Using the Body
  • Discussion
  • Appendix

The post is intended as a walkthrough of the methodology and provides a complete codebase to implement your own model.

Overview

This effort is aimed at extracting a set of keywords from individual documents extracted from the “Ask Ubuntu” database, freely available on Stack Exchange, specifically its subset questions collected from January 2013 to June 2017(*). The dataset comes with a set of fields, out of which only the title, body and keywords are used for this analysis.

The implementation pipeline is fairly simple. The title and body of each document undergo a simple NLP preprocessing (e.g. standard stop words, punctuation), then title and body TF-IDF matrices are produced and fed to a One-Vs-Rest classifier, which outputs, for each document, a list of probabilities, each corresponding to one individual keyword. Using a probability threshold - which can also be learned from data - each document can be ultimately assigned a set of keywords.

The performance of the model is evaluated using the Mean-F-Score.

(*) Note: Due to computational limitations, the dataset is further reduced in the analysis presented here.

Data Loading and Preprocessing

Data processing consists of dataset loading (from a set of files previously downloaded via Stack Exchange query) and train-test splitting. For the latter, the data from January 1st, 2017 to June, 30th, 2017, is used as test set. This choice, as opposed to a random split, is motivated by the fact that it is considered good practice to use the most recent data as test set for application based on time stamped data.

def process_source_df(s_df):
    cols = ['Id', 'UserId', 'Title', 'Body', 'Tags', 'CreationDate']
    s_df = s_df[cols]
    s_df['CreationDate'] = pd.to_datetime(s_df['CreationDate'])
    s_df = s_df[np.isnan(s_df['UserId'])==False]
    s_df['UserId'] = s_df['UserId'].astype(int)
    s_df['Tags'] = s_df['Tags'].apply(lambda tag: re.split('><', str(tag).strip('><')))    
    return s_df
#end 

def read_input():
    pref = 'au_'
    for sid in range(1, 11):
        fname = source_path + pref + str(sid).zfill(2) + '.csv'
        s_df = pd.read_csv(fname)
        s_df = process_source_df(s_df)
        if sid==1:
            df = s_df
        else:
            df = pd.concat([df, s_df]).reset_index(drop=True)
        #end
    #end
    return df.dropna(how='any')
#end

Metric Definition

It is good measure to start developing any algorithm with a solid idea of how to assess its performance. For this reason, I am also defining the performance metric before I dive into the description of the algorithm used. As mentioned above, I use the Mean-F-Score, as follows.

def f_scoring(true_ls, pred_ls):
    n_matches = len(set(true_ls) & set(pred_ls))
    p = 1.0 * n_matches / len(true_ls)
    r = 1.0 * n_matches / len(pred_ls)
    if (p + r)==0:
        f_score = 0
    else:
        f_score = 2* (p * r) / (p + r)
    #end
    return f_score
#end
    
def mean_f_scoring(true_ls_ls, pred_ls_ls):
    N = 1.0 * len(true_ls_ls)
    mean_f_score = 0.0
    for t_ls, p_ls in zip(true_ls_ls, pred_ls_ls):
        mean_f_score += f_scoring(t_ls, p_ls) / N
    #end
    return mean_f_score
#end

Natural Text Processing

Defining the Keywords Set

Every document will be potentially assigned a set of keywords taken as the list of unique keywords present in the train set.

def get_unique_kw(df):
    kw_ls = list()
    kw_ls_ls = list(df['Tags'])
    for kwls in kw_ls_ls:
        kw_ls += kwls
    #end
    return list(set(kw_ls))
#end

Building the Vocabulary

To procuce the TF-IDF matrices, one has to define a vocabulary. This, in turns, requires a corpus onto which to train. I here use the train set as corpus, define a CountVectorizer, using the default scikit-learn class, and extract the corresponding vocabulary using the english stopwords (from NLTK) plus punctuation. A filter on bi-grams and on the size of the vocabulary. Specifically, the common words, as well as the least common words are neglected. The thesholds are arbitrary (for example, words that appear on more than 60% of the titles are neglected), but could in principle be learned from data as well.

def build_vocabulary(df, field='Body'):
    stop_words = stopwords.words('english') + list(punctuation)
    if field=='Body':
        vect = CountVectorizer(stop_words=stop_words, min_df=0.01, max_df=0.5, ngram_range=(1,2))
    elif field=='Title':
        vect = CountVectorizer(stop_words=stop_words, min_df=0.001, max_df=0.6, ngram_range=(1,2))
    #end
    vf = vect.fit(df[field])
    vocab_fd_ls = vect.vocabulary_.keys()
    vocab_kw_ls = get_unique_kw(df)
    vocabulary = list(set(vocab_fd_ls + vocab_kw_ls))
    return vocabulary
#end

Building the TF-IDF matrix

Using the vocabulary and the text, the TF-IDF matrices are quickly built as follows.

def create_tfidf_smx(trn_text, tst_text, vocab):
    vect = CountVectorizer(vocabulary=vocab)
    freq_trn = vect.fit_transform(trn_text)
    freq_tst = vect.transform(tst_text)
    tfidf = TfidfTransformer()
    tfifd_trn_smx = tfidf.fit_transform(freq_trn)
    tfifd_tst_smx = tfidf.transform(freq_tst)
    return tfifd_trn_smx, tfifd_tst_smx
#end

Fit and Predict

Creating a Multi Label Binarizer

This section is necessary to create a multi label binarizer that turns the list of keywords assigned to each document into a vector of 0’s and 1’s. This step, which requires the sequential application of Label Encoding and Multi Label Binarizer, is necessary to create the target binary vectors to be fed to the fit method of the classifier.

def build_le(train_df):
    LE = LabelEncoder()
    LE.fit(get_unique_kw(train_df) + ['other'])
    return LE
#end

def build_le_mlb_trg(train_df):
    LE = build_le(train_df)
    target_lb = [LE.transform(tags).tolist() for tags in list(train_df['Tags'])]
    MLB = MultiLabelBinarizer()
    target_mlb = MLB.fit_transform(target_lb)
    return target_mlb, LE, MLB
#end

def ml_binarize(LE, MLB, kw_ls_ls):
    return MLB.transform(LE.transform(kw_ls_ls))
#end

def ml_invert_binarize(LE, MLB, mlb_ar):
    le_ar = MLB.inverse_transform(mlb_ar)
    other_lb = LE.transform(['other'])
    kw_le = [np.array(list(kw)) if kw!=tuple() else np.array(list(other_lb)) for kw in le_ar]
    return [list(LE.inverse_transform(kw_ar)) for kw_ar in kw_le]
#end

Building the Classifier

A set of function are defined below to define and use a OneVsRest classifier, implemented by feeding a Random Forest classifier into the OneVsRest class from scikit-learn. This is equivalent to fitting a set of indepenent classifiers for each keywords, extracting the corresponding probabilities and consolidating everything into a vector of probabilities, representing the probability that each given keyword is associated to the given document.

def set_max_kw(prob_ar):
    one_kw_ar = prob_ar.copy()
    one_kw_ar[np.arange(len(prob_ar)), prob_ar.argmax(1)] = 0.99
    return one_kw_ar
#end

def get_pred_lb(prob_ar, threshold):
    max_kw_ar = set_max_kw(prob_ar)
    pred_lb = np.zeros_like(max_kw_ar)
    pred_lb[max_kw_ar>=threshold] = 1
    return pred_lb
#end

def fit_clf(tfidf_smx, target_mlb):
    CLF = OneVsRestClassifier(RandomForestClassifier())
    CLF.fit(tfidf_smx, target_mlb)
    return CLF
#end

def predict_clf(CLF, tfidf_tst_smx, LE, MLB, thresh):
    tst_preds_prob = CLF.predict_proba(tfidf_tst_smx)
    tst_pred_lb_ar = get_pred_lb(tst_preds_prob, thresh)
    tst_pred_kw_ls = ml_invert_binarize(LE, MLB, tst_pred_lb_ar)
    return tst_pred_kw_ls
#end

def fit_predict_clf(tfidf_trn_smx, tfidf_tst_smx, target_mlb, LE, MLB, thresh):
    CLF = OneVsRestClassifier(RandomForestClassifier())
    CLF.fit(tfidf_trn_smx, target_mlb)
    tst_preds_prob = CLF.predict_proba(tfidf_tst_smx)
    tst_pred_lb_ar = get_pred_lb(tst_preds_prob, thresh)
    tst_pred_kw_ls = ml_invert_binarize(LE, MLB, tst_pred_lb_ar)
    return tst_pred_kw_ls, CLF
#end

Results

The following steps follow the implementation described above. Separate results are provided for using just the title or the body of the document, as provided from the Ask Ubuntu dataset. As of now, the implementations are independent, but a simple next step could be to combine the two, possibly by learning the respective weight from data.

Notice also that the classifier outputs a set of probabilities for each document. A simple analysis (not shown here) showed that the optimal threshold for assigning a keyword to a given document is 0.2, so that is this value that is used in this document.

source_path = './source_dataset/'
path = './'
s = 50000 #this value is necessary to downsize the training set for computational limitations
df = read_input()

train_df, test_df = get_train_test(df)
vocabulary_body = build_vocabulary(train_df, field='Body')
vocabulary_ttle = build_vocabulary(train_df, field='Title')


body_tfifd_trn_smx, body_tfifd_tst_smx = create_tfidf_smx(train_df['Body'].tail(s), test_df['Body'], vocabulary_body)
ttle_tfifd_trn_smx, ttle_tfifd_tst_smx = create_tfidf_smx(train_df['Title'].tail(s), test_df['Title'], vocabulary_ttle)
target_mlb, LE, MLB = build_le_mlb_trg(train_df.tail(s))

Using the Title

threshold = 0.2
tst_tt_pred_kw, CLF = fit_predict_clf(ttle_tfifd_trn_smx, ttle_tfifd_tst_smx, target_mlb, LE, MLB, threshold)

Using the Body

threshold = 0.2
tst_bd_pred_kw, CLF = fit_predict_clf(body_tfifd_trn_smx, body_tfifd_tst_smx, target_mlb, LE, MLB, threshold)

Discussion

The approach above attains a Mean-F-Score of 0.32 and 0.30 using, respectively, just the title or just the body of the question. The result is encouraging, especially because it is obtained with a small training set, without considering title and body simultaneously, and with a very simple preprocessing of the dataset, and offers a good benchmark for further development and/or refinements.

Appendix

Imported Libraries

import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from string import punctuation
from nltk.corpus import stopwords
from nltk import word_tokenize
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
Written on August 7, 2017