Use Skipgrams with sklearn CountVectorizer and TfidfVectorizer

We don’t have an implementation for skipgrams in sklearn. This post covers how to use the skipgram function in nltk with sklearn’s CountVectorizer and TfidfVectorizer

We are going to create a skipgram tokenizer that can be passed to the tokenizer parameter of the vectorizer.

Create a basic tokenizer that can split the original strings to tokesn. Tokenizer can be just .split() or a function to filter non-alpahbets etc. We can use tokenizer as below

def basic_tokenize(tweet):
    tweet = " ".join(re.split("[^a-zA-Z#]*", tweet)).strip()
    return tweet.split()

This data will be fed to the skipgram tokenizer to get skipgrams

This is the function that creates skipgram given a string, k and n

def skipgram_tokenize(tweet, n=None, k=None, include_all=True):
    from nltk.util import skipgrams
    tokens = [w for w in basic_tokenize(tweet)]
    if include_all:
        result = []
        for i in range(k+1):
            skg = [w for w in skipgrams(tokens, n, i)]
            result = result+skg
    else:
        result = [w for w in skipgrams(tokens, n, k)]
    result=set(result)
    #print(result)
    return result

def make_skip_tokenize(n, k, include_all=True):
    return lambda tweet: skipgram_tokenize(tweet, n=n, k=k, include_all=include_all)

It can be used with the vectorizer by setting the tokenizer param as shown below

## using 3-skip bigrams
vectorizer_3skipbigram = TfidfVectorizer(stop_words=other_exclusions,
                                   tokenizer=make_skip_tokenize(n=2, k=3))

Happy hacking skipgrams with sklearn. Leave a comment if there are questions

No comment