Categories
code deep learning Uncategorized

Word Embeddings and Keras

Hi! This is my first attempt to write a Jupyter notebook to wordpress. Jupyter notebooks are an increasingly popular way to combine code, results and content to one viewable platform. Notebooks use ‘kernels’ as interpreters to scripted languages. So far, I’ve seen Python, Julia and R kernels here.
(The full notebook is over at GitHub. WordPress is bailing on me.)

In this post, I’ll be exploring all about Keras, the GloVe word embedding, deep learning and XGBoost (see the full code). This is a playground, nothing new, since I’ve pulled about 75% of this from all over the web. I’ll be dropping references here and there so you can also enjoy your own playground.

That said, let’s get started! First off, some boilerplate code.

“The Python Preamble”

In this section, most code is pulled from the Keras project over in Github.
https://github.com/fchollet/keras/tree/master/examples

In [1]:
import os
import numpy as np
np.random.seed(1337)

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
import sys

%pylab inline
Using Theano backend.
Using gpu device 0: GeForce GTX 960 (CNMeM is disabled, cuDNN 5005)
In [2]:
BASE_DIR = '.'
GLOVE_DIR = BASE_DIR + '/glove/'
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

Introducing GloVe and Word Embeddings

Global Vectors for Word Representation (GloVe) is an unsupervised learning algorithm for obtaining vector representations for words. It’s basically computing co-word occurrences statistics from a large corpus. This is different compared to Word2Vec skip-gram or CBOW models, which is trained using model predictions of the current word given a window of surrounding words.

Ultimately though, GloVe and Word2Vec is concerned with achieving word embeddings. The goal is to find a high dimensional vector representation for each word. This is different from BOW models which can result in a very sparse matrix with no attractive mathematical properties other than classification in machine learning. For word embeddings, practitioners can create structures and reason on word distances on this high dimensional space. We’ll find for example, the word “cat” is close to other feline entities such as “tiger”, “kitty” and “lion”. We’ll also see that “cat” is farther than “dog” as compared to “kitty”. Mathematically:

\mathbf{D}(cat, kitten) > \mathbf{D}(cat, dog)

But, we can achieve something like this!

\mathbf{D}(cat, kitten) \simeq \mathbf{D}(dog, puppy)

Thus, we can represent differences:

(kitten - cat) + dog \simeq puppy

Anyway, these are just fun properties. The most important thing for word embeddings is that even if the new corpora is small, more concepts can be brought to it from the pre-trained word embeddings. For example, if the new corpora mentions “economics”, its word vector contains properties related to broad ideas like “academia” and “social sciences”, as well as narrower concepts such as “supply” and “demand”. This can somehow be called transfer learning, an attractive concept from deep learning.

You’ll find a great tutorial for word embeddings in the Udacity course on NLP and deep learning.

https://www.udacity.com/course/deep-learning–ud730

Stanford has a good site for GloVe:

http://nlp.stanford.edu/projects/glove/

GLoVe, can be downloaded here:

http://nlp.stanford.edu/data/glove.6B.zip

Preprocessing

Next, we’ll be using GloVe to transform our words into high dimensional vectors.

In [3]:
print 'Indexing word vectors.'
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print 'Found {} word vectors.'.format(len(embeddings_index))
Indexing word vectors.
Found 400000 word vectors.
In [4]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_trainval = fetch_20newsgroups(subset='train')
In [5]:
print 'Processing corpora'
texts = [article.encode('latin-1') for article in newsgroups_trainval.data]
labels_index = dict(zip(range(len(newsgroups_trainval.target_names)), newsgroups_trainval.target_names))
labels = newsgroups_trainval.target

print 'Found {} texts'.format(len(texts))
Processing corpora
Found 11314 texts
In [6]:
# quantize the text samples to a 2D integer tensor
# note that the values here are ultimately indexes to the actual words
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print 'Found {} unique tokens.'.format(len(word_index))
Found 134142 unique tokens.
In [7]:
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
sequences = None
texts = None
In [8]:
labels = to_categorical(np.asarray(labels))
print 'Shape of data tensor: ', data.shape
print 'Shape of label tensor: ', labels.shape

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

X_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
X_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]
Shape of data tensor:  (11314, 1000)
Shape of label tensor:  (11314, 20)
In [9]:
dict(word_index.items()[:10])
Out[9]:
{'3ds2scn': 66866,
 'l1tbk': 66868,
 'luanch': 66871,
 'mbhi8bea': 66870,
 'nunnery': 38557,
 'ree84': 47367,
 'sonja': 38558,
 'theoreticaly': 89142,
 'wax3': 111542,
 'woods': 8003}
In [10]:
print 'Preparing embedding matrix.'
nb_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((nb_words + 1, EMBEDDING_DIM))
# word index is sorted by rank of the word
for word, i in word_index.items():
    # skip the word if the rank is above MAX_NB_WORDS
    # resulting in a matrix of exactly [MAX_NB_WORDS + 1, EMBEDDING_DIM]
    if i > MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    # if word is not found in the embeddings index, then it's all 0.

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False as to keep the embeddings fixed
embedding_layer = Embedding(nb_words + 1, EMBEDDING_DIM, weights=[embedding_matrix],
                           input_length= MAX_SEQUENCE_LENGTH,
                           trainable=False)
Preparing embedding matrix.

Classification using Convolutional Neural Nets

Now, we start using Keras for our CNN. This is a very basic CNN model, using a single stride architecture. In more advanced CNNs, one can have multiple strides and concatenation layer combined with max pool layers.

In [11]:
from keras.layers import Dropout
In [12]:
print 'Preparing model.'
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(256, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(512, 5, activation='relu')(x)
x = Flatten()(x)
x = Dense(1024, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)
Preparing model.
In [13]:
from keras.callbacks import History, EarlyStopping, ModelCheckpoint
history = History()
early = EarlyStopping(patience=5)
checkpoint = ModelCheckpoint("model.weights.checkpoint", save_best_only=True, 
                             save_weights_only=True, verbose=1)
callbacks_list =[history, early, checkpoint]
In [14]:
print 'Training model.'
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
model.fit(X_train, y_train, validation_data = (X_val, y_val), nb_epoch=20, batch_size=128,
         callbacks = callbacks_list)
Training model.
Train on 9052 samples, validate on 2262 samples
Epoch 1/20
8960/9052 [============================>.] - ETA: 0s - loss: 2.9707 - acc: 0.1191Epoch 00000: val_loss improved from inf to 2.43787, saving model to model.weights.checkpoint.
9052/9052 [==============================] - 18s - loss: 2.9651 - acc: 0.1195 - val_loss: 2.4379 - val_acc: 0.1826
Epoch 2/20
8960/9052 [============================>.] - ETA: 0s - loss: 2.0385 - acc: 0.2961Epoch 00001: val_loss improved from 2.43787 to 1.76717, saving model to model.weights.checkpoint.
9052/9052 [==============================] - 19s - loss: 2.0326 - acc: 0.2981 - val_loss: 1.7672 - val_acc: 0.4076
Epoch 3/20
8960/9052 [============================>.] - ETA: 0s - loss: 1.4873 - acc: 0.4879Epoch 00002: val_loss improved from 1.76717 to 1.46691, saving model to model.weights.checkpoint.
9052/9052 [==============================] - 18s - loss: 1.4899 - acc: 0.4885 - val_loss: 1.4669 - val_acc: 0.4611
Epoch 4/20
8960/9052 [============================>.] - ETA: 0s - loss: 1.0544 - acc: 0.6381Epoch 00003: val_loss improved from 1.46691 to 1.07019, saving model to model.weights.checkpoint.
9052/9052 [==============================] - 18s - loss: 1.0516 - acc: 0.6388 - val_loss: 1.0702 - val_acc: 0.6375
Epoch 5/20
8960/9052 [============================>.] - ETA: 0s - loss: 0.7668 - acc: 0.7400Epoch 00004: val_loss improved from 1.07019 to 0.99900, saving model to model.weights.checkpoint.
9052/9052 [==============================] - 19s - loss: 0.7659 - acc: 0.7403 - val_loss: 0.9990 - val_acc: 0.6888
Epoch 6/20
8960/9052 [============================>.] - ETA: 0s - loss: 0.5928 - acc: 0.8037Epoch 00005: val_loss did not improve
9052/9052 [==============================] - 17s - loss: 0.5943 - acc: 0.8031 - val_loss: 1.1021 - val_acc: 0.6583
Epoch 7/20
8960/9052 [============================>.] - ETA: 0s - loss: 0.3943 - acc: 0.8666Epoch 00006: val_loss did not improve
9052/9052 [==============================] - 17s - loss: 0.3929 - acc: 0.8670 - val_loss: 1.0765 - val_acc: 0.7175
Epoch 8/20
8960/9052 [============================>.] - ETA: 0s - loss: 0.4049 - acc: 0.8906Epoch 00007: val_loss improved from 0.99900 to 0.92310, saving model to model.weights.checkpoint.
9052/9052 [==============================] - 17s - loss: 0.4028 - acc: 0.8910 - val_loss: 0.9231 - val_acc: 0.7524
Epoch 9/20
8960/9052 [============================>.] - ETA: 0s - loss: 0.2101 - acc: 0.9410Epoch 00008: val_loss did not improve
9052/9052 [==============================] - 17s - loss: 0.2088 - acc: 0.9412 - val_loss: 1.1036 - val_acc: 0.7440
Epoch 10/20
8960/9052 [============================>.] - ETA: 0s - loss: 0.2147 - acc: 0.9460Epoch 00009: val_loss did not improve
9052/9052 [==============================] - 17s - loss: 0.2132 - acc: 0.9463 - val_loss: 0.9985 - val_acc: 0.7582
Epoch 11/20
8960/9052 [============================>.] - ETA: 0s - loss: 0.1966 - acc: 0.9565Epoch 00010: val_loss did not improve
9052/9052 [==============================] - 17s - loss: 0.2018 - acc: 0.9551 - val_loss: 2.0937 - val_acc: 0.6454
Epoch 12/20
8960/9052 [============================>.] - ETA: 0s - loss: 0.1380 - acc: 0.9665Epoch 00011: val_loss did not improve
9052/9052 [==============================] - 17s - loss: 0.1373 - acc: 0.9666 - val_loss: 1.1651 - val_acc: 0.7661
Epoch 13/20
8960/9052 [============================>.] - ETA: 0s - loss: 0.1574 - acc: 0.9656Epoch 00012: val_loss did not improve
9052/9052 [==============================] - 17s - loss: 0.1574 - acc: 0.9656 - val_loss: 2.6486 - val_acc: 0.6362
Epoch 14/20
8960/9052 [============================>.] - ETA: 0s - loss: 0.1686 - acc: 0.9633Epoch 00013: val_loss did not improve
9052/9052 [==============================] - 17s - loss: 0.1670 - acc: 0.9637 - val_loss: 1.1794 - val_acc: 0.7754
Out[15]:
In [16]:
model.summary()
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_4 (InputLayer)             (None, 1000)          0                                            
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 1000, 100)     0           input_4[0][0]                    
____________________________________________________________________________________________________
convolution1d_17 (Convolution1D) (None, 996, 128)      64128       embedding_1[3][0]                
____________________________________________________________________________________________________
maxpooling1d_9 (MaxPooling1D)    (None, 199, 128)      0           convolution1d_17[0][0]           
____________________________________________________________________________________________________
convolution1d_18 (Convolution1D) (None, 195, 256)      164096      maxpooling1d_9[0][0]             
____________________________________________________________________________________________________
maxpooling1d_10 (MaxPooling1D)   (None, 39, 256)       0           convolution1d_18[0][0]           
____________________________________________________________________________________________________
convolution1d_19 (Convolution1D) (None, 35, 512)       655872      maxpooling1d_10[0][0]            
____________________________________________________________________________________________________
flatten_4 (Flatten)              (None, 17920)         0           convolution1d_19[0][0]           
____________________________________________________________________________________________________
dense_7 (Dense)                  (None, 1024)          18351104    flatten_4[0][0]                  
____________________________________________________________________________________________________
dense_8 (Dense)                  (None, 20)            20500       dense_7[0][0]                    
====================================================================================================
Total params: 19255700
____________________________________________________________________________________________________
In [17]:
model.load_weights("model.weights.checkpoint")

Network performance

Pulled from scikit learn’s pretty confusion matrix.

http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

In [18]:
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues, figsize=(15,15)):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.figure(figsize=figsize)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [19]:
from sklearn import metrics
preds = model.predict(X_val)
preds = np.argmax(preds, axis=1)
y_val_actual = np.argmax(y_val, axis=1)
print "Accuracy: ", metrics.accuracy_score(y_val_actual, preds)
print "Classification report: "
print metrics.classification_report(y_val_actual, preds)
conf_mat = metrics.confusion_matrix(y_val_actual, preds)
plot_confusion_matrix(conf_mat, newsgroups_trainval.target_names)
Accuracy:  0.752431476569
Classification report: 
             precision    recall  f1-score   support

          0       0.75      0.78      0.77        90
          1       0.64      0.75      0.69       104
          2       0.74      0.74      0.74       125
          3       0.65      0.26      0.37       135
          4       0.55      0.34      0.42       108
          5       0.72      0.69      0.70       121
          6       0.73      0.75      0.74       125
          7       0.72      0.85      0.78       111
          8       0.80      0.71      0.75       114
          9       0.89      0.96      0.92       121
         10       0.96      0.93      0.94       118
         11       0.95      0.86      0.90       119
         12       0.42      0.77      0.55       117
         13       0.76      0.91      0.83       103
         14       0.80      0.93      0.86       120
         15       0.85      0.86      0.86       129
         16       0.91      0.75      0.82       114
         17       0.90      0.94      0.92       112
         18       0.73      0.76      0.75       108
         19       0.66      0.46      0.54        68

avg / total       0.76      0.75      0.75      2262

In [20]:
# then let's save the model for future use
model.save("keras_nlp.convnet.model")

Build features for other classifiers

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

So much for CNNs. We can also transform our words embeddings to features. In this case, the linked post above highlights averaging word vectors for an entire document. Much information is lost in this case, however, but it will be a good starting point for other techniques such as Doc2Vec.

In [81]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(word2vec.itervalues().next())

    def fit(self, X, y):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of 
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])

        return self

    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])
In [21]:
# I'll be repeating the above preprocessing here.
texts = [article.encode('latin-1') for article in newsgroups_trainval.data]
labels = newsgroups_trainval.target
indices = np.arange(len(texts))
np.random.shuffle(indices)
texts = [texts[i] for i in indices]
labels = labels[indices]

nb_validation_samples = int((1-VALIDATION_SPLIT) * len(texts))

text_train = texts[:nb_validation_samples]
text_val = texts[nb_validation_samples:]
y_train = labels[:nb_validation_samples]
y_val = labels[nb_validation_samples:]
In [22]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

vectorizer = Pipeline([('vectorizer', TfidfEmbeddingVectorizer(embeddings_index)),
                       ('standardizer', StandardScaler())])
                       
X_train = vectorizer.fit_transform(text_train)
X_val = vectorizer.transform(text_val)
 That’s it for now. The next thing to do is to use these as input matrices to our other models. That’s your SVM’s, Multinomial NB’s and Trees.
Please head on over to my GitHub page to see the rest of the code.

By krsnewwave

I'm a software engineer and a data science guy on recommender systems, natural language processing, and computer vision.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s