Word2Vec in 6 lines of PythonÂ¶

6 lines, not including the analysis
Using project gutenberg to obtain data.
Input data used in this walkthrough

    1  import requests
    2  import gensim
    3  url = 'http://www.gutenberg.org/cache/epub/1041/pg1041.txt'
    4  text = requests.get(url).text
    5  tokens = gensim.utils.simple_preprocess(text)
    6  model = gensim.models.Word2Vec([tokens], min_count=3, size=100)

Breakdown of this codeÂ¶

import requests

def fetch(url):
    '''input url, output text from page'''
    r = requests.get(url).text
    return r

# fetch Shakespeare's sonnets
text = fetch('http://www.gutenberg.org/cache/epub/1041/pg1041.txt')

len(text)

122778

# take a look at what we have
# its dirty
text[1000:3001]

u"own bright eyes,\r\n  Feed'st thy light's flame with self-substantial fuel,\r\n  Making a famine where abundance lies,\r\n  Thy self thy foe, to thy sweet self too cruel:\r\n  Thou that art now the world's fresh ornament,\r\n  And only herald to the gaudy spring,\r\n  Within thine own bud buriest thy content,\r\n  And tender churl mak'st waste in niggarding:\r\n    Pity the world, or else this glutton be,\r\n    To eat the world's due, by the grave and thee.\r\n\r\n  II\r\n\r\n  When forty winters shall besiege thy brow,\r\n  And dig deep trenches in thy beauty's field,\r\n  Thy youth's proud livery so gazed on now,\r\n  Will be a tatter'd weed of small worth held:\r\n  Then being asked, where all thy beauty lies,\r\n  Where all the treasure of thy lusty days;\r\n  To say, within thine own deep sunken eyes,\r\n  Were an all-eating shame, and thriftless praise.\r\n  How much more praise deserv'd thy beauty's use,\r\n  If thou couldst answer 'This fair child of mine\r\n  Shall sum my count, and make my old excuse,'\r\n  Proving his beauty by succession thine!\r\n    This were to be new made when thou art old,\r\n    And see thy blood warm when thou feel'st it cold.\r\n\r\n  III\r\n\r\n  Look in thy glass and tell the face thou viewest\r\n  Now is the time that face should form another;\r\n  Whose fresh repair if now thou not renewest,\r\n  Thou dost beguile the world, unbless some mother.\r\n  For where is she so fair whose unear'd womb\r\n  Disdains the tillage of thy husbandry?\r\n  Or who is he so fond will be the tomb,\r\n  Of his self-love to stop posterity?\r\n  Thou art thy mother's glass and she in thee\r\n  Calls back the lovely April of her prime;\r\n  So thou through windows of thine age shalt see,\r\n  Despite of wrinkles this thy golden time.\r\n    But if thou live, remember'd not to be,\r\n    Die single and thine image dies with thee.\r\n\r\n  IV\r\n\r\n  Unthrifty loveliness, why dost thou spend\r\n  Upon thy self thy beauty's legacy?\r\n  Nature's bequest gives nothing, but doth lend,\r\n  And being frank she lends to those are free:\r\n  Then, beaute"

import gensim
tokenized = gensim.utils.simple_preprocess(text)

len(tokenized)

20247

# cleaned up
tokenized[1000:1021]

[u'in',
 u'single',
 u'life',
 u'ah',
 u'if',
 u'thou',
 u'issueless',
 u'shalt',
 u'hap',
 u'to',
 u'die',
 u'the',
 u'world',
 u'will',
 u'wail',
 u'thee',
 u'like',
 u'makeless',
 u'wife',
 u'the',
 u'world']

Explore the corpusÂ¶

from collections import Counter

# find the frequency of each word in list
# NLP speak: most frequently occuring tokens in corpus

c = Counter(tokenized)
c.most_common(20) # top 20

[(u'the', 613),
 (u'and', 560),
 (u'to', 495),
 (u'of', 488),
 (u'in', 380),
 (u'my', 372),
 (u'that', 338),
 (u'thy', 281),
 (u'thou', 235),
 (u'with', 228),
 (u'for', 198),
 (u'love', 195),
 (u'is', 194),
 (u'not', 188),
 (u'you', 183),
 (u'but', 168),
 (u'me', 164),
 (u'thee', 162),
 (u'be', 160),
 (u'or', 157)]

Build modelÂ¶

# min_count = must appear at least 3 times
# size = dimensionality of the feature vectors
model = gensim.models.Word2Vec([tokenized], min_count=3, size=100)

model.wv.similarity('man', 'woman')

0.99722207209177871

model.wv.similarity('woman', 'woman')

0.99999999999999989

# vector representation of the word 'love'
model['love']

array([ 0.51965952,  0.55235094, -0.17475532,  0.64651948, -0.48291188,
        0.06080499, -0.27537176, -0.53690392, -0.08770045,  0.50699908,
        0.17042752, -0.56355774, -0.26045609,  0.20165233, -0.07852872,
        0.66898388,  0.30179507, -0.32529974,  0.09284518, -0.15472995,
       -0.80867547, -0.37871212,  0.3926107 ,  0.90615118, -0.08342564,
        0.16193688, -0.32557613,  0.65092826, -0.358367  , -0.59081411,
       -0.07039595,  0.87389702,  0.44004893, -0.04361319,  0.13640521,
        0.32703295,  0.03798396,  0.06206302, -0.03943499, -0.43150824,
       -0.25584692,  0.31193838, -0.46332827, -0.24257505,  0.02769107,
       -0.37423518,  0.17743127,  0.04594169, -0.24431312,  0.22337559,
        0.02724983, -0.0218022 , -0.19361265, -0.18199681, -0.17359681,
       -0.0015086 , -0.01472659, -0.16743575, -0.84514213,  0.14478178,
       -0.4065662 ,  0.27313149, -0.16871376, -0.1994134 , -0.13054073,
        0.03255269, -0.55017006,  0.38131151,  0.52704483,  0.08423062,
       -0.24848555, -0.23845927, -0.33051106,  0.16118209, -0.7264629 ,
       -0.33346862, -0.35598928,  0.47968715, -0.11829652, -0.04094103,
       -0.13746099,  0.47757912,  0.08380625, -0.16608757, -0.02269495,
        0.61047113, -0.01637947, -0.55348778,  0.2418807 ,  0.36410737,
       -0.23879786,  0.15609293, -0.07274883, -0.05155676, -0.04249015,
       -0.18196352,  0.36223558, -0.0806519 ,  0.03152369, -0.12127267], dtype=float32)

vec1 = model['love']

# cosine similarity between 'love' and top 20 most similar words 
# in the context of Shakespears Sonnets. 
model.wv.similar_by_vector(vec1, topn=20)

[(u'love', 1.0),
 (u'and', 0.999948263168335),
 (u'in', 0.99994295835495),
 (u'my', 0.9999422430992126),
 (u'so', 0.9999417066574097),
 (u'to', 0.9999407529830933),
 (u'that', 0.9999404549598694),
 (u'thou', 0.9999399781227112),
 (u'of', 0.9999398589134216),
 (u'thy', 0.9999396204948425),
 (u'me', 0.9999381303787231),
 (u'doth', 0.9999342560768127),
 (u'is', 0.9999341368675232),
 (u'their', 0.9999338984489441),
 (u'the', 0.9999328255653381),
 (u'with', 0.9999327659606934),
 (u'but', 0.9999297857284546),
 (u'this', 0.999929666519165),
 (u'your', 0.9999285340309143),
 (u'it', 0.9999279379844666)]

What does this mean?Â¶

It means that the words 'and', 'in', 'my', and 'so', were used most frequently when the word 'love' was used.

vec2 = model['die']

model.wv.similar_by_vector(vec2, topn=10)

[(u'die', 1.0000001192092896),
 (u'can', 0.9993754029273987),
 (u'for', 0.9993654489517212),
 (u'sweet', 0.9993617534637451),
 (u'be', 0.9993606209754944),
 (u'upon', 0.9993589520454407),
 (u'nor', 0.9993535280227661),
 (u'now', 0.9993531703948975),
 (u'these', 0.9993516206741333),
 (u'was', 0.9993469715118408)]

vec3 = model['death']

model.wv.similar_by_vector(vec3, topn=10)

[(u'death', 1.0000001192092896),
 (u'with', 0.9998724460601807),
 (u'is', 0.9998631477355957),
 (u'should', 0.9998618364334106),
 (u'to', 0.9998602867126465),
 (u'their', 0.9998598098754883),
 (u'and', 0.9998587965965271),
 (u'more', 0.9998586773872375),
 (u'me', 0.999858021736145),
 (u'my', 0.9998569488525391)]