3

I am trying to import pre trained wiki word embeddings. I am trying to read this file so I am facing the following error

import gensim
from gensim.models import KeyedVectors
model = gensim.models.KeyedVectors.load_word2vec_format('C:\Users\PHQ-Admin\Downloads\enwiki_20180420_100d.txt')

Error:

model = gensim.models.KeyedVectors.load_word2vec_format('C:\Users\PHQ-Admin\Downloads\enwiki_20180420_100d.txt')
                                                           ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2 Answers 2

4

You are using a path with backslashes (\) and it is trying to escape U, P, ... etc. which produces an error. You can use one of the following solutions:

load_word2vec_format("C:/Users/PHQ-Admin/Downloads/enwiki_20180420_100d.txt")

OR

Escape the backslashes with backslashes.

load_word2vec_format("C:\\Users\\PHQ-Admin\\Downloads\\enwiki_20180420_100d.txt")

OR

Just put r before your string as it converts a normal string to a raw string:

load_word2vec_format(r"C:\Users\PHQ-Admin\Downloads\enwiki_20180420_100d.txt")
-1

I think you are supposed to send in a word2vec format file as input for this function, and also you can look at changing encoding to the method that fits you.

    def load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
                             limit=None, datatype=REAL):
        """Load the input-hidden weight matrix from the original C word2vec-tool format.

        Note that the information stored in the file is incomplete (the binary tree is missing),
        so while you can query for word similarity etc., you cannot continue training
        with a model loaded this way.

        Parameters
        ----------
        fname : str
            The file path to the saved word2vec-format file.
        fvocab : str
                Optional file path to the vocabulary.Word counts are read from `fvocab` filename,
                if set (this is the file generated by `-save-vocab` flag of the original C tool).
        binary : bool
            If True, indicates whether the data is in binary word2vec format.
        encoding : str
            If you trained the C model using non-utf8 encoding for words, specify that
            encoding in `encoding`.
        unicode_errors : str
            default 'strict', is a string suitable to be passed as the `errors`
            argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source
            file may include word tokens truncated in the middle of a multibyte unicode character
            (as is common from the original word2vec.c tool), 'ignore' or 'replace' may help.
        limit : int
            Sets a maximum number of word-vectors to read from the file. The default,
            None, means read all.
        datatype : :class: `numpy.float*`
            (Experimental) Can coerce dimensions to a non-default float type (such
            as np.float16) to save memory. (Such types may result in much slower bulk operations
            or incompatibility with optimized routines.)```

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.