KeyError when using array as feature in language detection

Question

I am following this tutorial for language detection using machine learning. In the dataset I am using, however, there are multiple variables as features. I tried, in the place of X = data["Text"], X = df["message", "fingers", "tail"],(message, fingers, and tail are the three feature variables I am using) but it throws a KeyError;

Traceback (most recent call last):
  File "C:\Users\usr\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\\_libs\\hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\\_libs\\hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('message', 'fingers', 'tail')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\usr\Downloads\thecode.py", line 13, in <module>
    X = df["message", "fingers", "tail"]
        ~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\usr\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\usr\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: ('message', 'fingers', 'tail')

How should I implement code so as to use all features without throwing errors?

Hi Harry, are you trying to return all three columns from your data frame? In that case you need to pass your keys to the dataframe as a list: X=df[["message", "fingers", "tail"]] — A10, Commented Sep 23, 2024 at 22:30
@A10: Thanks a lot, that fixes the issue, but then I'm running into a ValueError due to the line x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20). It says it found input variables with inconsistent numbers of samples: [3,500]. Could you help with that too? — harry, Commented Sep 23, 2024 at 22:39
That error typically comes up when X and y are not the same length, from the error message one of your variables has 3 rows, and the other has 500. — A10, Commented Sep 23, 2024 at 22:42
@A10: But each of the feature variables have columns with 500 entries, why is that a problem? — harry, Commented Sep 23, 2024 at 22:59
Sorry, I probably shouldn't have used the word variable. What that error message is telling you is that X, the entire data frame, and y, a vector of labels, don't have the same length. For example, it looks like your data frame X only has 3 rows and you have 500 labels in y. I'd check your code to make sure you haven't accidentally sliced your data frame with something like df=df.head() or a df = df.dropna() before defining X (and that your data frame is complete) — A10, Commented Sep 23, 2024 at 23:06

harry · Accepted Answer · 2024-09-24 04:19:12Z

0

The issue can be solved by replacing the code with X = np.asarray(df[["message", "fingers", "tail"]]).

answered Sep 24, 2024 at 4:19

harry

1111 silver badge11 bronze badges

Add a comment |

Collectives™ on Stack Overflow

KeyError when using array as feature in language detection

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related