2

I have a pandas Dataframe with mixed datatypes (float64 and strings), to use it in a sklearn Pipeline I need to convert it to a numpy array. In the end of the Pipeline I want to make a Dataframe again.

The problem is, when creating a numpy array with mixed types all data is converted to dtype "object". That way, when I create a new dataframe at the end all data is categorical.

Example:

Dataframe with mixed data

>>> dataframe = pd.DataFrame([[1,2,3],["a","b","c"]], columns = ["num", "cat"])

>>> dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      int64 
 1   cat     3 non-null      object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes

To numpy array

>>> array = dataframe.to_numpy()

array([[1, 'a'],
       [2, 'b'],
       [3, 'c']], dtype=object)

Back to dataframe

>>> new_df = pd.DataFrame(array, columns = ["num", "cat"])

>>> new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      object
 1   cat     3 non-null      object
dtypes: object(2)
memory usage: 176.0+ bytes

Now the two columns are categorical.

Is there a way to make pandas recognize the true data types inside the numpy array?

2 Answers 2

2

If you are using pandas >= 1.0, there's convert_dtypes:

>>> new_df = pd.DataFrame(array, columns = ["num", "cat"]).convert_dtypes()
>>> new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      Int64 
 1   cat     3 non-null      string
dtypes: Int64(1), string(1)
memory usage: 179.0 bytes
Sign up to request clarification or add additional context in comments.

Comments

2

you can use infer_objects() as well:

new_df = pd.DataFrame(array, columns = ["num", "cat"]).infer_objects()
print(new_df,'\n\n',new_df.dtypes)

  num cat
0    1   a
1    2   b
2    3   c 

num     int64
cat    object
dtype: object

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.