python dataframe automatically convert numeric columns as float but don't drop non-numeric

Question

I have created the following function to take a data frame, convert numeric columns d-type to numeric. This is doing good job but the problem is, it drops the non-numeric columns as well, which I don't want to. Because, those columns carry some important information as well.

def convert_dataframe_to_numeric_type(df):
    def is_it_a_number(x):
      try:
        float(x)
        return True
      except:
        return False
    df = df[df.applymap(is_it_a_number)]
    df = df.dropna(how='all',axis=1)
    # after converting all non-numeric elements, transform them to numeric
    df = df.transform(pd.to_numeric,errors='ignore')
    return df

The question is unclear. Are you trying to convert object or string columns to float? Or are you trying to determine the contents of mixed columns? Why not use astype or convert_dtypes ? Please post an example of what you want — Panagiotis Kanavos, Commented Jul 14, 2023 at 9:53
@PanagiotisKanavos I just have a big data frame of mixed content: numeric columns, non-numeric columns. When I import it, all columns are object type. There are many columns. I don't want to look through each column. Rather, automate it. — Mainland, Commented Jul 14, 2023 at 16:27
Why not call convert_dtypes then? Or infer_objects Have you tried them? — Panagiotis Kanavos, Commented Jul 14, 2023 at 16:54
I posted examples how to convert types with a single call. No loops — Panagiotis Kanavos, Commented Jul 14, 2023 at 17:28

Panagiotis Kanavos · Accepted Answer · 2023-07-14 17:28:04Z

Attempting to convert to numeric and check for nulls won't work. Almost all data files will have missing numeric values, which will appear as NA. Data loading functions like read_csv will generate NAs for every empty field and common NaN markers

By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘None’, ‘n/a’, ‘nan’, ‘null’.

Besides, trying to convert all the values in a series and then check if any failed does the same job twice. Pandas has built-in methods to detect/convert types that will stop immediately if conversion fails.

One option is infer_object, which tries to detect the types of any object Series. Another option is convert_dtypes which will try to find the best type for the values.

Using this dataframe, where everything is object :

df = pd.DataFrame(
    {
        "a": pd.Series([1, 2, 3], dtype=np.dtype("O")),
        "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
        "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
        "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
        "e": pd.Series([10, np.nan, 20], dtype=np.dtype("O")),
        "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("O")),
    }
)

infer_objects() produces these types:

df_i=df.infer_objects()
df_i.dtypes
-----------------------
a      int64
b     object
c     object
d     object
e    float64
f    float64
dtype: object

While convert_dtypes goes deeper :

df_c=df.convert_dtypes()
df_c.dtypes
------------------------
a      Int64
b     string
c    boolean
d     string
e      Int64
f    Float64

Collectives™ on Stack Overflow

python dataframe automatically convert numeric columns as float but don't drop non-numeric

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related