Selecting only non-empty cells in a row

Question

Using Pandas to read a 14k lines CSV with many empty cells - what I call "Emmenthaler data", lots of holes :-) . A very short, simplified sampler (urlbuilder.csv):

"one","two","three"
"","foo","bacon",""
"spam","bar",""

The cells contain information to match against this web API, like this:

http://base_URL&two="foo"&three="bacon"
http://base_URL&one="spam"&two="bar"

Leaving values empty in the URL (``...one=""&two="bar"`) would give wrong results. So I want to use just the non-empty fields in each row. To illustrate the method:

import pandas as pd


def buildurl(row, colnames):
    URL = 'http://someAPI/q?'
    values = row.one, row.two, row.three  # ..ugly.. 

    for value,field in zip(values, colnames):
        if not pd.isnull(value): 
            URL = '{}{}:{}&'.format(URL, field, value)

    return URL


df = pd.read_csv('urlbuilder.csv', encoding='utf-8-sig', engine='python')
colnames = list(df.columns)

for row in df.itertuples():
    url = buildurl(row, colnames)
    print(url)

It works and it's probably better than a cascade of if not isnull's. But I still have this loud hunch that there are much more elegant ways of doing this. Yet I can't seem to find them, probably I'm not googling for the right jargon.

Please comment

301_Moved_Permanently · Accepted Answer · 2018-07-25 07:30:26Z

For anything related to building URL query strings from key-values pairs, you should let urllib.parse.urlencode do the work for you. It can also handle quoting and other special cases for you. To do that, you have to convert your data to a dictionary.

This is easy if you start with a pd.Series:

from urllib.parse import urlencode


def build_url_params(serie):
    parameters = serie[~pd.isnull(serie)].to_dict()
    return urlencode(parameters)

Then you just need to provide Series to this function instead of tuples:

def populate_url_params(df):
    df['URL parameters'] = df.apply(build_url_params, axis=1)


if __name__ == '__main__':
    df = pd.read_csv('urlbuilder.csv', encoding='utf-8-sig', engine='python')
    populate_url_params(df)
    print(df)

Or if you want to go the manual iteration route, you can use df.iterrows() instead of df.itertuples().

301_Moved_Permanently · Accepted Answer · 2018-07-27 09:28:43Z

This is not an answer but an elaboration on @Mathias Ettinger's answer.

Being new to Pandas, I didn't understand what all is going on in his code. So I dove in, and in the process drew up the text below. I'm posting it here for the sake of completeness and reference. Hopefully it helps others too.

As is often the case, the best way is to try to understand is from the bottom upwards (instead of top-down).

def populate_url_params(df):  
    df['URL parameters'] = df.apply(build_url_params, axis=1)

df['URL parameters'] has data type <class 'pandas.core.series.Series'> allright. But build_url_params takes a parameter series. So what are series and how does apply work? First, apply. The official apply doc reads

Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is [in this case] the DataFrame’s columns (axis=1).

Where "Series objects whose index is the DataFrame’s columns", in Excel speak would mean (if axis=1!) 'a row of cells with their column name'.

What confused me was the this explanation of Series where it says

the Series is the datastructure for a single column of a DataFrame.

We're dealing with a row (not a column) of named cells. I think of my dataframe here as a bunch of rows with column names. IMHO a better definition is from Pandas intro to Data Structures

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

It all became much clearer when I put in a bunch of print statements in

def build_url_params(serie):  
    parameters = serie[~pd.isnull(serie)].to_dict()  
    return urlencode(parameters)

serie:

one        NaN  
two        foo  
three    bacon  
Name: 0, dtype: object

~pd.isnull(serie):

one      False  
two       True  
three     True  
Name: 0, dtype: bool

serie[~pd.isnull(serie)]

two      foo  
three    bacon  
Name: 0, dtype: object

parameters:

{'two': 'foo', 'three': 'bacon'}

And so on for each line in the csv. Then the final print(df):

    one  two  three       URL parameters  
0   NaN  foo  bacon  two=foo&three=bacon  
1  spam  bar    NaN     one=spam&two=bar

So yes, you can think of a Series as a single column of mixed data with name labels. In this particular case, you could think of it as a row of named cells turned 90 degrees counterclockwise.

Also, serie[~pd.isnull(serie)] I find really nifty. Note the tilde ~ operator. Demo above.

To further explain things: the "natural unit of work" in a DataFrame are its columns, which are stored as Series. Rows are only built when necessary if you operate on the axis=1 or you explicitly iterate using itertuple or iterrows. — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Jul 27, 2018 at 9:34

Stack Exchange Network

Selecting only non-empty cells in a row

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Selecting only non-empty cells in a row

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions