-2

I have a bunch of data and each row has a 4 date fields. The first date field is a duplicate for the ID number and sometimes not. It looks a little like this:

ID,LName,FName,DateIn,DateOut,Days,ODateIn,ODateOut,Odays
1,Doe,Jay,7/14/2023,8/14/2023,31.00,8/15/2023,4/22/2024,251.00
1,Doe,Jay,3/4/2021,11/5/2021,246.00,11/12/2021,12/31/2021,49.00
1,Doe,Jay,7/14/2023,8/14/2023,31.00,5/30/2024,7/2/2024,33.00
1,Doe,Jay,5/8/2022,1/1/2023,238.00,2/28/2023,4/8/2023,39.00
2,Smith,Dude,4/16/2022,6/2/2022,47.00,7/23/2022,9/13/2022,52.00
2,Smith,Dude,12/5/2022,3/14/2023,99.00,8/30/2023,10/11/2023,42.00
2,Smith,Dude,1/3/2024,3/30/2024,87.00,7/18/2024,9/1/2024,45.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,11/12/2020,1/18/2021,67.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,3/27/2021,6/9/2021,74.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,10/4/2021,11/30/2021,57.00

I would like to clean it up to combine them into one row for each ID number. It would look like this:

ID,DateIn1,DateOut1,Days1,DateIn2,DateOut2,Days2,DateIn3,DateOut3,Days3,ODateIn1,ODateOut1,Days1,ODateIn2,ODateOut2,Days2,ODateIn3,ODateOut3,Days3,ODateIn4,ODateOut4,Days4
1,3/4/2021,11/5/2021,246.00,5/8/2022,1/1/2023,238.00,7/14/2023,8/14/2023,31.00,11/12/2021,12/31/2021,49.00,2/28/2023,4/8/2023,39.00,8/15/2023,4/22/2024,251.00,5/30/2024,7/2/2024,33.00
2,4/16/2022,6/2/2022,47.00,12/5/2022,3/14/2023,99.00,1/3/2024,3/30/2024,87.00,7/23/2022,9/13/2022,52.00,8/30/2023,10/11/2023,42.00,7/18/2024,9/1/2024,45.00,,,
3,4/6/2020,8/10/2020,126.00,,,,,,,11/12/2020,1/18/2021,67.00,3/27/2021,6/9/2021,74.00,10/4/2021,11/30/2021,57.00,,,

I tried the pivot method but it didn't work because of my duplicate values in the first two sets of dates. Anyone have any insight into what could make this work?

10
  • 2
    With a question like this, please share a useable version of your example data, not just an image of the data. You should also share what you tried, even though you know why it doesn't work - in this case your pivot example. It's unclear from the data alone exactly why you expect the result you shared and how you yourself decide between either of the duplicate values - which would likely lead to some implementation. Finally, you're talking about dataframe, but what library? pandas? polars?
    – Grismar
    Commented Sep 14, 2024 at 2:13
  • sorry i'm super new and didn't know that. i don't see how i can attach it now so i'm not sure how to remedy this. also, the dataframe library is pandas. Commented Sep 16, 2024 at 15:48
  • @ThatOneGirl, show raw data (a sample of) - before it was loaded into pandas ? csv? excel,db extract ... help us help you be you doing the basics, what are days,odates representing ?, how many duplicate id's could there be ... 10,100,1000,10000......, what values would go into rows with one entry but other having multiple - give details, be concise. ,use copy/paste ie: copy your code/data and paste into your question ....
    – ticktalk
    Commented Sep 16, 2024 at 21:58
  • @ThatOneGirl that's fine, we all started at 1 points and people can tell you're new here, so don't take getting told what to do as being told to leave. We're trying to help you improve your question, so you can get a better answer - for you and others that find it later. You can edit your question, and make any changes you like. In this case, it would be good if you added a table of data, or just a formatted code section with the data in .csv format, or whatever text format you have it in. As well as a section of the relevant code you have - you can format a block with three backticks.
    – Grismar
    Commented Sep 17, 2024 at 13:18
  • @Grismar not at all. I understand now. I changed the format of the data to CSV. I hope this is what you meant. I really feel like a peon but this is most likely the way I'll get good the best. I learn best by doing and I have a lot to learn so actually learning from everyone here and implementing what I learn will help me get better. I value the community. :) Commented Sep 25, 2024 at 20:36

1 Answer 1

0

since you haven't shared any of your own code am presuming you've not done anything (if you have , exellent, post for feedback) :-(

Below a stab joining your example rows to facilitate panda'ing them ... Could easily be improved.

pseudo code

  • read input csv, appending common key records into buffer array
  • scan buffer to pad 'short' records with empty/null fields
  • write buffer to a 'flattened' csv
  • load flattened csv into a panda dataframe (to test )
  • display end
import csv

key=[]       # fields used to build a unique key
flattened=[] # holds a joined row
header=[]
hdrNum=1 # used when generating new column names
skipHeader=True
hdrTemplate=[] # repeating field names for joined row (incremented to provide unique column name)

mxFlds=0 # used to determine how many ',' need to be added to rows with < mxFlds 

buffer=[] # holds processed rows, written out to csv

with open('flattenMe.csv', newline='') as csvfile:
    spam = csv.reader(csvfile, delimiter=',' )
    for row in spam:
        if skipHeader == True:
            header=row[0:3]     # 
            hdrTemplate=row[3:] # we need this when adding new columns
            for h in row[3:]:
                header.append(h+str(hdrNum))
            hdrNum=hdrNum+1
            skipHeader=False # so we only do this block once

            continue

        if key == []:
            key=row[0:3]
            flattened=row
        else:
            thiskey=row[0:3]
            if thiskey != key: #key change processing

                buffer.append(flattened)
                hdrNum=hdrNum+1

                flattened=row
                key=thiskey
            else:
                for fld in row[3:]:
                    flattened.append(fld)

                if not hdrTemplate[-1]+str(hdrNum) in header: # this key set more rows than any previous

                    for fld in hdrTemplate:
                        header.append(fld+str(hdrNum))

        if len(header) > mxFlds:
            mxFlds=len(header)

buffer.append(flattened)

#
# write results back to a csv
#
with open('flattened.csv', 'w' ) as spam:
    slice = csv.writer(spam, dialect='unix', quoting=csv.QUOTE_NONE )
    slice.writerow( header )
    for r in buffer:
        slice.writerow( r )

#
# pandas
#
import pandas as pd
acsv = pd.read_csv('flattened.csv')

#
# show some detail
#
print(acsv)

running produces:

   ID  LName FName    DateIn1   DateOut1  Days1    ODateIn1   ODateOut1  Odays1  ...    ODateIn4  ODateOut4  Odays4   DateIn5   DateOut5  Days5    ODateIn5   ODateOut5  Odays5
0   1    Doe   Jay  7/14/2023  8/14/2023   31.0   8/15/2023   4/22/2024   251.0  ...   2/28/2023   4/8/2023    39.0       NaN        NaN    NaN         NaN         NaN     NaN
1   2  Smith  Dude  4/16/2022   6/2/2022   47.0   7/23/2022   9/13/2022    52.0  ...         NaN        NaN     NaN       NaN        NaN    NaN         NaN         NaN     NaN
2   3    Doe  Jane   4/6/2020  8/10/2020  126.0  11/12/2020   1/18/2021    67.0  ...         NaN        NaN     NaN       NaN        NaN    NaN         NaN         NaN     NaN
3   4    Pat   May  4/11/2020  8/10/2020  126.0   10/4/2021  11/30/2021    57.0  ...  11/12/2020  1/18/2021    67.0  3/9/2021  11/5/2021  246.0  11/12/2021  12/31/2021    49.0

[4 rows x 33 columns]

feel free use/butcher/ignore as appropriate. E&O acknowledged, all critique gratefully taken onboard but will not be posting any more on this

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.