2

I got a dataframe df1 which looks like this:

Column1 Column2
13 1
12 1
15 0
16 0
15 1
14 1
12 1
11 0
21 1
45 1
44 0

The 1s indicate that a measurement started, I don't know how many 1s will be in one measurement and also don't know how many 0s will be in between two measurements. So what I want to yield are sub-dataframes, which are as long as I received a 1. So in my example it would be:

df2

Column1 Column2
13 1
12 1

df3

Column1 Column2
15 1
14 1
12 1

and df4

Column1 Column2
21 1
45 1

Alternatively, it would be acceptable to count up on Column2, so I can later split based on that value of Column2:

df5

Column1 Column2
13 1
12 1
15 0
16 0
15 2
14 2
12 2
11 0
21 3
45 3
44 0

I have no idea on how to approach this. Also with googling I could not find a proper approach. Thanks for any help.

1
  • 2
    Use .diff() and .cumsum(). Commented Nov 25 at 15:49

4 Answers 4

3

I would use the hint provided in Reinderien's comment to realize your alternative approach (i.e. counting up to get df5 from your question):

import numpy as np
import pandas as pd

df1 = pd.DataFrame({
    "Column1": [13,12,15,16,15,14,12,11,21,45,44],
    "Column2": [1,1,0,0,1,1,1,0,1,1,0]
})

df5 = df1.copy()
starts = (np.diff(df5["Column2"], prepend=0) == 1)
df5["Column2"] = np.cumsum(starts) * df5["Column2"]

This produces in df5:

    Column1  Column2
0        13        1
1        12        1
2        15        0
3        16        0
4        15        2
5        14        2
6        12        2
7        11        0
8        21        3
9        45        3
10       44        0

Main ideas:

  • With np.diff(…), we get the positions where subsequent values are different:
    [1 0 -1 0 1 0 0 -1 1 0 -1]
    
  • With np.diff(…) == 1, we only keep those where the value changes from 0 to 1 (as opposed to changing from 1 to 0), i.e. we only keep the starts of segments containing measurements:
    [1 0 0 0 1 0 0 0 1 0 0]
    
    or rather, [True False False …], since we have boolean values at this point.
  • With np.cumsum(…), we "spread out" this information to subsequent rows, at the same time incrementing the value at the start of each new segment:
    [1 1 1 1 2 2 2 2 3 3 3]
    
  • With np.cumsum(…) * df5["Column2"], we suppress the gaps between segments again:
      [1 1 1 1 2 2 2 2 3 3 3]
    * [1 1 0 0 1 1 1 0 1 1 0]
    
    = [1 1 0 0 2 2 2 0 3 3 0]
    

If you prefer a Pandas-only solution, you could replace the last two lines by:

starts = df5["Column2"].diff() == 1
starts[0] = (df5["Column2"][0] == 1)
df5["Column2"] = starts.cumsum() * df5["Column2"]

Here, some extra effort (starts[0] = …) is necessary to get the correct value for df5's first row, which we equivalently achieved by np.diff(…, prepend=0) in the previous solution.

Sign up to request clarification or add additional context in comments.

Comments

1
grp = df['Column2'].ne(df['Column2'].shift()).cumsum()
cond = df['Column2'].ne(0)
out = [d for _, d in df[cond].groupby(grp)]

out

[   Column1  Column2
 0       13        1
 1       12        1,

    Column1  Column2
 4       15        1
 5       14        1
 6       12        1,

    Column1  Column2
 8       21        1
 9       45        1]

Comments

1

Another possible solution:

m = df['Column2'].eq(1)
starts = df['Column2'].diff().fillna(1).eq(1) & m
df['grp_id'] = starts.cumsum()
[df[(df['grp_id'].eq(i)) & m].drop(columns='grp_id') 
 for i in df['grp_id'][m].unique()]

This solution identifies contiguous measurement blocks by first creating a boolean mask of rows where Column2.eq(1), then detects the start of each block using diff() to find transitions from 0 to 1 (with fillna(1) handling the first row) combined with the mask to isolate only true beginnings; cumsum() converts these boolean start flags into sequential group IDs, and a list comprehension iterates over unique() group values to slice the dataframe via boolean indexing, producing separate sub-dataframes for each measurement block while dropping the temporary group column.

Output:

[   Column1  Column2
 0       13        1
 1       12        1,
    Column1  Column2
 4       15        1
 5       14        1
 6       12        1,
    Column1  Column2
 8       21        1
 9       45        1]

Comments

-2

Here’s one way to solve it using pandas. The key idea is to assign a “measurement number” to each consecutive block of 1s in Column2. Then, if you want, you can split them into separate DataFrames.

import pandas as pd

# Sample DataFrame
data = {
    "Column1": [13,12,15,16,15,14,12,11,21,45,44],
    "Column2": [1,1,0,0,1,1,1,0,1,1,0]
}

df = pd.DataFrame(data)

# Step 1: Add a new column for measurement numbers
df['Measurement'] = 0

measurement = 0       # Counter for each measurement
in_measurement = False  # Flag to track if we are inside a block of 1s

# Step 2: Iterate over rows to assign measurement numbers
for i in range(len(df)):
    if df.loc[i, 'Column2'] == 1:
        if not in_measurement:
            measurement += 1   # Start of a new measurement
            in_measurement = True
        df.loc[i, 'Measurement'] = measurement
    else:
        in_measurement = False  # End of the current measurement

print(df)

Output:

    Column1  Column2  Measurement
0        13        1            1
1        12        1            1
2        15        0            0
3        16        0            0
4        15        1            2
5        14        1            2
6        12        1            2
7        11        0            0
8        21        1            3
9        45        1            3
10       44        0            0

Split into separate DataFrames

Once you have the Measurement column, you can split the DataFrame like this:

# Get unique measurement numbers
measurements = df['Measurement'].unique()

# Create a list of sub-DataFrames
dfs = [df[df['Measurement']==m].drop(columns='Measurement') for m in measurements if m != 0]

# Example: first measurement
df2 = dfs[0]
print(df2)

This will give you:

   Column1  Column2
0       13        1
1       12        1

You can similarly access df3, df4, etc.

How this works

  1. in_measurement keeps track of whether you are currently inside a block of 1s.

  2. measurement increments only when a new block starts.

  3. Every 1 in the same block gets the same number.

  4. 0s are ignored but mark the end of a block.

  5. After that, splitting into sub-DataFrames is easy.

New contributor
Bhumika Aggarwal is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

1 Comment

No, don't do this. It fails to vectorise.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.