How to unnesting a column in pandas' DataFrame?

up vote
6
down vote

favorite

I have following Dataframe one of the columns is object (list type cell).

df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]: 
 A B
0 1 [1, 2]
1 2 [1, 2]

My expected out put as below :

How should I do to achieve this ?

asked 4 hours ago

W-B

90.7k72755

Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â€“Â U9-Forward
3 hours ago

1

@U9-Forward thank you man :-)
â€“Â W-B
3 hours ago

Haha, YW :D good question here
â€“Â U9-Forward
3 hours ago

add a commentÂ |Â

up vote
6
down vote

favorite

I have following Dataframe one of the columns is object (list type cell).

df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]: 
 A B
0 1 [1, 2]
1 2 [1, 2]

My expected out put as below :

How should I do to achieve this ?

asked 4 hours ago

W-B

90.7k72755

Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â€“Â U9-Forward
3 hours ago

1

@U9-Forward thank you man :-)
â€“Â W-B
3 hours ago

Haha, YW :D good question here
â€“Â U9-Forward
3 hours ago

add a commentÂ |Â

up vote
6
down vote

favorite

I have following Dataframe one of the columns is object (list type cell).

df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]: 
 A B
0 1 [1, 2]
1 2 [1, 2]

My expected out put as below :

How should I do to achieve this ?

asked 4 hours ago

W-B

90.7k72755

I have following Dataframe one of the columns is object (list type cell).

df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]: 
 A B
0 1 [1, 2]
1 2 [1, 2]

My expected out put as below :

How should I do to achieve this ?

python pandas dataframe

asked 4 hours ago

W-B

90.7k72755

asked 4 hours ago

W-B

90.7k72755

asked 4 hours ago

W-B

90.7k72755

asked 4 hours ago

W-B

90.7k72755

asked 4 hours ago

W-B

90.7k72755

Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â€“Â U9-Forward
3 hours ago

1

@U9-Forward thank you man :-)
â€“Â W-B
3 hours ago

Haha, YW :D good question here
â€“Â U9-Forward
3 hours ago

add a commentÂ |Â

Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â€“Â U9-Forward
3 hours ago

1

@U9-Forward thank you man :-)
â€“Â W-B
3 hours ago

Haha, YW :D good question here
â€“Â U9-Forward
3 hours ago

Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â€“Â U9-Forward
3 hours ago

@U9-Forward thank you man :-)
â€“Â W-B
3 hours ago

Haha, YW :D good question here
â€“Â U9-Forward
3 hours ago

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
6
down vote

As an user with both R and python and spent one year in this site, I have seen this type of question couple times.

Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.

I know object columns type always make the data hard to convert by pandas' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .

Method 1
apply + pd.Series (easy to understand by in term of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]: 
 A B
0 1 1
1 1 2
0 2 1
1 2 2

Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
 B A
0 1 1
0 2 1
1 1 2
1 2 2

If you need the column order exactly same as before , adding reindex at the end

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3 recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
 A B
0 1 1
1 1 2
2 2 1
3 2 2

If more than two columns

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
 0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]

Method 4 using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5 when the list only contain unique values:

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
 B A
0 1 1
1 2 1
2 3 2
3 4 2

Special case have two columns type object

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]: 
 A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]

Self-def function

def unnesting(df, explode):
 idx=df.index.repeat(df[explode[0]].str.len())
 df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
 df1.index=idx
 return df1.join(df.drop(explode,1),how='left')

unnesting(df,['B','C'])
Out[609]: 
 B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2

Summary :

I am using pandas and python function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy and most of the time numpy is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython and numba

edited 1 hour ago

answered 4 hours ago

W-B

90.7k72755

Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â€“Â coldspeed
3 hours ago

add a commentÂ |Â

up vote
3
down vote

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist()) 
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist()
rs = [len(r) for r in vals] 
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:

df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
 'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])

 A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C

def unnest(df, tile, explode):
 vals = df[explode].sum(1)
 rs = [len(r) for r in vals]
 a = np.repeat(df[tile].values, rs, axis=0)
 b = np.concatenate(vals.values)
 d = np.column_stack((a, b))
 return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

Functions

def wen1(df):
 return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

def wen2(df):
 return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

def wen3(df):
 s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
 return s.join(df.drop('B', 1), how='left')

def wen4(df):
 return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
 vals = np.array(df.B.values.tolist())
 a = np.repeat(df.A, vals.shape[1])
 return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
 vals = df.B.values.tolist()
 rs = [len(r) for r in vals]
 a = np.repeat(df.A.values, rs)
 return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
 index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
 columns=[10, 50, 100, 500, 1000, 5000, 10000],
 dtype=float
)

for f in res.index:
 for c in res.columns:
 df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
 df = pd.concat([df]*c)
 stmt = '(df)'.format(f)
 setp = 'from __main__ import df, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

enter image description here

edited 2 hours ago

answered 4 hours ago

user3483203

28.2k72351

add a commentÂ |Â

up vote
1
down vote

Something pretty not recommended (at least work in this case):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat + sort_index + iter + apply + next.

Now:

print(df)

Is:

If care about index:

df=df.reset_index(drop=True)

Now:

print(df)

Is:

answered 3 hours ago

U9-Forward

8,7912733

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218931%2fhow-to-unnesting-a-column-in-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
6
down vote

As an user with both R and python and spent one year in this site, I have seen this type of question couple times.

Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.

Method 1
apply + pd.Series (easy to understand by in term of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]: 
 A B
0 1 1
1 1 2
0 2 1
1 2 2

Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
 B A
0 1 1
0 2 1
1 1 2
1 2 2

If you need the column order exactly same as before , adding reindex at the end

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3 recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
 A B
0 1 1
1 1 2
2 2 1
3 2 2

If more than two columns

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
 0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]

Method 4 using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5 when the list only contain unique values:

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
 B A
0 1 1
1 2 1
2 3 2
3 4 2

Special case have two columns type object

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]: 
 A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]

Self-def function

def unnesting(df, explode):
 idx=df.index.repeat(df[explode[0]].str.len())
 df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
 df1.index=idx
 return df1.join(df.drop(explode,1),how='left')

unnesting(df,['B','C'])
Out[609]: 
 B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2

Summary :

edited 1 hour ago

answered 4 hours ago

W-B

90.7k72755

Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â€“Â coldspeed
3 hours ago

add a commentÂ |Â

up vote
6
down vote

As an user with both R and python and spent one year in this site, I have seen this type of question couple times.

Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.

Method 1
apply + pd.Series (easy to understand by in term of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]: 
 A B
0 1 1
1 1 2
0 2 1
1 2 2

Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
 B A
0 1 1
0 2 1
1 1 2
1 2 2

If you need the column order exactly same as before , adding reindex at the end

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3 recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
 A B
0 1 1
1 1 2
2 2 1
3 2 2

If more than two columns

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
 0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]

Method 4 using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5 when the list only contain unique values:

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
 B A
0 1 1
1 2 1
2 3 2
3 4 2

Special case have two columns type object

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]: 
 A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]

Self-def function

def unnesting(df, explode):
 idx=df.index.repeat(df[explode[0]].str.len())
 df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
 df1.index=idx
 return df1.join(df.drop(explode,1),how='left')

unnesting(df,['B','C'])
Out[609]: 
 B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2

Summary :

edited 1 hour ago

answered 4 hours ago

W-B

90.7k72755

Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â€“Â coldspeed
3 hours ago

add a commentÂ |Â

up vote
6
down vote

As an user with both R and python and spent one year in this site, I have seen this type of question couple times.

Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.

Method 1
apply + pd.Series (easy to understand by in term of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]: 
 A B
0 1 1
1 1 2
0 2 1
1 2 2

Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
 B A
0 1 1
0 2 1
1 1 2
1 2 2

If you need the column order exactly same as before , adding reindex at the end

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3 recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
 A B
0 1 1
1 1 2
2 2 1
3 2 2

If more than two columns

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
 0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]

Method 4 using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5 when the list only contain unique values:

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
 B A
0 1 1
1 2 1
2 3 2
3 4 2

Special case have two columns type object

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]: 
 A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]

Self-def function

def unnesting(df, explode):
 idx=df.index.repeat(df[explode[0]].str.len())
 df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
 df1.index=idx
 return df1.join(df.drop(explode,1),how='left')

unnesting(df,['B','C'])
Out[609]: 
 B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2

Summary :

edited 1 hour ago

answered 4 hours ago

W-B

90.7k72755

As an user with both R and python and spent one year in this site, I have seen this type of question couple times.

Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.

Method 1
apply + pd.Series (easy to understand by in term of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]: 
 A B
0 1 1
1 1 2
0 2 1
1 2 2

Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
 B A
0 1 1
0 2 1
1 1 2
1 2 2

If you need the column order exactly same as before , adding reindex at the end

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3 recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
 A B
0 1 1
1 1 2
2 2 1
3 2 2

If more than two columns

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
 0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]

Method 4 using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5 when the list only contain unique values:

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
 B A
0 1 1
1 2 1
2 3 2
3 4 2

Special case have two columns type object

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]: 
 A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]

Self-def function

def unnesting(df, explode):
 idx=df.index.repeat(df[explode[0]].str.len())
 df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
 df1.index=idx
 return df1.join(df.drop(explode,1),how='left')

unnesting(df,['B','C'])
Out[609]: 
 B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2

Summary :

edited 1 hour ago

answered 4 hours ago

W-B

90.7k72755

edited 1 hour ago

answered 4 hours ago

W-B

90.7k72755

answered 4 hours ago

W-B

90.7k72755

answered 4 hours ago

W-B

90.7k72755

Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â€“Â coldspeed
3 hours ago

add a commentÂ |Â

Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â€“Â coldspeed
3 hours ago

Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â€“Â coldspeed
3 hours ago

add a commentÂ |Â

up vote
3
down vote

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist()) 
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist()
rs = [len(r) for r in vals] 
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:

df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
 'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])

 A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C

def unnest(df, tile, explode):
 vals = df[explode].sum(1)
 rs = [len(r) for r in vals]
 a = np.repeat(df[tile].values, rs, axis=0)
 b = np.concatenate(vals.values)
 d = np.column_stack((a, b))
 return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

Functions

def wen1(df):
 return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

def wen2(df):
 return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

def wen3(df):
 s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
 return s.join(df.drop('B', 1), how='left')

def wen4(df):
 return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
 vals = np.array(df.B.values.tolist())
 a = np.repeat(df.A, vals.shape[1])
 return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
 vals = df.B.values.tolist()
 rs = [len(r) for r in vals]
 a = np.repeat(df.A.values, rs)
 return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
 index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
 columns=[10, 50, 100, 500, 1000, 5000, 10000],
 dtype=float
)

for f in res.index:
 for c in res.columns:
 df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
 df = pd.concat([df]*c)
 stmt = '(df)'.format(f)
 setp = 'from __main__ import df, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

enter image description here

edited 2 hours ago

answered 4 hours ago

user3483203

28.2k72351

add a commentÂ |Â

up vote
3
down vote

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist()) 
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist()
rs = [len(r) for r in vals] 
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:

df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
 'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])

 A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C

def unnest(df, tile, explode):
 vals = df[explode].sum(1)
 rs = [len(r) for r in vals]
 a = np.repeat(df[tile].values, rs, axis=0)
 b = np.concatenate(vals.values)
 d = np.column_stack((a, b))
 return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

Functions

def wen1(df):
 return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

def wen2(df):
 return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

def wen3(df):
 s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
 return s.join(df.drop('B', 1), how='left')

def wen4(df):
 return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
 vals = np.array(df.B.values.tolist())
 a = np.repeat(df.A, vals.shape[1])
 return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
 vals = df.B.values.tolist()
 rs = [len(r) for r in vals]
 a = np.repeat(df.A.values, rs)
 return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
 index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
 columns=[10, 50, 100, 500, 1000, 5000, 10000],
 dtype=float
)

for f in res.index:
 for c in res.columns:
 df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
 df = pd.concat([df]*c)
 stmt = '(df)'.format(f)
 setp = 'from __main__ import df, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

enter image description here

edited 2 hours ago

answered 4 hours ago

user3483203

28.2k72351

add a commentÂ |Â

up vote
3
down vote

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist()) 
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist()
rs = [len(r) for r in vals] 
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:

df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
 'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])

 A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C

def unnest(df, tile, explode):
 vals = df[explode].sum(1)
 rs = [len(r) for r in vals]
 a = np.repeat(df[tile].values, rs, axis=0)
 b = np.concatenate(vals.values)
 d = np.column_stack((a, b))
 return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

Functions

def wen1(df):
 return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

def wen2(df):
 return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

def wen3(df):
 s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
 return s.join(df.drop('B', 1), how='left')

def wen4(df):
 return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
 vals = np.array(df.B.values.tolist())
 a = np.repeat(df.A, vals.shape[1])
 return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
 vals = df.B.values.tolist()
 rs = [len(r) for r in vals]
 a = np.repeat(df.A.values, rs)
 return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
 index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
 columns=[10, 50, 100, 500, 1000, 5000, 10000],
 dtype=float
)

for f in res.index:
 for c in res.columns:
 df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
 df = pd.concat([df]*c)
 stmt = '(df)'.format(f)
 setp = 'from __main__ import df, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

enter image description here

edited 2 hours ago

answered 4 hours ago

user3483203

28.2k72351

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist()) 
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist()
rs = [len(r) for r in vals] 
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:

df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
 'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])

 A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C

def unnest(df, tile, explode):
 vals = df[explode].sum(1)
 rs = [len(r) for r in vals]
 a = np.repeat(df[tile].values, rs, axis=0)
 b = np.concatenate(vals.values)
 d = np.column_stack((a, b))
 return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

Functions

def wen1(df):
 return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

def wen2(df):
 return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

def wen3(df):
 s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
 return s.join(df.drop('B', 1), how='left')

def wen4(df):
 return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
 vals = np.array(df.B.values.tolist())
 a = np.repeat(df.A, vals.shape[1])
 return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
 vals = df.B.values.tolist()
 rs = [len(r) for r in vals]
 a = np.repeat(df.A.values, rs)
 return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
 index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
 columns=[10, 50, 100, 500, 1000, 5000, 10000],
 dtype=float
)

for f in res.index:
 for c in res.columns:
 df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
 df = pd.concat([df]*c)
 stmt = '(df)'.format(f)
 setp = 'from __main__ import df, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

enter image description here

edited 2 hours ago

answered 4 hours ago

user3483203

28.2k72351

edited 2 hours ago

answered 4 hours ago

user3483203

28.2k72351

answered 4 hours ago

user3483203

28.2k72351

answered 4 hours ago

user3483203

28.2k72351

add a commentÂ |Â

up vote
1
down vote

Something pretty not recommended (at least work in this case):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat + sort_index + iter + apply + next.

Now:

print(df)

Is:

If care about index:

df=df.reset_index(drop=True)

Now:

print(df)

Is:

answered 3 hours ago

U9-Forward

8,7912733

add a commentÂ |Â

up vote
1
down vote

Something pretty not recommended (at least work in this case):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat + sort_index + iter + apply + next.

Now:

print(df)

Is:

If care about index:

df=df.reset_index(drop=True)

Now:

print(df)

Is:

answered 3 hours ago

U9-Forward

8,7912733

add a commentÂ |Â

up vote
1
down vote

Something pretty not recommended (at least work in this case):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat + sort_index + iter + apply + next.

Now:

print(df)

Is:

If care about index:

df=df.reset_index(drop=True)

Now:

print(df)

Is:

answered 3 hours ago

U9-Forward

8,7912733

Something pretty not recommended (at least work in this case):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat + sort_index + iter + apply + next.

Now:

print(df)

Is:

If care about index:

df=df.reset_index(drop=True)

Now:

print(df)

Is:

answered 3 hours ago

U9-Forward

8,7912733

answered 3 hours ago

U9-Forward

8,7912733

answered 3 hours ago

U9-Forward

8,7912733

answered 3 hours ago

U9-Forward

8,7912733

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu