Performant cartesian product (CROSS JOIN) with pandas

up vote
16
down vote

favorite

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame('col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3])
right = pd.DataFrame('col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50])

left

 col1 col2
0 A 1
1 B 2
2 C 3

right

 col1 col2
0 X 20
1 Y 30
2 Z 50

The cross product of these frames can be computed, and will look something like:

A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50

What is the most performant method of computing this result?

edited Dec 10 at 9:04

asked Dec 10 at 3:12

coldspeed

116k18107185

1

Would you like share your input in Github as well , I think adding cross join in pandas is really good to match all the join function in SQL . github.com/pandas-dev/pandas/issues/5401
– W-B
Dec 11 at 20:58

1

@W-B Thanks! That thread seems interesting, let me see if there's something I can contribute to it.
– coldspeed
Dec 12 at 2:53

add a comment |

up vote
16
down vote

favorite

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame('col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3])
right = pd.DataFrame('col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50])

left

 col1 col2
0 A 1
1 B 2
2 C 3

right

 col1 col2
0 X 20
1 Y 30
2 Z 50

The cross product of these frames can be computed, and will look something like:

A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50

What is the most performant method of computing this result?

edited Dec 10 at 9:04

asked Dec 10 at 3:12

coldspeed

116k18107185

1

Would you like share your input in Github as well , I think adding cross join in pandas is really good to match all the join function in SQL . github.com/pandas-dev/pandas/issues/5401
– W-B
Dec 11 at 20:58

1

@W-B Thanks! That thread seems interesting, let me see if there's something I can contribute to it.
– coldspeed
Dec 12 at 2:53

add a comment |

up vote
16
down vote

favorite

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame('col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3])
right = pd.DataFrame('col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50])

left

 col1 col2
0 A 1
1 B 2
2 C 3

right

 col1 col2
0 X 20
1 Y 30
2 Z 50

The cross product of these frames can be computed, and will look something like:

A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50

What is the most performant method of computing this result?

edited Dec 10 at 9:04

asked Dec 10 at 3:12

coldspeed

116k18107185

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame('col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3])
right = pd.DataFrame('col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50])

left

 col1 col2
0 A 1
1 B 2
2 C 3

right

 col1 col2
0 X 20
1 Y 30
2 Z 50

The cross product of these frames can be computed, and will look something like:

A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50

What is the most performant method of computing this result?

python pandas numpy dataframe merge

edited Dec 10 at 9:04

asked Dec 10 at 3:12

coldspeed

116k18107185

edited Dec 10 at 9:04

asked Dec 10 at 3:12

coldspeed

116k18107185

edited Dec 10 at 9:04

asked Dec 10 at 3:12

coldspeed

116k18107185

asked Dec 10 at 3:12

coldspeed

116k18107185

asked Dec 10 at 3:12

coldspeed

116k18107185

1

Would you like share your input in Github as well , I think adding cross join in pandas is really good to match all the join function in SQL . github.com/pandas-dev/pandas/issues/5401
– W-B
Dec 11 at 20:58

1

@W-B Thanks! That thread seems interesting, let me see if there's something I can contribute to it.
– coldspeed
Dec 12 at 2:53

add a comment |

1

Would you like share your input in Github as well , I think adding cross join in pandas is really good to match all the join function in SQL . github.com/pandas-dev/pandas/issues/5401
– W-B
Dec 11 at 20:58

1

@W-B Thanks! That thread seems interesting, let me see if there's something I can contribute to it.
– coldspeed
Dec 12 at 2:53

Would you like share your input in Github as well , I think adding cross join in pandas is really good to match all the join function in SQL . github.com/pandas-dev/pandas/issues/5401
– W-B
Dec 11 at 20:58

@W-B Thanks! That thread seems interesting, let me see if there's something I can contribute to it.
– coldspeed
Dec 12 at 2:53

add a comment |

3 Answers
3

active

oldest

votes

up vote
16
down vote

accepted

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):
 return (
 left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)

 col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

def cartesian_product(*arrays):
 la = len(arrays)
 dtype = np.result_type(*arrays)
 arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
 for i, a in enumerate(np.ix_(*arrays)):
 arr[...,i] = a
 return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

def cartesian_product_generalized(left, right):
 la, lb = len(left), len(right)
 idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
 return pd.DataFrame(
 np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

 0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

np.array_equal(cartesian_product_generalized(left, right),
 cartesian_product_basic(left, right))
True

And, along similar lines,

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']


left2
 col1 col2
s1 A 1
s2 B 2
s1 C 3

right2
 col1 col2
x X 20
y Y 30
y Z 50

np.array_equal(cartesian_product_generalized(left, right),
 cartesian_product_basic(left2, right2))
True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):
 idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
 return pd.DataFrame(
 np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

 0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):
 la, lb = len(left), len(right)
 ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

 return pd.DataFrame(
 np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
 cartesian_product_basic(left2, right2))
True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087
# I've put my own spin on this to make it as fast as possible.
def cartesian_product_itertools(left, right):
 return pd.DataFrame([
 [*x, *y] for x, y in itertools.product(
 left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
 index=['cartesian_product_basic', 'cartesian_product_generalized', 
 'cartesian_product_multi', 'cartesian_product_simplified',
 'cartesian_product_itertools'],
 columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
 dtype=float
)

for f in res.index: 
 for c in res.columns:
 # print(f,c)
 if f in 'cartesian_product_itertools' and c > 600:
 continue
 left2 = pd.concat([left] * c, ignore_index=True)
 right2 = pd.concat([right] * c, ignore_index=True)
 stmt = '(left2, right2)'.format(f)
 setp = 'from __main__ import left2, right2, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

edited Dec 12 at 14:58

answered Dec 10 at 3:12

coldspeed

116k18107185

add a comment |

up vote
4
down vote

Using itertools product and recreate the value in dataframe

import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,),l)))
 0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

answered Dec 10 at 3:41

W-B

99k73162

add a comment |

up vote
3
down vote

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
 pd.concat([right]*len(left)).reset_index(drop=True) ], 1)

 col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

enter image description here

edited Dec 10 at 14:00

answered Dec 10 at 13:39

Dark

21k31946

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53699012%2fperformant-cartesian-product-cross-join-with-pandas%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
16
down vote

accepted

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):
 return (
 left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)

 col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):
 la = len(arrays)
 dtype = np.result_type(*arrays)
 arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
 for i, a in enumerate(np.ix_(*arrays)):
 arr[...,i] = a
 return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

def cartesian_product_generalized(left, right):
 la, lb = len(left), len(right)
 idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
 return pd.DataFrame(
 np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

 0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

np.array_equal(cartesian_product_generalized(left, right),
 cartesian_product_basic(left, right))
True

And, along similar lines,

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']


left2
 col1 col2
s1 A 1
s2 B 2
s1 C 3

right2
 col1 col2
x X 20
y Y 30
y Z 50

np.array_equal(cartesian_product_generalized(left, right),
 cartesian_product_basic(left2, right2))
True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):
 idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
 return pd.DataFrame(
 np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

 0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):
 la, lb = len(left), len(right)
 ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

 return pd.DataFrame(
 np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
 cartesian_product_basic(left2, right2))
True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087
# I've put my own spin on this to make it as fast as possible.
def cartesian_product_itertools(left, right):
 return pd.DataFrame([
 [*x, *y] for x, y in itertools.product(
 left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
 index=['cartesian_product_basic', 'cartesian_product_generalized', 
 'cartesian_product_multi', 'cartesian_product_simplified',
 'cartesian_product_itertools'],
 columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
 dtype=float
)

for f in res.index: 
 for c in res.columns:
 # print(f,c)
 if f in 'cartesian_product_itertools' and c > 600:
 continue
 left2 = pd.concat([left] * c, ignore_index=True)
 right2 = pd.concat([right] * c, ignore_index=True)
 stmt = '(left2, right2)'.format(f)
 setp = 'from __main__ import left2, right2, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

edited Dec 12 at 14:58

answered Dec 10 at 3:12

coldspeed

116k18107185

add a comment |

up vote
16
down vote

accepted

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):
 return (
 left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)

 col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):
 la = len(arrays)
 dtype = np.result_type(*arrays)
 arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
 for i, a in enumerate(np.ix_(*arrays)):
 arr[...,i] = a
 return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

def cartesian_product_generalized(left, right):
 la, lb = len(left), len(right)
 idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
 return pd.DataFrame(
 np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

 0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

np.array_equal(cartesian_product_generalized(left, right),
 cartesian_product_basic(left, right))
True

And, along similar lines,

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']


left2
 col1 col2
s1 A 1
s2 B 2
s1 C 3

right2
 col1 col2
x X 20
y Y 30
y Z 50

np.array_equal(cartesian_product_generalized(left, right),
 cartesian_product_basic(left2, right2))
True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):
 idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
 return pd.DataFrame(
 np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

 0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):
 la, lb = len(left), len(right)
 ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

 return pd.DataFrame(
 np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
 cartesian_product_basic(left2, right2))
True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087
# I've put my own spin on this to make it as fast as possible.
def cartesian_product_itertools(left, right):
 return pd.DataFrame([
 [*x, *y] for x, y in itertools.product(
 left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
 index=['cartesian_product_basic', 'cartesian_product_generalized', 
 'cartesian_product_multi', 'cartesian_product_simplified',
 'cartesian_product_itertools'],
 columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
 dtype=float
)

for f in res.index: 
 for c in res.columns:
 # print(f,c)
 if f in 'cartesian_product_itertools' and c > 600:
 continue
 left2 = pd.concat([left] * c, ignore_index=True)
 right2 = pd.concat([right] * c, ignore_index=True)
 stmt = '(left2, right2)'.format(f)
 setp = 'from __main__ import left2, right2, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

edited Dec 12 at 14:58

answered Dec 10 at 3:12

coldspeed

116k18107185

add a comment |

up vote
16
down vote

accepted

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):
 return (
 left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)

 col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):
 la = len(arrays)
 dtype = np.result_type(*arrays)
 arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
 for i, a in enumerate(np.ix_(*arrays)):
 arr[...,i] = a
 return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

def cartesian_product_generalized(left, right):
 la, lb = len(left), len(right)
 idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
 return pd.DataFrame(
 np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

 0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

np.array_equal(cartesian_product_generalized(left, right),
 cartesian_product_basic(left, right))
True

And, along similar lines,

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']


left2
 col1 col2
s1 A 1
s2 B 2
s1 C 3

right2
 col1 col2
x X 20
y Y 30
y Z 50

np.array_equal(cartesian_product_generalized(left, right),
 cartesian_product_basic(left2, right2))
True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):
 idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
 return pd.DataFrame(
 np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

 0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):
 la, lb = len(left), len(right)
 ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

 return pd.DataFrame(
 np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
 cartesian_product_basic(left2, right2))
True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087
# I've put my own spin on this to make it as fast as possible.
def cartesian_product_itertools(left, right):
 return pd.DataFrame([
 [*x, *y] for x, y in itertools.product(
 left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
 index=['cartesian_product_basic', 'cartesian_product_generalized', 
 'cartesian_product_multi', 'cartesian_product_simplified',
 'cartesian_product_itertools'],
 columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
 dtype=float
)

for f in res.index: 
 for c in res.columns:
 # print(f,c)
 if f in 'cartesian_product_itertools' and c > 600:
 continue
 left2 = pd.concat([left] * c, ignore_index=True)
 right2 = pd.concat([right] * c, ignore_index=True)
 stmt = '(left2, right2)'.format(f)
 setp = 'from __main__ import left2, right2, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

edited Dec 12 at 14:58

answered Dec 10 at 3:12

coldspeed

116k18107185

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):
 return (
 left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)

 col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):
 la = len(arrays)
 dtype = np.result_type(*arrays)
 arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
 for i, a in enumerate(np.ix_(*arrays)):
 arr[...,i] = a
 return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

def cartesian_product_generalized(left, right):
 la, lb = len(left), len(right)
 idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
 return pd.DataFrame(
 np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

 0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

np.array_equal(cartesian_product_generalized(left, right),
 cartesian_product_basic(left, right))
True

And, along similar lines,

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']


left2
 col1 col2
s1 A 1
s2 B 2
s1 C 3

right2
 col1 col2
x X 20
y Y 30
y Z 50

np.array_equal(cartesian_product_generalized(left, right),
 cartesian_product_basic(left2, right2))
True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):
 idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
 return pd.DataFrame(
 np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

 0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):
 la, lb = len(left), len(right)
 ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

 return pd.DataFrame(
 np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
 cartesian_product_basic(left2, right2))
True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087
# I've put my own spin on this to make it as fast as possible.
def cartesian_product_itertools(left, right):
 return pd.DataFrame([
 [*x, *y] for x, y in itertools.product(
 left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
 index=['cartesian_product_basic', 'cartesian_product_generalized', 
 'cartesian_product_multi', 'cartesian_product_simplified',
 'cartesian_product_itertools'],
 columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
 dtype=float
)

for f in res.index: 
 for c in res.columns:
 # print(f,c)
 if f in 'cartesian_product_itertools' and c > 600:
 continue
 left2 = pd.concat([left] * c, ignore_index=True)
 right2 = pd.concat([right] * c, ignore_index=True)
 stmt = '(left2, right2)'.format(f)
 setp = 'from __main__ import left2, right2, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

edited Dec 12 at 14:58

answered Dec 10 at 3:12

coldspeed

116k18107185

edited Dec 12 at 14:58

answered Dec 10 at 3:12

coldspeed

116k18107185

answered Dec 10 at 3:12

coldspeed

116k18107185

answered Dec 10 at 3:12

coldspeed

116k18107185

add a comment |

up vote
4
down vote

Using itertools product and recreate the value in dataframe

import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,),l)))
 0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

answered Dec 10 at 3:41

W-B

99k73162

add a comment |

up vote
4
down vote

Using itertools product and recreate the value in dataframe

import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,),l)))
 0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

answered Dec 10 at 3:41

W-B

99k73162

add a comment |

up vote
4
down vote

Using itertools product and recreate the value in dataframe

import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,),l)))
 0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

answered Dec 10 at 3:41

W-B

99k73162

Using itertools product and recreate the value in dataframe

import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,),l)))
 0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

answered Dec 10 at 3:41

W-B

99k73162

answered Dec 10 at 3:41

W-B

99k73162

answered Dec 10 at 3:41

W-B

99k73162

answered Dec 10 at 3:41

W-B

99k73162

add a comment |

up vote
3
down vote

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
 pd.concat([right]*len(left)).reset_index(drop=True) ], 1)

 col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

enter image description here

edited Dec 10 at 14:00

answered Dec 10 at 13:39

Dark

21k31946

add a comment |

up vote
3
down vote

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
 pd.concat([right]*len(left)).reset_index(drop=True) ], 1)

 col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

enter image description here

edited Dec 10 at 14:00

answered Dec 10 at 13:39

Dark

21k31946

add a comment |

up vote
3
down vote

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
 pd.concat([right]*len(left)).reset_index(drop=True) ], 1)

 col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

enter image description here

edited Dec 10 at 14:00

answered Dec 10 at 13:39

Dark

21k31946

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
 pd.concat([right]*len(left)).reset_index(drop=True) ], 1)

 col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

enter image description here

edited Dec 10 at 14:00

answered Dec 10 at 13:39

Dark

21k31946

edited Dec 10 at 14:00

answered Dec 10 at 13:39

Dark

21k31946

answered Dec 10 at 13:39

Dark

21k31946

answered Dec 10 at 13:39

Dark

21k31946

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu