Parallele Computing - 2 vs. 4 processor speed [closed]

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












3












$begingroup$


I am evaluating a code which ends with Table having ParallelEvaluate of a function XXXX[phi, theta, si]. For a grid of 225 points, a normal 2 processor laptop is taking 7 h as compared to 8.30 h by a high end Xeon 4 processor computer. CPU and memory usage for laptop and computer are about 66% vs 99% and 700MB vs 900 MB respectively. Will be thankful for any suggestion on how to improve the evaluation speed on computer. Thanks










share|improve this question











$endgroup$



closed as off-topic by m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1 Jan 25 at 4:12


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "This question cannot be answered without additional information. Questions on problems in code must describe the specific problem and include valid code to reproduce it. Any data used for programming examples should be embedded in the question or code to generate the (fake) data must be included." – m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1
If this question can be reworded to fit the rules in the help center, please edit the question.















  • $begingroup$
    Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the many Sort and SortBy commands); instead of concise function calls with purely numerical input and output, you use replace rules;...
    $endgroup$
    – Henrik Schumacher
    Jan 19 at 2:20










  • $begingroup$
    Running XXXX[0, 0, 0] once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by using Compile here and there).
    $endgroup$
    – Henrik Schumacher
    Jan 19 at 2:22










  • $begingroup$
    @HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
    $endgroup$
    – user49535
    Jan 19 at 4:49






  • 1




    $begingroup$
    @user49535 I think that your replacement rules are killing you. If I run DSC[0, 0, 0, 1] by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.
    $endgroup$
    – MassDefect
    Jan 19 at 5:52






  • 1




    $begingroup$
    @user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and have dsc/.R[m-1]
    $endgroup$
    – MassDefect
    Jan 19 at 7:39















3












$begingroup$


I am evaluating a code which ends with Table having ParallelEvaluate of a function XXXX[phi, theta, si]. For a grid of 225 points, a normal 2 processor laptop is taking 7 h as compared to 8.30 h by a high end Xeon 4 processor computer. CPU and memory usage for laptop and computer are about 66% vs 99% and 700MB vs 900 MB respectively. Will be thankful for any suggestion on how to improve the evaluation speed on computer. Thanks










share|improve this question











$endgroup$



closed as off-topic by m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1 Jan 25 at 4:12


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "This question cannot be answered without additional information. Questions on problems in code must describe the specific problem and include valid code to reproduce it. Any data used for programming examples should be embedded in the question or code to generate the (fake) data must be included." – m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1
If this question can be reworded to fit the rules in the help center, please edit the question.















  • $begingroup$
    Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the many Sort and SortBy commands); instead of concise function calls with purely numerical input and output, you use replace rules;...
    $endgroup$
    – Henrik Schumacher
    Jan 19 at 2:20










  • $begingroup$
    Running XXXX[0, 0, 0] once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by using Compile here and there).
    $endgroup$
    – Henrik Schumacher
    Jan 19 at 2:22










  • $begingroup$
    @HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
    $endgroup$
    – user49535
    Jan 19 at 4:49






  • 1




    $begingroup$
    @user49535 I think that your replacement rules are killing you. If I run DSC[0, 0, 0, 1] by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.
    $endgroup$
    – MassDefect
    Jan 19 at 5:52






  • 1




    $begingroup$
    @user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and have dsc/.R[m-1]
    $endgroup$
    – MassDefect
    Jan 19 at 7:39













3












3








3


1



$begingroup$


I am evaluating a code which ends with Table having ParallelEvaluate of a function XXXX[phi, theta, si]. For a grid of 225 points, a normal 2 processor laptop is taking 7 h as compared to 8.30 h by a high end Xeon 4 processor computer. CPU and memory usage for laptop and computer are about 66% vs 99% and 700MB vs 900 MB respectively. Will be thankful for any suggestion on how to improve the evaluation speed on computer. Thanks










share|improve this question











$endgroup$




I am evaluating a code which ends with Table having ParallelEvaluate of a function XXXX[phi, theta, si]. For a grid of 225 points, a normal 2 processor laptop is taking 7 h as compared to 8.30 h by a high end Xeon 4 processor computer. CPU and memory usage for laptop and computer are about 66% vs 99% and 700MB vs 900 MB respectively. Will be thankful for any suggestion on how to improve the evaluation speed on computer. Thanks







performance-tuning parallelization






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 24 at 6:14







user49535

















asked Jan 18 at 7:27









user49535user49535

1465




1465




closed as off-topic by m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1 Jan 25 at 4:12


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "This question cannot be answered without additional information. Questions on problems in code must describe the specific problem and include valid code to reproduce it. Any data used for programming examples should be embedded in the question or code to generate the (fake) data must be included." – m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1
If this question can be reworded to fit the rules in the help center, please edit the question.







closed as off-topic by m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1 Jan 25 at 4:12


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "This question cannot be answered without additional information. Questions on problems in code must describe the specific problem and include valid code to reproduce it. Any data used for programming examples should be embedded in the question or code to generate the (fake) data must be included." – m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1
If this question can be reworded to fit the rules in the help center, please edit the question.











  • $begingroup$
    Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the many Sort and SortBy commands); instead of concise function calls with purely numerical input and output, you use replace rules;...
    $endgroup$
    – Henrik Schumacher
    Jan 19 at 2:20










  • $begingroup$
    Running XXXX[0, 0, 0] once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by using Compile here and there).
    $endgroup$
    – Henrik Schumacher
    Jan 19 at 2:22










  • $begingroup$
    @HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
    $endgroup$
    – user49535
    Jan 19 at 4:49






  • 1




    $begingroup$
    @user49535 I think that your replacement rules are killing you. If I run DSC[0, 0, 0, 1] by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.
    $endgroup$
    – MassDefect
    Jan 19 at 5:52






  • 1




    $begingroup$
    @user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and have dsc/.R[m-1]
    $endgroup$
    – MassDefect
    Jan 19 at 7:39
















  • $begingroup$
    Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the many Sort and SortBy commands); instead of concise function calls with purely numerical input and output, you use replace rules;...
    $endgroup$
    – Henrik Schumacher
    Jan 19 at 2:20










  • $begingroup$
    Running XXXX[0, 0, 0] once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by using Compile here and there).
    $endgroup$
    – Henrik Schumacher
    Jan 19 at 2:22










  • $begingroup$
    @HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
    $endgroup$
    – user49535
    Jan 19 at 4:49






  • 1




    $begingroup$
    @user49535 I think that your replacement rules are killing you. If I run DSC[0, 0, 0, 1] by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.
    $endgroup$
    – MassDefect
    Jan 19 at 5:52






  • 1




    $begingroup$
    @user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and have dsc/.R[m-1]
    $endgroup$
    – MassDefect
    Jan 19 at 7:39















$begingroup$
Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the many Sort and SortBy commands); instead of concise function calls with purely numerical input and output, you use replace rules;...
$endgroup$
– Henrik Schumacher
Jan 19 at 2:20




$begingroup$
Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the many Sort and SortBy commands); instead of concise function calls with purely numerical input and output, you use replace rules;...
$endgroup$
– Henrik Schumacher
Jan 19 at 2:20












$begingroup$
Running XXXX[0, 0, 0] once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by using Compile here and there).
$endgroup$
– Henrik Schumacher
Jan 19 at 2:22




$begingroup$
Running XXXX[0, 0, 0] once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by using Compile here and there).
$endgroup$
– Henrik Schumacher
Jan 19 at 2:22












$begingroup$
@HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
$endgroup$
– user49535
Jan 19 at 4:49




$begingroup$
@HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
$endgroup$
– user49535
Jan 19 at 4:49




1




1




$begingroup$
@user49535 I think that your replacement rules are killing you. If I run DSC[0, 0, 0, 1] by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.
$endgroup$
– MassDefect
Jan 19 at 5:52




$begingroup$
@user49535 I think that your replacement rules are killing you. If I run DSC[0, 0, 0, 1] by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.
$endgroup$
– MassDefect
Jan 19 at 5:52




1




1




$begingroup$
@user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and have dsc/.R[m-1]
$endgroup$
– MassDefect
Jan 19 at 7:39




$begingroup$
@user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and have dsc/.R[m-1]
$endgroup$
– MassDefect
Jan 19 at 7:39










1 Answer
1






active

oldest

votes


















6












$begingroup$

Without knowing the exact function (I assume it's something fairly long, possibly involving integrals or differential equations), I can only make the following suggestions:



It looks like you're using exact numbers. If this is necessary for your application, then there's probably not a lot you can do, but exact numbers usually slow things down substantially. If you can, use Real numbers (just place a dot after the numbers like phi, 0., Pi/4., Pi/56.. If you need more precision than that but don't necessarily require the infinite precision of exact numbers, you can also do this: phi, 0`50, Pi/4`50, Pi/56`50. This will give you 50 digits of precision to work with which should make your final answer pretty close to the exact answer.



The other thing I would try is:



XX1 = ParallelTable[
XXXX[phi, theta, si]], phi, theta, si,
phi, 0, Pi/4, Pi/56,
theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/14,
si, 0 Pi, 0 Pi, 0
]


I think that ParallelTable is a better way to handle this than ParallelEvaluate. On a trial function, I see about a 100x speedup. ParallelEvaluate is simply evaluating your exact same function 4 times at each data point rather than splitting the task into multiple threads.



If you can, combine both things for the best speedup.



I hope this helps a bit! There are some people on here that are amazing at optimizing, perhaps they will be able to improve the speed even more. If it's possible, I would recommend posting your XXXX function unless it's insanely long.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
    $endgroup$
    – MassDefect
    Jan 18 at 9:10







  • 1




    $begingroup$
    You have to increase the amount of enclosing accents: ``` `` Codewithaccents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
    $endgroup$
    – Lukas Lang
    Jan 18 at 9:24











  • $begingroup$
    @LukasLang Oh, I see! Thanks!
    $endgroup$
    – MassDefect
    Jan 18 at 9:32










  • $begingroup$
    Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
    $endgroup$
    – user49535
    Jan 18 at 11:03






  • 1




    $begingroup$
    @user49535 As MassDefect already pointed out, using ParallelEvaluate here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual function XXXX whether ParallelTable can help at all. If it is a pure function then ParallelTable should help.But if XXXX has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowing XXXX.
    $endgroup$
    – Henrik Schumacher
    Jan 18 at 11:49


















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









6












$begingroup$

Without knowing the exact function (I assume it's something fairly long, possibly involving integrals or differential equations), I can only make the following suggestions:



It looks like you're using exact numbers. If this is necessary for your application, then there's probably not a lot you can do, but exact numbers usually slow things down substantially. If you can, use Real numbers (just place a dot after the numbers like phi, 0., Pi/4., Pi/56.. If you need more precision than that but don't necessarily require the infinite precision of exact numbers, you can also do this: phi, 0`50, Pi/4`50, Pi/56`50. This will give you 50 digits of precision to work with which should make your final answer pretty close to the exact answer.



The other thing I would try is:



XX1 = ParallelTable[
XXXX[phi, theta, si]], phi, theta, si,
phi, 0, Pi/4, Pi/56,
theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/14,
si, 0 Pi, 0 Pi, 0
]


I think that ParallelTable is a better way to handle this than ParallelEvaluate. On a trial function, I see about a 100x speedup. ParallelEvaluate is simply evaluating your exact same function 4 times at each data point rather than splitting the task into multiple threads.



If you can, combine both things for the best speedup.



I hope this helps a bit! There are some people on here that are amazing at optimizing, perhaps they will be able to improve the speed even more. If it's possible, I would recommend posting your XXXX function unless it's insanely long.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
    $endgroup$
    – MassDefect
    Jan 18 at 9:10







  • 1




    $begingroup$
    You have to increase the amount of enclosing accents: ``` `` Codewithaccents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
    $endgroup$
    – Lukas Lang
    Jan 18 at 9:24











  • $begingroup$
    @LukasLang Oh, I see! Thanks!
    $endgroup$
    – MassDefect
    Jan 18 at 9:32










  • $begingroup$
    Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
    $endgroup$
    – user49535
    Jan 18 at 11:03






  • 1




    $begingroup$
    @user49535 As MassDefect already pointed out, using ParallelEvaluate here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual function XXXX whether ParallelTable can help at all. If it is a pure function then ParallelTable should help.But if XXXX has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowing XXXX.
    $endgroup$
    – Henrik Schumacher
    Jan 18 at 11:49
















6












$begingroup$

Without knowing the exact function (I assume it's something fairly long, possibly involving integrals or differential equations), I can only make the following suggestions:



It looks like you're using exact numbers. If this is necessary for your application, then there's probably not a lot you can do, but exact numbers usually slow things down substantially. If you can, use Real numbers (just place a dot after the numbers like phi, 0., Pi/4., Pi/56.. If you need more precision than that but don't necessarily require the infinite precision of exact numbers, you can also do this: phi, 0`50, Pi/4`50, Pi/56`50. This will give you 50 digits of precision to work with which should make your final answer pretty close to the exact answer.



The other thing I would try is:



XX1 = ParallelTable[
XXXX[phi, theta, si]], phi, theta, si,
phi, 0, Pi/4, Pi/56,
theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/14,
si, 0 Pi, 0 Pi, 0
]


I think that ParallelTable is a better way to handle this than ParallelEvaluate. On a trial function, I see about a 100x speedup. ParallelEvaluate is simply evaluating your exact same function 4 times at each data point rather than splitting the task into multiple threads.



If you can, combine both things for the best speedup.



I hope this helps a bit! There are some people on here that are amazing at optimizing, perhaps they will be able to improve the speed even more. If it's possible, I would recommend posting your XXXX function unless it's insanely long.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
    $endgroup$
    – MassDefect
    Jan 18 at 9:10







  • 1




    $begingroup$
    You have to increase the amount of enclosing accents: ``` `` Codewithaccents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
    $endgroup$
    – Lukas Lang
    Jan 18 at 9:24











  • $begingroup$
    @LukasLang Oh, I see! Thanks!
    $endgroup$
    – MassDefect
    Jan 18 at 9:32










  • $begingroup$
    Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
    $endgroup$
    – user49535
    Jan 18 at 11:03






  • 1




    $begingroup$
    @user49535 As MassDefect already pointed out, using ParallelEvaluate here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual function XXXX whether ParallelTable can help at all. If it is a pure function then ParallelTable should help.But if XXXX has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowing XXXX.
    $endgroup$
    – Henrik Schumacher
    Jan 18 at 11:49














6












6








6





$begingroup$

Without knowing the exact function (I assume it's something fairly long, possibly involving integrals or differential equations), I can only make the following suggestions:



It looks like you're using exact numbers. If this is necessary for your application, then there's probably not a lot you can do, but exact numbers usually slow things down substantially. If you can, use Real numbers (just place a dot after the numbers like phi, 0., Pi/4., Pi/56.. If you need more precision than that but don't necessarily require the infinite precision of exact numbers, you can also do this: phi, 0`50, Pi/4`50, Pi/56`50. This will give you 50 digits of precision to work with which should make your final answer pretty close to the exact answer.



The other thing I would try is:



XX1 = ParallelTable[
XXXX[phi, theta, si]], phi, theta, si,
phi, 0, Pi/4, Pi/56,
theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/14,
si, 0 Pi, 0 Pi, 0
]


I think that ParallelTable is a better way to handle this than ParallelEvaluate. On a trial function, I see about a 100x speedup. ParallelEvaluate is simply evaluating your exact same function 4 times at each data point rather than splitting the task into multiple threads.



If you can, combine both things for the best speedup.



I hope this helps a bit! There are some people on here that are amazing at optimizing, perhaps they will be able to improve the speed even more. If it's possible, I would recommend posting your XXXX function unless it's insanely long.






share|improve this answer











$endgroup$



Without knowing the exact function (I assume it's something fairly long, possibly involving integrals or differential equations), I can only make the following suggestions:



It looks like you're using exact numbers. If this is necessary for your application, then there's probably not a lot you can do, but exact numbers usually slow things down substantially. If you can, use Real numbers (just place a dot after the numbers like phi, 0., Pi/4., Pi/56.. If you need more precision than that but don't necessarily require the infinite precision of exact numbers, you can also do this: phi, 0`50, Pi/4`50, Pi/56`50. This will give you 50 digits of precision to work with which should make your final answer pretty close to the exact answer.



The other thing I would try is:



XX1 = ParallelTable[
XXXX[phi, theta, si]], phi, theta, si,
phi, 0, Pi/4, Pi/56,
theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/14,
si, 0 Pi, 0 Pi, 0
]


I think that ParallelTable is a better way to handle this than ParallelEvaluate. On a trial function, I see about a 100x speedup. ParallelEvaluate is simply evaluating your exact same function 4 times at each data point rather than splitting the task into multiple threads.



If you can, combine both things for the best speedup.



I hope this helps a bit! There are some people on here that are amazing at optimizing, perhaps they will be able to improve the speed even more. If it's possible, I would recommend posting your XXXX function unless it's insanely long.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 18 at 8:42









Lukas Lang

6,7401930




6,7401930










answered Jan 18 at 8:07









MassDefectMassDefect

1,09139




1,09139











  • $begingroup$
    Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
    $endgroup$
    – MassDefect
    Jan 18 at 9:10







  • 1




    $begingroup$
    You have to increase the amount of enclosing accents: ``` `` Codewithaccents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
    $endgroup$
    – Lukas Lang
    Jan 18 at 9:24











  • $begingroup$
    @LukasLang Oh, I see! Thanks!
    $endgroup$
    – MassDefect
    Jan 18 at 9:32










  • $begingroup$
    Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
    $endgroup$
    – user49535
    Jan 18 at 11:03






  • 1




    $begingroup$
    @user49535 As MassDefect already pointed out, using ParallelEvaluate here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual function XXXX whether ParallelTable can help at all. If it is a pure function then ParallelTable should help.But if XXXX has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowing XXXX.
    $endgroup$
    – Henrik Schumacher
    Jan 18 at 11:49

















  • $begingroup$
    Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
    $endgroup$
    – MassDefect
    Jan 18 at 9:10







  • 1




    $begingroup$
    You have to increase the amount of enclosing accents: ``` `` Codewithaccents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
    $endgroup$
    – Lukas Lang
    Jan 18 at 9:24











  • $begingroup$
    @LukasLang Oh, I see! Thanks!
    $endgroup$
    – MassDefect
    Jan 18 at 9:32










  • $begingroup$
    Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
    $endgroup$
    – user49535
    Jan 18 at 11:03






  • 1




    $begingroup$
    @user49535 As MassDefect already pointed out, using ParallelEvaluate here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual function XXXX whether ParallelTable can help at all. If it is a pure function then ParallelTable should help.But if XXXX has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowing XXXX.
    $endgroup$
    – Henrik Schumacher
    Jan 18 at 11:49
















$begingroup$
Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
$endgroup$
– MassDefect
Jan 18 at 9:10





$begingroup$
Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
$endgroup$
– MassDefect
Jan 18 at 9:10





1




1




$begingroup$
You have to increase the amount of enclosing accents: ``` `` Codewithaccents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
$endgroup$
– Lukas Lang
Jan 18 at 9:24





$begingroup$
You have to increase the amount of enclosing accents: ``` `` Codewithaccents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
$endgroup$
– Lukas Lang
Jan 18 at 9:24













$begingroup$
@LukasLang Oh, I see! Thanks!
$endgroup$
– MassDefect
Jan 18 at 9:32




$begingroup$
@LukasLang Oh, I see! Thanks!
$endgroup$
– MassDefect
Jan 18 at 9:32












$begingroup$
Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
$endgroup$
– user49535
Jan 18 at 11:03




$begingroup$
Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
$endgroup$
– user49535
Jan 18 at 11:03




1




1




$begingroup$
@user49535 As MassDefect already pointed out, using ParallelEvaluate here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual function XXXX whether ParallelTable can help at all. If it is a pure function then ParallelTable should help.But if XXXX has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowing XXXX.
$endgroup$
– Henrik Schumacher
Jan 18 at 11:49





$begingroup$
@user49535 As MassDefect already pointed out, using ParallelEvaluate here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual function XXXX whether ParallelTable can help at all. If it is a pure function then ParallelTable should help.But if XXXX has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowing XXXX.
$endgroup$
– Henrik Schumacher
Jan 18 at 11:49



Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Nur Jahan