Why is 2-dimensional array access slow (compared to memoized downvalues)?

Question

The fabricated example below demonstrates a bizarre case I ran into. (If it matters, this is with Mma 14.1 on Apple ARM, but I see the same behavior with Mma 12.3.1 on Intel x86.)

On a whim, I tried replacing a matrix-like collection of memoized downvalues

n = 10000; sn = Floor[Sqrt[n]];
First@AbsoluteTiming[
  Do[ m[i,j] = N[ (i+j)/n ], {i,0,n}, {j,0,sn} ];
  ]
First@AbsoluteTiming@
  Table[ Product[ m[i,k] + m[n-i,k], {i,0,n-k^2} ], {k,0,sn} ]

0.982767

0.601142

with a 2-dimensional array:

First@AbsoluteTiming[
  t1 = ConstantArray[ N[0], {1+n, 1+sn} ];
  Do[ t1[[1+i,1+j]] = N[ (i+j)/n ], {i,0,n}, {j,0,sn}];
  ]
First@AbsoluteTiming@
  Table[ Product[ t1[[1+i,1+k]] + t1[[1+n-i,1+k]], {i,0,n-k^2} ], {k,0,sn} ]

I did this under the assumption packed contiguously-stored arrays are inherently faster and require less memory, furthermore their use sets the stage for other types of subsequent improvements involving list-related built-ins (e.g., Table instead of ConstantArray/Do).

Indeed, with just the replacement shown above, the construction phase is already faster, but I was surprised by the dramatic slowdown in the calculation phase:

0.77669

15.733 (?!?)

Transposing the storage doesn't seem to matter:

First@AbsoluteTiming[
  t2 = ConstantArray[N[0], {1+sn, 1+n}];
  Do[ t2[[1+j,1+i]] = N[(i+j)/n], {j,0,sn}, {i,0,n}];
  ]
First@AbsoluteTiming@
  Table[ Product[ t2[[1+k,1+i]] + t2[[1+k,1+n-i]], {i,0,n-k^2} ], {k,0,sn} ]

0.731359

15.5293

Since the calculation phase (in the transposed version) works with one row at a time, I can extract each row to then use 1-dimensional indexing:

First@AbsoluteTiming@
  Table[ With[ {v = t2[[1+k]]}, Product[ v[[1+i]] + v[[1+n-i]], {i,0,n-k^2} ]], {k,0,sn} ]

This gets me back to the speed seen with memoized downvalues.

0.564264

Question: Why is 2-dimensional array access so slow compared to “2-dimensional” downvalue lookup, or compared to row-extraction followed by 1-dimensional array access?

Wow, interesting catch. Something absolutely weird is going on with Product here. When I replace Product by Times @@ Table in the matrix-access version, things get a lot faster (but of course not as fast as the faster Product version as some array have to be created and filled first.) I think we cannot do anything about it. Please report this to Wolfram Support. I am pretty sure that this is not intended behavior. — Henrik Schumacher
– Henrik Schumacher, Commented Apr 26 at 18:06
Each version in the original post gives me identical results. Replacing Product with Times @@ Table also gives me identical results and indeed eliminates the dramatic slowdown. Thanks for the insight! The slowdown apparently depends on the context in which matrix access is occurring, and thus has more to do with Product than with matrix access! And not just Product. I saw the same type of matrix-access slowdown in a different calculation involving Sum. I'll report to Wolfram Support and update here. — Alexander Perlis
– Alexander Perlis, Commented Apr 26 at 19:20
This is 300 times faster than Product: First@AbsoluteTiming[Table[Tr[t1[[1 ;; 1 + n - k^2, 1 + k]] + t1[[1 + n ;; 1 + k^2 ;; -1, 1 + k]], Times], {k, 0, sn}]] — Michael E2
– Michael E2, Commented Apr 28 at 2:15

Michael E2 · Accepted Answer · 2025-04-30 12:46:35Z

7

Here's the solution deduced from an observation of the OP (see the series of comments starting with this:

First@AbsoluteTiming@Table[
   Product[t1[[1 + i, 1 + k]] + t1[[1 + n - i, 1 + k]], {i, 0, n - k^2}
    , Method -> {"Compiled" -> False}]
   , {k, 0, sn}]
(*  0.547137  *)

Compiling and calling a function with a large, constant array is expensive. The indirection provided by t2 := t1 in the original answer below seems to prevent compilation.

Original answer

Here's a cute trick:

$Version
(*  "14.2.1 for Mac OS X ARM (64-bit) (March 16, 2025)"  *)

First@AbsoluteTiming[
  glurg = Table[
    Product[t1[[1 + i, 1 + k]] + t1[[1 + n - i, 1 + k]], 
     {i, 0, n - k^2}], {k, 0, sn}]]
(*  16.2223  *)

t2 := t1 (* SetDelayed sleight-of-hand!!! *)
First@AbsoluteTiming[
  hmph = Table[
    Product[t2[[1 + i, 1 + k]] + t2[[1 + n - i, 1 + k]],
     {i, 0, n - k^2}], {k, 0, sn}]]
(*  0.589388  *)

hmph == glurg
(*  True  *)

Still not as fast as Fold[]:

First@RepeatedTiming[
  Table[Fold[Times, 
     t1[[1 ;; 1 + n - k^2, 1 + k]] + 
      t1[[1 + n ;; 1 + k^2 ;; -1, 1 + k]]], {k, 0, sn}];]
(*  0.0190083  *)

edited Apr 30 at 12:46

answered Apr 28 at 2:29

Michael E2

259k21 gold badges370 silver badges830 bronze badges

$\begingroup$ Tr[t1[[1 ;; 1 + n - k^2, 1 + k]] + t1[[1 + n ;; 1 + k^2 ;; -1, 1 + k]], Times] takes 0.045 sec. to compute the product. $\endgroup$

Michael E2
– Michael E2

2025-04-28 02:33:16 +00:00
Commented Apr 28 at 2:33
$\begingroup$ I wonder, in the slow version, what Mma is pondering. Indeed ironic how adding an extra step of SetDelayed indirection ends up speeding things up! $\endgroup$

Alexander Perlis
– Alexander Perlis

2025-04-29 06:52:39 +00:00
Commented Apr 29 at 6:52
$\begingroup$ Wow, Fold[Times[...]] is impressive! There's also Times@@, which in this example slightly beats Tr[...,Times]. Warning: The various approaches don't necessarily give identical numerical results. Here, Times@@ and Tr[...,Times] match OP, whereas Fold[Times[...]] does not. $\endgroup$

Alexander Perlis
– Alexander Perlis

2025-04-29 07:35:28 +00:00
Commented Apr 29 at 7:35
$\begingroup$ @AlexanderPerlis This is even faster on long enough arrays, with yet again different roundoff errors: Table[Exp@Total@Log[t1[[1 ;; 1 + n - k^2, 1 + k]] + t1[[1 + n ;; 1 + k^2 ;; -1, 1 + k]]], {k, 0, sn}]. The magic of SIMD/pipelining. Compare with Total[list] or Tr[list] for sums, and Dot[list, list] for dot-product, and you'll see there's room for improvement. However, I'm not sure the same support for products exists in CPUs. $\endgroup$

Michael E2
– Michael E2

2025-04-29 14:33:14 +00:00
Commented Apr 29 at 14:33

Add a comment |

Alexander Perlis · Accepted Answer · 2025-04-29 20:00:45Z

2

Here's another matrix access slowdown that yields to the SetDelayed sleight-of-hand. Background: You have a massive matrix (any data type, but the slowdown seems most pronounced with non-native types, so I'll demonstrate with 65-bit integers) and want to extract a corner, say to peek at it with //TableForm. There are many ways to accomplish that...

$Version
(*  "14.1.0 for Mac OS X ARM (64-bit) (July 16, 2024)"  *)

m = RandomInteger[2^64, {Floor@Sqrt[99000], 99000}];

First@AbsoluteTiming@Take[m, 20, 20]
(*  0.000006  *)

First@AbsoluteTiming[ m[[ ;;20, ;;20 ]] ]
(*  0.000005  *)

First@AbsoluteTiming@Table[ With[ {r=m[[i]]}, Table[ r[[j]], {j,20} ]], {i,20} ]
(*  0.000059  *)

First@AbsoluteTiming@Table[ m[[i,j]], {i,20}, {j,20} ]
(*  17.9656   (?!?)  *)

Strange. Now apply the SetDelayed sleight-of-hand:

mm := m
First@AbsoluteTiming@Table[ mm[[i,j]], {i,20}, {j,20} ]
(*  0.000331   (yay!)  *)

There's a precipice to be found:

First@AbsoluteTiming@Table[ m[[i,j]], {i,15}, {j,15} ]
(*  0.000112  (not here)  *)

First@AbsoluteTiming@Table[ m[[i,j]], {i,15}, {j,16} ]
(*  0.000097  (not here)  *)

First@AbsoluteTiming@Table[ m[[i,j]], {i,16}, {j,15} ]
(*  0.000095  (not here)  *)

First@AbsoluteTiming@Table[ m[[i,j]], {i,16}, {j,16} ]
(*  18.0321   (found it!)  *)

Playing around more, the slowdown occurs when the number of entries hits 250. For example, {i,1}, {j,249} and {i,124}, {j,2} are fast, whereas {i,1}, {j,250} and {i,125}, {j,2} are slow.

During those 18 seconds, the kernel's memory usage briefly spikes (appears to almost quadruple). Reducing the size of the matrix reduces the delay. For example, replace 99000 with 70000 in the definition of m, then 18 seconds reduces to 10 seconds, but the precipice remains at 250. Changing the data type also affects the delay. Change 2^64 to 2^63, or change RandomInteger to RandomReal, and the delay goes down a lot, but the precipice remains at 250. If you're thinking it's not the data type, just the ByteCount of the matrix, try this: use a native data type and increase the matrix size until the ByteCount matches the original example; you'll be able to demonstrate an observable delay but not as pronounced as the original example.

answered Apr 29 at 20:00

Alexander Perlis

3912 silver badges7 bronze badges

1

$\begingroup$ You found it!: Examine SystemOptions["CompileOptions"], and look for "TableCompileLength" and "ProductCompileLength". $\endgroup$

Michael E2
– Michael E2

2025-04-29 21:51:03 +00:00
Commented Apr 29 at 21:51
1

$\begingroup$ You can add this to your answer, or I'll add it to mine (I feel it's a joint effort): n = 10000; sn = Floor[Sqrt[n]]; t1 = ConstantArray[N[0], {1 + n, 1 + sn}]; Do[t1[[1 + i, 1 + j]] = N[(i + j)/n], {i, 0, n}, {j, 0, sn}]; WithCleanup[ opts = SystemOptions["CompileOptions"]; SetSystemOptions["CompileOptions" -> "ProductCompileLength" -> Infinity], First@AbsoluteTiming@Table[Product[t1[[1 + i, 1 + k]] + t1[[1 + n - i, 1 + k]], {i, 0, n - k^2}], {k, 0, sn}], SetSystemOptions[opts] ] --> 0.519559 $\endgroup$

Michael E2
– Michael E2

2025-04-29 21:54:49 +00:00
Commented Apr 29 at 21:54
1

$\begingroup$ D'oh, easier: First@AbsoluteTiming@Table[Product[t1[[1 + i, 1 + k]] + t1[[1 + n - i, 1 + k]], {i, 0, n - k^2}, Method -> {"Compiled" -> False}], {k, 0, sn}] $\endgroup$

Michael E2
– Michael E2

2025-04-29 21:58:46 +00:00
Commented Apr 29 at 21:58
$\begingroup$ Oh wow! But hold on, still confused. With SetDelayed trick I guess compiler is still invoked but either gives up quickly "not much here to compile" or just creates byte-code for delayed lookup but "not much here to analyze and optimize", it's quickly off to the execution engine; whereas, with original version, compiler makes byte code but then gets quite lost in optimization analysis, maybe tries to unroll loops, hardcode data values, who knows. Does it do that for say 17.8s then run its masterpiece for .2s, or does it give up after 17.5s, "optimization dead end, just do it the 0.5s way"? $\endgroup$

Alexander Perlis
– Alexander Perlis

2025-04-30 01:10:54 +00:00
Commented Apr 30 at 1:10
2

$\begingroup$ I think the SetDelayed[] prevents compilation. For instance, this has two problems: Block[{k = 1}, cf2 = Compile[{{i, _Integer}}, t2[[1 + i, 1 + k]] + t2[[1 + n - i, 1 + k]], CompilationOptions -> {"InlineExternalDefinitions" -> True}] ]. One problem is that there are warning messages. Another is that there are four MainEvaluate[] calls in cf2, which is usually a reason to reject compilation. See cf2 // CompiledFunctionTools`CompilePrint to see MainEvaluate[]. $\endgroup$

Michael E2
– Michael E2

2025-04-30 01:31:37 +00:00
Commented Apr 30 at 1:31

| Show 2 more comments

Stack Exchange Network

Why is 2-dimensional array access slow (compared to memoized downvalues)?

2 Answers 2

Linked

Hot Network Questions

Why is 2-dimensional array access slow (compared to memoized downvalues)?

2 Answers 2

Linked

Related

Hot Network Questions