After a lengthy design process and preliminary foundations in Julia 0.5, Julia 0.6 includes new facilities for writing code in the “vectorized” style (familiar from Matlab, Numpy, R, etcetera) while avoiding the overhead that this style of programming usually imposes: multiple vectorized operations can now be “fused” into a single loop, without allocating any extraneous temporary arrays.
This is best illustrated with an example (in which we get
order-of-magnitude savings in memory and time, as demonstrated below). Suppose we have
a function f(x) = 3x^2 + 5x + 2
that evaluates a polynomial,
and we want to evaluate f(2x^2 + 6x^3 - sqrt(x))
for a whole array X
,
storing the result in-place in X
. You can now do:
X .= f.(2 .* X.^2 .+ 6 .* X.^3 .- sqrt.(X))
or, equivalently:
@. X = f(2X^2 + 6X^3 - sqrt(X))
and the whole computation will be fused into a single loop, operating in-place, with performance comparable to the hand-written “devectorized” loop:
for i in eachindex(X)
x = X[i]
X[i] = f(2x^2 + 6x^3 - sqrt(x))
end
(Of course, like all Julia code, to get good performance both of these snippets should be executed inside some function, not in global scope.) To see the details of a variety of performance experiments with this example code, follow along in the attached IJulia/Jupyter notebook: we find that the
X .= ...
code has performance within 10% of the hand-devectorized loop (which itself is within 5% of the
speed of C code),
except for very small arrays where there is a modest overhead (e.g. 50% overhead for a length-1 array X
).
In this blog post, we delve into some of the details of this new development, in order to answer questions that often arise when this feature is presented:
What is the overhead of traditional “vectorized” code? Isn’t vectorized code supposed to be fast already?
Why are all these dots necessary? Couldn’t Julia just optimize “ordinary” vector code?
Is this something unique to Julia, or can other languages do the same thing?
The short answers are:
Ordinary vectorized code is fast, but not as fast as a hand-written loop (assuming loops are efficiently compiled, as in Julia) because each vectorized operation generates a new temporary array and executes a separate loop, leading to a lot of overhead when multiple vectorized operations are combined.
The dots allow Julia to recognize the “vectorized” nature of the
operations at a syntactic level (before e.g. the type of x
is known),
and hence the loop fusion is a syntactic guarantee, not a
compiler optimization that may or may not occur for carefully written code. They also allow the caller to “vectorize” any function, rather than relying on the function author. (The @.
macro lets you add dots to every operation in an expression, improving readability for expressions with lots of dots.)
Other languages have implemented loop fusion for vectorized operations, but typically for only a small set of types and operations/functions that are known to the compiler or vectorization library. Julia’s ability to do it generically, even for user-defined array types and functions/operators, is unusual and relies in part on the syntax choices above and on its ability to efficiently compile higher-order functions.
Finally, we’ll review why, since these dots actually correspond to
broadcast
operations, they can combine arrays and scalars, or combine containers
of different shapes and kinds, and we’ll compare broadcast
and map
. Moreover, Julia 0.6 expanded and
clarified the notion of a “scalar” for broadcast
, so that it is not limited to numerical operations: you can use broadcast
and fusing “dot calls” for many other
tasks (e.g. string processing).
To explore this question (also discussed in this blog post), let’s begin by rewriting the code above in a more traditional vectorized style, without so many dots, such as you might use in Julia 0.4 or in other languages (most famously Matlab, Python/Numpy, or R).
X = f(2 * X.^2 + 6 * X.^3 - sqrt(X))
Of course, this assumes that the functions sqrt
and f
are “vectorized,”
i.e. that they accept vector arguments X
and compute the
function elementwise. This is true of sqrt
in Julia 0.4, but it
means that we have to rewrite our function f
from above in a vectorized style, as
e.g. f(x) = 3x.^2 + 5x + 2
(changing f
to use the elementwise operator .^
because
vector^scalar
is not defined). (If we were using Julia 0.4 and cared a lot about efficiency,
we might have instead used the @vectorize_1arg f Number
macro to generate
more specialized elementwise code.)
As an aside, this example illustrates an annoyance with the vectorized style:
you have to decide in advance whether a given function f(x)
will also be applied elementwise to arrays, and either
write it specially or define a corresponding elementwise method.
(Our function f
accepts any x
type, and in Matlab or R there is no distinction between
a scalar and a 1-element array. However, even if a function accepts an array argument x
,
that doesn’t mean it will work elementwise
for an array unless you write the function with that in mind.)
For library functions like sqrt
, this means that the library authors
have to guess at which functions should have vectorized methods, and users
have to guess at what vaguely defined subset of library functions work
for vectors.
One possible solution is to vectorize every function automatically. The
language Chapel
does this: every function f(x...)
implicitly
defines a function f(x::Array...)
that evaluates map(f, x...)
(Chamberlain et al, 2011).
This could be implemented in Julia as well via
function-call overloading (Bezanson, 2015: chapter 4),
but we chose to go in a different direction.
Instead, starting in Julia 0.5, any function f(x)
can be applied elementwise
to an array X
with the “dot call” syntax f.(X)
.
Thus, the caller decides which functions to vectorize. In Julia 0.6,
“traditionally” vectorized library functions like sqrt(X)
are deprecated in
favor of sqrt.(X)
, and dot operators
like x .+ y
are now equivalent to
dot calls (+).(x,y)
. Unlike Chapel’s implicit vectorization, Julia’s
f.(x...)
syntax corresponds to broadcast(f, x...)
rather than map
,
allowing you to combine arrays and scalars or arrays of different shapes/dimensions. (broadcast
and map
are compared at the end of this post; each
has its own unique capabilities.)
From the standpoint of the programmer, this adds a certain amount of
clarity because it indicates explicitly when an elementwise operation
is occuring. From the standpoint of the compiler, dot-call syntax
enables the syntactic loop fusion optimization described in more detail
below, which we think is an overwhelming advantage of this style.
In many dynamically typed languages popular for interactive technical computing
(Matlab, Python, R, etc.), vectorization is seen as a key (often the key)
performance optimization. It allows your code to take advantage of highly
optimized (perhaps even parallelized) library routines for basic operations like
scalar*array
or sqrt(array)
. Those functions, in turn, are usually
implemented in a low-level language like C or Fortran. Writing your own
“devectorized” loops, in contrast, is too slow, unless you are willing to drop
down to a low-level language yourself, because the semantics of those dynamic
languages make it hard to compile them to efficient code in general.
Thanks to Julia’s design, a properly written devectorized loop in Julia
has performance within a few percent of C or Fortran, so there is no necessity
of vectorizing; this is explicitly demonstrated for the devectorized
loop above in the accompanying notebook. However, vectorization may still be convenient for some problems.
And vectorized operations like scalar*array
or sqrt(array)
are still fast in Julia
(calling optimized library routines, albeit ones written in Julia itself).
Furthermore, if your problem involves a function that does not have a pre-written,
highly optimized, vectorized library routine in Julia, and that does not
decompose easily into existing vectorized building blocks like scalar*array
, then
you can write your own building block without dropping down to a low-level language.
(If all the performance-critical code you will ever need already existed in the
form of optimized library routines, programming would be a lot easier!)
There is a tension between two general principles in computing: on the one hand, re-using highly optimized code is good for performance; on the other other hand, optimized code that is specialized for your problem can usually beat general-purpose functions. This is illustrated nicely by the traditional vectorized version of our code above:
f(x) = 3x.^2 + 5x + 2
X = f(2 * X.^2 + 6 * X.^3 - sqrt(X))
Each of the operations like X.^2
and 5*X
individually
calls highly optimized functions, but their combination
leaves a lot of performance on the table when X
is an array. To see that,
you have to realize that this code is equivalent to:
tmp1 = X.^2
tmp2 = 2*tmp1
tmp3 = X.^3
tmp4 = 6 * tmp3
tmp5 = tmp2 + tmp4
tmp6 = sqrt(X)
tmp7 = tmp5 - tmp6
X = f(tmp7)
That is, each of these vectorized operations allocates a separate temporary array, and is a separate library call with its own inner loop. Both of these properties are bad for performance.
First, eight arrays are allocated (tmp1
through tmp7
, plus another
for the result of f(tmp7)
, and another four are allocated
internally by f(tmp7)
for the same reasons, for 12 arrays in all.
The resulting X = ...
expression does not update X
in-place, but
rather makes the variable X
“point” to a new array returned by f(tmp7)
,
discarding the old array X
. All of these extra arrays are eventually
deallocated by Julia’s garbage collector, but in the meantime it wastes
a lot of memory (an order of magnitude!)
By itself, allocating/freeing memory can take a significant amount of time
compared to our other computations. This is especially true if X
is very small
so that the allocation overhead matters (in our benchmark notebook, we pay
a 10× cost for a 6-element array and a 6× cost for a 36-element array), or if
X
is very large so that the memory churn matters (see below for numbers).
Furthermore, you pay a different performance price from the fact that you have
12 loops (12 passes over memory) compared to one, in part because of the loss of
memory locality.
In particular, reading or writing data in main computer memory (RAM) is much slower than performing scalar arithmetic operations like +
and *
, so computer hardware stores recently used data in a cache: a small amount
of much faster memory. Furthermore, there is a hierarchy of smaller,
faster caches, culminating in the register memory
of the CPU itself. This means that, for good performance, you should
load each datum x = X[i]
once (so that it goes into cache, or into a register for small enough types), and
then perform several operations like f(2x^2 + 6x^3 - sqrt(x))
on x
while you still have fast access to it, before loading the next datum;
this is called “temporal locality.” The traditional vectorized code
discards this potential locality: each X[i]
is loaded once for a
single small operation like 2*X[i]
, writing the result out to a temporary
array before immediately reading the next X[i]
.
In typical performance benchmarks (see notebook), therefore, the traditional
vectorized code X = f(2 * X.^2 + 6 * X.^3 - sqrt(X))
turns out to be about
10× slower than the devectorized or fused-vectorized versions of the same code
at the beginning of this article for X = zeros(10^6)
. Even if we
pre-allocate all of the temporary arrays (completely eliminating the allocation
cost), our benchmarks show that performing a separate loop for each operation
still is about 4–5× slower for a million-element X
. This is not unique to
Julia! Vectorized code is suboptimal in any language unless the
language’s compiler can automatically fuse all of these loops (even ones that
appear inside function calls), which rarely happens for the reasons described
below.
You might look at an expression like 2 * X.^2 + 6 * X.^3 - sqrt(X)
and
think that it is “obvious” that it could be combined into a single loop
over X
. Why can’t Julia’s compiler be smart enough to recognize this?
The thing that you need to realize is that, in Julia, there is nothing
particularly special about +
or sqrt
— they are arbitrary functions
and could do anything. X + Y
could send an email or open
a plotting window, for all the compiler knows. To figure out that it
could fuse e.g. 2*X + Y
into a single loop, allocating a single
array for the result, the compiler would need to:
Deduce the types of X
and Y
and figure out what *
and +
functions
to call. (Julia already does this, at least when type inference succeeds.)
Look inside of those functions, realize that they are elementwise loops over X
and Y
, and realize that they are pure
(e.g. 2*X
has no side-effects like modifying Y
).
Analyze expressions like X[i]
(which are calls to a function getindex(X, i)
that is “just another function” to the compiler), to detect that they
are memory reads/writes and determine what data dependencies they imply
(e.g. to figure out that 2*X
allocates a temporary array that can be eliminated).
The second and third steps pose an enormous challenge: looking at an arbitrary function and “understanding” it at this level turns out to be a very hard problem for a computer. If fusion is viewed as a compiler optimization, then the compiler is only free to fuse if it can prove that fusion won’t change the results, which requires the detection of purity and other data-dependency analyses.
In contrast, when the Julia compiler sees an expression like 2 .* X .+ Y
,
it knows just from the syntax (the “spelling”) that these are elementwise
operations, and Julia guarantees that the code will always fuse into a single
loop, freeing it from the need to prove purity. This is what we
term syntactic loop fusion, described in more detail below.
One approach that may occur to you, and which has been implemented in a
variety of languages (e.g. Kennedy & McKinley, 1993;
Lewis et al., 1998;
Chakravarty & Keller, 2001;
Manjikian & Abdelrahman, 2002;
Sarkar, 2010;
Prasad et al., 2011;
Wu et al., 2012), is to only
perform loop fusion for a few “built-in” types and operations that the
compiler can be designed to recognize. The same idea has also been
implemented as libraries (e.g. template libraries in C++:
Veldhuizen, 1995) or
domain-specific
languages (DSLs)
as extensions of existing languages; in Python, for example, loop fusion for a small
set of vector operations and array/scalar types can be found in the
Theano,
PyOP2, and Numba
software. Likewise, in Julia we could
potentially build the compiler to recognize that it can fuse
*
, +
, .^
, and similar operations for the built-in Array
type,
(and perhaps only for a few scalar types).
This has, in fact, already been implemented in Julia as a macro-based DSL (you add @vec
or @acc
decorators to a vectorized expression) in the Devectorize
and ParallelAccelerator
packages.
However, even though Julia will certainly implement additional compiler
optimizations as time passes, one of the key principles of Julia’s design
is to “build in” as little as possible into the core language, implementing
as much as possible of Julia in Julia itself (Bezanson, 2015).
Put another way, the same optimizations should be just as available to user-defined
types and functions as to the “built-in” functions of Julia’s standard library
(Base
). You should be able to define your own array types
(e.g. via the StaticArrays
package or PETSc arrays)
and functions (such as our f
above), and have them be capable of fusing vectorized operations.
Moreover, a difficulty with fancy compiler optimizations is that, as a programmer, you are often unsure whether they will occur. You have to learn to avoid coding styles that accidentally prevent the compiler from recognizing the fusion opportunity (e.g. because you called a “non-built-in” function), you need to learn to use additional compiler-diagnostic tools to identify which optimizations are taking place, and you need to continually check these diagnostics as new versions of the compiler and language are released. With vectorized code, losing a fusion optimization may mean wasting an order of magnitude in memory and time, so you have to worry much more than you would for a typical compiler micro-optimization.
In contrast, Julia’s approach is quite simple and general: the caller
indicates, by adding dots, which function calls and operators are
intended to be applied elementwise (specifically, as broadcast
calls).
The compiler notices these dots at parse time (or technically
at “lowering” time, but in any case long before it knows
the types of the variables etc.), and transforms them into calls to
broadcast
. Moreover, it guarantees that nested “dot calls” will
always be fused into a single broadcast call, i.e. a single loop.
Put another way, f.(g.(x .+ 1))
is treated by Julia as merely
syntactic sugar for
broadcast(x -> f(g(x + 1)), x)
. An assignment y .= f.(g.(x .+ 1))
is treated as sugar for the in-place operation
broadcast!(x -> f(g(x + 1)), y, x)
. The compiler need not prove
that this produces the same result as a corresponding non-fused operation,
because the fusion is a mandatory transformation defined as part
of the language, rather than an optional optimization.
Arbitrary user-defined functions f(x)
work with this mechanism,
as do arbitrary user-defined collection types for x
, as long as you
define broadcast
methods for your collection. (The default
broadcast
already works for any subtype of AbstractArray
.)
Moreover, dotted operators are now available for not just
the familiar ASCII operators like .+
, but for any
character that Julia parses as a binary operator. This includes
a wide array of Unicode symbols like ⊗
, ∪
, and ⨳
, most
of which are undefined by default. So, for example, if you
define ⊗(x,y) = kron(x,y)
for the Kronecker product,
you can immediately do [A, B] .⊗ [C, D]
to compute the
“elementwise” operation [A ⊗ C, B ⊗ D]
, because x .⊗ y
is sugar for broadcast(⊗, x, y)
.
Note that “side-by-side” binary operations are actually equivalent
to nested calls, and hence they fuse for dotted operations. For
example 3 .* x .+ y
is equivalent to (+).((*).(3, x), y)
, and
hence it fuses into broadcast((x,y) -> 3*x+y, x, y)
. Note
also that the fusion stops only when a “non-dot” call is encountered,
e.g. sqrt.(abs.(sort!(x.^2)))
fuses the sqrt
and abs
operations
into a single loop, but x.^2
occurs in a separate loop (producing
a temporary array) because of the intervening non-dot function call
sort!(...)
.
For the sake of completeness, we should mention some other possibilities that would partly address the problems of vectorization. For example, functions could be specially annotated to declare that they are pure, one could specially annotate container types with array-like semantics, etcetera, to help the compiler recognize the possibility of fusion. But this imposes a lot of requirements on library authors, and once again it requires them to identify in advance which functions are likely to be applied to vectors (and hence be worth the additional analysis and annotation effort).
Another approach that has been suggested is to define updating operators
like x += y
to be equivalent to calls to a special function,
like x = plusequals!(x, y)
, that can be defined as an in-place operation, rather
than x += y
being a synonym for x = x + y
as in Julia today.
(NumPy does this.)
By itself, this can be used to avoid temporary arrays in some simple cases by breaking them into a sequence of in-place updates, but
it doesn’t handle more complex expressions, is limited to a few
operations like +
, and doesn’t address the cache inefficiency of
multiple loops. (In Julia 0.6, you can do x .+= y
and it is
equivalent to x .= x .+ y
, which does a single fused loop in-place,
but this syntax now extends to arbitrary combinations of arbitrary functions.)
Obviously, Julia’s approach of syntactic loop fusion relies partly on the
fact that, as a young language, we are still relatively free to
redefine core syntactic elements like f.(x)
and x .+ y
. But
suppose you were willing to add this or similar syntax to an
existing language, like Python or Go, or create a DSL add-on on top of those
languages as discussed above; would you then be able to
implement the same fusing semantics efficiently?
There is a catch: 2 .* x .+ x .^ 2
is sugar for
broadcast(x -> 2*x + x^2, x)
in Julia, but for this to be
fast we need the higher-order function
broadcast
to be very fast as well. First, this
requires that arbitrary user-defined scalar (non-vectorized!) functions like
x -> 2*x + x^2
be compiled to fast code, which is often a challenge
in high-level dynamic languages. Second, it ideally requires that
higher-order functions like broadcast
be able to inline
the function argument x -> 2*x + x^2
, and this facility is even
less common. (It wasn’t available in Julia until version 0.5.)
Also, the ability of broadcast
to combine arrays and scalars or
arrays of different shapes (see below) turns out to be subtle to
implement efficiently without losing generality. The current
implementation relies on a metaprogramming feature that Julia provides
called generated
functions
in order to get compile-time specialization on the number and types of
the arguments. An alternative solution to the inlining and
specialization issues would be to build the broadcast
function into
the compiler, but then you might lose the ability of broadcast
to be
overloadable for user-defined containers, nor could users write their
own higher-order functions with similar functionality.
In particular, consider
a naive implementation of broadcast
(only for one-argument functions):
function naivebroadcast(f, x)
y = similar(x)
for i in eachindex(x)
y[i] = f(x[i])
end
return y
end
In Julia, as in other languages, f
must be some kind of function
pointer or function
object. Normally, a call
f(x[i])
to a function object f
must figure out where the actual machine
code for the function is (in Julia,
this involves dispatching on the type of x[i]
; in object-oriented languages,
it might involve dispatching on the type of f
), push the argument x[i]
etcetera to f
via a register and/or a call stack,
jump to the machine instructions to execute them, jump back to
the caller naivebroadcast
, and extract the return value.
That is, calling a function argument f
involves some overhead beyond
the cost of the computations inside f
.
If f(x)
is expensive enough, then the overhead of the function call may be negligible,
but for a cheap function like f(x) = 2*x + x^2
the overhead can be very
significant: with Julia 0.4, the overhead is roughly a factor of two compared
to a hand-written loop that evaluates z = x[i]; y[i] = 2*z + z^2
. Since lots
of vectorized code in practice evaluates relatively cheap functions like this,
it would be a big problem for a generic vectorization method based on broadcast
. (The function call also inhibits SIMD optimization
by the compiler, which prevents computations in f(x)
from
being applied simultaneously to several x[i]
elements.)
However, in Julia 0.5, every function has its own type. And, in Julia,
whenever you call a function like naivebroadcast(f, x)
, a specialized version
of naivebroadcast
is compiled for typeof(f)
and typeof(x)
. Since
the compiled code is specific to typeof(f)
, i.e. to the specific function
being passed, the Julia compiler is free to inline f(x)
into the generated code
if it wants to, and all of the function-call overhead can disappear.
Julia is neither the first nor the only language that can inline
higher-order functions; e.g. it is reportedly possible in Haskell and in
the Kotlin language.
Nevertheless, it seems to be a rare feature, especially in imperative languages. Fast
higher-order functions are a key ingredient of Julia that allows
a function like broadcast
to be written in Julia itself (and
hence be extensible to user-defined containers), rather than having
to be built in to the compiler (and probably limited to “built-in” container types).
Dot calls correspond to the broadcast
function in Julia. Broadcasting
is a powerful concept (also found, for example, in NumPy and
Matlab) in which
the concept of “elementwise” operations is extended to encompass combining
arrays of different shapes or arrays and scalars. Moreover, this is
not limited to arrays of numbers, and starting in Julia 0.6 a
“scalar” in a broadcast
context can be an object of an arbitrary type.
You may have noticed that the examples above included expressions like
6 .* X.^3
that combine an array (X
) with scalars (6
and 3
).
Conceptually, in X.^3
the scalar 3
is “expanded” (or “broadcasted”)
to match the size of X
, as if it became an array [3,3,3,...]
,
before performing ^
elementwise. In practice of course, no array
of 3
s is ever explicitly constructed.
More generally, if you combine two arrays of different dimensions or shapes,
any “singleton” (length 1) or missing dimension of one array is “broadcasted”
across that dimension of the other array. For example, A .+ [1,2,3]
adds [1,2,3]
to each column of a 3×n matrix A
. Another typical
example is to combine a row vector (or a 1×n array) and a column vector to make a matrix
(2d array):
julia> [1 2 3] .+ [10,20,30]
3×3 Array{Int64,2}:
11 12 13
21 22 23
31 32 33
(If x
is a row vector, and y
is a column vector, then A = x .+ y
makes
a matrix with A[i,j] = x[j] + y[i]
.)
Although other languages have also implemented similar broadcast
semantics,
Julia is unusual in being able to support such operations for arbitrary user-defined
functions and types with performance comparable to hand-written C loops, even though
its broadcast
function is written entirely in Julia with no special support
from the compiler. This not only requires efficient compilation and
higher-order inlining as mentioned above, but also the ability
to efficiently iterate over arrays of arbitrary dimensionalities determined
at compile-time for each caller.
Although the examples above were all for numeric computations, in fact
neither the broadcast
function nor the dot-call fusion syntax is limited
to numeric data. For example:
julia> s = ["The QUICK Brown", "fox jumped", "over the LAZY dog."];
julia> s .= replace.(lowercase.(s), r"\s+", "-")
3-element Array{String,1}:
"the-quick-brown"
"fox-jumped"
"over-the-lazy-dog."
Here, we take an array s
of strings, we convert each string to
lower case, and then we replace any sequence of whitespace (the regular expression
r"\s+"
) with a hyphen "-"
. Since these two dot calls are nested,
they are fused into a single loop over s
and are written in-place in s
thanks to the s .= ...
(temporary strings are allocated in this process,
but not temporary arrays of strings). Furthermore, notice that the
arguments r"\s+"
and "-"
are treated as “scalars” and are “broadcasted”
to every element of s
.
The general rule (starting in Julia 0.6) is that, in broadcast
, arguments of any type are
treated as scalars by default. The main exceptions are arrays (subtypes of
AbstractArray
) and tuples, which are treated as containers and are iterated
over. (If you define your own container type that is not a subtype of
AbstractArray
, you can tell broadcast
to treat it as a container to
be iterated over by overloading Base.Broadcast.containertype
and a
couple of other functions.)
Since the dot-call syntax corresponds to broadcast
, and broadcast
is just an
ordinary Julia function to which you can add your own methods (as opposed to
some kind of privileged compiler built-in), many possibilities open up. Not
only can you extend fusing dot calls to your own data structures (e.g.
DistributedArrays
extends broadcast
to work for arrays
distributed across multiple
computers), but you can apply the same syntax to data types that are hardly
“containers” at all.
For example, the ApproxFun
package defines an object called a Fun
that represents a numerical
approximation of a user-defined function (essentially, a Fun
is a fancy
polynomial fit). By defining broadcast
methods for
Fun
, you can
now take an f::Fun
and do, for example, exp.(f.^2 .+ f.^3)
and it will
translate to broadcast(y -> exp(y^2 + y^3), f)
. This broadcast
call, in
turn, will evaluate exp(y^2 + y^3)
for y = f(x)
at cleverly selected x
points, construct a polynomial fit, and return a new Fun
object representing
the fit. (Conceptually, this replaces elementwise operations on containers
with pointwise operations on functions.) In contrast, ApproxFun also allows
you to compute the same result using exp(f^2 + f^3)
, but in this case it will go
through the fitting process four times (constructing four Fun
objects), once
for each operation like f^2
, and is more than an order of magnitude slower
due to this lack of fusion.
Finally, it is instructive to compare broadcast
with map
, since map
also
applies a function elementwise to one or more arrays. (The dot-call
syntax invokes broadcast
, not map
.) The basic differences are:
broadcast
handles only containers with “shapes” M×N×⋯ (i.e., a size
and dimensionality), whereas map
handles “shapeless” containers like Set
or iterators of
unknown length like eachline(file)
.
map
requires all arguments to have the same length (and
hence cannot combine arrays and scalars) and (for array containers) the same shape, whereas broadcast
does
not (it can “expand” smaller containers to match larger ones).
map
treats all arguments as containers by default, and in particular
expects its arguments to act as iterators.
In contrast, broadcast
treats its arguments as scalars by default (i.e., as 0-dimensional arrays
of one element), except for a few types like AbstractArray
and Tuple
that are explicitly declared to be broadcast containers.
Sometimes, of course, their behavior coincides, e.g. map(sqrt, [1,2,3])
and
sqrt.([1,2,3])
give the same result. But, in general, neither map
nor broadcast
generalizes the other — each has things they can do that
the other cannot.