Speeding up atan2f - Hugo Hacker News

gumby 2021-08-17 14:11:04 +0000 UTC [ - ]

Because of the disregard for the literature common in CS I loved this part:

> This is achieved through ...and some cool documents from the 50s.

A bit of anecdote: back when I was a research scientist (corporate lab) 30+ years ago I would in fact go downstairs to the library and read — I was still a kid with a lot to learn (and still am). When I came across (by chance or by someone’s suggestion) something useful to my work I’d photocopy the article and and try it out. I’d put a comment in the code with the reference.

My colleagues in my group would give me (undeserved) credit for my supposed brilliance even though I said in the code where the idea came from and would determinedly point to the very paper on my desk. This attitude seemed bizarre as the group itself was producing conference papers and even books.

(Obviously this was not a universal phenomenon as there were other people in the lab overall, and friends on the net, suggesting papers to read. But I’ve seen it a lot, from back then up to today)

lmilcin 2021-08-17 15:35:12 +0000 UTC [ - ]

It is very unfortunate that in the world of ubiquitous information, ability to just look up a reasonable solution to a well known and well studied problem that has already been solved long ago is practically a superpower.

As a developer, I regularly am in a situation where I solved some kind of large problem people had for a long time with significant impact for the company and this mostly with couple google searches.

Once I solved a batch task that took 11 hours to complete and reduced it to about 5s of work by looking up a paper on bitemporal indexes. The reactions ranged from "You are awesome!" to "You red a... paper????" to "No, we can't have this, this is super advanced code and it surely must have bugs. Can't you find an open source library that implements this?"

refset 2021-08-17 17:24:54 +0000 UTC [ - ]

> bitemporal indexes [...] Can't you find an open source library that implements this?

Undoubtedly too late for you, but https://github.com/juxt/crux/blob/master/crux-core/src/crux/... :)

jacobolus 2021-08-17 17:16:27 +0000 UTC [ - ]

One of the most valuable (but often under-appreciated) kinds of work people can do is to take hard-won knowledge that exists somewhere obscure and make it more accessible, e.g. by writing expository survey papers, adding material to Wikipedia, writing documentation, ...

zokier 2021-08-17 16:57:05 +0000 UTC [ - ]

> It is very unfortunate that in the world of ubiquitous information, ability to just look up a reasonable solution to a well known and well studied problem that has already been solved long ago is practically a superpower

Its exactly because information is so ubiquitous and plentiful that finding a reasonable solution is like looking for a needle in a haystack.

lmilcin 2021-08-17 17:46:06 +0000 UTC [ - ]

In most cases in my experience, the people did not face challenge actually grepping the internet for the solution.

More likely they just went to roll out their own solution for any number of other reasons:

-- For the fun of intellectual challenge (I am guilty of this),

-- For having hubris to think they are first to meet that challenge (I have been guilty of this in the past, I no longer make that mistake),

-- For having hubris to think they have uniquely best solution to the problem, highly unlikely it is even worth verifying,

-- For being superior to everybody else and hence no real reason to dignify other, lesser people, by being interested in their work,

-- For feeling there is just too much pressure and no time to do it right or even to take a peek at what is right and how much time it would take,

-- For not having enough experience to suspect there might be a solution to this problem,

-- Books? Only old people read books,

-- Papers? They are for research scientists. I haven't finished CS to read papers at work.

and so on.

There have really been very few concrete problems that I could not find some solution to on the Internet.

It is more difficult to find solution to less concrete problems but even with these kinds of problems if you have a habit of regularly seeking knowledge, reading various sources, etc. you will probably amass enough experience and knowledge to deal with most even complex problems on your own.

djmips 2021-08-17 19:08:09 +0000 UTC [ - ]

I regularly hear coders say they cannot understand other people's code so they _had_ to write it themselves. Most people can't read papers either it seems.

lmilcin 2021-08-17 20:13:40 +0000 UTC [ - ]

Well, that's another thing.

People lost ability to read anything longer than a tweet with focus and comprehension. Especially new developers.

I know this because whatever I put in third and further paragraphs of my tickets is not being red or understood.

I sometimes quiz guys on review meetings and I even sometimes put the most important information somewhere in the middle of the ticket. Always with the same result -- they are surprised.

Now, to read a paper is to try to imagine being the person who wrote it and appreciate not just written text but also a lot more things that are not necessarily spelled.

The same with code -- when I read somebodys code I want to recognize their style, how they approach various problems. Once you know a little bit about their vocabulary you can really speed up reading the rest because you can anticipate what is going to happen next.

nuclearnice1 2021-08-17 20:55:49 +0000 UTC [ - ]

People may have a preference for certainty.

They work on the problem. They will make progress and likely solve it.

They search the literature they will spend time on research and quite possibly lose days before emerging with nothing to show for it.

lmilcin 2021-08-18 00:38:35 +0000 UTC [ - ]

> They search the literature they will spend time on research and quite possibly lose days before emerging with nothing to show for it.

I see you prefer lousy results.

Also, you read BEFORE you need it.

To be literate, educated, knowledgeable developer. And to know where the knowledge is when you need it (because you can't remember everything).

nuclearnice1 2021-08-18 03:26:49 +0000 UTC [ - ]

> I see you prefer lousy results.

I wasn’t stating a preference merely observing another possible reason people don’t research solutions before diving in.

2021-08-17 19:18:33 +0000 UTC [ - ]

caf 2021-08-18 04:50:09 +0000 UTC [ - ]

Reminds me a little of

If Edison had a needle to find in a haystack, he would proceed at once with the diligence of the bee to examine straw after straw until he found the object of his search. … I was a sorry witness of such doings, knowing that a little theory and calculation would have saved him ninety per cent of his labor.

  (Nikola Tesla, 19 October 1931)

lmilcin 2021-08-18 11:27:22 +0000 UTC [ - ]

Actually, it would save much more than 90 percent. More like 99.9%. Just take powerful magnet in your hand and start pushing through the haystack until you find it.

Which, of course, requires that you somehow know of magnets beforehand.

Which is exactly the point -- even if you are in business of finding needles in haystack it pays to have knowledge in other areas, and basically read books and other knowledge sources. You never know when it is going to pay off.

nextaccountic 2021-08-18 08:08:59 +0000 UTC [ - ]

> "No, we can't have this, this is super advanced code and it surely must have bugs. Can't you find an open source library that implements this?"

Please tell me this person didn't win and they got to keep your solution.

(my response would be "let me open source this then")

lmilcin 2021-08-18 11:24:07 +0000 UTC [ - ]

Yes, they got to keep my solution. I basically told the guy "You are free to propose your own".

As to open sourcing, not going to happen.

Companies treat the code as if it was tangible good. If you give it to somebody else you are loosing value.

Which is super strange, because code needs to be maintained and why not have other people do it for you if they find it valuable and worth maintaining?

teddyh 2021-08-17 14:18:31 +0000 UTC [ - ]

I’ve seen an odd use of language, mostly from U.S. Americans, where “intelligent” = “smart” = “knowledgable”. And conversely, assuming that knowing many things is what makes you smart. I suspect this is what happened there; they thought you “brilliant” because you knew many things – in their minds, there might not be a difference between the two concepts.

dleslie 2021-08-17 15:18:56 +0000 UTC [ - ]

I find that disregard for education to be tasteless, but perhaps warranted due to the modern prevalence of "bootcamps" and similar fast-but-shallow programmes.

Personally, I love when I get a chance to apply something from a published paper; I will leave a citation as a comment in the code and a brief explanation of how it works and why it was chosen. I have no regrets for having achieved a computing degree, so many years ago.

whatshisface 2021-08-17 14:18:38 +0000 UTC [ - ]

Not brilliant? Use of the written word to pass ideas down between generations is one of the most brilliant ideas in human history, and since nobody around you was doing it, the only explanation is that you independently invented the ethos of the scholar yourself.

gumby 2021-08-17 17:50:11 +0000 UTC [ - ]

Heh heh, Ockham’s razor provides an explanation!

specialist 2021-08-17 17:05:49 +0000 UTC [ - ]

Nice.

I truly miss foraging the stacks. Such delight.

Browsing scholar.google.com, resulting in 100s of tabs open, just isn't the same.

jjgreen 2021-08-17 22:48:38 +0000 UTC [ - ]

I always found that the smell of old books filled me with the urge to defecate, to the extent that I had to deliberately go for a crap before diving into the stacks. Something of a disadvantage as a pre-internet researcher. I thought I was the only person in the world with this condition, but have subsequently found a couple of other similarly afflicted souls.

IntrepidWorm 2021-08-17 23:14:47 +0000 UTC [ - ]

There is a semi-documented phenomenon among book store workers regarding just this: employees at retail stores like Barnes and Noble and Borders Books report famously bad cleanups in their bathrooms on a daily basis, reputedly due to this.

See: https://en.m.wikipedia.org/wiki/Mariko_Aoki_phenomenon

The academic consensus seems understandably mixed, but it is certainly a fascinating ponder (what a wonderful avenue of research)!

I'd wager any significant breakthroughs in this would certainly be up for consideration for an IgNobel.

jjgreen 2021-08-18 08:55:03 +0000 UTC [ - ]

It has a name! You have made my day.

touisteur 2021-08-17 18:47:43 +0000 UTC [ - ]

But, knowing where to look for, being able to suss out details from papers, having the almost adequate level of conversation to try and contact authors (if still alive...). The ability to put undecypherable formulas into code and to bear stackoverflow's requirements for a 'good' question. Not a set of skills that amounts to nothing. It's rare nowadays that I see any copy of a reference textbook on numerical recipes, although we do tons of signal processing and HPC stuff.

Recently I rediscovered the ability to compute logs with 64 bits integers with a better precision than double-precision. Man, jumping back into those depths after so many years... Not as easy as reading other people's code.

_moof 2021-08-17 22:31:50 +0000 UTC [ - ]

No one likes to do lit review, which is a shame. I get it - it's way less fun to find out someone already bested your cool idea than it is to write code. But it's definitely more productive to read the literature!

kbenson 2021-08-17 19:55:59 +0000 UTC [ - ]

> 30+ years ago ... I was still a kid with a lot to learn (and still am).

We should all strive to stay young at heart like this. ;)

jacobolus 2021-08-17 17:01:12 +0000 UTC [ - ]

If someone wants a fast version of x ↦ tan(πx/2), let me recommend the approximation:

  tanpi_2 = function tanpi_2(x) {
    var y = (1 - x*x);
    return x * (((-0.000221184 * y + 0.0024971104) * y - 0.02301937096) * y
      + 0.3182994604 + 1.2732402998 / y);
  }

(valid for -1 <= x <= 1)

https://observablehq.com/@jrus/fasttan with error: https://www.desmos.com/calculator/hmncdd6fuj

But even better is to avoid trigonometry and angle measures as much as possible. Almost everything can be done better (faster, with fewer numerical problems) with vector methods; if you want a 1-float representation of an angle, use the stereographic projection:

  stereo = (x, y) => y/(x + Math.hypot(x, y));
  stereo_to_xy = (s) => {
    var q = 1/(1 + s*s);
    return !q ? [-1, 0] : [(1 - s*s)/q, 2*s/q]; }

leowbattle 2021-08-17 17:47:37 +0000 UTC [ - ]

> Almost everything can be done better ... use the stereographic projection

I was just learning about stereographic projection earlier. Isn't it odd how when you know something you notice it appearing in places.

Can you give an example of an operation that could be performed better using stereographic projection rather than angles?

jacobolus 2021-08-17 17:54:12 +0000 UTC [ - ]

Generally you can just stick to storing 2-coordinate vectors and using vector operations.

The places where you might want to convert to a 1-number representation are when you have a lot of numbers you want to store or transmit somewhere. Using the stereographic projection (half-angle tangent) instead of angle measure works better with the floating point number format and uses only rational arithmetic instead of transcendental functions.

aj7 2021-08-17 17:02:53 +0000 UTC [ - ]

Around 1988, I added phase shift to my optical thin film design program written in Excel 4.0 for the Mac. At the time, this was utterly unique: each spreadsheet row represented a layer and the matrices describing each layer could be calculated right in that row by squashing them down horizontally. The S- and P-polarization matrices could be recorded this way, and the running matrix products similarly maintained. Finally, using a simple one input table, the reflectance of a typically 25-31 layer laser mirror could be calculated. And in less than a second on a 20 MHz 68020 (?) Mac II for about 50 wavelengths. The best part were the graphics which were instantaneous, beautiful, publishable, and pasteable into customer quotations. Semi-technical people could be trained to use the whole thing.

Now about the phase shift. In 1988, atan2 didn’t exist. Anywhere. Not in FORTRAN, Basic, Excel, or a C library. I’m sure phase shift calculators implemented it, each working alone. For us, it was critical. You see we actually cared not about the phase shift, but its second derivative, the group delay dispersion. This was the beginning of the femtosecond laser era, and people needed to check whether these broadband laser pulses would be inadvertently stretched by reflection off or transmission through the mirror coating. So atan2, the QUADRANT-PRESERVING arc tangent, is required for a continuous,differential phase function. An Excel function macro did this, with IF statements correcting the quadrant. And the irony of all this?

I CALLED it atan2.

dzdt 2021-08-17 19:26:27 +0000 UTC [ - ]

Fortran did have it much earlier. Per wikipedia atan2 appeared in Fortran IV from IBM in 1961. https://en.wikipedia.org/wiki/Atan2

pklausler 2021-08-17 19:48:30 +0000 UTC [ - ]

And it's been in every ANSI & ISO Fortran standard from F'66 through F'2018.

floxy 2021-08-17 18:03:00 +0000 UTC [ - ]

Table 8-3, page 8-7 (pdf page 97):

http://www.bitsavers.org/www.computer.museum.uq.edu.au/pdf/D...

raphlinus 2021-08-17 19:28:39 +0000 UTC [ - ]

1965: https://www.ecma-international.org/publications-and-standard... pdf page 36

floxy 2021-08-17 19:50:23 +0000 UTC [ - ]

Nice find. Anyone have a way to search for old versions "math.h"? The 1989 version of ANSI C had atan2:

https://web.archive.org/web/20161223125339/http://flash-gord...

...but I wonder when it first hit the scene.

kimixa 2021-08-17 20:57:17 +0000 UTC [ - ]

According to https://unix.superglobalmegacorp.com/ it was in 3bsd libm -https://unix.superglobalmegacorp.com/cgi-bin/cvsweb.cgi/3BSD...

3BSD was released at the end of 1979 according to wikipedia, and that's just the oldest BSD source I could find, so it may have been in even older versions.

Before ANSI/ISO C I don't think there was really a "math.h" to look for - as it may be different on different systems, but as so many things derive from BSD I wouldn't be surprised if it was "de-facto" standardised to what that supported.

Narishma 2021-08-17 21:09:46 +0000 UTC [ - ]

> Around 1988, I added phase shift to my optical thin film design program written in Excel 4.0 for the Mac.

Excel 4 was released in 1992. In 1988 the only version of Excel that would have been available on the Mac would be the very first one.

drej 2021-08-17 13:27:32 +0000 UTC [ - ]

Nice. Reminds me of an optimisation trick from a while ago: I remember being bottlenecked by one of these trigonometric functions years ago when working with a probabilistic data structure... then I figured the input domain was pretty small (a couple dozen values), so I precomputed those and used an array lookup instead. A huge win in terms of perf, obviously only applicable in these extreme cases.

ThePadawan 2021-08-17 13:37:01 +0000 UTC [ - ]

One of the things I recently learned that sounded the most "that can't possibly work well enough" is an optimization for sin(x):

If abs(x) < 0.1, "sin(x)" is approximated really well by "x".

That's it. For small x, just return x.

(Obviously, there is some error involved, but for the speedup gained, it's a very good compromise)

tzs 2021-08-17 14:36:09 +0000 UTC [ - ]

To put some numbers on it, using N terms of the Taylor series for sin(x) [1] with |x| <= 0.1, the maximum error percentage cannot exceed [2]:

  N  Error Limit
  1  0.167% (1/6%)
  2  8.35 x 10^-5% (1/11980%)
  3  1.99 x 10^-8% (1/50316042%)
  4  2.76 x 10^-12% (1/362275502328%)

Even for |x| as large as 1 the sin(x) = x approximation is within 20%, which is not too bad when you are just doing a rough ballpark calculation, and a two term approximation gets you under 1%. Three terms is under 0.024% (1/43%).

For |x| up to Pi/2 (90°) the one term approximation falls apart. The guarantee from the Taylor series is within 65% (in reality it does better, but is still off by 36%). Two terms is guaranteed to be within 8%, three within 0.5%, and four within 0.02%.

Here's a quick and dirty little Python program to compute a table of error bounds for a given x [3].

[1] x - x^3/3! + x^5/5! - x^7/7! + ...

[2] In reality the error will usually be quite a bit below this upper limit. The Taylor series for a given x is a convergent alternating series. That is, it is of the form A0 - A1 + A2 - A3 + ... where all the A's have the same sign, |Ak| is a decreasing sequence past some particular k, and |Ak| has a limit of 0 as k goes to infinity. Such a series converges to some value, and if you approximate that by just taking the first N terms the error cannot exceed the first omitted term as long as N is large enough to take you to the point where the sequence from there on is decreasing. This is the upper bound I'm using above.

[3] https://pastebin.com/thN8B7Gf

MauranKilom 2021-08-17 21:19:33 +0000 UTC [ - ]

Some trivia, partly stolen from Bruce Dawson[0]:

The sin(x) = x approximation is actually exact (in terms of doubles) for |x| < 2^-26 = 1.4e-8. See also [1]. This happens to cover 48.6% of all doubles.

Similarly, cos(x) = 1 for |x| < 2^-27 = 7.45e-9 (see [2]).

Finally, sin(double(pi)) tells you the error in double(pi) - that is, how far the double representation of pi is away from the "real", mathematical pi [3].

[0]: https://randomascii.wordpress.com/2014/10/09/intel-underesti...

[1]: https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...

[2]: https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...

[3]: https://randomascii.wordpress.com/2012/02/25/comparing-float...

quietbritishjim 2021-08-17 13:48:23 +0000 UTC [ - ]

That is precisely the technique discussed in the article: it's the first term of the Taylor expansion. Except that the article used more terms of the expansion, and also used very slightly "wrong" coefficients to improve the overall accuracy within the small region.

RicoElectrico 2021-08-17 14:46:17 +0000 UTC [ - ]

Come on, at this point I've seen this "engineering approximation" memed so many times, even on Instagram ;)

What is more interesting to me is that this can be one of rationales behind using radians.

And that tan(x)~x also holds for small angles, greatly easing estimations of distance to objects of known size (cf. mil-dot reticle).

lapetitejort 2021-08-17 15:18:08 +0000 UTC [ - ]

tan(x)~x because cos(x)~1. It's approximations all the way down.

chriswarbo 2021-08-17 13:47:40 +0000 UTC [ - ]

This is a very common assumption in Physics, e.g. https://en.wikipedia.org/wiki/Pendulum_(mathematics)#Small-a...

Whether it's appropriate in a numerical calculation obviously depends on the possible inputs and the acceptable error bars :)

sharikone 2021-08-17 13:45:28 +0000 UTC [ - ]

I think that you will find that for subnormal numbers any math library will use the identity function for sin(x) and 1 for cos(x)

ThePadawan 2021-08-17 14:05:24 +0000 UTC [ - ]

Right, but the largest subnormal number in single-precision floats is ~ 10^-38.

That the sin(x) approximation still works well for 10^-1 (with an error of ~0.01%) is the really cool thing!

jvz01 2021-08-17 17:00:35 +0000 UTC [ - ]

Another good hint is the classical half-angle formulas. You can often avoid calling sin() and cos() altogether!

Bostonian 2021-08-17 13:49:37 +0000 UTC [ - ]

Why wouldn't you at least include the x^3 term in the Taylor series for abs(x) < 0.1?

ThePadawan 2021-08-17 13:54:26 +0000 UTC [ - ]

That's what I assumed would have been a reasonable optimization!

What I really found amazing was that rather than reducing the number of multiplications to 2 (to calculate x^3), you can reduce it to 0, and it would still do surprisingly well.

mooman219 2021-08-17 16:26:37 +0000 UTC [ - ]

This would be a decent lookup for the atan2 function:

https://gist.github.com/mooman219/19b18ff07bb9d609a103ef0cd0...

tantalor 2021-08-17 13:34:07 +0000 UTC [ - ]

https://en.wikipedia.org/wiki/Memoization

bluedino 2021-08-17 13:38:13 +0000 UTC [ - ]

Technically it's a lookup table if you pre-compute them. Memoization would just be caching them as you do them.

tantalor 2021-08-17 13:48:22 +0000 UTC [ - ]

Not necessarily, you could "cache" them in a compilation step and then use the table at runtime.

bruce343434 2021-08-17 13:39:46 +0000 UTC [ - ]

Tangential at best, but why was the 'r' dropped from that term? Or why not call it caching? Why the weird "memo-ization"? It makes me think of a mass extinction event where everything is turned into a memo.

franciscop 2021-08-17 13:43:34 +0000 UTC [ - ]

It's explained right in the linked Wikipedia page:

> The term "memoization" was coined by Donald Michie in 1968[3] and is derived from the Latin word "memorandum" ("to be remembered"), usually truncated as "memo" in American English, and thus carries the meaning of "turning [the results of] a function into something to be remembered". While "memoization" might be confused with "memorization" (because they are etymological cognates), "memoization" has a specialized meaning in computing.

wongarsu 2021-08-17 13:54:04 +0000 UTC [ - ]

The term memoization likely precedes the word caching (as related to computing, obviously weapon caches are far older). Memoization was coined in 1968. CPU caches only came about in the 80s as registers became significantly faster than main memory.

As wikipedia outlines, the r was dropped because of the memo. It's derived from the latin word memorandum that does contain the r, just like memory, but apparently it was more meant as an analogy to written memos.

Const-me 2021-08-17 13:46:00 +0000 UTC [ - ]

I wonder how does it compare with Microsoft’s implementation, there: https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/Di...

Based on the code your version is probably much faster. It would be interesting to compare precision still, MS uses 17-degree polynomial there.

pclmulqdq 2021-08-17 14:15:03 +0000 UTC [ - ]

The "small error" in this article isn't so small when people want exact results. Unfortunately, a high-degree polynomial is necessary if you want 24 bit precision across the input range for functions like this.

That said, I wonder if a rational approximation might not be so bad on modern machines...

azalemeth 2021-08-17 14:50:14 +0000 UTC [ - ]

Indeed. I love Padé approximants -- they're very underused. Another really fun trick for approximations of expensive functions are turning them into Chebsyshev functions, like Nick Trefethen's chebfun package. You can trivially differentiate, integrate and compute chebfun approximations of really horrible special functions (e.g. DE solutions) with fantastic accuracy. Criminally underused.

jacobolus 2021-08-17 19:28:27 +0000 UTC [ - ]

If you like Padé approximants, have you been following Trefethen & al.’s recent work on the "AAA" algorithm etc.? https://arxiv.org/pdf/1612.00337.pdf https://arxiv.org/pdf/1705.10132.pdf

See also the lecture https://www.youtube.com/watch?v=S1upJPMIFfg

azalemeth 2021-08-17 21:47:25 +0000 UTC [ - ]

No, I haven't -- thank you very much for sharing. I've only ever used them, e.g. for spectral decomposition in magnetic resonance in a medical setting (1) with the fast Padé transformation, which has a wonderfully deep relationship with the DEs describing magnetic resonance (2).

Thanks also for the lecture - I'll enjoy. Nick Trefethen is such a good speaker. He taught me in a graduate course in numerical methods -- honestly, I'd watch those lectures on a loop again.

(1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5179227/

(2) https://iopscience.iop.org/article/10.1088/0031-9155/51/5/00...

petschge 2021-08-17 15:02:33 +0000 UTC [ - ]

24 bits precision would be possible with the Hastings approximation with 15 terms instead of 6 terms. A naive estimate would be that this requires 2.5 times as much work and could run at 6GB/s and take a little less than 5 cycles per element.

Someone 2021-08-17 18:07:25 +0000 UTC [ - ]

I think that remains to be seen. Exact calculations will give you 24 bits, but in computers, floating point calculations typically aren’t exact.

If, as this article does, you compute all your intermediate values as 32-bit IEEE floats, it would hugely surprise me if you still got 24 bits of precision after 1 division (to compute y/x or x/y), 15 multiplications and 14 additions (to evaluate the polynomial) and, sometimes, a subtraction from π/2.

(That’s a weakness in this article. They point out how fast their code is, but do not show any test that it actually computes approximations to atan2, let alone about how accurate their results are)

Getting all the bits right in floating point is expensive.

adgjlsfhk1 2021-08-17 18:23:08 +0000 UTC [ - ]

The good news is that the high order terms of the polynomial don't notably affect the error. with fma, horner's method converges to .5 ULP accuracy. The subtraction with pi/2 can also be done without adding inaccuracy by using compensated summation.

Const-me 2021-08-17 14:41:01 +0000 UTC [ - ]

> I wonder if a rational approximation might not be so bad on modern machines

Uops.info says the throughput of 32-byte FP32 vector division is 5 cycles on Intel, 3.5 cycles on AMD.

The throughput of 32-byte vector multiplication is 0.5 cycles. The same is for FMA, which is a single instruction computing a*b+c for 3 float vectors.

If the approximation formula only has 1 division where both numerator and denominator are polynomials, and the degree of both polynomials is much lower than 17, the rational version might be not too bad compared to the 17-degree polynomial version.

pclmulqdq 2021-08-17 15:37:41 +0000 UTC [ - ]

A Pade approximation is generally like that. Especially when you are dealing with singularities, it should be a much smaller polynomial than 17 terms in the numerator and denominator. I think I'm going to have to start a blog, and this will be on the list.

janwas 2021-08-17 18:02:32 +0000 UTC [ - ]

Indeed, degree-4 rational polynomials are generally enough for JPEG XL. Code for evaluating them using Horner's scheme (just FMAs): https://github.com/libjxl/libjxl/blob/bdde644b94c125a15e532b...

I initially used Chebfun to approximate them, then a teammate implemented the same Caratheodory-Fejer method in C. We subsequently convert from Chebyshev basis to monomials.

Blog sounds good :)

stephencanon 2021-08-17 20:57:24 +0000 UTC [ - ]

> if we’re working with batches of points and willing to live with tiny errors, we can produce an atan2 approximation which is 50 times faster than the standard version provided by libc.

Which libc, though? I assume glibc, but it's frustrating when people talk about libc as though there were a single implementation. Each vendor supplies their own implementation, libc is just a common interface defined by the C library. There is no "standard version" provided by libc.

In particular, glibc's math functions are not especially fast--Intel's and Apple's math libraries are 4-5x faster for some functions[1], and often more accurate as well, for example (and both vendors provide vectorized implementations). Even within glibc versions, there have been enormous improvements over the last decade or so, and for some functions there are big performance differences depending on whether or not -fno-math-errno is specified. (I would also note that atan2 has a lot of edge cases, and more than half the work in a standards-compliant libc is in getting those edge cases with zeros and infinities right, which this implementation punts on. There's nothing wrong with that, but that's a bigger tradeoff for most users than the small loss of accuracy and important to note.)

So what are we actually comparing against here? Comparing against a clown-shoes baseline makes for eye-popping numbers, but it's not very meaningful.

None of this should really take away from the work presented, by the way--the techniques described here are very useful for people interested in this stuff.

[1] I don't know the current state of atan2f in glibc specifically; it's possible that it's been improved since last I looked at its performance. But the blog post cites "105.98 cycles / element", which would be glacially slow on any semi-recent hardware, which makes me think something is up here.

rostayob 2021-08-17 21:14:48 +0000 UTC [ - ]

(I'm the author)

You're right, I should have specified -- it is glibc 2.32-48 . This the source specifying how glibc is built: https://github.com/NixOS/nixpkgs/blob/97c5d0cbe76901da0135b0... .

I've amended the article so that it says `glibc` rather than `libc`, and added a sidenote specifying the version.

I link to it statically as indicated in the gist, although I believe that shouldn't matter. Also see https://gist.github.com/bitonic/d0f5a0a44e37d4f0be03d34d47ac... .

Note that the hardware is not particularly recent (Q3 2017), but we tend to rent servers which are not exactly on the bleeding edge, so that was my platform.

floxy 2021-08-17 23:49:31 +0000 UTC [ - ]

Thanks for the interesting writeup. I wonder if it is because glibc has a less-optimized atan2f (for floats). The double version seems quite involved in glibc anyway:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/iee...

...where the atan2f version ends up calling atan, which doesn't seem as sophisticated:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/iee...

stephencanon 2021-08-17 21:21:41 +0000 UTC [ - ]

Thanks for clarifying!

Without benchmarking, I would expect atan2f to be around 20-30 cycles per element or less with either Intel's or Apple's scalar math library, and proportionally faster for their vector libs.

rostayob 2021-08-17 21:28:58 +0000 UTC [ - ]

Thanks for the info.

By the way, your writing on floating point arithmetic is very informative -- I even cite a message of yours on FMA in the post itself!

rostayob 2021-08-17 21:21:20 +0000 UTC [ - ]

Ah, regarding edge cases: the only edge cases I do not handle are infinities and the origin. In my case these are both uninteresting: we deal with points from panoramic images, so those both make no sense.

I did want to deal with 0s in either coordinates though, and to preserve NaNs.

You're right that atan2 has a lot of edge cases, which is why it made for an interesting subject. I thought it was neat that the edge cases I did care about fell out fairly naturally from the efficient bit-twiddling.

That said, the versions presented in the article post are _not_ conforming, as I remark repeatedly. The article is more about getting people curious about some topics that I think are neat, more than a guide on how to implement those standard functions correctly.

hdersch 2021-08-17 17:01:31 +0000 UTC [ - ]

A comparison with one of the many SIMD-mathlibraries would have been fairer than with plain libm. Long time ago I wrote such a dual-platform library for the PS3 (cell-processor) and x86 architecture (outdated, but still available here [1]). Depending on how standard libm implements atan2f, a speedup of 3x to 15x is achieved, without sacrifying accuracy.

1. https://webuser.hs-furtwangen.de/~dersch/libsimdmath.pdf

prionassembly 2021-08-17 14:12:06 +0000 UTC [ - ]

I wonder whether Padé approximants are well known by this kind of researcher. E.g. http://www-labs.iro.umontreal.ca/~mignotte/IFT2425/Documents...

nimish 2021-08-17 14:52:17 +0000 UTC [ - ]

Numerical analysts are definitely familiar. Library authors may not be unless they took a course in function approximation.

Personally I prefer chebyshev approximants like chebfun but pade are really good at the cost of a division.

stephencanon 2021-08-17 21:18:38 +0000 UTC [ - ]

A math library implementor will generally be familiar with at least minimax and Chebyshev approximations, and generally Padé and Carathéodory-Fejér approximations as well. (Source: I implemented math libraries for a decade. We used all of these frequently.)

jvz01 2021-08-17 16:32:09 +0000 UTC [ - ]

I have developed very fast, accurate, and vectorizable atan() and atan2() implementations, leveraging AVX/SSE capabilities. You can find them here [warning: self-signed SSL-Cert].

https://fox-toolkit.org/wordpress/?p=219

jvz01 2021-08-17 16:57:53 +0000 UTC [ - ]

Little side-note: algorithm as given is scalar; however, its branch-free, and defined entirely in the header file. So, compilers will typically be able to vectorize it, and thus achieve speed up directly based on the vector size. I see potential [but architecture-dependent] optimization using Estrin scheme for evaluating the polynomial.

ghusbands 2021-08-17 16:38:25 +0000 UTC [ - ]

Your result is significantly slower than the versions presented in the article, though yours has more terms and so may be more accurate.

jvz01 2021-08-17 16:41:29 +0000 UTC [ - ]

Yes, aim was to be acurate down to 1 lsb while significantly faster. Feel free to drop terms from the polynomial if you can live with less accurate results!

The coefficients were generated by a package called Sollya, I've used it a few times to develop accurate chebyshev approximations for functions.

Abramowitz & Stegun is another good reference.

touisteur 2021-08-17 19:41:17 +0000 UTC [ - ]

Please, Would you mind one of these days updating your blog post with the instructions you gave to sollya? I'm trying something stupid with log1p and can't get sollya to help, mostly because I'm not putting enough time to read all the docs...

shoo 2021-08-17 21:20:08 +0000 UTC [ - ]

Related -- there's a 2011 post from Paul Minero with fast approximations for logarithm, exponential, power, inverse root. http://www.machinedlearnings.com/2011/06/fast-approximate-lo...

Minero's faster approximate log2, < 1.4% relative error for x in [1/100, 10]. Here's the simple non-sse version:

  static inline float 
  fasterlog2 (float x)
  {
    union { float f; uint32_t i; } vx = { x };
    float y = vx.i;
    y *= 1.1920928955078125e-7f;
    return y - 126.94269504f;
  }

This fastapprox library also includes fast approximations of some other functions that show up in statistical / probabilistic calculations -- gamma, digamma, lambert w function. It is BSD licensed, originally lived in google code, copies of the library live on in github, e.g. https://github.com/etheory/fastapprox

It's also interesting to read through libm. E.g. compare Sun's ~1993 atan2 & atan:

https://github.com/JuliaMath/openlibm/blob/master/src/e_atan...

https://github.com/JuliaMath/openlibm/blob/master/src/s_atan...

zokier 2021-08-17 14:12:32 +0000 UTC [ - ]

I would have wished to see the error analysis section expanded a bit, or maybe seeing some sort of tests to validate the max error. In particular if the mathematical approximation function arctan* has max error of 1/10000 degrees then I'd naively expect that the float-based implementation to have worse error. Furthermore it's not obvious if additional error could be introduced by the division

    float atan_input = (swap ? x : y) / (swap ? y : x);

pklausler 2021-08-17 18:54:31 +0000 UTC [ - ]

(Undoubtedly) stupid question: would it be any faster to project (x, y) to the unit circle (x', y'), then compute acos(x') or asin(y'), and then correct the result based on the signs of x & y? When converting Cartesian coordinates to polar, the value of r=HYPOT(x, y) is needed anyway, so the projection to the unit circle would be a single division by r.

drfuchs 2021-08-17 13:37:02 +0000 UTC [ - ]

Wouldn’t CORDIC have done the trick faster? There’s no mention that they even considered it, even though it’s been around for half a century or so.

adrian_b 2021-08-17 13:46:42 +0000 UTC [ - ]

CORDIC was great for devices that were too simple to include hardware multipliers.

CORDIC, in its basic form, produces just 1 bit of result per clock cycle. Only for very low precisions (i.e. with few bits for each result) you would have the chance to overlap enough parallel atan2 computations to achieve a throughput comparable to what you can do with polynomial approximations on modern CPUs, with pipelined multipliers.

CORDIC remains useful only for ASIC/FPGA implementations, in the cases when the area and the power consumption are much more important than the speed.

mzs 2021-08-17 13:52:02 +0000 UTC [ - ]

CORDIC unlikely to be faster than this article's: "the approximation produces a result every 2 clock cycles"

sorenjan 2021-08-17 14:55:43 +0000 UTC [ - ]

How do you handle arrays of values where the array lengths are not a multiple of 8 in this kind of vectorized code? Do you zero pad them before handling them to the vectorized function, or do you run a second loop element by element on the remaining elements after the main one? What happens if you try to do `_mm256_load_ps(&ys[i])` with < 8 elements remaining?

zbjornson 2021-08-17 15:02:40 +0000 UTC [ - ]

You could pad the input as you said, which avoids a "tail loop," but otherwise you usually do a partial load (load <8 elements into a vector) and store. Some instruction set extensions provide "masked load/store" instructions for this, but there are ways to do it without those too.

To your last question specifically, if you _mm256_load_ps(&ys[i]) and you're at the edge of a page, you'll get a segfault. Otherwise you'll get undefined values (which can be okay, you could ignore them instead of dealing with a partial load).

twic 2021-08-17 15:46:44 +0000 UTC [ - ]

RISC-V has a solution to this which seems quite elegant - the vector length is set by the program, up to some maximum, and for the final block, it is simply whatever is left:

https://gms.tf/riscv-vector.html

vlovich123 2021-08-17 15:22:51 +0000 UTC [ - ]

> Do you zero pad them before handling them to the vectorized function, or do you run a second loop element by element on the remaining elements after the main one?

Typically the latter. That's why I find SVEs so interesting. Should improve code density.

TonyTrapp 2021-08-17 15:00:17 +0000 UTC [ - ]

Typically you have have the vectorized loop followed by a non-vectorized loop handling the remaining items, or even just explicit if-then-else statements for each possible number of remaining items.

aconz2 2021-08-17 16:16:04 +0000 UTC [ - ]

Nice writeup and interesting results. I hadn't seen the use of perf_event_open(2) before directly in code which looks cool.

The baseline is at a huge disadvantage here because the call to atan2 in the loop never gets inlined and the loop doesn't seem to get unrolled (which is surprising actually). Manually unrolling by 8 gives me an 8x speedup. Maybe I'm missing something with the `-static` link but unless they're using musl I didn't think -lm could get statically linked.

xxpor 2021-08-17 16:22:57 +0000 UTC [ - ]

Could it be that calls to glibc never get inlined, since inlining is essentially static linking by another name? Or since they're in separate compilation units, without LTO you'd never perform the analysis to figure out if it's worth it. Would LTO inspect across a dynamic loading boundary? Just speculating, I really have no idea. Everything I work with is statically linked, so I've never really had to think about it.

aconz2 2021-08-17 19:44:58 +0000 UTC [ - ]

You've made a silly mistake and just did 1/8th the work, so of course it was 8x speedup. I am still wondering how the numbers would look if you could inline atan2f from libc. Too bad that isn't easier

nice2meetu 2021-08-17 17:04:51 +0000 UTC [ - ]

I did something similar for tanh once, though I found I could get to 1 ulp.

Part of the motivation was that I could get 10x faster than libc. However, I then tried on my FreeBSD and could only get 4x faster. After a lot of head scratching and puzzling it turned out there was a bug in the version of libc on my linux box that slowed things down. It kind of took the wind out of the achievement, but it was still a great learning experience.

azhenley 2021-08-17 18:54:24 +0000 UTC [ - ]

This is pretty similar to my quest to make my own cos() when my friend didn't have access to libc. It was fun! Though I don't have the math or low-level knowledge that this author does.

https://web.eecs.utk.edu/~azh/blog/cosine.html

spaetzleesser 2021-08-17 18:22:18 +0000 UTC [ - ]

I am envious of people who can deal with such problems. The problem is defined clearly and can be measured easily.

This is so much more fun than figuring out why some SAAS service is misbehaving.

cogman10 2021-08-17 18:34:21 +0000 UTC [ - ]

I'm actually a bit surprised that the x86 SIMD instructions don't support trig functions.

2021-08-17 15:57:58 +0000 UTC [ - ]

h0mie 2021-08-17 15:58:16 +0000 UTC [ - ]

Love posts like this!

unemphysbro 2021-08-17 18:26:04 +0000 UTC [ - ]

coolest blog post I've seen here in a while. :)