Complexity theory puts limits on performance of gradient descent

Buttons840 2021-08-18 02:33:50 +0000 UTC [ - ]

> For example, gradient descent is often used in machine learning in ways that don’t require extreme precision. But a machine learning researcher might want to double the precision of an experiment. In that case, the new result implies that they might have to quadruple the running time of their gradient descent algorithm. That’s not ideal, but it is not a deal breaker.

> But for other applications, like in numerical analysis, researchers might need to square their precision. To achieve such an improvement, they might have to square the running time of gradient descent, making the calculation completely intractable.

I think a silver lining here is that a company having access to 10,000 times as much computing power as a common consumer can only achieve, say, a 10x better model. So the inequality isn't as extreme as it could be.

adrianN 2021-08-18 09:53:21 +0000 UTC [ - ]

One has to keep in mind that these are worst-case bounds. It is perfectly compatible with the result that someone with access to 10000x the computing power can achieve a 10000x better model for all the functions that we care about in practice.

kbelder 2021-08-18 17:23:48 +0000 UTC [ - ]

My general impression is that it's fairly easy to get moderately accurate models with simple procedures on cheap hardware... linear regressions, say. Maybe that'll get you 75% predictive accuracy.

Then, you apply better techniques, that hit the limits of your PC, and you bring that up to maybe 80%.

Then you bring in the PHDs, move your analysis to a cluster or into the cloud, and you gain another 2%.

In other words, it's more often that 10000x the computing power gives you something like a 1.2x the model quality, not 10000x or even 100x. Still worth doing for a large enough company, though.

Just my experience, and only a generalization. Some models just can't be built without a minimum amount of horsepower, etc.

mjburgess 2021-08-18 05:03:48 +0000 UTC [ - ]

Is that how scientific modelling (ie., modelling of non-computable functions) works though?

It's not necessarily that squaring precision leads to a "better model", its that it fails to be a model at all without that precision.

Eg., consider a climate model with effects at the first-decimal place, and 10th decimal-place of some measurable quantity vs. time.

The effects at the 10th dp. could aggregate to overwhelm the effects at the 1 dp.

It isnt that the model is better if it can "compute at the 10th dp" -- its just wildly inaccurate otherwise.

ben_w 2021-08-18 07:48:53 +0000 UTC [ - ]

10 decimal points reminds me of a conversation I had with my father. Last century my dad was working on radar/IFF simulations, and he had an anecdote (if you can call it that) about when they switched from normal to double precision, and needing to get more digits of π (I think from 10 to 20, the way he talked implied some very strange things about the programming language(s) he was using). The third time he told this story, late-teenage-me pointed out that the curvature of spacetime under GR had more impact than the extra digits.

While the broader point is valid, IIRC climate (being about average and broad trends) is fine with the imperfections, and handles this by running multiple simulations with slightly different parameters and seeing what usually happens. Weather forecast likewise.

TeMPOraL 2021-08-18 09:13:35 +0000 UTC [ - ]

> The third time he told this story, late-teenage-me pointed out that the curvature of spacetime under GR had more impact than the extra digits.

That may be true, but in simulations using floating points (which are, as 'mjburgess points out, discrete), there are discrete approximation errors that accumulate with almost any operation, on top of any measurement or modelling approximation errors you may have. On some problems, numerical integration on floats bleeds precision really fast.

I recall seeing some problems on numerics labs during university years, where using doubles instead of floats was what made the algorithm converge on a solution instead of endlessly spinning or diverging into infinity. I imagine that ideally, you'd use a more sophisticated algorithm instead[0] - perhaps one that simulates real numbers with greater precision. But computer power isn't infinite (and it certainly wasn't infinite in the last century) - floats and doubles are fast, supported by hardware, and you may need that speed to get results in a reasonable timeframe. Using doubles instead of floats may give you just enough precision buffer to make the faster algorithm work.

[0] - A trivial example that I'm sure is known to anyone doing physical simulations (including for videogames) is using increasingly sophisticated integrators for kinematics. You may start with naive Euler calculating v += aΔt, s += vΔt, but this quickly becomes inaccurate in quite spectacular ways (try simulating springs with naive Euler). So you may then progress through backwards / symplectic Euler, and end up with Verlet integration or even reaching for 4th Order Runge-Kutta, to keep the energy in the system more-less conserved as it evolves. And yes, the difference between floats and doubles is also very apparent, especially with Euler methods.

gugagore 2021-08-18 14:22:17 +0000 UTC [ - ]

It sort of sounds like you're confusing two different kinds of errors. Euler vs RK4 is an issue of truncation error. It's called that because when you approximate a derivative with a finite difference, you can see that you've truncated the infinite Taylor series.

Truncation error is present even if there were zero approximation in the elementary floating point operations. I'm not sure how you mean that the difference between floats and doubles shows up with Euler methods especially.

Under no floating point error, you can drive truncation error toward zero by driving the step size toward zero. But the floating point precision sets a floor on how low you can go. Perhaps this is what you meant.

mjburgess 2021-08-18 08:22:22 +0000 UTC [ - ]

I'm not sure this is as true as it seems.

When creating simulations we do so by defining a computable function (ie., over discrete binary strings) which approximates some non-computable function (ie., over the reals).

It is important to note that those non-computable functions are derived, in part, analytically based on theorems which require infinite precision, ie., real numbers.

When we define a computable approximation we can only do so by narrowing the simulation to a scale where "the numbers have only a few significant digits". This is really the only thing we can simulate, and even then, we are using mathematical analysis to derive much of it -- which cannot be simulated.

If you really wanted to simulate across scales, ie., run a simulation "from atoms to tables", i think we'd be requiring many many many decimal places. It would likely, i think, be physically impossible to do. (Or if not, then atoms to planets would be) -- as the discrete precision required would require a theoretical minimum of mass that would create a black hole.

eg., Consider that a table has 10^verylargenumber of "spherical" atoms, and so its aggregate shape would be extremely sensitive to a miscalculation of each atoms shape ... and we would require a precision in PI probably something like 10^-verylargenumber.

wcoenen 2021-08-18 11:57:00 +0000 UTC [ - ]

> Consider that a table has 10^verylargenumber of "spherical" atoms

That "verylargenumber" exponent is not actually a large number. It's probably close to 28.

E.g. for a rough estimate, pretend that the table has a mass of 120kg and is made of only carbon atoms. 120000 gram / (12 gram/mol) = 10000 mol = 6x10^27 < 10^28.

On a related note, the estimated number of atoms in the visible universe is around 10^80.

mjburgess 2021-08-18 14:24:24 +0000 UTC [ - ]

So the volume of one atom can be calculated using, c. 30 dp. of pi -- +3 so that errors do not compound.

So we need 30 bits/atom to store the volume, for 10^30 atoms -- this is 10^15 petabytes.

Suppose we're trying to estimate the shape of the table using the relevant state vector for each atom (ie., radius, coordinate, EM-shell, etc.). Ie., how these atoms interact to produce a structure.

Then we need perhaps 10 numbers/atom, so 10^16 petabytes. And consider the computation.

The issue is that macroscopic structuring effects, ie., that the table has its tabular shape also exert a top-down effect constraining the structure of the atoms.

We'd have to solve that problem through iteration. Ie., we compute a shape from local effects only, then iteratively recompute until increasing orders of scale are included.

Suppose we perform 1 operation/atom to compute an interaction, that's 10^30 operations. At 1ns/op, we have 10^21 seconds -- optimistically.

That's one scale, now we need to iterate; given the orders of effects are 1nm...1m that's c. 10 scales, say. We're now at 10^22 seconds with 10^16 petabytes of data.

And also, since we're running a 10-fold iteration we need more than 10^30 precision, as errors will compound.

This is for a table!. As I said, choose your scale, eg., atom to star, and the storage requirments have an essentially physically impossible mass.

actually_a_dog 2021-08-18 11:41:50 +0000 UTC [ - ]

That doesn't surprise me a lot. Apparently, you'd only need 40 digits of pi to compute the area of a circle the size of the visible universe[0]. Pi tends to show up in equations where it's being multiplied by relatively small constants like 1/2, 2, or 8. Maybe sometimes you might see a pi^2 times a smallish constant, too. So, not needing a ton of digits in order to be good enough for numerical models doesn't seem crazy at all.

---

[0]: https://www.jpl.nasa.gov/edu/news/2016/3/16/how-many-decimal...

morpheos137 2021-08-18 12:37:07 +0000 UTC [ - ]

>only need 40 digits...to calculate area.

This comment makes no sense to without the number of significant digits.

legulere 2021-08-18 05:04:19 +0000 UTC [ - ]

10x is a lot if the market follows a winner takes it all model.

nightski 2021-08-18 18:44:01 +0000 UTC [ - ]

I thought that Deep Learning primarily used stochastic gradient descent. I did not get deep enough into the article, but wouldn't that change the picture? Or are the computational complexity limits the same?

madsbuch 2021-08-18 06:19:44 +0000 UTC [ - ]

This would only be implied if gradient descent was the only way to write machine learning.

cblconfederate 2021-08-18 09:23:01 +0000 UTC [ - ]

inequality stems from the fact that others cannot reach the same result, no matter how small it is

scotth 2021-08-18 03:04:02 +0000 UTC [ - ]

This is my favorite comment (please don't down vote)

miketery 2021-08-18 00:15:57 +0000 UTC [ - ]

> Brouwer’s fixed point theorem. The theorem says that for any continuous function, there is guaranteed to be one point that the function leaves unchanged — a fixed point, as it’s known. This is true in daily life. If you stir a glass of water, the theorem guarantees that there absolutely must be one particle of water that will end up in the same place it started from.

This seems counter intuitive. Why would this be true? I would have thought that it’s almost impossible for a particle to end up in the same place as it started.

fshbbdssbbgdd 2021-08-18 00:26:32 +0000 UTC [ - ]

The paragraph leaves out an important constraint: it’s only for functions that map some compact convex set to itself. So, the theory doesn’t apply to the sibling commenter’s example of f(x) = x + 1 because it doesn’t map a compact set to itself.

One example of a compact convex set is a circle. Imagine some continuous function that maps a circle to a circle. An easy example is rotation. For rotation, the center stays in the same place. Now imagine some more complicated function that might do rotation, stretching, spiral distortions, whatever you can imagine. But it has to be continuous, meaning you can’t go off one edge and come back on the other side (among other things). Try to draw, or even just imagine a transformation like this and you’ll find that there’s always some point that stays in the same place.

For a 3D object like a glass it’s hard to see the same thing intuitively, but Brouwer’s theorem says that it holds (and it also holds for higher-dimensional sets).

kevinventullo 2021-08-18 01:07:58 +0000 UTC [ - ]

Minor wording nit: rather than the circle, I would say “consider the disc”. Usually in math “circle” means just the boundary whereas “disc” means the circle together with its interior.

OrderlyTiamat 2021-08-18 11:47:03 +0000 UTC [ - ]

thank you, that did clear it up for me.

enriquto 2021-08-18 10:46:20 +0000 UTC [ - ]

> For a 3D object like a glass

A glass is not convex so this theorem does not apply. You'd need other fixed-point theorems for that. If the glass has a handle the result is trivially false (consider rotating a doughnut, this does not have any fixed point).

FabHK 2021-08-18 11:27:27 +0000 UTC [ - ]

I think GP meant the water in the glass (as the article).

antman 2021-08-18 08:25:52 +0000 UTC [ - ]

What about a circle with 60cm diameter that I move one meter to the right? Something is missing in my understanding of the definition.

legolas2412 2021-08-18 08:38:17 +0000 UTC [ - ]

It has to map the same set onto itself.

techwizrd 2021-08-18 00:32:44 +0000 UTC [ - ]

The devil lies in the details here. Brouwer's fixed point theorem says that for a continuous function f that maps a _compact, convex_ set to itself, there must exist a point x₀ such that f(x₀) = x₀. This theorem is only valid for sets that are compact (i.e., closed and bounded) and homeomorphic to convex.

I think the 3-dimensional analogy (similar to the stirring of coffee and sugar Brouwer observed) is rather unintuitive. I think some of the proofs are quite beautiful, but I like this analogy from Wikipedia:

> Take an ordinary map of a country, and suppose that that map is laid out on a table inside that country. There will always be a "You are Here" point on the map which represents that same point in the country.

SquishyPanda23 2021-08-18 01:34:40 +0000 UTC [ - ]

As others have mentioned, the article is missing a crucial constraint, which is essentially that it holds on the closed unit ball and spaces homeomorphic to it.

As you mentioned, it's also not intuitive. It's by now well-digested, but it's a fairly surprising theorem, even to Brouwer.

Finally, it's not constructive. Generally you can't figure out which point is fixed. If I recall correctly, frustration with this fact fueled Brouwer's interest in intuitionism, which gives us a type of logic where some propositions are neither true nor false but essentially undecidable.

I'm not sure what you mean by a "particle" in your sentence. If you mean to go down to the quantum scale, then I'm sure all sorts of things go wrong in trying to apply the theorem. For one, even talking about the position of a particle at all is very different from talking about a point in Euclidean space.

If you don't mean to go down to the quantum scale, then lots of maps have fixed points. Rotations an reflections are two large classes of examples.

vlovich123 2021-08-18 02:05:15 +0000 UTC [ - ]

Does intuitionism go beyond Gödel’s incompleteness theorem? Or is it an unrelated thing?

SquishyPanda23 2021-08-18 02:39:07 +0000 UTC [ - ]

They are different things, although "unrelated" might be a bit strong. I'm sure there are interesting things to say about the incompleteness theorem and intuitionistic logic.

Intuitionistic logic is a logic that is weaker than classical logic in the sense that there are some proofs in classical logic that don't work in intuitinistic logic. Specifically, the law of the excluded middle (either A is true or not A is true") does not hold.

You can convince yourself that this is a reasonable type of logic by interpreting "A is true" to mean "I have a proof of A", and likewise "not A is true" means "I have a proof of not A". Then it's pretty reasonable that you neither have a proof of A or of not A.

I haven't read Godel's proof, but my understanding is that he uses intuitionistic logic in some of the key steps. Since facts in intuitionistic logic are also facts in classical logic, then I believe the incompleteness theorem holds in both logics. But I'm not at all an expert here, so someone please correct me if I'm wrong.

"Intuitionism" also refers to a specific philosophy of mathematics. But a lot of people just use the word to refer to the logic that came out of that philosophical school.

Intuitionistic logic is also very common in type theory, so a lot of proof assistants use it.

TeMPOraL 2021-08-18 08:43:53 +0000 UTC [ - ]

>>> intuitionism, which gives us a type of logic where some propositions are neither true nor false but essentially undecidable.

Between math and software development, I feel I've encountered this idea at least three times (+ rediscovered it independently), under different names - "3-value logic", "3-state booleans" - without ever hearing about "intuitionism". Thanks for mentioning it, and now I wonder how it all relates (i.e. I'm about to get stuck on Wikipedia).

> interpreting "A is true" to mean "I have a proof of A", and likewise "not A is true" means "I have a proof of not A".

Makes me think of an interesting progression I've also seen when studying maths and then (classical) AI algorithms. You start with boolean logic, then figure out you need to represent lack of knowledge/certainty and get 3-state logic. Then someone notices that there are degrees to being sure, and you get fuzzy logic[0], representing logical values as real numbers. And then someone notices that this smells awful lot like probability theory, and you get Bayesian probability as applied to reasoning, with all the extra concepts like using whole probability distributions as first-class logic values.

[0] - I'm not sure how much mathematical rigour there is to it; it's something I've learned when studying AI for videogames.

SquishyPanda23 2021-08-18 10:24:30 +0000 UTC [ - ]

This is an interesting point. I haven't really thought about the connection between intuitionistic logic and three-valued logic before.

They are different logics in general though. 3-value logic is more like the A is True, False or None/Unknown.

Another interpretation one can put on intuitionistic logic is game semantics. Here a statement is true if the verifier has a winning strategy to prove the proposition. It's false if the falsifier has a winning strategy. Then intuitionistic logic corresponds roughly to the idea that some games don't have a winning strategy for either player.

So here (and in the proofs example I gave above) is that we get an idea of "neither true nor false" that is represented by the lack of an explicit rule for concluding truth or falsity in some cases.

In contrast, 3-valued logic gets at the idea of "neither true or false" by introducing another symbol (e.g. "None" or "-1") together with new operators that operate on sentences with that symbol.

So we get one from classical logic by deleting something (namely the law of the excluded middle) and the other by adding something.

2021-08-18 02:17:19 +0000 UTC [ - ]

skulk 2021-08-18 02:18:40 +0000 UTC [ - ]

This philosophy.se post answers that question: https://philosophy.stackexchange.com/questions/7194/how-is-g...

im3w1l 2021-08-18 02:07:14 +0000 UTC [ - ]

> Generally you can't figure out which point is fixed.

So just thinking out loud here but like in the 1d case... The set will be an interval [a,b] Let g(x) = f(x) - x

g(a) >= 0 and g(b) <= 0. Now we can use the intermediate value value theorem and binary search to find (get arbitrary close) a point with g(x) = 0 (i.e. fix point of f).

Not possible to do such searches in higher dimensions?

SquishyPanda23 2021-08-18 10:12:39 +0000 UTC [ - ]

You can generally approximate the fixed point. You can possibly do this to arbitrary precision, but I'm not sure.

This is useful for practical applications, but it's not the same as knowing the fixed point for mathematical applications.

E.g. you may want to prove that some function preserves the irrational number pi. But you may not be able to prove it, only show that computations make your conjecture more and more likely.

ChrisKnott 2021-08-18 06:56:34 +0000 UTC [ - ]

I'm not sure that even works in 1 dimension...?

I don't think you necessarily have g(a) >= 0 and g(b) <= 0. Finding the starting points for binary search might be essentially as hard as finding the fixed point.

im3w1l 2021-08-18 09:58:23 +0000 UTC [ - ]

a <= f(x) <= b

Hence a <= f(a), so 0 <= f(a) - a = g(a)

xarici_ishler 2021-08-18 00:23:45 +0000 UTC [ - ]

https://www.youtube.com/watch?v=oX9aPNF6_h8

This is a great video explaining Brouwer's theorem and its (theoretical) application to stirring a cup of water.

2021-08-18 04:16:32 +0000 UTC [ - ]

hyperbovine 2021-08-18 12:44:00 +0000 UTC [ - ]

It's counter intuitive because it's obviously false as stated: f(x) = 1 + x for x in R. They left out the main assumption of the theorem, namely that the domain of the function is compact, and gets mapped (in)to itself.

max_ 2021-08-18 18:15:30 +0000 UTC [ - ]

>I would have thought that it’s almost impossible for a particle to end up in the same place as it started.

That's true on a short time scale. But on a longer time scale, you will eventually visit all possible states of the system.

I think this is similar to the concept of ergodicity[0].

[0]: https://en.wikipedia.org/wiki/Ergodicity

arketyp 2021-08-18 07:20:35 +0000 UTC [ - ]

I stumbled upon this theorem after trying to come up with a mapping of color space, RGB to RGB, where each color maps to a new one. (It was for a mouse cursor effect in a painting app.) I couldn't come up with one, and indeed it is impossible!

chompychop 2021-08-18 08:15:14 +0000 UTC [ - ]

Are there any constraints on the mapping? If no, then why not map each intensity value to the immediately next value, and then circle it around? For instance, 0 -> 1, 1 -> 2, 2 -> 3, ...., 255 -> 0.

voldacar 2021-08-18 08:33:07 +0000 UTC [ - ]

The function you are describing is not continuous, so the theorem does not apply. But yeah, it has no fixed point.

2021-08-18 00:28:18 +0000 UTC [ - ]

joebob42 2021-08-18 00:24:07 +0000 UTC [ - ]

I'm with this guy. Obviously I'm missing something or not understanding what is being said here, because to my head f(x) = x+1 changes every single point and is continuous.

Or in the glass of water point, what if I pour the water out on the table. It's certainly all been moved, yes?

(To be clear I'm obviously wrong / not understanding the premise, but I would be curious to know what I'm missing)

nonameiguess 2021-08-18 00:34:26 +0000 UTC [ - ]

It only applies to functions on bounded sets. The reals are not bounded. There are other restrictions, but that is one of them.

im3w1l 2021-08-18 00:27:27 +0000 UTC [ - ]

Your example function doesn't map any compact set back to itself.

joebob42 2021-08-18 00:36:06 +0000 UTC [ - ]

Yep. I notice mention of this additional constraint from another comment. As far as I can tell the article is just incorrect since it misses this constraint up to that point, unless I'm just a clown and somehow not seeing it.

The upside here is that it seems perhaps I wasn't missing the premise, and it was just mis-stated :)

voldacar 2021-08-18 08:37:09 +0000 UTC [ - ]

No you are right, quanta sucks. Im sure there are thousands of other people thinking of x+1 and scratching their heads

SilasX 2021-08-18 02:15:54 +0000 UTC [ - ]

So add "mod p" to it.

random314 2021-08-18 02:29:05 +0000 UTC [ - ]

Transformation is not continuous at the boundary p

cblconfederate 2021-08-18 09:31:27 +0000 UTC [ - ]

I think it's because we think of the drink points as molecules which can indeed move and rearrange in ways that continuous points cannot.

im3w1l 2021-08-18 00:19:31 +0000 UTC [ - ]

It only works for some idealized continous fluid. Not for actual water, which is made up of a large but finite amount of molecules.

kevinventullo 2021-08-18 01:16:00 +0000 UTC [ - ]

Even then it doesn’t feel like a great example since it’s not obvious to me that stirring should be a continuous function. E.g. if I put half the glass into a container and then pour the container back into the glass, that is definitely a discontinuous operation!

alanbernstein 2021-08-18 06:06:14 +0000 UTC [ - ]

I wonder if there is some idealized definition of "stirring" being used here. It seems that actual stirring could conceivably result in any possible rearrangement of particles. As a simple example, which is illustrative even though it seems unlikely as a result of stirring: leave the (x, y) coordinates of each particle the same, while changing the z coordinate as z -> z+k mod h. The result is effectively a discontinuous map (violating the theorem), even though the path of each individual particle must be continuous for physical reasons. Two different notions of continuity here.

drdeca 2021-08-18 01:56:04 +0000 UTC [ - ]

But stirring doesn’t break the connnectedness of the water like that?

kevinventullo 2021-08-18 03:39:35 +0000 UTC [ - ]

I guess wherever you dip the spoon in will break it in a kind of discontinuous way?

drdeca 2021-08-18 17:20:32 +0000 UTC [ - ]

If the spoon was one of those spoons with holes in the, non-handle-part, then I think it would have to produce a discontinuity.

But for a normal spoon shape, I don’t think dipping it in makes the map between points in the water fail to be continuous. I’m somewhat confident that there is a smooth path from an embedding of a cylinder to R^3 to an embedding of a cylinder which is a little taller and has a spoon shape removed from it, into R^3 .

zbendefy 2021-08-18 11:36:59 +0000 UTC [ - ]

watch this, it explains it well: https://www.youtube.com/watch?v=csInNn6pfT4

zitterbewegung 2021-08-18 00:36:16 +0000 UTC [ - ]

Also, an example of Brouwer's fixed point theorem is the Y Combinator in the lambda calculus. See https://en.wikipedia.org/wiki/Fixed-point_combinator#Y_combi...

drdeca 2021-08-18 01:54:19 +0000 UTC [ - ]

What is the compact space in question there? I don’t see one. Or, I don’t even see an appropriate topology to apply?

Are you sure?

tzs 2021-08-18 13:48:23 +0000 UTC [ - ]

There is an easy way to counteract this. Let’s say your original runs in 1000 seconds. Squaring that would give 1000000 seconds. That’s a jump from under 17 minutes to over 11 days.

The fix: measure everything in hours. Your original runs in 0.278 hours. Squaring that gives 0.077 hours. Squaring the precision now makes it run 3.6 times faster.

jakeinspace 2021-08-18 14:01:23 +0000 UTC [ - ]

I assume this is a joke, but for those not aware: the correct calculation would be to take a measure of the original running time in terms of CPU cycles or some equivalent unitless value, and square that. You don't want to be squaring values with units attached, because squared hours (or squared seconds) are very much not simple units of time.

mmiyer 2021-08-18 18:42:17 +0000 UTC [ - ]

CPU cycles isn't right either - this article is about theoretical CS not about a real machine (squared cycles is meaningless too). It seems the author is saying the algorithm running time is O(n^2) where n is precision. This is approximately cn^2 where c is a constant. Without knowing c we cannot say what squaring the running time precisely means.

dynm 2021-08-18 10:40:20 +0000 UTC [ - ]

> But a machine learning researcher might want to double the precision of an experiment. In that case, the new result implies that they might have to quadruple the running time of their gradient descent algorithm.

I don't quite understand this. Say f(p) is the complexity of gradient descent for precision p. The first paragraph seems to imply that f(p) = c×p^2, so that f(2p) = c×4×p^2 = f(p). However, the second paragraph then just seems to imply that f(p) = c×p. Am I missing something?

Incidentally, I love that Quanta brings attention to work mostly just on the basis of how important it is. 99% of the other stuff you hear about new research is really just driven by PR departments at the researcher's home institutions.

mmiyer 2021-08-18 18:47:34 +0000 UTC [ - ]

If f(p) = cp^2 then f(p^2) = c(p^2)^2 = cp^4 which is square the running time, ignoring constant factors.

FabHK 2021-08-18 11:30:04 +0000 UTC [ - ]

Now, that article was written by an academic, not a practitioner (my emphasis):

> You can imagine a function as a landscape, where the elevation of the land is equal to the value of the function (the “profit”) at that particular spot. Gradient descent searches for the function’s local minimum [...]

optimalsolver 2021-08-18 01:00:10 +0000 UTC [ - ]

Derivative-free optimization to the rescue!

wenc 2021-08-18 03:05:13 +0000 UTC [ - ]

DFO is far slower to converge in general than gradient based methods even when it’s close to the optimum because it doesn’t use derivative information. Newton’s method on the other hand has local quadratic convergence. Not sure I would bet on DFOs to have better performance than gradient methods. DFOs have two advantages: they’re embarrassingly parallelizable and for non convex problems, they don’t succumb as easily to local optima. But they’re not better performing in general.

gerdesj 2021-08-17 23:35:27 +0000 UTC [ - ]

You say minima and I say maxima; They say maybe over there ... no left a bit.