Program Synthesis with Large Language Models
mountainriver 2021-08-18 03:33:24 +0000 UTC [ - ]
I’ve been thinking on this quite a bit, it’s an interesting property of language models in that they are essentially associative lookup machines.
It feels on some level that we need to be combining this kind of reasoning with something like world models.
tasdfqwer0897 2021-08-18 03:52:51 +0000 UTC [ - ]
There are people who fall more on the side of bitter-lesson/scaling-law-maximalism, and I think it's probably healthy and valuable that there are people in the research community placing both types of bet.
algo_trader 2021-08-18 14:57:07 +0000 UTC [ - ]
But will a branch-and-search with some-noisy-evaluator get us there? Obviously this is easier with some domains.
Or maybe simply using increasing large "prompts" or "input specification" to specify the desired end result. There might be a while scaling law hiding there ..
qualudeheart 2021-08-18 15:10:36 +0000 UTC [ - ]
tasdfqwer0897 2021-08-18 02:55:00 +0000 UTC [ - ]
https://twitter.com/gstsdn/status/1427794393373626368
criticaltinker 2021-08-18 03:41:24 +0000 UTC [ - ]
> On both datasets, we find that synthesis performance scales log-linearly with model size
This "scaling law" observation is starting to become a major trend in NLP [1] and other other modalities such as speech recognition [2] and protein structure/function prediction [3]. Do you have any insight or commentary to offer regarding the next step in improving program synthesis? For example, will techniques to scale up model size continue to be a primary focus (eg as in [4]), or do you see improvements in Transformer and attention based architectures as essential for pushing the limits of what has been achieved by you and your colleagues?
> We find that even our best models are generally unable to predict the output of a program given a specific input.
What do you think about leveraging unsupervised training to improve program synthesis? Could synthesized programs be executed on generated input in a way that supports contrastive learning [5]?
Thanks in advance for your time and comments here.
[1] Scaling Laws for Neural Language Models https://arxiv.org/abs/2001.08361
[2] Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition https://arxiv.org/abs/2010.10504
[3] ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning https://arxiv.org/abs/2007.06225
[4] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity https://arxiv.org/abs/2101.03961
[5] A Simple Framework for Contrastive Learning of Visual Representations https://arxiv.org/pdf/2108.00587.pdf
tasdfqwer0897 2021-08-18 03:59:02 +0000 UTC [ - ]
I do personally, but there is some disagreement about this in the field. In fact, I would go further and say that (in addition to using large pre-trained models) we will need methods of training that are pretty substantially different in order to elicit robust reasoning behavior.
Even supposing I'm wrong about this, if you go and look at the scaling plots in figure 3 and try to figure out how big your model would need to be in order to be solving most of these problems, you'd get a really big number. Even if you had such a big model, it would still require post-processing of the samples to actually get the right answers. From the perspective of applications, that's fine, but it's a little unsatisfying from the perspective of studying intelligence. Even with those caveats (!) these problems aren't that hard compared to general software engineering tasks...
> What do you think about leveraging unsupervised training to improve program synthesis? Could synthesized programs be executed on generated input in a way that supports contrastive learning [5]?
I think this is an interesting idea and someone should try it! I do think that, even restricting our attention to just getting neural networks to execute programs, that we will need to do something a little more drastic to robustly get the results we want.
mjburgess 2021-08-18 08:10:33 +0000 UTC [ - ]
This power-law behaviour is exactly what you get when modelling a non-computable function.
Eg., consider approximating the mean of sin^2(x) over -pi to +pi.
If you sample x 10^s times (s = 1, 2, 3, ...) the difference from the non-computable analytic answer (0.5) scales log-linear (ie., power-law).
It is my view that intelligence isn't computable, and that any approximation method to some layer of intelligence will run-up against this problem.nextaccountic 2021-08-18 14:25:03 +0000 UTC [ - ]
I mean, if there is a process to progressively give better estimates of the output of a function, then this function is computable.
mjburgess 2021-08-18 14:43:26 +0000 UTC [ - ]
There is a computable version for computable inputs, ie., we ask "what is sin(x)?" for some computable-number x_computable.
But there is no computable version for a non-computable real number, x_real.
nextaccountic 2021-08-18 16:05:03 +0000 UTC [ - ]
> But there is no computable version for a non-computable real number, x_real.
That's true[0]. But consider that you will never receive something non-computable as input to a program, ever. (if you're allowed to approximate the input until you have enough digits to compute what you need, then the input is computable)
Really, I think the best way to view sin(x) is as a function that receives a stream of digits 0.2345345323.. and returns a stream of digits 0.0040933884... - this computable version of sine completely captures every thing we could possibly do with it in a program. Operating with floating point numbers, then, is just mostly truncating the input and output stream.
[0] At least in classical logic; in intuitionistic logic, sin : Real -> Real is computable and is equivalent to this idea of receiving and returning streams of digits.
mjburgess 2021-08-18 17:02:17 +0000 UTC [ - ]
Yes, but this is far less than nature can do with it -- which is my point, that discrete computation is extremely limited.
nextaccountic 2021-08-19 08:40:59 +0000 UTC [ - ]
That is, we don't know if nature is actually continuous. Perhaps spacetime becomes discrete in the Planck scale or something like that (I know past attempts have been unsuccessful, but still, it might be). But if nature is continuous, there is probably a fundamental limitation on harnessing those precision bits past a certain limit. Nature's continuous variables might have enough symmetries as to make it Turing-complete. In this case, computation with full real numbers wouldn't ever happen in this universe.
I mean, if it happens, the Church-Turing thesis would be false, and the consequences would be far more strange.
YeGoblynQueenne 2021-08-18 13:57:55 +0000 UTC [ - ]
I will have to read the paper more carefully however having quickly scanned the paper it seems that it only reports empirical resuls. In particular, there seem to be no theoretical results about lernability of programs from natural language specifications using large language models. To make it more plain - how do we know these techniques work as well as reported on problems other than the ones in the dataset introduced in your work?
Note that I'm not asking why you introduced a new dataset, this seems to be motivated in the abstract. I'm asking: how do we know how well this kind of thing works ("this kind of thing" being what it says in the title) in the general case?
muds 2021-08-18 03:34:40 +0000 UTC [ - ]
From response 1, it seems that the model has very little confidence in its decision but get the correct answer while, on the contrary, in response 3, the model seems very confident in its incorrect answer. Is this usually a trend that you see with large models? How hard is it, generally, to make such models "aware" of their own shortcomings?
[0]: https://arxiv.org/pdf/2108.07732.pdf#figure.caption.21
tasdfqwer0897 2021-08-18 03:49:44 +0000 UTC [ - ]
More generally, whether these models are well-calibrated (that is, they know what they don't know) is an important area of research. I don't have references offhand, but I think it's true broadly speaking that these larger pre-trained models do tend to be better calibrated.
ipsum2 2021-08-18 03:31:07 +0000 UTC [ - ]
tasdfqwer0897 2021-08-18 03:50:42 +0000 UTC [ - ]
ipsum2 2021-08-18 04:27:19 +0000 UTC [ - ]
rollulus 2021-08-18 08:08:29 +0000 UTC [ - ]
trainwithsgd 2021-08-18 22:05:55 +0000 UTC [ - ]